留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

Infrared-visible image patches matching via convolutional neural networks

Mao Yuanhong Ma Zhong He Zhanzhuang

毛远宏, 马钟, 贺占庄. 采用卷积神经网络的红外和可见光图像块匹配[J]. 红外与激光工程, 2021, 50(5): 20200364. doi: 10.3788/IRLA20200364
引用本文: 毛远宏, 马钟, 贺占庄. 采用卷积神经网络的红外和可见光图像块匹配[J]. 红外与激光工程, 2021, 50(5): 20200364. doi: 10.3788/IRLA20200364
Mao Yuanhong, Ma Zhong, He Zhanzhuang. Infrared-visible image patches matching via convolutional neural networks[J]. Infrared and Laser Engineering, 2021, 50(5): 20200364. doi: 10.3788/IRLA20200364
Citation: Mao Yuanhong, Ma Zhong, He Zhanzhuang. Infrared-visible image patches matching via convolutional neural networks[J]. Infrared and Laser Engineering, 2021, 50(5): 20200364. doi: 10.3788/IRLA20200364

采用卷积神经网络的红外和可见光图像块匹配

doi: 10.3788/IRLA20200364
详细信息
  • 中图分类号: TP391

Infrared-visible image patches matching via convolutional neural networks

More Information
    Author Bio:

    毛远宏,男,博士生,研究方向为计算机视觉、机器学习和人工智能

    贺占庄,男,研究员,博士生导师,博士,研究方向为嵌入式系统架构、计算机操作系统、计算机控制

    Corresponding author: 马钟,男,高级工程师,博士后,研究方向为计算机视觉、机器学习和人工智能。
  • 摘要: 红外和可见光图像块匹配在视觉导航和目标识别等任务中有着广泛的应用。由于红外和可见光传感器有不同的成像原理,红外和可见光图像块匹配更加具有挑战。深度学习在可见光领域图像的块匹配上取得了很好的性能,但是它们很少涉及到红外和可见光的图像块。文中提出了一种基于卷积神经网络的红外和可见光的图像块匹配网络。此网络由特征提取和特征匹配两部分组成。在特征提取过程中,使用对比和三重损失函数能够最大化不同类的图像块的特征距离,缩小同一类图像块的特征距离,使得网络能够更加关注于图像块的公共特征,而忽略红外和可见光成像之间差异。在红外和可见光图像中,不同尺度的空间特征能够提供更加丰富的区域和轮廓信息。红外和可见光图像块的高层特征和底层特征融合可以有效地提升特征的表现能力。改进后的网络相比于先前卷积神经匹配网络,准确率提升了9.8%。
  • Figure  1.  Infrared-visible image deep matching network. The black line with the arrow indicates the data-flow. The blue lines represent shortcut connections through the reshape layers. This figure describes the process of the infrared-visible image patches matching

    Figure  2.  Multi-scale spatial feature integration in a single branch. The output feature map in each block shorts to the concatenation layer. The output of the concatenation layer is one input of the metric network

    Figure  3.  (a) Feature extraction network architecture with the contrastive loss; (b) Input data for feature extraction network with the contrastive loss. The visual patches are in the first row. The infrared patches are in the second row. The positive samples are in odd columns. The negative ones are in even columns

    Figure  4.  (a) Feature extraction network architecture with the triplet loss; (b) Input data for feature extraction network with the triplet loss. The anchor patches are in the first row. The positive patches are in the second row. The negative patches are in the third row. Each column is triple patches input

    Figure  5.  Infrared-visible image samples. Ten image pairs randomly was selected. The ground truth of the first five images is 0. The ground truth of the last five columns is 1

    Figure  6.  ROC curves for various methods. The numbers in the legends are FPR95 values. In the legend, the symbol “F” means the network uses fine-tuning with VGG16. The symbol “C” means that the contrastive loss is used in the extraction feature network. The symbol “T” means that the triplet loss is used in the extraction feature network. The symbol “S” means that shortcut connection is used

    Figure  8.  Top-ranking false and true results in overpass and factory image patches. (a) True positive samples; (b) True negative samples; (c) False positive samples; (d) False negative samples

    Figure  7.  Visualization of the five class features in the test data set by the feature extraction network. (a) Features from the original network; (b) Features from the network with the contrastive loss; (c) Features from the network with the triplet loss

    Figure  9.  Performance matching in the test data set. In the legend, the symbols “F”, “C”, “T” and “S” have the same meaning in Fig. 6

  • [1] Yang Weiping, Shen Zhenkang. Matching technique and its application in aided inertial navigation [J]. Infrared and Laser Engineering, 2007, 36(S2): 15-17. (in Chinese) doi:  10.3969/j.issn.1007-2276.2007.z2.003
    [2] Li Hongguang, Ding Wenrui, Cao Xianbin, et al. Image registration and fusion of visible and infrared integrated camera for medium-altitude unmanned aerial vehicle remote sensing [J]. Remote Sensing, 2017, 9(5): 441. doi:  10.3390/rs9050441
    [3] Wang Ning, Zhou Ming, Du Qinglei. A method for infrared visible image fusion and target recognition [J]. Journal of Air Force Early Warning Academy, 2019, 33(5): 328-332.
    [4] Mao Yuanhong, He Zhanzhuang, Ma Zhong. Infrared target classification with reconstruction transfer learning [J]. Journal of University of Electronic Science and Technology of China, 2020, 49(4): 609-614. (in Chinese)
    [5] Lowe D G. Distinctive image features from scale-invariant keypoints [J]. International Journal of Computer Vision, 2004, 60(2): 91-110. doi:  10.1023/B:VISI.0000029664.99615.94
    [6] Bay H, Tuytelaars T, Gool L V. SURF: Speeded up robust features[C]//European Conference on Computer Vision, 2006, 3951: 404–417.
    [7] Rublee E, Rabaud V, Konolige K, et al. ORB: An efficient alternative to SIFT or SURF[C]//International Conference on Computer Vision, 2011: 2564-2571.
    [8] Sima A A, Buckley S J. Optimizing SIFT for matching of short wave infrared and visible wavelength images [J]. Remote Sensing, 2013, 5(5): 2037-2056. doi:  10.3390/rs5052037
    [9] Li D M, Zhang J L. A improved infrared and visible images matching based on SURF [J]. Applied Mechanics and Materials, 2013, 2418(651): 1637-1640. doi:  10.4028/www.scientific.net/AMM.325-326.1637
    [10] Chao Zhiguo, Wu Bo. Approach on scene matching based on histograms of oriented gradients [J]. Infrared and Laser Engineering, 2012, 41(2): 513-516. (in Chinese) doi:  10.3969/j.issn.1007-2276.2012.02.044
    [11] Cao Zhiguo, Yan Ruicheng, Song Jie. Approach on fuzzy shape context matching between infrared images and visible images [J]. Infrared and Laser Engineering, 2008, 37(12): 1095-1100. (in Chinese)
    [12] Jiao Anbo, Shao Liyun, Li Chenxi, et al. Automatic target recognition algorithm based on affine invariant feature of line grouping [J]. Infrared and Laser Engineering, 2019, 48(S2): S226003. (in Chinese) doi:  10.3788/IRLA201948.S226003
    [13] Han X, Leung T, Jia Y, et al. MatchNet: Unifying feature and metric learning for patch-based matching[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015: 3279-3286.
    [14] Zagoruyko S, Komodakis N. Learning to compare image patches via convolutional neural networks[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015: 4353-4361.
    [15] Hanif M S. Patch match networks: Improved two-channel and Siamese networks for image patch matching [J]. Pattern Recognition Letters, 2019, 120: 54-61. doi:  10.1016/j.patrec.2019.01.005
    [16] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[C]//ICLR 2015: International Conference on Learning Representations, 2015.
    [17] Szegedy C, Liu W, Jia Y, et al. Going deeper with convolutions[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015: 1-9.
    [18] Hadsell R, Chopra S, LeCun Y. Dimensionality reduction by learning an invariant mapping[C]//IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), 2006, 2: 1735-1742.
    [19] Schroff F, Kalenichenko D, Philbin J. FaceNet: A unified embedding for face recognition and clustering[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015: 815-823.
    [20] Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks[C]//Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 2010: 249-256.
    [21] Van der Maaten L , Hinton G. Visualizing data using t-SNE [J]. Journal of Machine Learning Research, 2008, 9(86): 2579-2605.
  • [1] Liu Hongming, Liu Yujuan, Song Ying, Zhong Zhicheng, Kong Lingsheng, Liu Huaibin.  Principle and optimum analysis of small near-infrared spectrometers based on digital micromirror device . 红外与激光工程, 2021, 50(2): 20200427-1-20200427-7. doi: 10.3788/IRLA20200427
    [2] Xu Lu, Yang Xu, Wu Long, Bao Xiaoan, Zhang Yijia.  Restrain range walk error of Gm-APD lidar to acquire high-precision 3D image . 红外与激光工程, 2020, 49(10): 20200218-1-20200218-8. doi: 10.3788/IRLA.20200218
    [3] Zhang Yuhui, Yang Bowei, Li Yiting, Zhao Yuanzhi, Fu Yuegang.  Design of polarization-independent reflective metalens in near infrared waveband . 红外与激光工程, 2020, 49(6): 20200048-1-20200048-7. doi: 10.3788/IRLA20200048
    [4] Yang Ce, Chen Meng, Ma Ning, Xue Yaoyao, Du Xinbiao, Ji Lingfei.  Picosecond multi-pulse burst pump KGW infrared multi-wavelength Raman laser . 红外与激光工程, 2020, 49(11): 20200044-1-20200044-11. doi: 10.3788/IRLA20200044
    [5] Lu Qiang.  Thermal radiation stray light integration method of infrared camera in geostationary orbit . 红外与激光工程, 2020, 49(5): 20190457-20190457-9. doi: 10.3788/IRLA20190457
    [6] Yang Guoqing, Li Zhou, Zhao Chen, Yu Yi, Qiao Yanfeng, He Fengyun.  Nonlinear atmospheric correction based on neural network for infrared target radiometry . 红外与激光工程, 2020, 49(5): 20190413-20190413-8. doi: 10.3788/IRLA20190413
    [7] 张盼盼, 罗海波, 鞠默然, 惠斌, 常铮.  一种改进的Capsule及其在SAR图像目标识别中的应用 . 红外与激光工程, 2020, 49(5): 20201010-20201010-8. doi: 10.3788/IRLA20201010
    [8] 盛家川, 陈雅琦, 王君, 韩亚洪.  深度学习结构优化的图像情感分类 . 红外与激光工程, 2020, 49(11): 20200269-1-20200269-10. doi: 10.3788/IRLA20200269
    [9] 李鸿龙, 杨杰, 张忠星, 罗迁, 于双铭, 刘力源, 吴南健.  用于实时目标检测的高速可编程视觉芯片 . 红外与激光工程, 2020, 49(5): 20190553-20190553-10. doi: 10.3788/IRLA20190553
    [10] Ji Yi.  Visible light optical coherence tomography in biomedical imaging . 红外与激光工程, 2019, 48(9): 902001-0902001(9). doi: 10.3788/IRLA201948.0902001
    [11] 郭强, 芦晓红, 谢英红, 孙鹏.  基于深度谱卷积神经网络的高效视觉目标跟踪算法 . 红外与激光工程, 2018, 47(6): 626005-0626005(6). doi: 10.3788/IRLA201847.0626005
  • 加载中
图(9)
计量
  • 文章访问数:  34
  • HTML全文浏览量:  10
  • PDF下载量:  13
  • 被引次数: 0
出版历程
  • 收稿日期:  2020-09-20
  • 修回日期:  2020-11-04
  • 刊出日期:  2021-05-21

Infrared-visible image patches matching via convolutional neural networks

doi: 10.3788/IRLA20200364
    作者简介:

    毛远宏,男,博士生,研究方向为计算机视觉、机器学习和人工智能

    贺占庄,男,研究员,博士生导师,博士,研究方向为嵌入式系统架构、计算机操作系统、计算机控制

    通讯作者: 马钟,男,高级工程师,博士后,研究方向为计算机视觉、机器学习和人工智能。
  • 中图分类号: TP391

摘要: 红外和可见光图像块匹配在视觉导航和目标识别等任务中有着广泛的应用。由于红外和可见光传感器有不同的成像原理,红外和可见光图像块匹配更加具有挑战。深度学习在可见光领域图像的块匹配上取得了很好的性能,但是它们很少涉及到红外和可见光的图像块。文中提出了一种基于卷积神经网络的红外和可见光的图像块匹配网络。此网络由特征提取和特征匹配两部分组成。在特征提取过程中,使用对比和三重损失函数能够最大化不同类的图像块的特征距离,缩小同一类图像块的特征距离,使得网络能够更加关注于图像块的公共特征,而忽略红外和可见光成像之间差异。在红外和可见光图像中,不同尺度的空间特征能够提供更加丰富的区域和轮廓信息。红外和可见光图像块的高层特征和底层特征融合可以有效地提升特征的表现能力。改进后的网络相比于先前卷积神经匹配网络,准确率提升了9.8%。

English Abstract

    • Infrared-visible image patches matching is a fundamental task of infrared-visible image processing. It compares the object or region by analyzing the similarity of content, features, structures, relationships, textures, and grayscales in infrared-visible images. The infrared-visible image matching is often used as a subroutine that plays an important role in a wide variety of applications, such as visual navigation[1-2] and target recognition[3-4].

      Infrared-visible image patches is more challenging compared with traditional visible images. Since infrared and visible sensors use different imaging principles, the images taken by multiple sensors also have more differences than those by a single sensor. The edges of the object are blurred in infrared images. Less texture and color features are found in the object. The infrared-visible image pairs have significant grayscale distortion and illumination change.

      Manual descriptors are used to extract features, such as SIFT[5], SURF[6], ORB[7], etc. The features extracted with the descriptors should have the invariance of illumination, rotation, scale, and affine. After feature extraction, image patches matching is predicted by comparison of features similarity. Most work is focused on improvements to infrared and visible image descriptors in the traditional infrared-visible image system. Sima[8] optimized the SIFT method for the infrared-visible image. Li[9] detected object edges and extracted the features of SURF to match the infrared-visible images. Chao Zhiguo[10] proposed a matching method based on histograms of oriented gradients used as the matching feature and the correlation coefficient used as a similar measure. Cao Zhiguo[11] adopted an approach to shape contexts for matching infrared-visible images based on their similar shape. Jiao Anbo[12] proposed the image matching algorithm using linear group geometric primitives for infrared and visible template matching.

      Hand-craft descriptors need to improve continuously for new applications to extract efficient features. The feature extraction and similarity measure are two independent and unrelated stages, which cannot be optimized end-to-end. With the widespread application of deep learning in computer vision, the image patches matching based on deep learning has become a trend. MatchNet[13] extracts the image features from two CNN branches. It uses two full connection (FC) layers to determine whether the extracted features are similar. Deep Compare Network[14] compares the image patches by Siamese networks, 2-channels, and pseudo-Siamese models. Patch match networks[15] proposed improved architecture for two-channel and Siamese networks to compare the visible image patches. The networks above have achieved excellent performance in visible images. However, they do not solve the infrared-visible image patches matching well. The patches have different imaging principles. It is necessary to design a new deep neural network to achieve better performance in infrared-visible images matching.

      This paper proposes an infrared-visible image deep matching network (InViNet) to tackle these challenges above. Two CNN branches extract the infrared and visible image features independently. The full connection layers compare their similarity.

      In infrared-visible image patches matching, we think that the differences between unrelated patches are still more significant than those within similar patches, even if multi-sensors take the infrared and visible images. The feature extraction subnetwork uses the contrastive loss and triplet loss to maximize the distance of the feature between unrelated patches and minimize it within similar patches. It makes the distribution of the high-level feature more centralized within the intra-classes and more separate between the inter-classes.

      For the infrared images, their regions and shapes still have essential references in the infrared-visible image matching. Integrating the spatial features with semantic features is necessary. We combine the multi-scale spatial features with the high-level features to enhance the performance. Compared to the previous CNNs, our method can increase the accuracy from 78.95% to 88.75%.

    • Our network mainly consists of two parts: the feature extraction network and the metric network, as shown in Fig.1. The feature extraction network is responsible for extracting features in infrared and visible images. The metric network mainly matches the feature’s similarities.

      Figure 1.  Infrared-visible image deep matching network. The black line with the arrow indicates the data-flow. The blue lines represent shortcut connections through the reshape layers. This figure describes the process of the infrared-visible image patches matching

      The feature extraction network extracts the distinguishing features of visible and infrared images. In the feature extraction network, infrared and visible images are input into two VGG16[16] branches, which constitute a Siamese network. To be compatible with the visible image’s three channels, we copy an infrared image into three channels as another branch input. The weights are shared in the two branches. There are five blocks and two FC layers in a single VGG16 branch. A block consists of two or three convolution layers, an activation layer, and a pooling layer. In a single VGG branch, we retain the original FC6 and FC7 layers in VGG16. There are two reasons for retaining the two FC layers. Firstly, the FC7 layer can produce a feature of 1×4096 dimension, rather than 7×7×512 dimensions from the Conv5 block. It can significantly reduce the parameters and calculations in the metric network. Secondly, we find that the branch with FC layers has better performance than that without them in training.

      For infrared and visible images, although their imaging principles are different, the same target is very similar in semantic features. Therefore, branches share network weights in network design. We believe that deep convolutional networks have strong feature representation capacity. It can extract common feature in infrared and visible images. Multiple network branches that traditionally use contrastive loss or triplet loss generally share weights. The shared weights can map high-level features to the same feature space for distance comparison.

      The metric network is composed of two FC layers with softmax loss as the objective function. It estimates the probability of whether the visible image and the infrared image are similar or not. Ideally, if they match, the prediction is 1. If they don’t match, the prediction is 0.

    • Compared with visible images, infrared images have no color and less texture information. The edges are usually blurred. However, the objects still have rough outlines and region information in infrared images. These outlines and shapes are common features in visible and infrared images. Therefore, we believe that their spatial information is essential in infrared images for image matching. It is necessary to integrate the spatial features with the semantic features to enhance feature representation.

      On the other hand, it is feasible to propose features with multiple scales in the deep neural network's hierarchical framework. The features proposed from the low-level layers are similar to those extracted with the hand-craft descriptors, such as SIFT, SURF. As the CNN layers deepen, the feature maps less focus on the imaging difference. The semantic features gradually reveal in the high-level layers. In our network, the multi-scale features are input into the metric network. So, the metric network can use more comprehensive information to make similarity decisions. Each block in our network directly connects to the input of the metric network. It can preserve more multi-scale spatial information for similarity comparison in the metric network, as shown in Fig.2.

      Figure 2.  Multi-scale spatial feature integration in a single branch. The output feature map in each block shorts to the concatenation layer. The output of the concatenation layer is one input of the metric network

      In multi-scale spatial feature extraction, two problems need to be solved. Firstly, the shortcut feature should maintain the original feature maps' size in each block to preserve spatial information. Secondly, the shortcut feature dimensions should not be too high after it reshapes into a vector. The great dimension eventually results in vast parameters and high computation in the metric network.

      The 1×1 convolution is adopted in our network to solve the problems. The 1×1 convolution is widely used in GoogLeNet[17]. The multi-channel feature maps are compressed into a single-channel feature map, which preserves the spatial information and avoids the too high dimensions. To connect the features of different dimensions, they are converted to the vectors of length N×N with the reshape layer. N represents the size of the corresponding feature map. All multi-scale feature maps, including the semantic feature from the FC7, are concatenated as the input of the metric network. In the metric networks, its inputs include the infrared image branch and the visible image branch.

    • As shown in Fig.3, the network of feature extraction consists of two branches. Two branches are identical in structure and share weights. A visible image and an infrared image make up an image pair. The contrastive loss was first used for dimensionality reduction[18]. Here, the contrastive loss is used as the objective function to train the two branches.

      Figure 3.  (a) Feature extraction network architecture with the contrastive loss; (b) Input data for feature extraction network with the contrastive loss. The visual patches are in the first row. The infrared patches are in the second row. The positive samples are in odd columns. The negative ones are in even columns

      The contrastive loss is shown in Eq.(1).

      $$l({x_1},{x_2}) = \left\{ \begin{array}{l} {\mathop{d}\nolimits} (f({x_1}),f({x_2})),\;\;\;\;\;\;\;\;\;{p_1} = {p_2}\\ \max (0,{\mathop{\rm margin}\nolimits} - {\mathop{d}\nolimits} (f({x_1}),f({x_2}))),\;\begin{array}{*{20}{c}} {{p_1} \ne {p_2}} \end{array} \end{array} \right.$$ (1)

      where ${{d}}\mathrm(f({x}_{1}), f({x}_{2}))$ represents the Euclidean distance of two sample features; ${p_1}$ is the label of input visual image; ${p_2}$ is the label of input infrared image. ${p_1} = {p_2}$ means a similar patches pair. ${p_1} \ne {p_2}$ means an unrelated patches pair. The margin is a threshold in Eq.(1). It represents the distance that should be separated from the unrelated features, at least. In our experiment, the margin is set to 1.

    • As shown in Fig.4 (a), the network consists of triple branches. Three branches are identical in structure and share weights. A visible patch (anchoring sample), an infrared patch (positive sample), and another infrared patch (negative sample) form an image pair. We input a triple pair at a time to train the feature extraction network. The triplet loss was used for face recognition[19] firstly. Here, it is used as the objective function to train the triple branches.

      Figure 4.  (a) Feature extraction network architecture with the triplet loss; (b) Input data for feature extraction network with the triplet loss. The anchor patches are in the first row. The positive patches are in the second row. The negative patches are in the third row. Each column is triple patches input

      The triplet loss shows in Eq.(2).

      $${\rm{max}}({d}(f({x_a}),f({x_p})) - {d} (f({x_a}),f({x_n})) + {\rm{margin}},0)$$ (2)

      The input data include anchoring sample (${x_a}$), positive sample (${x_p}$) and negative sample (${x_n}$). ${d}(f({x_a}),f({x_p}))$ represents the Euclidean distance of the anchoring sample and the positive sample. ${d} (f({x_a}),f({x_n}))$ represents the Euclidean distance of the anchoring sample and the negative sample. By optimizing the function, the distance between the anchoring example and the positive example is less than the distance between the anchoring example and the negative example. The anchoring example is randomly selected from the sample set. The positive example and the anchoring example go to the same class, while the negative example and the anchoring example belong to different classes.

      $${d(}f({x_a}),f({x_p})) + {\rm{margin}} \leqslant {d}(f({x_a}),f({x_n}))$$ (3)

      Eq.(3) illustrates that there is a margin between ${d(}f({x_a}),f({x_p}))$ and ${d(}f({x_a}),f({x_n}))$ to distinguish positive and negative samples. Unlike contrastive loss, the triplet loss function compares the distance between positive and negative samples in a forward and backpropagation process. Compared to visible image matching, the same object’s imaging difference is also relatively large in multi-source image patches matching. So, it is found that a larger margin can achieve better performance in our experiment. The margin is set 3 to achieve the best performance.

    • There are no available infrared-visible image patches matching datasets on the Internet, so we have to collect image pairs ourselves. In data acquisition, the visible camera is the default equipment in the DJI UAV. The infrared camera is manufactured by FLIR company. The wavelength of the infrared camera ranges from 7.5 to 13.5 µm. In terms of image resolution, the UAV acquires infrared and visible images at different altitudes. In the original image, the proportion of the same target to the image size is 0.8×, 0.5×, and 0.25×, respectively. In the following data preprocess, we crop the target area from the original images. The input images of the neural network resize to 224×224. Therefore, we use different resolution images during training and testing.

      Our data set contains 2 000 images, falling into 25 classes. For scene selection, the target taken by UAVs should be different in shape and outline. The classes cover bridges, buildings, roads, parking lots, factories, houses, towers, gas storage tanks, etc., as shown in Fig.5. In the data set, the ratio of visible and infrared images is 1∶1. 80% of the images are used as training data. The rest images are used as test data. A sample includes an infrared patch and a visible patch. If the image pairs are similar, they are positive samples. Their ground truth is 1. If they are not similar, they are negative samples, and the ground truth is 0. In the training and test data set, the ratio of positive and negative samples is 1∶1.

      Figure 5.  Infrared-visible image samples. Ten image pairs randomly was selected. The ground truth of the first five images is 0. The ground truth of the last five columns is 1

    • InViNet using two-stage training is better than the traditional classification network. In two-stage training, the feature network can improve the features representation. It can significantly increase the accuracy of the metric network in the latter stage. By comparing the existence of shortcut connections in InViNet, the low-level spatial feature is acknowledged as a useful complement for high-level semantic information. We use the following settings to train our network in two stages.

      The feature extraction network is trained in the first stages. The branches in the feature extraction network are initialized with VGG16 trained weights by the ImageNet data set. Xavier[20] method initializes the new or modified layers. The low-level filters in VGG16 are acknowledged that they are beneficial for the shallow features, while higher-level features are more closely related to specific tasks. So, the learning rate multipliers in each layer are also set differently. The learning rate in each layer is the basic learning rate multiplied by its learning rate multiplier. The basic learning rate is 10−3. The learning rate multiplier is 0.01 in Block1, Block2, and Block3. The learning rate multiplier is 0.05 in Block4 and Block5. The learning rate multiplier of FC6 and FC7 in VGG remains 1. Since all branches share weights, only one copy of the weights is in the feature extraction network. The optimizer uses the momentum SGD method. The momentum parameter is 0.9. Minibatch size in training is 16. The number of epochs is 2 000. The weight decay is 10−4.

      The metric network and shortcut connections are trained in the second stage. In metric network training, the weights trained well in the feature network are used as the initial value. The branches' weights slightly change during this training. Their learning rate multipliers are less than 10−2. The basic learning rate is 10−3. The weights are initialized with the Xavier method in new layers. Their learning rate multipliers are 1 in the metric network and shortcut layers. The number of epochs is adjusted to 2 500. The rest of the training parameters are the same as the first training.

      All experiments run on a computer equipped with Nvidia TITAN XP GPU. Our experiment is implemented with Caffe.

    • To validate our approach, we have implemented the following experiments on different network architecture.

      (1) Traditional method[9]. We enhance the object edges and use SURF to extract the infrared-visible image features. The similarity of images is measured by matching the feature points of the infrared and visible images.

      (2) Baseline Network. MatchNet[13] is used as a baseline network. The Softmax loss function directly optimizes the whole network. There are no two phases in training. Two VGG16 branches train from scratch. The network has been over-fitting soon.

      (3) MatchNet[13](F). MatchNet architecture improved with fine-tuning. Unlike the baseline network, its VGG16 branches are initialized by the weights trained with the ImageNet dataset.

      (4) Pseudo-SiamNet[14] (F). The pseudo-Siamese Deep Compare Network architecture improved with fine-tuning. In the two VGG16 branches, the Conv1, Conv2 and Conv3 layers use their respective weights, whereas the Conv4 and Conv5 layers share the weights. The model weights also are initialized by the VGG16 trained to avoid over-fitting.

      (5) InViNet (F+C). InViNet with fine-tuning and contrastive loss. We trained this network in two phases, which are described in Sec 2.2.

      (6) InViNet (F+C+S). InViNet with fine-tuning, contrastive loss, and shortcut connection. The network adds shortcut connections.

      (7) InViNet (F+T+S). InViNet with fine-tuning, triplet loss, and shortcut connection. This network is mainly to compare triplet loss and constrained loss.

      The ROC curve usually measures a binary classification performance to avoid the imbalance between positive and negative samples. The commonly used evaluation metric is the false positive rate at 95% recall (Error@95%), the lower the better. Based on the experimental results, ROC curves are drawn for different methods. See Fig.6 for details.

      Figure 6.  ROC curves for various methods. The numbers in the legends are FPR95 values. In the legend, the symbol “F” means the network uses fine-tuning with VGG16. The symbol “C” means that the contrastive loss is used in the extraction feature network. The symbol “T” means that the triplet loss is used in the extraction feature network. The symbol “S” means that shortcut connection is used

      From our experiments, the following conclusions can be summarized.

      (1) In infrared-visible image patches matching, it is hard to extract common features in infrared and visible images with traditional methods due to the different imaging principles. The result is not satisfying.

      (2) The few samples easily lead to over-fitting when the network is trained from scratch. With the fine-tuning, all deep learning networks show better performance than traditional algorithms. The fine-tuning can avoid over-fitting effectively.

      (3) The pseudo-Siamese network performs better than the Siamese network. The explanation may be that the low-level convolution layers don’t share weights in pseudo-Siamese networks. According to the different imaging principles of infrared and visible images, they can extract their unique shallow features from two separate branches.

      To be concrete, we visualize the deep learned features of expression using t-SNE[21], a common tool used to visualize high-dimensional data. Our approach can effectively reduce the intra-class distance and enlarge the inter-class distance in Fig.7 which is beneficial for patches matching.

      We show some top-ranking correct and incorrect results in InViNet in Fig.8. We find that incorrect results also may be easily mistaken by a human.

      Figure 8.  Top-ranking false and true results in overpass and factory image patches. (a) True positive samples; (b) True negative samples; (c) False positive samples; (d) False negative samples

      To further analyze our method results, we list the mean average precision (MAP) in the test set, which has five classes. The classes have never been used in the training process. As shown in Fig.9, our InViNet outperforms other approaches.

      Figure 7.  Visualization of the five class features in the test data set by the feature extraction network. (a) Features from the original network; (b) Features from the network with the contrastive loss; (c) Features from the network with the triplet loss

      Figure 9.  Performance matching in the test data set. In the legend, the symbols “F”, “C”, “T” and “S” have the same meaning in Fig. 6

    • Given the difficulty of infrared-visible image patches matching, this paper proposes an improved network based on deep learning. Compared to the previous method, our method can increase the accuracy from 78.95% to 88.75%. At present, it is difficult to obtain samples of visible and infrared images. There are many multi-sensor data sets available on the Internet. However, they are not fully utilized because there is no corresponding similar visible image. We believe that we can make full use of many multi-sensor images through unsupervised learning to further improve our matching performance in the future.

参考文献 (21)

目录

    /

    返回文章
    返回