增量角度域损失和多特征融合的地标识别
Landmark recognition based on ArcFace loss and multiple feature fusion
- 2020年25卷第8期 页码:1567-1577
收稿:2019-08-11,
修回:2019-12-13,
录用:2019-12-20,
纸质出版:2020-08-16
DOI: 10.11834/jig.190418
移动端阅览

浏览全部资源
扫码关注微信
收稿:2019-08-11,
修回:2019-12-13,
录用:2019-12-20,
纸质出版:2020-08-16
移动端阅览
目的
2
地标识别是图像和视觉领域一个应用问题,针对地标识别中全局特征对视角变化敏感和局部特征对光线变化敏感等单一特征所存在的问题,提出一种基于增量角度域损失(additive angular margin loss,ArcFace损失)并对多种特征进行融合的弱监督地标识别模型。
方法
2
使用图像检索取Top-1的方法来完成识别任务。首先证明了ArcFace损失参数选取的范围,并于模型训练时使用该范围作为参数选取的依据,接着使用一种有效融合局部特征与全局特征的方法来获取图像特征以用于检索。其中,模型训练过程分为两步,第1步是在谷歌地标数据集上使用ArcFace损失函数微调ImageNet预训练模型权重,第2步是增加注意力机制并训练注意力网络。推理过程分为3个部分:抽取全局特征、获取局部特征和特征融合。具体而言,对输入的查询图像,首先从微调卷积神经网络的特征嵌入层提取全局特征;然后在网络中间层使用注意力机制提取局部特征;最后将两种特征向量横向拼接并用图像检索的方法给出数据库中与当前查询图像最相似的结果。
结果
2
实验结果表明,在巴黎、牛津建筑数据集上,特征融合方法可以使浅层网络达到深层预训练网络的效果,融合特征相比于全局特征(mean average precision,mAP)值提升约1%。实验还表明在神经网络嵌入特征上无需再加入特征白化过程。最后在城市级街景图像中本文模型也取得了较为满意的效果。
结论
2
本模型使用ArcFace损失进行训练且使多种特征相似性结果进行有效互补,提升了模型在实际应用场景中的抗干扰能力。
Objective
2
Landmark recognition
which is a new application in computer vision
has been increasing investigated in the past several years and has been widely used to implement landmark image recognition function in image retrieval. However
this application has many problems unsolved
such as the global features are sensitive to view change
and the local features are sensitive to light change. Most existing methods based on convolutional neural network (CNN) are used to extract image features for replacing traditional feature extraction methods
such as scale-invariant feature transform(SIFT) or speeded up robust feature (SURF). At present
the best model is deep local feature(DeLF)
but its retrieval needs the combination of product quantization(PQ) and K-dimensional(KD) trees. The process is complex and consumes approximately 6 GB of display memory
which is unsuitable for rapid deployment and use
and the most time-consuming process is random sample consensus.
Method
2
A multiple feature fusion method is needed when focusing on the problems of a single feature
and multiple features can be horizontally connected to create a single vector for improving the performance of CNN global features. For large-scale landmark data
manual labeling of images is time consuming and laborious
and artificial cognitive bias exists in labeling. To minimize human work in labeling images
weakly supervised loss
such as the additive angular margin loss function(ArcFace loss function)
which is improved from standard cross-entry loss and changes the Euclidean distances to angular domain
is used to train the model in image-level annotations. The ArcFace loss function performs well in facial recognition and image classification and is easy to use in other deep learning applications. This paper provides the values of the parameters in ArcFace loss function and the proof process. Thus
a weakly supervised recognition model based on ArcFace loss and multiple feature fusion is proposed for landmark recognition. The proposed model uses ResNet50 as its trunk and has two steps in model training
including the trunk's finetuning and attention layer's training. Finetuning uses the Google landmark image dataset
and the trunk is finetuned on the weights pretrained on the ImageNet dataset. The average pooling layer is replaced by a generalized mean(GeM) pooling layer because it is proven useful in image retrieval. The attention mechanism is built using two convolutional layers that use 1×1 kernel to train the features focusing on the local features needed. Image preprocessing is required before training. The preprocessing consists of three stages
including center crop/resize and random crop. People usually prefer to place buildings and themselves in the center of images. Thus
a center crop method is suitable to ignore the problems occurring in padding or resizing. The proposed model uses classification training to complete the image retrieval task. The final input image size is set to 448
2
. This value is a compromise value because the input image size in image retrieval is usually 800×8001 500×1 500 pixels and the classification size is 224×224~300×300 pixels. The image is center cropped first
and then its size is resized to 500×500 pixels because it is a useful method to enhance the data through random cropping. For inference
the image is center cropped and directly resized to 448×448 pixels because it only needs to be processed twice. The inference of this model is divided into three parts
namely
extracting global features
obtaining local features
and feature fusion. For the inputted query image
the global feature is first extracted from the embedding layer of CNN fine-tuned by ArcFace loss function; Second
the attention mechanism is used to obtain local features in the middle layer of the network
and the useful local features must be larger than the threshold; finally
two features are fused
and the results that are the most similar with the current query image in the database are obtained through image retrieval.
Result
2
We compared the proposed model with several state-of-the-art models
including the traditional approaches and deep learning methods on two public reviewed datasets
namely
Oxford and Paris building datasets. The two datasets are reconstructed in 2018 and are classified into three levels
namely
easy
medium
and hard. Three groups of comparisons are used in the experiment
and they are all compared on the reviewed Oxford and Paris datasets. The first group is to compare the proposed model's performance with other models
such as HesAff-rSIFT-VLAD and VggNet-NetVLAD. The second group is designed to compare the performance of single global feature with the performance of fused features. The last group compares the results obtained from the whiting of the extracted features at different layers of the proposed model. Results show that the feature fusion method can make the shallow network achieve the effect of deep pretrained network
and the mean average precision(mAP) increases by approximately 1% compared with the global features on the two previously mentioned datasets. The proposed model achieves satisfactory results in urban street view images.
Conclusion
2
In this study
we proposed a composite model that contains a CNN
an attention model
and a fusion algorithm to fuse two types of features. Experimental results show that the proposed model performs well
the fusion algorithm improves its performance
and the performance in urban street datasets ensures the practical application value of the proposed model.
Appalaraju S and Chaoji V. 2019. Image similarity using deep CNN and curriculum learning[EB/OL].[2019-05-17] . https://arxiv.org/pdf/1709.08761.pdf https://arxiv.org/pdf/1709.08761.pdf
Arandjelovic R, Gronat P, Torii A, Pajdla T and Sivic J. 2016. NetVLAD: CNN architecture for weakly supervised place recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 5297-5307[ DOI: 10.1109/CVPR.2016.572 http://dx.doi.org/10.1109/CVPR.2016.572 ]
Babenko A, Slesarev A, Chigorin A and Lempitsky V. 2014. Neural codes for image retrieval//Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer: 584-599[ DOI: 10.1007/978-3-319-10590-1_38 http://dx.doi.org/10.1007/978-3-319-10590-1_38 ]
Berman M, Jégou H, Vedaldi A, Kokkinos I and Douze M. 2019. MultiGrain: a unified image embedding for classes and instances[EB/OL].[2019-05-17] . https://arxiv.org/pdf/1902.05509.pdf https://arxiv.org/pdf/1902.05509.pdf
Deng J K, Guo J, Xue N N and Zafeiriou S. 2019. ArcFace: additive angular margin loss for deep face recognition[EB/OL].[2019-05-17] . https://arxiv.org/pdf/1801.07698.pdf https://arxiv.org/pdf/1801.07698.pdf
Fischler M A and Bolles R C. 1981. Random sample consensus:a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381-395[DOI:10.1145/358669.358692]
Gordo A, Almazán J, Revaud J and Larlus D. 2016. Deep image retrieval: learning global representations for image search//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands: Springer: 241-257[ DOI: 10.1007/978-3-319-46466-4_15 http://dx.doi.org/10.1007/978-3-319-46466-4_15 ]
He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 770-778[ DOI: 10.1109/CVPR.2016.90 http://dx.doi.org/10.1109/CVPR.2016.90 ]
Hu W, Huang Y Y, Zhang F and Li R R. 2019. Noise-tolerant paradigm for training face recognition CNNs[EB/OL].[2019-09-26] . https://arxiv.org/pdf/1903.10357.pdf https://arxiv.org/pdf/1903.10357.pdf
Jain H, Zepeda J, Pérez P and Gribonval R. 2018. Learning a complete image indexing pipeline//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 4933-4941[ DOI: 10.1109/CVPR.2018.00518 http://dx.doi.org/10.1109/CVPR.2018.00518 ]
Jégou H, Douze M and Schmid C. 2011. Product quantization for nearest neighbor search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1):117-128[DOI:10.1109/TPAMI.2010.57]
Lowry S, Sünderhauf N, Newman P, Leonard J J, Cox D, Corke P and Milford M J. 2016. Visual place recognition:a survey. IEEE Transactions on Robotics, 32(1):1-19[DOI:10.1109/TRO.2015.2496823]
Ng J Y H, Yang F and Davis L S. 2015. Exploiting local features from deep networks for image retrieval//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Boston, USA: IEEE: 53-61[ DOI: 10.1109/CVPRW.2015.7301272 http://dx.doi.org/10.1109/CVPRW.2015.7301272 ]
Noh H, Araujo A, Sim J, Weyand T and Han B. 2017. Large-scale image retrieval with attentive deep local features//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 3476-3485[ DOI: 10.1109/ICCV.2017.374 http://dx.doi.org/10.1109/ICCV.2017.374 ]
Ozaki K and Yokoo S. 2019. Large-scale landmark retrieval/recognition under a noisy and diverse dataset[EB/OL].[2019-09-26] . https://arxiv.org/pdf/1906.04087.pdf https://arxiv.org/pdf/1906.04087.pdf
RadenovićF, Iscen A, Tolias G, Avrithis Y and Chum O. 2018b. Revisiting oxford and Paris: large-scale image retrieval benchmarking//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 5706-5715[ DOI: 10.1109/CVPR.2018.00598 http://dx.doi.org/10.1109/CVPR.2018.00598 ]
Radenović F, Tolias G and Chum O. 2018a. Fine-tuning CNN image retrieval with no human annotation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(7):1655-1668[DOI:10.1109/TPAMI.2018.2846566]
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S A, Huang Z H, Karpathy A, Khosla A, Bernstein M, Berg A C and Li F F. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211-252[DOI:10.1007/s11263-015-0816-y]
Schroff F, Kalenichenko D and Philbin J. 2015. Facenet: a unified embedding for face recognition and clustering//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 815-823[ DOI: 10.1109/CVPR.2015.7298682 http://dx.doi.org/10.1109/CVPR.2015.7298682 ]
Sivic J and Zisserman A. 2003. Video Google: a text retrieval approach to object matching in videos//Proceedings of the 9th IEEE International Conference on Computer Vision. Nice, France: IEEE: 1470-1477[ DOI: 10.1109/ICCV.2003.1238663 http://dx.doi.org/10.1109/ICCV.2003.1238663 ]
Wang H, Wang Y T, Zhou Z, Ji X, Gong D H, Zhou J C, Li Z F and Liu W. 2018. CosFace: large margin cosine loss for deep face recognition//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 5265-5274[ DOI: 10.1109/CVPR.2018.00552 http://dx.doi.org/10.1109/CVPR.2018.00552 ]
Weyand T, Kostrikov I and Philbin J. 2016. PlaNet-photo geolocation with convolutional neural networks//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands: Springer: 37-55[ DOI: 10.1007/978-3-319-46484-8_3 http://dx.doi.org/10.1007/978-3-319-46484-8_3 ]
Zheng L, Yang Y and Tian Q. 2018. SIFT meets CNN:a decade survey of instance retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(5):1224-1244[DOI:10.1109/TPAMI.2017.2709749]
Zhou S. 2017. Research on the Key Technologies of Landmark-based Pedestrian Navigation. Wuhan: China University of Geosciences
周沙. 2017.基于地标的行人导航关键技术研究.武汉: 中国地质大学
相关作者
相关机构
京公网安备11010802024621