选择并融合粗细粒度特征的细粒度图像识别

阳治民; 宋威

doi:10.11834/jig.220052

图像分析和识别 | 浏览量 : 0 下载量: 8 CSCD: 1

PDF
导出
分享
收藏
专辑

选择并融合粗细粒度特征的细粒度图像识别
Selecting and fusing coarse-and-fine granularity features for fine-grained image recognition
2023年28卷第7期页码：2081-2092
纸质出版日期： 2023-07-16 ，
DOI： 10.11834/jig.220052
稿件说明：

移动端阅览

阳治民，宋威. 2023. 选择并融合粗细粒度特征的细粒度图像识别. 中国图象图形学报， 28(07):2081-2092

Yang Zhimin， Song Wei. 2023. Selecting and fusing coarse-and-fine granularity features for fine-grained image recognition. Journal of Image and Graphics， 28(07):2081-2092
阳治民，宋威. 2023. 选择并融合粗细粒度特征的细粒度图像识别. 中国图象图形学报， 28(07):2081-2092 DOI： 10.11834/jig.220052.

Yang Zhimin， Song Wei. 2023. Selecting and fusing coarse-and-fine granularity features for fine-grained image recognition. Journal of Image and Graphics， 28(07):2081-2092 DOI： 10.11834/jig.220052.

摘要

目的

在细粒度图像识别任务中，类内图像姿势方差大，需要找到类内变化小的共性，决定该任务依赖于鉴别性局部的细粒度特征；类间图像局部差异小，需要找到类间更全面的不同，决定该任务还需要多样性局部的粗粒度特征。现有方法主要关注粗细粒度下的局部定位，没有考虑如何选择粗细粒度的特征及如何融合不同粒度的特征。为此，提出一种选择并融合粗细粒度特征的细粒度图像识别方法。

方法

设计一个细粒度特征选择模块，通过空间选择和通道选择来突出局部的细粒度鉴别性特征；构建一个粗粒度特征选择模块，基于细粒度模块选择后的局部，挖掘各局部间的语义和位置关系，从而获得为细粒度局部提供补充信息的粗粒度多样性特征；融合这两个模块中提取到的细粒度特征和粗粒度特征，形成互补的粗细粒度表示，以提高细粒度图像识别方法的准确性。

结果

在CUB-200-2011（caltech-UCSD birds-200-2011）、Stanford Cars和FGVC-Aircraft（fine-grained visual classification aircraft） 3个公开的标准数据集上进行广泛实验，结果表明，所提方法的识别准确率分别达到90.3%、95.6%和94.8%，明显优于目前主流的细粒度图像识别方法，相较于对比方法中的最好结果，准确率相对提升0.7%、0.5%和1.4%。

结论

提出的方法能够提取粗粒度和细粒度两种类型的视觉特征，同时保证特征的鉴别性和多样性，使细粒度图像识别的结果更加精准。

Abstract

Objective

With the rapid development of deep learning techniques， the machine intelligence of general image recognition has reached or even surpassed the human level. This encourages researchers to tackle highly complicated tasks in the field of fine-grained vision. As one of the classic tasks in this field， fine-grained image recognition （FGIR） subdivides subordinate categories， such as bird species， car models， and aircraft types. FGIR is more challenging than general image recognition because subordinate species often have smaller interclass differences （e.g.， similar geometry or texture） and larger intraclass variations （e.g.， illumination or pose variations）. Therefore， determining how to explore subtle differences and shorten inner variations in diverse parts of an object becomes a challenge for FGIR. Faced with this challenge， most of the current FGIR methods mainly focus on locating the discriminant parts at coarse-and-fine granularity， with the goal of generating powerful representations of parts for fine-grained recognition. Recent studies have demonstrated that part-based methods locating diverse parts are capable of distinguishing different subclasses， contributing to enhancing the performance of FGIR. The success behind part-based methods can largely be attributed to being able to select and locate multiple parts with distinct differences for downstream recognition. Strongly supervised part-based methods leverage the part annotations to establish the connection of each part. However， this process usually causes heavy labor of manual labeling for them， which is time consuming and inefficient. By contrast， weakly supervised part-based methods show that the complementary relations between parts can be exploited in a learnable way. These methods based on part-level features may not be robust enough to face appearance distortion， e.g.， the poses of bird heads are uncontrollable in a real scene. More importantly， the pose variations may influence the validity of spatial features. To compensate for the deficiency of spatial features， multigranularity methods are adopted for feature learning. However， in these multigranularity methods， little or no effort has been made toward at which granularities are these diverse parts most discriminative and how can information across different granularities be effectively fused for recognition accuracy. Therefore， we consider that FGIR needs not only to efficiently select coarse-and-fine granularity features but also to effectively fuse them through part relations. To deal with the two aforementioned limitations， we propose an FGIR method selecting and fusing coarse-and-fine granularity features for fine-grained recognition.

Method

In this study， an FGIR method based on pretrained convolutional neural network is built to extract basic convolutional features， and feature selection and fusion are carried out via three modules. First， we design a fine-grained feature selection module to highlight fine-grained discriminative features of parts through spatial and channel selection. Considering that the parts in a fine-grained image are usually spatially connected and activated in most feature channels， we perform spatial and channel selection to discard the background and highlight the informative channels for the convolutional features， contributing to obtaining discriminative image representations. Second， a coarse-grained feature selection module is constructed to focus on the subtle features of parts. Based on the selected parts of fine-grained modules， the semantic and position relationships among parts are mined to generate the coarse-grained diverse features that provide context information for fine-grained parts. Most of the previous methods focused on the use of local subtle features for FGIR， ignoring the potential influence of the relationship between parts and their coarse-grained context. These methods usually individually fed each part directly into the subnetwork or simply spliced their features to make recognition predictions. However， these relationships between the parts， especially the semantic and positional relationships of these parts and their coarse-grained contexts， contain valuable information that benefits feature learning and recognition processes. Therefore， inspired by the self-attention mechanism， this module in our method constructs the relationship between each part of the object and its context from two different perspectives of semantic modeling and spatial modeling. It selects the information related to the object with attention， thus providing coarse-grained diversity features for fine-grained parts. Lastly， we design a coarse-and-fine granularity feature fusion module to improve the accuracy of the FGIR method. It establishes the communication between fine- and coarse-grained features and carries out supplementary fusion of fine- and coarse-grained features to form complementary coarse-and-fine granularity representations.

Result

Extensive experiments were carried out to verify the effectiveness of the proposed method. Specifically， we compared our method with seven state-of-the-art FGIR methods on three public standard datasets， namely， caltech-UCSD birds-200-2011 （CUB-200-2011）， Stanford Cars， and fine-grained visual classification aircraft （FGVC-Aircraft）. The quantitative evaluation metrics were accuracy， mean average precision， and precision-recall curves （higher is better）， and we provided parameters and floating-point operations （lower is better） of several methods for comparison. The experimental results showed that our method outperformed all other FGIR methods on CUB-200-2011， Stanford Cars， and FGVC-Aircraft datasets， and the recognition accuracy of the proposed method was 90.3%， 95.6%， and 94.8%， respectively. Compared with the best results of progressive multigranularity （PMG） in the comparison methods， the recognition accuracy was improved by 0.7%， 0.5%， and 1.4% on the three datasets， respectively. Moreover， compared with PMG， the floating-point operations decreased by 17.8 G， and the parameters were reduced by 17.2 M on all three datasets. We also conducted a series of comparative experiments in our method to clearly show the effectiveness of different modules. Furthermore， we introduced a widely used method of class activation mapping to visualize the recognition results of our method and other higher-performance methods to carry out a fair comparison experiment and make a qualitative analysis of the success and failure regarding our proposed method.

Conclusion

In this study， faced with the image task of fine-grained recognition， we propose an FGIR method selecting and fusing coarse-and-fine granularity features. The experimental results show that our method outperforms several state-of-the-art FGIR methods， indicating that our method can extract both coarse- and fine-grained visual features with the discrimination and diversity of features.

关键词

细粒度识别粗细粒度特征选择特征融合鉴别性多样性

Keywords

fine-grained recognitioncoarse-and-fine granularityfeature selectionfeature fusiondiscriminationdiversity

references

Branson S， van Horn G， Belongie S and Perona P. 2014. Bird species categorization using pose normalized deep convolutional nets ［EB/OL］. ［2022-02-14］. https：//arxiv.org/pdf/1406.2952.pdfhttps://arxiv.org/pdf/1406.2952.pdf

Chang D L， Ding Y F， Xie J Y， Bhunia A K， Li X X， Ma Z， Wu M， Guo J and Song Y Z. 2020. The devil is in the channels： mutual-channel loss for fine-grained image classification. IEEE Transactions on Image Processing， 29： 4683-4695 ［DOI： 10.1109/TIP.2020.2973812http://dx.doi.org/10.1109/TIP.2020.2973812］

Ding Y， Zhou Y Z， Zhu Y， Ye Q X and Jiao J B. 2019. Selective sparse sampling for fine-grained image recognition//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul， Korea （South）： IEEE： 6598-6607 ［DOI： 10.1109/ICCV.2019.00670http://dx.doi.org/10.1109/ICCV.2019.00670］

Ding Y F， Ma Z Y， Wen S G， Xie J Y， Chang D L， Si Z W， Wu M and Ling H B. 2021. AP-CNN： weakly supervised attention pyramid convolutional neural network for fine-grained visual classification. IEEE Transactions on Image Processing， 30： 2826-2836 ［DOI： 10.1109/TIP.2021.3055617http://dx.doi.org/10.1109/TIP.2021.3055617］

Du R Y， Chang D L， Bhunia A K， Xie J Y， Ma Z Y， Song Y Z and Guo J. 2020. Fine-grained visual classification via progressive multi-granularity training of jigsaw patches//Proceedings of 2020 European Conference on Computer Vision. Glasgow， UK： Springer： 153-168 ［DOI： 10.1007/978-3-030-58565-5_10http://dx.doi.org/10.1007/978-3-030-58565-5_10］

Fu J L， Zheng H L and Mei T. 2017. Look closer to see better： recurrent attention convolutional neural network for fine-grained image recognition//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu， USA： IEEE： 4476-4484 ［DOI： 10.1109/CVPR.2017.476http://dx.doi.org/10.1109/CVPR.2017.476］

Ge W F， Lin X R and Yu Y Z. 2019. Weakly supervised complementary parts models for fine-grained image classification from the bottom up//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 3029-3038 ［DOI： 10.1109/CVPR.2019.00315http://dx.doi.org/10.1109/CVPR.2019.00315］

Hu T， Qi H G， Huang Q M and Lu Y. 2019. See better before looking closer： weakly supervised data augmentation network for fine-grained visual classification ［EB/OL］. ［2022-02-14］. https：//arxiv.org/pdf/1901.09891.pdfhttps://arxiv.org/pdf/1901.09891.pdf

Jiang S Q， Min W Q， Liu L H and Luo Z D. 2020. Multi-scale multi-view deep feature aggregation for food recognition. IEEE Transactions on Image Processing， 29（1）： 265-276 ［DOI： 10.1109/TIP.2019.2929447http://dx.doi.org/10.1109/TIP.2019.2929447］

Krause J， Stark M， Deng J and Li F F. 2013. 3D object representations for fine-grained categorization//Proceedings of 2013 IEEE International Conference on Computer Vision Workshops. Sydney， Australia： IEEE： 554-561 ［DOI： 10.1109/ICCVW.2013.77http://dx.doi.org/10.1109/ICCVW.2013.77］

Lam M， Mahasseni B and Todorovic S. 2017. Fine-grained recognition as HSnet search for informative image parts//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu， USA： IEEE： 6497-6506 ［DOI： 10.1109/CVPR.2017.688http://dx.doi.org/10.1109/CVPR.2017.688］

Lin T Y， RoyChowdhury A and Maji S. 2018. Bilinear convolutional neural networks for fine-grained visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence， 40（6）： 1309-1322 ［DOI： 10.1109/TPAMI.2017.2723400http://dx.doi.org/10.1109/TPAMI.2017.2723400］

Maji S， Rahtu E， Kannala J， Blaschko M and Vedaldi A. 2013. Fine-grained visual classification of aircraft ［EB/OL］. ［2022-02-14］. http://arxiv.org/pdf/1306.5151.pdfhttp://arxiv.org/pdf/1306.5151.pdf

Meng F M， Huang K X， Li H L， Chen S， Wu Q B and Ngan K N. 2021. Hierarchical class grouping with orthogonal constraint for class activation map generation. Neural Computing and Applications， 33（13）： 7371-7380 ［DOI： 10.1007/s00521-020-05416-2http://dx.doi.org/10.1007/s00521-020-05416-2］

Niu Y， Jiao Y and Shi G M. 2021. Attention-shift based deep neural network for fine-grained visual categorization. Pattern Recognition， 116： #107947 ［DOI： 10.1016/j.patcog.2021.107947http://dx.doi.org/10.1016/j.patcog.2021.107947］

Pei Y T， Huang Y P， Zou Q， Zhang X Y and Wang S. 2021. Effects of image degradation and degradation removal to CNN-based image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence， 43（4）： 1239-1253 ［DOI： 10.1109/TPAMI.2019.2950923http://dx.doi.org/10.1109/TPAMI.2019.2950923］

Peng Y X， He X T and Zhao J J. 2018. Object-part attention model for fine-grained image classification. IEEE Transactions on Image Processing， 27（3）： 1487-1500 ［DOI： 10.1109/TIP.2017.2774041http://dx.doi.org/10.1109/TIP.2017.2774041］

Qi L， Lu X Q and Li X L. 2019. Exploiting spatial relation for fine-grained image classification. Pattern Recognition， 91： 47-55 ［DOI： 10.1016/j.patcog.2019.02.007http://dx.doi.org/10.1016/j.patcog.2019.02.007］

Ren S Q， He K M， Girshick R and Sun J. 2017. Faster R-CNN： towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence， 39（6）： 1137-1149 ［DOI： 10.1109/TPAMI.2016.2577031http://dx.doi.org/10.1109/TPAMI.2016.2577031］

Wei X S， Xie C W， Wu J X and Shen C H. 2018. Mask-CNN： localizing parts and selecting descriptors for fine-grained bird species categorization. Pattern Recognition， 76： 704-714 ［DOI： 10.1016/j.patcog.2017.10.002http://dx.doi.org/10.1016/j.patcog.2017.10.002］

Wu L， Wang Y， Li X and Gao J B. 2019. Deep attention-based spatially recursive networks for fine-grained visual recognition. IEEE Transactions on Cybernetics， 49（5）： 1791-1802 ［DOI： 10.1109/TCYB.2018.2813971http://dx.doi.org/10.1109/TCYB.2018.2813971］

Xu H， Ma J Y and Zhang X P. 2020. MEF-GAN： multi-exposure image fusion via generative adversarial networks. IEEE Transactions on Image Processing， 29： 7203-7216 ［DOI： 10.1109/TIP.2020.2999855http://dx.doi.org/10.1109/TIP.2020.2999855］

Yan Z X， Hou Z Q， Xiong L， Liu X Y， Yu W S and Ma S G. 2021. Fine-grained classification based on bilinear feature fusion and YOLOv3. Journal of Image and Graphics， 26（4）： 847-856

闫子旭，侯志强，熊磊，刘晓义，余旺盛，马素刚. 2021. YOLOv3和双线性特征融合的细粒度图像分类. 中国图象图形学报， 26（4）： 847-856 ［DOI： 10.11834/jig.200031http://dx.doi.org/10.11834/jig.200031］

Zhang D M， Jin G Q， Dai F， Yuan Q S， Bao X G and Zhang Y D. 2019. Salient object detection based on deep fusion of hand-crafted features. Chinese Journal of Computers， 42（9）： 2076-2086

张冬明，靳国庆，代锋，袁庆升，包秀国，张勇东. 2019. 基于深度融合的显著性目标检测算法. 计算机学报， 42（9）： 2076-2086 ［DOI： 10.11897/SP.J.1016.2019.02076http://dx.doi.org/10.11897/SP.J.1016.2019.02076］

Zhang Y B， Jia K and Wang Z X. 2020. Part-aware fine-grained object categorization using weakly supervised part detection network. IEEE Transactions on Multimedia， 22（5）： 1345-1357 ［DOI： 10.1109/TMM.2019.2939747http://dx.doi.org/10.1109/TMM.2019.2939747］

Zheng H L， Fu J L， Mei T and Luo J B. 2017. Learning multi-attention convolutional neural network for fine-grained image recognition//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice， Italy： IEEE： 5219-5227 ［DOI： 10.1109/ICCV.2017.557http://dx.doi.org/10.1109/ICCV.2017.557］

Zheng H L， Fu J L， Zha Z J， Luo J B and Mei T. 2020. Learning rich part hierarchies with progressive attention networks for fine-grained image recognition. IEEE Transactions on Image Processing， 29： 476-488 ［DOI： 10.1109/TIP.2019.2921876http://dx.doi.org/10.1109/TIP.2019.2921876］

文章被引用时，请邮件提醒。

提交

图神经网络与CNN融合的虹膜特征编码方法

混合监督学习的乳腺癌全切片病理图像分类

红外与可见光图像特征动态选择的目标检测网络

融合姿态引导和多尺度特征的遮挡行人重识别

结合双边交叉增强与自注意力补偿的点云语义分割