互补注意多样性特征融合网络的细粒度分类

黄港; 郑元林; 廖开阳; 蔺广逢; 曹从军; 宋雪芳

doi:10.11834/jig.220295

图像理解和计算机视觉 | 浏览量 : 0 下载量: 1 CSCD: 1

PDF
导出
分享
收藏
专辑

互补注意多样性特征融合网络的细粒度分类
Mutual attention diversity feature fusion network-relevant fine-grained classification
2023年28卷第8期页码：2420-2431
纸质出版日期： 2023-08-16 ，
DOI： 10.11834/jig.220295
稿件说明：

移动端阅览

黄港，郑元林，廖开阳，蔺广逢，曹从军，宋雪芳. 2023. 互补注意多样性特征融合网络的细粒度分类. 中国图象图形学报， 28(08):2420-2431

Huang Gang， Zheng Yuanlin， Liao Kaiyang， Lin Guangfeng， Cao Congjun， Song Xuefang. 2023. Mutual attention diversity feature fusion network-relevant fine-grained classification. Journal of Image and Graphics， 28(08):2420-2431
黄港，郑元林，廖开阳，蔺广逢，曹从军，宋雪芳. 2023. 互补注意多样性特征融合网络的细粒度分类. 中国图象图形学报， 28(08):2420-2431 DOI： 10.11834/jig.220295.

Huang Gang， Zheng Yuanlin， Liao Kaiyang， Lin Guangfeng， Cao Congjun， Song Xuefang. 2023. Mutual attention diversity feature fusion network-relevant fine-grained classification. Journal of Image and Graphics， 28(08):2420-2431 DOI： 10.11834/jig.220295.

摘要

目的

基于Transformer架构的网络在图像分类中表现出优异的性能。然而，注意力机制往往只关注图像中的显著性特征，而忽略了其他区域的次级显著信息，基于自注意力机制的Transformer也是如此。为了获取更多的有效信息，从有区别的潜在性特征中学习到更多的可判别特征，提出了一种互补注意多样性特征融合网络（complementary attention diversity feature fusion network，CADF），通过关注次显特征和对通道与空间特征协同编码，以增强特征多样性的注意感知。

方法

CADF由潜在性特征模块（potential feature module，PFM）和多样性特征融合模块（diversity feature fusion module，DFFM）组成。PFM模块通过聚合空间与通道中感兴趣区域得到显著性特征，再对特征的显著性进行抑制，以强制网络挖掘潜在性特征，从而增强网络对微小判别特征的感知。DFFM模块探索特征间的相关性，对不同尺寸的特征交互建模，以得到更加丰富的互补信息，从而产生更强的细粒度特征。

结果

本文方法可以端到端地进行训练，不需要边界框和多阶段训练。在CUB-200-2011（Caltech-UCSD Birds-200-2011）、Stanford Dogs、Stanford Cars以及FGVC-Aircraft（fine-grained visual classification of aircraft） 4个基准数据集上验证所提方法，准确率分别达到了92.6%、94.5%、95.3%和93.5%。实验结果表明，本文方法的性能优于当前主流方法，并在多个数据集中表现出良好的性能。在消融研究中，验证了模型中各个模块的有效性。

结论

本文方法具有显著性能，通过注意互补有效提升了特征的多样性，以此尽可能地获取丰富的判别特征，使分类的结果更加精准。

Abstract

Objective

Fine-grained requirement is focused on images segmentation for such domain like multiple wild birds or vehicles-between features extraction in related to transferring benched category into more detailed subcategories. Due to the subtle inter-category differences and large intra-category is existed， it is challenging to capture specific regions-targeted subtle differences for classification. The attention mechanism are still used to pay attention to the salient features in the picture only although Transformer architecture-based network has its potentials for image classification， and most of the latent features are ignored and self-attention mechanism-based Transformer are required to be involved as well. To get more effective information， discriminative latent features-derived feature representations are required to be learnt for fine-grained classification. To get more effective feature， we develop a complementary attention diversity feature fusion （CADF） network， which can extract multi-scale features and models from the channel and spatial feature interactions of images.

Method

A mutual attention diversity feature fusion network is facilitated and it consists of two modules： 1） potential feature module （PFM）： it can be focused on the features of different parts， and the salient features can be enhanced with the preservation of latent features. 2） Diversity feature fusion module （DFFM）： multiple features-between channel and spatial information interaction modeling is used to enhance rich feature， and information of specific parts of features can be enhanced in terms of feature fusion module. The scalable features can realize mutual-benefited information， and robustness of the features can be enhanced and the features can be more discriminative. Our network proposed is configured in PyTorch on an NVIDIA 2080Ti GPU. The weight parameters of the model are initialized using ImageNet classification dataset-related Swin-Transformer parameters pre-trained. The optimization is performed on the AdamW optimizer with a momentum of 0.9 and a cosine annealing scheduler. The batch size is set to 6， the learning rate of the backbone layer is set to 0.000 1， the newly layer is added and set to 0.000 01， and a weight decay of 0.05 is used as well. For training， the input images are resized to 550 × 550 pixels and cropped to 448 × 448 pixels randomly， and random horizontal flips are used for data augmentation further. For testing， the input images are resized to 550 × 550 pixels and cropped to 448 × 448 pixels from the center. The hyper-parameters of

= 1，

= 0.5 is set as well.

Result

To verify the effectiveness， the experiments are carried out on four fine-grained datasets： CUB-Birds， Stanford Dogs， Stanford Cars， and FGVC-Aircraft. Each of the classification accuracy can be reached to the following percentages： 92.6%， 94.5%， 95.3% and 93.5%. For the ablation experiments， the effectiveness of the PFM module and the DFFM module are verified further. Compared to the benched framework， it can improve the accuracy greatly via adding the PFM module only. Swin-B + PFM can be used to improve the accuracy by 1.4%， 1.4% and 0.8% on multiple datasets of CUB-Birds， Stanford Dogs and Stanford Cars datasets. Compared to PFM module-added network only， the accuracy of each feature exchange fusion module （Swin-B + PFM + DFFM） is also improved by 0.4%， 0.5% and 0.3% as well. It shows that the CADF model has strong feature extraction ability to a certain extent， and the effectiveness of each structure in the network is verified on the dataset potentially. The feature visualization is conducted to get the regional features of attention mechanism intuitively. In the ablation study， the effectiveness of each module in this model is verified further

Conclusion

To resolve the problem of insufficient attention mechanism-based feature extraction， we develop a latent feature extraction method for fine-grained image classification further.

关键词

细粒度分类多样性特征潜在特征特征融合端到端学习

Keywords

fine grained classificationdiversity characteristicspotential characteristicsfeature fusionend to end learning

references

Cai C L， Zhang T K， Weng Z W， Feng C Y and Wang Y P. 2021. A transformer architecture with adaptive attention for fine-grained visual classification//Proceedings of the 7th International Conference on Computer and Communications. Chengdu， China： IEEE： 863-867 ［DOI： 10.1109/ICCC54389.2021.9674560http://dx.doi.org/10.1109/ICCC54389.2021.9674560］

Ding Y， Zhou Y Z， Zhu Y， Ye Q X and Jiao J B. 2019. Selective sparse sampling for fine-grained image recognition//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul， Korea （South）： IEEE： 6598-6607 ［DOI： 10.1109/ICCV.2019.00670http://dx.doi.org/10.1109/ICCV.2019.00670］

Dosovitskiy A， Beyer L， Kolesnikov A， Weissenborn D， Zhai X H， Unterthiner T， Dehghani M， Minderer M， Heigold G， Gelly S， Uszkoreit J and Houlsby N. 2021. An image is worth 16×16 words： transformers for image recognition at scale ［EB/OL］. ［2022-03-07］. https://arxiv.org/pdf/2010.11929v2.pdfhttps://arxiv.org/pdf/2010.11929v2.pdf

Gao Y， Han X T， Wang X， Huang W L and Scott M. 2020. Channel interaction networks for fine-grained image categorization//Proceedings of the 34th AAAI Conference on Artificial Intelligence. New York， USA： AAAI Press： 10818-10825 ［DOI： 10.1609/aaai.v34i07.6712http://dx.doi.org/10.1609/aaai.v34i07.6712］

He J， Chen J N， Liu S， Kortylewski A， Yang C， Bai Y T and Wang C H. 2022. TransFG： a transformer architecture for fine-grained recognition//Proceedings of the 36th AAAI Conference on Artificial Intelligence. Vacnouver， Canada： AAAI Press： 852-860 ［DOI： 10.1609/aaai.v36i1.19967http://dx.doi.org/10.1609/aaai.v36i1.19967］

Hou Q B， Zhou D Q and Feng J S. 2021. Coordinate attention for efficient mobile network design//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville， USA： IEEE： 13708-13717 ［DOI： 10.1109/CVPR46437.2021.01350http://dx.doi.org/10.1109/CVPR46437.2021.01350］

Hu J， Shen L and Sun G. 2018. Squeeze-and-excitation networks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City， USA： IEEE： 7132-7141 ［DOI： 10.1109/CVPR.2018.00745http://dx.doi.org/10.1109/CVPR.2018.00745］

Krizhevsky A， Sutskever I and Hinton G E. 2012. ImageNet classification with deep convolutional neural networks//Proceedings of the 25th International Conference on Neural Information Processing Systems. Lake Tahoe， USA： Curran Associates Inc.： 1097-1105

Liu C B， Xie H T， Zha Z J， Ma L F， Yu L Y and Zhang Y D. 2020a. Filtration and distillation： enhancing region attention for fine-grained visual categorization//Proceedings of the 34th AAAI Conference on Artificial Intelligence. New York， USA： AAAI Press： 11555-11562 ［DOI： 10.1609/aaai.v34i07.6822http://dx.doi.org/10.1609/aaai.v34i07.6822］

Liu X D， Wang L L and Han X G. 2022. Transformer with peak suppression and knowledge guidance for fine-grained image recognition. Neurocomputing， 492： 137-149 ［DOI： 10.1016/j.neucom.2022.04.037http://dx.doi.org/10.1016/j.neucom.2022.04.037］

Liu Y， Sun G L， Qiu Y， Zhang L， Chhatkuli A and Van Gool L. 2021a. Transformer in convolutional neural networks ［EB/OL］. ［2022-03-07］. https://arxiv.org/pdf/2106.03180v2.pdfhttps://arxiv.org/pdf/2106.03180v2.pdf

Liu Z， Lin Y T， Cao Y， Hu H， Wei Y X， Zhang Z， Lin S and Guo B N. 2021b. Swin transformer： hierarchical vision transformer using shifted windows//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 9992-10002 ［DOI： 10.1109/ICCV48922.2021.00986http://dx.doi.org/10.1109/ICCV48922.2021.00986］

Liu Z Y， Luo S， Li W B， Lu J B， Wu Y F， Sun S L， Li C G and Yang L X. 2020b. ConvTransformer： a convolutional transformer network for video frame synthesis ［EB/OL］. ［2022-03-07］. https://arxiv.org/pdf/2011.10185.pdfhttps://arxiv.org/pdf/2011.10185.pdf

Long J， Shelhamer E and Darrell T. 2015. Fully convolutional networks for semantic segmentation//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston， USA： IEEE： 3431-3440 ［DOI： 10.1109/CVPR.2015.7298965http://dx.doi.org/10.1109/CVPR.2015.7298965］

Miao Z， Zhao X， Wang J B， Li Y and Li H. 2021. Complemental attention multi-feature fusion network for fine-grained classification. IEEE Signal Processing Letters， 28： 1983-1987 ［DOI： 10.1109/LSP.2021.3114622http://dx.doi.org/10.1109/LSP.2021.3114622］

Park J， Woo S， Lee J Y and Kweon I S. 2018. BAM： bottleneck attention module//Proceedings of British Machine Vision Conference. Newcastle， UK： BMVA Press： #147

Radenović F， Tolias G and Chum O. 2019. Fine-tuning CNN image retrieval with no human annotation. IEEE Transactions on Pattern Analysis and Machine Intelligence， 41（7）： 1655-1668 ［DOI： 10.1109/tpami.2018.2846566http://dx.doi.org/10.1109/tpami.2018.2846566］

Ren S Q， He K M， Girshick R and Sun J. 2017. Faster R-CNN： towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence， 39（6）： 1137-1149 ［DOI： 10.1109/tpami.2016.2577031http://dx.doi.org/10.1109/tpami.2016.2577031］

Song J W and Yang R Y. 2021. Feature boosting， suppression， and diversification for fine-grained visual classification//Proceedings of 2021 International Joint Conference on Neural Networks. Shenzhen， China： IEEE： 1-8 ［DOI： 10.1109/IJCNN52387.2021.9534004http://dx.doi.org/10.1109/IJCNN52387.2021.9534004］

Sun M， Yuan Y C， Zhou F and Ding E R. 2018. Multi-attention multi-class constraint for fine-grained image recognition//Proceedings of the 15th European Conference on Computer Vision. Munich， Germany： Springer： 834-850 ［DOI： 10.1007/978-3-030-01270-0_49http://dx.doi.org/10.1007/978-3-030-01270-0_49］

Wang J， Yu X H and Gao Y S. 2021. Feature fusion vision transformer for fine-grained visual categorization//Proceedings of the 32nd British Machine Vision Conference. ［s.l.］： BMVA Press： #170

Woo S， Park J， Lee J Y and Kweon I S. 2018. CBAM： convolutional block attention module//Proceedings of the 15th European Conference on Computer Vision. Munich， Germany： Springer： 3-19 ［DOI： 10.1007/978-3-030-01234-2_1http://dx.doi.org/10.1007/978-3-030-01234-2_1］

Zhao X， Wang J B， Li Y， Wang Y P and Miao Z. 2021. Complemented attention method for fine-grained image classification. Journal of Image and Graphics， 26（12）： 2860-2869

赵勋，王家宝，李阳，王亚鹏，苗壮. 2021. 细粒度图像分类的互补注意力方法. 中国图象图形学报， 26（12）： 2860-2869 ［DOI： 10.11834/jig.200426http://dx.doi.org/10.11834/jig.200426］

Zheng X， Lin L， Ye M， Wang L and He C L. 2020. Improving person re-identification by attention and multi-attributes. Journal of Image and Graphics， 25（5）： 936-945

郑鑫，林兰，叶茂，王丽，贺春林. 2020. 结合注意力机制和多属性分类的行人再识别. 中国图象图形学报， 25（5）： 936-945 ［DOI： 10.11834/jig.190185http://dx.doi.org/10.11834/jig.190185］

Zhou M H， Bai Y L， Zhang W， Zhao T J and Mei T. 2020. Look-into-object： self-supervised structure modeling for object recognition//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle， USA： IEEE： 11771-11780 ［DOI： 10.1109/CVPR42600.2020.01179http://dx.doi.org/10.1109/CVPR42600.2020.01179］

Zhuang P Q， Wang Y L and Qiao Y. 2020. Learning attentive pairwise interaction for fine-grained classification//Proceedings of the 34th AAAI Conference on Artificial Intelligence. New York， USA： AAAI Press： 13130-13137 ［DOI： 10.1609/aaai.v34i07.7016http://dx.doi.org/10.1609/aaai.v34i07.7016］

文章被引用时，请邮件提醒。

提交

图神经网络与CNN融合的虹膜特征编码方法

混合监督学习的乳腺癌全切片病理图像分类

红外与可见光图像特征动态选择的目标检测网络

融合姿态引导和多尺度特征的遮挡行人重识别

结合双边交叉增强与自注意力补偿的点云语义分割