融合弱监督目标定位的细粒度小样本学习

贺小箭; 林金福

doi:10.11834/jig.200849

图像分析和识别 | 浏览量 : 0 下载量: 26 CSCD: 2

PDF
导出
分享
收藏
专辑

融合弱监督目标定位的细粒度小样本学习
Weakly-supervised object localization based fine-grained few-shot learning
2022年27卷第7期页码：2226-2239
收稿日期：2021-01-13，

修回日期：2021-04-13，

录用日期：2021-4-20，

纸质出版日期：2022-07-16
DOI： 10.11834/jig.200849
稿件说明：

移动端阅览

贺小箭, 林金福. 融合弱监督目标定位的细粒度小样本学习[J]. 中国图象图形学报, 2022,27(7):2226-2239. DOI： 10.11834/jig.200849.

Xiaojian He, Jinfu Lin. Weakly-supervised object localization based fine-grained few-shot learning[J]. Journal of image and graphics, 2022, 27(7): 2226-2239. DOI： 10.11834/jig.200849.

摘要

目的

小样本学习旨在通过一幅或几幅图像来学习全新的类别。目前许多小样本学习方法基于图像的全局表征，可以很好地实现常规小样本图像分类任务。但是，细粒度图像分类需要依赖局部的图像特征，而基于全局表征的方法无法有效地获取图像的局部特征，导致很多小样本学习方法不能很好地处理细粒度小样本图像分类问题。为此，提出一种融合弱监督目标定位的细粒度小样本学习方法。

方法

在数据量有限的情况下，目标定位是一个有效的方法，能直接提供最具区分性的区域。受此启发，提出了一个基于自注意力的互补定位模块来实现弱监督目标定位，生成筛选掩膜进行特征描述子的筛选。基于筛选的特征描述子，设计了一种语义对齐距离来度量图像最具区分性区域的相关性，进而完成细粒度小样本图像分类。

结果

在miniImageNet数据集上，本文方法在1-shot和5-shot下的分类精度相较性能第2的方法高出0.56%和5.02%。在细粒度数据集Stanford Dogs和Stanford Cars数据集上，本文方法在1-shot和5-shot下的分类精度相较性能第2的方法分别提高了4.18%，7.49%和16.13，5.17%。在CUB 200-2011（Caltech-UCSD birds）数据集中，本文方法在5-shot下的分类精度相较性能第2的方法提升了1.82%。泛化性实验也显示出本文方法可以更好地同时处理常规小样本学习和细粒度小样本学习。此外，可视化结果显示出所提出的弱监督目标定位模块可以更完整地定位出目标。

结论

融合弱监督目标定位的细粒度小样本学习方法显著提高了细粒度小样本图像分类的性能，而且可以同时处理常规的和细粒度的小样本图像分类。

Abstract

Objective

Few-shot learning (FSL) aims to learn emerged visual categories derived from constraint samples. A scenario of few-shot learning is the model learning via the classification strategy in the meta-train phase. It is required to recognize previously unseen classes with few labeled data in the meta-test phase. Current few-shot image classification methods focus on a robust global representation based learning.

It is challenged to facilitate in-situ fine-grained image classification in spite of a common few-shot image classification existing. Such a global representation cannot capture the local and subtle features well

which is critical for fine-grained image recognition. The fine-grained image datasets samples are constrained due to the high cost of labeling

which is a tailored scenario of few-shot learning. Therefore

fine-grained images recognition is lack of annotated data. To fulfill image classification

fine-grained image recognition is based on the most discriminative region location and the discriminate features utilization. However

many fine-grained image recognition methods cannot be straightforward to the fine-grained few-shot task due to limited annotation data (e.g.

bounding box). Thus

it is necessary to promote the few-shot learning and the fine-grained few-shot learning tasks both.

Method

Weakly-supervised object localization (WSOL) analysis is beneficial to the fine-grained few-shot classification task. Most fine-grained few-shot datasets are merely involved the label-based annotation due to the high cost of the pixel-level annotation. In addition

WSOL can provide the most discriminative regions directly

which is critical to general image classification and fine-grained image classification both. However

many existing WSOL methods cannot achieve complete localization of objects. For instance

class activation map (CAM) can update the last few layers of the classification network to obtain the merely class activation map via global maximum pooling and fully connected layers. To tackle these issues

we yield a self-attention based complementary module (SACM) to fulfill the WSOL. Our SACM contains the channel-based attention module (CBAM) and classifier module. Based on the spatial attention mechanism of the feature maps

CBAM can directly generate the saliency mask for the feature maps. A complementary non-saliency mask can be obtained through the threshold at the same time. To obtain the saliency and complementary non-saliency feature maps each

the saliency mask and the complementary non-saliency mask spatial-wise multiplies with the feature map. The classifier can obtain a more complete class activation map by assigning the saliency and non-saliency feature maps into the same category. Subsequently

we utilize the class activation map to filter and obtain the useful local feature descriptors for classification

which is as the descriptor representation. Additionally

images

the metric method cannot be directly applied to the fine-grained few-shot image classification in terms of common images based few-shot classification. We harness the semantic alignment distance to measure the distance between the two fine-grained images through the optioned feature descriptors and the naive Bayes nearest neighbor (NBNN) algorithm. First

we clarify the most neighboring descriptor among the supporting set through cosine distance for each query feature descriptor

which is denoted as the most neighboring cosine distance. Then

we accumulate the most neighboring cosine distance of each optioned feature descriptor to obtain the semantic alignment distance. The above two phases are merged into the semantic alignment module (SAM). Each feature descriptor in the query image can be accurately aligned by the support feature descriptor through the nearest neighbor cosine distance. This guarantees that the content between the query image and the supporting image can be semantically aligned. Meanwhile

each feature descriptor has a larger search space than the previous high-dimensional feature vector representation

which is equivalent to classification in a relative "high-data" regime

thereby improving the tolerance of the metric to noise.

Result

We carried out a large number of experiments to verify the performance. On the miniImageNet dataset

the proposed method gains 0.56% and 5.02% improvement than the second place under the 1-shot and 5-shot settings

respectively. On the fine-grained datasets Stanford Dogs and Stanford Cars

our method improves by 4.18%

7.49%

and 16.13

5.17% under 1-shot setting and 5-shot setting

respectively. In CUB 200-2011

our method also improves 1.82% under 5-shot. Our approach can be applied to both general few-shot learning and fine-grained few-shot learning. The ablation experiment demonstrates that to feature descriptors filtering improves the performance of fine-grained few-shot recognition via SACM-based activation map classification. Meanwhile

our proposed semantic alignment distance improves the classification performance of few-shot classification under the same conditions compared to the Euclidean distance. Extra visualization illustrates the proposed SACM can localize the key interval objects based on merely label-based annotations.

Conclusion

Our WSOL-based fine-grained few-shot learning method has its priorities for common and fine-grained few-shot learning both.

关键词

Keywords

references

Bertinetto L, Henriques J, Torr P H S and Vedaldi A. 2019. Meta-learning with differentiable closed-form solvers//Proceedings of the 7th International Conference on Learning Representations. Seoul, Korea (South): ICLR: 1-15

Boiman O, Shechtman E and Irani M. 2008. In defense of nearest-neighbor based image classification//Proceedings of 2008 IEEE Conference on Computer Vision and Pattern Recognition. Anchorage, USA: IEEE: 1-8[ DOI: 10.1109/CVPR.2008.4587598 http://dx.doi.org/10.1109/CVPR.2008.4587598 ]

Choe J and Shim H. 2019. Attention-based dropout layer for weakly supervised object localization//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 2214-2223[ DOI: 10.1109/CVPR.2019.00232 http://dx.doi.org/10.1109/CVPR.2019.00232 ]

Feng Y S and Wang Z L. 2016. Fine-grained image categorization with segmentation based on top-down attention map. Journal of Image and Graphics, 21(9): 1147-1154

冯语姗, 王子磊. 2016. 自上而下注意图分割的细粒度图像分类. 中国图象图形学报, 21(9): 1147-1154[DOI: 10.11834/jig.20160904]

Finn C, Abbeel P and Levine S. 2017. Model-agnostic meta-learning for fast adaptation of deep networks//Proceedings of the 34th International Conference on Machine Learning. Sydney, Australia: JMLR. org: 1126-135

Fu J L, Zheng H L and Mei T. 2017. Look closer to see better: recurrent attention convolutional neural network for fine-grained image recognition//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 4476-4484[ DOI: 10.1109/CVPR.2017.476 http://dx.doi.org/10.1109/CVPR.2017.476 ]

Garcia V and Bruna J. 2018. Few-shot learning with graph neural networks//Proceedings of the 6th International Conference on Learning Representations. Vancouver, Canada: ICLR: 1-13

Gidaris S and Komodakis N. 2018. Dynamic few-shot visual learning without forgetting//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 4367-4375[ DOI: 10.1109/CVPR.2018.00459 http://dx.doi.org/10.1109/CVPR.2018.00459 ]

Hariharan B and Girshick R. 2017. Low-shot visual recognition by shrinking and hallucinating features//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 3037-3046[ DOI: 10.1109/ICCV.2017.328 http://dx.doi.org/10.1109/ICCV.2017.328 ]

Huang H X, Zhang J J, Zhang J, Xu J S and Wu Q. 2021. Low-rank pairwise alignment bilinear network for few-shot fine-grained image classification. IEEE Transactions on Multimedia, 23: 1666-1680[DOI: 10.1109/tmm.2020.3001510]

Khosla A, Jayadevaprakash N, Yao B and Li F F. 2011. Novel dataset for fine-grained image categorization//Proceedings of CVPR Workshop on Fine-Grained Visual Categorization (FGVC). Citeseer: 1-2

Li W B, Wang L, Xu J L, Huo J, Gao Y and Luo J B. 2019a. Revisiting local descriptor based image-to-class measure for few-shot learning//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 7253-7260[ DOI: 10.1109/CVPR.2019.00743 http://dx.doi.org/10.1109/CVPR.2019.00743 ]

Li W B, Xu J L, Huo J, Wang L, Gao Y and Luo J B. 2019b. Distribution consistency based covariance metric networks for few-shot learning//Proceedings of the AAAI Conference on Artificial Intelligence, 33: 8642-8649[ DOI: 10.1609/aaai.v33i01.33018642 http://dx.doi.org/10.1609/aaai.v33i01.33018642 ]

Li X M, Yu L Q, Fu C W, Fang M and Heng P A. 2020. Revisiting metric learning for few-shot image classification. Neurocomputing, 406: 49-58[DOI: 10.1016/j.neucom.2020.04.040]

Lifchitz Y, Avrithis Y, Picard S and Bursuc A. 2019. Dense classification and implanting for few-shot learning//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 9250-9259[ DOI: 10.1109/CVPR.2019.00948 http://dx.doi.org/10.1109/CVPR.2019.00948 ]

Long J, Shelhamer E and Darrell T. 2015. Fully convolutional networks for semantic segmentation//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 3431-3440[ DOI: 10.1109/CVPR.2015.7298965 http://dx.doi.org/10.1109/CVPR.2015.7298965 ]

Makadia A and Yumer M E. 2015. Learning 3D part detection from sparsely labeled data//Proceedings of the 2nd International Conference on 3D Vision. Tokyo, Japan: IEEE: 311-318[ DOI: 10.1109/3DV.2014.108 http://dx.doi.org/10.1109/3DV.2014.108 ]

Oquab M, Bottou L, Laptev I and Sivic J. 2015. Is object localization for free? Weakly-supervised learning with convolutional neural networks//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 685-694[ DOI: 10.1109/CVPR.2015.7298668 http://dx.doi.org/10.1109/CVPR.2015.7298668 ]

Qiao S Y, Liu C X, Shen W and Yuille A. 2018. Few-shot image recognition by predicting parameters from activations//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 7229-7238[ DOI: 10.1109/CVPR.2018.00755 http://dx.doi.org/10.1109/CVPR.2018.00755 ]

Ravi S and Larochelle H. 2017. Optimization as a model for few-shot learning//Proceedings of the 5th International Conference on Learning Representations. Toulon, France: ICLR

Recht B, Roelofs R, Schmidt L and Shankar V. 2019. Do ImageNet classifiers generalize to ImageNet?//Proceedings of the 36th International Conference on Machine Learning. Long Beach, USA: ICML: 9413-9424

Santoro A, Bartunov S, Botvinick M, Wierstra D and Lillicrap T. 2016. Meta-learning with memory-augmented neural networks//Proceedings of the 33rd International Conference on Machine Learning. New York, USA: ICML: 1842-1850

Snell J, Swersky K and Zemel R. 2017. Prototypical networks for few-shot learning//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc. : 4080-4090

Sun X, Xv H, Dong J Y, Zhou H Y, Chen C R and Li Q. 2021. Few-shot learning for domain-specific fine-grained image classification. IEEE Transactions on Industrial Electronics, 68(4): 3588-3598[DOI: 10.1109/TIE.2020.2977553]

Sung F, Yang Y X, Zhang L, Xiang T, Torr P H S and Hospedales T M. 2018. Learning to compare: relation network for few-shot learning//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 1199-1208[ DOI: 10.1109/CVPR.2018.00131 http://dx.doi.org/10.1109/CVPR.2018.00131 ]

Vinyals O, Blundell C, Lillicrap T, Kavukcuoglu K and Wierstra D. 2016. Matching networks for one shot learning//Proceedings of the 30th International Conference on Neural Information Processing Systems. Barcelona, Spain: Curran Associates Inc. : 3637-3645

Wah C, Branson S, Welinder P, Perona P and Belongie S. 2011. The Caltech-UCSD Birds-200-2011 Dataset. Computation and Neural Systems Technical Report

Wei X S, Luo J H, Wu J X and Zhou Z H. 2017. Selective convolutional descriptor aggregation for fine-grained image retrieval. IEEE Transactions on Image Processing, 26(6): 2868-2881[DOI: 10.1109/TIP.2017.2688133]

Wei X S, Xie C W, Wu J X and Shen C H. 2018. Mask-CNN: localizing parts and selecting descriptors for fine-grained bird species categorization. Pattern Recognition, 76: 704-714[DOI: 10.1016/j.patcog.2017.10.002]

Weng Y C, Tian Y, Lu D M and Li Q Y. 2017. Fine-grained bird classification based on deep region networks. Journal of Image and Graphics, 22(11): 1521-1531

翁雨辰, 田野, 路敦民, 李琼砚. 2017. 深度区域网络方法的细粒度图像分类. 中国图象图形学报, 22(11): 1521-1531[DOI: 10.11834/jig.170262]

Zhang H G, Zhang J and Koniusz P. 2019. Few-shot learning via saliency-guided hallucination of samples//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 2765-2774[ DOI: 10.1109/CVPR.2019.00288 http://dx.doi.org/10.1109/CVPR.2019.00288 ]

Zhang X L, Wei Y C, Feng J S, Yang Y and Huang T. 2018a. Adversarial complementary learning for weakly supervised object localization//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 1325-1334[ DOI: 10.1109/CVPR.2018.00144 http://dx.doi.org/10.1109/CVPR.2018.00144 ]

Zhang X L, Wei Y C, Kang G L, Yang Y and Huang T. 2018b. Self-produced guidance for weakly-supervised object localization//Proceedings of 15th European Conference on Computer Science. Munich, Germany: Springer: 610-625[ DOI: 10.1007/978-3-030-01258-8_37 http://dx.doi.org/10.1007/978-3-030-01258-8_37 ]

Zheng H L, Fu J L, Mei T and Luo J B. 2017. Learning multi-attention convolutional neural network for fine-grained image recognition//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 5219-5227[ DOI: 10.1109/ICCV.2017.557 http://dx.doi.org/10.1109/ICCV.2017.557 ]

Zhou B L, Khosla A, Lapedriza A, Oliva A and Torralba A. 2016. Learning deep features for discriminative localization//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 2921-2929[ DOI: 10.1109/CVPR.2016.319 http://dx.doi.org/10.1109/CVPR.2016.319 ]

Zhu Y H, Liu C L and Jiang S Q. 2020. Multi-attention meta learning for few-shot fine-grained image recognition//Proceedings of the 29th International Joint Conference on Artificial Intelligence. Yokohama, Japan: IJCAI: 1090-1096[ DOI: 10.24963/ijcai.2020/152 http://dx.doi.org/10.24963/ijcai.2020/152 ]

文章被引用时，请邮件提醒。

提交

双分支注意和特征交互的小样本细粒度学习

细粒度图像分类的自知识蒸馏学习

YOLOv3和双线性特征融合的细粒度图像分类

聚焦—识别网络架构的细粒度图像分类