Current Issue Cover

阳治民1, 宋威1,2(1.江南大学人工智能与计算机学院, 无锡 214122;2.江苏省模式识别与计算智能工程实验室, 无锡 214122)

摘 要
目的 在细粒度图像识别任务中,类内图像姿势方差大,需要找到类内变化小的共性,决定该任务依赖于鉴别性局部的细粒度特征;类间图像局部差异小,需要找到类间更全面的不同,决定该任务还需要多样性局部的粗粒度特征。现有方法主要关注粗细粒度下的局部定位,没有考虑如何选择粗细粒度的特征及如何融合不同粒度的特征。为此,提出一种选择并融合粗细粒度特征的细粒度图像识别方法。方法 设计一个细粒度特征选择模块,通过空间选择和通道选择来突出局部的细粒度鉴别性特征;构建一个粗粒度特征选择模块,基于细粒度模块选择后的局部,挖掘各局部间的语义和位置关系,从而获得为细粒度局部提供补充信息的粗粒度多样性特征;融合这两个模块中提取到的细粒度特征和粗粒度特征,形成互补的粗细粒度表示,以提高细粒度图像识别方法的准确性。结果 在CUB-200-2011 (caltech-UCSD birds-200-2011)、Stanford Cars和FGVC-Aircraft (fine-grained visual classification aircraft)3个公开的标准数据集上进行广泛实验,结果表明,所提方法的识别准确率分别达到90.3%、95.6%和94.8%,明显优于目前主流的细粒度图像识别方法,相较于对比方法中的最好结果,准确率相对提升0.7%、0.5%和1.4%。结论 提出的方法能够提取粗粒度和细粒度两种类型的视觉特征,同时保证特征的鉴别性和多样性,使细粒度图像识别的结果更加精准。
Selecting and fusing coarse-and-fine granularity features for fine-grained image recognition

Yang Zhimin1, Song Wei1,2(1.School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, China;2.Jiangsu Provincial Engineering Laboratory of Pattern Recognition and Computational Intelligence, Wuxi 214122, China)

Objective With the rapid development of deep learning techniques,the machine intelligence of general image recognition has reached or even surpassed the human level. This encourages researchers to tackle highly complicated tasks in the field of fine-grained vision. As one of the classic tasks in this field,fine-grained image recognition(FGIR)subdivides subordinate categories,such as bird species,car models,and aircraft types. FGIR is more challenging than general image recognition because subordinate species often have smaller interclass differences(e. g. ,similar geometry or texture) and larger intraclass variations(e. g. ,illumination or pose variations). Therefore,determining how to explore subtle differences and shorten inner variations in diverse parts of an object becomes a challenge for FGIR. Faced with this challenge, most of the current FGIR methods mainly focus on locating the discriminant parts at coarse-and-fine granularity,with the goal of generating powerful representations of parts for fine-grained recognition. Recent studies have demonstrated that partbased methods locating diverse parts are capable of distinguishing different subclasses,contributing to enhancing the performance of FGIR. The success behind part-based methods can largely be attributed to being able to select and locate multiple parts with distinct differences for downstream recognition. Strongly supervised part-based methods leverage the part annotations to establish the connection of each part. However,this process usually causes heavy labor of manual labeling for them,which is time consuming and inefficient. By contrast,weakly supervised part-based methods show that the complementary relations between parts can be exploited in a learnable way. These methods based on part-level features may not be robust enough to face appearance distortion,e. g. ,the poses of bird heads are uncontrollable in a real scene. More importantly,the pose variations may influence the validity of spatial features. To compensate for the deficiency of spatial features,multigranularity methods are adopted for feature learning. However,in these multigranularity methods,little or no effort has been made toward at which granularities are these diverse parts most discriminative and how can information across different granularities be effectively fused for recognition accuracy. Therefore,we consider that FGIR needs not only to efficiently select coarse-and-fine granularity features but also to effectively fuse them through part relations. To deal with the two aforementioned limitations,we propose an FGIR method selecting and fusing coarse-and-fine granularity features for fine-grained recognition. Method In this study,an FGIR method based on pretrained convolutional neural network is built to extract basic convolutional features,and feature selection and fusion are carried out via three modules. First,we design a fine-grained feature selection module to highlight fine-grained discriminative features of parts through spatial and channel selection. Considering that the parts in a fine-grained image are usually spatially connected and activated in most feature channels,we perform spatial and channel selection to discard the background and highlight the informative channels for the convolutional features,contributing to obtaining discriminative image representations. Second,a coarsegrained feature selection module is constructed to focus on the subtle features of parts. Based on the selected parts of finegrained modules,the semantic and position relationships among parts are mined to generate the coarse-grained diverse features that provide context information for fine-grained parts. Most of the previous methods focused on the use of local subtle features for FGIR,ignoring the potential influence of the relationship between parts and their coarse-grained context. These methods usually individually fed each part directly into the subnetwork or simply spliced their features to make recognition predictions. However,these relationships between the parts,especially the semantic and positional relationships of these parts and their coarse-grained contexts,contain valuable information that benefits feature learning and recognition processes. Therefore,inspired by the self-attention mechanism,this module in our method constructs the relationship between each part of the object and its context from two different perspectives of semantic modeling and spatial modeling. It selects the information related to the object with attention,thus providing coarse-grained diversity features for fine-grained parts. Lastly,we design a coarse-and-fine granularity feature fusion module to improve the accuracy of the FGIR method. It establishes the communication between fine- and coarse-grained features and carries out supplementary fusion of fine- and coarse-grained features to form complementary coarse-and-fine granularity representations. Result Extensive experiments were carried out to verify the effectiveness of the proposed method. Specifically,we compared our method with seven state-of-the-art FGIR methods on three public standard datasets,namely,caltech-UCSD birds-200-2011 (CUB-200- 2011),Stanford Cars,and fine-grained visual classification aircraft(FGVC-Aircraft). The quantitative evaluation metrics were accuracy,mean average precision,and precision-recall curves(higher is better),and we provided parameters and floating-point operations(lower is better)of several methods for comparison. The experimental results showed that our method outperformed all other FGIR methods on CUB-200-2011,Stanford Cars,and FGVC-Aircraft datasets,and the recognition accuracy of the proposed method was 90. 3%,95. 6%,and 94. 8%,respectively. Compared with the best results of progressive multigranularity (PMG) in the comparison methods,the recognition accuracy was improved by 0. 7%, 0. 5%,and 1. 4% on the three datasets,respectively. Moreover,compared with PMG,the floating-point operations decreased by 17. 8 G,and the parameters were reduced by 17. 2 M on all three datasets. We also conducted a series of comparative experiments in our method to clearly show the effectiveness of different modules. Furthermore,we introduced a widely used method of class activation mapping to visualize the recognition results of our method and other higherperformance methods to carry out a fair comparison experiment and make a qualitative analysis of the success and failure regarding our proposed method. Conclusion In this study,faced with the image task of fine-grained recognition,we propose an FGIR method selecting and fusing coarse-and-fine granularity features. The experimental results show that our method outperforms several state-of-the-art FGIR methods,indicating that our method can extract both coarse- and finegrained visual features with the discrimination and diversity of features.