发布时间: 2019-04-16
摘要点击次数:
全文下载次数:
DOI: 10.11834/jig.180423
2019 | Volume 24 | Number 4

图像分析和识别

聚焦—识别网络架构的细粒度图像分类

王永雄, 张晓兵

上海理工大学光电信息与计算机工程学院, 上海 200093

收稿日期: 2018-07-27; 修回日期: 2018-09-18

基金项目: 国家自然科学基金项目（61673276, 61603255）

第一作者简介: 王永雄, 1970年生, 男, 教授, 主要研究方向为智能机器人及机器视觉。E-mail:wyxiong@usst.edu.cn;
张晓兵, 女, 硕士研究生, 主要研究方向为机器视觉、图像处理。E-mail:bingzxcn@163.com.

中图法分类号: TP391

文献标识码: A

文章编号: 1006-8961(2019)04-0493-10

摘要

目的细粒度图像分类是指对一个大类别进行更细致的子类划分，如区分鸟的种类、车的品牌款式、狗的品种等。针对细粒度图像分类中的无关信息太多和背景干扰问题，本文利用深度卷积网络构建了细粒度图像聚焦—识别的联合学习框架，通过去除背景、突出待识别目标、自动定位有区分度的区域，从而提高细粒度图像分类识别率。方法首先基于Yolov2（youonly look once v2）的网络快速检测出目标物体，消除背景干扰和无关信息对分类结果的影响，实现聚焦判别性区域，之后将检测到的物体（即Yolov2的输出）输入双线性卷积神经网络进行训练和分类。此网络框架可以实现端到端的训练，且只依赖于类别标注信息，而无需借助其他的人工标注信息。结果在细粒度图像库CUB-200-2011、Cars196和Aircrafts100上进行实验验证，本文模型的分类精度分别达到84.5%、92%和88.4%，与同类型分类算法得到的最高分类精度相比，准确度分别提升了0.4%、0.7%和3.9%，比使用两个相同D（dence）-Net网络的方法分别高出0.5%、1.4%和4.5%。结论使用聚焦—识别深度学习框架提取有区分度的区域对细粒度图像分类有积极作用，能够滤除大部分对细粒度图像分类没有贡献的区域，使得网络能够学习到更多有利于细粒度图像分类的特征，从而降低背景干扰对分类结果的影响，提高模型的识别率。

关键词

细粒度图像分类; 目标检测; 双线性卷积神经网络; 聚焦—识别框架; 区分度

Fine-grained image classification with network architecture of focus and recognition

Wang Yongxiong, Zhang Xiaobing

School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China

Supported by: National Natural Science Foundation of China (61673276, 61603255)

Abstract

Objective In recent years, with the rapid progress of science and technology as well as the increasing demand of human life, people's research has shifted from the coarse-grained image classification to the fine-grained image classification. Fine-grained image classification is a hot research topic in the field of computer vision research in recent years. Its purpose is to provide a detailed subdivision of a large category, such as the distinction of bird species, car brand style, and dog breed. Nowadays, the fine-grained image classification has great application requirements. For example, in the field of ecological protection, the identification of different species of organisms is the key to ecological research. And in the field of botany, because of the variety and quantity of flowers as well as the similarity between different flowers which make the fine-grained image classification tasks more difficult. With the help of computer vision technology, we can realize low-cost fine-grained image classification tasks. However, the fine-grained classification often has smaller differences between classes and larger differences within classes. Thus, in comparison with the ordinary image classification, the task of the fine-grained image classification is more challenging. Moreover, the fine-grained image classification has much irrelevant information and background interference. Those problems would influence the network model to learn the actual discriminative characteristics and result in inferior classification performance in fine-grained image classification. Therefore, finding discriminative regions in the image is important for the improvement of fine-grained image classification performance. To solve this problem, a joint deep learning framework of focus and recognition is constructed for fine-grained image classification. This framework can remove the background in the image, highlight the target to be identified, and then automatically locate the discriminative area in the image. Thus, the deep convolutional neural networks can extract more useful and discriminative features, and the classification rate of fine-grained images can be improved naturally. Method Firstly, the Yolov2 (you only look once v2) target detection algorithm can detect object in the image rapidly and eliminate the influence of background interference and unrelated information, and then the datasets, which include the detected target objects, are used to train the bilinear convolutional neural network. Finally, the final model can be used for fine-grained image classification. The Yolov2 algorithm is a further improvement of the Yolov1 target detection algorithm, and it is more precise for small object localization. It can automatically find the target in the picture to filter out most of the regions that do not contribute to image classification. Bilinear convolutional neural network is a special network for fine-grained image classification. Its characteristic is that it uses the two convolutional neural networks to extract the features of the same picture simultaneously, and the bilinear feature vector is obtained by the approaches of bilinear pooling. Finally, the bilinear feature vector is fed into the softmax network layer and the classification task is completed, we can get the final classification results. In addition, the advantage of the bilinear convolutional neural network is that it is not dependent on additional manual annotation information and it is an entire system which can complete end-to-end training. It only relies on the class label information. Therefore, it greatly reduces the difficulty and complexity of fine-grained image classification. Result We perform verification experiments on open standard fine-grained image library CUB-200-2011, Cars196, and Aircrafts100. We use the pre-trained target detection model of Yolov2 algorithm to detect these three datasets respectively, therefore, we can get the discriminative regions in the image for each datasets. Then, the bilinear convolutional neural network is trained by the processed datasets. Finally, our proposed bilinear convolutional neural network model can be used for the fine-grained image classification and achieves classification accuracy of 84.5%, 92%, and 88.4% on these three datasets. In comparison with the highest classification accuracy obtained by the same classification algorithm without discriminant information extraction, the classification accuracy of the three databases is improved by 0.4%, 0.7%, and 3.9%. Moreover, the recognition rate is also increased by 0.5%, 1.4%, and 4.5% compared with the same classification algorithm, which extracts features from two identical D(dence)-Net networks. We also compared with other fine-grained image classification algorithms, such as the Spatial Transformer Networks, which has a fine classification performance in fine-grained image classification and it is also an entire system and is only dependent on label information. For all that, the classification accuracy rate of ours is still 0.4 percentage points higher than the method of Spatial Transformer Networks on bird dataset. Conclusion In this paper, an innovative method based on focused recognition network architecture is proposed to improve the recognition rate of the fine-grained image classification. And the experiment results show that our method positively affects the fine-grained image classification results, which uses the network architecture of focus and recognition to detect discriminative region in the image. It can filter out most of the area in the image which does not contribute to the classification of fine-grained images, thereby reducing the influence of background interference to the classification results. Thus, the bilinear convolutional neural network can learn more useful features, which are beneficial to the classification of fine-grained images. Finally, the recognition rate of the model of the fine-grained image classification can be improved effectively. Of course, we also compare with other fine-grained image classification algorithms on several datasets, which also strongly proves the effectiveness of our algorithm.

Key words

fine-grained image classification; target detection; bilinear convolutional neural network; framework of focus and recognition; discrimination

0 引言

图像分类是计算机视觉研究领域的一个经典课题，主要包括粗粒度和细粒度图像分类。细粒度图像分类是对一个大类别进行更精细的子类划分，如区分鸟的种类、车的品牌款式、狗的品种等，在很多情况下更有使用价值。因为图像采集中存在姿态、视角、光照、遮挡、背景干扰等差异，所以细粒度分类往往具有细微的类间差异和较大的类内差异。与普通的图像分类课题相比，细粒度图像分类的研究具有更大的挑战性。

早期基于人工特征的细粒度图像分类算法，一般先从图像中提取SIFT(scale-invariant feature transform)^[1]或HOG(histogram of oriented gradient)^[2]等局部特征，然后利用VLAD(vector of locally aggregated descrip tors)^[3]或Fisher vector^[4-5]等编码模型进行特征编码。由于人工特征选择过程繁琐，表述能力有限，因此分类效果不佳。然而，随着深度学习的兴起，从卷积神经网络中自动获得的特征，比人工特征有更强大的描述能力，因此大量基于卷积特征算法的提出，促进了细粒度图像分类算法的快速发展。

按照模型训练时是否需要人工标注信息，基于深度学习的细粒度图像分类算法可分为强监督和弱监督两类。强监督的细粒度图像分类在模型训练时不仅需要图像的类别标签，还需要图像标注框和局部区域位置等人工标注信息；而弱监督的细粒度图像分类在模型训练时仅依赖于类别标签。然而无论是强监督还是弱监督的细粒度图像分类算法，大多数细粒度图像分类算法的思路都是先找到前景对象和图像中的局部区域，之后利用卷积神经网络对这些区域分别提取特征，并将提取的特征连接，以此完成分类器的训练和预测^[6]。Zhang等人^[7]提出了part-based R-CNN算法，先采用R-CNN算法^[8]对图像进行检测，得到局部区域，再分别对每一块区域提取卷积特征，并将这些区域的特征连接，构成一个特征向量，最后用支持向量机(SVM)训练分类。然而，其利用的选择性搜索算法^[9]会产生大量无关的候选区域，造成运算上的浪费。Branson等人^[10]提出了姿态归一化卷积神经网络(CNN)算法，通过原型对图像进行姿态对齐操作，对不同的局部区域提取不同网络层的特征，但该算法利用DPM(deformable parts mode)^[11]算法对关键点进行检测与实际标注的关键点信息差距较大。Xiao等人^[12]提出两级注意力算法，不依赖额外的标注信息，仅使用类别标签，该模型分为3个处理阶段，分别是预处理、对象级和局部级3个不同的子模型，但是，两级注意力模型利用聚类算法得到局部区域，准确度十分有限。Zhang等人^[13]提出了从候选的卷积特征中选出具有区分度局部区域特征的算法，基于选择性搜索算法产生区域候选框的方法，虽然有效，却面临巨大的计算代价和资源浪费。Simon等人^[14]利用卷积神经网络产生关键点，再利用这些关键点得到局部区域，最后通过卷积神经网络对局部区域提取特征。对于前景对象，依然采用传统的选择性搜索算法。然而，以上算法都只是利用卷积神经网络提取特征，各处理步骤之间依然是一个分散的过程，且未从整体上进行端到端的训练优化。

细粒度图像分类的难点在于各子类之间差异较小，这些差异易被复杂的背景信息覆盖，使得网络模型难以学习到真正的差异性特征。因此，找到具有区分度的区域(即判别性区域)对细粒度图像分类至关重要。Lin等人^[15]提出了新颖的双线性卷积神经网络(B-CNN)的弱监督细粒度图像分类算法，在3个经典数据集上达到很高的分类精度，能够实现端到端的训练，且仅依赖类别标签，而无需借助其他的图像标注信息，这提高了算法的实用性。双线性网络模型可认为一个网络对物体局部区域进行检测，另一个网络进行特征提取，两个网络相互协调完成细粒度图像分类过程中的区域检测与特征提取，最终完成细粒度图像的分类任务。B-CNN模型有强大的泛化能力，与以往的细粒度图像分类算法相比，其计算复杂度较低，而且识别效果很好。

同一幅图像中有区分度的信息越多或占比越大，卷积神经网络能提取的有区分度的特征就越多，分类精度也就更高，这与人类识别细粒度物体聚焦的过程类似。基于此思路，本文提出了基于深度学习的聚焦—识别网络框架实现细粒度图像分类。首先使用Redmon等人^[16]提出的Yolov2算法快速找到物体，聚焦判别性区域，滤除图像中对细粒度图像分类无关的区域，再使用B-CNN模型对判别性区域进行特征提取与分类，从而降低背景干扰对分类结果的影响，提高细粒度图像识别率。此方法只需图像的类别标签，减少了大量繁琐的人工标注。

1 基于聚焦—识别的深度学习框架

1.1 聚焦—识别总体框架

首先，利用Yolov2算法检测网络，找到图片中的判别性区域，剔除与分类无关的背景信息，之后将得到的结果(Yolov2的输出)输入双线性卷积神经网络得到最后的分类结果。系统框架如图 1所示。

图 1 Yolov2+B-CNN系统结构

Fig. 1 The system architecture of Yolov2 and B-CNN

1.2 基于Yolov2的检测方法

1.2.1 基于Yolov2网络聚焦判别性区域

Yolov2算法是对Yolov1目标检测算法^[17]的改进。Yolov2算法首先将输入图像划分成$S \times S$个栅格，更加细致的栅格划分使得模型对小物体的定位更加精准。经过Yolov2检测网络，对每个栅格都预测$K$个边界框。由于模型在训练过程中会不断地学习调整预测的边界框的宽高维度，如果一开始就选择有代表性的先验框维度，则模型对边界框的预测就更加准确。Yolov2采用K-means的方法对训练集的标注框做聚类，可以找到合适的先验框。在实现K-means聚类时，若选择欧氏距离为测度函数，则尺寸较大的边界框会比较小的边界框产生更多的错误。通过引入交并比，使得误差和边界框的大小无关，最终距离测度函数为

$ d\left( {\mathit{\boldsymbol{g}},\mathit{\boldsymbol{h}}} \right) = 1 - I\left( {\mathit{\boldsymbol{g}},\mathit{\boldsymbol{h}}} \right) $

(1)

式中，$\mathit{\boldsymbol{h}}$表示聚类中心框，$\mathit{\boldsymbol{g}}$表示人工标注框。$I\left( {\mathit{\boldsymbol{g}}, \mathit{\boldsymbol{h}}} \right)$表示聚类中心框和标注框的交并比，即二者交集面积与并集面积的比值，表示预测框的准确度，具体表示为

$ I\left( {\mathit{\boldsymbol{g}},\mathit{\boldsymbol{h}}} \right) = \frac{{\mathit{\boldsymbol{g}} \cap \mathit{\boldsymbol{h}}}}{{\mathit{\boldsymbol{g}} \cup \mathit{\boldsymbol{h}}}} $

(2)

最后得到的先验框形状大多为细高型，扁平型居少。为平衡模型的复杂度和召回率，选择先验框的个数为5。Yolov2算法使用先验框对检测网络的最后一层的特征图直接预测，每个格子预测5个边界框，每个边界框都包含5个预测值：${t_x}, {t_y}, {t_{\rm{w}}}, {t_{\rm{h}}}$和置信值${t_o}$。先验框的引入会导致模型在训练过程中不稳定。若${t_{\rm{w}}}$和${t_{\rm{h}}}$表示先验框的宽和高，${t_x}$和${t_y}$经过logistic函数处理后，范围在0~1之间，${c_x}$和${c_y}$表示网格偏离图像左上角的偏移量。则相应的预测为

$ \left\{ \begin{array}{l} {b_x} = \sigma ({t_x}) + {c_x}\\ {b_y} = \sigma ({t_y}) + {c_y}\\ {b_{\rm{w}}} = {t_{\rm{w}}}{{\rm{e}}^{{t_{\rm{w}}}}}\\ {b_{\rm{h}}} = {t_{\rm{h}}}{{\rm{e}}^{{t_{\rm{h}}}}}\\ c = \sigma ({t_o}) \end{array} \right. $

(3)

式中，$\sigma ({t_o})$为置信值，$\sigma $为logistic激活函数，${b_x}, {b_y}, {b_{\rm{w}}}, {b_{\rm{h}}}$表示预测框的中心坐标和宽高。通过对位置预测进行限制后，模型参数更容易学习，使得模型更加稳定。相应的损失函数($loss$)表示为

$ \begin{array}{l} loss = {{\rm{ \mathsf{ λ} }}_{{\rm{coord}}}}\sum\limits_{i = 0}^{{s^2}} {\sum\limits_{j = 0}^K {\Delta _{ij}^{{\rm{obj}}}[{{({b_{xi}} - {{\hat b}_{xi}})}^2} + {{({b_{yi}}{{\hat b}_{yi}})}^2}] + } } \\ {{\rm{ \mathsf{ λ} }}_{{\rm{coord}}}}\sum\limits_{i = 0}^{{s^2}} {\sum\limits_{j = 0}^K {\Delta _{ij}^{{\rm{obj}}}[{{(\sqrt {{b_{{\rm{w}}i}}} - \sqrt {{{\hat b}_{{\rm{w}}i}}} )}^2} + {{(\sqrt {{b_{{\rm{h}}i}}} - \sqrt {{{\hat b}_{{\rm{h}}i}}} )}^2}] + } } \\ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\sum\limits_{i = 0}^{{s^2}} {\sum\limits_{j = 0}^K {\Delta _{ij}^{{\rm{obj}}}{{({C_{ij}} - {{\hat C}_{ij}})}^2} + } } \\ \;\;\;\;\;\;\;{{\rm{ \mathsf{ λ} }}_{{\rm{noobj}}}}\sum\limits_{i = 0}^{{s^2}} {\sum\limits_{j = 0}^K {\Delta _{ij}^{{\rm{noobj}}}{{({C_{ij}} - {{\hat C}_{ij}})}^2} + } } \\ \;\;\;\;\;\;\sum\limits_{i = 0}^{{s^2}} {\Delta _i^{{\rm{obj}}}} \sum\limits_{c \in {\rm{classes}}} {{{({p_i}(c) - {{\hat p}_i}(c))}^2}} \end{array} $

(4)

式中，${{s^2}}$表示将图像划分的栅格数，$K$表示每个栅格的预测框个数，${{{\hat C}_{ij}}}$表示预测框中物体的置信值，${{C_{ij}}}$表示人工标注框内物体的置信值，${{{\hat p}_i}(c)}$表示预测的栅格中包含物体且物体是某一类别的概率，${{p_i}(c)}$表示栅格真实条件类别概率，${\Delta _{ij}^{{\rm{obj}}}}$表示第$i$个栅格存在目标，且该栅格预测的第$j$个边界框负责预测该目标，${\Delta _i^{{\rm{obj}}}}$表示物体是否出现在第$i$个栅格里，${{\rm{ \mathit{ λ} }}_{{\rm{coord}}}}$和${{\rm{ \mathit{ λ} }}_{{\rm{noobj}}}}$分别表示位置预测和物体预测正则化惩罚系数。

1.2.2 判别性区域检测的网络结构

Yolov2算法提出了Darknet19的分类网络，该网络借鉴了VGG(visual geometry group network)^[18]分类网络结构，包括19个卷积层和5个全连接层。Darknet19大多采用3×3卷积核，且在每一次池化操作后把通道数翻倍，如表 1所示。

表 1 Darknet19网络结构
Table 1 Network structure of Darknet19

下载CSV

类型	卷积核/个	尺寸/步长	输出/像素
卷积	32	3×3	224×224
最大池化		2×2/2	112×112
卷积	64	3×3	112×112
最大池化		2×2/2	56×56
卷积	128	3×3	56×56
卷积	64	1×1	56×56
卷积	128	3×3	56×56
最大池化		2×2/2	28×28
卷积	256	3×3	28×28
卷积	128	1×1	28×28
卷积	256	3×3	28×28
最大池化		2×2/2	14×14
卷积	512	3×3	14×14
卷积	256	1×1	14×14
卷积	512	3×3	14×14
卷积	256	1×1	14×14
卷积	512	3×3	14×14
最大池化		2×2/2	7×7
卷积	1 024	3×3	7×7
卷积	512	1×1	7×7
卷积	1 024	3×3	7×7
卷积	512	1×1	7×7
卷积	1 024	3×3	7×7
卷积	1 000	1×1	7×7
平均池化		全局	1 000
Softmax 分类			1 000

Yolov2的检测网络是在改进分类网络结构的基础上得到的，首先使用ImageNet^[19]数据集训练Darknet19，输入图像大小为224×224像素，迭代160次。然后使用分辨率为448×448像素的图像微调网络，迭代10次，这可以使网络的卷积核更好地适应高分辨率图像的输入。去掉原网络最后1个卷积层，增加了3个3×3的卷积层，每层卷积核个数为1 024，并且在每一个卷积层后面跟一个1×1的卷积层，在最后一个1×1的卷积层，卷积核个数为125，即每个格子检测需要的数量。为了得到模型的细粒度特征，添加pass through层，将最后一个3×3×512的卷积层和倒数第2个卷积层特征进行堆叠，形成不同的通道，这样有利于小目标(细粒度特征)的检测。

1.3 基于双线性卷积神经网络的特征提取和识别

1.3.1 双线性卷积神经网络模型的结构

双线性卷积神经网络模型由1个4元组$\mathit{\boldsymbol{{\boldsymbol{\beta} }}}{\rm{ = }}\left( {{f_A}, {f_B}, P, C} \right)$构成。其中，${{f_A}}$和${{f_B}}$是2个基于卷积神经网络的特征提取函数, 分别对应于图 2中的网络$A$和网络$B$，$P$是一个池化函数，$C$则是分类函数。特征提取函数$f$可以看成接收一个$\mathit{\boldsymbol{i}} \in \mathit{\boldsymbol{I}}$的图像块，相应区域位置满足$\mathit{\boldsymbol{l}} \in \mathit{\boldsymbol{L}}$，其输出$K \times D$大小的特征图，通过矩阵外积将每一个位置点的特征输出汇聚，也就是在$\mathit{\boldsymbol{l}}$区域${f_A}$和${f_B}$的双线性特征的融合，即

$ \begin{array}{l} b(\mathit{\boldsymbol{l}},\mathit{\boldsymbol{i}},{f_A},{f_B}) = {f_A}{\left( {\mathit{\boldsymbol{l}},\mathit{\boldsymbol{i}}} \right)^{\rm{T}}}{f_B}\left( {\mathit{\boldsymbol{l}},\mathit{\boldsymbol{i}}} \right)\\ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\mathit{\boldsymbol{l}} \in \mathit{\boldsymbol{L}},\mathit{\boldsymbol{i}} \in \mathit{\boldsymbol{I}} \end{array} $

(5)

图 2 双线性卷积神经网络结构

Fig. 2 Bilinear convolutional neural network architecture

式中，${f_A}$和${f_B}$必须具有相同的特征维度$K$，$K$的值取决于具体的模型。池化函数$P$的作用则是将所有位置的双线性特征汇聚以获得图像的全局特征${{\boldsymbol{\phi}}} \left( {\boldsymbol{I}} \right)$，即

$ \begin{array}{l} \phi \left( \mathit{\boldsymbol{I}} \right) = \sum\limits_{\mathit{\boldsymbol{l}} \in \mathit{\boldsymbol{L}}} {b(\mathit{\boldsymbol{l}},\mathit{\boldsymbol{i}},{f_A},{f_B})} = \sum\limits_{\mathit{\boldsymbol{l}} \in \mathit{\boldsymbol{L}}} {{f_A}{{\left( {\mathit{\boldsymbol{l}},\mathit{\boldsymbol{i}}} \right)}^{\rm{T}}}{f_B}\left( {\mathit{\boldsymbol{l}},\mathit{\boldsymbol{i}}} \right)} \\ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\mathit{\boldsymbol{i}} \in \mathit{\boldsymbol{I}} \end{array} $

(6)

在池化过程中，由于特征的位置信息被忽略，因此双线性特征${\boldsymbol{\phi}} \left( {\boldsymbol{I}} \right)$是一个无序的特征表示。如果${{f_A}}$和${{f_B}}$提取的特征维度分别为$K \times M$和$K \times N$，则${\boldsymbol{\phi}} \left( {\boldsymbol{I}} \right)$为$M \times N$的矩阵，将其转化为一个$M \times N \times 1$的列向量, 作为最终的双线性特征向量。最后，通过softmax网络层进行分类。

1.3.2 特征提取

B-CNN采用由卷积和池化层组成的D-Net (VGG)和M-Net^[20]网络结构。由于目标数据集较小，本文采用预训练在ImageNet数据集上的网络结构，分别提取B-CNN网络中M-Net的relu5和D-Net的relu5_3层的特征。在网络的前向运算时，模型${{f_A}}$和${{f_B}}$可以是无共享、部分共享或完全共享，如图 3所示。实验中采用完全共享的模型，即采用2个相同的截断在relu5_3层的D-Net网络。本文将输入图像归一化为448×448像素大小，D-Net网络的输出为512个28×28的特征矩阵。经过外积运算和扁平化处理操作得到一个512×512维的双线性特征向量。

图 3 双线性卷积神经网络前向运算方式

Fig. 3 Forward operation ways of bilinear convolution neural network((a)no sharing; (b)partially shared; (c)fully shared)

1.3.3 归一化和分类

为提高分类精度，在得到双线性特征$\mathit{\boldsymbol{x}} = \phi \left( \mathit{\boldsymbol{I}} \right)$后，对双线性特征进行带符号的开平方根及${l_2}$归一化处理，即

$ \mathit{\boldsymbol{y}} = {\rm{sgn}}\left( \mathit{\boldsymbol{x}} \right)\sqrt {\left| \mathit{\boldsymbol{x}} \right|} $

(7)

$ \mathit{\boldsymbol{z}} = {\frac{\mathit{\boldsymbol{y}}}{{\left\| \mathit{\boldsymbol{y}} \right\|}}_2} $

(8)

经过较精确的检测和特征提取之后，可以使用常规的分类方法进行识别，如logistic回归或线性SVM。这里使用softmax分类层进行分类。

1.3.4 端到端的训练

B-CNN网络结构是一个有向无环图。通过分类损失函数的梯度反向传播完成网络参数的训练，如交叉熵。双线性形式简化了梯度运算。

如果两个网络的输出为矩阵$\mathit{\boldsymbol{A}}$和$\mathit{\boldsymbol{B}}$，其大小分别为$L \times M$和$L \times N$，则双线性特征为$\mathit{\boldsymbol{x}} = {\mathit{\boldsymbol{A}}^{\rm{T}}}\mathit{\boldsymbol{B}}$，大小为$M \times N$。令${\frac{{{\rm{d}}l}}{{{\rm{d}}\mathit{\boldsymbol{x}}}}}$表示损失函数$l$对$\mathit{\boldsymbol{x}}$的梯度，由梯度的链式法则，有

$ \frac{{{\rm{d}}l}}{{{\rm{d}}\mathit{\boldsymbol{A}}}} = \mathit{\boldsymbol{B}}{\left( {\frac{{{\rm{d}}l}}{{{\rm{d}}\mathit{\boldsymbol{x}}}}} \right)^{\rm{T}}},\frac{{{\rm{d}}l}}{{{\rm{d}}\mathit{\boldsymbol{B}}}} = \mathit{\boldsymbol{A}}\left( {\frac{{{\rm{d}}l}}{{{\rm{d}}\mathit{\boldsymbol{x}}}}} \right) $

(9)

计算得到特征$\mathit{\boldsymbol{A}}$和$\mathit{\boldsymbol{B}}$的梯度，则整个模型可以进行端到端的训练，模型的梯度更新如图 4所示。其他部分的训练和常规的CNNs网络相同。

图 4 双线性卷积神经网络梯度流

Fig. 4 Gradient flow of B-CNN

2 实验结果与分析

2.1 数据集

为了验证改进后的算法效果，采用目前较常用的3个标准细粒度图像分类数据集CUB-200- 2011^[21]、Cars196^[22]和Aircrafts100^[23]进行测试。3个数据集的类别的若干样本如图 5所示。

图 5 数据集图像示例

Fig. 5 Database image examples ((a) CUB-200-2011 dataset; (b) Cars196 dataset; (c) Aircrafts100 dataset)

CUB-200-2011是细粒度图像分类领域最经典、最常用的一个数据集，包含200种不同类别、形态各异的11 788幅鸟类图像，分为训练集和测试集，两者的图像数量大致相同。同时，提供了丰富的人工标注数据，每幅图像包含15个局部区域位置、1个标注框和类别标签。由于该数据集的图像背景大多比较杂乱且目标在图像中的占比不定，以及诸多遮挡和姿态的因素，使得其分类任务具有一定的挑战性。

Cars196数据集包含16 185幅图像，提供196类不同品牌、年份、车型的车辆图像数据, 分为训练集和测试集两部分，大小分别为8 144幅和8 041幅, 只提供标注框信息和类别标签。

Aircrafts100数据集提供了100类不同的飞机照片，每1类包含100张不同的照片，共10 000张照片，分为训练集、验证集和测试集，大小分别为3 334张、3 333张和3 333张，只提供标注框信息和类别标签。细粒度图像数据集的特点在于其类内差异较大，类别间的差异较小，如图 5(c)，可以看出，两个类别Boeing737-300与Boeing737-400在形状和颜色等方面都很相似。

2.2 实验结果和分析

首先，使用Yolov2预训练模型从上述3个标准数据集CUB-200-2011、Cars196和Aircrafts100中分别提取具有判别性的区域，并将其分辨率归一化为448×448像素。然后，微调双线性卷积神经网络模型。微调分两个步骤进行:1)将分类层的类别数替换为细粒度数据集的类别数；2)对最后1层的参数进行随机初始化，且只训练最后1层。接下来，设置相对较小的学习率，数值为0.001，使用随机梯度下降法通过反向传播微调整个模型，迭代次数在45~100之间, 最终在CUB-200-2011、Cars196和Aircrafts100细粒度图像集上分别达到84.5%、92%和88.4%的分类准确率，与没有进行有区分度信息提取的数据集相比，分别高出0.4%、0.7%和3.9%。本文方法与其他多个方法的对比结果如表 2所示。

表 2 3个数据集分类结果
Table 2 Classification results on three datasets

下载CSV

方法	分类精度/%
方法	CUB-200-2011	Cars196	Aircrafts100
FV-SIFT^[15]	18.8	59.2	61.0
FV+SIFT^[24]	-	82.7	80.7
FV-CNN(M)^[15]	64.1	77.2	71.2
FV-CNN(D)^[15]	74.7	85.7	78.7
FC-CNN(M)^[15]	58.8	58.6	63.4
FC-CNN(D)^[15]	70.4	79.8	76.6
B-CNN(M)^[15]	78.1	86.5	79.5
B-CNN(D)^[15]	84.0	90.6	83.9
B-CNN(M, D)^[15]	84.1	91.3	84.5
Part-based R-CNN^[7]	73.9	-	-
STNs^[25]	84.1	-	-
BaseNet + SegNet^[26]	-	86.74	83.43
本文(Yolov2+B-CNN(D))	84.5	92.0	88.4
注：括号中M表示M-Net, D表示D-Net，“-”表示原文献未对此数据库进行识别。

表 2中，FV(fisher vector)-SIFT^[15]的字典大小为256，空间金字塔有1层。FV+SIFT^[23]的字典大小为1 024，使用的是多尺度SIFT特征，空间金字塔有2层，为1×1，3×1区域。可以看出，使用传统FV特征的分类准确率明显低于本文模型。1)FV-CNN^[15]方法对CNN提取的特征进行FV编码，使用两种不同的卷积神经网络模型M-Net和D-Net分别提取relu5和relu5_3的特征。由于FV采用混合高斯模型构建码本，且FV编码后的向量通常不是稀疏的，所以分类准确率不高。2)FC-CNN^[15]同样使用两个不同的卷积神经网络模型M-Net和D-Net对目标数据集分类，分类准确率有限。3)B-CNN模型需要同时学习两个卷积神经网络，这两个模型可以是对称的(学习两个相同的M-Net或两个相同的D-Net网络)，也可以是不对称的(同时学习M-Net和D-Net这两个不对称的网络)。在数据集CUB-200-2011、Cars196和Aircrafts100上，B-CNN算法使用M-Net和D-Net两个不对称的网络时，得到的最高分类准确率分别为84.1%、91.3%和84.5%，比本文方法分别低0.4%、0.7%和3.9%。使用两个相同D-Net网络时，得到的最高分类准确率分别为84.0%、90.6%和83.9%，比本文方法分别低0.5%、1.4%和4.5%。4)part-based R-CNN算法利用选择性搜索算法会产生大量无关的候选区域，造成计算上的浪费，在鸟类数据集上，分类准确率比本文低10.6%。5)STNs^[25]算法的创新之处在于提出了带参数且可学习的网络层，根据任务自己学习图片或特征的空间变换参数，它可以加入到任意的CNN或FCN(full convolutional networks)网络中，并能提升网络的学习能力，但分类效果依然低于本文方法。6)BaseNet+SegNet^[26]算法首先利用卷积神经网络对细粒度数据集进行初分类，得到基本网络模型。然后利用学习好的BaseNet生成自上而下的注意图，再利用注意图初始化GraphCut^[27]算法，分割出关键的目标区域，从而提高图像的判别性。最后，对分割图像提取CNN特征实现细粒度分类。在Cars196和Aircrafts100数据集上，分类准确率为86.74%和83.43%，但算法处理过程较为繁杂，且准确率低于本文方法。7)本文方法首先使用Yolov2预训练的模型快速找出判别区域，然后仅使用图像的类别标签完成B-CNN算法的学习和分类，分类性能明显优于其他算法。

2.3 时间复杂度分析

为验证本文方法的复杂度，对传统模型和改进前的B-CNN模型进行测试时间复杂度分析和运行时间比较。

若传统特征样本的特征维数为$m$，样本个数为$n$，则线性SVM的测试时间复杂度为O($m\times n$)。在文献[15]中，SIFT的底层特征为128维，每幅图像的特征维数均为65 536。而本文模型每幅图像的特征长度为512×512×$k$维($k$取决于数据集的种类)，特征维度远高于传统特征的维数。在CNN的分类模型中，VGG网络模型的参数量有138 MB，而本文采用的B-CNN模型的参数量有58.9 MB，明显少于VGG网络模型。在Tesla K40 GPU中，提取图像特征的运行速度，B-CNN(M)模型为87帧/s，B-CNN(D)模型为10帧/s，B-CNN(M, D)模型为8帧/s。由于本文方法是在B-CNN的基础上增加了判别性区域检测部分，在Tesla K40 GPU中，Yolov2检测图像中的判别性区域的速度为4帧/s。本文模型的运行时间与未改进的B-CNN算法相比，增加的运行时间为Yolov2检测判别性区域的时间。各种方法的特征提取速度如表 3所示。从表 3可以看出，本文方法只是小幅度增加了时间复杂度。

表 3 特征提取速度比较
Table 3 The speeds of feature extraction

下载CSV

方法	速度/(帧/s)
FV-SIFT^[15]	10
part-based R-CNN^[7]	1/47
B-CNN(M)^[15]	87
B-CNN(D)^[15]	10
B-CNN(M, D)^[15]	8
本文	4

3 结论

本文针对细粒度图像分类中的背景干扰问题，提出基于深度学习的聚焦—识别框架。首先检测图像中的判别性区域，然后再进行识别。具体做法是利用Yolov2预训练的模型，聚焦图像中有区分度的区域，将聚焦结果输入B-CNN模型，对目标区域进行学习与分类，构造更具判别性的特征表示，从而提高分类性能。实验结果表明，在识别模型基础上引入聚焦方法能够进一步提高细粒度图像分类的正确率。相较于原B-CNN算法，Yolov2算法的引进能够过滤掉很多对细粒度图像分类没有贡献的图片内容，使得B-CNN能够学习到更多有利于细粒度图像分类的判别性特征，从而增强模型的鲁棒性，且整个模型仅使用图像的类别标签信息，对于其他的图像库也能适用，增加了模型的通用性。

参考文献

[1] Lowe D G. Object recognition from local scale-invariant features[C]//Proceedings of the 7th IEEE international Conference on Computer Vision. Kerkyra, Greece: IEEE, 1999, 2: 1150-1157.[DOI: 10.1109/ICCV.1999.790410]

[2] Dalal N, Triggs B. Histograms of oriented gradients for human detection[C]//Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Diego, CA, USA: IEEE, 2005, 1: 886-893.[DOI: 10.1109/CVPR.2005.177]

[3] Jégou H, Douze M, Schmid C, et al. Aggregating local descriptors into a compact image representation[C]//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco, CA, USA: IEEE, 2010: 3304-3311.[DOI: 10.1109/CVPR.2010.5540039]

[4] Perronnin F, Dance C. Fisher kernels on visual vocabularies for image categorization[C]//Proceedings of 2007 IEEE Conference on Computer Vision and Pattern Recognition. Minneapolis, MN, USA: IEEE, 2007: 1-8.[DOI: 10.1109/CVPR.2007.383266]

[5] Sánchez J, Perronnin F, Mensink T, et al. Image classification with the fisher vector:theory and practice[J]. International Journal of Computer Vision, 2013, 105(3): 222–245. [DOI:10.1007/s11263-013-0636-x]

[6] Weng Y C, Tian Y, Lu D M, et al. Fine-grained bird classification based on deep region networks[J]. Journal of Image and Graphics, 2017, 22(11): 1521–1531. [翁雨辰, 田野, 路敦民, 等. 深度区域网络方法的细粒度图像分类[J]. 中国图象图形学报, 2017, 22(11): 1521–1531. ]

[7] Zhang N, Donahue J, Girshick R, et al. Part-based R-CNNs for fine-grained category detection[C]//Proceedings of the 13th European Conference on Computer Vision-ECCV 2014. Zurich, Switzerland: Springer, 2014: 834-849.[DOI: 10.1007/978-3-319-10590-1_54]

[8] Girshick R, Donahue J, Darrell T, et al. Rich feature hierarchies for accurate object detection and Semantic segmentation[C]//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA: IEEE, 2014: 580-587.[DOI: 10.1109/CVPR.2014.81]

[9] Uijlings J R R, Van De Sande K E A, Gevers T, et al. Selective search for object recognition[J]. International Journal of Computer Vision, 2013, 104(2): 154–171. [DOI:10.1007/s11263-013-0620-5]

[10] Branson S, Van Horn G, Belongie S, et al. Bird species categorization using pose normalized deep convolutional nets[EB/OL].[2018-07-10]. https://arxiv.org/pdf/1406.2952.pdf.

[11] Branson S, Beijbom O, Belongie S. Efficient large-scale structured learning[C]//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, OR, USA: IEEE, 2013: 1806-1813.[DOI: 10.1109/CVPR.2013.236]

[12] Xiao T J, Xu Y C, Yang K Y, et al. The application of two-level attention models in deep convolutional neural network for fine-grained image classification[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 2015: 842-850.[DOI: 10.1109/CVPR.2015.7298685]

[13] Zhang Y, Wei X S, Wu J X, et al. Weakly supervised fine-grained categorization with part-based image representation[J]. IEEE Transactions on Image Processing, 2016, 25(4): 1713–1725. [DOI:10.1109/TIP.2016.2531289]

[14] Simon M, Rodner E. Neural activation constellations: unsupervised part model discovery with convolutional networks[C]//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 2015: 1143-1151.[DOI: 10.1109/ICCV.2015.136]

[15] Lin T Y, RoyChowdhury A, Maji S. Bilinear convolutional neural networks for fine-grained visual recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(6): 1309–1322. [DOI:10.1109/TPAMI.2017.2723400]

[16] Redmon J, Farhadi A. Yolo9000: better, faster, stronger[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017.[DOI: 10.1109/CVPR.2017.690]

[17] Redmon J, Divvala S, Girshick R, et al. You only look once: unified, real-time object detection[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 779-788.[DOI: 10.1109/CVPR.2016.91]

[18] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[EB/OL].[2017-07-10].https://arxiv.org/pdf/1409.1556.pdf.

[19] Deng J, Dong W, Socher R, et al. ImageNet: a large-scale hierarchical image database[C]//Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, FL, USA: IEEE, 2009: 248-255.[DOI: 10.1109/CVPR.2009.5206848]

[20] Chatfield K, Simonyan K, Vedaldi A, et al. Return of the devil in the details: delving deep into convolutional nets[EB/OL].[2017-07-10].https://arxiv.org/pdf/1405.3531.pdf.

[21] Wah C, Branson S, Welinder P, et al. The caltech-UCSD birds-200-2011 dataset[R]. California: California Institute of Technology, 2011.

[22] Krause J, Stark M, Deng J, et al. 3D object representations for fine-grained categorization[C]//Proceedings of 2013 IEEE International Conference on Computer Vision Workshops. Sydney, NSW, Australia: IEEE, 2013: 554-561.[DOI:10.1109/ICCVW.2013.77]

[23] Maji S, Rahtu E, Kannala J, et al. Fine-grained visual classification of aircraft[EB/OL].[2017-07-10].https://arxiv.org/pdf/1306.5151.pdf.

[24] Gosselin P H, Murray N, Jégou H, et al. Revisiting the fisher vector for fine-grained classification[J]. Pattern Recognition Letters, 2014, 49: 92–98. [DOI:10.1016/j.patrec.2014.06.011]

[25] Jaderberg M, Simonyan K, Zisserman A, et al. Spatial transformer networks[C]//Proceedings of the 29th International Conference on Neural Information Processing Systems. Montreal, Canada: MIT Press, 2015: 2017-2025.

[26] Feng Y S, Wang Z L. Fine-grained image categorization with segmentation based on top-down attention map[J]. Journal of Image and Graphics, 2016, 21(9): 1147–1154. [冯语姗, 王子磊. 自上而下注意图分割的细粒度图像分类[J]. 中国图象图形学报, 2016, 21(9): 1147–1154. ] [DOI:10.11834/jig.20160904]

[27] Boykov Y Y, Jolly M P. Interactive graph cuts for optimal boundary & region segmentation of objects in N-D images[C]//Proceedings of the 8th IEEE International Conference on Computer Vision. Vancouver, BC, Canada: IEEE, 2001, 1: 105-112.[DOI: 10.1109/ICCV.2001.937505]