发布时间: 2017-07-16
摘要点击次数:
全文下载次数:
DOI: 10.11834/jig.160497
2017 | Volume 22 | Number 7

图像分析和识别

分类错误指导的分层B-CNN模型用于细粒度分类

沈海鸿¹, 杨兴¹, 汪凌峰², 潘春洪²

1. 中国地质大学(北京)信息工程学院, 北京 100083;

2. 中国科学院自动化研究所, 北京 100190

收稿日期: 2016-10-11; 修回日期: 2017-04-13

基金项目: 教育部留学基金委青年骨干教师项目资助（201406405014）；中央高校基本科研业务费专项资金资助项目（2-9-2013-083）；国家自然科学基金项目（61403376，91338202）

第一作者简介: 沈海鸿(1970—), 女, 副教授, 2010年于北京航空航天大学获通信与信息系统专业博士学位, 主要研究方向为通信。E-mail:haihong_shen@163.com

中图法分类号: TP391

文献标识码: A

文章编号: 1006-8961(2017)07-0906-09

摘要

目的细粒度分类近年来受到了越来越多研究者的广泛关注，其难点是分类目标间的差异非常小。为此提出一种分类错误指导的分层双线性卷积神经网络模型。方法该模型的核心思想是将双线性卷积神经网络算法（B-CNN）容易分错、混淆的类再分别进行重新训练和分类。首先，为得到易错类，提出分类错误指导的聚类算法。该算法基于受限拉普拉斯秩（CLR）聚类模型，其核心“关联矩阵”由“分类错误矩阵”构造。其次，以聚类结果为基础，构建了新的分层B-CNN模型。结果用分类错误指导的分层B-CNN模型在CUB-200-2011、FGVC-Aircraft-2013b和Stanford-cars 3个标准数据集上进行了实验，相比于单层的B-CNN模型，分类准确率分别由84.35%，83.56%，89.45%提高到了84.67%，84.11%，89.78%，验证了本文算法的有效性。结论本文提出了用分类错误矩阵指导聚类从而进行重分类的方法，相对于基于特征相似度而构造的关联矩阵，分类错误矩阵直接针对分类问题，可以有效提高易混淆类的分类准确率。本文方法针对比较相近的目标，尤其是有非常相近的目标的情况，通过将容易分错、混淆的目标分组并进行再训练和重分类，使得分类效果更好，适用于细粒度分类问题。

关键词

细粒度分类; 分类错误; 分层模型; 双线性卷积神经网络; 受限拉普拉斯秩

Hierarchical B-CNN model guided by classification error for fine-grained classification

Shen Haihong¹, Yang Xing¹, Wang Lingfeng², Pan Chunhong²

1. School of Information Engineering, China University of Geosciences(Beijing), Beijing 100083, China;

2. Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China

Supported by: National Natural Science Foundation of China (61403376, 91338202)

Abstract

Objective Fine-grained classification has gained increasing attention in recent years. The subtle differences among categories remain challenging and could be addressed by localizing the parts of the object. This step requires a considerable amount of manual work. In this regard, bilinear convolutional neural network(B-CNN) models have been established using two feature extractors to represent an image. B-CNN only needs image labels and yields accurate results. However, B-CNN cannot distinguish confusing categories because the networks and classifiers are trained on all training images. We propose a hierarchical B-CNN model guided by classification error to combine confusing categories. We then retrain and reclassify the categories. The model can distinguish the categories and improve the classification accuracy on the fine-grained classification targets. Method This work mainly aims to retrain and reclassify confusing categories. First, we propose a clustering algorithm guided by classification error to obtain clusters containing frequently misclassified categories. The algorithm is established using constrained Laplacian rank(CLR) method, and the "affinity matrix" is constructed by the "classification error matrix." Considering that the labels of the test images are unknown, we conduct experiments on validation images. The classification error matrix is obtained by comparing the classification results and the real labels of the validation images. Second, we propose a new hierarchical B-CNN model. In the first layer, the networks and classifiers are trained on the entire training sets, and the test images are preliminarily classified. In the second layer, the networks and classifiers are trained on each cluster, and the test sets are reclassified. We select three datasets, namely, CUB-200-2011, FGVC-Aircraft-2013b, and Stanford-cars. First, we train the networks and classifiers on the entire training sets and obtain the classification results of the validation images. The datasets of CUB-200-2011 and Stanford-cars do not have validation sets; as such, a part of the training set is randomly assigned as the validation set. We obtain the "classification error matrix" by classification error of the validation set. The matrix comprises two columns designated for the classification result and the real label. Second, we construct the "affinity matrix" of size $c$×$c$ for the dataset containing c categories, where($i$, $j$) refers to the frequency at which the samples of the ith category are misclassified to the jth category. We also normalize the "affinity matrix" to obtain improved cluster results. The entire samples are divided into different groups by using the CLR algorithm. Each group contains only several categories that can be easily classified from one another. Finally, we extract training and testing sets by groups for retraining and reclassification. We retrain the convolutional neural networks and the SVM classifiers on each group of the training set. We re-extract the features of the corresponding testing set and reclassify them. We conduct other experiments to verify the effectiveness of the proposed algorithm. First, we retrain only the SVM classifiers without retraining the convolutional neural networks for simplicity. Second, we retrain the SVM classifiers guided by the distance of the features instead of the classification error. Result The classification accuracies of the single B-CNN model for CUB-200-2011, FGVC-Aircraft-2013b, and Stanford-cars are 84.35%, 83.56%, and 8945%, respectively, which increase to 84.48%, 84.01%, and 89.66%, respectively, after retraining the SVM classifiers and reclassifying the test samples guided by the classification error; moreover, the accuracies increase to 84.67%, 84.11%, and 89.78%, respectively, when the hierarchical B-CNN model is used. However, the accuracy of the results after retraining the SVM classifiers and reclassifying the test samples guided by the distance of the features is lower than that obtained using the single B-CNN model. Experimental results show that retraining SVM classifiers guided by classification error and retraining the networks can improve the classification accuracy. The accuracy of the result obtained with the distance of the features is low. Conclusion In B-CNN models, the networks and classifiers are constructed based on all training samples, resulting in confusing categories that are difficult to classify. In this paper, we propose a new hierarchical B-CNN model guided by classification error. In this model, clusters of the confusing categories are collected together. We retrain and reclassify each cluster to distinguish confusing categories. The classification error matrix is directly related to the classification problem and can provide higher classification accuracy than feature similarity. The experiment results based on the three datasets confirm that the proposed model can effectively improve the classification accuracy but requires a considerable amount of time. The model is suitable for fine-grained classification tasks, especially when dealing with similar targets. In our future work, we will develop our model in terms of two aspects. First, the number of clustering subsets after one cluster operation is relatively high, and the subsets can be further clustered. We will attempt to deepen the model from two to more layers. Second, this paper adopts clustering method for automatic selection. Other effective methods will be explored in our future studies.

Key words

fine-grained classification; classification error; hierarchical model; bilinear convolutional neural network (B-CNN); constrained Laplacian rank (CLR)

0 引言

细粒度分类目前是一个热门的研究领域，它是对同一大类别下的子类别进行分类，如对鸟的物种的分类^[1-5], 对飞机模型的分类^[6]，对汽车模型的分类^{[2, 6-7]}，对狗的物种的分类^[7-8]，对花卉品种的分类^{[5, 9-10]}，以及对人脸的识别^[11]等。细粒度分类任务所针对的目标之间在视觉上非常相似，也就是说它的子类别之间的差异很小，这也是细粒度分类任务的难点所在。为了达到细粒度分类任务，研究者主要在以下两方面做了深入研究：一是特征的提取，二是根据特征分类。

对于图像数据，提取特征的方法大致分为两类，一种是先把目标划分为多个部件，然后提取每个部件的特征，最后综合所有的部件特征来对这个目标进行分类^[12-15]。在细粒度分类任务中，目标之间的差别一般不大，它们的某些部件可能很相似，只有部分部件有比较好的区分性，利用这种基于部件的方法提取的特征能对这些部件进行有针对性的区分，往往能获得比较好的分类效果。但是这种方法最大的缺点就是比较复杂，因为它需要对目标部件进行划分和定位，需要大量的人工标记。

另一种方法是提取鲁棒的全局特征，例如VLAD^[16]，它把每幅图像所有的特征点聚合在一起，形成一个VLAD向量，该向量包含了每个特征点每一维的值，因此可以直接用它来表示整个图像。类似的方法还有FV-SIFT^[17-18]，FV-CNN^[19]等，这种方法的优点是简单，不需要手动标记，但是这样提取的特征对图像的细节或部件的描述不够精确，因此其分类准确率也低于基于部件的方法。B-CNN模型是由Lin等人提出的一种提取特征的方法^[20]，它在每个位置提取特征，并且用两个神经网络同时提取特征，然后将这两个特征结合起来，虽然提取的是全局特征，但是它也包含了局部信息，因此，它在不需要人工标记的同时，准确率也可以和基于部件的模型相媲美。

而分类器目前用的比较多的是支持向量机(SVM)，它比较擅长线性分类(也支持非线性分类)，而在一般的细粒度分类任务中，由于要分类的目标的类别非常多，每个类别的样本数也很多，对于这种大数据，线性分类器和非线性分类器性能相当，但用线性分类器可以大大简化运算时间。在B-CNN模型中，用的就是SVM线性分类器。

在B-CNN中，提取特征的网络和分类器都是用所有的训练样本一起训练的，这可能会导致一些相似的、容易混淆的类不能很好的区分开，当有大量类时，它的鉴别度比较低。针对统一处理的难题，本文提出了分类错误指导的分层B-CNN模型。其中分层模型的含义是：第1层，用B-CNN算法对所有的样本进行训练和分类；第2层，将那些易分错、易混淆的类单独提出来再进行一次训练和分类。因此，关键是将这些易分错、易混淆的类聚到一起，以便对它们进行进一步处理。本文采用CLR算法^[21]进行聚类，其中的关联矩阵由分类错误得到，即将验证集的分类结果与真实标签做比对，得到分类错误，并根据分类错误进行聚类。

1 分类错误指导的聚类

目前细粒度分类的任务中经常会出现有某几类比较容易混淆的情况，如果专门把这些类提出来进行单独的训练和分类，则可以更好的区分开这些类。因此，关键在于找到这些易混淆的类。本文提出了基于分类错误的聚类，即对验证集分类结果进行统计，如果第$i$类经常被分成第j类或第j类经常被分为第$i$类，那么便将它们视为一组易混淆的类。所以首要任务是要把这些易混淆的类合并在一起，这里采用聂飞平等人提出的CLR聚类算法，它能根据所有类之间的关联关系自动进行聚类。

1.1 CLR聚类算法

CLR算法是由聂飞平等人提出，它的优点是通过引入拉普拉斯秩的限制，可以同时学习相似矩阵和聚类，避免了这两个分开学习而导致的结果次优问题。该算法解决的优化问题是

$\begin{array}{*{20}{c}} {\mathop {{\rm{min}}}\limits_{\mathit{\boldsymbol{S}},\mathit{\boldsymbol{F}}} \left\| {\mathit{\boldsymbol{S}} - \mathit{\boldsymbol{A}}} \right\|_\mathit{\boldsymbol{F}}^2 + 2\lambda {\rm{tr}}\left( {{\mathit{\boldsymbol{F}}^{\rm{T}}}{\mathit{\boldsymbol{L}}_S}\mathit{\boldsymbol{F}}} \right)} \\ {{\rm{s}}.{\rm{t}}.\quad \sum\limits_j {{S_{ij}} = 1} ,{S_{ij}} \geqslant 0,\mathit{\boldsymbol{F}} \in {\mathit{\boldsymbol{R}}^{n \times k}},{\mathit{\boldsymbol{F}}^{\rm{T}}}\mathit{\boldsymbol{F = I}}} \end{array}$

(1)

式中，$\mathit{\boldsymbol{A}}$是给定的关联矩阵，$\mathit{\boldsymbol{S}} = \left\{ {{S_{ij}}} \right\}$是要学习的块对角矩阵, ${\mathit{\boldsymbol{L}}_S}$是拉普拉斯矩阵，$\mathit{\boldsymbol{F}}$是拉普拉斯矩阵的特征向量，$\mathit{\boldsymbol{I}}$是单位矩阵，$\lambda $是一个足够大的值，它们的具体含义以及求解最优的过程详见论文^[21]。式(1) 的含义为：给定关联矩阵$\mathit{\boldsymbol{A}}$，学习块对角矩阵$\mathit{\boldsymbol{S}}$，使得它与关联矩阵$\mathit{\boldsymbol{A}}$的距离最小。这里学习到的$\mathit{\boldsymbol{S}}$的连通分支的数量就是聚类的数量，因此通过它可以直接进行聚类。

1.2 基于分类错误的关联矩阵$A$构造

根据上文介绍，要用CLR算法进行聚类，主要工作就是构造一个适当的关联矩阵$\mathit{\boldsymbol{A}}$，它是一个$c$×$c$的方阵，$c$是类别数，它的第$i$行第$j$列表示第$i$类与第$j$类之间的关联性。这里采用的方法是根据分类错误进行统计，某两类之间被混淆的次数越多，那么它们之间的关联值越大。

由于本文用的是基于分类错误的方法，所以需要知道分类结果以及真实标签，而测试集的真实标签只能用来最终计算准确率，不能作为已知条件，因此，聚类过程是对验证集的分类结果进行分析。

本文构造关联矩阵$\mathit{\boldsymbol{A}}$的思路是：首先根据验证集的分类结果和真实标签构造分类错误矩阵$\mathit{\boldsymbol{E}}$；接下来用分类错误矩阵$\mathit{\boldsymbol{E}}$构造关联矩阵$\mathit{\boldsymbol{A}}$；最后对关联矩阵进一步进行对称化、归一化，具体细节如算法1所示。

算法1：构造关联矩阵$\mathit{\boldsymbol{A}}$

1) 构造分类错误矩阵$\mathit{\boldsymbol{E}}$。用原始的B-CNN模型得到验证集的分类结果，如果某个数据集只有训练集和测试集而没有验证集，则把训练集随机划分为训练集和验证集。得到验证集的分类结果以后，将它们与验证集图像的真实标签一一对应，得到一个2维矩阵，2维矩阵的每一行代表一幅图像，因此其列的长度为验证集样本个数，第1列代表这幅图像的真实标签，第2列代表这幅图像的分类结果。筛选出2维矩阵第1、2列的值不同的行，即分类结果与真实标签不符的行，得到分类错误矩阵$\mathit{\boldsymbol{E}}$。

2) 构造关联矩阵$\mathit{\boldsymbol{A}}$。初始化一个$c$×$c$大小的零值矩阵$\mathit{\boldsymbol{A}}$，然后，遍历分类错误矩阵$\mathit{\boldsymbol{E}}$的每一行，假设第$n$行的第1列和第2列分别是$i$和$j$, 则表明将第$i$类分成了第$j$类，所以$\mathit{\boldsymbol{A}}$($i$, $j$)的位置加1，即

$\begin{gathered} A\left( {E\left( {n,1} \right),E\left( {n,2} \right)} \right) = \hfill \\ A\left( {E\left( {n,1} \right),E\left( {n,2} \right)} \right) + 1 \hfill \\ \end{gathered} $

(2)

3) 对称化

$\mathit{\boldsymbol{A}} = \frac{1}{2}\left( {\mathit{\boldsymbol{A}} + \mathit{\boldsymbol{A'}}} \right)$

(3)

4) 归一化

$\mathit{\boldsymbol{A}} = {\left( {{\mathit{\boldsymbol{D}}_A} + \varepsilon \mathit{\boldsymbol{I}}} \right)^{ - 1}}\mathit{\boldsymbol{A}}$

(4)

式中，${{\mathit{\boldsymbol{D}}_A}}$是$\mathit{\boldsymbol{A}}$的度矩阵，它是一个对角矩阵，其第$j$个对角元素是$\mathit{\boldsymbol{A}}$的第$j$列元素的和。$\varepsilon $为大于0的常数，其目的是防止矩阵不可逆，$\mathit{\boldsymbol{I}}$为单位矩阵。本文将$\varepsilon $取值为1。

得到该关联矩阵$\mathit{\boldsymbol{A}}$后，就可以用1.1节的CLR算法将易分错、易混淆的类聚到一起。

2 分层B-CNN模型

B-CNN模型是由Lin等人提出，它用两个预训练好的M-Net^[22]或D-Net^[23]同时提取特征，得到该幅图像的双线性特征，并用SVM分类器进行分类。该模型的优点是只需要整幅图像的标签，不需要对部件进行标注。其不足之处是提取特征的网络和分类器都是由所有类的训练样本训练得到，对于一些易混淆的类不容易区分开。

因此，本文提出了分类错误指导的分层B-CNN模型，将易混淆的类进行单独的重新训练和分类，使得这些类分辨性更强。它的整体结构如图 1所示。

图 1 分类错误指导的分层B-CNN模型框架

Fig. 1 Hierarchical bilinear convolutional neural network model guided by classification error

本文提出的分层B-CNN模型包含两层：第1层(虚线框外)首先用训练集训练网络，得到训练好的网络，并用这个网络提取训练集的特征，用这个特征训练SVM分类器，然后就可以用这个训练好的网络和SVM分别提取测试集和验证集的特征以及对它们进行分类。由验证集的分类结果以及图像的真实标签我们可以得到分类错误结果，根据第1节介绍的方法进行聚类。

第2层(虚线框内)首先按照聚类结果分块提取训练集和测试集，由于测试集的真实类别是未知的，不能直接提取，所以按照测试集的分类结果来提取，即某个测试集图像在第1层被分为哪个类就把它视为这个类。然后用新提取的训练集分别继续训练网络和SVM，并对相应的测试集重新分类。

简单来说，本文的分层模型第1层的任务是得到1) 预训练网络；2) 测试集的分类结果；3) 聚类结果。第2层则是用根据第1层得到的聚类结果和分类结果分别提取聚类后的训练集和测试集，继续训练网络和分类。

2.1 第1层B-CNN

在第1层模型中，用M-Net和D-Net作为初始化的网络，并用训练集对它训练，得到训练好的网络net-a和net-b，这两个网络将在模型的第2层作为初始化的网络，同时也用于第1层网络的特征提取。

用这两个训练好的网络net-a和net-b提取训练集和测试集的特征，用训练集的特征去训练SVM分类器，用训练好的SVM分类器和测试集的特征对测试集进行分类，得到测试集的第1层分类结果，该结果也用于模型第2层测试集的提取。

用这两个训练好的网络提取验证集的特征，用训练好的SVM分类器和验证集的特征对验证集进行分类，得到验证集的第1层分类结果，根据该结果与验证集的真实标签对所有的类进行聚类，得到的聚类结果用于模型第2层训练集和测试集的提取。

2.2 第2层B-CNN

经过第1层B-CNN，可以将容易分错、容易混淆的类筛选出来，分成一组，对每组数据分别单独重新训练和分类，这样的训练更有针对性，所以分类结果也会更准确。

1) 提取训练集和测试集。由第1层的聚类结果，可以按照数据集给定的图像ID与标签的对应关系按组提取训练集。

由于测试集的标签是未知的，所以不能像训练集那样按照图像ID与标签的对应关系提取测试集，所以按照第1层对测试集的分类结果来提取。

2) 再训练与重分类。在第2层B-CNN的网络训练过程中，不再用M-Net和D-Net作为网络的初始化，而是用第1层训练好的网络net-a和net-b作为网络的初始化，然后对于每组训练集和测试集，分别训练相应的网络和SVM分类器，用它们分别提取对应测试集的特征并分类。

2.3 分层B-CNN模型示例

假设第1、5、6类被聚成了一组，那么用这3类所包含的训练集继续训练net-a和net-b，得到训练好的网络f-a-1和f-b-1，用这个网络提取训练集的特征并训练SVM分类器，训练好的分类器是svm-1，该分类器将在第1层B-CNN被分成1、5、6的测试集重新分类，并且只分到1、5、6类上，如图 2所示。

图 2 重分类示意图

Fig. 2 Re-classify example

3 实验

为测试算法的有效性，选取CUB-200-2011, FGVC-Aircraft-2013b，stanford-cars 3个标准数据集，从3个方面进行详细的测试：1) 简单重分类；2) 一个基于特征聚类的简单重分类；3) 分类错误指导的重训练和分类。

3.1 数据集介绍

CUB-200-2011该数据集共有200类，11 788幅鸟的图像，其中训练集大小为5 994，测试集大小为5 794，在训练集和测试集中每一类大概都有30幅图像。在该数据集中，某些类的样本十分相似，在分类过程中极易混淆，如图 3所示。

图 3 CUB-200-2011部分样本

Fig. 3 Samples of CUB-200-2011((a) black billed cuckoo; (b) mangrove cuckoo; (c) yellow billed cuckoo)

FGVC-Aircraft-2013b该数据集共有100类，10 000幅飞机的图像，其中训练集大小为3 334，验证集大小为3 333，测试集大小为3 333，在训练集和测试集中每一类大概都有33幅图像。该数据集中，也有一些易混淆的类，如图 4所示。

图 4 FGVC-Aircraft-2013b部分样本

Fig. 4 Samples of FGVC-Aircraft-2013b((a) 737-200;(b) 737-300;(c) 737-500;(d) 737-600)

Stanford-cars该数据集共有196类，16 185幅汽车的图像，其中训练集大小为8 144，测试集大小为8 041，在训练集和测试集中每一类大概都有41幅图像。该数据集中，也有一些易混淆的类，如图 5所示。

图 5 Stanford-cars部分样本

Fig. 5 Samples of Stanford-cars((a) BMW 3 Series Wagon 2012;(b)BMW X3 SUV 2012)

3.2 分类错误率矩阵计算及聚类

由于CUB-200-2011和Stanford-cars数据集不包含验证集，所以在聚类过程中，随机将训练集的图像划分为两部分，3/4作为训练集，剩下的作为验证集，如此反复5次，得到验证集的分类结果。

而FGVC-Aircraft-2013b数据集则直接用3 334幅训练集图像训练网络和分类器，然后对3 333幅验证集图像进行分类。

根据验证集的分类结果与图像的真实标签，利用第2节所述的方法构造分类错误率矩阵并进行聚类，这里将3个数据集所有的类别分别聚成50组、25组、50组。3个数据集的聚类效果如图 6所示。

图 6 聚类效果图

Fig. 6 Clustering results((a)CUB-200-2001;(b) FGVC-Aircraft-2013b;(c)Stanford-cars)

通过观察上面的图 6可以看出，图中大块的部分很分散，聚类效果不是很好，所以忽略这一部分。显然，只有一个类的子集在重新分类时预设的分类结果就是该类，所以分类结果不会有任何改变，所以也忽略。因此，只选择类别数适中的子集。

根据该聚类结果(表 1—表 3)，便可以利用第2节的分层B-CNN模型进行重新训练和分类。

表 1 CUB-200-2001聚类结果
Table 1 CUB-200-2001 cluster result

下载CSV

cluster
1-2-3-45-71-72-84
59-60-62-63-64-66-101
31-32-33-83-156
79-80-81-82-100
141-143-144-145-146
5-7-8-58
17-42-139-140
18-22-92-105
23-25-44-52
49-73-74-134
50-51-53-86
67-68-69-70
116-130-132-133
14-15-54
41-189-191
46-87-88
55-98-157
99-101-181
111-112-138
148-152-164
163-174-194
168-169-171
187-190-192

表 2 FGVC-Aircraft-2013b聚类结果
Table 2 FGVC-Aircraft-2013b cluster result

下载CSV

cluster
2-53-85-86-99
3-4-5-6-7-8-55
9-10-45-90-95
11-12-13-14
24-25-26
35-36-37-78
38-39
41-72-91
43-60-61-62-93
44-54
46-47
59-63-96
65-66-67
75-76-98
77-79
81-82-84
83-92-97

表 3 Stanford-cars聚类结果
Table 3 Stanford-cars cluster result

下载CSV

cluster
1-124-125-146-169
24-72-73-101-168-173
33-130-134-138
53-62-65-106
64-71-119
88-166
118-120-143-148-149
8-9-10-11-156
30-37
39-40-41-42-43-44
54-69-74-75
83-84
90-91
176-177
14-19-25
31-36
45-46
55-56-57-192
86-87
102-103-104-193

3.3 对比算法介绍

3.3.1 B-CNN

在图 1中，虚线框外面的部分就是B-CNN模型的简单示意图，它采用两个CNN网络同时提取一幅图像的特征，并将这两个特征做外积操作得到该幅图像的双线性特征，并用SVM进行分类。其中，这两个网络可以是对称的，也可以是不对称的，在文献[20]中，Lin等人分别给出了利用每个网络及两个网络的不同组合的实验结果(表 4)。本文基于B-CNN[D, M]不对称模型。

表 4 分类结果
Table 4 Classification result

下载CSV

/%
模型	birds	aircrafts	cars
FC-CNN[M]^[20]	58.8	57.3	58.6
FC-CNN[D]^[20]	70.4	74.1	79.8
FV-CNN[M]^[20]	64.1	70.1	77.2
FV-CNN[D]^[20]	74.7	77.6	85.7
B-CNN[D, M]	84.35	83.56	89.45
简单重分类	84.48	84.01	89.66
特征聚类	84.15	—	—
分层B-CNN	84.67	84.11	89.78
注：黑体表示最好结果，下划线表示次好结果。

3.3.2 简单的重分类

为了快速验证本文聚类方法的好坏，进行一组简单的重分类实验，即直接用根据聚类结果提取的训练集训练SVM分类器并对相应的测试集进行分类。在重训练的过程中，训练的是一个全新的SVM，它只包含组中的类的分类器，因此在重分类时，只用这些分类器对测试集进行打分，并将测试图像分到分数最高的分类器所代表的类上。至于提取特征的网络，这里直接用之前所有训练样本训练的网络，这样可以大量节省时间。

3.3.3 基于特征聚类的简单重分类

本文也进行了一个基于特征相似构造关联矩阵的实验，即计算两个类的特征之间的距离，并利用类之间特征的距离得到关联矩阵，方法如下：

用B-CNN模型对每个训练图像提取特征，并用PCA方法进行降维，得到每个类的特征矩阵, 大小为$m$×$d$, 其中，$m$表示各个类中的样本数，$d$表示降维后的特征维数，本文实验中$d$=1 000。计算$a$类和$b$类之间的距离为

$\mathop {{\text{med}}}\limits_j \left( {\mathop {\min }\limits_i \left\| {{a_i} - {b_j}} \right\|} \right) + \mathop {{\text{med}}}\limits_i \left( {\mathop {\min }\limits_j \left\| {{a_i} - {b_j}} \right\|} \right)$

(5)

通过计算所有类两两之间的距离得到关联矩阵。

3.4 分类结果对比

本文分类结果及对比结果如表 4所示。表 4中第5行是利用B-CNN[D, M]模型得到的分类结果，该结果高于利用单个网络的结果，是本文的比较基准；第六行是分类错误指导的简单重分类实验，该结果略高于B-CNN[D, M]模型；第7行是根据特征聚类的简单重分类实验，该结果低于B-CNN[D, M]模型，因此本文没有作进一步实验；最后一行是分类错误指导的分层B-CNN实验，由于进行了两层B-CNN，所以它在效率上比B-CNN降低了一半，但结果相比于分类错误指导的简单重分类实验进一步提高了准确率，证明了重新训练卷积网络的必要性。

4 结论

本文提出了分类错误指导的分层B-CNN模型用于细粒度分类任务。通过对验证集的错误结果的分析，将容易混淆和分错的类聚到一起，并进行重新单独的训练和分类，以便更好地区分这些类。本文在3个数据集上的实验结果证明了该模型能有效提高分类准确率，但也增加了时间，即用时间换取了准确率，适用于细粒度分类任务，尤其是当有非常相近的目标时效果更好。下一步研究方向为：1) 在一次聚类后，有的聚类子集数量比较大，可以作进一步聚类和重分类。2) 关于如何自动选择，本文采用聚类的方法，接下来将会进一步探索其他方法。

参考文献

[1] Berg T, Liu J X, Lee S W, et al. Birdsnap: large-scale fine-grained visual categorization of birds[C]//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH: IEEE, 2014: 2019-2026. [DOI:10.1109/CVPR.2014.259]

[2] Krause J, Jin H L, Yang J C, et al. Fine-grained recognition without part annotations[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA: IEEE, 2015: 5546-5555. [DOI:10.1109/CVPR.2015.7299194]

[3] Zhang X P, Xiong H K, Zhou W G, et al. Picking deep filter responses for fine-grained image recognition[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV: IEEE, 2016: 1134-1142. [DOI:10.1109/CVPR.2016.128]

[4] Zhang H, Xu T, Elhoseiny M, et al. SPDA-CNN: unifying semantic part detection and abstraction for fine-grained recognition[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV: IEEE, 2016: 1143-1152. [DOI:10.1109/CVPR.2016.129]

[5] Cui Y, Zhou F, Lin Y Q, et al. Fine-grained categorization and dataset bootstrapping using deep metric learning with humans in the loop[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV: IEEE, 2016: 1153-1162. [DOI:10.1109/CVPR.2016.130]

[6] Wang Y M, Choi J, Morariu V I, et al. Mining discriminative triplets of patches for fine-grained classification[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV: IEEE, 2016: 1163-1172. [DOI:10.1109/CVPR.2016.131]

[7] Xie S N, Yang T B, Wang X Y, et al. Hyper-class Augmented and regularized deep learning for fine-grained image classification[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA: IEEE, 2015: 2645-2654. [DOI:10.1109/CVPR.2015.7298880]

[8] Liu J X, Kanazawa A, Jacobs D, et al. Dog breed classification using part localization[C]//Proceedings of European Conference on Computer Vision 2012. Berlin, Heidelberg: Springer, 2012: 172-185. [DOI:10.1007/978-3-642-33718-5_13]

[9] Nilsback M E, Zisserman A. A visual vocabulary for flower classification[C]//Proceedings of 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2006, 2: 1447-1454. [DOI:10.1109/CVPR.2006.42]

[10] Angelova A, Zhu S H, Lin Y Q. Image segmentation for large-scale subcategory flower recognition[C]//Proceedings of 2013 IEEE Workshop on Applications of Computer Vision. Tampa, FL: IEEE, 2013: 39-45. [DOI:10.1109/WACV.2013.6474997]

[11] Wagner A, Wright J, Ganesh A, et al.Toward a practical face recognition system: robust alignment and illumination by sparse representation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 34(2): 372–386. [DOI:10.1109/TPAMI.2011.112]

[12] Branson S, Van Horn G, Belongie S, et al. Bird species categorization using pose normalized deep convolutional nets[C]//Proceedings of the British Machine Vision Conference. Nottingham: BMVC Press, 2014.

[13] Bourdev L, Maji S, Malik J. Describing people: a poselet-based approach to attribute classification[C]//Proceedings of 2011 IEEE International Conference on Computer Vision. Barcelona: IEEE, 2011: 1543-1550. [DOI:10.1109/ICCV.2011.6126413]

[14] Zhang N, Donahue J, Girshick R, et al. Part-based R-CNNs for fine-grained category detection[C]//Proceedings of European Conference on Computer Vision 2014. Berlin:Springer, 2014: 834-849. DOI:[10.1007/978-3-319-10590-1_54]

[15] Zhang N, Farrell R, Darrell T. Pose pooling kernels for sub-category recognition[C]//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, RI: IEEE, 2012: 3665-3672. [DOI:10.1109/CVPR.2012.6248364]

[16] Jégou H, Douze M, Schmid C, et al. Aggregating local descriptors into a compact image representation[C]//Proceedings of 2010 IEEE Conference on Computer Vision and Pattern Recognition. San Francisco, CA: IEEE, 2010: 3304-3311. [DOI:10.1109/CVPR.2010.5540039]

[17] Lowe D G. Object recognition from local scale-invariant features[C]//Proceedings of the Seventh IEEE International Conference on Computer Vision, 1999. Kerkyra: IEEE, 1999, 2: 1150-1157. [DOI:10.1109/ICCV.1999.790410]

[18] Perronnin F, Sánchez J, Mensink T. Improving the fisher kernel for large-scale image classification[C]//Proceedings of European Conference on Computer Vision 2010. Berlin, Heidelberg: Springer, 2010: 143-156. [DOI:10.1007/978-3-642-15561-1_11]

[19] Cimpoi M, Maji S, Vedaldi A. Deep filter banks for texture recognition and segmentation[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA: IEEE, 2015: 3828-3836. [DOI:10.1109/CVPR.2015.7299007]

[20] Lin T Y, RoyChowdhury A, Maji S. Bilinear CNN models for fine-grained visual recognition[C]//Proceedings of 2015 IEEE International Conference on Computer Vision(ICCV). Santiago: IEEE, 2015: 1449-1457. [DOI:10.1109/ICCV.2015.170]

[21] Nie F P, Wang X Q, Huang H. Clustering and projected clustering with adaptive neighbors[C]//Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2014: 977-986. [DOI:10.1145/2623330.2623726]

[22] Chatfield K, Simonyan K, Vedaldi A, et al. Return of the devil in the details: Delving deep into convolutional nets[C]//Proceedings of the British Machine Vision Conference 2014. Nottingham: BMVC Press, 2014. [DOI:10.5244/C.28.6]

[23] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014.