发布时间: 2019-05-16
摘要点击次数:
全文下载次数:
DOI: 10.11834/jig.180426
2019 | Volume 24 | Number 5

图像分析和识别

选择性卷积特征融合的花卉图像分类

尹红, 符祥, 曾接贤, 段宾, 陈英

南昌航空大学软件学院, 南昌 330063

收稿日期: 2018-07-04; 修回日期: 2018-10-16

基金项目: 国家自然科学基金项目（61662049，61763033，61762067）；国家留学基金委项目（201608360163）

第一作者简介: 尹红, 1992年生, 女, 硕士, 主要研究方向为图像处理与模式识别、深度学习。E-mail:yinhafu92@163.com;
曾接贤, 男, 教授, 硕士生导师, 主要研究方向为图像处理、模式识别、计算机视觉。E-mail:zengjx58@163.com;
段宾, 男, 硕士研究生, 主要研究方向为图像处理、模式识别、计算机视觉。E-mail:806659750@qq.com;
陈英, 男, 副教授, 硕士生导师, 主要研究方向为图像处理、模式识别、计算机视觉。E-mail:c_y2008@163.com.

中图法分类号: TP391.41

文献标识码: A

文章编号: 1006-8961(2019)05-0762-11

摘要

目的针对花卉图像标注样本缺乏、标注成本高、传统基于深度学习的细粒度图像分类方法无法较好地定位花卉目标区域等问题，提出一种基于选择性深度卷积特征融合的无监督花卉图像分类方法。方法构建基于选择性深度卷积特征融合的花卉图像分类网络。首先运用保持长宽比的尺寸归一化方法对花卉图像进行预处理，使得图像的尺寸相同，且目标不变形、不丢失图像细节信息；之后运用由ImageNet预训练好的深度卷积神经网络VGG-16模型对预处理的花卉图像进行特征学习，根据特征图的响应值分布选取有效的深度卷积特征，并将多层深度卷积特征进行融合；最后运用softmax分类层进行分类。结果在Oxford 102 Flowers数据集上做了对比实验，将本文方法与传统的基于深度学习模型的花卉图像分类方法进行对比，本文方法的分类准确率达85.55%，较深度学习模型Xception高27.67%。结论提出了基于选择性卷积特征融合的花卉图像分类方法，该方法采用无监督的方式定位花卉图像中的显著区域，去除了背景和噪声部分对花卉目标的干扰，提高了花卉图像分类的准确率，适用于处理缺乏带标注的样本时的花卉图像分类问题。

关键词

花卉分类; 深度学习; 显著区域; 特征选取; 特征融合

Flower image classification with selective convolutional descriptor aggregation

Yin Hong, Fu Xiang, Zeng Jiexian, Duan Bin, Chen Ying

School of Software, Nanchang Hangkong University, Nanchang 330063, China

Supported by: National Natural Science Foundation of China (61662049, 61763033, 61762067)

Abstract

Objective Flower image classification is a fine-grained image classification. Its main challenges are large intra-class differences and inter-class similarities. Different types of flowers have high similarities in morphology, color, and other aspects, whereas flowers in the same category have great diversities in color, shape, and others. According to research and analysis, the current methods of flower image classification can be divided into two categories:methods based on handcrafted features and methods based on deep learning. The former usually obtains flower areas by image segmentation methods and then extracts or designs features manually. Finally, the extracted features are combined with a traditional machine learning algorithm to complete classification. These methods rely on the design experience of researchers. By contrast, methods based on deep learning utilize deep networks to learn the features of flowers automatically. Bounding boxes and part annotations are used to define accurate target positions, and then different convolution neural network models are fine-tuned to obtain the targets' features. Given that currently available flower image datasets lack annotation information, such as bounding box and part annotation, these strongly supervised methods are difficult to apply. Furthermore, tagging many flower images' bounding boxes and part annotations incurs high cost. To solve these problems, this study proposes an unsupervised flower image classification method on the basis of selective convolution descriptor aggregation. Method A flower image classification network is constructed on the basis of selective deep convolution descriptor aggregation. The proposed method can be divided into four phases:flower image preprocessing, selection and aggregation of convolution features in the Pool5 layer, selection and aggregation of convolution features in the Relu5-2 layer, and multi-layer feature fusion and classification. In the first phase, flower images are preprocessed with the normalization method that retains the aspect ratio to make the size of all flower images equal; thus, the dimension of each flower feature generated by the deep convolutional neural network is consistent. The input image size is set to 224×224 pixels in this study. In the second phase, the features of the preprocessed flower images are learned by VGG-16, which is the deep convolutional neural network model pre-trained by ImageNet. Then, the saliency region is located according to the high response value in the feature map of the Pool5 layer. However, some background regions also have high response values. The area of the background region with a large response value is smaller than the target area. Thus, the flood filling algorithm is used to calculate the maximum connected region of the saliency region. On the basis of the location information of the saliency region, deep convolution features within the region are selected and aggregated to form a low-dimensional feature of flower images. In the third phase, deep convolution features in the Relu5-2 layer are selected and fused to form another low-dimensional feature of flowers. Multi-layer convolution features have been proven to help the network to learn features and then complete the classification task; thus, the deep convolution features in the Pool5 and Relu5-2 layers are chosen in this study. Similarly, a saliency region map from the Relu5-2 layer is obtained on the basis of the response value. The saliency map from the Relu5-2 layer more accurately locates the flower region relative to the saliency map from the Pool5 layer, in which numerous noise regions and few semantic information exist. Thus, the saliency region map from the Relu5-2 layer is combined with a maximum connected region map from the Pool5 layer to produce a true saliency region map with little noise. Finally, deep convolution features are selected and aggregated to form the low-dimensional feature of flower images from the Relu5-2 layer on the basis of the location information of the true saliency region map. In the final phase, the above two low-dimensional features are aggregated to form the final flower features, which are then entered into the softmax layer for classification. Result To explore the effects of the proposed selective convolution descriptor aggregation method, we perform the following experiment on Oxford 102 Flowers. The preprocessed flower images are entered into the AlexNet, VGG-16, and Xception models, all of which are pre-trained by ImageNet. Experimental results show that the classification accuracy of the proposed method is superior to that of other models. Experiments are also conducted to compare the proposed method and other current flower image classification methods in the literature. Results indicate that the classification accuracy of this method is higher than that of methods based on handcrafted features and other methods based on deep learning. Conclusion A method for classifying flower images using selective convolution descriptor aggregation was proposed. A flower image's features were learned by using the transfer learning technique on the basis of a pre-trained network. Effective deep convolution features were selected according to the response value distribution in the feature map. Then, multi-layer deep convolution features were fused. Finally, the softmax layer is used for classification. The advantages of this method include locating the conspicuous region in the flower image in an unsupervised manner and selecting deep convolution features in the located region to exclude other invalid parts, such as background and noise parts. Therefore, the accuracy of flower image classification can be improved by reducing the disturbing information from invalid parts.

Key words

flower classification; deep learning; saliency region; features selection; features aggregation

0 引言

植物花卉的特性、营养价值及药用价值分析是植物学家研究的重点领域，而花卉分类是相关研究的首要工作。植物学家可以依赖自身的专业知识、经验或查阅相关资料准确地完成分类，但需要耗费大量时间。随着经济与科技的发展，人们也喜爱运用相机、手机等设备拍摄花卉，但缺乏花卉知识的人们会迷惑眼前的花卉属于何种类型。因此，运用算法将花卉图像自动分类不仅能帮助植物学家提高工作效率，同时也能帮助普通民众解除疑惑，添加生活乐趣。目前，花卉图像分类方法可分为两类：基于人工设计特征的方法和基于深度学习自动提取特征的方法。

基于人工设计特征的方法主要通过图像分割来获取原始图像中的花卉区域，之后由人工设计特征提取算子，最后将提取的特征和传统机器学习算法结合实现分类。如Nilsback等人^[1]新建一个包含102类花卉的数据集Oxford 102 Flowers，并提出将尺度不变特征变换(SIFT)特征和方向梯度直方图(HOG)特征融合起来，使用支持向量机(SVM)进行分类。该方法认为特征的结合能提高在大数据集上的分类效果。通过在Oxford 102 Flowers上实验，分类准确率为72.8%。Angelova等人^[2]提出一种目标分割方法，并提取4种尺度的HOG特征，使用LLC(locality-constrained linear coding)技术将特征进行编码。谢晓东等人^[3]提出基于显著性检测的迭代图割(GrabCut)花卉图像前景分割方法，该方法训练前景与背景的分类器，并结合GrabCut算法将花卉的主要部分与背景分割开。在此基础上，又提出一种基于多特征融合的分类方法^[4]，运用花卉的颜色特征以及形状特征实现花卉分类。然而，这类方法对于图像特征的设计和提取算法依赖于研究人员的经验和当前图像的特征，影响算法的通用性。

基于深度学习自动提取特征的方法主要运用卷积神经网络(CNNs)自动学习图像的特征，具有较好的通用性。依据不同的分类任务，基于深度学习的图像分类方法可分为粗粒度和细粒度分类方法。粗粒度分类方法是对车、人脸、猫、狗等不相关的类别物体进行分类。目前优秀的深度学习分类模型有AlexNet^[5]、VGG-16^[6]、Xception^[7]等。

花卉图像分类是细粒度图像分类，是对同属于花卉的不同子类别进行分类，该分类主要有两个难点：1)不同类别的花卉之间在形态、颜色等方面具有极大的相似性，如图 1所示；2)同一类别的花卉之间又具有极大的差异性，如颜色多样性、形状多样性，如图 2所示。

图 1 不同花卉的类间相似性

Fig. 1 Inter-class similarity between different flower classes((a) colts' foot; (b) dandelion)

图 2 同种花卉的类内差异性

Fig. 2 Intra-class difference between flowers of the same class((a) iris with different color; (b) tulips at different time)

可见，细粒度图像分类方法是对同一大类中的不同小类进行识别，这些小类之间具有非常强的相似性，区分它们的有效方法就是寻找它们之间的细微差别。若直接将粗粒度图像分类方法应用于花卉图像分类场景中会存在问题，难以学习到花卉图像的细节特征，无法进行准确的分类。

目前，基于深度学习的花卉分类方法不多。Liu等人^[8]运用基于全局对比度的显著性检测算法获取花卉图像的显著图，将显著图与灰度图结合来选取原始花卉图像中的花卉区域，之后运用卷积神经网络来学习花卉区域的特征，最后完成分类。通过在Oxford 102 Flowers上实验，准确率为84.02%。同样，该方法中显著图的检测效果取决于人工设计的显著性检测算法，影响花卉分类的通用性。Xia等人^[9]基于Inception-v3模型，运用迁移学习技术再次训练花卉图像，提高了花卉图像分类的准确率。然而由于模型复杂，因此该方法需要耗费大量的时间。

细粒度图像分类方面，部分文献研究了鸟类分类问题^[10-12]。文献[10]主要构建目标定位网络以及包含两分支的分类网络，在训练阶段运用鸟区域的选框(bounding box)以及花卉部件标注(part annotation)训练全卷积神经网络(FCN)^[13]获取准确的目标位置信息，之后运用CaffeNet^[14]完成特征学习并实现分类。文献[11]运用bounding box和part annotation训练区域卷积神经网络(R-CNN)模型^[15]以得到最优的候选区域并选取相应的卷积特征，最后运用SVM进行分类。这类方法具有较强的监督性。目前的花卉数据集仅包含花卉的类别标签信息，不包含花卉区域的bounding box和part annotation，而标记大量图像的标注信息需要耗费较高的成本，因此该类方法不适用于花卉图像分类领域。文献[12]运用区域建议网络产生建议区域，使用兴趣区域(ROI)池化层对目标区域特征进行最大值池化，之后将池化的目标区域输入至区域卷积网络中进行类别预测和目标边界回归。该方法需要使用区域建议网络产生建议区域并运用网络进行目标区域的回归，实现过程较为复杂。在与细粒度图像分类相似的图像检索问题上，Wei等人^[16]提出基于选择性卷积特征融合的方法。该方法以一种无监督的方式通过预训练的卷积神经网络模型定位细粒度图像中的显著目标，并依据定位结果，选取相应的深度卷积特征作为有效特征，最后，将这些特征进行融合形成低维度的特征向量。通过实验，该方法在基于内容的细粒度图像检索任务上取得了最好的检索效果。

基于选择性卷积特征融合在图像检索任务上的优势，本文针对花卉图像标注信息缺乏、标注成本高、传统基于深度学习的细粒度图像分类方法无法较好地定位花卉区域等问题，提出基于选择性深度卷积特征融合的无监督花卉图像分类方法。首先运用选择性深度卷积特征融合方法学习有效的花卉特征，再将有效的花卉特征以及花卉图像的标签作为softmax的输入实现分类，并通过实验重新设定融合系数使得分类效果达到最优。该方法的优势：在特征学习部分不使用任何标注信息，仅在分类阶段使用图像的标签进行分类，且能够学习有效的花卉特征完成分类。

1 基于选择性卷积特征融合的花卉图像分类网络

研究表明，深度神经网络不同的层次反映了物体不同侧面的信息。高层的特征反映类别信息，具有更强的抽象性和关于复杂干扰因素的不变性，可以区分不同类别的目标；低层信息反映细节信息，具有关于平移等简单变换的不变性，可以用来区分同类别相似的目标。高层与低层卷积特征的融合可以帮助网络更好地完成相应的分类任务^[17]，因此，本文选择Pool5层的高层特征与较低层Relu5-2层的特征进行融合，进行花卉识别。网络的结构主要分为4部分：花卉图像预处理、Pool5层卷积特征的选取和融合、Relu5-2层卷积特征的选取和融合、多层特征的融合与分类，如图 3所示。1)对尺寸不一的花卉图像进行保持长宽比的图像归一化处理，使得所有输入图像的尺寸相同，并保持目标不变形、不丢失图像细节信息；2)将归一化的图像输入至由ImageNet预训练好的VGG-16模型中进行特征学习，根据Pool5层特征图中的高响应值定位到图像中的显著区域，再依据定位信息选取显著区域中的深度卷积特征，并运用max-pooling与average-pooling结合的方法进行融合，形成花卉的一个低维特征SA_Pool5；3)进行Relu5-2层的深度卷积特征选取及融合，形成花卉的另一个低维特征SA_Relu5-2，目的是获得花卉区域和学习花卉更细节的特征，利于分类；4)将SA_Pool5和SA_Relu5-2两个特征进行融合，形成花卉特征SA⁺，并将该特征输入至softmax层进行分类。

图 3 基于选择性卷积特征融合的花卉图像分类网络

Fig. 3 The flower image classification network based on selective convolutional descriptor aggregation

1.1 花卉图像预处理

本文方法使用牛津大学VGG(visual gemetry group)组提供的Oxford 102 Flowers^[1]数据集，该数据集包含102类花卉，但花卉图像的分辨率各不相同。

为了使每幅花卉图像通过深度卷积神经网络产生的特征维度一致，进而能输入至softmax层进行分类，要求输入图像的大小一致，因此需对图像进行尺寸归一化。传统的图像归一化方法有最近邻插值法、双线性插值法、双三次插值法等，它们的归一化效果如图 4所示。

图 4 传统归一化方法的效果对比

Fig. 4 Comparison of traditional normalization methods

((a) original images; (b) nearest neighbor interpolation; (c) bilinear interpolation; (d) bicubic interpolation)

由图 4可知，当目标图像与原始图像的长宽比不同时，传统的归一化方法会使图像变形，且图像的有用信息被大幅度压缩，目标的细节特征变形和压缩会影响分类精度。

为解决这一问题，本文运用保持长宽比的图像归一化方法。该方法为每幅原始图像与目标图像的长宽比不同的图像制作$N$个不同的副本，其中每个副本都是该图像的一部分，且长宽比与目标图像长宽比保持一致。具体步骤是：

1) 获得输入图像的宽度$W_{\mathrm{in}}$(像素)、高度$H_{\mathrm{in}}$(像素)，计算输入图像的长宽比

$ A_{\mathrm{in}}=\frac{H_{\mathrm{in}}}{W_{\mathrm{in}}} $

(1)

2) 获取目标图像的宽度$W_{\mathrm{go}}$、高度$H_{\mathrm{go}}$，计算目标图像的长宽比

$ A_{\mathrm{go}}=\frac{H_{\mathrm{go}}}{W_{\mathrm{go}}} $

(2)

3) 若$A_{\mathrm{in}}=A_{\mathrm{go}}$，直接运用最近邻插值方法对图像进行尺寸归一化；否则，进行步骤4)—6)。

4) 计算输入图像的短边和长边

$ e d g_{\mathrm{sh}}=\min \left\{H_{\mathrm{in}}, W_{\mathrm{in}}\right\} $

(3)

$ e d g_{\mathrm{lo}}=\max \left\{H_{\mathrm{in}}, W_{\mathrm{in}}\right\} $

(4)

5) 以$e d g_{\mathrm{sh}}$为度量在长边$e d g_{\mathrm{lo}}$中以步长$step$为单位截取长度为$len$的$N$段

$ len=\left\{\begin{array}{ll}{e d g_{\text { sh }} \times A_{\text { go }}} & {e d g_{\text { sh }}=W_{\text {in}}} \\ {\frac{e d g_{\text { sh }}}{A_{\text { go }}}} & {e d g_{\text { sh }}=H_{\text { in }}}\end{array}\right. $

(5)

截取长宽均为$len$的图像，这样，输入图像转换成$N$个长宽比等于目标图像长宽比$A_{\mathrm{go}}$的图像。

6) 将以上得到的图像进行等比例变换，归一化为同一尺寸的目标图像。

本文设定目标图像尺寸为224×224像素，长宽比$A_{\mathrm{go}}=1$，步长$step=\left(e d g_{\mathrm{lo}}-e d g_{\mathrm{sh}}\right) / 3$，则$ N = 4 $。长宽比不等同于目标长宽比的花卉图像归一化效果如图 5所示。

图 5 保持长宽比的图像归一化效果图

Fig. 5 Results of normalization method keeping aspect ratio

((a) original image; (b) normalized images)

从图 5可以看出，归一化后产生的4幅图像保留了共同的长宽比。这种方法不仅防止图像变形，而且增加了花卉图像的数量，有利于深度卷积神经网络的训练。

使用这种方法预处理Oxford 102 Flowers数据集中的原始花卉图像，可得到32 546幅尺寸为224×224像素的花卉图像。

1.2 Pool5层深度卷积特征的选取与融合

运用由ImageNet预训练好的深度卷积神经网络VGG-16模型^[6]来学习花卉图像的特征，其中，VGG-16模型主要包括13个卷积层和3个全连接层。Pool5层是VGG-16模型中最后一个卷积层连接的最大池化层，除全连接层之外，该层为最高层特征，而高层的特征反映类别信息，具有更强的抽象性和关于复杂干扰因素的不变性，因此本文选取该层的深度卷积特征。根据Pool5层特征图中的高响应值定位显著区域，选取有效的深度卷积特征，并将多通道的深度卷积特征进行融合形成低维的特征，该部分基本框架如图 6所示。

图 6 Pool5层的卷积特征选取与融合框架

Fig. 6 The convolutional feature selection and aggregation framework of Pool5 layer

首先，将1.1节归一化的花卉图像输入至VGG-16模型中，之后选取Pool5层的卷积特征。在Pool5层，包含512个通道且尺寸为7×7的特征图，因此，花卉图像被表达为一个7×7×512的3维矩阵。

为了可视化Pool5层特征，本文运用最近邻插值法将特征图的大小归一化为224×224像素，并与原图叠加，部分通道的特征图如图 7(b)所示。

图 7 Pool5层的特征图及显著区域

Fig. 7 The feature maps and saliency regions of Pool5 layer

((a) original images; (b) feature maps of Pool5 layer; (c) saliency region $\boldsymbol{M}_{\mathrm{Pool} 5}$; (d) max-connected saliency region ${\mathit{\boldsymbol{\tilde M}}_{{\rm{Pool5}}}}$)

之后，将Pool5层的特征图在深度方向进行相加，将7×7×512的3维矩阵转换为7×7的2维矩阵$\mathit{\boldsymbol{A}}$，称为高响应位置矩阵，计算为

$ \mathit{\boldsymbol{A}} = \sum\limits_{n = 1}^d {{\mathit{\boldsymbol{S}}_n}} $

(6)

式中，${{\mathit{\boldsymbol{S}}_n}}$表示在Pool5层中第$ n $个通道的特征图，$ d $表示通道数，$ d=512 $。在高响应位置矩阵$\mathit{\boldsymbol{A}}$中，7×7个不同的响应位置共含7×7个响应值。若位置$ (i, j) $的响应值更高，其关联的区域更可能是需要识别的目标区域。

计算$\mathit{\boldsymbol{A}}$中所有位置的平均响应值${\bar a}$，将其作为一个阈值来决定哪些位置可以用来定位花卉区域。如果位置$ (i, j) $的响应值高于${\bar a}$，表明花卉区域出现在该位置的概率较高；相反，表明花卉区域不太可能出现在这个位置。从而形成一个与$\mathit{\boldsymbol{A}}$相同大小的显著区域$\boldsymbol{M}_{\mathrm{Pool} 5}$，计算为

$ {\mathit{\boldsymbol{M}}_{{\rm{Pool5}}}}(i, j) = \left\{ {\begin{array}{*{20}{l}} 1&{{\mathit{\boldsymbol{A}}_{i, j}} > \bar a}\\ 0&{{\mathit{\boldsymbol{A}}_{i, j}} \le \bar a} \end{array}} \right. $

(7)

式中，$ (i, j) $表示在7×7个位置中的某一位置。

部分花卉图像的显著区域$\boldsymbol{M}_{\mathrm{Pool} 5}$如图 7(c)所示，显著区域$\boldsymbol{M}_{\mathrm{Pool} 5}$仍存在一些噪声，即一些背景部分也具有较高的响应值，但这些背景部分比实际的显著区域范围小很多。

本文使用泛洪填充算法^[18]计算$\boldsymbol{M}_{\mathrm{Pool} 5}$的最大连通区域，这里称为最大连通的显著区域${\mathit{\boldsymbol{\tilde M}}_{{\rm{Pool5}}}}$，去除部分噪声对显著区域的影响，部分花卉的${\mathit{\boldsymbol{\tilde M}}_{{\rm{Pool5}}}}$如图 7(d)所示。

得到最大连通的显著区域${\mathit{\boldsymbol{\tilde M}}_{{\rm{Pool5}}}}$后，运用${\mathit{\boldsymbol{\tilde M}}_{{\rm{Pool5}}}}$分别在512个通道的Pool5层特征图中选取有效的卷积特征。

当${\mathit{\boldsymbol{\tilde M}}_{{\rm{Pool5}}}}(i, j)=1$，保留该位置的卷积特征$ {x_{(i, j)}} $；当${\mathit{\boldsymbol{\tilde M}}_{{\rm{Pool5}}}}(i, j)=0$，表明该位置可能存在背景，不选取这个卷积特征$ {x_{(i, j)}} $。计算为

$ F = \left\{ {{x_{(i, j)}}|{\rm{ }}{{\mathit{\boldsymbol{\tilde M}}}_{{\rm{Pool5}}}}(i, j) = 1} \right\} $

(8)

式中，$\mathit{\boldsymbol{F}}$是被选取的深度卷积特征。

被选取的深度卷积特征$\mathit{\boldsymbol{F}}$包含512个通道，每个通道都包含不同的深度卷积特征，因此需要将多通道的卷积特征融合成低维度的特征向量，从而能输入至softmax层完成分类。通过实验对比，本文方法选用max-pooling与average-pooling结合的方法来完成卷积特征的融合，融合的公式为

$ \mathit{\boldsymbol{S}}{\mathit{\boldsymbol{A}}_{{\rm{Pool}}5}} = \left[ {{\mathit{\boldsymbol{P}}_{\max }}\mathit{\boldsymbol{;}}{\mathit{\boldsymbol{P}}_{{\rm{avg}}}}} \right] $

(9)

$ {\mathit{\boldsymbol{P}}_{\max }} = \mathop {\max }\limits_{i, j} {x_{(i, j)}}, {\mathit{\boldsymbol{P}}_{{\rm{avg}}}} = \frac{1}{N}\sum\limits_{i, j} {{x_{(i, j)}}} $

(10)

式中，${\mathit{\boldsymbol{P}}_{{\rm{avg}}}}$和${\mathit{\boldsymbol{P}}_{\max }}$是维度为1×512的矩阵，分别由max-pooling和average-pooling池化得到，即分别选取各通道的特征$\mathit{\boldsymbol{F}}$中的最大值和平均值。$ N $是选取的卷积特征数量。$\mathit{\boldsymbol{S}}{\mathit{\boldsymbol{A}}_{{\rm{Pool}}5}}$是经过选择性特征融合形成的特征，作为花卉特征的一部分。

1.3 Relu5-2层深度卷积特征的选取与融合

除了选择Pool5层的高层特征，本文还选取Relu5-2层的深度卷积特征。Relu5-2层是VGG-16模型的中间层，与Pool5层相比较，可视为低层；与图像原始数据的底层特征相比，Relu5-2层对原始特征进行了自动提取和抽象描述，不需要人工设计和选择特征。依据1.2节的方法，对Relu5-2层的卷积特征做相同的处理，得到Relu5-2层的显著区域${\mathit{\boldsymbol{M}}_{{\rm{Relu5 - 2}}}}$，如此，将${{\mathit{\boldsymbol{\tilde M}}}_{{\rm{Relu5 - 2}}}}$和${\mathit{\boldsymbol{M}}_{{\rm{Relu5 - 2}}}}$进行结合，用${{\mathit{\boldsymbol{\tilde M}}}_{{\rm{Relu5 - 2}}}}$确定花卉的位置；另一方面，可利用${\mathit{\boldsymbol{M}}_{{\rm{Relu5 - 2}}}}$确定较完整的花卉区域。Relu5-2层的特征图及显著区域如图 8所示。

图 8 Relu5-2层的特征图及显著区域

Fig. 8 The feature maps and max-connected saliency regions of Relu5-2 layer

((a) original images; (b) feature maps of Relu5-2 layer; (c) saliency region ${\mathit{\boldsymbol{M}}_{{\rm{Relu5 - 2}}}}$; (d) combined saliency region ${{\mathit{\boldsymbol{\tilde M}}}_{{\rm{Relu5 - 2}}}}$)

1.4 多层特征的融合与分类

通过1.2节和1.3节，得到花卉图像的两个低维特征$\mathit{\boldsymbol{S}}{\mathit{\boldsymbol{A}}_{{\rm{ Pool }}5}}$和$\mathit{\boldsymbol{S}}{\mathit{\boldsymbol{A}}_{{\rm{Relu5 - 2}}}}$。多层的卷积特征能帮助网络学习图像的特征，将Relu5-2和Pool5层的融合特征结合，形成花卉图像的最终特征，记为

$ \mathit{\boldsymbol{S}}{\mathit{\boldsymbol{A}}^ + } = \left[ {\mathit{\boldsymbol{S}}{\mathit{\boldsymbol{A}}_{{\rm{Poul}}5}}, \alpha \times \mathit{\boldsymbol{S}}{\mathit{\boldsymbol{A}}_{{\rm{Relu5 - 2}}}}} \right] $

(11)

式中，$\alpha $是一个融合系数。通过实验，当$\alpha =3$时，本文方法可达到最好的分类效果。

之后，运用${{\rm{L}}_{\rm{2}}}$范数对$\mathit{\boldsymbol{S}}{\mathit{\boldsymbol{A}}^ + }$进行正则化处理，得到1个维度为2 048的向量。为了提高分类准确率，运用数据增强的思想对原始图像进行翻转操作，并对翻转图像进行深度卷积特征的选取和融合，再将原始图像与翻转图像的特征融合，形成最终的特征$\mathit{\boldsymbol{SA}}_{{\rm{flip}}}^ + $，维度为4 096。最后，将特征$\mathit{\boldsymbol{SA}}_{{\rm{flip}}}^ + $作为softmax的输入，实现从特征向量到类别概率的转换，完成分类。

2 实验结果与分析

2.1 实验环境和目的

本文实验采用的计算机环境是Intel(R) Xeon (R) CPU E5-2603 V3@1.6 GHz，GeForce GTX 1080显卡，8 GB显存，36 GB内存，编程环境为MATLAB R2014a，运用的深度学习工具是MatConvNet。

采用的花卉图像来源于Oxford 102 Flowers^[1]数据集。该数据集中仅包含8 189幅图像，本文运用1.1节的方法对其进行预处理，形成32 546幅大小均为224×224像素的花卉图像，并为每幅图像设定相应的标签，即花卉的类别，范围为1~102。

根据不同的训练集和测试集，将实验分为实验1和实验2。实验1采用与文献[1]相同的训练集、验证集和测试集，主要探究与传统深度学习模型的比较、特征融合方法的影响、多层特征融合的影响以及与传统花卉分类方法的对比。实验2采用新的训练集和测试集，并与最新的基于深度学习的花卉分类方法进行比较。

2.2 实验1

文献[1]将8 189幅图像分为训练集1 020幅、验证集1 020幅、测试集6 149幅，文献[2, 4, 8]均采用这种分类方法。为了便于比较，实验1采用与文献[1-2, 4, 8]相同的训练集、验证集和测试集。本文方法经过1.1节的预处理，训练集1 020幅图像可得到4 044幅新图像，验证集1 020幅图像可得到4 029幅新图像，测试集6 149幅图像得到24 473幅新图像。新图像的分辨率均为224×224像素。

2.2.1 实验流程

1) 训练阶段。将花卉图像输入至基于选择性卷积特征融合的花卉图像分类网络中，完成对softmax层的训练。其中用于特征提取的VGG-16模型的参数经过ImageNet预训练产生，本文不做更改。通过训练阶段，得到训练好的模型。

2) 测试阶段。将测试集中的花卉图像输入至训练好的模型中完成特征的提取及融合，并通过训练好的softmax层进行分类；最后将预测的花卉类别与花卉的真实标签进行对比，计算分类准确率。

2.2.2 实验结果与分析

1) 与传统深度学习模型的比较。为了验证本文基于选择性特征融合的分类方法的效果，将预处理好的花卉图像分别输入至均由ImageNet预训练好的AlexNet^[5]、VGG-16^[6]、Xception^[7]模型和本文方法中，完成对模型的训练及测试，结果如表 1所示，其中，Top1准确率为预测结果正确的样本数/总样本数，且仅当预测结果中概率最大的类是正确类别才判定为预测正确。

表 1 深度学习网络与选择性特征融合方法的分类对比
Table 1 The comparison between deep learning network and selective convolutional descriptor aggregation methods

下载CSV

/%
输入	深度学习网络	Top1准确率
预处理的花卉图像	AlexNet^[5]	80.15
	VGG-16^[6]	66.45
	Xception^[7]	57.88
	本文方法	85.55

从表 1可以看出，本文方法的分类准确率高于AlexNet、VGG-16和Xeception模型。主要因为这些模型使用的是全连接层的特征进行分类，而全连接特征表达的是整个图像的特征，包含较多背景噪声，因此影响分类精度。相反，本文方法使用的是显著区域的卷积特征，包含了显著区域的细节信息，去除了背景噪声对分类的干扰，一定程度上提高了分类精度。

2) 特征融合方法的影响。在将Pool5层或Relu5-2层的有效卷积特征进行融合的过程中，本文选取max-pooling与average-pooling结合的方法进行处理。为了证实该方法的效果，本文分别运用average-pooling和max-pooling方法融合特征，并输入至softmax进行分类，结果如表 2所示。

表 2 不同特征融合方法的分类结果
Table 2 the classification results of different feature aggregation methods

下载CSV

/%
特征融合方法	Top1准确率
average-pooling	83.85
max-pooling	84.42
max-pooling + average-pooling	85.55

由表 2可得，max-pooling与average-pooling结合的融合方法能进一步提高分类效果。

3) 多层特征融合的影响。为了探究本文方法中多层特征融合对分类结果的影响，本文分别将Pool5层的选择性特征$\mathit{\boldsymbol{S}}{\mathit{\boldsymbol{A}}_{{\rm{ Pool }}5}}$、Relu5-2层的选择性特征$\mathit{\boldsymbol{S}}{\mathit{\boldsymbol{A}}_{{\rm{Relu5 - 2}}}}$和融合特征$\mathit{\boldsymbol{SA}}_{{\rm{flip}}}^ + $作为softmax的输入进行分类，且三者均包含翻转图像的特征，分类结果如表 3所示。

表 3 多层特征融合与单层特征的分类对比
Table 3 The comparison of classification between multi-layer features aggregation and single-layer feature

下载CSV

/%
softmax的输入	Top1准确率
SA_Pool5	80.56
SA_Relu5-2	82.83
SA_flip⁺	85.55

从表 3可以看出，本文方法选取的多层特征融合的方法优于仅使用一层特征的方法。说明使用多层的融合特征表示花卉图像更细节的特征，有助于精确分类。

4) 与传统花卉分类方法的对比。将本文方法与目前主流的花卉图像分类方法对比，结果如表 4所示。

表 4 不同花卉图像分类方法的分类结果
Table 4 The classification results of different flower image classification methods

下载CSV

/%
方法	Top1准确率
文献[1]	72.80
文献[2]	80.66
文献[4]	79.10
文献[8]	84.02
本文	85.55

从表 4可以看出，文献[1-2, 4]属于传统方法，分类精度较低。主要原因有：1)人工设计的特征通用性较差，某幅图像提取的特征能获得很好的效果，而在别的图像上提取的特征可能无法进行准确的分类；2)基于颜色、形状等底层特征无法有效解决类间相似性和类内差异性问题。而本文方法除了明显优于文献[1-2, 4]的传统方法外，也略优于文献[8]基于深度学习的方法。主要原因是文献[8]虽然也使用了显著图定位显著区域，但学习的特征是全连接特征，而全连接特征表达的是整个图像的特征，包含较多背景噪声，因此影响分类精度。

综合以上分析，本文方法可取得较好的分类效果。

2.3 实验2

实验1将图像集分成固定的训练集1 020幅、验证集1 020幅、测试集6 149幅，存在两方面的不足：首先，训练集比例过少可能影响分类效果；其次，训练集中的图像不会被测试到，而测试集中的图像不会用于训练网络，结论有一定的片面性。因此，实验2重新设定训练数据与测试数据，将预处理好的32 546幅花卉图像随机分成数目基本相同的5份，每一份的图像数目分别为6 509、6 509、6 510、6 509、6 509，对其分别进行测试，并将本文方法与最新的基于深度学习的花卉分类方法Inception-v3模型^[9]进行比较。

2.3.1 实验流程

实验的主要步骤为：1)将以上分好的数据集选择1份作为测试集，其他4份作为训练集；2)训练本文方法的网络以及Inception-v3模型^[9]并完成测试，计算分类准确率；3)重复步骤1)和2)共5次，使每一份图像均被测试一次；4)计算5次分类准确率的平均值，比较两方法的分类结果。

在运用花卉图像对Inception-v3^[9]进行训练时，本实验设定学习率learning-rate=0.01，参数更新使用随机梯度下降法，批尺寸batch-size=32，epoches=10。

2.3.2 实验结果与分析

本文方法与Inception-v3^[9]的对比如表 5所示。

表 5 深度学习方法的分类对比
Table 5 The classification contrast of deep learning methods

下载CSV

/%
方法	分类数据	测试准确率	平均准确率
Inception-v3^[9]	1	98.10	97.99
	2	98.60
	3	97.51
	4	98.32
	5	97.42
本文	1	98.89	98.96
	2	99.17
	3	99.12
	4	98.92
	5	98.72

通过分析表 5可以看到，本文方法优于Inception-v3^[9]。此外，本文的softmax平均训练时间为203.43 s，而Inception-v3^[9]的平均训练时间为21 h，远大于本文方法的时间，主要原因是本文不使用花卉图像对VGG16模型进行再训练，仅需训练softmax。

综合以上分析，与深度学习网络Inception-v3^[9]相比，本文方法能取得更好的分类效果。

3 结论

提出了基于选择性卷积特征融合的花卉图像分类网络，该方法借助迁移学习思想，利用预训练网络学习花卉图像的特征，根据特征图中的响应值分布选取有效的深度卷积特征，并将多层的深度卷积特征进行融合，最后运用softmax分类模型进行分类。该方法的优势在于采用无监督的方式定位花卉图像中的显著区域，并选取该区域内的深度卷积特征，去除了背景区域特征对花卉图像分类的干扰。实验结果表明，本文方法在处理花卉图像分类问题上具有较好的效果，但仍存在提高的空间，未来可进一步研究粗粒度与细粒度结合对花卉分类效果的影响。

参考文献

[1] Nilsback M E, Zisserman A. Automated flower classification over a large number of classes[C]//Proceedings of the 6th Indian Conference on Computer Vision, Graphics & Image Processing. Bhubaneswar, India: IEEE, 2008: 722-729.[DOI:10.1109/ICVGIP.2008.47]

[2] Angelova A, Zhu S H. Efficient object detection and segmentation for fine-grained recognition[C]//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, OR, USA: IEEE, 2013: 811-818.[DOI:10.1109/CVPR.2013.110]

[3] Xie X D, Lyu Y P, Cao D L. Saliency detection based flower image foreground segmentation[EB/OL]. 2014-09-18[2018-06-20]. http://www.paper.edu.cn/releasepaper/content/201409-215. [谢晓东, 吕艳萍, 曹冬林.基于显著性检测的花卉图像分割[EB/OL]. 2014-09-18[2018-06-20]. http://www.paper.edu.cn/releasepaper/content/201409-215.]

[4] Xie X D, Lyu Y P, Cao D L. Hierarchical feature fusion based flower image classification[EB/OL]. 2014-10-07[2018-06-20]. http://www.paper.edu.cn/releasepaper/content/201410-33. [谢晓东, 吕艳萍, 曹冬林.基于层次化特征融合的花卉图像分类[EB/OL]. 2014-10-07[2018-06-20]. http://www.paper.edu.cn/releasepaper/content/201410-33.]

[5] Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks[C]//Proceedings of the 25th International Conference on Neural Information Processing Systems. Lake Tahoe, Nevada, United States: ACM, 2012: 1097-1105.

[6] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[EB/OL].[2018-06-20] https://arxiv.org/pdf/1409.1556.pdf.

[7] Chollet F. Xception: deep learning with depthwise separable convolutions[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 1800-1807.[DOI:10.1109/CVPR.2017.195]

[8] Liu Y Y, Tang F, Zhou D W, et al. Flower classification via convolutional neural network[C]//Proceedings of 2016 IEEE International Conference on Functional-Structural Plant Growth Modeling, Simulation, Visualization and Applications. Qingdao: IEEE, 2016: 110-116.[DOI:10.1109/FSPMA.2016.7818296]

[9] Xia X L, Xu C, Nan B. Inception-v3 for flower classification[C]//Proceedings of the 2nd International Conference on Image, Vision and Computing. Chengdu: IEEE, 2017: 783-787.[DOI:10.1109/ICIVC.2017.7984661]

[10] Huang S L, Xu Z, Tao D C, et al. Part-stacked CNN for fine-grained visual categorization[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, Nevada: IEEE, 2016: 1173-1182.[DOI:10.1109/CVPR.2016.132]

[11] Zhang N, Donahue J, Girshick R, et al. Part-Based R-CNNs for fine-grained category detection[C]//Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer, 2014: 834-849.[DOI:10.1007/978-3-319-10590-1_54]

[12] Weng Y C, Tian Y, Lu D M, et al. Fine-grained bird classification based on deep region networks[J]. Journal of Image and Graphics, 2017, 22(11): 1521–1531. [翁雨辰, 田野, 路敦民, 等. 深度区域网络方法的细粒度图像分类[J]. 中国图象图形学报, 2017, 22(11): 1521–1531. ] [DOI:10.11834/jig.170262]

[13] Matan O, Burges C J C, Cun Y L, et al. Multi-digit recognition using a space displacement neural network[C]//Advances in Neural Information Processing Systems. Denver, Colorado, USA: NIPS, 1991: 488-495.

[14] Jia Y Q, Shelhamer E, Donahue J, et al. Caffe: Convolutional architecture for fast feature embedding[C]//Proceedings of the 22nd ACM International Conference on Multimedia. Orlando, Florida, USA: ACM, 2014: 675-678.[DOI:10.1145/2647868.2654889]

[15] Girshick R, Donahue J, Darrell T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, Ohio: IEEE, 2014: 580-587.[DOI:10.1109/CVPR.2014.81]

[16] Wei X S, Luo J H, Wu J X, et al. Selective convolutional descriptor aggregation for fine-grained image retrieval[J]. IEEE Transactions on Image Processing, 2017, 26(6): 2868–2881. [DOI:10.1109/TIP.2017.2688133]

[17] Hariharan B, Arbelaez P, Girshick R, et al. Hypercolumns for object segmentation and fine-grained localization[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, Massachusetts: IEEE, 2015: 447-456.[DOI:10.1109/CVPR.2015.7298642]

[18] Lee D, Lin A. Computational complexity of art gallery problems[J]. IEEE Transactions on Information Theory, 1986, 32(2): 276–282. [DOI:10.1109/TIT.1986.1057165]