发布时间: 2018-06-16 摘要点击次数: 全文下载次数: DOI: 10.11834/jig.170481 2018 | Volume 23 | Number 6 图像分析和识别

 收稿日期: 2017-08-23; 修回日期: 2017-12-19 基金项目: 国家自然科学基金面上项目（61672202）；中国博士后基金项目（2014M561817） 第一作者简介: 杨娟(1983-), 女, 讲师, 2012年于合肥工业大学获计算机应用技术专业博士学位, 主要研究方向为深度学习、智能信息处理等。E-mail:yangjuan@hfut.edu.cn. 中图法分类号: TP391 文献标识码: A 文章编号: 1006-8961(2018)06-0837-09

# 关键词

Fine-grained car recognition method based on region proposal networks
Yang Juan, Cao Haoyu, Wang Ronggui, Xue Lixia, Hu Min
School of Computer and Information, Hefei University of Technology, Hefei 230009, China
Supported by: National Natural Science Foundation of China (61672202)

# Abstract

Objective Over the past few decades, studies on visual object recognition have mostly focused on the category level, such as ImageNet Large-scale Visual Recognition Challenge and PASCAL VOC challenge. With the powerful feature extraction of convolutional neural networks (CNNs), many studies have begun to focus on challenging visual tasks aimed at the subtle classification of subcategories, which is called fine-grained visual pattern recognition. Fine-grained car model recognition is designed to recognize the exact make, model, and year of a car from an arbitrary viewpoint, which is essential in intelligent transportation, public security, and other fields. Research on this field mainly includes three aspects: finding and extracting features of discriminative parts, using the alignment algorithm or 3D object representations to eliminate the effects of posture and angle, and looking for powerful feature extractors such as CNN features. The three methods presented have various degrees of defect, the bottleneck of most part-based models is accurate part localization, and methods generally report adequate part localization only when a known bounding box at test time is given. 3D object representations and many other alignment algorithms need complex preprocessing or post-processing of training samples, such as co-segmentation and 3D geometry estimation. Currently, methods based on CNNs significantly outperform those of previous works, which rely on handcrafted features for fine-grained classification, but the location of objects is essential even at test time due to the subtle difference between categories. These methods are difficult to apply in real intelligent transportation because a video frame in a real traffic monitoring scenario typically shows multiple cars in which each car object and parts cannot be assigned with a bounding box. To solve these problems, the present study proposes a fine-grained car recognition method based on deep fully CNNs called region proposal network (RPN), which automatically proposes regions of discriminative parts and car objects. Our method can be trained in an end-to-end manner and without requiring a bounding box at test time. Method RPN is a type of fully CNN that simultaneously predicts object bounding box and scores at each position, which has made remarkable achievements in the field of object detection. We improve RPN with an outstanding deep CNN called deep residual network (ResNet). First, the deep convolution feature of the image is extracted by the ResNet pipeline. Then, we slide a small network over the convolutional feature map and each sliding window is mapped to a low-dimensional vector. The vector is fed into a box-regression layer and box-classification layer; the former outputs the probability of a region, which includes an object, whereas the latter outputs the coordinates of the region by bounding-box regression. Finally, these object regional candidates obtain the specific category and corrected object position through the object detection network and the final recognition result through the non-maximal suppression algorithm. To quickly optimize the model parameters, we use the ImageNet pre-training model to initialize the RPN and object detection network and then share convolutional features between them through joint optimization. Result First, we verify the performance of the proposed method on several public fine-grained car datasets. Stanford BMW-10 dataset has 512 pictures, including 10 BMW serials. However, most CNNs suffered from over-fitting and thus obtained poor results due to the limit of training samples. The Stanford Cars-196 dataset is currently the most widely used dataset for fine-grained car recognition, with 16185 images of 196 fine-grained car models covering SUVs, coupes, convertibles, pickup trucks, and trucks, to name a few. Second, apart from using the public dataset, we conduct the recognition experiment under a real traffic monitoring video. Finally, we carefully analyze the misrecognized samples of our models to explore the improved room of fine-grained methods. Recognition accuracy could be significantly improved by training data augmentation, and all our experiments only use image horizontal flip as data augmentation to compare with other methods with the same standard. The recognition accuracy of this method is 76.38% in the Stanford BMW-10 dataset and 91.48% in the Stanford Cars-196 dataset. The method also achieves excellent recognition effect in a traffic monitoring video. In particular, our method is trained in an end-to-end manner and requires no knowledge of object or part bounding box at test time. The RPN provides not only an object detection network with the specific location of the car object but also the distinguishable region, which contributes to the classification. The misrecognized samples are mostly at the same makes, which have tiny visual difference. Methods based on handcrafted global feature templates, such as HOG, achieve 28.3% recognition accuracy on the Stanford BMW-10 dataset. The most valuable 3D object representations, which trains from 59040 synthetic images, achieves 76.0% less accuracy than that of our methods. The state-of-the-art method of the Stanford Car-196 dataset without bounding-box annotation is recurrent attention CNN, which is published on CVPR2017 and achieved 92.5% recognition accuracy by combining features at three scales via a fully connected layer. Experiments show that our method not only outperforms significantly traditional methods based on handcrafted feature but is also comparable with the current state-of-the-art methods. Conclusion We introduce a new deep learning method, which is used in fine-grained car recognition, that overcomes the dependence of the traditional object recognition on the object location and can realize the recognition of cars under complex scenes, such as multiple vehicles and dense vehicles with high accuracy. The findings of this study can provide new ideas for fine-grained object recognition. Compared with traditional methods, the proposed model is better in terms of robustness and practicability.

# Key words

deep learning; convolutional neural networks; car recognition; fine-grained recognition; image classification

# 0 引言

2012年，Krizhevsky等人[4]通过深层卷积神经网络大幅提高了大规模图像数据集的识别精度，此后深度学习尤其是卷积神经网络在计算机视觉各个领域逐渐成为常态。Yang等人[5]通过ImageNet预训练的深层卷积神经网络模型提取车辆特征并分类，在大型车辆数据集上取得较好的识别结果。该方法充分利用了卷积神经网络的特征提取能力，不足是没有针对细粒度分类视觉差异主要集中在局部区域的特点进行优化。Krause等人[6]通过协同分割及对齐图像寻找具有区分度的局部区域，然后通过卷积神经网络提取特征并分类。但是该方法进行了大量的预处理，过程较为复杂。上述各方法均将区域定位以及特征提取作为两个独立的过程，忽略了区域定位和特征提取之间的相关性。Fu等人[7]提出循环注意力卷积神经网络(RACNN)通过互相强化的方式对区域定位和特征表征进行学习，由粗到细地迭代生成区域注意力，通过端到端的训练方式大幅提高了细粒度分类任务的识别精度。

# 2 细粒度车型识别网络结构

Faster RCNN[13](faster regions with convolutional neural network features)是一种极为优秀的目标检测框架，其首先通过CNN提取目标的深层卷积特征，之后通过区域建议网络[13] (RPN)产生目标的区域候选及区域评分，然后通过感兴趣区域池化(ROI pooling)将目标候选区域映射到对应的卷积特征，最后通过检测网络的分类及回归得到目标的具体类别以及修正后的目标位置。由于候选区域通常较多，且存在不同程度的重叠，网络通过非极大值抑制算法(NMS)获得目标的最终位置。本文通过ResNet-101改进了原始Faster RCNN的卷积特征提取器，使得网络具有更好的识别及检测性能以适用细粒度目标识别，网络结构如图 1所示，其中RPN使用的是卷积层4_23层特征，检测网络使用的是最后一层卷积层特征。

# 2.1 RPN

RPN是一种全卷积神经网络(FCN)，相比于经典CNN，FCN没有全连接层，因此可以接受任意尺寸的输入图片，并且可以通过反向传播算法(BP)进行端到端的训练。RPN以图像卷积映像(feature maps)为输入，通过滑动窗口在特征映像中滑窗并产生窗口固定尺寸的低维特征，在每个窗口位置，通过参照框(anchors)产生多尺寸及多比例的区域建议，然后通过分类层识别该区域为目标的概率，并通过区域回归层得到目标的粗略位置。

ResNet-101接受固定输入大小为224×224像素的图片，其卷积4_23层特征大小为14×14像素，特征维度为1 024，而相应的卷积4_23层步伐大小对应为原图的16个像素。为了得到更准确的目标位置，本文将图像保持长宽比放缩，使得短边长度为600像素。

 $\begin{array}{l} L(\{ {p_i}\} , \{ {t_i}\} ) = \frac{1}{{{N_{{\rm{cls}}}}}}\sum\nolimits_i {{L_{{\rm{cls}}}}({p_i}, {p_i}^*) + } \\ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\frac{\lambda }{{{N_{{\rm{reg}}}}}}\sum\nolimits_i {{p_i}^*{L_{{\rm{reg}}}}({t_i}, {t_i}^*)} \end{array}$ (1)

# 2.3 非极大值抑制算法

RPN是一种滑动窗口的形式，对于同一个目标可能会产生多个候选框，NMS算法旨在去除冗余的目标框。其算法流程如下：

1) 首先遍历所有的类别，对于每一类别下的类别得分大于阈值的目标框按照得分降序排列，并舍弃其他目标框。

2) 依次选中得分最高的目标框，并计算其余的框与该框的重叠面积(IoU)，若其$U_I$大于一定的阈值，则删除该框。

3) 从未处理的框中继续选一个得分最高的目标框，重复步骤2)，直至处理完所有的目标框。$U_I$计算方法为

 ${U_{A, B}} = \frac{{s\left( {\mathit{\boldsymbol{A}} \cap \mathit{\boldsymbol{B}}} \right)}}{{s\left( {\mathit{\boldsymbol{A}} \cup \mathit{\boldsymbol{B}}} \right)}}$ (5)

# 3.1 实验细节

Stanford BMW-10数据集包含宝马10个车系的各个角度图片，每类车型训练集约25幅。由于训练样本较少，类别之间差异极小，其难度也较高。而深度学习方法通常会因过拟合而难以取得较好的识别效果。

Stanford Cars-196是目前细粒度车型识别领域使用最为广泛的数据集，该数据集包含197类常见车辆型号，涵盖轿车、SUV、货车、跑车等诸多类型共16 185幅图片。

# 3.2 BMW-10数据集实验结果

Table 1 Comparison results on BMW-10 dataset

 方法 BMW-10/% 水平翻转/% HOG [15] 28.3 - SPM [1] 52.8 66.1 BB [2] 58.7 69.3 SPM-3D-L [3] 58.7 67.3 BB-3D-G [3] 66.1 76.0 CaffeNet [14] 48.43 58.27 GoogleNet [8] - 41.34 本文方法 - 76.38

# 3.3 Cars-196实验结果

Table 2 Comparison results on BMW-10 dataset

 方法名称 识别准确率/% HAR-CNN [16] 80.8 FV-CNN [17] 85.7 DVAN [18] 87.1 FCAN [19] 89.1 B-CNN [20] 91.3 RA-CNN [7] 92.5 本文方法 91.48

# 参考文献

• [1] Lazebnik S, Schmid C, Ponce J. Beyond bags of features: spatial pyramid matching for recognizing natural scene categories[C]//Proceedings of 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. New York, USA: IEEE, 2006: 2169-2178. [DOI:10.1109/CVPR.2006.68]
• [2] Deng J, Krause J, Li F F. Fine-grained crowdsourcing for fine-grained recognition[C]//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, USA: IEEE, 2013: 580-587. [DOI:10.1109/CVPR.2013.81]
• [3] Krause J, Stark M, Deng J, et al. 3D object representations for fine-grained categorization[C]//Proceedings of 2013 IEEE International Conference on Computer Vision. Sydney, Australia: IEEE, 2013: 554-561. [DOI:10.1109/ICCVW.2013.77]
• [4] Krizhevsky A, Sutskever I, Hinton G E, et al. ImageNet classification with deep convolutional neural networks[C]//Proceedings of the 26th Annual Conference on Neural Information Processing Systems. Lake Tahoe, Nevada, USA: NIPS, 2012: 1097-1105.
• [5] Yang L J, Luo P, Loy C C, et al. A large-scale car dataset for fine-grained categorization and verification[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE, 2015: 3973-3981. [DOI:10.1109/CVPR.2015.7299023]
• [6] Krause J, Gebru T, Deng J, et al. Learning features and parts for fine-grained recognition[C]//Proceedings of the 22nd International Conference on Pattern Recognition. Stockholm, Sweden: IEEE, 2014: 26-33. [DOI:10.1109/ICPR.2014.15]
• [7] Fu J L, Zheng H L, Mei T. Look closer to see better: recurrent attention convolutional neural network for fine-grained image recognition[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE, 2017: 4476-4484. [DOI:10.1109/CVPR.2017.476]
• [8] Szegedy C, Liu W, Jia Y Q, et al. Going deeper with convolutions[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE, 2015: 1-9. [DOI:10.1109/CVPR.2015.7298594]
• [9] Ioffe S, Szegedy C. Batch normalization: accelerating deep network training by reducing internal covariate shift[C]//Proceedings of the 32nd International Conference on Machine Learning. Lille, France: PMLR, 2015: 448-456.
• [10] Szegedy C, Vanhoucke V, Ioffe S, et al. Rethinking the inception architecture for computer vision[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016: 2818-2826. [DOI:10.1109/CVPR.2016.308]
• [11] Szegedy C, Ioffe S, Vanhoucke V, et al. Inception-v4, Inception-ResNet and the impact of residual connections on learning[C]//Proceedings of 2017 Thirty-First AAAI Conference on Artificial Intelligence. California, USA: AAAI, 2017.
• [12] He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016: 770-778. [DOI:10.1109/CVPR.2016.90]
• [13] Ren S Q, He K M, Girshick R, et al. Faster R-CNN:towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137–1149. [DOI:10.1109/TPAMI.2016.2577031]
• [14] Jia Y Q, Shelhamer E, Donahue J, et al. Caffe: convolutional architecture for fast feature embedding[C]//Proceedings of the 22nd ACM International Conference on Multimedia. Orlando, USA: ACM, 2014: 675-678. [DOI:10.1145/2647868.2654889]
• [15] Dalal N, Triggs B. Histograms of oriented gradients for human detection[C]//Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Diego, USA: IEEE, 2005: 886-893. [DOI:10.1109/CVPR.2005.177]
• [16] Xie S N, Yang T B, Wang X Y, et al. Hyper-class augmented and regularized deep learning for fine-grained image classification[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE, 2015: 2645-2654. [DOI:10.1109/CVPR.2015.7298880]
• [17] Cimpoi M, Maji S, Vedaldi A. Deep filter banks for texture recognition and segmentation[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE, 2015: 3828-3836. [DOI:10.1109/CVPR.2015.7299007]
• [18] Zhao B, Wu X, Feng J S, et al. Diversified visual attention networks for fine-grained object classification[J]. IEEE Transactions on Multimedia, 2017, 19(6): 1245–1256. [DOI:10.1109/TMM.2017.2648498]
• [19] Liu X, Xia T, Wang J, et al. Fully convolutional attention localization networks: efficient attention localization for fine-grained recognition[J]. arXiv Preprint, arXiv: 1603.06765, 2016. http://arxiv.org/abs/1603.06765v2
• [20] Lin T Y, Roychowdhury A, Maji S. Bilinear CNN models for fine-grained visual recognition[C]//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 2015: 1449-1457. [DOI:10.1109/ICCV.2015.170]