区域建议网络的细粒度车型识别

杨娟; 曹浩宇; 汪荣贵; 薛丽霞; 胡敏

doi:10.11834/jig.170481

图像分析和识别 | 浏览量 : 0 下载量: 4 CSCD: 3

PDF
导出
分享
收藏
专辑

区域建议网络的细粒度车型识别
Fine-grained car recognition method based on region proposal networks
2018年23卷第6期页码：837-845
收稿：2017-08-23，

修回：2017-12-19，

纸质出版：2018-06-16
DOI： 10.11834/jig.170481
稿件说明：

移动端阅览

杨娟, 曹浩宇, 汪荣贵, 薛丽霞, 胡敏. 区域建议网络的细粒度车型识别[J]. 中国图象图形学报, 2018,23(6):837-845. DOI： 10.11834/jig.170481.

Juan Yang, Haoyu Cao, Ronggui Wang, Lixia Xue, Min Hu. Fine-grained car recognition method based on region proposal networks[J]. Journal of Image and Graphics, 2018, 23(6): 837-845. DOI： 10.11834/jig.170481.

摘要

目的

细粒度车型识别旨在通过任意角度及场景下的车辆外观图像识别出其生产厂家、品牌型号、年款等信息，在智慧交通、安防等领域具有重要意义。针对该问题，目前主流方法已由手工特征提取向卷积神经网络为代表的深度学习方法过渡。但该类方法仍存在弊端，首先是识别时须指定车辆的具体位置，其次是无法充分利用细粒度目标识别其视觉差异主要集中在关键的目标局部的特点。为解决这些问题，提出基于区域建议网络的细粒度识别方法，并成功应用于车型识别。

方法

区域建议网络是一种全卷积神经网络，该方法首先通过卷积神经网络提取图像深层卷积特征，然后在卷积特征上滑窗产生区域候选，之后将区域候选的特征经分类层及回归层得到其为目标的概率及目标的位置，最后将这些区域候选通过目标检测网络获取其具体类别及目标的精确位置，并通过非极大值抑制算法得到最终识别结果。

结果

该方法在斯坦福BMW-10数据集的识别准确率为76.38%，在斯坦福Cars-196数据集识别准确率为91.48%，不仅大幅领先于传统手工特征方法，也取得了与目前最优的方法相当的识别性能。该方法同时在真实自然场景中取得了优异的识别效果。

结论

区域建议网络不仅为目标检测提供了目标的具体位置，而且提供了具有区分度的局部区域，为细粒度目标识别提供了一种新的思路。该方法克服了传统目标识别对于目标位置的依赖，并且能够实现一图多车等复杂场景下的车型细粒度识别，具有更好的鲁棒性及实用性。

Abstract

Objective

Over the past few decades

studies on visual object recognition have mostly focused on the category level

such as ImageNet Large-scale Visual Recognition Challenge and PASCAL VOC challenge. With the powerful feature extraction of convolutional neural networks (CNNs)

many studies have begun to focus on challenging visual tasks aimed at the subtle classification of subcategories

which is called fine-grained visual pattern recognition. Fine-grained car model recognition is designed to recognize the exact make

model

and year of a car from an arbitrary viewpoint

which is essential in intelligent transportation

public security

and other fields. Research on this field mainly includes three aspects: finding and extracting features of discriminative parts

using the alignment algorithm or 3D object representations to eliminate the effects of posture and angle

and looking for powerful feature extractors such as CNN features. The three methods presented have various degrees of defect

the bottleneck of most part-based models is accurate part localization

and methods generally report adequate part localization only when a known bounding box at test time is given. 3D object representations and many other alignment algorithms need complex preprocessing or post-processing of training samples

such as co-segmentation and 3D geometry estimation. Currently

methods based on CNNs significantly outperform those of previous works

which rely on handcrafted features for fine-grained classification

but the location of objects is essential even at test time due to the subtle difference between categories. These methods are difficult to apply in real intelligent transportation because a video frame in a real traffic monitoring scenario typically shows multiple cars in which each car object and parts cannot be assigned with a bounding box. To solve these problems

the present study proposes a fine-grained car recognition method based on deep fully CNNs called region proposal network (RPN)

which automatically proposes regions of discriminative parts and car objects. Our method can be trained in an end-to-end manner and without requiring a bounding box at test time.

Method

RPN is a type of fully CNN that simultaneously predicts object bounding box and scores at each position

which has made remarkable achievements in the field of object detection. We improve RPN with an outstanding deep CNN called deep residual network (ResNet). First

the deep convolution feature of the image is extracted by the ResNet pipeline. Then

we slide a small network over the convolutional feature map and each sliding window is mapped to a low-dimensional vector. The vector is fed into a box-regression layer and box-classification layer; the former outputs the probability of a region

which includes an object

whereas the latter outputs the coordinates of the region by bounding-box regression. Finally

these object regional candidates obtain the specific category and corrected object position through the object detection network and the final recognition result through the non-maximal suppression algorithm. To quickly optimize the model parameters

we use the ImageNet pre-training model to initialize the RPN and object detection network and then share convolutional features between them through joint optimization.

Result

First

we verify the performance of the proposed method on several public fine-grained car datasets. Stanford BMW-10 dataset has 512 pictures

including 10 BMW serials. However

most CNNs suffered from over-fitting and thus obtained poor results due to the limit of training samples. The Stanford Cars-196 dataset is currently the most widely used dataset for fine-grained car recognition

with 16185 images of 196 fine-grained car models covering SUVs

coupes

convertibles

pickup trucks

and trucks

to name a few. Second

apart from using the public dataset

we conduct the recognition experiment under a real traffic monitoring video. Finally

we carefully analyze the misrecognized samples of our models to explore the improved room of fine-grained methods. Recognition accuracy could be significantly improved by training data augmentation

and all our experiments only use image horizontal flip as data augmentation to compare with other methods with the same standard. The recognition accuracy of this method is 76.38% in the Stanford BMW-10 dataset and 91.48% in the Stanford Cars-196 dataset. The method also achieves excellent recognition effect in a traffic monitoring video. In particular

our method is trained in an end-to-end manner and requires no knowledge of object or part bounding box at test time. The RPN provides not only an object detection network with the specific location of the car object but also the distinguishable region

which contributes to the classification. The misrecognized samples are mostly at the same makes

which have tiny visual difference. Methods based on handcrafted global feature templates

such as HOG

achieve 28.3% recognition accuracy on the Stanford BMW-10 dataset. The most valuable 3D object representations

which trains from 59040 synthetic images

achieves 76.0% less accuracy than that of our methods. The state-of-the-art method of the Stanford Car-196 dataset without bounding-box annotation is recurrent attention CNN

which is published on CVPR2017 and achieved 92.5% recognition accuracy by combining features at three scales via a fully connected layer. Experiments show that our method not only outperforms significantly traditional methods based on handcrafted feature but is also comparable with the current state-of-the-art methods.

Conclusion

We introduce a new deep learning method

which is used in fine-grained car recognition

that overcomes the dependence of the traditional object recognition on the object location and can realize the recognition of cars under complex scenes

such as multiple vehicles and dense vehicles with high accuracy. The findings of this study can provide new ideas for fine-grained object recognition. Compared with traditional methods

the proposed model is better in terms of robustness and practicability.

关键词

Keywords

references

Lazebnik S, Schmid C, Ponce J. Beyond bags of features: spatial pyramid matching for recognizing natural scene categories[C]//Proceedings of 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. New York, USA: IEEE, 2006: 2169-2178. [ DOI:10.1109/CVPR.2006.68 http://dx.doi.org/10.1109/CVPR.2006.68 ]

Deng J, Krause J, Li F F. Fine-grained crowdsourcing for fine-grained recognition[C]//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, USA: IEEE, 2013: 580-587. [ DOI:10.1109/CVPR.2013.81 http://dx.doi.org/10.1109/CVPR.2013.81 ]

Krause J, Stark M, Deng J, et al. 3D object representations for fine-grained categorization[C]//Proceedings of 2013 IEEE International Conference on Computer Vision. Sydney, Australia: IEEE, 2013: 554-561. [ DOI:10.1109/ICCVW.2013.77 http://dx.doi.org/10.1109/ICCVW.2013.77 ]

Krizhevsky A, Sutskever I, Hinton G E, et al. ImageNet classification with deep convolutional neural networks[C]//Proceedings of the 26th Annual Conference on Neural Information Processing Systems. Lake Tahoe, Nevada, USA: NIPS, 2012: 1097-1105.

Yang L J, Luo P, Loy C C, et al. A large-scale car dataset for fine-grained categorization and verification[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE, 2015: 3973-3981. [ DOI:10.1109/CVPR.2015.7299023 http://dx.doi.org/10.1109/CVPR.2015.7299023 ]

Krause J, Gebru T, Deng J, et al. Learning features and parts for fine-grained recognition[C]//Proceedings of the 22nd International Conference on Pattern Recognition. Stockholm, Sweden: IEEE, 2014: 26-33. [ DOI:10.1109/ICPR.2014.15 http://dx.doi.org/10.1109/ICPR.2014.15 ]

Fu J L, Zheng H L, Mei T. Look closer to see better: recurrent attention convolutional neural network for fine-grained image recognition[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE, 2017: 4476-4484. [ DOI:10.1109/CVPR.2017.476 http://dx.doi.org/10.1109/CVPR.2017.476 ]

Szegedy C, Liu W, Jia Y Q, et al. Going deeper with convolutions[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE, 2015: 1-9. [ DOI:10.1109/CVPR.2015.7298594 http://dx.doi.org/10.1109/CVPR.2015.7298594 ]

Ioffe S, Szegedy C. Batch normalization: accelerating deep network training by reducing internal covariate shift[C]//Proceedings of the 32nd International Conference on Machine Learning. Lille, France: PMLR, 2015: 448-456.

Szegedy C, Vanhoucke V, Ioffe S, et al. Rethinking the inception architecture for computer vision[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016: 2818-2826. [ DOI:10.1109/CVPR.2016.308 http://dx.doi.org/10.1109/CVPR.2016.308 ]

Szegedy C, Ioffe S, Vanhoucke V, et al. Inception-v4, Inception-ResNet and the impact of residual connections on learning[C]//Proceedings of 2017 Thirty-First AAAI Conference on Artificial Intelligence. California, USA: AAAI, 2017.

He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE, 2016: 770-778. [ DOI:10.1109/CVPR.2016.90 http://dx.doi.org/10.1109/CVPR.2016.90 ]

Ren S Q, He K M, Girshick R, et al. Faster R-CNN:towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6):1137-1149.[DOI:10.1109/TPAMI.2016.2577031]

Jia Y Q, Shelhamer E, Donahue J, et al. Caffe: convolutional architecture for fast feature embedding[C]//Proceedings of the 22nd ACM International Conference on Multimedia. Orlando, USA: ACM, 2014: 675-678. [ DOI:10.1145/2647868.2654889 http://dx.doi.org/10.1145/2647868.2654889 ]

Dalal N, Triggs B. Histograms of oriented gradients for human detection[C]//Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Diego, USA: IEEE, 2005: 886-893. [ DOI:10.1109/CVPR.2005.177 http://dx.doi.org/10.1109/CVPR.2005.177 ]

Xie S N, Yang T B, Wang X Y, et al. Hyper-class augmented and regularized deep learning for fine-grained image classification[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE, 2015: 2645-2654. [ DOI:10.1109/CVPR.2015.7298880 http://dx.doi.org/10.1109/CVPR.2015.7298880 ]

Cimpoi M, Maji S, Vedaldi A. Deep filter banks for texture recognition and segmentation[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE, 2015: 3828-3836. [ DOI:10.1109/CVPR.2015.7299007 http://dx.doi.org/10.1109/CVPR.2015.7299007 ]

Zhao B, Wu X, Feng J S, et al. Diversified visual attention networks for fine-grained object classification[J]. IEEE Transactions on Multimedia, 2017, 19(6):1245-1256.[DOI:10.1109/TMM.2017.2648498]

Liu X, Xia T, Wang J, et al. Fully convolutional attention localization networks: efficient attention localization for fine-grained recognition[J]. arXiv Preprint, arXiv: 1603.06765, 2016. http://arxiv.org/abs/1603.06765v2 .

Lin T Y, Roychowdhury A, Maji S. Bilinear CNN models for fine-grained visual recognition[C]//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 2015: 1449-1457. [ DOI:10.1109/ICCV.2015.170 http://dx.doi.org/10.1109/ICCV.2015.170 ]