杨娟,曹浩宇,汪荣贵,薛丽霞,胡敏(合肥工业大学计算机与信息学院, 合肥 230009)
目的 细粒度车型识别旨在通过任意角度及场景下的车辆外观图像识别出其生产厂家、品牌型号、年款等信息，在智慧交通、安防等领域具有重要意义。针对该问题，目前主流方法已由手工特征提取向卷积神经网络为代表的深度学习方法过渡。但该类方法仍存在弊端，首先是识别时须指定车辆的具体位置，其次是无法充分利用细粒度目标识别其视觉差异主要集中在关键的目标局部的特点。为解决这些问题，提出基于区域建议网络的细粒度识别方法，并成功应用于车型识别。方法 区域建议网络是一种全卷积神经网络，该方法首先通过卷积神经网络提取图像深层卷积特征，然后在卷积特征上滑窗产生区域候选，之后将区域候选的特征经分类层及回归层得到其为目标的概率及目标的位置，最后将这些区域候选通过目标检测网络获取其具体类别及目标的精确位置，并通过非极大值抑制算法得到最终识别结果。结果 该方法在斯坦福BMW-10数据集的识别准确率为76.38%，在斯坦福Cars-196数据集识别准确率为91.48%，不仅大幅领先于传统手工特征方法，也取得了与目前最优的方法相当的识别性能。该方法同时在真实自然场景中取得了优异的识别效果。结论 区域建议网络不仅为目标检测提供了目标的具体位置，而且提供了具有区分度的局部区域，为细粒度目标识别提供了一种新的思路。该方法克服了传统目标识别对于目标位置的依赖，并且能够实现一图多车等复杂场景下的车型细粒度识别，具有更好的鲁棒性及实用性。
Fine-grained car recognition method based on region proposal networks
Yang Juan,Cao Haoyu,Wang Ronggui,Xue Lixia,Hu Min(School of Computer and Information, Hefei University of Technology, Hefei 230009, China)
Objective Over the past few decades, studies on visual object recognition have mostly focused on the category level, such as ImageNet Large-scale Visual Recognition Challenge and PASCAL VOC challenge. With the powerful feature extraction of convolutional neural networks (CNNs), many studies have begun to focus on challenging visual tasks aimed at the subtle classification of subcategories, which is called fine-grained visual pattern recognition. Fine-grained car model recognition is designed to recognize the exact make, model, and year of a car from an arbitrary viewpoint, which is essential in intelligent transportation, public security, and other fields. Research on this field mainly includes three aspects: finding and extracting features of discriminative parts, using the alignment algorithm or 3D object representations to eliminate the effects of posture and angle, and looking for powerful feature extractors such as CNN features. The three methods presented have various degrees of defect, the bottleneck of most part-based models is accurate part localization, and methods generally report adequate part localization only when a known bounding box at test time is given. 3D object representations and many other alignment algorithms need complex preprocessing or post-processing of training samples, such as co-segmentation and 3D geometry estimation. Currently, methods based on CNNs significantly outperform those of previous works, which rely on handcrafted features for fine-grained classification, but the location of objects is essential even at test time due to the subtle difference between categories. These methods are difficult to apply in real intelligent transportation because a video frame in a real traffic monitoring scenario typically shows multiple cars in which each car object and parts cannot be assigned with a bounding box. To solve these problems, the present study proposes a fine-grained car recognition method based on deep fully CNNs called region proposal network (RPN), which automatically proposes regions of discriminative parts and car objects. Our method can be trained in an end-to-end manner and without requiring a bounding box at test time. Method RPN is a type of fully CNN that simultaneously predicts object bounding box and scores at each position, which has made remarkable achievements in the field of object detection. We improve RPN with an outstanding deep CNN called deep residual network (ResNet). First, the deep convolution feature of the image is extracted by the ResNet pipeline. Then, we slide a small network over the convolutional feature map and each sliding window is mapped to a low-dimensional vector. The vector is fed into a box-regression layer and box-classification layer; the former outputs the probability of a region, which includes an object, whereas the latter outputs the coordinates of the region by bounding-box regression. Finally, these object regional candidates obtain the specific category and corrected object position through the object detection network and the final recognition result through the non-maximal suppression algorithm. To quickly optimize the model parameters, we use the ImageNet pre-training model to initialize the RPN and object detection network and then share convolutional features between them through joint optimization. Result First, we verify the performance of the proposed method on several public fine-grained car datasets. Stanford BMW-10 dataset has 512 pictures, including 10 BMW serials. However, most CNNs suffered from over-fitting and thus obtained poor results due to the limit of training samples. The Stanford Cars-196 dataset is currently the most widely used dataset for fine-grained car recognition, with 16185 images of 196 fine-grained car models covering SUVs, coupes, convertibles, pickup trucks, and trucks, to name a few. Second, apart from using the public dataset, we conduct the recognition experiment under a real traffic monitoring video. Finally, we carefully analyze the misrecognized samples of our models to explore the improved room of fine-grained methods. Recognition accuracy could be significantly improved by training data augmentation, and all our experiments only use image horizontal flip as data augmentation to compare with other methods with the same standard. The recognition accuracy of this method is 76.38% in the Stanford BMW-10 dataset and 91.48% in the Stanford Cars-196 dataset. The method also achieves excellent recognition effect in a traffic monitoring video. In particular, our method is trained in an end-to-end manner and requires no knowledge of object or part bounding box at test time. The RPN provides not only an object detection network with the specific location of the car object but also the distinguishable region, which contributes to the classification. The misrecognized samples are mostly at the same makes, which have tiny visual difference. Methods based on handcrafted global feature templates, such as HOG, achieve 28.3% recognition accuracy on the Stanford BMW-10 dataset. The most valuable 3D object representations, which trains from 59040 synthetic images, achieves 76.0% less accuracy than that of our methods. The state-of-the-art method of the Stanford Car-196 dataset without bounding-box annotation is recurrent attention CNN, which is published on CVPR2017 and achieved 92.5% recognition accuracy by combining features at three scales via a fully connected layer. Experiments show that our method not only outperforms significantly traditional methods based on handcrafted feature but is also comparable with the current state-of-the-art methods. Conclusion We introduce a new deep learning method, which is used in fine-grained car recognition, that overcomes the dependence of the traditional object recognition on the object location and can realize the recognition of cars under complex scenes, such as multiple vehicles and dense vehicles with high accuracy. The findings of this study can provide new ideas for fine-grained object recognition. Compared with traditional methods, the proposed model is better in terms of robustness and practicability.