王永雄,张晓兵(上海理工大学光电信息与计算机工程学院, 上海 200093)
目的 细粒度图像分类是指对一个大类别进行更细致的子类划分，如区分鸟的种类、车的品牌款式、狗的品种等。针对细粒度图像分类中的无关信息太多和背景干扰问题，本文利用深度卷积网络构建了细粒度图像聚焦—识别的联合学习框架，通过去除背景、突出待识别目标、自动定位有区分度的区域，从而提高细粒度图像分类识别率。方法 首先基于Yolov2（youonly look once v2）的网络快速检测出目标物体，消除背景干扰和无关信息对分类结果的影响，实现聚焦判别性区域，之后将检测到的物体（即Yolov2的输出）输入双线性卷积神经网络进行训练和分类。此网络框架可以实现端到端的训练，且只依赖于类别标注信息，而无需借助其他的人工标注信息。结果 在细粒度图像库CUB-200-2011、Cars196和Aircrafts100上进行实验验证，本文模型的分类精度分别达到84.5%、92%和88.4%，与同类型分类算法得到的最高分类精度相比，准确度分别提升了0.4%、0.7%和3.9%，比使用两个相同D（dence）-Net网络的方法分别高出0.5%、1.4%和4.5%。结论 使用聚焦—识别深度学习框架提取有区分度的区域对细粒度图像分类有积极作用，能够滤除大部分对细粒度图像分类没有贡献的区域，使得网络能够学习到更多有利于细粒度图像分类的特征，从而降低背景干扰对分类结果的影响，提高模型的识别率。
Fine-grained image classification with network architecture of focus and recognition
Wang Yongxiong,Zhang Xiaobing(School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China)
Objective In recent years, with the rapid progress of science and technology as well as the increasing demand of human life, people's research has shifted from the coarse-grained image classification to the fine-grained image classification. Fine-grained image classification is a hot research topic in the field of computer vision research in recent years. Its purpose is to provide a detailed subdivision of a large category, such as the distinction of bird species, car brand style, and dog breed. Nowadays, the fine-grained image classification has great application requirements. For example, in the field of ecological protection, the identification of different species of organisms is the key to ecological research. And in the field of botany, because of the variety and quantity of flowers as well as the similarity between different flowers which make the fine-grained image classification tasks more difficult. With the help of computer vision technology, we can realize low-cost fine-grained image classification tasks. However, the fine-grained classification often has smaller differences between classes and larger differences within classes. Thus, in comparison with the ordinary image classification, the task of the fine-grained image classification is more challenging. Moreover, the fine-grained image classification has much irrelevant information and background interference. Those problems would influence the network model to learn the actual discriminative characteristics and result in inferior classification performance in fine-grained image classification. Therefore, finding discriminative regions in the image is important for the improvement of fine-grained image classification performance. To solve this problem, a joint deep learning framework of focus and recognition is constructed for fine-grained image classification. This framework can remove the background in the image, highlight the target to be identified, and then automatically locate the discriminative area in the image. Thus, the deep convolutional neural networks can extract more useful and discriminative features, and the classification rate of fine-grained images can be improved naturally. Method Firstly, the Yolov2 (you only look once v2) target detection algorithm can detect object in the image rapidly and eliminate the influence of background interference and unrelated information, and then the datasets, which include the detected target objects, are used to train the bilinear convolutional neural network. Finally, the final model can be used for fine-grained image classification. The Yolov2 algorithm is a further improvement of the Yolov1 target detection algorithm, and it is more precise for small object localization. It can automatically find the target in the picture to filter out most of the regions that do not contribute to image classification. Bilinear convolutional neural network is a special network for fine-grained image classification. Its characteristic is that it uses the two convolutional neural networks to extract the features of the same picture simultaneously, and the bilinear feature vector is obtained by the approaches of bilinear pooling. Finally, the bilinear feature vector is fed into the softmax network layer and the classification task is completed, we can get the final classification results. In addition, the advantage of the bilinear convolutional neural network is that it is not dependent on additional manual annotation information and it is an entire system which can complete end-to-end training. It only relies on the class label information. Therefore, it greatly reduces the difficulty and complexity of fine-grained image classification. Result We perform verification experiments on open standard fine-grained image library CUB-200-2011, Cars196, and Aircrafts100. We use the pre-trained target detection model of Yolov2 algorithm to detect these three datasets respectively, therefore, we can get the discriminative regions in the image for each datasets. Then, the bilinear convolutional neural network is trained by the processed datasets. Finally, our proposed bilinear convolutional neural network model can be used for the fine-grained image classification and achieves classification accuracy of 84.5%, 92%, and 88.4% on these three datasets. In comparison with the highest classification accuracy obtained by the same classification algorithm without discriminant information extraction, the classification accuracy of the three databases is improved by 0.4%, 0.7%, and 3.9%. Moreover, the recognition rate is also increased by 0.5%, 1.4%, and 4.5% compared with the same classification algorithm, which extracts features from two identical D(dence)-Net networks. We also compared with other fine-grained image classification algorithms, such as the Spatial Transformer Networks, which has a fine classification performance in fine-grained image classification and it is also an entire system and is only dependent on label information. For all that, the classification accuracy rate of ours is still 0.4 percentage points higher than the method of Spatial Transformer Networks on bird dataset. Conclusion In this paper, an innovative method based on focused recognition network architecture is proposed to improve the recognition rate of the fine-grained image classification. And the experiment results show that our method positively affects the fine-grained image classification results, which uses the network architecture of focus and recognition to detect discriminative region in the image. It can filter out most of the area in the image which does not contribute to the classification of fine-grained images, thereby reducing the influence of background interference to the classification results. Thus, the bilinear convolutional neural network can learn more useful features, which are beneficial to the classification of fine-grained images. Finally, the recognition rate of the model of the fine-grained image classification can be improved effectively. Of course, we also compare with other fine-grained image classification algorithms on several datasets, which also strongly proves the effectiveness of our algorithm.