目的 细粒度图像分类是指对一个大类别进行更细致的子类划分，如区分鸟的种类、车的品牌款式、狗的品种等。针对细粒度图像分类中的无关信息太多和背景干扰问题，本文利用深度卷积网络构建了细粒度图像聚焦-识别的联合学习框架，通过去除背景，突出待识别目标，自动定位有区分度的区域，从而提高细粒度图像分类识别率。方法 首先基于yolov2网络快速检测出目标物体，消除背景干扰和无关信息对分类结果的影响，实现聚焦判别性区域，之后将检测到的物体即yolov2的输出输入双线性卷积神经网络进行训练和分类。此网络框架可以实现端到端的训练，且只依赖于类别标注信息，而无需借助其他的人工标注信息。结果 在细粒度图像库CUB-200-2011、Cars196和Aircrafts100上进行实验验证，我们的模型分别达到了84.5%、92%、88.4%的分类精度。我们的方法和同样分类算法得到的最高分类精度相比，准确度分别提升了0.4%，0.7%，3.9%，比使用两个相同D-Net网络的方法分别高出0.5%、1.4%、4.5%。结论 使用聚焦-识别深度学习框架提取有区分度的区域对细粒度图像分类有积极作用，能够滤除大部分对细粒度图像分类没有贡献的区域，使得网络能够学习到更多有利于细粒度图像分类的特征，从而降低背景干扰对分类结果的影响，提高模型的识别率。
Objective Fine-grained image classification is a hot research topic in the field of computer vision research in recent years. Its purpose is to make a more detailed subdivision of a large category, such as the distinction of bird species, car brand style, dog breed and so on. Fine-grained classification has often smaller difference between classes and larger difference within classes. Thus, compared with the ordinary image classification, fine-grained image classification is more challenging. And there are too many irrelevant information and background interference in fine-grained image classification, which can make the network model difficult to learn the real difference characteristics and finally influence the classification performance in fine-grained image classification. Therefore, finding discriminative regions in the image is very important for fine-grained image classification. In order to solve this problem, a joint deep learning framework of focus and recognition is constructed for fine-grained image classification. This framework can remove the background in the image and highlight the target to be identified, and then automatically locate the discriminative area. Thus, convolutional neural networks can extract more useful and discriminative features and the classification rate of fine-grained images can be improved naturally. Method First, the algorithm yolov2 can detect object in the image quickly and eliminate the influence of background interference and unrelated information, and then the datasets which include the detected objects is used to train the bilinear convolutional neural network. Finally, the final model can be used for fine-grained image classification. The algorithm yolov2 is a further improvement of the yolov1 target detection algorithm, and it is more precise for small objects localization. It can automatically find the target in the picture, so as to filter out most of the regions in the picture that do not contribute to the image classification. Bilinear convolutional neural network is a special network for fine-grained image classification. Its characteristic is that it uses the two convolutional neural networks to extract the features of the same picture at the same time, and the bilinear feature vector is obtained by the ways of bilinear pooling. Finally, the classification is completed by the softmax network layer. And bilinear convolutional neural network is not dependent on additional manual annotation information and can finish end to end training. And it only relies on the class label information. So it greatly reduces the difficulty and complexity of fine-grained image classification. Result We do the verification experiments on open standard fine-grained image library CUB-200-2011, Cars196 and Aircrafts100. We use the trained target detection model of yolov2 algorithm to detect three datasets respectively. And then the bilinear convolutional neural network is trained by the processed datasets. Finally, our bilinear convolutional neural network model achieves classification accuracy of 84.5%, 92% and 88.4% respectively. Compared with the highest classification accuracy obtained by the same classification algorithm that there are not the step of discriminant information extraction, the classification accuracy of the 3 databases is improved by 0.4%, 0.7%, and 3.9% respectively. And the recognition rate is also increased by 0.5%, 1.4%, and 4.5% respectively, compared with the same classification algorithm which extracts features from the two identical D-Net networks. Conclusion The experiments show that our method has a positive effect for the fine-grained image classification which uses the network architecture of focus and recognition to detect discriminative region. It can filter out most of the area in the image that does not contribute to the classification of fine-grained images, and thus reduce the influence of background interference to the classification results. So the bilinear convolutional neural network can learn more features which are beneficial to the classification of fine-grained images, Finally, the recognition rate of the model can be improved.