YOLOv3和双线性特征融合的细粒度图像分类

闫子旭; 侯志强; 熊磊; 刘晓义; 余旺盛; 马素刚

发布时间： 2021-04-17
摘要点击次数： 3825
全文下载次数： 850
DOI: 10.11834/jig.200031
2021 | Volume 26 | Number 4

YOLOv3和双线性特征融合的细粒度图像分类

闫子旭¹, 侯志强¹, 熊磊², 刘晓义¹, 余旺盛³, 马素刚¹(1.西安邮电大学计算机学院, 西安 710121;2.西安交通大学电信学院, 西安 710049;3.空军工程大学信息与导航学院, 西安 710077)

摘要

目的细粒度图像分类是计算机视觉领域具有挑战性的课题，目的是将一个大的类别分为更详细的子类别，在工业和学术方面都有着十分广泛的研究需求。为了改善细粒度图像分类过程中不相关背景干扰和类别差异特征难以提取的问题，提出了一种将目标检测方法YOLOv3（you only look once）和双线性融合网络相结合的细粒度分类优化算法，以此提高细粒度图像分类的性能。方法利用重新训练过的目标检测算法YOLOv3粗略确定目标在图像中的位置；使用背景抑制方法消除目标以外的信息干扰；利用融合不同通道、不同层级卷积层特征的方法对经典的细粒度分类算法双线性卷积神经网络（bilinear convolutional neural network，B-CNN）进行改进，优化分类性能，通过融合双线性网络中不同卷积层的特征向量，得到更加丰富的互补信息，从而提高细粒度分类精度。结果实验结果表明，在CUB-200-2011（Caltech-UCSD Birds-200-2011）、Cars196和Aircrafts100数据集中，本文算法的分类准确率分别为86.3%、92.8%和89.0%，比经典的B-CNN细粒度分类算法分别提高了2.2%、1.5%和4.9%，验证了本文算法的有效性。同时，与已有细粒度图像分类算法相比也表现出一定的优势。结论改进算法使用YOLOv3有效滤除了大量无关背景，通过特征融合方法来改进双线性卷积神经分类网络，丰富特征信息，使分类的结果更加精准。

关键词

细粒度图像分类目标检测背景抑制特征融合双线性卷积神经网络(B-CNN)

Fine-grained classification based on bilinear feature fusion and YOLOv3

Yan Zixu¹, Hou Zhiqiang¹, Xiong Lei², Liu Xiaoyi¹, Yu Wangsheng³, Ma Sugang¹(1.College of Computer, Xi'an University of Posts and Telecommunications, Xi'an 710121, China;2.College of Telecommunications, Xi'an Jiaotong University, Xi'an 710049, China;3.College of Information and Navigation, Air Force Engineering University, Xi'an 710077, China)

Abstract

Objective Image classification is a classic topic in the field of computer vision. It can be divided into coarse-grained classification and fine-grained classification. The purpose of coarse-grained classification is to identify objects of different categories, whereas that of fine-grained image classification is to subdivide larger categories into more fine-grained categories, which in many cases have greater use value. Fine-grained image classification is a challenging research topic in computer vision. There are extensive research needs and application scenarios of fine-grained image classification in the industry and academia. Due to background interference and difficulty in extracting effective classification features, problems still exist in fine-grained classification. Compared with general image classification, fine-grained classification experiences background interference. This problem can be addressed by object detection methods. The task of object detection is to find the objects of interest in the image and determine their position and size. At present, more and more target detection methods are based on deep learning. These methods can be divided into two categories:one-stage detection method and two-stage detection method. One-stage detection method has fast detection speed, but its accuracy is slightly lower. Examples of one-stage detection method mainly include you only look once(YOLO) and single shot multibox detector(SSD). Two-stage detection method first uses region recommendation to generate candidate targets, and then it uses a convolutional neural network (CNN) to process this condition. Some of the examples of this method include R-CNN (region CNN), SPP-NET (spatial pyramid pooling convolutional network), and Faster R-CNN. Among them, YOLOv3 of the YOLO series has achieved a better balance in detection accuracy and speed compared with other commonly used target detection frameworks. Method To improve the accuracy of these detection methods, a fine-grained classification algorithm based on the fusion of YOLOv3 and bilinear features is proposed in this study. The algorithm first uses the retrained target detection algorithm YOLOv3 to coarsely locate the target. Then, a background suppression method is used to remove irrelevant background interference. Finally, the feature fusion method is used to bilinear convolutional neural networks in the classic fine-grained classification algorithm. It can find that the convolutional neural network (referred to as B-CNN (bilinear CNN)) is greatly improved. By merging the features of different convolutional layers, more abundant complementary information is obtained. We use this method to improve the accuracy. The specific operation steps are as follows:1) enter the image; 2) use YOLOv3 pre-trained model to generate discriminative regions; 3) the background suppression method removes irrelevant background interference outside the discrimination box; 4) construct a bilinear fine classification network of feature fusion, and use deep convolutional neural networks to extract features at the multi-layer convolution stage on the image; 5) the outer product operation is used to fuse the features of the convolution layers at different stages, and the obtained fusion features of the three different levels of features are connected by the concat method to obtain the final bilinear vector. Finally, the Softmax layer is used to achieve fine-grained classification. Result After adding the YOLOv3 algorithm with background suppression, the classification accuracy rates on the three datasets are 0.7%, 0.5%, and 3.1% higher than those of B-CNN, respectively, indicating that removing background interference using the YOLOv3 algorithm can effectively improve classification. After using feature fusion to optimize the B-CNN network structure, we use three datasets (namely, CUB-200-2011 (Caltech-UCSD Birds-200-2011), Stanford Cars, and fine-grained visual categorization(FGVC) Aircraft) to test the performance. The results are 1.4% and 1.2% higher than B-CNN, which indicates that the fusion of the features of different convolutional layers and the strengthening of the spatial relationship of the features can effectively improve the classification accuracy rate. After using YOLOv3 for background suppression and fusion of B-CNN, the accuracy rates reach 86.3%, 92.8%, and 89.0% in the three datasets, respectively. Compared with B-CNN algorithm, the proposed algorithm improves the accuracy by 2.2%, 1.5%, and 4.9% in the three datasets, respectively, indicating its effectiveness. For the purpose of analyzing the classification performance of the algorithm, the improved algorithm classification results also have certain advantages compared with the mainstream algorithms. Conclusion The fine-grained classification algorithm based on YOLOv3 and bilinear feature fusion proposed in this study not only uses YOLOv3 to effectively filter out several irrelevant backgrounds to obtain discriminative regions on the image, but also improves the bilinear fine-grainedness by means of feature fusion. Classification network, so as to extract richer fine-grained features, and make the results of fine-grained image classification more accurate. This study proposes a fine-grained classification algorithm based on YOLOv3 and bilinear feature fusion, which can remove interference from irrelevant backgrounds. At the same time, the improved feature fusion B-CNN can learn richer features, which improves to a certain extent the accuracy of fine-grained classification. Compared with the classic B-CNN algorithm, the three fine-grained datasets are better than some mainstream algorithms. On the other hand, some new fine-grained classification algorithms are constantly changing. They use a host of different deep learning models to perform fine classification in fine-grained classification, but do not use background suppression and feature fusion to extract richer fine-grained features. In the future, we will apply fusion to the new network and use different types of fusion to further improve the accuracy of fine-grained classification in this study.

Keywords

fine-grained image classification target detection background suppression feature fusion bilinear convolutional neural network (B-CNN)

在线采编平台

在线出版

年度会议

下载中心

年度信息