YOLOv3和双线性特征融合的细粒度图像分类

闫子旭; 侯志强; 熊磊; 刘晓义; 余旺盛; 马素刚

doi:10.11834/jig.200031

图像分析和识别 | 浏览量 : 0 下载量: 0 CSCD: 5

PDF
导出
分享
收藏
专辑

YOLOv3和双线性特征融合的细粒度图像分类
Fine-grained classification based on bilinear feature fusion and YOLOv3
2021年26卷第4期页码：847-856
纸质出版日期： 2021-04-16 ，

录用日期： 2020-10-15
DOI： 10.11834/jig.200031
稿件说明：

移动端阅览

闫子旭, 侯志强, 熊磊, 刘晓义, 余旺盛, 马素刚. YOLOv3和双线性特征融合的细粒度图像分类[J]. 中国图象图形学报, 2021,26(4):847-856.

Zixu Yan, Zhiqiang Hou, Lei Xiong, Xiaoyi Liu, Wangsheng Yu, Sugang Ma. Fine-grained classification based on bilinear feature fusion and YOLOv3[J]. Journal of Image and Graphics, 2021,26(4):847-856.
闫子旭, 侯志强, 熊磊, 刘晓义, 余旺盛, 马素刚. YOLOv3和双线性特征融合的细粒度图像分类[J]. 中国图象图形学报, 2021,26(4):847-856. DOI： 10.11834/jig.200031.

Zixu Yan, Zhiqiang Hou, Lei Xiong, Xiaoyi Liu, Wangsheng Yu, Sugang Ma. Fine-grained classification based on bilinear feature fusion and YOLOv3[J]. Journal of Image and Graphics, 2021,26(4):847-856. DOI： 10.11834/jig.200031.

摘要

目的

细粒度图像分类是计算机视觉领域具有挑战性的课题，目的是将一个大的类别分为更详细的子类别，在工业和学术方面都有着十分广泛的研究需求。为了改善细粒度图像分类过程中不相关背景干扰和类别差异特征难以提取的问题，提出了一种将目标检测方法YOLOv3(you only look once)和双线性融合网络相结合的细粒度分类优化算法，以此提高细粒度图像分类的性能。

方法

利用重新训练过的目标检测算法YOLOv3粗略确定目标在图像中的位置；使用背景抑制方法消除目标以外的信息干扰；利用融合不同通道、不同层级卷积层特征的方法对经典的细粒度分类算法双线性卷积神经网络(bilinear convolutional neural network，B-CNN)进行改进，优化分类性能，通过融合双线性网络中不同卷积层的特征向量，得到更加丰富的互补信息，从而提高细粒度分类精度。

结果

实验结果表明，在CUB-200-2011(Caltech-UCSD Birds-200- 2011)、Cars196和Aircrafts100数据集中，本文算法的分类准确率分别为86.3%、92.8%和89.0%，比经典的B-CNN细粒度分类算法分别提高了2.2%、1.5%和4.9%，验证了本文算法的有效性。同时，与已有细粒度图像分类算法相比也表现出一定的优势。

结论

改进算法使用YOLOv3有效滤除了大量无关背景，通过特征融合方法来改进双线性卷积神经分类网络，丰富特征信息，使分类的结果更加精准。

Abstract

Objective

Image classification is a classic topic in the field of computer vision. It can be divided into coarse-grained classification and fine-grained classification. The purpose of coarse-grained classification is to identify objects of different categories

whereas that of fine-grained image classification is to subdivide larger categories into more fine-grained categories

which in many cases have greater use value. Fine-grained image classification is a challenging research topic in computer vision. There are extensive research needs and application scenarios of fine-grained image classification in the industry and academia. Due to background interference and difficulty in extracting effective classification features

problems still exist in fine-grained classification. Compared with general image classification

fine-grained classification experiences background interference. This problem can be addressed by object detection methods. The task of object detection is to find the objects of interest in the image and determine their position and size. At present

more and more target detection methods are based on deep learning. These methods can be divided into two categories: one-stage detection method and two-stage detection method. One-stage detection method has fast detection speed

but its accuracy is slightly lower. Examples of one-stage detection method mainly include you only look once(YOLO) and single shot multibox detector(SSD). Two-stage detection method first uses region recommendation to generate candidate targets

and then it uses a convolutional neural network (CNN) to process this condition. Some of the examples of this method include R-CNN (region CNN)

SPP-NET (spatial pyramid pooling convolutional network)

and Faster R-CNN. Among them

YOLOv3 of the YOLO series has achieved a better balance in detection accuracy and speed compared with other commonly used target detection frameworks.

Method

To improve the accuracy of these detection methods

a fine-grained classification algorithm based on the fusion of YOLOv3 and bilinear features is proposed in this study. The algorithm first uses the retrained target detection algorithm YOLOv3 to coarsely locate the target. Then

a background suppression method is used to remove irrelevant background interference. Finally

the feature fusion method is used to bilinear convolutional neural networks in the classic fine-grained classification algorithm. It can find that the convolutional neural network (referred to as B-CNN (bilinear CNN)) is greatly improved. By merging the features of different convolutional layers

more abundant complementary information is obtained. We use this method to improve the accuracy. The specific operation steps are as follows: 1) enter the image; 2) use YOLOv3 pre-trained model to generate discriminative regions; 3) the background suppression method removes irrelevant background interference outside the discrimination box; 4) construct a bilinear fine classification network of feature fusion

and use deep convolutional neural networks to extract features at the multi-layer convolution stage on the image; 5) the outer product operation is used to fuse the features of the convolution layers at different stages

and the obtained fusion features of the three different levels of features are connected by the concat method to obtain the final bilinear vector. Finally

the Softmax layer is used to achieve fine-grained classification.

Result

After adding the YOLOv3 algorithm with background suppression

the classification accuracy rates on the three datasets are 0.7%

0.5%

and 3.1% higher than those of B-CNN

respectively

indicating that removing background interference using the YOLOv3 algorithm can effectively improve classification. After using feature fusion to optimize the B-CNN network structure

we use three datasets (namely

CUB-200-2011 (Caltech-UCSD Birds-200- 2011)

Stanford Cars

and fine-grained visual categorization(FGVC) Aircraft) to test the performance. The results are 1.4% and 1.2% higher than B-CNN

which indicates that the fusion of the features of different convolutional layers and the strengthening of the spatial relationship of the features can effectively improve the classification accuracy rate. After using YOLOv3 for background suppression and fusion of B-CNN

the accuracy rates reach 86.3%

92.8%

and 89.0% in the three datasets

respectively. Compared with B-CNN algorithm

the proposed algorithm improves the accuracy by 2.2%

1.5%

and 4.9% in the three datasets

respectively

indicating its effectiveness. For the purpose of analyzing the classification performance of the algorithm

the improved algorithm classification results also have certain advantages compared with the mainstream algorithms.

Conclusion

The fine-grained classification algorithm based on YOLOv3 and bilinear feature fusion proposed in this study not only uses YOLOv3 to effectively filter out several irrelevant backgrounds to obtain discriminative regions on the image

but also improves the bilinear fine-grainedness by means of feature fusion. Classification network

so as to extract richer fine-grained features

and make the results of fine-grained image classification more accurate. This study proposes a fine-grained classification algorithm based on YOLOv3 and bilinear feature fusion

which can remove interference from irrelevant backgrounds. At the same time

the improved feature fusion B-CNN can learn richer features

which improves to a certain extent the accuracy of fine-grained classification. Compared with the classic B-CNN algorithm

the three fine-grained datasets are better than some mainstream algorithms. On the other hand

some new fine-grained classification algorithms are constantly changing. They use a host of different deep learning models to perform fine classification in fine-grained classification

but do not use background suppression and feature fusion to extract richer fine-grained features. In the future

we will apply fusion to the new network and use different types of fusion to further improve the accuracy of fine-grained classification in this study.

关键词

细粒度图像分类目标检测背景抑制特征融合双线性卷积神经网络(B-CNN)

Keywords

fine-grained image classificationtarget detectionbackground suppressionfeature fusionbilinear convolutional neural network (B-CNN)

references

Branson S, Van Horn G, Belongie S and Perona P. 2014. Bird species categorization using pose normalized deep convolutional nets[EB/OL]. [2020-02-11].http://arxiv.org/pdf/1406.2952.pdfhttp://arxiv.org/pdf/1406.2952.pdf

Fu J L, Zheng H L and Mei T. 2017. Look closer to see better: recurrent attention convolutional neural network for fine-grained image recognition//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 4476-4484[DOI: 10.1109/CVPR.2017.476http://dx.doi.org/10.1109/CVPR.2017.476]

Ge W F, Lin X R and Yu Y Z. 2019. Weakly supervised complementary parts models for fine-grained image classification from the bottom up//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 3029-3038[DOI: 10.1109/CVPR.2019.00315http://dx.doi.org/10.1109/CVPR.2019.00315]

Huang K Q, Ren W Q and Tan T N. 2014. A review on image object classification and detection. Chinese Journal of Computers, 37(6): 1225-1240

黄凯奇, 任伟强, 谭铁牛. 2014. 图像物体分类与检测算法综述. 计算机学报, 37(6): 1225-1240[DOI:10.3724/SP.J.1016.2014.01225]

Krause J, Stark M, Deng J and Li F F. 2013. 3d object representations for fine-grained categorization//Proceedings of 2013 IEEE International Conference on Computer Vision Workshops. Sydney, Australia: IEEE: 554-561[DOI: 10.1109/ICCVW.2013.77http://dx.doi.org/10.1109/ICCVW.2013.77]

Lin T Y, RoyChowdhury A and Maji S. 2018. Bilinear convolutional neural networks for fine-grained visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6): 1309-1322[DOI:10.1109/TPAMI.2017.2723400]

Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C Y and Berg A C. 2016. SSD: single shot MultiBox detector//Proceedings of European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 21-37[DOI: 10.1007/978-3-319-46448-0_2http://dx.doi.org/10.1007/978-3-319-46448-0_2]

Luo H B, Xu L Y, Hui B and Chang Z. 2017. Status and prospect of target tracking based on deep learning. Infrared and Laser Engineering, 46(5): #0502002

罗海波, 许凌云, 惠斌, 常铮. 2017. 基于深度学习的目标跟踪方法研究现状与展望. 红外与激光工程, 46(5): #0502002[DOI:10.3788/IRLA201746.0502002]

Luo J H and Wu J X. 2017. A survey on fine-grained image categorization using deep convolutional features. Acta Automatica Sinica, 43(8): 1306-1318

罗建豪, 吴建鑫. 2017. 基于深度卷积特征的细粒度图像分类研究综述. 自动化学报, 43(8): 1306-1318[DOI:10.16383/j.aas.2017.c160425]

Maji S, Rahtu E, Kannala J, Blaschko M and Vedaldi A. 2013. Fine-grained visual classification of aircraft[EB/OL]. [2020-02-11].https://arxiv.org/pdf/1306.5151.pdfhttps://arxiv.org/pdf/1306.5151.pdf

Qi L, Lu X Q and Li X L. 2019. Exploiting spatial relation for fine-grained image classification. Pattern Recognition, 91: 47-55[DOI:10.1016/j.patcog.2019.02.007]

Qin X and Song G F. 2019. Pig face recognition algorithm based on bilinear convolution neural network. Journal of Hangzhou Dianzi University (Natural Sciences), 39(2): 12-17

秦兴, 宋各方. 2019. 基于双线性卷积神经网络的猪脸识别算法. 杭州电子科技大学学报(自然科学版), 39(2): 12-17[DOI:10.13954/j.cnki.hdu.2019.02.003]

Redmon J, Divvala S, Girshick R and Farhadi A. 2016. You only look once: unified, real-time object detection//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 779-788[DOI: 10.1109/CVPR.2016.91http://dx.doi.org/10.1109/CVPR.2016.91]

Redmon J and Farhadi A. 2018. YOLOv3: an incremental improvement[EB/OL]. [2020-02-11].https://arxiv.org/pdf/1804.02767.pdfhttps://arxiv.org/pdf/1804.02767.pdf

Ren S Q, He K M, Girshick R and Sun J. 2015. Faster R-CNN: towards real-time object detection with region proposal networks//Proceedings of the 28th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press: 91-99

Shi W W, Gong Y H, Tao X Y, Cheng D and Zheng N N. 2019. Fine-grained image classification using modified DCNNs trained by cascaded softmax and generalized large-margin losses. IEEE Transactions on Neural Networks and Learning Systems, 30(3): 683-694[DOI:10.1109/TNNLS.2018.2852721]

Wah C, Branson S, Welinder P, Perona P and Belongie S. 2011. The Caltech-UCSD birds-200-2011 dataset[EB/OL]. [2020-02-11].http://www.vision.caltech.edu/visipedia/papers/CUB_200_2011.pdfhttp://www.vision.caltech.edu/visipedia/papers/CUB_200_2011.pdf

Wang Y X and Zhang X B. 2019. Fine-grained image classification with network architecture of focus and recognition. Journal of Image and Graphics, 24(4): 493-502

王永雄, 张晓兵. 2019. 聚焦-识别网络架构的细粒度图像分类. 中国图象图形学报, 24(4): 493-502[DOI:10.11834/jig.180423]

Xiao T J, Xu Y C, Yang K Y, Zhang J X, Peng Y X and Zhang Z. 2015. The application of two-level attention models in deep convolutional neural network for fine-grained image classification//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 842-850[DOI: 10.1109/CVPR.2015.7298685http://dx.doi.org/10.1109/CVPR.2015.7298685]

Zhang N, Donahue J, Girshick R and Darrell T. 2014. Part-based R-CNNs for fine-grained category detection//Proceedings of European Conference on Computer Vision. Zurich, Switzerland: Springer: 834-849[DOI: 10.1007/978-3-319-10590-1_54http://dx.doi.org/10.1007/978-3-319-10590-1_54]

Zhang Y, Wei X S, Wu J X, Cai J F, Lu J B, Nguyen V A and Do M N. 2016. Weakly supervised fine-grained categorization with part-based image representation. IEEE Transactions on Image Processing, 25(4): 1713-1725[DOI:10.1109/TIP.2016.2531289]

Zhao B, Wu X, Feng J S, Peng Q and Yan S C. 2017. Diversified visual attention networks for fine-grained object classification. IEEE Transactions on Multimedia, 19(6): 1245-1256[DOI:10.1109/TMM.2017.2648498]

Zhao H R, Zhang Y and Liu G Z. 2019. Fine-granted image classification algorithm based on RPN and B-CNN. Computer Applications and Software, 36(3): 210-213, 264

赵浩如, 张永, 刘国柱. 2019. 基于RPN与B-CNN的细粒度图像分类算法研究. 计算机应用与软件, 36(3): 210-213, 264[DOI:10.3969/j.issn.1000-386x.2019.03.038]

Zheng H L, Fu J L, Zha Z J and Luo J B. 2019. Looking for the devil in the details: learning Trilinear attention sampling network for fine-grained image recognition//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 5007-5016[DOI: 10.1109/CVPR.2019.00515http://dx.doi.org/10.1109/CVPR.2019.00515]

文章被引用时，请邮件提醒。

提交