聚焦—识别网络架构的细粒度图像分类

王永雄; 张晓兵

doi:10.11834/jig.180423

图像分析和识别 | 浏览量 : 0 下载量: 80 CSCD: 4

PDF
导出
分享
收藏
专辑

聚焦—识别网络架构的细粒度图像分类
Fine-grained image classification with network architecture of focus and recognition
2019年24卷第4期页码：493-502
收稿：2018-07-27，

修回：2018-9-18，

纸质出版：2019-04-16
DOI： 10.11834/jig.180423
稿件说明：

移动端阅览

王永雄, 张晓兵. 聚焦—识别网络架构的细粒度图像分类[J]. 中国图象图形学报, 2019,24(4):493-502. DOI： 10.11834/jig.180423.

Yongxiong Wang, Xiaobing Zhang. Fine-grained image classification with network architecture of focus and recognition[J]. Journal of Image and Graphics, 2019, 24(4): 493-502. DOI： 10.11834/jig.180423.

摘要

目的

细粒度图像分类是指对一个大类别进行更细致的子类划分，如区分鸟的种类、车的品牌款式、狗的品种等。针对细粒度图像分类中的无关信息太多和背景干扰问题，本文利用深度卷积网络构建了细粒度图像聚焦—识别的联合学习框架，通过去除背景、突出待识别目标、自动定位有区分度的区域，从而提高细粒度图像分类识别率。

方法

首先基于Yolov2（youonly look once v2）的网络快速检测出目标物体，消除背景干扰和无关信息对分类结果的影响，实现聚焦判别性区域，之后将检测到的物体（即Yolov2的输出）输入双线性卷积神经网络进行训练和分类。此网络框架可以实现端到端的训练，且只依赖于类别标注信息，而无需借助其他的人工标注信息。

结果

在细粒度图像库CUB-200-2011、Cars196和Aircrafts100上进行实验验证，本文模型的分类精度分别达到84.5%、92%和88.4%，与同类型分类算法得到的最高分类精度相比，准确度分别提升了0.4%、0.7%和3.9%，比使用两个相同D（dence）-Net网络的方法分别高出0.5%、1.4%和4.5%。

结论

使用聚焦—识别深度学习框架提取有区分度的区域对细粒度图像分类有积极作用，能够滤除大部分对细粒度图像分类没有贡献的区域，使得网络能够学习到更多有利于细粒度图像分类的特征，从而降低背景干扰对分类结果的影响，提高模型的识别率。

Abstract

Objective

In recent years

with the rapid progress of science and technology as well as the increasing demand of human life

people's research has shifted from the coarse-grained image classification to the fine-grained image classification. Fine-grained image classification is a hot research topic in the field of computer vision research in recent years. Its purpose is to provide a detailed subdivision of a large category

such as the distinction of bird species

car brand style

and dog breed. Nowadays

the fine-grained image classification has great application requirements. For example

in the field of ecological protection

the identification of different species of organisms is the key to ecological research. And in the field of botany

because of the variety and quantity of flowers as well as the similarity between different flowers which make the fine-grained image classification tasks more difficult. With the help of computer vision technology

we can realize low-cost fine-grained image classification tasks. However

the fine-grained classification often has smaller differences between classes and larger differences within classes. Thus

in comparison with the ordinary image classification

the task of the fine-grained image classification is more challenging. Moreover

the fine-grained image classification has much irrelevant information and background interference. Those problems would influence the network model to learn the actual discriminative characteristics and result in inferior classification performance in fine-grained image classification. Therefore

finding discriminative regions in the image is important for the improvement of fine-grained image classification performance. To solve this problem

a joint deep learning framework of focus and recognition is constructed for fine-grained image classification. This framework can remove the background in the image

highlight the target to be identified

and then automatically locate the discriminative area in the image. Thus

the deep convolutional neural networks can extract more useful and discriminative features

and the classification rate of fine-grained images can be improved naturally.

Method

Firstly

the Yolov2 (you only look once v2) target detection algorithm can detect object in the image rapidly and eliminate the influence of background interference and unrelated information

and then the datasets

which include the detected target objects

are used to train the bilinear convolutional neural network. Finally

the final model can be used for fine-grained image classification. The Yolov2 algorithm is a further improvement of the Yolov1 target detection algorithm

and it is more precise for small object localization. It can automatically find the target in the picture to filter out most of the regions that do not contribute to image classification. Bilinear convolutional neural network is a special network for fine-grained image classification. Its characteristic is that it uses the two convolutional neural networks to extract the features of the same picture simultaneously

and the bilinear feature vector is obtained by the approaches of bilinear pooling. Finally

the bilinear feature vector is fed into the softmax network layer and the classification task is completed

we can get the final classification results. In addition

the advantage of the bilinear convolutional neural network is that it is not dependent on additional manual annotation information and it is an entire system which can complete end-to-end training. It only relies on the class label information. Therefore

it greatly reduces the difficulty and complexity of fine-grained image classification.

Result

We perform verification experiments on open standard fine-grained image library CUB-200-2011

Cars196

and Aircrafts100. We use the pre-trained target detection model of Yolov2 algorithm to detect these three datasets respectively

therefore

we can get the discriminative regions in the image for each datasets. Then

the bilinear convolutional neural network is trained by the processed datasets. Finally

our proposed bilinear convolutional neural network model can be used for the fine-grained image classification and achieves classification accuracy of 84.5%

92%

and 88.4% on these three datasets. In comparison with the highest classification accuracy obtained by the same classification algorithm without discriminant information extraction

the classification accuracy of the three databases is improved by 0.4%

0.7%

and 3.9%. Moreover

the recognition rate is also increased by 0.5%

1.4%

and 4.5% compared with the same classification algorithm

which extracts features from two identical D(dence)-Net networks. We also compared with other fine-grained image classification algorithms

such as the Spatial Transformer Networks

which has a fine classification performance in fine-grained image classification and it is also an entire system and is only dependent on label information. For all that

the classification accuracy rate of ours is still 0.4 percentage points higher than the method of Spatial Transformer Networks on bird dataset.

Conclusion

In this paper

an innovative method based on focused recognition network architecture is proposed to improve the recognition rate of the fine-grained image classification. And the experiment results show that our method positively affects the fine-grained image classification results

which uses the network architecture of focus and recognition to detect discriminative region in the image. It can filter out most of the area in the image which does not contribute to the classification of fine-grained images

thereby reducing the influence of background interference to the classification results. Thus

the bilinear convolutional neural network can learn more useful features

which are beneficial to the classification of fine-grained images. Finally

the recognition rate of the model of the fine-grained image classification can be improved effectively. Of course

we also compare with other fine-grained image classification algorithms on several datasets

which also strongly proves the effectiveness of our algorithm.

关键词

Keywords

references

Lowe D G. Object recognition from local scale-invariant features[C]//Proceedings of the 7th IEEE international Conference on Computer Vision. Kerkyra, Greece: IEEE, 1999, 2: 1150-1157.[ DOI: 10.1109/ICCV.1999.790410 http://dx.doi.org/10.1109/ICCV.1999.790410 ]

Dalal N, Triggs B. Histograms of oriented gradients for human detection[C]//Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Diego, CA, USA: IEEE, 2005, 1: 886-893.[ DOI: 10.1109/CVPR.2005.177 http://dx.doi.org/10.1109/CVPR.2005.177 ]

Jégou H, Douze M, Schmid C, et al. Aggregating local descriptors into a compact image representation[C]//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco, CA, USA: IEEE, 2010: 3304-3311.[ DOI: 10.1109/CVPR.2010.5540039 http://dx.doi.org/10.1109/CVPR.2010.5540039 ]

Perronnin F, Dance C. Fisher kernels on visual vocabularies for image categorization[C]//Proceedings of 2007 IEEE Conference on Computer Vision and Pattern Recognition. Minneapolis, MN, USA: IEEE, 2007: 1-8.[ DOI: 10.1109/CVPR.2007.383266 http://dx.doi.org/10.1109/CVPR.2007.383266 ]

Sánchez J, Perronnin F, Mensink T, et al. Image classification with the fisher vector:theory and practice[J]. International Journal of Computer Vision, 2013, 105(3):222-245.[DOI:10.1007/s11263-013-0636-x]

Weng Y C, Tian Y, Lu D M, et al. Fine-grained bird classification based on deep region networks[J]. Journal of Image and Graphics, 2017, 22(11):1521-1531.

翁雨辰, 田野, 路敦民, 等.深度区域网络方法的细粒度图像分类[J].中国图象图形学报, 2017, 22(11):1521-1531.

Zhang N, Donahue J, Girshick R, et al. Part-based R-CNNs for fine-grained category detection[C]//Proceedings of the 13th European Conference on Computer Vision-ECCV 2014. Zurich, Switzerland: Springer, 2014: 834-849.[ DOI: 10.1007/978-3-319-10590-1_54 http://dx.doi.org/10.1007/978-3-319-10590-1_54 ]

Girshick R, Donahue J, Darrell T, et al. Rich feature hierarchies for accurate object detection and Semantic segmentation[C]//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA: IEEE, 2014: 580-587.[ DOI: 10.1109/CVPR.2014.81 http://dx.doi.org/10.1109/CVPR.2014.81 ]

Uijlings J R R, Van De Sande K E A, Gevers T, et al. Selective search for object recognition[J]. International Journal of Computer Vision, 2013, 104(2):154-171.[DOI:10.1007/s11263-013-0620-5]

Branson S, Van Horn G, Belongie S, et al. Bird species categorization using pose normalized deep convolutional nets[EB/OL].[2018-07-10] . https://arxiv.org/pdf/1406.2952.pdf http://arxiv.org/pdf/1406.2952.pdf .

Branson S, Beijbom O, Belongie S. Efficient large-scale structured learning[C]//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, OR, USA: IEEE, 2013: 1806-1813.[ DOI: 10.1109/CVPR.2013.236 http://dx.doi.org/10.1109/CVPR.2013.236 ]

Xiao T J, Xu Y C, Yang K Y, et al. The application of two-level attention models in deep convolutional neural network for fine-grained image classification[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 2015: 842-850.[ DOI: 10.1109/CVPR.2015.7298685 http://dx.doi.org/10.1109/CVPR.2015.7298685 ]

Zhang Y, Wei X S, Wu J X, et al. Weakly supervised fine-grained categorization with part-based image representation[J]. IEEE Transactions on Image Processing, 2016, 25(4):1713-1725.[DOI:10.1109/TIP.2016.2531289]

Simon M, Rodner E. Neural activation constellations: unsupervised part model discovery with convolutional networks[C]//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 2015: 1143-1151.[ DOI: 10.1109/ICCV.2015.136 http://dx.doi.org/10.1109/ICCV.2015.136 ]

Lin T Y, RoyChowdhury A, Maji S. Bilinear convolutional neural networks for fine-grained visual recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(6):1309-1322.[DOI:10.1109/TPAMI.2017.2723400]

Redmon J,Farhadi A. Yolo9000: better, faster, stronger[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017.[ DOI: 10.1109/CVPR.2017.690 http://dx.doi.org/10.1109/CVPR.2017.690 ]

Redmon J, Divvala S, Girshick R, et al. You only look once: unified, real-time object detection[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 779-788.[ DOI: 10.1109/CVPR.2016.91 http://dx.doi.org/10.1109/CVPR.2016.91 ]

Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[EB/OL].[2017-07-10] . https://arxiv.org/pdf/1409.1556.pdf http://arxiv.org/pdf/1409.1556.pdf .

Deng J, Dong W, Socher R, et al. ImageNet: a large-scale hierarchical image database[C]//Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, FL, USA: IEEE, 2009: 248-255.[ DOI: 10.1109/CVPR.2009.5206848 http://dx.doi.org/10.1109/CVPR.2009.5206848 ]

Chatfield K, Simonyan K, Vedaldi A, et al. Return of the devil in the details: delving deep into convolutional nets[EB/OL].[2017-07-10] . https://arxiv.org/pdf/1405.3531.pdf http://arxiv.org/pdf/1405.3531.pdf .

Wah C, Branson S, Welinder P, et al. The caltech-UCSD birds-200-2011 dataset[R]. California: California Institute of Technology, 2011.

Krause J, Stark M, Deng J, et al. 3D object representations for fine-grained categorization[C]//Proceedings of 2013 IEEE International Conference on Computer Vision Workshops. Sydney, NSW, Australia: IEEE, 2013: 554-561.[ DOI:10.1109/ICCVW.2013.77 http://dx.doi.org/10.1109/ICCVW.2013.77 ]

Maji S, Rahtu E, Kannala J, et al. Fine-grained visual classification of aircraft[EB/OL].[2017-07-10] . https://arxiv.org/pdf/1306.5151.pdf http://arxiv.org/pdf/1306.5151.pdf .

Gosselin P H, Murray N, Jégou H, et al. Revisiting the fisher vector for fine-grained classification[J]. Pattern Recognition Letters, 2014, 49:92-98.[DOI:10.1016/j.patrec.2014.06.011]

Jaderberg M, Simonyan K, Zisserman A, et al. Spatial transformer networks[C]//Proceedings of the 29th International Conference on Neural Information Processing Systems. Montreal, Canada: MIT Press, 2015: 2017-2025.

Feng Y S, Wang Z L. Fine-grained image categorization with segmentation based on top-down attention map[J]. Journal of Image and Graphics, 2016, 21(9):1147-1154.

冯语姗, 王子磊.自上而下注意图分割的细粒度图像分类[J].中国图象图形学报, 2016, 21(9):1147-1154. [DOI:10.11834/jig.20160904]

Boykov Y Y, Jolly M P. Interactive graph cuts for optimal boundary & region segmentation of objects in N-D images[C]//Proceedings of the 8th IEEE International Conference on Computer Vision. Vancouver, BC, Canada: IEEE, 2001, 1: 105-112.[ DOI: 10.1109/ICCV.2001.937505 http://dx.doi.org/10.1109/ICCV.2001.937505 ]