Current Issue Cover
深度区域网络方法的细粒度图像分类

翁雨辰, 田野, 路敦民, 李琼砚(北京林业大学工学院, 北京 100083)

摘 要
目的 在细粒度视觉识别中,难点是对处于相同层级的大类,区分其具有微小差异的子类,为实现准确的分类精度,通常要求具有专业知识,所以细粒度图像分类为计算机视觉的研究提出更高的要求。为了方便普通人在不具备专业知识和专业技能的情况下能够区分物种细粒度类别,进而提出一种基于深度区域网络的卷积神经网络结构。方法 该结构基于深度区域网络,首先,进行深度特征提取任务,使用VGG16层网络和残差101层网络两种结构作为特征提取网络,用于提取深层共享特征,产生特征映射。其次,使用区域建议网络结构,在特征映射上进行卷积,产生目标区域;同时使用兴趣区域(RoI)池化层对特征映射进行最大值池化,实现网络共享。之后将池化后的目标区域输入到区域卷积网络中进行细粒度类别预测和目标边界回归,最终输出网络预测类别及回归边框点坐标。同时还进行了局部遮挡实验,检测局部遮挡部位对于分类正确性的影响,分析局部信息对于鸟类分类的影响情况。结果 该模型针对CUB_200_2011鸟类数据库进行实验,该数据库包含200种细粒度鸟类类别,11 788幅鸟类图片。经过训练及测试,实现VGG16+R-CNN (RPN)和Res101+R-CNN (RPN)两种结构验证正确率分别为90.88%和91.72%,两种结构Top-5验证正确率都超过98%。本文模拟现实环境遮挡情况进行鸟类局部特征遮挡实验,检测分类效果。结论 基于深度区域网络的卷积神经网络模型,提高了细粒度鸟类图像的分类性能,在细粒度鸟类图像的分类上,具有分类精度高、泛化能力好和鲁棒性强的优势,实验发现头部信息对于细粒度鸟类分类识别非常重要。
关键词
Fine-grained bird classification based on deep region networks

Weng Yuchen, Tian Ye, Lu Dunmin, Li Qiongyan(School of Technology,Beijing Forestry University,Beijing 100083,China)

Abstract
Objective Fine-grained visual recognition requiring domain-specific and expert knowledge is a difficult issue in computer vision.This study addresses the problem of fine-grained visual categorization on accuracy and speed.This task is full of challenges because all the objects in a fine-grained dataset belong to the same level category,with some fine,subtle differences between classes.These subtle differences cannot be easily distinguished by ordinary people who do not have expert knowledge.Moreover,the features cannot be easily extracted and classified into the correct category by the computer.This study uses the convolutional neural network to extract features to predict the fine-grained class and thus improve recognition accuracy.The task includes object detection,object recognition,and object classification.Method In this study,we propose a convolutional neural network architecture based on deep region networks.We use this architecture because the convolutional neural network can improve the extraction and classification accuracy significantly.The deep architecture can also extract fine and coarse features that are useful for classification.First,we extract deep features by use of the feature extraction network to create a feature map.Every convolution layer extracts the feature by feature weight matrix,which is set by forward and back propagation algorithms to minimize the loss function.The proposed feature extraction network possesses two different architectures,i.e.,VGG and Residual networks.We use the VGG 16-layer and Res 101-layer deep networks as the feature extraction networks.We share the deep features extracted by the feature extraction network to the subsequent network.Second,we use the region proposal network architecture to implement the region proposal step.The feature map extracted by the feature extraction network is convoluted with a convolution kernel size of 3×3 with three different feature sizes and aspect ratios.Through convolution of the feature map,we derive 512 dimension vectors that can be used to conduct bounding box classification and regression.We determine the region of interest (RoI) if the class is yes.RoI pooling is used to ensure maximum pooling on the RoI.This pooling shares the network and avoids repeated computation.Third,the proposed regions are inputted into the region convolutional neural network to predict the class and score of every fine-grained classification and compute the regression function of the bounding box of object detection.Finally,the deep region convolutional neural network outputs four results.Two results are the classification results that the network predicted with maximum and minimum scores.The two other results are the regression results that include four values,i.e.,x1,x2,y1,and y2,and two coordinates,i.e.,the lower left corner (x1,y1) and the upper right corner (x2,y2),which form a rectangular area.Result We use the CUB_200_2011 dataset,which is designed for fine-grained image recognition with public annotations,including class labels,object bounding boxes,and part locations,to conduct fine-grained recognition of birds.The dataset contains 200 fine-grained categories of birds and a total of 11,788 pictures.Only the class label and object bounding box annotations are used in our evaluation.In this work,the proposed networks are trained and tested to achieve superior performance.For the entire bird,the VGG16+R-CNN(RPN) network achieves the Top 1 accuracy of 90.88% and the Res101+R-CNN(RPN) network achieves the Top 1 accuracy of 91.72%.Meanwhile,the VGG16+R-CNN(RPN) network achieves the Top 5 accuracy of 98.15% and the Res101+R-CNN(RPN) network achieves the Top 5 accuracy of 98.24%.The Top 5 accuracy of the two networks is more than 98%.We also conduct experiments on certain parts of the bird,such as the head.For the head of the bird,the VGG16+R-CNN(RPN) network achieves the Top 1 accuracy of 90.70% and the Res101+R-CNN(RPN) network achieves the Top 1 accuracy of 91.06%.Meanwhile,the VGG16+R-CNN(RPN) network achieves the Top 5 accuracy of 98.04% and the Res101+R-CNN(RPN) network achieves the Top 5 accuracy of 98.07%.In this study,we analyze the effect of certain parts of the bird,such as beak,head,belly,leg,and foot,on the performance of object detection,object recognition,and object classification.Conclusion This study proposes a network architecture based on deep region convolutional neural networks that achieve superior performance compared with other models in terms of object detection,object recognition,and object classification,although fine-grained visual recognition is difficult for ordinary people and other network architectures.Our method exhibits a high accuracy and a good performance and does not need extra data to train,which make it robust and applicable to various datasets.Experimental results show that the local information is useful for fine-grained image recognition.The proposed model is applicable to object detection and fine-grained image categorization.The experimental results also show that the local information,such as head,is useful for fine-grained image recognition.
Keywords

订阅号|日报