发布时间: 2019-03-16 摘要点击次数: 全文下载次数: DOI: 10.11834/jig.180406 2019 | Volume 24 | Number 3 ChinaMM 2018

1. 陆军工程大学研究生院, 南京 210018;
2. 陆军工程大学野战工程学院, 南京 210018;
3. 南京大学计算机软件新技术国家重点实验室, 南京 210018
 收稿日期: 2018-07-04; 修回日期: 2018-08-18 基金项目: 国家自然科学基金项目（61472444） 第一作者简介: 肖锋, 1993年生, 男, 硕士研究生, 主要研究方向为深度学习、计算机视觉。E-mail:1193221332@qq.com;任桐炜, 男, 教授, 主要研究方向为图像分析与视觉显著性。E-mai:rentw@nju.edu.cn;王东, 男, 讲师, 主要研究方向为智能信号处理。E-mai:dyhkxydfbb@163.com. 中图法分类号: TP301.6 文献标识码: A 文章编号: 1006-8961(2019)03-0474-09

# 关键词

Full convolutional network for semantic segmentation and object detection
Xiao Feng1, Rui Ting2, Ren Tongwei3, Wang Dong2
1. School of Graduate, PLA Army Engineering University, Nanjing 210018, China;
2. School of Field Works, PLA Army Engineering University, Nanjing 210018, China;
3. State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210018, China
Supported by: National Natural Science Foundation of China (61472444)

# Abstract

Objective The mainstream object detection algorithm needs to delimit the default box in advance and then acquire the object box by filtering out the default box. Sufficiently dense and multi-scale default boxes must be preset to ensure a sufficient recall rate, which leads to repeated detection of various areas in an image and great computational waste. This study proposes a multi-task deep learning model (FCDN), which does not need to delimit the default boxes and can improve the detection speed while ensuring accuracy. Method The condition that the number of objects being detected is undetermined is the reason the current mainstream object detection algorithm needs to delineate the default box in advance. Deep learning object detection networks are developed by image classification models. Consequently, the number of objects to be detected is unpredictable, and the output of the detection model cannot be determined. Sufficiently dense and multiscale default boxes must be classified or recognized to ensure the recall rate. The object detection task requires object category information to realize the recognition of different objects and object boundary information to realize the positioning of each object. A semantic segmentation map extracts rich category information of objects, which can be used to recognize the categories of the objects. Object recognition and positioning can be completed by adopting the idea of semantic segmentation, designing a module to extract the boundary key points of the objects, and combining semantic segmentation map and the boundary key points of the objects. Object detection methods based on image classification have a rectangular receptive field that contains the information of other objects or background other than the object itself. Object detection methods based on semantic segmentation map and boundary key points are different; their receptive field is at the pixel level. Pixels of detected object can be removed from the semantic segmentation map and boundary key point distribution map, which does not affect the other object detection and can avoid the residual of small objects. According to the preceding analysis, we propose a new multi-task learning model, which increases the prediction layer of boundary key points on the basis of a semantic segmentation model, can complete the semantic segmentation and boundary key point prediction at the same time, and combines the semantic segmentation map and boundary key point distribution map to complete object detection. Boundary lines are obtained through boundary key points and object boxes according to the boundary lines. Result An object detection network that does not need to delimit the default boxes is proposed. This object detection algorithm is no longer based on image classification but uses the semantic segmentation idea to detect all object boundary key points at the pixel level. The ground truth box is obtained by combining the category information of the semantic segmentation result. The object detection method is trained based on semantic segmentation and then tested with PASCAL VOC 2007 test image data sets to verify its feasibility. The performance comparison results with the current mainstream object detection algorithm show that the semantic segmentation and object can be realized at the same time by using the new model trained with the same training sample. The detection precision of FCDN is superior to that of classic detection models. In terms of the running speed of the algorithm, compared with FCN, it is reduced by 8 ms, which is close to fast detection algorithms, such as YOLO. Conclusion This study proposes an idea of object detection that is no longer based on image classification and it utilizes semantic segmentation to extract information from the image to be detected. Experimental results show that according to the semantic segmentation image and the boundary point to complete, the object detection method is feasible. This method can avoid repeated detection and reduced waste calculation by decreasing the pixels of semantic segmentation prediction to improve detection efficiency. The simplified semantic segmentation map will not affect the detection accuracy.

# Key words

deep learning; object detection; semantic segmentation; object boundary key points; multi-task learning; transfer learning; default boxes

# 2 FCDN模型实现和训练

FCDN整个模型可以分为5个模块：特征提取模块、语义分割模块、边界关键点预测模块、结合语义分割图和边界关键点的物体检测模块及小物体再检测模块。

# 2.3 边界关键点预测模块

 $\min \left\{ {\sum\limits_{\left( {x, y} \right) \in \mathit{\boldsymbol{T}}} {L\left( {{f_\theta }\left( x \right), y} \right) + \lambda R\left( \theta \right):\theta \in \mathit{\boldsymbol{ \boldsymbol{\varTheta} }}} } \right\}$ (1)

 $L\left( {\hat y, y} \right) = - \frac{1}{n}\left( {\sum {{l_{\hat yy}}} } \right)$ (2)

# 2.4 结合语义分割图和边界关键点的物体检测模块

1) 对模型输出的语义分割图进行去噪，尽量消除语义分割误差；

2) 依据模型输出的边界关键点分割图得到图像上横向和纵向上的边界关键点的分布；

3) 依据边界关键点的分布确定边界线的数量和位置；

4) 确定所有边界线(包括语义分割图的4条边框)相切的分割区域及该区域所属的物体类别；

5) 依据步骤3) 4)的结果，4条边界线为一组，将所有边界线进行组合，每组边界线构成一个预测框；

6) 确定预测框中物体类别及置信度。

 ${C_i} = \frac{{{S_i}}}{{\sum\limits_{i = 1}^N {{S_i}} }}$ (3)

# 3.2 FCDN多任务深度学习实验

 $\min \left( {\alpha {L_{{\rm{seg}}}} + \beta {L_{{\rm{kps}}}}} \right)$ (4)

# 3.3 FCDN检测性能对比试验

Table 1 Comparisons of FCDN and other detection models on operating efficiency

 方法 前向运算时间/ms Faster R-CNN 143 SSD 53 YOLO v2 36 FCN 62 FCDN 54

Table 2 Comparison of detection accuracy between FCDN and other detection models in VOC 2007 test data set

 /% 检测模型 aero bike bird bottle boat bus car cat chair cow table dog horse mbike person plant sheep sofa train TV mAP Faster R-CNN 73.5 73.6 66.9 42.1 65.5 73.1 74.7 73.4 37.2 74.9 53.7 72.8 72.6 67.5 76.7 38.8 67.6 63.9 65.3 62.6 64.8 SSD 62.4 64.7 58.4 33.8 53.2 66.2 63.5 66.3 32.8 67.1 45.2 64.9 65.2 53.9 69.7 30.3 68.9 73.9 56.5 57.3 57.7 YOLO v2 74.8 66.9 60.2 31.3 56.9 66.5 57.8 60.9 30.3 57.6 50.5 65.6 57.5 60.2 62.1 34.4 54.8 62.6 64.3 58.6 56.7 FCDN 75.8 76.9 70.2 41.3 66.9 76.5 77.8 70.9 36.3 77.6 60.5 75.6 77.5 70.2 72.1 36.4 64.8 72.6 70.3 68.6 66.4

# 参考文献

• [1] Girshick R, Donahue J, Darrell T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA: IEEE, 2014: 580-587.[DOI: 10.1109/CVPR.2014.81]
• [2] He K M, Zhang X Y, Ren S Q, et al. Spatial pyramid pooling in deep convolutional networks for visual recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(9): 1904–1916. [DOI:10.1109/TPAMI.2015.2389824]
• [3] Girshick R. Fast R-CNN[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 2015: 1440-1448.[DOI: 10.1109/ICCV.2015.169]
• [4] Ren S Q, He K M, Girshick R, et al. Faster R-CNN:towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137–1149. [DOI:10.1109/TPAMI.2016.2577031]
• [5] Dai J F, Li Y, He K M, et al. R-FCN: object detection via region-based fully convolutional networks[M]//Lee D D, von Luxburg U, Sugiyama M, et al. Advances in Neural Information Processing Systems: 29. Red Hook, NY: Curran Associates, Inc., 2016: 379-387.
• [6] Redmon J, Divvala S, Girshick R, et al. You only look once: unified, real-time object detection[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 779-788.[DOI: 10.1109/CVPR.2016.91]
• [7] Redmon J, Farhadi A. YOLO9000: better, faster, stronger[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 6517-6525.[DOI: 10.1109/CVPR.2017.690]
• [8] Liu W, Anguelov D, Erhan D, et al. SSD: single shot MultiBox detector[C]//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands: Springer, 2016: 21-37.[DOI: 10.1007/978-3-319-46448-0_2]
• [9] He K M, Gkioxari G, Dollar P, et al. Mask R-CNN[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018. [DOI:10.1109/TPAMI.2018.2844175]
• [10] Ren S Q, He K M, Girshick R, et al. Object detection networks on convolutional feature maps[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(7): 1476–1481. [DOI:10.1109/TPAMI.2016.2601099]
• [11] Lin T Y, Goyal P, Girshick R, et al. Focal loss for dense object detection[C]//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE, 2017: 2999-3007.[DOI: 10.1109/ICCV.2017.324]
• [12] Lin T Y, Dollár P, Girshick R, et al. Feature pyramid networks for object detection[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 936-944.[DOI: 10.1109/CVPR.2017.106]
• [13] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 2015: 3431-3440.[DOI: 10.1109/CVPR.2015.7298965]
• [14] Badrinarayanan V, Kendall A, Cipolla R. SegNet:a deep convolutional encoder-decoder architecture for image segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(12): 2481–2495. [DOI:10.1109/TPAMI.2016.2644615]
• [15] Zhao H S, Shi J P, Qi X J, et al. Pyramid scene parsing network[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 6230-6239.[DOI: 10.1109/CVPR.2017.660]
• [16] Chen L C, Papandreou G, Kokkinos I, et al. DeepLab:semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(4): 834–848. [DOI:10.1109/TPAMI.2017.2699184]
• [17] Lin G S, Milan A, Shen C H, et al. RefineNet: multi-path refinement networks for high-resolution semantic segmentation[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 5168-5177.[DOI: 10.1109/CVPR.2017.549]
• [18] Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks[C]//Proceedings of the 25th International Conference on Neural Information Processing Systems. Lake Tahoe, Nevada: Curran Associates Inc., 2012: 1097-1105.
• [19] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[EB/OL].[2018-06-19]. https: //arxiv.org/pdf/1409.1556.pdf.
• [20] He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 770-778.[DOI: 10.1109/CVPR.2016.90]
• [21] Shu X B, Qi G J, Tang J H, et al. Weakly-shared deep transfer networks for heterogeneous-domain knowledge propagation[C]//Proceedings of the 23rd ACM International Conference on Multimedia. Brisbane, Australia: ACM, 2015: 35-44.[DOI: 10.1145/2733373.2806216]
• [22] Ronneberger O, Fischer P, Brox T. U-Net: convolutional networks for biomedical image segmentation[C]//Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention. Munich, Germany: Springer, 2015: 234-241.[DOI: 10.1007/978-3-319-24574-4_28]
• [23] Bulò S R, Neuhold G, Kontschieder P. Loss max-pooling for semantic image segmentation[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 7082-7091.[DOI: 10.1109/CVPR.2017.749]
• [24] Everingham M, van Gool L, Williams C K I, et al. The PASCAL visual object classes (VOC) challenge[J]. International Journal of Computer Vision, 2010, 88(2): 303–338. [DOI:10.1007/s11263-009-0275-4]