全卷积语义分割与物体检测网络

肖锋; 芮挺; 任桐炜; 王东

doi:10.11834/jig.180406

ChinaMM 2018 | 浏览量 : 0 下载量: 29 CSCD: 4

PDF
导出
分享
收藏
专辑

全卷积语义分割与物体检测网络
Full convolutional network for semantic segmentation and object detection
2019年24卷第3期页码：474-482
收稿：2018-07-04，

修回：2018-8-18，

纸质出版：2019-03-16
DOI： 10.11834/jig.180406
稿件说明：

移动端阅览

肖锋, 芮挺, 任桐炜, 王东. 全卷积语义分割与物体检测网络[J]. 中国图象图形学报, 2019,24(3):474-482. DOI： 10.11834/jig.180406.

Feng Xiao, Ting Rui, Tongwei Ren, Dong Wang. Full convolutional network for semantic segmentation and object detection[J]. Journal of Image and Graphics, 2019, 24(3): 474-482. DOI： 10.11834/jig.180406.

摘要

目的

目前主流物体检测算法需要预先划定默认框，通过对默认框的筛选剔除得到物体框。为了保证足够的召回率，就必须要预设足够密集和多尺度的默认框，这就导致了图像中各个区域被重复检测，造成了极大的计算浪费。提出一种不需要划定默认框，实现完全端到端深度学习语义分割及物体检测的多任务深度学习模型（FCDN），使得检测模型能够在保证精度的同时提高检测速度。

方法

首先分析了被检测物体数量不可预知是目前主流物体检测算法需要预先划定默认框的原因，由于目前深度学习物体检测算法都是由图像分类模型拓展而来，被检测数量的无法预知导致无法设置检测模型的输出，为了保证召回率，必须要对足够密集和多尺度的默认框进行分类识别；物体检测任务需要物体的类别信息以实现对不同类物体的识别，也需要物体的边界信息以实现对各个物体的区分、定位；语义分割提取了丰富的物体类别信息，可以根据语义分割图识别物体的种类，同时采用语义分割的思想，设计模块提取图像中物体的边界关键点，结合语义分割图和边界关键点分布图，从而完成物体的识别和定位。

结果

为了验证基于语义分割思想的物体检测方法的可行性，训练模型并在VOC（visual object classes）2007 test数据集上进行测试，与目前主流物体检测算法进行性能对比，结果表明，利用新模型可以同时实现语义分割和物体检测任务，在训练样本相同的条件下训练后，其物体检测精度优于经典的物体检测模型；在算法的运行速度上，相比于FCN，减少了8 ms，比较接近于YOLO（you only look once）等快速检测算法。

结论

本文提出了一种新的物体检测思路，不再以图像分类为检测基础，不需要对预设的密集且多尺度的默认框进行分类识别；实验结果表明充分利用语义分割提取的丰富信息，根据语义分割图和边界关键点完成物体检测的方法是可行的，该方法避免了对图像的重复检测和计算浪费；同时通过减少语义分割预测的像素点数量来提高检测效率，并通过实验验证简化后的语义分割结果仍足够进行物体检测任务。

Abstract

Objective

The mainstream object detection algorithm needs to delimit the default box in advance and then acquire the object box by filtering out the default box. Sufficiently dense and multi-scale default boxes must be preset to ensure a sufficient recall rate

which leads to repeated detection of various areas in an image and great computational waste. This study proposes a multi-task deep learning model (FCDN)

which does not need to delimit the default boxes and can improve the detection speed while ensuring accuracy.

Method

The condition that the number of objects being detected is undetermined is the reason the current mainstream object detection algorithm needs to delineate the default box in advance. Deep learning object detection networks are developed by image classification models. Consequently

the number of objects to be detected is unpredictable

and the output of the detection model cannot be determined. Sufficiently dense and multiscale default boxes must be classified or recognized to ensure the recall rate. The object detection task requires object category information to realize the recognition of different objects and object boundary information to realize the positioning of each object. A semantic segmentation map extracts rich category information of objects

which can be used to recognize the categories of the objects. Object recognition and positioning can be completed by adopting the idea of semantic segmentation

designing a module to extract the boundary key points of the objects

and combining semantic segmentation map and the boundary key points of the objects. Object detection methods based on image classification have a rectangular receptive field that contains the information of other objects or background other than the object itself. Object detection methods based on semantic segmentation map and boundary key points are different; their receptive field is at the pixel level. Pixels of detected object can be removed from the semantic segmentation map and boundary key point distribution map

which does not affect the other object detection and can avoid the residual of small objects. According to the preceding analysis

we propose a new multi-task learning model

which increases the prediction layer of boundary key points on the basis of a semantic segmentation model

can complete the semantic segmentation and boundary key point prediction at the same time

and combines the semantic segmentation map and boundary key point distribution map to complete object detection. Boundary lines are obtained through boundary key points and object boxes according to the boundary lines.

Result

An object detection network that does not need to delimit the default boxes is proposed. This object detection algorithm is no longer based on image classification but uses the semantic segmentation idea to detect all object boundary key points at the pixel level. The ground truth box is obtained by combining the category information of the semantic segmentation result. The object detection method is trained based on semantic segmentation and then tested with PASCAL VOC 2007 test image data sets to verify its feasibility. The performance comparison results with the current mainstream object detection algorithm show that the semantic segmentation and object can be realized at the same time by using the new model trained with the same training sample. The detection precision of FCDN is superior to that of classic detection models. In terms of the running speed of the algorithm

compared with FCN

it is reduced by 8 ms

which is close to fast detection algorithms

such as YOLO.

Conclusion

This study proposes an idea of object detection that is no longer based on image classification and it utilizes semantic segmentation to extract information from the image to be detected. Experimental results show that according to the semantic segmentation image and the boundary point to complete

the object detection method is feasible. This method can avoid repeated detection and reduced waste calculation by decreasing the pixels of semantic segmentation prediction to improve detection efficiency. The simplified semantic segmentation map will not affect the detection accuracy.

关键词

Keywords

references

Girshick R, Donahue J, Darrell T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA: IEEE, 2014: 580-587.[ DOI: 10.1109/CVPR.2014.81 http://dx.doi.org/10.1109/CVPR.2014.81 ]

He K M, Zhang X Y, Ren S Q, et al. Spatial pyramid pooling in deep convolutional networks for visual recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(9):1904-1916.[DOI:10.1109/TPAMI.2015.2389824]

Girshick R. Fast R-CNN[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 2015: 1440-1448.[ DOI: 10.1109/ICCV.2015.169 http://dx.doi.org/10.1109/ICCV.2015.169 ]

Ren S Q, He K M, Girshick R, et al. Faster R-CNN:towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6):1137-1149.[DOI:10.1109/TPAMI.2016.2577031]

Dai J F, Li Y, He K M, et al. R-FCN: object detection via region-based fully convolutional networks[M]//Lee D D, von LuxburgU, Sugiyama M, et al. Advances in Neural Information Processing Systems: 29. Red Hook, NY: Curran Associates, Inc., 2016: 379-387.

Redmon J, Divvala S, Girshick R, et al. You only look once: unified, real-time object detection[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 779-788.[ DOI: 10.1109/CVPR.2016.91 http://dx.doi.org/10.1109/CVPR.2016.91 ]

Redmon J, Farhadi A. YOLO9000: better, faster, stronger[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 6517-6525.[ DOI: 10.1109/CVPR.2017.690 http://dx.doi.org/10.1109/CVPR.2017.690 ]

Liu W, Anguelov D, Erhan D, et al. SSD: single shot MultiBox detector[C]//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands: Springer, 2016: 21-37.[ DOI: 10.1007/978-3-319-46448-0_2 http://dx.doi.org/10.1007/978-3-319-46448-0_2 ]

He K M, Gkioxari G, Dollar P, et al. Mask R-CNN[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.[DOI:10.1109/TPAMI.2018.2844175]

Ren S Q, He K M, Girshick R, et al. Object detection networks on convolutional feature maps[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(7):1476-1481.[DOI:10.1109/TPAMI.2016.2601099]

Lin T Y, Goyal P, Girshick R, et al. Focal loss for dense object detection[C]//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE, 2017: 2999-3007.[ DOI: 10.1109/ICCV.2017.324 http://dx.doi.org/10.1109/ICCV.2017.324 ]

Lin T Y, Dollár P, Girshick R, et al. Feature pyramid networks for object detection[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 936-944.[ DOI: 10.1109/CVPR.2017.106 http://dx.doi.org/10.1109/CVPR.2017.106 ]

Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 2015: 3431-3440.[ DOI: 10.1109/CVPR.2015.7298965 http://dx.doi.org/10.1109/CVPR.2015.7298965 ]

Badrinarayanan V, Kendall A, Cipolla R. SegNet:a deep convolutional encoder-decoder architecture for image segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(12):2481-2495.[DOI:10.1109/TPAMI.2016.2644615]

Zhao H S, Shi J P, Qi X J, et al. Pyramid scene parsing network[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 6230-6239.[ DOI: 10.1109/CVPR.2017.660 http://dx.doi.org/10.1109/CVPR.2017.660 ]

Chen L C, Papandreou G, Kokkinos I, et al.DeepLab:semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(4):834-848.[DOI:10.1109/TPAMI.2017.2699184]

Lin G S, Milan A, Shen C H, et al. RefineNet: multi-path refinement networks for high-resolution semantic segmentation[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 5168-5177.[ DOI: 10.1109/CVPR.2017.549 http://dx.doi.org/10.1109/CVPR.2017.549 ]

Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks[C]//Proceedings of the 25th International Conference on Neural Information Processing Systems. Lake Tahoe, Nevada: Curran Associates Inc., 2012: 1097-1105.

Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[EB/OL].[2018-06-19]. https: //arxiv.org/pdf/1409.1556.pdf.

He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 770-778.[ DOI: 10.1109/CVPR.2016.90 http://dx.doi.org/10.1109/CVPR.2016.90 ]

Shu X B, Qi G J, Tang J H, et al. Weakly-shared deep transfer networks for heterogeneous-domain knowledge propagation[C]//Proceedings of the 23rd ACM International Conference on Multimedia. Brisbane, Australia: ACM, 2015: 35-44.[ DOI: 10.1145/2733373.2806216 http://dx.doi.org/10.1145/2733373.2806216 ]

Ronneberger O, Fischer P, Brox T. U-Net: convolutional networks for biomedical image segmentation[C]//Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention. Munich, Germany: Springer, 2015: 234-241.[ DOI: 10.1007/978-3-319-24574-4_28 http://dx.doi.org/10.1007/978-3-319-24574-4_28 ]

BulòS R, Neuhold G, Kontschieder P. Loss max-pooling for semantic image segmentation[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 7082-7091.[ DOI: 10.1109/CVPR.2017.749 http://dx.doi.org/10.1109/CVPR.2017.749 ]

Everingham M, van Gool L, Williams C K I, et al. The PASCAL visual object classes (VOC) challenge[J]. International Journal of Computer Vision, 2010, 88(2):303-338.[DOI:10.1007/s11263-009-0275-4]