1. 陆军工程大学研究生院, 南京 210018;
2. 陆军工程大学野战工程学院, 南京 210018;
3. 南京大学计算机软件新技术国家重点实验室, 南京 210018
Full convolutional network for semantic segmentation and object detection
Xiao Feng1, Rui Ting2, Ren Tongwei3, Wang Dong2
1. School of Graduate, PLA Army Engineering University, Nanjing 210018, China;
2. School of Field Works, PLA Army Engineering University, Nanjing 210018, China;
3. State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210018, China
Supported by: National Natural Science Foundation of China (61472444)

# Abstract

Objective The mainstream object detection algorithm needs to delimit the default box in advance and then acquire the object box by filtering out the default box. Sufficiently dense and multi-scale default boxes must be preset to ensure a sufficient recall rate, which leads to repeated detection of various areas in an image and great computational waste. This study proposes a multi-task deep learning model (FCDN), which does not need to delimit the default boxes and can improve the detection speed while ensuring accuracy. Method The condition that the number of objects being detected is undetermined is the reason the current mainstream object detection algorithm needs to delineate the default box in advance. Deep learning object detection networks are developed by image classification models. Consequently, the number of objects to be detected is unpredictable, and the output of the detection model cannot be determined. Sufficiently dense and multiscale default boxes must be classified or recognized to ensure the recall rate. The object detection task requires object category information to realize the recognition of different objects and object boundary information to realize the positioning of each object. A semantic segmentation map extracts rich category information of objects, which can be used to recognize the categories of the objects. Object recognition and positioning can be completed by adopting the idea of semantic segmentation, designing a module to extract the boundary key points of the objects, and combining semantic segmentation map and the boundary key points of the objects. Object detection methods based on image classification have a rectangular receptive field that contains the information of other objects or background other than the object itself. Object detection methods based on semantic segmentation map and boundary key points are different; their receptive field is at the pixel level. Pixels of detected object can be removed from the semantic segmentation map and boundary key point distribution map, which does not affect the other object detection and can avoid the residual of small objects. According to the preceding analysis, we propose a new multi-task learning model, which increases the prediction layer of boundary key points on the basis of a semantic segmentation model, can complete the semantic segmentation and boundary key point prediction at the same time, and combines the semantic segmentation map and boundary key point distribution map to complete object detection. Boundary lines are obtained through boundary key points and object boxes according to the boundary lines. Result An object detection network that does not need to delimit the default boxes is proposed. This object detection algorithm is no longer based on image classification but uses the semantic segmentation idea to detect all object boundary key points at the pixel level. The ground truth box is obtained by combining the category information of the semantic segmentation result. The object detection method is trained based on semantic segmentation and then tested with PASCAL VOC 2007 test image data sets to verify its feasibility. The performance comparison results with the current mainstream object detection algorithm show that the semantic segmentation and object can be realized at the same time by using the new model trained with the same training sample. The detection precision of FCDN is superior to that of classic detection models. In terms of the running speed of the algorithm, compared with FCN, it is reduced by 8 ms, which is close to fast detection algorithms, such as YOLO. Conclusion This study proposes an idea of object detection that is no longer based on image classification and it utilizes semantic segmentation to extract information from the image to be detected. Experimental results show that according to the semantic segmentation image and the boundary point to complete, the object detection method is feasible. This method can avoid repeated detection and reduced waste calculation by decreasing the pixels of semantic segmentation prediction to improve detection efficiency. The simplified semantic segmentation map will not affect the detection accuracy.

deep learning; object detection; semantic segmentation; object boundary key points; multi-task learning; transfer learning; default boxes

# 2 FCDN模型实现和训练

FCDN整个模型可以分为5个模块：特征提取模块、语义分割模块、边界关键点预测模块、结合语义分割图和边界关键点的物体检测模块及小物体再检测模块。

# 2.3 边界关键点预测模块

 $\min \left\{ {\sum\limits_{\left( {x, y} \right) \in \mathit{\boldsymbol{T}}} {L\left( {{f_\theta }\left( x \right), y} \right) + \lambda R\left( \theta \right):\theta \in \mathit{\boldsymbol{ \boldsymbol{\varTheta} }}} } \right\}$ (1)

 $L\left( {\hat y, y} \right) = - \frac{1}{n}\left( {\sum {{l_{\hat yy}}} } \right)$ (2)

# 2.4 结合语义分割图和边界关键点的物体检测模块

1) 对模型输出的语义分割图进行去噪，尽量消除语义分割误差；

2) 依据模型输出的边界关键点分割图得到图像上横向和纵向上的边界关键点的分布；

3) 依据边界关键点的分布确定边界线的数量和位置；

4) 确定所有边界线(包括语义分割图的4条边框)相切的分割区域及该区域所属的物体类别；

5) 依据步骤3) 4)的结果，4条边界线为一组，将所有边界线进行组合，每组边界线构成一个预测框；

6) 确定预测框中物体类别及置信度。

 ${C_i} = \frac{{{S_i}}}{{\sum\limits_{i = 1}^N {{S_i}} }}$ (3)

# 3.2 FCDN多任务深度学习实验

 $\min \left( {\alpha {L_{{\rm{seg}}}} + \beta {L_{{\rm{kps}}}}} \right)$ (4)

# 3.3 FCDN检测性能对比试验

Table 1 Comparisons of FCDN and other detection models on operating efficiency

 方法 前向运算时间/ms Faster R-CNN 143 SSD 53 YOLO v2 36 FCN 62 FCDN 54

Table 2 Comparison of detection accuracy between FCDN and other detection models in VOC 2007 test data set

 /% 检测模型 aero bike bird bottle boat bus car cat chair cow table dog horse mbike person plant sheep sofa train TV mAP Faster R-CNN 73.5 73.6 66.9 42.1 65.5 73.1 74.7 73.4 37.2 74.9 53.7 72.8 72.6 67.5 76.7 38.8 67.6 63.9 65.3 62.6 64.8 SSD 62.4 64.7 58.4 33.8 53.2 66.2 63.5 66.3 32.8 67.1 45.2 64.9 65.2 53.9 69.7 30.3 68.9 73.9 56.5 57.3 57.7 YOLO v2 74.8 66.9 60.2 31.3 56.9 66.5 57.8 60.9 30.3 57.6 50.5 65.6 57.5 60.2 62.1 34.4 54.8 62.6 64.3 58.6 56.7 FCDN 75.8 76.9 70.2 41.3 66.9 76.5 77.8 70.9 36.3 77.6 60.5 75.6 77.5 70.2 72.1 36.4 64.8 72.6 70.3 68.6 66.4

