发布时间: 2019-03-16
摘要点击次数:
全文下载次数:
DOI: 10.11834/jig.180406
2019 | Volume 24 | Number 3

ChinaMM 2018

全卷积语义分割与物体检测网络

肖锋¹, 芮挺², 任桐炜³, 王东²

1. 陆军工程大学研究生院, 南京 210018;

2. 陆军工程大学野战工程学院, 南京 210018;

3. 南京大学计算机软件新技术国家重点实验室, 南京 210018

收稿日期: 2018-07-04; 修回日期: 2018-08-18

基金项目: 国家自然科学基金项目（61472444）

第一作者简介: 肖锋, 1993年生, 男, 硕士研究生, 主要研究方向为深度学习、计算机视觉。E-mail:1193221332@qq.com;
任桐炜, 男, 教授, 主要研究方向为图像分析与视觉显著性。E-mai:rentw@nju.edu.cn;
王东, 男, 讲师, 主要研究方向为智能信号处理。E-mai:dyhkxydfbb@163.com.

中图法分类号: TP301.6

文献标识码: A

文章编号: 1006-8961(2019)03-0474-09

摘要

目的目前主流物体检测算法需要预先划定默认框，通过对默认框的筛选剔除得到物体框。为了保证足够的召回率，就必须要预设足够密集和多尺度的默认框，这就导致了图像中各个区域被重复检测，造成了极大的计算浪费。提出一种不需要划定默认框，实现完全端到端深度学习语义分割及物体检测的多任务深度学习模型（FCDN），使得检测模型能够在保证精度的同时提高检测速度。方法首先分析了被检测物体数量不可预知是目前主流物体检测算法需要预先划定默认框的原因，由于目前深度学习物体检测算法都是由图像分类模型拓展而来，被检测数量的无法预知导致无法设置检测模型的输出，为了保证召回率，必须要对足够密集和多尺度的默认框进行分类识别；物体检测任务需要物体的类别信息以实现对不同类物体的识别，也需要物体的边界信息以实现对各个物体的区分、定位；语义分割提取了丰富的物体类别信息，可以根据语义分割图识别物体的种类，同时采用语义分割的思想，设计模块提取图像中物体的边界关键点，结合语义分割图和边界关键点分布图，从而完成物体的识别和定位。结果为了验证基于语义分割思想的物体检测方法的可行性，训练模型并在VOC（visual object classes）2007 test数据集上进行测试，与目前主流物体检测算法进行性能对比，结果表明，利用新模型可以同时实现语义分割和物体检测任务，在训练样本相同的条件下训练后，其物体检测精度优于经典的物体检测模型；在算法的运行速度上，相比于FCN，减少了8 ms，比较接近于YOLO（you only look once）等快速检测算法。结论本文提出了一种新的物体检测思路，不再以图像分类为检测基础，不需要对预设的密集且多尺度的默认框进行分类识别；实验结果表明充分利用语义分割提取的丰富信息，根据语义分割图和边界关键点完成物体检测的方法是可行的，该方法避免了对图像的重复检测和计算浪费；同时通过减少语义分割预测的像素点数量来提高检测效率，并通过实验验证简化后的语义分割结果仍足够进行物体检测任务。

关键词

深度学习; 物体检测; 语义分割; 边界关键点; 多任务学习; 迁移学习; 默认框

Full convolutional network for semantic segmentation and object detection

Xiao Feng¹, Rui Ting², Ren Tongwei³, Wang Dong²

1. School of Graduate, PLA Army Engineering University, Nanjing 210018, China;

2. School of Field Works, PLA Army Engineering University, Nanjing 210018, China;

3. State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210018, China

Supported by: National Natural Science Foundation of China (61472444)

Abstract

Objective The mainstream object detection algorithm needs to delimit the default box in advance and then acquire the object box by filtering out the default box. Sufficiently dense and multi-scale default boxes must be preset to ensure a sufficient recall rate, which leads to repeated detection of various areas in an image and great computational waste. This study proposes a multi-task deep learning model (FCDN), which does not need to delimit the default boxes and can improve the detection speed while ensuring accuracy. Method The condition that the number of objects being detected is undetermined is the reason the current mainstream object detection algorithm needs to delineate the default box in advance. Deep learning object detection networks are developed by image classification models. Consequently, the number of objects to be detected is unpredictable, and the output of the detection model cannot be determined. Sufficiently dense and multiscale default boxes must be classified or recognized to ensure the recall rate. The object detection task requires object category information to realize the recognition of different objects and object boundary information to realize the positioning of each object. A semantic segmentation map extracts rich category information of objects, which can be used to recognize the categories of the objects. Object recognition and positioning can be completed by adopting the idea of semantic segmentation, designing a module to extract the boundary key points of the objects, and combining semantic segmentation map and the boundary key points of the objects. Object detection methods based on image classification have a rectangular receptive field that contains the information of other objects or background other than the object itself. Object detection methods based on semantic segmentation map and boundary key points are different; their receptive field is at the pixel level. Pixels of detected object can be removed from the semantic segmentation map and boundary key point distribution map, which does not affect the other object detection and can avoid the residual of small objects. According to the preceding analysis, we propose a new multi-task learning model, which increases the prediction layer of boundary key points on the basis of a semantic segmentation model, can complete the semantic segmentation and boundary key point prediction at the same time, and combines the semantic segmentation map and boundary key point distribution map to complete object detection. Boundary lines are obtained through boundary key points and object boxes according to the boundary lines. Result An object detection network that does not need to delimit the default boxes is proposed. This object detection algorithm is no longer based on image classification but uses the semantic segmentation idea to detect all object boundary key points at the pixel level. The ground truth box is obtained by combining the category information of the semantic segmentation result. The object detection method is trained based on semantic segmentation and then tested with PASCAL VOC 2007 test image data sets to verify its feasibility. The performance comparison results with the current mainstream object detection algorithm show that the semantic segmentation and object can be realized at the same time by using the new model trained with the same training sample. The detection precision of FCDN is superior to that of classic detection models. In terms of the running speed of the algorithm, compared with FCN, it is reduced by 8 ms, which is close to fast detection algorithms, such as YOLO. Conclusion This study proposes an idea of object detection that is no longer based on image classification and it utilizes semantic segmentation to extract information from the image to be detected. Experimental results show that according to the semantic segmentation image and the boundary point to complete, the object detection method is feasible. This method can avoid repeated detection and reduced waste calculation by decreasing the pixels of semantic segmentation prediction to improve detection efficiency. The simplified semantic segmentation map will not affect the detection accuracy.

Key words

deep learning; object detection; semantic segmentation; object boundary key points; multi-task learning; transfer learning; default boxes

0 引言

目前主流深度学习物体检测模型^[1-12]主要有两类:第1类是基于区域候选的思想，典型代表为：R-CNN^[1]、SPP-net^[2]、Fast R-CNN^[3]、Faster R-CNN^[4]、R-FCN^[5]；第2类采用的是回归的思想，典型代表为：YOLO^[6-7]、SSD^[8]。由于物体数量不可预知，不论是第1类还是第2类都需要按照一定的方式划定默认框(default boxes、anchors)，从而建立起默认框、预测框、真值物体框的关系以进行训练。可见，需要人为划定默认框的根本原因是被检测物体的数量不可预知。假设图像上能够识别的最小像素区域为${A_{\min }}$，理论上一幅$W \times H$图像上存在物体的数量的可能性是从0到$\left( {W \times H} \right)/{A_{\min }}$。如果要让深度学习模型直接输出真值物体框，一个真值物体框对应一个物体，由于不知道被检测物体的数量，无法确定深度学习模型的输出的维度。为了量化深度学习模型的输出，必须要人为划定默认框，再通过图像分类的方法对每个默认框进行分类，根据分类结果对默认框进行筛选得到预测框。R-CNN系列算法是分两步进行的：第1步通过区域生成网络(RPN)对锚点(anchors)进行筛选生成大概率包含物体的默认框；第2步对筛选后的默认框进一步分类和回归，最后通过极大值抑制算法除去重叠的预测框得到最终的预测框。YOLO和SSD算法虽然是一步的端到端检测，但是它们只是去除了以上第1步的初步筛除，仍需要预先按照一定方式划定默认框，再利用网络计算出所有默认框的置信度，最后取高置信度的区域作为最终的预测框，其本质也是自下而上的默认框排除法。

因此本文将R-CNN、YOLO、SSD等深度学习物体检测模型称为基于排除法物体检测模型。为了确保高的召回率，基于排除法要求，设置默认框规格越多、分布越密越好，但是图像上被重复检测的区域太多，由此产生的计算代价太大。因此排除法模型存在检测效率低的缺陷。

针对目前物体检测深度模型存在的问题，借鉴语义分割模型的思想，提出一种完全端到端的物体检测模型，该模型在语义分割模型的基础上增加一个物体检测层，通过预测边界关键点间接得到预测框，增加极少量的运算便能够在语义分割的基础上完成物体检测。

1 本文方法

排除法物体检测模型的原则和依据是：包含物体的候选框具有更大的得分(score)。目前深度学习物体检测的算法都是在图像分类网络模型上的应用拓展，图像分类模型只能判断出图像包含何种物体以及包含该物体的置信度。包含物体越完整且其他背景越少，则网络输出的分类结果就越准确，预测的置信度也越大。而语义分割^[13-17]模型则不同，由于其训练集对物体有了像素级的标注，所以语义分割网络对物体信息的提取则更为精准和丰富。其实现了对图像的每个像素的分类，而要对一个像素进行分类，必须要结合该像素点周围的像素点，也就是说必须要有区域信息为支撑，因此可以认为语义分割模型在语义分割过程中提取了大量的区域信息作为中间特征。本文旨在探索通过深度学习的方法，充分利用语义分割过程中提取的大量的中间特征信息，同时对像素点进行类别分类及边界性质分类。

当语义分割已经能够在像素层面上区分不同种类的物体时，语义分割的结果甚至是中间的特征就已经蕴含了大量的能够对图像中的物体进行区分的信息。本文方法就是在语义分割的基础上，充分利用语义分割的结果以及语义分割网络提取的中间特征，达到物体的识别和定位——物体检测的作用。将能够用来判断像素点属于哪一类物体的信息称为类别信息，将能够用来判断像素点是否位于物体边界上的信息称为边界信息。物体检测任务需要物体的类别信息以实现对不同类物体的识别，也需要物体的边界信息以实现对各个物体的区分、定位。

将物体与真值物体框相切的那些物体边界称为物体边界关键点，如图 1所示。边界关键点与物体的种类无关，可以看成是最外侧的物体边缘，是能够确定真值物体框的物体边缘的一部分。实验发现，在原始图像上直接利用神经网络进行边界关键点提取效果并不好，提取误差较大。考虑到语义分割图上不同种类间的物体的边缘已经很明显，再结合包含同种类不同个体间边界信息的底层卷积特征图，对边界关键点的提取效果较好。因此将语义分割图和底层卷积特征图相结合，在此基础上再利用全卷积网络确定边界关键点。

图 1 物体边界关键点示意图

Fig. 1 Diagram of boundary key points

全卷积神经网络(FCN)等语义分割模型只是提取每个像素点的物体类别信息，但不能判断像素点是否为物体边界，因此不能从语义分割结果中识别同一类物体中的不同个体。本文提出一种全卷积检测网络(FCDN)新模型，该模型利用FCN同时完成两个密集预测任务，提取每个像素点的物体类别信息与边界信息，第1个任务是实现语义分割，第2个任务是实现边界关键点的预测。采用在VGG等预训练网络模型的全连接层卷积化基础上得到的全卷积网络进行实验。具体网络结构如图 2所示。

图 2 FCDN框架

Fig. 2 Construction of FCDN

2 FCDN模型实现和训练

FCDN整个模型可以分为5个模块：特征提取模块、语义分割模块、边界关键点预测模块、结合语义分割图和边界关键点的物体检测模块及小物体再检测模块。

2.1 特征提取模块

运用迁移学习的方法，以分类性能良好的神经网络图像分类模型^[18-22]为基础，将模型中的全连接层卷积化或去除后作为特征提取器。特征提取器主要功能是通过训练提取出图像的卷积特征图，后续的语义分割和边界关键点共用这些特征图。

2.2 语义分割模块

采用经典的语义分割模型, 如FCN^[13]、Seg-Net^[14]、PSP-Net^[15]、DeepLab^[16]、Refine-Net^[17]、U-Net^[22]等经典模型进行语义分割，利用特征提取模块所提取的特征，实现对每个像素点的密集分类，本文采用最基础的FCN模型为基础进行实验。

2.3 边界关键点预测模块

边界关键点预测模块共用语义分割模块的输出和特征提取器的底层卷积特征图。边界关键点用于不同物体的定位，如图 3所示，利用数据集中真值物体框标注和语义分割标注的交点作为边界关键点的标注。在物体检测模型中，模型得到被检测图像的语义分割图后，边界关键点再对语义分割图进行划分，语义分割图提供类别信息，边界关键点提供定位信息，结合两种信息完成物体检测任务。

图 3 根据VOC数据集中语义分割标注和物体检测标注得到物体边界关键点标注

Fig. 3 Getting the tagging of the boundary key points based on VOC dataset ((a) semantic segmentation tagging in the dataset; (b) tagging of the true object box in the dataset; (c) the boundary key points tagging obtained from (a) (b))

边界关键点预测模块和语义分割模块原理是相同的，是一个对像素点进行是否为边界关键点的二分类问题，因此损失函数采用交叉熵^[23]，$R$为正则项。边界关键点预测模块的训练目标为

$ \min \left\{ {\sum\limits_{\left( {x, y} \right) \in \mathit{\boldsymbol{T}}} {L\left( {{f_\theta }\left( x \right), y} \right) + \lambda R\left( \theta \right):\theta \in \mathit{\boldsymbol{ \boldsymbol{\varTheta} }}} } \right\} $

(1)

式中，$x$为输入，${{f_\theta }\left( x \right)}$为边界关键点预测模块的输出，$y$为边界关键点标注，$L$为损失函数，$\lambda $为常系数，$R$为正则项，$\theta $为待确定的参数，$\mathit{\boldsymbol{ \boldsymbol{\varTheta} }}$是模型所有可能的参数集合。

损失函数通常分解为分割图的各个像素的损失之和，即

$ L\left( {\hat y, y} \right) = - \frac{1}{n}\left( {\sum {{l_{\hat yy}}} } \right) $

(2)

式中，${\hat y}$为模块的输出，$y$为标注，${\sum {{l_{\hat yy}}} }$为分割图中各个像素点的损失之和，$n$为总的像素点数量。

2.4 结合语义分割图和边界关键点的物体检测模块

该模块将边界关键点映射到语义分割图，经过处理和判别最终得到包含各个物体的预测框，具体流程如图 4。

图 4 边界关键点映射到语义分割图上进行物体检测示意图

Fig. 4 Object detection by mapping boundary key points to semantic segmentation map((a) getting transverse and longitudinal representation of boundary key points map; (b) all possible regions of object; (c) getting bounding box based on semantic segmentation map; (d) object detection result)

将得到的边界关键点映射到语义分割图上完成物体检测，其中语义分割图用于物体的类别判定，边界关键点用于物体的定位。检测的流程如下：

1) 对模型输出的语义分割图进行去噪，尽量消除语义分割误差；

2) 依据模型输出的边界关键点分割图得到图像上横向和纵向上的边界关键点的分布；

3) 依据边界关键点的分布确定边界线的数量和位置；

4) 确定所有边界线(包括语义分割图的4条边框)相切的分割区域及该区域所属的物体类别；

5) 依据步骤3) 4)的结果，4条边界线为一组，将所有边界线进行组合，每组边界线构成一个预测框；

6) 确定预测框中物体类别及置信度。

由于输出的语义分割图中存在一定的误差，对于预测框中包含每类物体的可信度${C_i}$，定义为

$ {C_i} = \frac{{{S_i}}}{{\sum\limits_{i = 1}^N {{S_i}} }} $

(3)

式中，$i$表示类别，$N$为物体的总类别, ${{S_i}}$表示检测框中像素值为$i$类的像素数量。

2.5 小物体再检测模块

实验中发现，由于小物体的边界关键点相对较少，预测误差也就相对较大，在噪点消除过程中部分小物体可能会被当成噪点过滤掉，因此FCDN检测小物体的鲁棒性相对较低。针对这个问题，设计了原理如图 5所示的小物体再检测模块。

图 5 小物体再检测原理

Fig. 5 Principle of the small objects re-detection module ((a) the original semantic segmentation diagram; (b) the processed large object; (c) only small object of semantic segmentation)

利用语义分割图和边界关键点将各个物体定位后，对每个预测框去噪，将非同类像素点去除。将处理过后的语义分割图与原语义分割图进行异或操作，过滤所有已经被检测出来的物体的像素点，从而只保留一些未检测到的小物体像素点。对于剩下的小物体分割图，设定阈值保留一定大小的连通域。最后将保留的小物体与之前检测的所有物体一起进行极大值抑制算法，得到最终保留的预测框。

3 实验与分析

3.1 边界关键点预测可行性实验

为了验证边界关键点预测的可行性，本文设计了两种边界关键点预测方法，第1种方法是利用特征提取模块所提取的原图卷积特征图直接完成边界关键点预测，在输出语义分割图的同时输出物体边界关键点。实验结果如图 6中橙色曲线所示，实验发现这样的训练收敛较慢，且边界关键点预测的误差偏高。第2种方法是将语义分割图作为中间结果，在语义分割图的基础上结合底层卷积特征图再进行边界关键点的预测，如此可以在不增加网络层数的条件下更充分利用语义分割图，使其在物体的类别检测和位置确定方面都发挥作用。实验结果如图 6中蓝色曲线所示。图 7例举了这两种方法的部分效果，对比可以发现，直接在卷积特征图上进行边界关键点的回归训练，收敛效果不如使用分割图和底层卷积特征图相结合的回归训练。原因是语义分割图中已经实现了不同类物体的分类，因此语义分割图上已经包含了丰富的不同类物体的边界信息，但是没有同类物体的不同个体之间的边界信息，而底层卷积特征图中包含了同种类物体不同个体之间的边界信息。

图 6 利用原图训练边界关键点和加入语义分割图训练边界关键点的实验比较

Fig. 6 Experimental comparison of training boundary key points using original graph and adding semantic segmentation graph

图 7 两种边界关键点预测方法的实验效果比较

Fig. 7 Comparison of experimental results between two key point prediction methods of boundary

((a)tested figures; (b)label of boundary key points; (c)the prediction results of the first method after 100 000 training; (d)the prediction results of the second method after 100 000 training)

3.2 FCDN多任务深度学习实验

同一个全卷积神经网络同时完成两个密集分类任务，为了检验边界关键点预测模块共用语义分割模块中所有神经元的可行性，训练FCDN模型，训练结果如图 8所示，分析实验结果发现，两个任务都能很好地收敛，可见，边界关键点的检测可以共用语义分割模型的所有层，并不会对语义分割造成干扰，两个任务可以同时训练。模型的训练目标为

$ \min \left( {\alpha {L_{{\rm{seg}}}} + \beta {L_{{\rm{kps}}}}} \right) $

(4)

图 8 语义分割任务和边界关键点预测任务同时训练

Fig. 8 Simultaneous training of semantic segmentation and boundary key point predicting task

式中，${{L_{{\rm{seg}}}}}$表示语义分割的损失函数^[23]，本文采取与FCN相同的语义分割损失函数；${{L_{{\rm{kps}}}}}$表示边界关键点预测的损失函数；$\alpha $、$\beta $为常量系数。

3.3 FCDN检测性能对比试验

最后利用PASCAL VOC数据集^[24]进行实验，验证FCDN模型的检测性能，该数据集包含20类物体。因为边界关键点标注需要在语义分割标注的基础上得到，因此只能用VOC 2007和VOC 2012的trainval集中有语义分割标注的共3 335幅图像进行训练，VOC 2007的test集进行测试，采用mAP(mean average precision)指标来衡量检测模型的检测精度，本文所有实验均在单块Titan X GPU环境下完成。为了快速得到边界关键点的标注，本文采用图 3的方法，导致本文的模型只是用从VOC数据集中提取的子数据集训练，为了使实验结果具有可对比性，用于对比的其他检测模型也必须用具有语义标注的3 335幅图像的物体检测标注进行训练，利用VOC 2007数据集的test集进行测试。用于对比实验的所有检测模型都是以VGG-16为骨干网络。实验结果见表 1、表 2。分析表 1可知，在模型的运行速度方面，因为FCDN是在FCN的基础上增加了物体检测层和边界关键点预测层，但是并不是如语义分割一样预测所有的像素点，只是等间距地预测覆盖整幅图像的像素点，因此综合得到每次前向运算时间比FCN平均减少了8 ms。由于训练数据相对较少，分析表 2可以发现，对比模型训练后的检测性能，相比于只用物体检测数据集训练出来的模型，分类性能有所降低，可以通过增加训练样本的方法提高检测性能。实验结果表明，在训练样本相同的条件下，FCDN的检测性能能够达到经典物体检测模型的水平。

表 1 FCDN与其他检测模型的运行效率对比
Table 1 Comparisons of FCDN and other detection models on operating efficiency

下载CSV

方法	前向运算时间/ms
Faster R-CNN	143
SSD	53
YOLO v2	36
FCN	62
FCDN	54

表 2 FCDN与其他检测模型在VOC 2007 test数据集的检测精度比较
Table 2 Comparison of detection accuracy between FCDN and other detection models in VOC 2007 test data set

下载CSV

/%
检测模型	aero	bike	bird	bottle	boat	bus	car	cat	chair	cow	table	dog	horse	mbike	person	plant	sheep	sofa	train	TV	mAP
Faster R-CNN	73.5	73.6	66.9	42.1	65.5	73.1	74.7	73.4	37.2	74.9	53.7	72.8	72.6	67.5	76.7	38.8	67.6	63.9	65.3	62.6	64.8
SSD	62.4	64.7	58.4	33.8	53.2	66.2	63.5	66.3	32.8	67.1	45.2	64.9	65.2	53.9	69.7	30.3	68.9	73.9	56.5	57.3	57.7
YOLO v2	74.8	66.9	60.2	31.3	56.9	66.5	57.8	60.9	30.3	57.6	50.5	65.6	57.5	60.2	62.1	34.4	54.8	62.6	64.3	58.6	56.7
FCDN	75.8	76.9	70.2	41.3	66.9	76.5	77.8	70.9	36.3	77.6	60.5	75.6	77.5	70.2	72.1	36.4	64.8	72.6	70.3	68.6	66.4

4 结论

实验证明，用一个全卷积语义分割网络同时进行语义分割和定位关键点预测两个任务是可行的；同时也通过实验发现语义分割过程中所提取的特征能够用于边界关键点的预测。据此可以在语义分割图的基础上进一步完成物体检测甚至是实例分割的任务，在语义分割模型上只需增加两层卷积层便能实现物体边界关键点预测从而实现物体检测，实验表明增加边界关键点预测模块后，并不会影响语义分割，进一步将语义分割需要分类的像素点稀疏化，计算损耗相比于FCN还有所降低，语义分割的精度虽有所下降，但是物体检测的精度并没有明显降低。下一步可以根据图 3的流程利用COCO(common objects in context)数据集制作更多边界关键点标注样本；同时可以优化模型结构以提高模型的检测精度和检测速度；也可以进一步尝试将本文的原理运用于实例分割。

参考文献

[1] Girshick R, Donahue J, Darrell T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA: IEEE, 2014: 580-587.[DOI: 10.1109/CVPR.2014.81]

[2] He K M, Zhang X Y, Ren S Q, et al. Spatial pyramid pooling in deep convolutional networks for visual recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(9): 1904–1916. [DOI:10.1109/TPAMI.2015.2389824]

[3] Girshick R. Fast R-CNN[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 2015: 1440-1448.[DOI: 10.1109/ICCV.2015.169]

[4] Ren S Q, He K M, Girshick R, et al. Faster R-CNN:towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137–1149. [DOI:10.1109/TPAMI.2016.2577031]

[5] Dai J F, Li Y, He K M, et al. R-FCN: object detection via region-based fully convolutional networks[M]//Lee D D, von Luxburg U, Sugiyama M, et al. Advances in Neural Information Processing Systems: 29. Red Hook, NY: Curran Associates, Inc., 2016: 379-387.

[6] Redmon J, Divvala S, Girshick R, et al. You only look once: unified, real-time object detection[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 779-788.[DOI: 10.1109/CVPR.2016.91]

[7] Redmon J, Farhadi A. YOLO9000: better, faster, stronger[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 6517-6525.[DOI: 10.1109/CVPR.2017.690]

[8] Liu W, Anguelov D, Erhan D, et al. SSD: single shot MultiBox detector[C]//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands: Springer, 2016: 21-37.[DOI: 10.1007/978-3-319-46448-0_2]

[9] He K M, Gkioxari G, Dollar P, et al. Mask R-CNN[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018. [DOI:10.1109/TPAMI.2018.2844175]

[10] Ren S Q, He K M, Girshick R, et al. Object detection networks on convolutional feature maps[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(7): 1476–1481. [DOI:10.1109/TPAMI.2016.2601099]

[11] Lin T Y, Goyal P, Girshick R, et al. Focal loss for dense object detection[C]//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE, 2017: 2999-3007.[DOI: 10.1109/ICCV.2017.324]

[12] Lin T Y, Dollár P, Girshick R, et al. Feature pyramid networks for object detection[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 936-944.[DOI: 10.1109/CVPR.2017.106]

[13] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 2015: 3431-3440.[DOI: 10.1109/CVPR.2015.7298965]

[14] Badrinarayanan V, Kendall A, Cipolla R. SegNet:a deep convolutional encoder-decoder architecture for image segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(12): 2481–2495. [DOI:10.1109/TPAMI.2016.2644615]

[15] Zhao H S, Shi J P, Qi X J, et al. Pyramid scene parsing network[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 6230-6239.[DOI: 10.1109/CVPR.2017.660]

[16] Chen L C, Papandreou G, Kokkinos I, et al. DeepLab:semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(4): 834–848. [DOI:10.1109/TPAMI.2017.2699184]

[17] Lin G S, Milan A, Shen C H, et al. RefineNet: multi-path refinement networks for high-resolution semantic segmentation[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 5168-5177.[DOI: 10.1109/CVPR.2017.549]

[18] Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks[C]//Proceedings of the 25th International Conference on Neural Information Processing Systems. Lake Tahoe, Nevada: Curran Associates Inc., 2012: 1097-1105.

[19] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[EB/OL].[2018-06-19]. https: //arxiv.org/pdf/1409.1556.pdf.

[20] He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 770-778.[DOI: 10.1109/CVPR.2016.90]

[21] Shu X B, Qi G J, Tang J H, et al. Weakly-shared deep transfer networks for heterogeneous-domain knowledge propagation[C]//Proceedings of the 23rd ACM International Conference on Multimedia. Brisbane, Australia: ACM, 2015: 35-44.[DOI: 10.1145/2733373.2806216]

[22] Ronneberger O, Fischer P, Brox T. U-Net: convolutional networks for biomedical image segmentation[C]//Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention. Munich, Germany: Springer, 2015: 234-241.[DOI: 10.1007/978-3-319-24574-4_28]

[23] Bulò S R, Neuhold G, Kontschieder P. Loss max-pooling for semantic image segmentation[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 7082-7091.[DOI: 10.1109/CVPR.2017.749]

[24] Everingham M, van Gool L, Williams C K I, et al. The PASCAL visual object classes (VOC) challenge[J]. International Journal of Computer Vision, 2010, 88(2): 303–338. [DOI:10.1007/s11263-009-0275-4]