发布时间: 2022-08-16
摘要点击次数:
全文下载次数:
DOI: 10.11834/jig.210204
2022 | Volume 27 | Number 8

图像分析和识别

融合策略优选和双注意力的单阶段目标检测

戴坤, 许立波, 黄世旸, 李鋆铃

浙大宁波理工学院计算机与数据工程学院，宁波 315000

收稿日期: 2021-04-02; 修回日期: 2021-05-26; 预印本日期: 2021-06-02

基金项目: 国家自然科学基金项目(61872321);宁波市科技创新2025重大专项项目(2019B10036, 2020Z005)

作者简介: 戴坤，1999年生，男，本科生，主要研究方向为深度学习、计算机视觉、目标检测。E-mail: 351009456@qq.com
许立波，通信作者，男，工程师，主要研究方向为智能信息处理、目标检测。E-mail: xlb@nbt.edu.cn
黄世旸, 男, 本科生, 主要研究方向为深度学习、计算机视觉、目标检测。E-mail: 381667573@qq.com
李鋆铃, 女, 本科生, 主要研究方向为深度学习、计算机视觉、目标检测。E-mail: 1540363934@qq.com
*通信作者: 许立波 xlb@nbt.edu.cn

中图法分类号: TP391

文献标识码: A

文章编号: 1006-8961(2022)08-2430-14

摘要

目的特征融合是改善模糊图像、小目标以及受遮挡物体等目标检测困难的有效手段之一，为了更有效地利用特征融合来整合不同网络层次的特征信息，显著表达其中的重要特征，本文提出一种基于融合策略优选和双注意力机制的单阶段目标检测算法FDA-SSD(fusion double attention single shot multibox detector)。方法设计融合策略优化选择方法，结合特征金字塔(feature pyramid network，FPN)来确定最优的多层特征图组合及融合过程，之后连接双注意力模块，通过对各个通道和空间特征的权重再分配，提升模型对通道特征和空间信息的敏感性，最终产生包含丰富语义信息和凸显重要特征的特征图组。结果本文在公开数据集PASCAL VOC2007(pattern analysis, statistical modelling and computational learning visual object classes)和TGRS-HRRSD-Dataset(high resolution remote sensing detection)上进行对比实验，结果表明，在输入为300×300像素的PASCAL VOC2007测试集上，FDA-SSD模型的精度达到79.8%，比SSD(single shot multibox detector)、RSSD(rainbow SSD)、DSSD(de-convolution SSD)、FSSD(feature fusion SSD)模型分别高了2.6%、1.3%、1.2%、1.0%，在Titan X上的检测速度为47帧/s(frame per second，FPS)，与SSD算法相当，分别高于RSSD和DSSD模型12 FPS和37.5 FPS。在输入像素为300×300的TGRS-HRRSD-Dataset测试集上的精度为84.2%，在Tesla V100上的检测速度高于SSD模型10%的情况下，准确率提高了1.5%。结论通过在单阶段目标检测模型中引入融合策略选择和双注意力机制，使得预测的速度和准确率同时得到提升，并且对于小目标、受遮挡以及模糊图像等难目标的检测能力也得到较大提升。

关键词

单阶段目标检测; SSD算法; 特征金字塔(FPN); 特征融合; 注意力机制

Single stage object detection algorithm based on fusing strategy optimization selection and dual attention mechanism

Dai Kun, Xu Libo, Huang Shiyang, Li Yunling

School of Computer and Data Engineering, NingboTech University, Ningbo 315000, China

Supported by: National Natural Science Foundation of China (61872321);Ningbo Science and Technology Innovation 2025 Major Special Project(2019B10036, 2020Z005)

Abstract

Objective Object detection is essential to computer vision and in-depth learning recently. It has been widely used in industrial detection, intelligent transportation, human facial recognition and contexts. There are two main categories of recognized target detection algorithms. One of current target detection algorithms is two-stage algorithm, such as region-based convolution neural network (R-CNN), Fast R-CNN, online hard example mining (OHEM), Faster R-CNN, Mask R-CNN etc. The methods generate target candidate boxes first, and implement the candidate boxes classification and regression following. The other one is single-stage algorithms, such as you only look once (YOLO), single shot multibox detector (SSD) etc. In addition, the demonstrated corner network(CornerNet) & center network(CenterNet)-anchor free models have tried to ignore the anchor frame and conduct detection and matching based on key points, which has achieved quite good results, but there is still a little gap from the detection method based on anchor frame. In the practical application of single-stage target detection, a main challenging issue is target detection like blurred image, small target and occluded object, and the predicted performance and efficiency. Feature fusion can improve the detection ability of difficult targets effectively by fusing different deep and shallow features of the network, which has been used in many improved SSD models in common. However, most of the improved models use feature fusion methods directly, and the specific fusion strategies like the issues of fused graphs option and fused graphs processing. In addition, current attention mechanism can make the feature graph have a certain "focus" effect by giving dimension weight. The issue of combining attention mechanism to single-stage target detection effectively has its potentials. Method The shallow Visual Geometry Group (VGG) network in the original SSD algorithm is replaced by the deep residual network as the backbone network. First, an optimized selection method of fusion strategy is designed in accordance with the idea of feature pyramid network (FPN). FPN is applied to the four layers of backbone network output to accomplish the detailed feature information description, the lower layer features are retained by down sampling during enhanced fusion process, while the size of the largest graph remains stable. The speed and performance are taken into account. In the operation sequence, FPN is used first, and then enhanced fusion is used, which is equivalent to one-step reversed FPN, It is better than initial enhancing and following FPN, and final removing the r_c5_conv4 layer which is the same as the r_c5_conv3 layer to reduce the interference. To better describe the target object, the feature mapping combines the detailed features of high pixel feature mapping with the rich semantic representation of low pixel feature mapping. Then, In respect of the ideas of bottleneck attention module (BAM) and squeeze-and-excitation network (SENet), our research designs a parallel dual attention mechanism to integrate the channel and spatial information of the feature map. The dual clustered effect of the feature map on the shape and position of the target is improved through channel attention and spatial attention. The parallel addition processing of channel and spatial attention mechanism strengthens the supervision of key features in terms of the parallel addition of channel attention mechanism and spatial attention mechanism for each feature graph. The key features are strengthened and the redundant interference features are weakened. At the same time, the spatial information in the feature graph is transformed to extract the key spatial position information. Finally, rich semantic information and distinctive features related feature groups are obtained. Result This comparative experiment is carried out on pattern analysis, statistical modelling and computational learning visual object classes (PASCAL VOC 2007) and IEEE Transactions on Geoscience and Remote Sensing-High Resolution Remote Sensing Detection (TGRS-HRRSD-Dataset). Our experimental results show that on PASCAL VOC2007 test set with 300×300 input pixels, the accuracy of fusion double attention(FDA)-SSD model reaches 79.8%, which is 2.6%, 1.3%, 1.2% and 1.0% higher than SSD, rainbow single shot detector (RSSD), de-convolution single shot detector (DSSD) and feature fusion single shot detector (FSSD) models, respectively. The detection speed on Titan X is 47 frames per second (FPS), which is equivalent to SSD algorithm, higher than RSSD and DSSD model 12 FPS and 37.5 FPS, respectively. The accuracy of the proposed algorithm is 81.6% on PASCAL VOC2007 test set with 512×512 pixels, and the detection speed on Titan X is 18 FPS, which is better than most algorithms in terms of the obtained accuracy and speed. The accuracy of TGRS-HRRSD-Dataset with 300×300 input pixels is 84.2%. The detection speed of Tesla V100 is 10% higher than SSD model, and the accuracy is improved by 1.5%. Our algorithm has good performance for regular data sets and aerial datasets both, which reflects the stability and portability of the algorithm. Conclusion Our research proposes an optimized feature map selection method and dual attention mechanism. Compared to many existing SSD improved models, this model has its dual advantages in accuracy and speed, and has good performance in the detection of small targets, occluded, blurred images and other challenging targets as well. Although the FDA-SSD model performs well in the SSD, our analysis is mainly based on the optimization of the featured graphs. The future prediction box generation and non-maximum suppression methods have their potentials to be studied further.

Key words

single-stage object detection; single shot multibox detector(SSD); feature pyramid network(FPN); feature fusion; attention mechanism

0 引言

目标检测是近些年来计算机视觉和深度学习的研究热点之一，在工业检测、智能交通和人脸识别等领域有着广泛的应用。比较流行的目标检测算法主要分为两类：一类是两阶段(two-stage)算法，如R-CNN(region convolutional neural network) (Girshick等，2014)、Fast R-CNN (Girshick，2015)、OHEM(online hard example mining)(Shrivastava等，2016)、Faster R-CNN (Ren等，2017)和Mask R-CNN(He等，2020)等，该类方法需要先产生目标候选框，然后再对候选框做分类与回归任务。另一类是单阶段(one-stage)算法，如YOLO(you only look once)(Redmon等，2016)、SSD(single shot multibox detector)(Liu等，2016)等，该类方法使用卷积神经网络(convolutional neural network, CNN)直接预测不同目标的类别与位置。两种方法各有优劣，两阶段算法的准确率一般高于单阶段算法，但单阶段算法的速度优于两阶段算法。目前在实际应用中，速度与性能兼备的单阶段算法得到了更多的青睐。另外，以CornerNet(corner network)(Law和Deng，2020)和CenterNet(center network)(Zhou等，2019)为代表的anchor-free模型力图抛开锚框而基于关键点做出检测与匹配，也达到了相当好的效果，但离基于锚框的检测方法尚有少许差距。

最早的单阶段算法是Redmon等人(2016)提出的基于回归问题求解的YOLO算法，随后原作者又提出YOLO的改进版本YOLOv2(Redmon和Farhadi，2017)，效率和准确率得到进一步提升。Liu等人(2016)结合YOLO的高效和Faster R-CNN候选区域方法，提出了新的单阶段算法SSD，它在不同的特征图中生成预测框并且进行回归，有效地兼顾准确率和效率。Fu等人(2017)提出一种基于SSD模型改进的DSSD(de-convolution single shot detector)算法，通过对原来特征图进行反卷积操作生成新的特征图，充分利用了浅层的特征，并且将骨干网络换成表征能力更强的ResNet(residual neural network)(He等，2016)网络。Jeong等人(2017)提出另一种基于SSD模型改进的RSSD(rainbow single shot detector)算法，将SSD的初始特征图进行池化和反卷积处理，并且将处理后生成的特征图融合，使特征图同时拥有正向和反向的信息，算法在PASCAL VOC(pattern analysis, statistical modelling and computational learning visual object classes)数据集中准确率与DSSD相当，但检测速度(frame per second，FPS)达到35帧/s，远超过DSSD。之后YOLOv3 (Redmon和Farhadi，2018)模型被提出，相对于YOLOv2，准确率得到大幅度上升且速度没有下降。Lin等人(2020)提出了一种SSD改进版本——Retinanet，模型用ResNet-101-FPN(feature pyramid network)作为骨干网络，并且提出了focal loss损失函数来解决正负样本比例不平衡的问题。另外，Li和Zhou(2017)提出FSSD(fusion single shot multibox detector)，此算法先对VGG(Visual Geometry Group)(Simonyan和Zisserman，2015)中的两个特征图进行融合，然后再生成新特征图。Wang等人(2018)提出了多尺度位置感知内核候选区域(multi-scale location-aware kernel proposals, MLKP)，利用对象检测中的高阶统计量，生成更多具有强判别力和高灵敏度的候选框，可以灵活地运用于目标检测。Zhou等人(2018)提出了STDN (scale transferrable dense network)，该算法通过尺度变换模块，在获得高级语义多尺度特征图的同时，又不影响检测器的速度。Zhang等人(2018)提出了精细神经网络(refinement neural network, RefineDet)算法，通过对SSD算法、RPN(region proposal network)网络和FPN算法的结合，在保证高效的检测效率的前提下，提高算法对小目标的检测能力。郑浦等人(2020)提出了F_SE_SSD(fusion squeeze and excitation networks single shot multibox detector)，此算法对FSSD的特征融合方法进行了改进，加入注意力机制，提升模型对小目标物体的检测能力。Bae(2019)提出了区域分解与装配检测器(region decomposition and assembly detector, R-DAD)算法，通过多尺度候选区域(multi-scale region proposal, MRP)模块以及区域分解和组装(region decomposition and assembly, RDA)模块，对重要特征附近的多个方向语义信息进行融合，更有利于检测受遮挡的目标。唐乾坤和胡瑜(2020)提出了锚点框提升模块(anchor promotion module, APM)和特征对齐模块(feature alignment module, FAM)，在一定程度上解决了锚点框不均衡问题。Wu等人(2020)提出了双向金字塔网络(bidirectional pyramid network, BPN)算法，通过双向金字塔网络来解决SSD算法对于弱特征表达不明显的问题。

单阶段目标检测在实际应用中，面临的主要挑战是对模糊图像、小目标和受遮挡物体等目标检测困难(张焕龙等，2015；尹宏鹏等，2016；葛宝义等，2018；方路平等，2018)，以及性能和效率无法很好兼顾。特征融合通过融合网络不同深浅层特征，能够有效提升对困难目标的检测能力，这已经成为一种共识，因此在许多SSD改进模型中得到广泛使用。但多数改进模型都是直接使用特征融合手段，而对于融合的具体策略，如应该对哪些图进行融合、融合后的图如何处理等问题，相关的研究相对缺乏。另外，近些年来注意力机制也广受关注，其通过赋予维度权重，能够使特征图具有一定的“聚焦”作用，如何将其有效结合进单阶段目标检测，也是值得深入研究的问题。对此，本文提出了FDA-SSD(fusion double attention single shot multibox detector)算法，该算法的第1个贡献是设计特征图优化选择策略，为特征融合确定最有效的多层特征图组合以及设计恰当的再处理环节；第2个贡献是为特征融合后的输出特征图添加双注意力机制，使特征图中重要的通道和空间特征得到更多的显著表达，在提高预测准确率的同时，还能保持很高的运行效率。

1 FDA-SSD算法

1.1 骨干网络替换

SSD算法采用VGG16改进网络作为骨干网络，本文将原骨干网络替换成参数和计算量更小、深度更深的ResNet网络。

ResNet的block模块形式如图 1所示，其中图 1(a)为ResNet34网络中由两个3×3卷积层组成的基础块，图 1(b)为ResNet50/101/152的由两个1×1卷积层与一个3×3卷积层组成的瓶颈块，假设原有的多层映射表示为$H(\boldsymbol{x})$，每一层的输入为$\boldsymbol{x}$，ResNet的残差映射函数为$F(\boldsymbol{x})$，那么ResNet的残差连接(shortcut)起到的作用为$H(\boldsymbol{x})=F(\boldsymbol{x})+\boldsymbol{x}$，该网络在不加入新参数且不增加计算量的前提下，通过跳跃式连接，使得跨层参数可重复利用，在一定程度上缓解了梯度消失和梯度爆炸的问题。

图 1 ResNet的block形式

Fig. 1 Block form of ResNet

((a)basic block; (b)bottleneck)

1.2 特征融合策略优选

特征金字塔(feature pyramid networks, FPN)是Lin等人(2017)提出的经典特征融合方法，适用于不同的深度网络学习算法。如图 2所示，FPN算法通过自下而上和自上而下的方法，在各层特征图上进行运算，既融合了高层信息，又保留了低层信息。

图 2 特征金字塔

Fig. 2 Feature pyramid network

图 2中，自上而下的方法如图 3所示，将上采样后的高维特征与经过1×1卷积的特征进行对应元素相乘(element-wise)后相加，得到与原特征图网络维度相同的特征图网络，使得每一低层特征图都含有一部分高层语义信息，提高模型对小目标检测的效率和能力。

图 3 FPN融合方法

Fig. 3 FPN fusion method

为使骨干网络输出特征图上使用FPN融合的效果更好，以输出表达能力更强的特征图，本文采用一种特征图优化选择策略。通过骨干网络能得到4幅维度不一的特征图r_c2，r_c3，r_c4，r_c5，结合FPN可以构造出多种融合方式，考虑融合对象和方式的不同，构造出8种融合策略：

策略1)为验证来自浅层网络特征r_c2的有用性，对骨干网输出的r_c3，r_c4，r_c5这3层进行FPN，最后对r_c5进行卷积(卷积核大小为3，步长为2，填充像素大小为1)，生成r_c5_conv1，r_c5_conv2，r_c5_conv3，r_c5_conv4，最后删除r_c5_conv3(2×2×256)，以生成和原始SSD算法数量和尺度都相同的特征图组。

策略2)提取出骨干网的r_c2，r_c3，r_c4，r_c5，对这4层进行FPN，再对r_c5进行卷积，生成r_c5_conv1，r_c5_conv2，r_c5_conv3，r_c5_conv4，最后删除r_c2(75×75×256)，r_c5_conv3(2×2×256)，本策略可以与策略1)进行对比验证。

策略3)为探究浅层网络特征r_c2的有效融合方式，先提取出骨干网的r_c2，r_c3，r_c4，r_c5，并对r_c2做下采样与r_c3融合成r_c3′，然后对r_c3′，r_c4，r_c5做FPN，再对r_c5进行卷积, 生成r_c5_conv1，r_c5_conv2，r_c5_conv3，r_c5_conv4，最后删除r_c5_conv3。

策略4)提取出骨干网的r_c2，r_c3，r_c4，r_c5，对r_c2做下采样与r_c3融合成r_c3′，然后对生成的r_c3′，r_c4，r_c5做FPN，再对r_c5进行卷积, 生成r_c5_conv1，r_c5_conv2，r_c5_conv3，r_c5_conv4，为验证特征图r_c5_conv3的有效性，最后将r_c5_conv3下采样与r_c5_conv4融合。

策略5)对骨干网输出的r_c2，r_c3，r_c4，r_c5进行FPN, 生成fpn_r_c2，fpn_r_c3，fpn_r_c4，fpn_r_c5。为探究r_c2经过FPN后的有效性，将fpn_r_c2下采样和fpn_r_c3进行融合，再对r_c5进行卷积, 生成r_c5_conv1，r_c5_conv2，r_c5_conv3，r_c5_conv4，最后删除r_c5_conv3(2×2×256)。

策略6)前半部分同策略5)，后面对r_c5进行卷积, 生成r_c5_conv1，r_c5_conv2，r_c5_conv3，r_c5_conv4，为验证该情况下特征图r_c5_conv3的有效性，最后将r_c5_conv3下采样与r_c5_conv4融合。

策略7)前半部分同策略5)，后面先对r_c5进行卷积, 生成r_c5_conv1，r_c5_conv2，之后采用类似SSD算法生成额外特征图的方式，对r_c5_conv2再做卷积(卷积核大小为3，步长为1，填充像素大小为0), 生成r_c5_conv3。

策略8)前半部分同策略4)，后面对r_c5进行卷积, 生成r_c5_conv1，r_c5_conv2，之后采用类似SSD算法生成额外特征图的方式，对r_c5_conv2做卷积(卷积核大小为1，步长为1，填充像素大小为0), 生成r_c5_conv3。

通过消融实验对以上策略进行测试后，最终本文选择策略5)为最优策略。特征图生成结构如图 4所示，提取出骨干网络中的r_c2，r_c3，r_c4，r_c5，尺寸分别为75×75×256，38×38×512，19×19×1 024和10×10×2 048。为提取更深层的语义信息，对r_c5做卷积操作，卷积核个数为256，卷积核大小为3，步长为2，填充大小为1，生成r_c5_conv1 (5×5×256)，再用相同的方法生成r_c5_conv2，r_c5_conv3，r_c5_conv4，大小分别为3×3×256，2×2×256，1×1×256，同时对r_c2~r_c5做FPN，生成新特征fpn_r_c2~fpn_r_c5。

图 4 特征图优化选择

Fig. 4 Character diagram optimized selection

在经过FPN处理后，生成如图 4所示的8个特征图，为了保证模型对比的公平性，体现本文方法生成特征图的有效性，只保留与SSD算法维度及个数相同的6幅特征图。考虑保持最大尺寸为38×38像素，但由于75×75像素的特征图包含更多细节，直接舍去可能会影响对小目标的检测能力，因此通过增强融合的处理方法，将75×75像素特征图的信息做适当保留。如图 5所示，将最大的fpn_r_c2(75×75×256)先经过一个1×1卷积(卷积核个数为256，卷积核大小为1，步长为1)，然后再通过双线性降采样与一个经过1×1卷积(卷积核个数为256，卷积核大小为1，步长为1)的fpn_r_c3进行拼接，生成new_fpn_r_c3(38×38×512)来代替原来的fpn_r_c3。最终的输出是r_c5_conv1，r_c5_conv2，r_c5_conv4，r_c5，fpn_r_c4，new_fpn_r_c3。

图 5 特征图增强融合

Fig. 5 Feature diagram enhanced fusion

1.3 优选策略分析

在策略5)中，对骨干网络输出的4层特征进行FPN，保证了特征信息的全面性，在增强融合时，通过降采样将低层特征部分保留，同时保持最大图的尺寸不变，兼顾了速度和性能，在操作顺序上先FPN再增强融合，相当于进行一步反向FPN，比先增强再FPN的特征整合效果要好，最后去掉与r_c5_conv4功能重复的r_c5_conv3以减轻干扰。这是该策略优于其他策略的因素所在。

图 6(a)为原始输入图像，图 6(b)(c)为经过ResNet-50骨干网络输出尺寸为75×75和38×38像素的特征图，图 6(d)为经过策略5)处理后的38×38像素的特征图，通过对比可以看出：高像素特征图 6(b)包含的纹理、细节特征更丰富，低像素特征图 6(c)虽然图像较模糊，但是对于目标图形轮廓和形状信息的语义更直观，经过策略5)得到的图 6(d)既结合了高像素特征图的细节特征，又包含了低像素特征图丰富的语义表达，能够更好地描述目标对象。

图 6 不同特征图输出对比

Fig. 6 Output comparison of different feature graphs

((a) original map; (b)backbone network output (75×75 pixels); (c) backbone network output (38×38 pixels); (d) policy 5 output(38×38 pixels))

1.4 并联双注意力机制

借鉴瓶颈注意力模块(bottleneck attention module, BAM)(Woo等，2018)和SE(sequeeze and excitation) (Hu等，2018)的思想，本文设计一种并联双注意力机制来整合特征图的通道和空间信息，通过通道注意力和空间注意力提高特征图对于目标形状和位置的双重聚焦作用，如图 7所示。

图 7 双注意力模块核心结构

Fig. 7 Core structure of dual attention module

通道注意力机制分为3个操作，输入为特征图$\boldsymbol{U}(W, H, C)$，通过一系列卷积池化操作，得到一个新的特征图$\tilde{\boldsymbol{x}}(W, H, C)$。具体的操作如下：

1) Squeeze操作($F_{\mathrm{sq}}$)。对$\boldsymbol{U}$特征图中的每个特征做全局平均池化操作，压缩成一个(1, 1, $C$)的实数，具体为

$ \boldsymbol{Z}_{c}=F_{\mathrm{sq}}\left(\boldsymbol{u}_{c}\right)=\frac{1}{H \times W} \sum\limits_{i=1}^{H} \sum\limits_{j=1}^{W} u_{c}(i, j) $

(1)

式中，$\boldsymbol{Z}_c$表示Squeeze操作的输出，$\boldsymbol{u}_c$表示特征图的第$c$个通道，$H$和$W$表示特征图的高和宽，$u_{c}(i, j)$表示第$c$个通道的第$i$行, 第$j$列元素。

2) Excitation操作($F_{\mathrm{ex}}$)。类似循环神经门的机制，通过参数$\boldsymbol{W}$为每个特征通道生成权重，具体为

$ \boldsymbol{s}=F_{\mathrm{ex}}(z, \boldsymbol{W})=\sigma(g(z, \boldsymbol{W}))=\sigma\left(\boldsymbol{W}_{2} \delta\left(\boldsymbol{W}_{1} \boldsymbol{z}\right)\right) $

(2)

式中，$\boldsymbol{s}$为Excitation操作得到的权重，$\boldsymbol{z}$为Squeeze操作的输出，$\boldsymbol{W}_{1}$和$\boldsymbol{W}_{2}$为全连接操作，$\delta$为ReLU函数。

3) Reweight操作($F_{\text {scale }}$)。将Excitation的输出的权重通过乘法加权到原先的特征上, 具体为

$ \tilde{\boldsymbol{x}}=F_{\text {scale }}\left(\boldsymbol{u}_{c}, \boldsymbol{s}_{c}\right)=\boldsymbol{s}_{c} \boldsymbol{u}_{c} $

(3)

式中，$\tilde{\boldsymbol{x}}$为最终输出特征图，$\boldsymbol{u}_c$为特征中的第$c$个通道，$\boldsymbol{s}_c$为$\boldsymbol{s}$中的第$c$个权重。

空间注意力机制同样也由3个操作组成：

1) Pooling layer。将原特征图分别进行最大值池化和平均值池化，分别生成特征图$\boldsymbol{a}(W, H, 1)$与$\boldsymbol{m}(W, H, 1)$，然后拼接，让得到的权重图既具有最大值信息，又具有平均值信息。

2) Convolution layer。将拼接后的权重图通过卷积核大小为7、填充大小为3的卷积操作，使得权重图的尺寸重新从$\boldsymbol{c}(W, H, 2)$降维到$\boldsymbol{v}(W, H, 1)$。

3) Reweight layer。将经过卷积层的权重图通过乘法加权到原先的特征上，得到特征图$\boldsymbol{x}(W, H, C)$。

通过对每个特征图进行通道注意力机制和空间注意力机制的并联相加处理，有监督地强化重要的细节特征，削弱无用的干扰特征，同时，将特征图中的空间信息做相应的空间变换，提取其中重要的空间位置信息。

经过双注意力模块处理后的特征图如图 8所示，图 8(a)为原始输入图像，图 8(b)为ResNet50骨干网络输出尺寸为38×38像素的特征图，图 8(c)为经过双注意力模块处理后尺寸为38×38像素的特征图。通过对比可以看出，图 8(b)基本上只保留了边缘特征，而图 8(c)通过整合通道和空间信息，能同时提取到边缘特征、结构信息及色块信息，更能凸显出有效特征。

图 8 不同特征图对比

Fig. 8 Comparison of different feature graphs

((a) original picture; (b) backbone network output; (c) dual attention module output)

1.5 FDA-SSD网络结构

在原始输入图像经过骨干网络和特征金字塔层后，得到7个特征层，如图 9所示，分别为：new_fpn_r_c3，fpn_r_c4，fpn_r_c5，r_c5_conv1，r_c5_conv2，r_c5_conv3，r_c5_conv4，大小分别为38×38×512，19×19×256，10×10×256，5×5×256，3×3×256，2×2×256，1×1×256，然后加入双注意力机制进行权重分配计算，进行加法融合后得到同时含有通道和空间信息的特征图。最终FDA-SSD算法生成的预选框个数与SSD算法相同，但是得到的特征图包含了更多有效信息，因此该算法的边界框就更容易进行学习。

图 9 FDA-SSD模型网络结构

Fig. 9 Network structure of FDA-SSD model

1.6 预测框与损失设计

为公平比较，本文算法的预测框与损失设计采用与SSD算法一致的方法，先验框的设置包括尺度和长宽比两个方面。从先验框的尺度来看，其会随着特征图大小的降低而线性增加, 即

$ S_{k}=S_{\min }+\frac{S_{\max }-S_{\min }}{(m-1)(k-1)}, k \in[1, m] $

(4)

式中，$m$表示特征图数量，第1层是单独设置的，为说明本文方法的有效性，将$m$设置为5，与SSD算法保持一致，$S_{k}$为先验框与图像的比例(第1个框设置为0.1)，$S_{\rm{max}}$和$S_{\rm{min}}$表示比例的最大和最小值，设定为SSD给出的值0.2和0.9。对于长宽比，选取$\alpha_{r} \in\{1, 2, 3, 1 / 2, 1 / 3\}$，对于特定长宽比，计算先验框的宽度与高度，即

$ w_{k}^{\alpha}=S_{k} \sqrt{\alpha_{r}}, h_{k}^{\alpha}=\frac{S_{k}}{\sqrt{\alpha_{r}}} $

(5)

为了确保每个特征图设置大小不同但是长宽比为1的先验框，每个特征图都会设置一个尺度为$S_{k}$和一个尺度为$S_{k}^{\prime}=\sqrt{S_{k} S_{k+1}}$的先验框，因此除第1、5、6层不使用长宽比为{3，1/3}的先验框外，每个特征图中都将生成6个先验框，因此FDA-SSD一共需要预测8 732个边界框。

算法的损失函数分为2个部分：位置损失$L_{\rm{loc}}$和置信损失$L_{\rm{conf}}$，位置损失采用SmoothL1损失函数来计算真实框与预测框之间的误差，置信损失采用Softmax损失函数来计算分类准确率，损失函数表示为

$ L(x, c, l, g)=\frac{1}{N}\left[L_{\mathrm{conf}}(x, c)+\alpha L_{\mathrm{loc}}(x, l, g)\right] $

(6)

式中，$N$表示正样本数量，$x$为一个指示参数，当$x=0$时，表示在某一类中先验框与真实框不一致；当$x=1$时，表示在某一类中先验框与真实框一致，$c$表示类别置信度预测值，$l$表示边界框的位置预测值，$g$表示真实框的位置。

2 实验结果与分析

2.1 实验平台与数据

实验采用Paddlepaddle 1.8.0作为开发框架，运行环境基于Tesla V100和Titan X，CUDA9.0和CuDNN7.6。数据集为PASCAL VOC(Everingham等，2010)2007+2012数据集和TGRS-HRRSD-Dataset(high resolution remote sensing detection)(Zhang等，2019)遥感数据集。预训练模型为ImageNet(Russakovsky等，2015)。

2.2 数据增强

为保证模型对比的公平性，本文设置与SSD算法一致的预处理方法与参数：1) 归一化。对图像进行归一化，加快梯度下降的求解速度；2) 随机翻转。将图像随机翻转一个角度；3) 标准化边界框。对生成的边界框进行标准化处理；4) 随机展开。对图像进行最大展开比为4、RGB填充画布值为(104，117，123)的图像展开；5) 随机裁剪。对图像按照一定尺寸范围进行随机裁剪；6) 随机扭曲。对图像进行亮度、对比度和饱和度分别在(0.875，1.125)、(0.5，1.5)和(0.5，1.5)区间的图像随机扭曲。

2.3 实验指标评估

算法评估指标采用平均精度均值(mean average precision, mAP)，由准确率(precision)和召回率(recall)计算而来。

1) 准确率。在所有预测为正类的样本中，实际为正类的占比，即

$ P=\frac{T P}{T P+F P} $

(7)

式中，$TP$表示实际为正类、预测为正类的样本，$FP$表示实际为负类、预测为正类的样本。

2) 召回率。在所有预测中，实际为正类且预测为正类在所有实际为正类的占比，即

$ R=\frac{T P}{T P+F N} $

(8)

式中，$FN$表示实际为正类、预测为负类的样本。

3) 通过准确率和召回率求出每一类的AP(average precision)，然后再对AP取平均，得到mAP值，即

$ m A P=\frac{\sum\limits_{k-1}^{N} p(k) \Delta r(k)}{m} $

(9)

式中，$N$表示图像总数，$p(k)$表示第$k$幅图像的准确率，$\Delta r(k)$表示从第$k-1$幅图像到第$k$幅图像的召回率变化量，$m$表示所有图像的类别个数。

2.4 训练参数设置

训练参数如表 1所示。

表 1 训练参数
Table 1 Training parameters

下载CSV

参数	值
base_lr	0.001
max_iter	18万/36万
image_size	300/521
batch_size	16/8
gama	0.1
type	“Momentum”
momentum	0.9
start_factor(warmup)	0
steps(warmup)	1 000/2 000
type(regularizer)	“L2”
factor(regularizer)	0.000 5

2.5 优化策略实验

表 2显示了1.2节中基于Tesla V100的8种优化策略的实验结果，实验输入图像大小为300×300像素，训练15万轮，batch_size为16，训练数据集为PASCAL VOC2007+2012，测试集为PASCALVOC2007。

表 2 各方案结果对比
Table 2 Comparison of results of various schemes

下载CSV

选择	FPS/(帧/s)	mAP/%
策略1)	207	77.47
策略2)	215	77.87
策略3)	220	77.50
策略4)	218	77.64
策略5)	212	78.59
策略6)	179	78.02
策略7)	220	77.42
策略8)	206	78.19
注：加粗字体为各列最优结果。

根据表 2的结果，策略5)的准确率高于其他方案，策略3)和策略7)的速度较快，但是准确率过低，因此在兼顾准确率和检测速度的合理考量下，选择策略5)为最优策略，最终将基于该策略改进的SSD命名为F-SSD(fusion single shot multibox detector)。

最后在F-SSD的基础上加入双注意力模块进行消融实验，表 3展示了基于Tesla V100上的消融实验结果。双注意力模块能有效提升模型的准确率，又能保持与模型的原检测速度相近，最终将此模型命名为FDA-SSD。

表 3 注意力消融实验
Table 3 Attention ablation experiment

下载CSV

结构	FPS/(帧/s)	mAP/%
策略5)	212	78.59
策略5)+双注意力模块	194	78.97
注：加粗字体为各列最优结果。

2.6 骨干网络实验

以策略5)为融合策略，针对不同的骨干网络进行测试，训练数据集为PASCAL VOC2007+2012，测试集为PASCAL VOC 2007，batch_size为8的训练轮数为36万轮，batch_size为16的训练轮数为15万轮。表 4展示了基于Tesla V100上的实验结果，不同ResNet骨干网络的FDA-SSD模型的性能存在一定的差异，无论输入图像尺寸是300×300像素或是512×512像素，以ResNet50为骨干网络的检测速度最快，以CBResNet(composite backbone residual neural network)为骨干网络的准确率最高。将骨干网络从ResNet替换为CBResNet后，准确率平均提升1.7%，检测速度平均下降32帧/s。另外, 骨干网络的深度也有不小的影响，无论是ResNet还是CBResNet，50层网络版本都比101层版本在准确率上低1%左右，但在检测速度上更快。因此对于FDA-SSD算法，采用替换为更深的骨干网络在提升准确率上有更显著的效果。

表 4 不同类型ResNet与图像尺寸测试结果
Table 4 Test results of different ResNets and image size

下载CSV

骨干网络	图像尺寸/像素	Batchsize	FPS/(帧/s)	mAP/%
ResNet50	300×300	16	194	79.0
ResNet101	300×300	16	165	79.8
ResNet50	512×512	8	83	80.7
ResNet101	512×512	8	63	81.6
CBResNet50	300×300	16	145	80.7
CBResNet101	300×300	16	109	81.6
CBResNet50	512×512	8	59	82.1
CBResNet101	512×512	8	46	83.2
注：加粗字体为各列最优结果。

2.7 PASCAL VOC 2007数据集实验

表 5显示了各种模型在PASCAL VOC2007数据集上的结果对比，这些模型包括了主流目标检测模型Faster RCNN和YOLO系列、SSD及其各种改进版和其他一些检测模型，表中的测试结果来自原算法作者论文的最优结果及一些公开平台的测评结果。

表 5 各种算法结果对比
Table 5 Comparison of results of various algorithms

下载CSV

方法	骨干网络	GPU	输入尺寸/像素	FPS/(帧/s)	mAP/%
Faster RCNN(Ren等，2017)	VGGNet	Titan X	600×1 000	7	73.2
Faster RCNN(Ren等，2017)	ResNet101	K40	600×1 000	2.4	76.4
R-FCN(Dai等，2016)	ResNet101	Titan X	300×300	32	80.5
OHEM(Shrivastava等，2016)	VGGNet	Titan X	600×1 000	3	78.9
MLKP(Wang等，2018)	VGGNet	1080Ti	600×1 000	10	78.1
R-DAD(Bae，2019)	ResNet101	Titan X	600×1 000	5	77.6
YOLOv2(Redmon和Farhadi，2017)	DarkNet19	Titan X	352×352	81	73.7
YOLOv3^*	DarkNet53	Tesla V100	320×320	-	82.2
YOLOv3^*	ResNet34	Tesla V100	320×320	-	80.1
SSD(Liu等，2016)	VGGNet	1080Ti	300×300	85	77.2
SSD(Liu等，2016)	VGGNet	Titan X	512×512	19	78.5
SSD(Liu等，2016)	VGGNet	Titan X	300×300	46	77.2
SSD^*	VGGNet	Tesla V100	300×300	163	78.0
SSD^*	VGGNet	Tesla V100	512×512	71	80.3
DSOD(Shen等，2020)	DS/64-192-48-1	Titan X	300×300	17.4	77.7
STDN(Zhou等，2018)	DenseNet-169	Titan X	300×300	41.5	78.1
AFP-SSD(刘涛和汪西莉，2020)	VGGNet	Titan X	300×300	21	79.3
F_SE_SSD(郑浦等，2020)	VGGNet	1080Ti	300×300	65	80.4
BPN(Wu等，2020)	VGGNet	Titan X	320×320	32.4	80.3
RefineDet(Zhang等，2018)	VGGNet	Titan X	320×320	40.3	80.0
DSSD(Fu等，2017)	ResNet101	Titan X	321×321	9.5	78.6
DSSD(Fu等，2017)	ResNet101	Titan X	513×513	5.5	81.5
RSSD(Jeong等，2017)	VGGNet	Titan X	300×300	35	78.5
RSSD(Jeong等，2017)	VGGNet	Titan X	512×512	16.6	80.8
FSSD(Li和Zhou，2017)	VGGNet	1080Ti	300×300	66	78.8
FSSD(Li和Zhou，2017)	VGGNet	1080Ti	512×512	36	80.9
FDA_SSD(本文)	ResNet101	Tesla V100/Titan X	300×300	165/47	79.8
FDA_SSD(本文)	ResNet101	Tesla V100/Titan X	512×512	63/18	81.6
FDA_SSD(本文)	CBResNet101	Tesla V100/Titan X	512×512	46/13	83.2
注：加粗字体为各列最优结果, “*”表示经PaddleDetection优化后的增强版本, “-”表示未列出结果。

以输入图像尺寸为300×300像素为例，FDA-SSD的推理速度与原始SSD算法相当，明显快于DSSD、RSSD、FSSD、AFP-SSD(atrous filter pyramid single shot detector)等各种SSD改进算法。而从准确率来看，FDA-SSD模型已经优于绝大部分的SSD改进模型。因此综合速度和性能来看，FDA-SSD达到了最好的平衡。实验还显示了当采用CBResNet为骨干网时，FDA-SSD可以达到更高的准确率，说明本文算法在不同的骨干网上都具有较强的适用性。

图 10展示了各模型的mAP-FPS实验散点图，FDA-SSD512的mAP达到了81.6%，与SSD改进模型中准确率最高的以Resnet-101为骨干网络的DSSD512相当，但速度是DSSD512的3倍有余。FDA-SSD300的mAP达到79.8%，比SSD改进模型中准确率最高的以VGG16为骨干网络的FSSD300高1.00%，速度要更快。在以300×300像素为输入尺寸的各模型中，FDA-SSD300最靠近右上，充分显示了其在性能和速度上的双重优越性。

图 10 各算法在PASCAL VOC2007测试集上的FPS与mAP

Fig. 10 FPS and mAP of each algorithm in PASCAL VOC2007

图 11展示了SSD模型与FDA-SSD模型的一些图片预测效果对比，其中上半部分图片为经过PASCAL VOC2007+2012训练集所训练出来的SSD模型，在PASCAL VOC2012测试集上的部分预测结果；下半部分图像为FDA-SSD模型在PASCAL VOC2012测试集上的部分预测结果。从对比中可以看出，SSD模型对于受遮挡(图 11(a))、小目标(图 11(b))、多目标重叠(图 11(c))、图像模糊(图 11(d))及大长目标(图 11(e))的检测较为困难，而FDA-SSD模型的表现相对稳健很多。

图 11 SSD模型与FDA-SSD模型在PASCAL VOC test2007测试集中图像预测效果对比

Fig. 11 Comparison of picture prediction effect between SSD model and FDA-SSD model on PASCAL VOC test2007 datasets

((a)occluded target; (b)small target; (c)overlapping targets; (d)fuzzy target; (e)growth target)

2.8 TGRS-HRRSD-Dataset数据集实验

TGRS-HRRSD-Dataset是用于研究高分辨率遥感图像目标检测的数据集，包括21 761幅航拍图像，55 740个目标实例，飞机、船和桥等13类目标，空间分辨率从0.15~1.2 m，实验将10 818幅图像作为训练集，10 943幅图像作为测试集。所有图像放缩到300×300像素作为模型输入，一共进行90 k次迭代，前60 k次迭代的学习率为0.001，之后20 k次迭代的学习率为0.000 1，最后10 k次迭代的学习率为0.000 01。表 6展示了Tesla V100的实验结果，基于所有不同骨干网络的FDA-SSD算法的检测速度与准确率都优于SSD算法，分别平均高于SSD算法10%和1.6%，其中基于ResNet50骨干网络的FDA-SSD算法检测速度最快，为61.2帧/s，高于SSD模型12%，基于CBResNet101骨干网络的FDA-SSD算法准确率最高，为84.9%，高于SSD模型2.2%，基于ResNet101骨干网络的FDA-SSD模型在检测速度高于SSD模型10%的前提下，准确率高出1.5%。因此，对于FDA-SSD模型而言，使用更深和拟合能力更强的骨干网络能够具有更好的性能。

表 6 TGRS-HRRSD-Dataset数据集上的算法结果对比
Table 6 Comparison of algorithm results on TGRS-HRRSD-Dataset

下载CSV

算法	网络	FPS/(帧/s)	mAP/%
SSD300	VGGNet	54.7	82.7
FDA-SSD	ResNet50	61.2	84.0
FDA-SSD	ResNet101	60.1	84.2
FDA-SSD	CBResNet101	58.9	84.9
注：加粗字体为各列最优结果。

3 结论

本文提出了一种基于融合策略优化选择和双注意力机制的单目标检测模型FDA-SSD，采用层次更深的ResNet网络来代替原SSD算法中深度较浅的VGG16网络作为骨干网络，其次提出融合策略优化选择方法，通过在骨干网络之后增加特征金字塔，产生融合各种跨层信息的特征图候选集，经过不同的特征融合、删除图层和卷积运算等操作，选择出表征能力强、语义信息丰富的特征图组，最后增加双注意力机制，进一步优化各特征图中的通道权重。实验表明，相比现有的多种SSD改进模型，本文模型表现出精度和速度上的双重优势，在小目标、受遮挡、模糊图像等难目标的检测上也具有很好的性能。

虽然FDA-SSD模型在基于SSD系列改进模型中表现出色，但本文工作主要基于研究特征图的优化，而在预测框生成与非极大值抑制方法上还留有很大的研究空间，这些都有待在未来的工作中继续深入。

参考文献

Bae S H. 2019. Object detection based on region decomposition and assembly. Proceedings of the AAAI Conference on Artificial Intelligence, 33(1): 8094-8101 [DOI:10.1609/aaai.v33i01.33018094]

Dai J F, Li Y, He K M and Sun J. 2016. R-FCN: object detection via region-based fully convolutional network[EB/OL]. [2021-03-02]. https://arxiv.org/pdf/1605.06409.pdf

Everingham M, Van Gool L, Williams C K I, Winn J, Zisserman A. 2010. The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2): 303-338 [DOI:10.1007/s11263-009-0275-4]

Fang L P, He H J, Zhou G M. 2018. Research overview of object detection methods. Computer Engineering and Applications, 54(13): 11-18, 33 (方路平, 何杭江, 周国民. 2018. 目标检测算法研究综述. 计算机工程与应用, 54(13): 11-18, 33) [DOI:10.3778/j.issn.1002-8331.1804-0167]

Fu C Y, Liu W, Ranga A, Tyagi A and Berg A C. 2017. DSSD: deconvolutional single shot detector[EB/OL]. [2021-03-02]. https://arxiv.org/pdf/1701.06659.pdf

Ge B Y, Zuo X Z, Hu Y J. 2018. Review of visual object tracking technology. Journal of Image and Graphics, 23(8): 1091-1107 (葛宝义, 左宪章, 胡永江. 2018. 视觉目标跟踪方法研究综述. 中国图象图形学报, 23(8): 1091-1107) [DOI:10.11834/jig.170604]

Girshick R, Donahue J, Darrell T and Malik J. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE: 580-587[DOI: 10.1109/CVPR.2014.81]

Girshick R. 2015. Fast R-CNN//Proceedings of 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile: IEEE: 1440-1448[DOI: 10.1109/ICCV.2015.169]

He K M, Gkioxari G, Dollar P, Girshick R. 2020. Mask R-CNN. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2): 386-397 [DOI:10.1109/TPAMI.2018.2844175]

He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 770-778[DOI: 10.1109/CVPR.2016.90]

Hu J, Shen L, Albanie S, Sun G and Wu E H. 2018. Squeeze-and-excitation networks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 7132-7141[DOI: 10.1109/CVPR.2018.00745]

Jeong J, Park H and Kwak N. 2017. Enhancement of SSD by concatenating feature maps for object detection//Proceedings of the British Machine Vision Conference. London, UK: BMVA Press: 76.1-76.12[DOI: 10.5244/C.31.76]

Law H, Deng J. 2020. CornerNet: detecting objects as paired keypoints. International Journal of Computer Vision, 128(3): 642-656 [DOI:10.1007/s11263-019-01204-1]

Li Z X and Zhou F Q. 2017. FSSD: feature fusion single shot multibox detector[EB/OL]. [2021-03-02]. https://arxiv.org/pdf/1712.00960.pdf

Lin T Y, Dollár P, Girshick R, He K M, Hariharan B and Belongie S J. 2017. Feature pyramid networks for object detection//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 936-944[DOI: 10.1109/CVPR.2017.106]

Lin T Y, Goyal P, Girshick R, He K M, Dollár P. 2020. Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2): 318-327 [DOI:10.1109/TPAMI.2018.2858826]

Liu T, Wang X L. 2020. Single-stage object detection using filter pyramid and atrous convolution. Journal of Image and Graphics, 25(1): 102-112 (刘涛, 汪西莉. 2020. 采用卷积核金字塔和空洞卷积的单阶段目标检测. 中国图象图形学报, 25(1): 102-112) [DOI:10.11834/jig.190166]

Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C Y and Berg A C. 2016. SSD: single shot multibox detector//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 21-37[DOI: 10.1007/978-3-319-46448-0_2]

Redmon J, Divvala S, Girshick R and Farhadi A. 2016. You only look once: unified, real-time object detection//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE: 779-788[DOI: 10.1109/CVPR.2016.91]

Redmon J and Farhadi A. 2017. YOLO9000: better, faster, stronger//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE: 6517-6525[DOI: 10.1109/CVPR.2017.690]

Redmon J and Farhadi A. 2018. YOLOv3: an incremental improvement[EB/OL]. [2021-04-02]. https://arxiv.org/pdf/1804.02767.pdf

Ren S Q, He K M, Girshick R, Sun J. 2017. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6): 1137-1149 [DOI:10.1109/TPAMI.2016.2577031]

Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S A, Huang Z H, Karpathy A, Khosla A, Bernstein M, Berg A C, Li F F. 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3): 211-252 [DOI:10.1007/s11263-015-0816-y]

Shen Z Q, Liu Z, Li J G, Jiang Y G, Chen Y R, Xue X Y. 2020. Object detection from scratch with deep supervision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2): 398-412 [DOI:10.1109/TPAMI.2019.2922181]

Shrivastava A, Gupta A and Girshick R. 2016. Training region-based object detectors with online hard example mining//Proceedigns of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 761-769[DOI: 10.1109/CVPR.2016.89]

Simonyan K and Zisserman A. 2015. Very deep convolutional networks for large-scale image recognition[EB/OL]. [2021-03-02]. https://arxiv.org/pdf/1409.1556.pdf

Tang Q K, Hu Y. 2020. PosNeg-balanced anchors with aligned features for single-shot object detection. Journal of Computer-Aided Design and Computer Graphics, 32(11): 1773-1783 (唐乾坤, 胡瑜. 2020. 基于正负锚点框均衡及特征对齐的单阶段目标检测算法. 计算机辅助设计与图形学学报, 32(11): 1773-1783) [DOI:10.3724/SP.J.1089.2020.18175]

Wang H, Wang Q L, Gao M Q, Li P H and Zuo W M. 2018. Multi-scale location-aware kernel representation for object detection//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 1248-1257[DOI: 10.1109/CVPR.2018.00136]

Woo S, Park J, Lee J Y and Kweon I S. 2018. CBAM: convolutional block attention module//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 3-19[DOI: 10.1007/978-3-030-01234-2_1]

Wu X W, Sahoo D, Zhang D X, Zhu J K, Hoi S C H. 2020. Single-shot bidirectional pyramid networks for high-quality object detection. Neurocomputing, 401: 1-9 [DOI:10.1016/j.neucom.2020.02.116]

Yin H P, Chen B, Chai Y, Liu Z D. 2016. Vision-based object detection and tracking: a review. Acta Automatica Sinica, 42(10): 1466-1489 (尹宏鹏, 陈波, 柴毅, 刘兆栋. 2016. 基于视觉的目标检测与跟踪综述. 自动化学报, 42(10): 1466-1489) [DOI:10.16383/j.aas.2016.c150823]

Zhang H L, Hu S Q, Yang G S. 2015. Video object tracking based on appearance models learning. Journal of Computer Research and Development, 52(1): 177-190 (张焕龙, 胡士强, 杨国胜. 2015. 基于外观模型学习的视频目标跟踪方法综述. 计算机研究与发展, 52(1): 177-190) [DOI:10.7544/issnl000-1239.2015.20130995]

Zhang S F, Wen L Y, Bian X, Lei Z and Li S Z. 2018. Single-shot refinement neural network for object detection//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 4203-4212[DOI: 10.1109/CVPR.2018.00442]

Zhang Y, Yuan Y, Feng Y, Lu X Q. 2019. Hierarchical and robust convolutional neural network for very high-resolution remote sensing object detection. IEEE Transactions on Geoscience and Remote Sensing, 57(8): 535-554 [DOI:10.1109/TGRS.2019.2900302]

Zheng P, Bai H Y, Li W, Guo H W. 2020. Small target detection algorithm in complex background. Journal of Zhejiang University (Engineering Science), 54(9): 1777-1784 (郑浦, 白宏阳, 李伟, 郭宏伟. 2020. 复杂背景下的小目标检测算法. 浙江大学学报(工学版), 54(9): 1777-1784) [DOI:10.3785/j.issn.1008-973X.2020.09.014]

Zhou P, Ni B B, Geng C, Hu J G and Xu Y. 2018. Scale-transferrable object detection//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 528-537[DOI: 10.1109/CVPR.2018.00062]

Zhou X Y, Wang D Q and Krähenbühl P. 2019. Objects as points[EB/OL]. [2021-03-02]. https://arxiv.org/pdf/1904.07850.pdf