Current Issue Cover
融合策略优选和双注意力的单阶段目标检测

戴坤, 许立波, 黄世旸, 李鋆铃(浙大宁波理工学院计算机与数据工程学院, 宁波 315000)

摘 要
目的 特征融合是改善模糊图像、小目标以及受遮挡物体等目标检测困难的有效手段之一,为了更有效地利用特征融合来整合不同网络层次的特征信息,显著表达其中的重要特征,本文提出一种基于融合策略优选和双注意力机制的单阶段目标检测算法FDA-SSD (fusion double attention single shot multibox detector)。方法 设计融合策略优化选择方法,结合特征金字塔(feature pyramid network,FPN)来确定最优的多层特征图组合及融合过程,之后连接双注意力模块,通过对各个通道和空间特征的权重再分配,提升模型对通道特征和空间信息的敏感性,最终产生包含丰富语义信息和凸显重要特征的特征图组。结果 本文在公开数据集PASCAL VOC2007(pattern analysis,statistical modelling and computational learning visual object classes)和TGRS-HRRSD-Dataset (high resolution remote sensing detection)上进行对比实验,结果表明,在输入为300×300像素的PASCAL VOC2007测试集上,FDA-SSD模型的精度达到79.8%,比SSD (single shot multibox detector)、RSSD (rainbow SSD)、DSSD (de-convolution SSD)、FSSD (feature fusion SSD)模型分别高了2.6%、1.3%、1.2%、1.0%,在Titan X上的检测速度为47帧/s (frame per second,FPS),与SSD算法相当,分别高于RSSD和DSSD模型12 FPS和37.5 FPS。在输入像素为300×300的TGRS-HRRSD-Dataset测试集上的精度为84.2%,在Tesla V100上的检测速度高于SSD模型10%的情况下,准确率提高了1.5%。结论 通过在单阶段目标检测模型中引入融合策略选择和双注意力机制,使得预测的速度和准确率同时得到提升,并且对于小目标、受遮挡以及模糊图像等难目标的检测能力也得到较大提升。
关键词
Single stage object detection algorithm based on fusing strategy optimization selection and dual attention mechanism

Dai Kun, Xu Libo, Huang Shiyang, Li Yunling(School of Computer and Data Engineering, NingboTech University, Ningbo 315000, China)

Abstract
Objective Object detection is essential to computer vision and in-depth learning recently. It has been widely used in industrial detection, intelligent transportation, human facial recognition and contexts. There are two main categories of recognized target detection algorithms. One of current target detection algorithms is two-stage algorithm, such as region-based convolution neural network (R-CNN), Fast R-CNN, online hard example mining (OHEM), Faster R-CNN, Mask R-CNN etc. The methods generate target candidate boxes first, and implement the candidate boxes classification and regression following. The other one is single-stage algorithms, such as you only look once (YOLO), single shot multibox detector (SSD) etc. In addition, the demonstrated corner network(CornerNet) & center network(CenterNet)-anchor free models have tried to ignore the anchor frame and conduct detection and matching based on key points, which has achieved quite good results, but there is still a little gap from the detection method based on anchor frame. In the practical application of single-stage target detection, a main challenging issue is target detection like blurred image, small target and occluded object, and the predicted performance and efficiency. Feature fusion can improve the detection ability of difficult targets effectively by fusing different deep and shallow features of the network, which has been used in many improved SSD models in common. However, most of the improved models use feature fusion methods directly, and the specific fusion strategies like the issues of fused graphs option and fused graphs processing. In addition, current attention mechanism can make the feature graph have a certain "focus" effect by giving dimension weight. The issue of combining attention mechanism to single-stage target detection effectively has its potentials. Method The shallow Visual Geometry Group (VGG) network in the original SSD algorithm is replaced by the deep residual network as the backbone network. First, an optimized selection method of fusion strategy is designed in accordance with the idea of feature pyramid network (FPN). FPN is applied to the four layers of backbone network output to accomplish the detailed feature information description,the lower layer features are retained by down sampling during enhanced fusion process, while the size of the largest graph remains stable. The speed and performance are taken into account. In the operation sequence, FPN is used first, and then enhanced fusion is used, which is equivalent to one-step reversed FPN, It is better than initial enhancing and following FPN, and final removing the r_c5_conv4 layer which is the same as the r_c5_conv3 layer to reduce the interference. To better describe the target object, the feature mapping combines the detailed features of high pixel feature mapping with the rich semantic representation of low pixel feature mapping. Then, In respect of the ideas of bottleneck attention module (BAM) and squeeze-and-excitation network (SENet), our research designs a parallel dual attention mechanism to integrate the channel and spatial information of the feature map. The dual clustered effect of the feature map on the shape and position of the target is improved through channel attention and spatial attention. The parallel addition processing of channel and spatial attention mechanism strengthens the supervision of key features in terms of the parallel addition of channel attention mechanism and spatial attention mechanism for each feature graph. The key features are strengthened and the redundant interference features are weakened. At the same time, the spatial information in the feature graph is transformed to extract the key spatial position information. Finally, rich semantic information and distinctive features related feature groups are obtained. Result This comparative experiment is carried out on pattern analysis, statistical modelling and computational learning visual object classes (PASCAL VOC2007) and IEEE Transactions on Geoscience and Remote Sensing-High Resolution Remote Sensing Detection (TGRS-HRRSD-Dataset). Our experimental results show that on PASCAL VOC2007 test set with 300×300 input pixels, the accuracy of fusion double attention(FDA)-SSD model reaches 79.8%, which is 2.6%, 1.3%, 1.2% and 1.0% higher than SSD, rainbow single shot detector (RSSD), de-convolution single shot detector (DSSD) and feature fusion single shot detector (FSSD) models, respectively. The detection speed on Titan X is 47 frames per second (FPS), which is equivalent to SSD algorithm, higher than RSSD and DSSD model 12 FPS and 37.5 FPS, respectively. The accuracy of the proposed algorithm is 81.6% on PASCAL VOC2007 test set with 512×512 pixels, and the detection speed on Titan X is 18 FPS, which is better than most algorithms in terms of the obtained accuracy and speed. The accuracy of TGRS-HRRSD-Dataset with 300×300 input pixels is 84.2%. The detection speed of Tesla V100 is 10% higher than SSD model, and the accuracy is improved by 1.5%. Our algorithm has good performance for regular data sets and aerial datasets both, which reflects the stability and portability of the algorithm. Conclusion Our research proposes an optimized feature map selection method and dual attention mechanism. Compared to many existing SSD improved models, this model has its dual advantages in accuracy and speed, and has good performance in the detection of small targets, occluded, blurred images and other challenging targets as well. Although the FDA-SSD model performs well in the SSD, our analysis is mainly based on the optimization of the featured graphs. The future prediction box generation and non-maximum suppression methods have their potentials to be studied further.
Keywords

订阅号|日报