姜文涛,张驰,张晟翀,刘万军(辽宁工程技术大学软件学院, 葫芦岛 125105;辽宁工程技术大学研究生院, 葫芦岛 125105;光电信息控制和安全技术重点实验室, 天津 300308)
目的 自然场景图像中，特征提取的质量好坏是决定目标检测性能高低的关键因素。大多数检测算法都是利用卷积神经网络（CNN）强大的学习能力来获得目标的先验知识，并根据这些知识进行目标检测。卷积神经网络的低层次特征缺乏特征的代表性，而高层次的特征则对小尺度目标的监测能力弱。方法 利用原始SSD（single shot multiBox detector）网络提取特征图，通过1×1卷积层将提取的特征图统一为256维；通过反卷积操作增加自顶向下特征图的空间分辨率；通过对应元素相加的操作，将两个方向的特征图进行融合。将融合后的特征图采用3×3的卷积核进行卷积操作，减小特征图融合后的混叠效应。根据以上步骤构建具有较强语义信息的特征图，同时保留原有特征图的细节信息；对预测框进行聚合，利用非极大抑制（NMS）实现最终的检测效果。结果 在PASCAL VOC 2007和PASCAL VOC 2012数据集上进行实验测试，该模型的mAP（mean average precision）为78.9%和76.7%，相对于经典的SSD算法，分别提高了1.4%和0.9%；此外，本文方法在检测小尺度目标时相较于经典SSD模型mAP提升了8.3%。结论 提出了一种多尺度特征图融合的目标检测算法，以自顶向下的方式扩展了语义信息，构造了高强度语义特征图用于实现精确目标检测。
Multiscale feature map fusion algorithm for target detection
Jiang Wentao,Zhang Chi,Zhang Shengchong,Liu Wanjun(College of Software, Liaoning Technical University, Huludao 125105, China;Graduate School, Liaoning Technical University, Huludao 125105, China;Science and Technology on Electro-Optical Information Security Control Laboratory, Tianjin 300308, China)
Objective The development and progress of science and technology have made it possible to obtain numerous images from imaging equipment, the Internet, or image databases and have increased people’s requirements for image processing. Consequently, image-processing technology has been deeply, widely, and rapidly developed. Target detection is an important research content in the field of computer vision. Rapid and accurate positioning and recognition of specific targets in uncontrolled natural scenes are vital functional bases of many artificial intelligence application scenes. However, several major difficulties presently exist in the field of target detection. First, many small objects are widely distributed in visual scenes. The existence of these small objects challenges the agility and reliability of detection algorithms. Second, detection accuracy and speed are linked, and many technical bottle necks must be overcome to consider the performance of these two factors. Finally, large-scale model parameters are an important reason restricting the loading of deep network chips. The compression of model size while ensuring detection accuracy is a meaningful and urgent problem. Targets with simple background, sufficient illumination, and no occlusion are relatively easy to detect, whereas targets with mixed background and target, occlusion near the target, excessively weak illumination intensity, or diverse target posture are difficult to detect. In natural scene images, the quality of feature extraction is the key factor to determine the performance of target detection. Decades of research have resulted in a more robust detection algorithm. Deep learning technology in the field of computer vision has also achieved great breakthroughs in recent years. Target detection framework based on deep learning has become the mainstream, and two main branches of target detection algorithms based on candidate regions and regression have been derived. Most of the current detection algorithms use the powerful learning ability of convolutional neural networks (CNNs) to obtain the prior knowledge of the target and perform target detection according to such knowledge. The low-level features of convolutional neural networks are characterized by high resolution ratio, low abstract semantics, limited position information, and lack of representation of features. High-level features are characterized by high identification, low resolution ratio, and a weak ability to detect small-scale targets. Therefore, in this study, the semantic information of context is transmitted by combining high- and low-level feature graphs to make the semantic information complete and evenly distributed. Method While balancing detection speed and accuracy, the multiscale feature graph fusion target detection algorithm in this study takes a single-shot multibox detector (SSD) network structure as the basic network and adds a feature fusion module to obtain feature graphs with rich semantic information and uniform distribution. The speech information of feature graphs on different levels is transmitted from top to bottom by feature fusion structure to reduce the semantic difference among feature graphs at different levels. The original SSD network is first used to extract a feature graph, which is then unified into 256 channels through a 1×1 convolution layer. The spatial resolution of the top-down feature maps is subsequently increased by deconvolution. Hence, the feature graph coming from two directions has the same spatial resolution. Feature graphs in both directions are then fused to obtain feature graphs with complete semantic information and uniform distribution by adding corresponding elements. The fused feature graph is convolved with a 3×3 convolution kernel to reduce the aliasing effect of the fused feature graph. A feature graph with strong semantic information is constructed according to the abovementioned steps, and the details of the original feature graph are retained. Lastly, the predicted bounding boxes are aggregated and non maximum suppression is used to achieve the final detection results. Result Key problems in the practical application of target detection algorithms and difficult problems in related target detection are analyzed according to the research progress and task requirements of visual target detection-related technology. Current solutions are also given. The target detection algorithm based on multiscale feature graph fusion in this study can achieve good results when dealing with weak targets, multiple targets, messy background, occlusion, and other detection difficulties. Experimental tests are performed on PASCAL VOC 2007 and 2012 data sets. The mean average precision values of the proposed model are 78.9% and 76.7%, which are 1.4 and 0.9 percentage points higher than those of the classical SSD algorithm, respectively. In addition, the method in this paper improves by 8.3% mAP compared with the classical SSD model when detecting small-scale targets. Compared with the classical SSD model, the method proposed in this study significantly improves the detection effect when detecting small-scale targets. Conclusion The multiscale feature graph fusion target detection algorithm proposed in this study uses convolutional neural network to extract convolutional features instead of the traditional manual feature extraction process, thereby expanding semantic information in a top-down manner and constructing a high-strength semantic feature graph. The model can be used to detect new scene images with strong visual task. In combination with the idea of deep learning convolutional neural network, the convolution feature is used to replace the traditional manual feature, thus avoiding the problem of feature selection in the traditional detection problem. The deep convolution feature has improved expressive ability. The target detection model of multiscale feature map fusion is finally obtained through repeated iteration training on the basis of the SSD network. The detection model has good detection effect for small-scale target detection tasks. While realizing end-to-end training of detection algorithm, the model also improves its robustness to various complex scenes and the accuracy of target detection. Therefore, accurate target detection is achieved. This study provides a general and concise way to solve the problem of small-scale target detection.