Current Issue Cover
目标检测中的尺度变换应用综述

申奉璨, 张萍, 罗金, 刘松阳, 冯世杰(电子科技大学光电科学与工程学院, 成都 610054)

摘 要
目标检测试图用给定的标签标记自然图像中出现的对象实例,已经广泛用于自动驾驶、监控安防等领域。随着深度学习技术的普及,基于卷积神经网络的通用目标检测框架获得了远好于其他方法的目标检测结果。然而,由于卷积神经网络的特性限制,通用目标检测依然面临尺度、光照和遮挡等许多问题的挑战。本文的目的是对卷积神经网络架构中针对尺度的目标检测策略进行全面综述。首先,介绍通用目标检测的发展概况及使用的主要数据集,包括通用目标检测框架的两种类别及发展,详述基于候选区域的两阶段目标检测算法的沿革和结构层面的创新,以及基于一次回归的目标检测算法的3个不同的流派。其次,对针对检测问题中影响效果的尺度问题的优化思路进行简单分类,包括多特征融合策略、针对感受野的卷积变形和训练策略的设计等。最后,给出了各个不同检测框架在通用数据集上对不同尺寸目标的检测准确度,以及未来可能的针对尺度变换的发展方向。
关键词
Scale changing in general object detection: a survey

Shen Fengcan, Zhang Ping, Luo Jin, Liu Songyang, Feng Shijie(School of Optoelectronic Science and Engineering, University of Electronic Science and Technology of China, Chengdu 610054, China)

Abstract
General object detection has been one of most important research topics in the field of computer vision. This task attempts to locate and mark an object instance that appears in a natural image using a series of given labels. The technique has been widely used in actual application scenarios, such as automatic driving and security monitoring. With the development and popularization of deep learning technology, the acquisition of the semantic information of images has become easier; thus, the general object detection framework based on convolutional neural networks (CNNs) has obtained better results compared with other target detection methods. Given that the large-scale dataset of the task is relatively better than datasets designed for other vision tasks and the metrics are well defined, this task rapidly evolves in CNN-based computer vision tasks. However, general object detection tasks still face many problems, such as scale and illumination changes and occlusions, due to the limitations of the CNN structure. Given that the features extracted by CNNs are sensitive to the scale, multiscale detection is often valuable but challenging in the field of CNN-based target detection. Research on scale transformation also has reference value for other scales in small target- or pixel-level tasks, such as the semantic segmentation and pose detection of images. This study mainly aims to provide a comprehensive overview of object detection strategies for scales in CNN architectures, that is, how to locate and classify different sizes of targets robustly. First, we introduce the development of general target detection problems and the main datasets used. Then, we introduce two categories of the general object detection framework. One of the categories, i.e., two-stage strategies, first obtains the region proposals and then selects the proposals by points of classification confidence; it mostly takes region-based convolutional neural networks (RCNN) as the baseline. With the development of the RCNN structure, all the links are transformed into specific convolution layers, thus forming an end-to-end structure. In addition, several tricks are designed for the baseline to solve specific problems, thus improving the robustness of the baseline for all kinds of object regions. The other category, i.e., one-stage strategies, obtains the region location and category by regressing once; it starts with a structure named “you only look once” which regresses the information of the object for every block divided. Then, the baseline becomes convolutional and end to end and uses deep and effective features. This baseline has also become popular since focal loss has been proposed because it solves the problem in which regression may cause an unbalance of positive and negative samples. Besides, some other methods, which detect objects via point location and learn from pose estimation tasks, also obtain satisfactory results in general target detection. We then introduce a simple classification of the optimization ideas for scale problems; these ideas include multi-feature fusion strategies, convolution deformations for receptive fields, and training strategy designs. Multi-feature fusion strategies are used to detect the classes of objects that are not always performed in a small scale. Multi-feature fusion can obtain semantic information from different image scales and fuse them to attain the most suitable scale. It can also effectively identify the different sizes of one-class objects. Widely used structures can be divided as follows: those that use single-shot detection and those with feature pyramid networks. Some structures have a jump layer fusion design. In a receptive field, every feature corresponds with an image or lower-level feature. The specific design can solve a target that always appears small in the image. The general receptive field of a convolution is the same as the size of the kernel; another special convolution kernel is designed. Dilated kernels are the most deformed kernels, which are used with the designed pooling layer to obtain a dense high-level feature. Some scholars have designed an offset layer to attain the most useful deformation information automatically for the convolution kernel. A training strategy can also be designed for small targets. A dataset that only includes small objects can be designed, and different sizes of the image can be trained in the structure in an orderly manner. Resampling images is also a common strategy. We provide the detection accuracy results for different sizes of targets on common datasets for different detection frameworks. Results are obtained from the Microsoft common objects in context (MS COCO) dataset. We use average precision (AP) to measure the result of the detection, and the result set includes results for small, medium, and large targets and those for different intersection-over-union thresholds. It shows the influence of the changes for scale. This study provides a set of possible future development directions for scale transformation. It also includes strategies on how to obtain robust features and detection modules and how to design a training dataset.
Keywords

订阅号|日报