Current Issue Cover
SSD与时空特征融合的视频目标检测

尉婉青, 禹晶, 柏鳗晏, 肖创柏(北京工业大学信息学部, 北京 100124)

摘 要
目的 视频目标检测旨在序列图像中定位运动目标,并为各个目标分配指定的类别标签。视频目标检测存在目标模糊和多目标遮挡等问题,现有的大部分视频目标检测方法是在静态图像目标检测的基础上,通过考虑时空一致性来提高运动目标检测的准确率,但由于运动目标存在遮挡、模糊等现象,目前视频目标检测的鲁棒性不高。为此,本文提出了一种单阶段多框检测(single shot multibox detector,SSD)与时空特征融合的视频目标检测模型。方法 在单阶段目标检测的SSD模型框架下,利用光流网络估计当前帧与近邻帧之间的光流场,结合多个近邻帧的特征对当前帧的特征进行运动补偿,并利用特征金字塔网络提取多尺度特征用于检测不同尺寸的目标,最后通过高低层特征融合增强低层特征的语义信息。结果 实验结果表明,本文模型在ImageNet VID (Imagelvet for video object detetion)数据集上的mAP (mean average precision)为72.0%,相对于TCN (temporal convolutional networks)模型、TPN+LSTM (tubelet proposal network and long short term memory network)模型和SSD+孪生网络模型,分别提高了24.5%、3.6%和2.5%,在不同结构网络模型上的分离实验进一步验证了本文模型的有效性。结论 本文模型利用视频特有的时间相关性和空间相关性,通过时空特征融合提高了视频目标检测的准确率,较好地解决了视频目标检测中目标漏检和误检的问题。
关键词
Video object detection using fusion of SSD and spatiotemporal features

Yu Wanqing, Yu Jing, Bai Manyan, Xiao Chuangbai(Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China)

Abstract
Objective Object detection is a fundamental task in computer vision applications, which provides support for subsequent object tracking, semantic segmentation, and behavior recognition. Recent years have witnessed substantial progress in still image object detection based on deep convolutional neural network (DCNN). The task of still image object detection is to determine the category and position of each object in an image. Video object detection aims to locate a moving object in sequential images and assign a specific category label to each object. The accuracy of video object detection suffers from degenerated object appearances in videos, such as motion blur, multiobject occlusion, and rare poses. The methods of still image object detection achieve excellent results, but directly applying them to video object detection is challenging. According to the temporal and spatial information in videos, most existing video object detection methods improve the accuracy of moving object detection by considering spatiotemporal consistency based on still image object detection. Method In this paper, we propose a video object detection method using fusion of single shot multibox detector (SSD) and spatiotemporal features. Under the framework of SSD, temporal and spatial information of the video are applied to video object detection through the optical flow network and the feature pyramid network. On the one hand, the network combining residual network (ResNet) 101 with four extra convolutional layers is used for feature extraction to produce the feature map in each frame of the video. An optical flow network estimates the optical flow fields between the current frame and multiple adjacent frames to enhance the feature of the current frame. The feature maps from adjacent frames are compensated to the current frame according to the optical flow fields. The multiple compensated feature maps as well the feature map of the current frame are aggregated according to adaptive weights. The adaptive weights indicate the importance of all compensated feature maps to the current frame. Here, the cosine similarity metric is utilized to measure the similarity between the compensated feature map and the feature map extracted from the current frame. If the compensated feature map is close to the feature map of the current frame, then the compensated feature map is assigned a larger weight; otherwise, it is assigned a smaller weight. Moreover, an embedding network that consists of three convolutional layers is applied on the compensated feature maps and the current feature map to produce the embedding feature maps, and the embedding feature maps are used to compute the adaptive weights. On the other hand, the feature pyramid network is used to extract multiscale feature maps that are used to detect the object of different sizes. The low-and high-level feature maps are used to detect smaller and larger objects, respectively. For the problem of small object detection in the original SSD network, the low-level feature map is combined with the high-level feature map to enhance the semantic information of the low-level feature map via upsampling operation and a 1×1 convolutional layer. The upsampling operation is used to extend the high-level feature map to the same resolution as the low-level feature map, and the 1×1 convolution layer is used to reduce the channel dimensions of the low-level feature map to be consistent with those of the high-level feature map. Then, multiscale feature maps are input into the detection network to predict bounding boxes, and nonmaximum suppression is carried out to filter the redundant bounding boxes and obtain the final bounding boxes. Result Experimental results show that the mean average precision (mAP) score of the proposed method on the ImageNet VID(ImageNet for video object detection) dataset can reach 72.0%, which is 24.5%, 3.6%, and 2.5% higher than those of the temporal convolutional network, the method combining tubelet proposal network with long short memory network, and the method combining SSD and siamese network, respectively. In addition, an ablation experiment is conducted with five network structures, namely, 16-layer visual geometry group(VGG16) network, ResNet101 network, the network combining ResNet101 with feature pyramid network, and the network combining ResNet101 with spatiotemporal fusion. The network structure combining ResNet101 with spatiotemporal fusion improves the mAP score by 11.8%, 7.0%, and 1.2% compared with the first four network structures. For further analysis, the mAP scores of the slow, medium, and fast objects are reported in addition to the standard mAP score. Our method combined with optical flow improves the mAP score of slow, medium, and fast objects by 0.6%, 1.9%, and 2.3%, respectively, compared with the network structure combining ResNet101 with feature pyramid network. Experimental results show that the proposed method can improve the accuracy of video object detection, especially the performance of fast object detection. Conclusion Temporal and spatial correlation of the video by spatiotemporal fusion are used to improve the accuracy of video object detection in the proposed method. Using the optical flow network in video object detection can compensate the feature map of the current frame according to the feature maps of multiple adjacent frames. False negatives and false positives can be reduced through temporal feature fusion in video object detection. In addition, multiscale feature maps produced by the feature pyramid network can detect the object of different sizes, and the multiscale feature map fusion can enhance the semantic information of the low-level feature map, which improves the detection ability of the low-level feature map for small objects.
Keywords

订阅号|日报