SSD与时空特征融合的视频目标检测

尉婉青; 禹晶; 柏鳗晏; 肖创柏

doi:10.11834/jig.200020

图像分析和识别 | 浏览量 : 0 下载量: 212 CSCD: 7

PDF
导出
分享
收藏
专辑

SSD与时空特征融合的视频目标检测
Video object detection using fusion of SSD and spatiotemporal features
2021年26卷第3期页码：542-555
收稿：2020-02-11，

修回：2020-6-23，

录用：2020-6-30，

纸质出版：2021-03-16
DOI： 10.11834/jig.200020
稿件说明：

移动端阅览

尉婉青, 禹晶, 柏鳗晏, 肖创柏. SSD与时空特征融合的视频目标检测[J]. 中国图象图形学报, 2021,26(3):542-555. DOI： 10.11834/jig.200020.

Wanqing Yu, Jing Yu, Manyan Bai, Chuangbai Xiao. Video object detection using fusion of SSD and spatiotemporal features[J]. Journal of Image and Graphics, 2021, 26(3): 542-555. DOI： 10.11834/jig.200020.

摘要

目的

视频目标检测旨在序列图像中定位运动目标，并为各个目标分配指定的类别标签。视频目标检测存在目标模糊和多目标遮挡等问题，现有的大部分视频目标检测方法是在静态图像目标检测的基础上，通过考虑时空一致性来提高运动目标检测的准确率，但由于运动目标存在遮挡、模糊等现象，目前视频目标检测的鲁棒性不高。为此，本文提出了一种单阶段多框检测（single shot multibox detector，SSD）与时空特征融合的视频目标检测模型。

方法

在单阶段目标检测的SSD模型框架下，利用光流网络估计当前帧与近邻帧之间的光流场，结合多个近邻帧的特征对当前帧的特征进行运动补偿，并利用特征金字塔网络提取多尺度特征用于检测不同尺寸的目标，最后通过高低层特征融合增强低层特征的语义信息。

结果

实验结果表明，本文模型在ImageNet VID（Imagelvet for video object detetion）数据集上的mAP（mean average precision）为72.0%，相对于TCN（temporal convolutional networks）模型、TPN+LSTM（tubelet proposal network and long short term memory network）模型和SSD+孪生网络模型，分别提高了24.5%、3.6%和2.5%，在不同结构网络模型上的分离实验进一步验证了本文模型的有效性。

结论

本文模型利用视频特有的时间相关性和空间相关性，通过时空特征融合提高了视频目标检测的准确率，较好地解决了视频目标检测中目标漏检和误检的问题。

Abstract

Objective

Object detection is a fundamental task in computer vision applications

which provides support for subsequent object tracking

semantic segmentation

and behavior recognition. Recent years have witnessed substantial progress in still image object detection based on deep convolutional neural network (DCNN). The task of still image object detection is to determine the category and position of each object in an image. Video object detection aims to locate a moving object in sequential images and assign a specific category label to each object. The accuracy of video object detection suffers from degenerated object appearances in videos

such as motion blur

multiobject occlusion

and rare poses. The methods of still image object detection achieve excellent results

but directly applying them to video object detection is challenging. According to the temporal and spatial information in videos

most existing video object detection methods improve the accuracy of moving object detection by considering spatiotemporal consistency based on still image object detection.

Method

In this paper

we propose a video object detection method using fusion of single shot multibox detector (SSD) and spatiotemporal features. Under the framework of SSD

temporal and spatial information of the video are applied to video object detection through the optical flow network and the feature pyramid network. On the one hand

the network combining residual network (ResNet) 101 with four extra convolutional layers is used for feature extraction to produce the feature map in each frame of the video. An optical flow network estimates the optical flow fields between the current frame and multiple adjacent frames to enhance the feature of the current frame. The feature maps from adjacent frames are compensated to the current frame according to the optical flow fields. The multiple compensated feature maps as well the feature map of the current frame are aggregated according to adaptive weights. The adaptive weights indicate the importance of all compensated feature maps to the current frame. Here

the cosine similarity metric is utilized to measure the similarity between the compensated feature map and the feature map extracted from the current frame. If the compensated feature map is close to the feature map of the current frame

then the compensated feature map is assigned a larger weight; otherwise

it is assigned a smaller weight. Moreover

an embedding network that consists of three convolutional layers is applied on the compensated feature maps and the current feature map to produce the embedding feature maps

and the embedding feature maps are used to compute the adaptive weights. On the other hand

the feature pyramid network is used to extract multiscale feature maps that are used to detect the object of different sizes. The low-and high-level feature maps are used to detect smaller and larger objects

respectively. For the problem of small object detection in the original SSD network

the low-level feature map is combined with the high-level feature map to enhance the semantic information of the low-level feature map via upsampling operation and a 1×1 convolutional layer. The upsampling operation is used to extend the high-level feature map to the same resolution as the low-level feature map

and the 1×1 convolution layer is used to reduce the channel dimensions of the low-level feature map to be consistent with those of the high-level feature map. Then

multiscale feature maps are input into the detection network to predict bounding boxes

and nonmaximum suppression is carried out to filter the redundant bounding boxes and obtain the final bounding boxes.

Result

Experimental results show that the mean average precision (mAP) score of the proposed method on the ImageNet VID(ImageNet for video object detection) dataset can reach 72.0%

which is 24.5%

3.6%

and 2.5% higher than those of the temporal convolutional network

the method combining tubelet proposal network with long short memory network

and the method combining SSD and siamese network

respectively. In addition

an ablation experiment is conducted with five network structures

namely

16-layer visual geometry group(VGG16) network

ResNet101 network

the network combining ResNet101 with feature pyramid network

and the network combining ResNet101 with spatiotemporal fusion. The network structure combining ResNet101 with spatiotemporal fusion improves the mAP score by 11.8%

7.0%

and 1.2% compared with the first four network structures. For further analysis

the mAP scores of the slow

medium

and fast objects are reported in addition to the standard mAP score. Our method combined with optical flow improves the mAP score of slow

medium

and fast objects by 0.6%

1.9%

and 2.3%

respectively

compared with the network structure combining ResNet101 with feature pyramid network. Experimental results show that the proposed method can improve the accuracy of video object detection

especially the performance of fast object detection.

Conclusion

Temporal and spatial correlation of the video by spatiotemporal fusion are used to improve the accuracy of video object detection in the proposed method. Using the optical flow network in video object detection can compensate the feature map of the current frame according to the feature maps of multiple adjacent frames. False negatives and false positives can be reduced through temporal feature fusion in video object detection. In addition

multiscale feature maps produced by the feature pyramid network can detect the object of different sizes

and the multiscale feature map fusion can enhance the semantic information of the low-level feature map

which improves the detection ability of the low-level feature map for small objects.

关键词

Keywords

references

Dosovitskiy A, Fischer P, Ilg E, Häusser P, Hazirbas C, Golkov V, van der Smagt P, Cremers D and Brox T. 2015. FlowNet: learning optical flow with convolutional networks//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 2758-2766[ DOI: 10.1109/ICCV.2015.316 http://dx.doi.org/10.1109/ICCV.2015.316 ]

Girshick R, Donahue J, Darrell T and Malik J. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE: 580-587[ DOI: 10.1109/CVPR.2014.81 http://dx.doi.org/10.1109/CVPR.2014.81 ]

Girshick R. 2015. Fast R-CNN//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 1440-1448[ DOI: 10.1109/ICCV.2015.169 http://dx.doi.org/10.1109/ICCV.2015.169 ]

He K M, Zhang X Y, Ren S Q and Sun J. 2015. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9): 1904-1916[DOI: 10.1109/TPAMI.2015.2389824]

He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 770-778[ DOI: 10.1109/CVPR.2016.90 http://dx.doi.org/10.1109/CVPR.2016.90 ]

Kang K, Li H S, Xiao T, Ouyang W L, Yan J J, Liu X G and Wang X G. 2017. Object detection in videos with tubelet proposal networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 889-897[ DOI: 10.1109/CVPR.2017.101 http://dx.doi.org/10.1109/CVPR.2017.101 ]

Kang K, Ouyang W L, Li H S and Wang X G. 2016. Object detection from video tubelets with convolutional neural networks//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 817-825[ DOI: 10.1109/CVPR.2016.95 http://dx.doi.org/10.1109/CVPR.2016.95 ]

Krizhevsky A, Sutskever I and Hinton G E. 2012. ImageNet classification with deep convolutional neural networks//Proceedings of the 25th International Conference on Neural Information Processing Systems. Lake Tahoe, USA: NIPS: 1097-1105

Lin T Y, Dollár P, Girshick R, He K M, Hariharan B and Belongie S. 2017. Feature pyramid networks for object detection//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 936-944[ DOI: 10.1109/CVPR.2017.106 http://dx.doi.org/10.1109/CVPR.2017.106 ]

Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C Y and Berg A C. 2016. SSD: single shot multibox detector//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands: Springer: 21-37[ DOI: 10.1007/978-3-319-46448-0_2 http://dx.doi.org/10.1007/978-3-319-46448-0_2 ]

Redmon J, Divvala S, Girshick R and Farhadi A. 2016. You only look once: unified, real-time object detection//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern. Las Vegas, USA: IEEE: 779-788[ DOI: 10.1109/CVPR.2016.91 http://dx.doi.org/10.1109/CVPR.2016.91 ]

Ren S Q, He K M, Girshick R and Sun J. 2017. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6): 1137-1149[DOI: 10.1109/TPAMI.2016.2577031]

Uijlings J R R, van de Sande K E A, Gevers T and Smeulders A W M. 2013. Selective search for object recognition. International Journal of Computer Vision, 104(2): 154-171[DOI: 10.1007/s11263-013-0620-5]

Xiao F Y and Lee Y J. 2018. Video object detection with an aligned spatial-temporal memory//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 494-510[ DOI: 10.1007/978-3-030-01237-3_30 http://dx.doi.org/10.1007/978-3-030-01237-3_30 ]

Zhang S F, Wen L Y, Bian X, Lei Z and Li S Z. 2018a. Single-shot refinement neural network for object detection//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 4203-4212[ DOI: 10.1109/CVPR.2018.00442 http://dx.doi.org/10.1109/CVPR.2018.00442 ]

Zhang Z S, Qiao S Y, Xie C H, Shen W, Wang B and Yuille A L. 2018b. Single-shot object detection with enriched semantics//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 5813-5821[ DOI: 10.1109/CVPR.2018.00609 http://dx.doi.org/10.1109/CVPR.2018.00609 ]

Zhao B J, Zhao B Y, Tang L B, Han Y Q and Wang W Z. 2018. Deep spatial-temporal joint feature representation for video object detection. Sensors, 18(3): #774[DOI: 10.3390/s18030774]

Zhu X Z, Wang Y J, Dai J F, Yuan L and Wei X C. 2017. Flow-guided feature aggregation for video object detection//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 408-417[ DOI: 10.1109/ICCV.2017.52 http://dx.doi.org/10.1109/ICCV.2017.52 ]