SSD与时空特征融合的视频目标检测
Video object detection using fusion of SSD and spatiotemporal features
- 2021年26卷第3期 页码:542-555
收稿:2020-02-11,
修回:2020-6-23,
录用:2020-6-30,
纸质出版:2021-03-16
DOI: 10.11834/jig.200020
移动端阅览

浏览全部资源
扫码关注微信
收稿:2020-02-11,
修回:2020-6-23,
录用:2020-6-30,
纸质出版:2021-03-16
移动端阅览
目的
2
视频目标检测旨在序列图像中定位运动目标,并为各个目标分配指定的类别标签。视频目标检测存在目标模糊和多目标遮挡等问题,现有的大部分视频目标检测方法是在静态图像目标检测的基础上,通过考虑时空一致性来提高运动目标检测的准确率,但由于运动目标存在遮挡、模糊等现象,目前视频目标检测的鲁棒性不高。为此,本文提出了一种单阶段多框检测(single shot multibox detector,SSD)与时空特征融合的视频目标检测模型。
方法
2
在单阶段目标检测的SSD模型框架下,利用光流网络估计当前帧与近邻帧之间的光流场,结合多个近邻帧的特征对当前帧的特征进行运动补偿,并利用特征金字塔网络提取多尺度特征用于检测不同尺寸的目标,最后通过高低层特征融合增强低层特征的语义信息。
结果
2
实验结果表明,本文模型在ImageNet VID(Imagelvet for video object detetion)数据集上的mAP(mean average precision)为72.0%,相对于TCN(temporal convolutional networks)模型、TPN+LSTM(tubelet proposal network and long short term memory network)模型和SSD+孪生网络模型,分别提高了24.5%、3.6%和2.5%,在不同结构网络模型上的分离实验进一步验证了本文模型的有效性。
结论
2
本文模型利用视频特有的时间相关性和空间相关性,通过时空特征融合提高了视频目标检测的准确率,较好地解决了视频目标检测中目标漏检和误检的问题。
Objective
2
Object detection is a fundamental task in computer vision applications
which provides support for subsequent object tracking
semantic segmentation
and behavior recognition. Recent years have witnessed substantial progress in still image object detection based on deep convolutional neural network (DCNN). The task of still image object detection is to determine the category and position of each object in an image. Video object detection aims to locate a moving object in sequential images and assign a specific category label to each object. The accuracy of video object detection suffers from degenerated object appearances in videos
such as motion blur
multiobject occlusion
and rare poses. The methods of still image object detection achieve excellent results
but directly applying them to video object detection is challenging. According to the temporal and spatial information in videos
most existing video object detection methods improve the accuracy of moving object detection by considering spatiotemporal consistency based on still image object detection.
Method
2
In this paper
we propose a video object detection method using fusion of single shot multibox detector (SSD) and spatiotemporal features. Under the framework of SSD
temporal and spatial information of the video are applied to video object detection through the optical flow network and the feature pyramid network. On the one hand
the network combining residual network (ResNet) 101 with four extra convolutional layers is used for feature extraction to produce the feature map in each frame of the video. An optical flow network estimates the optical flow fields between the current frame and multiple adjacent frames to enhance the feature of the current frame. The feature maps from adjacent frames are compensated to the current frame according to the optical flow fields. The multiple compensated feature maps as well the feature map of the current frame are aggregated according to adaptive weights. The adaptive weights indicate the importance of all compensated feature maps to the current frame. Here
the cosine similarity metric is utilized to measure the similarity between the compensated feature map and the feature map extracted from the current frame. If the compensated feature map is close to the feature map of the current frame
then the compensated feature map is assigned a larger weight; otherwise
it is assigned a smaller weight. Moreover
an embedding network that consists of three convolutional layers is applied on the compensated feature maps and the current feature map to produce the embedding feature maps
and the embedding feature maps are used to compute the adaptive weights. On the other hand
the feature pyramid network is used to extract multiscale feature maps that are used to detect the object of different sizes. The low-and high-level feature maps are used to detect smaller and larger objects
respectively. For the problem of small object detection in the original SSD network
the low-level feature map is combined with the high-level feature map to enhance the semantic information of the low-level feature map via upsampling operation and a 1×1 convolutional layer. The upsampling operation is used to extend the high-level feature map to the same resolution as the low-level feature map
and the 1×1 convolution layer is used to reduce the channel dimensions of the low-level feature map to be consistent with those of the high-level feature map. Then
multiscale feature maps are input into the detection network to predict bounding boxes
and nonmaximum suppression is carried out to filter the redundant bounding boxes and obtain the final bounding boxes.
Result
2
Experimental results show that the mean average precision (mAP) score of the proposed method on the ImageNet VID(ImageNet for video object detection) dataset can reach 72.0%
which is 24.5%
3.6%
and 2.5% higher than those of the temporal convolutional network
the method combining tubelet proposal network with long short memory network
and the method combining SSD and siamese network
respectively. In addition
an ablation experiment is conducted with five network structures
namely
16-layer visual geometry group(VGG16) network
ResNet101 network
the network combining ResNet101 with feature pyramid network
and the network combining ResNet101 with spatiotemporal fusion. The network structure combining ResNet101 with spatiotemporal fusion improves the mAP score by 11.8%
7.0%
and 1.2% compared with the first four network structures. For further analysis
the mAP scores of the slow
medium
and fast objects are reported in addition to the standard mAP score. Our method combined with optical flow improves the mAP score of slow
medium
and fast objects by 0.6%
1.9%
and 2.3%
respectively
compared with the network structure combining ResNet101 with feature pyramid network. Experimental results show that the proposed method can improve the accuracy of video object detection
especially the performance of fast object detection.
Conclusion
2
Temporal and spatial correlation of the video by spatiotemporal fusion are used to improve the accuracy of video object detection in the proposed method. Using the optical flow network in video object detection can compensate the feature map of the current frame according to the feature maps of multiple adjacent frames. False negatives and false positives can be reduced through temporal feature fusion in video object detection. In addition
multiscale feature maps produced by the feature pyramid network can detect the object of different sizes
and the multiscale feature map fusion can enhance the semantic information of the low-level feature map
which improves the detection ability of the low-level feature map for small objects.
Dosovitskiy A, Fischer P, Ilg E, Häusser P, Hazirbas C, Golkov V, van der Smagt P, Cremers D and Brox T. 2015. FlowNet: learning optical flow with convolutional networks//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 2758-2766[ DOI: 10.1109/ICCV.2015.316 http://dx.doi.org/10.1109/ICCV.2015.316 ]
Girshick R, Donahue J, Darrell T and Malik J. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE: 580-587[ DOI: 10.1109/CVPR.2014.81 http://dx.doi.org/10.1109/CVPR.2014.81 ]
Girshick R. 2015. Fast R-CNN//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 1440-1448[ DOI: 10.1109/ICCV.2015.169 http://dx.doi.org/10.1109/ICCV.2015.169 ]
He K M, Zhang X Y, Ren S Q and Sun J. 2015. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9): 1904-1916[DOI: 10.1109/TPAMI.2015.2389824]
He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 770-778[ DOI: 10.1109/CVPR.2016.90 http://dx.doi.org/10.1109/CVPR.2016.90 ]
Kang K, Li H S, Xiao T, Ouyang W L, Yan J J, Liu X G and Wang X G. 2017. Object detection in videos with tubelet proposal networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 889-897[ DOI: 10.1109/CVPR.2017.101 http://dx.doi.org/10.1109/CVPR.2017.101 ]
Kang K, Ouyang W L, Li H S and Wang X G. 2016. Object detection from video tubelets with convolutional neural networks//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 817-825[ DOI: 10.1109/CVPR.2016.95 http://dx.doi.org/10.1109/CVPR.2016.95 ]
Krizhevsky A, Sutskever I and Hinton G E. 2012. ImageNet classification with deep convolutional neural networks//Proceedings of the 25th International Conference on Neural Information Processing Systems. Lake Tahoe, USA: NIPS: 1097-1105
Lin T Y, Dollár P, Girshick R, He K M, Hariharan B and Belongie S. 2017. Feature pyramid networks for object detection//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 936-944[ DOI: 10.1109/CVPR.2017.106 http://dx.doi.org/10.1109/CVPR.2017.106 ]
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C Y and Berg A C. 2016. SSD: single shot multibox detector//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands: Springer: 21-37[ DOI: 10.1007/978-3-319-46448-0_2 http://dx.doi.org/10.1007/978-3-319-46448-0_2 ]
Redmon J, Divvala S, Girshick R and Farhadi A. 2016. You only look once: unified, real-time object detection//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern. Las Vegas, USA: IEEE: 779-788[ DOI: 10.1109/CVPR.2016.91 http://dx.doi.org/10.1109/CVPR.2016.91 ]
Ren S Q, He K M, Girshick R and Sun J. 2017. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6): 1137-1149[DOI: 10.1109/TPAMI.2016.2577031]
Uijlings J R R, van de Sande K E A, Gevers T and Smeulders A W M. 2013. Selective search for object recognition. International Journal of Computer Vision, 104(2): 154-171[DOI: 10.1007/s11263-013-0620-5]
Xiao F Y and Lee Y J. 2018. Video object detection with an aligned spatial-temporal memory//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 494-510[ DOI: 10.1007/978-3-030-01237-3_30 http://dx.doi.org/10.1007/978-3-030-01237-3_30 ]
Zhang S F, Wen L Y, Bian X, Lei Z and Li S Z. 2018a. Single-shot refinement neural network for object detection//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 4203-4212[ DOI: 10.1109/CVPR.2018.00442 http://dx.doi.org/10.1109/CVPR.2018.00442 ]
Zhang Z S, Qiao S Y, Xie C H, Shen W, Wang B and Yuille A L. 2018b. Single-shot object detection with enriched semantics//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 5813-5821[ DOI: 10.1109/CVPR.2018.00609 http://dx.doi.org/10.1109/CVPR.2018.00609 ]
Zhao B J, Zhao B Y, Tang L B, Han Y Q and Wang W Z. 2018. Deep spatial-temporal joint feature representation for video object detection. Sensors, 18(3): #774[DOI: 10.3390/s18030774]
Zhu X Z, Wang Y J, Dai J F, Yuan L and Wei X C. 2017. Flow-guided feature aggregation for video object detection//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 408-417[ DOI: 10.1109/ICCV.2017.52 http://dx.doi.org/10.1109/ICCV.2017.52 ]
相关作者
相关机构
京公网安备11010802024621