Temporal action detection based on feature pyramid hierarchies
- Vol. 26, Issue 7, Pages: 1637-1647(2021)
Received:24 August 2020,
Revised:28 December 2020,
Accepted:04 January 2021,
Published:16 July 2021
DOI: 10.11834/jig.200495
移动端阅览
浏览全部资源
扫码关注微信
Received:24 August 2020,
Revised:28 December 2020,
Accepted:04 January 2021,
Published:16 July 2021
移动端阅览
目的
2
时序行为识别是视频理解中最重要的任务之一,该任务需要对一段视频中的行为片段同时进行分类和回归,而视频中往往包含不同时间长度的行为片段,对持续时间较短的行为片段进行检测尤其困难。针对持续时间较短的行为片段检测问题,文中构建了3维特征金字塔层次结构以增强网络检测不同持续时长的行为片段的能力,提出了一种提案网络后接分类器的两阶段新型网络。
方法
2
网络以RGB连续帧作为输入,经过特征金字塔结构产生不同分辨率和抽象程度的特征图,这些不同级别的特征图主要在网络的后两个阶段发挥作用:1)在提案阶段结合锚方法,使得不同时间长度的锚段具有与之对应的不同大小的感受野,锚段的初次预测将更加准确;2)在感兴趣区域池化阶段,不同的提案片段映射给对应级别特征图进行预测,平衡了分类和回归对特征图抽象度和分辨率的需求。
结果
2
在THUMOS Challenge 2014数据集上对模型进行测试,在与没有使用光流特征的其他典型方法进行比较时,本文模型在不同交并比阈值上超过了对比方法3%以上,按类别比较时,对持续时间较短的行为片段检测准确率则普遍得到提升。消融性实验中,在交并比阈值为0.5时,带特征金字塔结构的网络则超过使用普通特征提取网络的模型1.8%。
结论
2
本文提出的基于3维特征金字塔特征提取结构的双阶段时序行为模型能有效提升对持续时间较短的行为片段的检测准确率。
Objective
2
Temporal action localization is one of the most important tasks in video understanding and has great application prospects in practice. With the rise of various online video applications
the number of short videos on the Internet has increased sharply
many of which contain different human behaviors. A model that can automatically locate and classify human action segments in videos is needed to detect and distinguish human behavior in short videos quickly and efficiently. However
public security departments also need real-time human behavior detection systems to help monitor and provide early warning of public safety incidents. In the task of temporal action localization
the human action segments in a video must be classified and regressed simultaneously. Accurately locating the boundaries of human behavior segments is more difficult than classifying known segments. A video always contains action segments of different temporal lengths
and detecting action segments with a short duration is especially difficult because short-duration action segment is easily ignored by the detection model or regarded as part of a closer
longer-duration segment. Existing methods have various attempts to improve the detection accuracy of human behavior fragments with different durations. In this paper
a 3D feature pyramid hierarchy is proposed to enhance the network's ability to detect action segments of different temporal durations.
Method
2
A new two-stage network with a proposal network followed by a classifier named 3D feature pyramid convolutional network(3D-FPCN) is proposed. In 3D-FPCN
feature extraction is performed through the 3D feature pyramid feature extraction network built. The 3D feature pyramid feature extraction network has a bottom-up pathway and a top-down pathway. The bottom-up pathway simultaneously encodes the temporal and spatial characteristics of consecutive input frames through a series of 3D convolutional neural networks to obtain highly abstract feature maps. The top-down pathway uses a series of deconvolutional networks and lateral connection layers to fuse high-abstraction and high-resolution features
and obtain low-level feature maps. Through the feature pyramid feature extraction network
multilevel feature maps with different abstraction levels and different resolutions can be obtained. Highly abstract feature maps are used for the classification and regression of long-duration human action segments
and high-resolution feature maps are used for the regression and classification of short-duration human action segments
which can effectively improve the detection effect of the network on human behavior fragments of different durations. The whole network takes RGB frames as input and generates feature maps of different resolutions and abstract degrees via a feature pyramid structure. These feature maps of different levels mainly play a role in the latter two stages of the network. First
the anchor mechanism is used in the proposal stage. Thus
anchor segments of different temporal lengths have corresponding receptive fields of different sizes
and this is equivalent to a receptive field calibration. Second
in the region of interest pooling stage
different proposal segments are mapped to corresponding level feature maps for prediction
which makes feature prediction more targeted and balances the requirements for the abstraction and resolution of feature maps for action segments' classification and regression.
Result
2
Our model is evaluated on the THUMOS'14 dataset. Compared with other classic methods that do not use optical flow features
our network surpasses most of them. Specifically
when the intersection over union threshold is set to 0.5
the mean average precision (mAP) of 3D-FPCN is up to 37.4%. Compared with the classic two-stage network region convolutional 3D network(R-C3D)
the mAP of our method is increased by 8.5 percentage points. The comparison results of the detection precision on different class human action segments when the intersection ratio threshold is 0.5 are shown. The detection result of 3D-FPCN for short-duration human actions segments is greatly improved compared with other methods. For example
3D-FPCN's detection accuracy of basketball dunk and cliff diving is 10% higher than that of the same two-stage network method R-C3D
and the detection accuracy of pole vault is higher than the multi-stage segment convolutional neural network(SCNN) is about 40%. This finding proves the improvement of our model for detecting short-duration human action segments. An ablation test is also conducted in the feature pyramid feature extraction network to explore the effect of this structure on the model. When the feature pyramid structure is removed from the network
the detection accuracy of the network is approximately 2% lower than before when the intersection over union threshold is 0.5. When only the multilevel feature map generated by the feature pyramid structure is used in the first stage of the network
which is the proposal generation stage
the detection accuracy is only 0.2% higher than the model with the feature pyramid structure removed. This finding proves that the feature pyramid hierarchy can effectively enhance the detection of action with different durations
and it mainly works in the second stage of the network
which is region of interest pooling stage.
Conclusion
2
A two-stage temporal action localization network 3D-FPCN is proposed based on 3D feature pyramid feature extraction network. The network takes continuous RGB frames as input
which can quickly and effectively detect human action segments in short videos. Through a number of experiments
the superiority of the model is proven
and the mechanism of the 3D feature pyramid structure in the model is discussed and explored. The 3D feature pyramid structure effectively improves the model's ability to detect short-duration human action segments
but the overall mAP of the model remains low. In the next work
the model will be improved
and different feature inputs will be introduced to study the method of temporal action localization further. We hope that our work can inspire other researchers and promote the development of the field.
Buch S, Escorcia V, Ghanem B and Niebles J C. 2017a. End-to-end, single-stream temporal action detection in untrimmed videos//Proceedings of the British Machine Vision Conference. London, UK: BMVA Press: #7[ DOI: 10.5244/c.31.93 http://dx.doi.org/10.5244/c.31.93 ]
Buch S, Escorcia V, Shen C Q, Ghanem B and Niebles J C. 2017b. SST: single-stream temporal action proposals//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 2911-2920[ DOI: 10.1109/CVPR.2017.675 http://dx.doi.org/10.1109/CVPR.2017.675 ]
Carreira J and Zisserman A. 2017. Quo vadis, action recognition? a new model and the kinetics dataset//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 4724-4733[ DOI: 10.1109/CVPR.2017.502 http://dx.doi.org/10.1109/CVPR.2017.502 ]
Chao Y W, Vijayanarasimhan S, Seybold B, Ross D A, Deng J and Sukthankar R. 2018. Rethinking the faster R-CNN architecture for temporal action localization//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 1130-1139[ DOI: 10.1109/CVPR.2018.00124 http://dx.doi.org/10.1109/CVPR.2018.00124 ]
Gaidon A, Harchaoui Z and Schmid C. 2011. Actom sequence models for efficient action detection//Proceedings of 2011 IEEE Conference on Computer Vision and Pattern Recognition. Colorado, USA: Springer: IEEE: 3201-3208[ DOI: 10.1109/CVPR.2011.5995646 http://dx.doi.org/10.1109/CVPR.2011.5995646 ]
Gao J Y, Yang Z H and Nevatia R. 2017a. Cascaded boundary regression for temporal action detection//Proceedings of the British Machine Vision Conference. London, UK: BMVA Press: 52.1-52.11[ DOI: 10.5244/c.31.52 http://dx.doi.org/10.5244/c.31.52 ]
Gao J Y, Yang Z H, Sun C, Chen K and Nevatia R. 2017b. TURN TAP: temporal unit regression network for temporal action proposals//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 3628-3636[ DOI: 10.1109/ICCV.2017.392 http://dx.doi.org/10.1109/ICCV.2017.392 ]
Jiang Y G, Liu J G, Zamir A R, Toderici G, Laptev I, Shah M and Sukthankar R. 2014. Thumos challenge: action recognition with a large number of classes[EB/OL]. [2020-10-29] . http://crcv.ucf.edu/THUMOS14/ http://crcv.ucf.edu/THUMOS14/
Lin T W, Liu X, Li X, Ding E R and Wen S L. 2019. BMN: boundary-matching network for temporal action proposal generation//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea(South): IEEE: 3889-3898[ DOI: 10.1109/ICCV.2019.00399 http://dx.doi.org/10.1109/ICCV.2019.00399 ]
Long F C, Yao T, Qiu Z F, Tian X M, Luo J B and Mei T. 2019. Gaussian temporal awareness networks for action localization//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 344-353[ DOI: 10.1109/CVPR.2019.00043 http://dx.doi.org/10.1109/CVPR.2019.00043 ]
Qiu H N, Zheng Y B, Ye H, Lu Y, Wang F and He L. 2018. Precise temporal action localization by evolving temporal proposals//Proceedings of 2018 ACM on International Conference on Multimedia Retrieval. Yokohama, Japan: ACM: 388-396[ DOI: 10.1145/3206025.3206029 http://dx.doi.org/10.1145/3206025.3206029 ]
Shou Z, Chan J, Zareian A, Miyazawa K and Chang S F. 2017. CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 5734-5743[ DOI: 10.1109/CVPR.2017.155 http://dx.doi.org/10.1109/CVPR.2017.155 ]
Shou Z, Wang D A and Chang S F. 2016. Temporal action localization in untrimmed videos via multi-stage CNNs//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 1049-1058[ DOI: 10.1109/CVPR.2016.119 http://dx.doi.org/10.1109/CVPR.2016.119 ]
Tran D, Bourdev L, Fergus R, Torresani L and Paluri M. 2015. Learning spatiotemporal features with 3D convolutional networks//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 4489-4497[ DOI: 10.1109/ICCV.2015.510 http://dx.doi.org/10.1109/ICCV.2015.510 ]
Xu H J, Das A and Saenko K. 2017. R-C3D: region convolutional 3d network for temporal activity detection//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 5783-5792[ DOI: 10.1109/ICCV.2017.617 http://dx.doi.org/10.1109/ICCV.2017.617 ]
Yuan Z H, Stroud J C, Lu T and Deng J. 2017. Temporal action localization by structured maximal sums//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 3684-3692[ DOI: 10.1109/CVPR.2017.342 http://dx.doi.org/10.1109/CVPR.2017.342 ]
Zeng R H, Huang W B, Gan C, Tan M K, Rong Y, Zhao P L and Huang J Z. 2019. Graph convolutional networks for temporal action localization//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 7094-7103[ DOI: 10.1109/ICCV.2019.00719 http://dx.doi.org/10.1109/ICCV.2019.00719 ]
Zhao Y, Xiong Y J, Wang L M, Wu Z R, Tang X O and Lin D H. 2017. Temporal action detection with structured segment networks//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 2914-2923[ DOI: 10.1109/ICCV.2017.317 http://dx.doi.org/10.1109/ICCV.2017.317 ]
相关文章
相关作者
相关机构