Temporal action detection based on feature pyramid hierarchies

Jiayu He; Jun Lei; Guohui Li

doi:10.11834/jig.200495

Image Understanding and Computer Vision | Views : 0 下载量: 69 CSCD: 2

PDF
Export
Share
Collection
Album

Temporal action detection based on feature pyramid hierarchies
Vol. 26, Issue 7, Pages: 1637-1647(2021)
Received：24 August 2020，

Revised：28 December 2020，

Accepted：04 January 2021，

Published：16 July 2021
DOI： 10.11834/jig.200495
稿件说明：

移动端阅览

DOI：

Jiayu He, Jun Lei, Guohui Li. Temporal action detection based on feature pyramid hierarchies[J]. Journal of image and graphics, 2021, 26(7): 1637-1647. DOI： 10.11834/jig.200495.

摘要

目的

时序行为识别是视频理解中最重要的任务之一，该任务需要对一段视频中的行为片段同时进行分类和回归，而视频中往往包含不同时间长度的行为片段，对持续时间较短的行为片段进行检测尤其困难。针对持续时间较短的行为片段检测问题，文中构建了3维特征金字塔层次结构以增强网络检测不同持续时长的行为片段的能力，提出了一种提案网络后接分类器的两阶段新型网络。

方法

网络以RGB连续帧作为输入，经过特征金字塔结构产生不同分辨率和抽象程度的特征图，这些不同级别的特征图主要在网络的后两个阶段发挥作用：1）在提案阶段结合锚方法，使得不同时间长度的锚段具有与之对应的不同大小的感受野，锚段的初次预测将更加准确；2）在感兴趣区域池化阶段，不同的提案片段映射给对应级别特征图进行预测，平衡了分类和回归对特征图抽象度和分辨率的需求。

结果

在THUMOS Challenge 2014数据集上对模型进行测试，在与没有使用光流特征的其他典型方法进行比较时，本文模型在不同交并比阈值上超过了对比方法3%以上，按类别比较时，对持续时间较短的行为片段检测准确率则普遍得到提升。消融性实验中，在交并比阈值为0.5时，带特征金字塔结构的网络则超过使用普通特征提取网络的模型1.8%。

结论

本文提出的基于3维特征金字塔特征提取结构的双阶段时序行为模型能有效提升对持续时间较短的行为片段的检测准确率。

Abstract

Objective

Temporal action localization is one of the most important tasks in video understanding and has great application prospects in practice. With the rise of various online video applications

the number of short videos on the Internet has increased sharply

many of which contain different human behaviors. A model that can automatically locate and classify human action segments in videos is needed to detect and distinguish human behavior in short videos quickly and efficiently. However

public security departments also need real-time human behavior detection systems to help monitor and provide early warning of public safety incidents. In the task of temporal action localization

the human action segments in a video must be classified and regressed simultaneously. Accurately locating the boundaries of human behavior segments is more difficult than classifying known segments. A video always contains action segments of different temporal lengths

and detecting action segments with a short duration is especially difficult because short-duration action segment is easily ignored by the detection model or regarded as part of a closer

longer-duration segment. Existing methods have various attempts to improve the detection accuracy of human behavior fragments with different durations. In this paper

a 3D feature pyramid hierarchy is proposed to enhance the network's ability to detect action segments of different temporal durations.

Method

A new two-stage network with a proposal network followed by a classifier named 3D feature pyramid convolutional network(3D-FPCN) is proposed. In 3D-FPCN

feature extraction is performed through the 3D feature pyramid feature extraction network built. The 3D feature pyramid feature extraction network has a bottom-up pathway and a top-down pathway. The bottom-up pathway simultaneously encodes the temporal and spatial characteristics of consecutive input frames through a series of 3D convolutional neural networks to obtain highly abstract feature maps. The top-down pathway uses a series of deconvolutional networks and lateral connection layers to fuse high-abstraction and high-resolution features

and obtain low-level feature maps. Through the feature pyramid feature extraction network

multilevel feature maps with different abstraction levels and different resolutions can be obtained. Highly abstract feature maps are used for the classification and regression of long-duration human action segments

and high-resolution feature maps are used for the regression and classification of short-duration human action segments

which can effectively improve the detection effect of the network on human behavior fragments of different durations. The whole network takes RGB frames as input and generates feature maps of different resolutions and abstract degrees via a feature pyramid structure. These feature maps of different levels mainly play a role in the latter two stages of the network. First

the anchor mechanism is used in the proposal stage. Thus

anchor segments of different temporal lengths have corresponding receptive fields of different sizes

and this is equivalent to a receptive field calibration. Second

in the region of interest pooling stage

different proposal segments are mapped to corresponding level feature maps for prediction

which makes feature prediction more targeted and balances the requirements for the abstraction and resolution of feature maps for action segments' classification and regression.

Result

Our model is evaluated on the THUMOS'14 dataset. Compared with other classic methods that do not use optical flow features

our network surpasses most of them. Specifically

when the intersection over union threshold is set to 0.5

the mean average precision (mAP) of 3D-FPCN is up to 37.4%. Compared with the classic two-stage network region convolutional 3D network(R-C3D)

the mAP of our method is increased by 8.5 percentage points. The comparison results of the detection precision on different class human action segments when the intersection ratio threshold is 0.5 are shown. The detection result of 3D-FPCN for short-duration human actions segments is greatly improved compared with other methods. For example

3D-FPCN's detection accuracy of basketball dunk and cliff diving is 10% higher than that of the same two-stage network method R-C3D

and the detection accuracy of pole vault is higher than the multi-stage segment convolutional neural network(SCNN) is about 40%. This finding proves the improvement of our model for detecting short-duration human action segments. An ablation test is also conducted in the feature pyramid feature extraction network to explore the effect of this structure on the model. When the feature pyramid structure is removed from the network

the detection accuracy of the network is approximately 2% lower than before when the intersection over union threshold is 0.5. When only the multilevel feature map generated by the feature pyramid structure is used in the first stage of the network

which is the proposal generation stage

the detection accuracy is only 0.2% higher than the model with the feature pyramid structure removed. This finding proves that the feature pyramid hierarchy can effectively enhance the detection of action with different durations

and it mainly works in the second stage of the network

which is region of interest pooling stage.

Conclusion

A two-stage temporal action localization network 3D-FPCN is proposed based on 3D feature pyramid feature extraction network. The network takes continuous RGB frames as input

which can quickly and effectively detect human action segments in short videos. Through a number of experiments

the superiority of the model is proven

and the mechanism of the 3D feature pyramid structure in the model is discussed and explored. The 3D feature pyramid structure effectively improves the model's ability to detect short-duration human action segments

but the overall mAP of the model remains low. In the next work

the model will be improved

and different feature inputs will be introduced to study the method of temporal action localization further. We hope that our work can inspire other researchers and promote the development of the field.

关键词

Keywords

references

Buch S, Escorcia V, Ghanem B and Niebles J C. 2017a. End-to-end, single-stream temporal action detection in untrimmed videos//Proceedings of the British Machine Vision Conference. London, UK: BMVA Press: #7[ DOI: 10.5244/c.31.93 http://dx.doi.org/10.5244/c.31.93 ]

Buch S, Escorcia V, Shen C Q, Ghanem B and Niebles J C. 2017b. SST: single-stream temporal action proposals//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 2911-2920[ DOI: 10.1109/CVPR.2017.675 http://dx.doi.org/10.1109/CVPR.2017.675 ]

Carreira J and Zisserman A. 2017. Quo vadis, action recognition? a new model and the kinetics dataset//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 4724-4733[ DOI: 10.1109/CVPR.2017.502 http://dx.doi.org/10.1109/CVPR.2017.502 ]

Chao Y W, Vijayanarasimhan S, Seybold B, Ross D A, Deng J and Sukthankar R. 2018. Rethinking the faster R-CNN architecture for temporal action localization//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 1130-1139[ DOI: 10.1109/CVPR.2018.00124 http://dx.doi.org/10.1109/CVPR.2018.00124 ]

Gaidon A, Harchaoui Z and Schmid C. 2011. Actom sequence models for efficient action detection//Proceedings of 2011 IEEE Conference on Computer Vision and Pattern Recognition. Colorado, USA: Springer: IEEE: 3201-3208[ DOI: 10.1109/CVPR.2011.5995646 http://dx.doi.org/10.1109/CVPR.2011.5995646 ]

Gao J Y, Yang Z H and Nevatia R. 2017a. Cascaded boundary regression for temporal action detection//Proceedings of the British Machine Vision Conference. London, UK: BMVA Press: 52.1-52.11[ DOI: 10.5244/c.31.52 http://dx.doi.org/10.5244/c.31.52 ]

Gao J Y, Yang Z H, Sun C, Chen K and Nevatia R. 2017b. TURN TAP: temporal unit regression network for temporal action proposals//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 3628-3636[ DOI: 10.1109/ICCV.2017.392 http://dx.doi.org/10.1109/ICCV.2017.392 ]

Jiang Y G, Liu J G, Zamir A R, Toderici G, Laptev I, Shah M and Sukthankar R. 2014. Thumos challenge: action recognition with a large number of classes[EB/OL]. [2020-10-29] . http://crcv.ucf.edu/THUMOS14/ http://crcv.ucf.edu/THUMOS14/

Lin T W, Liu X, Li X, Ding E R and Wen S L. 2019. BMN: boundary-matching network for temporal action proposal generation//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea(South): IEEE: 3889-3898[ DOI: 10.1109/ICCV.2019.00399 http://dx.doi.org/10.1109/ICCV.2019.00399 ]

Long F C, Yao T, Qiu Z F, Tian X M, Luo J B and Mei T. 2019. Gaussian temporal awareness networks for action localization//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 344-353[ DOI: 10.1109/CVPR.2019.00043 http://dx.doi.org/10.1109/CVPR.2019.00043 ]

Qiu H N, Zheng Y B, Ye H, Lu Y, Wang F and He L. 2018. Precise temporal action localization by evolving temporal proposals//Proceedings of 2018 ACM on International Conference on Multimedia Retrieval. Yokohama, Japan: ACM: 388-396[ DOI: 10.1145/3206025.3206029 http://dx.doi.org/10.1145/3206025.3206029 ]

Shou Z, Chan J, Zareian A, Miyazawa K and Chang S F. 2017. CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 5734-5743[ DOI: 10.1109/CVPR.2017.155 http://dx.doi.org/10.1109/CVPR.2017.155 ]

Shou Z, Wang D A and Chang S F. 2016. Temporal action localization in untrimmed videos via multi-stage CNNs//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 1049-1058[ DOI: 10.1109/CVPR.2016.119 http://dx.doi.org/10.1109/CVPR.2016.119 ]

Tran D, Bourdev L, Fergus R, Torresani L and Paluri M. 2015. Learning spatiotemporal features with 3D convolutional networks//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 4489-4497[ DOI: 10.1109/ICCV.2015.510 http://dx.doi.org/10.1109/ICCV.2015.510 ]

Xu H J, Das A and Saenko K. 2017. R-C3D: region convolutional 3d network for temporal activity detection//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 5783-5792[ DOI: 10.1109/ICCV.2017.617 http://dx.doi.org/10.1109/ICCV.2017.617 ]

Yuan Z H, Stroud J C, Lu T and Deng J. 2017. Temporal action localization by structured maximal sums//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 3684-3692[ DOI: 10.1109/CVPR.2017.342 http://dx.doi.org/10.1109/CVPR.2017.342 ]

Zeng R H, Huang W B, Gan C, Tan M K, Rong Y, Zhao P L and Huang J Z. 2019. Graph convolutional networks for temporal action localization//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 7094-7103[ DOI: 10.1109/ICCV.2019.00719 http://dx.doi.org/10.1109/ICCV.2019.00719 ]

Zhao Y, Xiong Y J, Wang L M, Wu Z R, Tang X O and Lin D H. 2017. Temporal action detection with structured segment networks//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 2914-2923[ DOI: 10.1109/ICCV.2017.317 http://dx.doi.org/10.1109/ICCV.2017.317 ]

Alert me when the article has been cited

提交

Review of physical adversarial attacks against visual deep learning models

Comprehensive survey on 3D visual-language understanding techniques

Deep learning-based real-time semantic segmentation： a survey

Proposal-free video grounding based on motion excitation

Vision Transformer-based recognition tasks： a critical review