类别敏感的全局时序关联视频动作检测
Class-aware network with global temporal relations for video action detection
- 2022年27卷第12期 页码:3566-3580
收稿日期:2021-11-17,
修回日期:2022-01-05,
录用日期:2022-1-12,
纸质出版日期:2022-12-16
DOI: 10.11834/jig.211096
移动端阅览

浏览全部资源
扫码关注微信
收稿日期:2021-11-17,
修回日期:2022-01-05,
录用日期:2022-1-12,
纸质出版日期:2022-12-16
移动端阅览
目的
2
视频动作检测是视频理解领域的重要问题,该任务旨在定位视频中动作片段的起止时刻并预测动作类别。动作检测的关键环节包括动作模式的识别和视频内部时序关联的建立。目前主流方法往往试图设计一种普适的检测算法以定位所有类别的动作,忽略了不同类别间动作模式的巨大差异,限制了检测精度。此外,视频内部时序关联的建立对于检测精度至关重要,图卷积常用于全局时序建模,但其计算量较大。针对当前方法的不足,本文提出动作片段的逐类检测方法,并借助门控循环单元以较低的计算代价有效建立了视频内部的全局时序关联。
方法
2
动作模式识别方面,首先对视频动作进行粗略分类,然后借助多分支的逐类检测机制对每类动作进行针对性检测,通过识别视频局部特征的边界模式来定位动作边界,通过识别动作模式来评估锚框包含完整动作的概率;时序建模方面,构建了一个简洁有效的时序关联模块,利用门控循环单元建立了当前时刻与过去、未来时刻间的全局时序关联。上述创新点整合为类别敏感的全局时序关联视频动作检测方法。
结果
2
为验证本文方法的有效性,使用多种视频特征在两个公开数据集上进行实验,并与其他先进方法进行比较。在ActivityNet-1.3数据集中,该方法在双流特征下的平均mAP(mean average precision)达到35.58%,优于其他现有方法;在THUMOS-14数据集中,该方法在多种特征下的指标均取得了最佳性能。实验结果表明,类别敏感的逐类检测思路和借助门控循环单元的时序建模方法有效提升了视频动作检测精度。此外,提出的时序关联模块计算量低于使用图卷积建模的其他主流模型,且具备一定的泛化能力。
结论
2
提出了类别敏感的全局时序关联视频动作检测模型,实现了更为细化的逐类动作检测,同时借助门控循环单元设计了时序关联模块,提升了视频动作检测的精度。
Objective
2
Video-based actions understanding has been concerned more under the huge number of internet videos circumstances. As a significant task of video understanding
temporal action detection (TAD) aims at locating the boundary of each action instance and classifying its class label in untrimmed videos. Inspired by the success of object detection
two-stage pipeline dominates the field of TAD: the first stage generates candidate action segments (proposals)
which are then labelled with certain classes in the second stage. Overall
performance of TAD largely depends on two aspects: recognizing action patterns and exploring temporal relations. 1) Current methods usually try to recognize the start and end patterns to locate action boundaries
and patterns between boundaries contribute to predicting confidence score of each segment. 2) Much more temporal relations are vital for accurate detection because information in video is closely related temporally
and a broader receptive field helps model to understand context and semantic relations of the whole video. However
existing methods have limitations on these two aspects. In terms of pattern recognition
almost all methods force the model to cater for all kinds of actions (Class-Agnostic)
which means that a universal pattern has to be summarized to locate every action's start
end and actionness. This method has challenged for varied patterns dramatically with action classes. As for temporal relations
graph convolution network prevails recently to model temporal relations in video
but this method is computationally costly.
Method
2
We develop a class-aware network (CAN) with global temporal relations to tackle these two problems and there are two crucial designs in CAN. 1) Different action classes should be treated differently. The model can recognize patterns of various classes unambiguously by this way. Class-aware mechanism (CAM) is embedded into the detection pipeline. It includes several action branches and a universal branch. Each action branch takes charge of one specified class and the universal branch supplies complementary information for more accurate detection. After obtaining a sketchy and general action label of raw video from a video-level classifier
the corresponding action branch of this label in CAM is activated to generate predictions. 2) Gate recurrent unit (GRU)-assisted ternary basenet (TB) is designed to explore temporal relations more effectively. Considering the whole video feature sequence is accessible in offline TAD task
by changing the input order of features
GRU can not only memorize the existed features but also forecast future information gathering. In TB
temporal features are combined simultaneously
so the receptive field of model is not restricted locally but bidirectional extended to the past and future
and thus the video-based global-temporal-relations are built in.
Result
2
Our experiments are carried out on two benchmarks: ActivityNet-1.3 and THUMOS-14. 1) The THUMOS-14 consists of 200 temporally annotated videos in validation set and 213 videos in testing set. A sum of 20 action categories is included. 2) The ActivityNet-1.3 contains 19 994 temporally annotated videos with 200 action classes. Furthermore
the hierarchical structure of all classes is accessible in annotation. The comparative analysis has been conducted as well. 1) On THUMOS-14
the CAN improves the average mean average precision (mAP) to 54.90%. 2) On ActivityNet-1.3
average mAP of CAN is 35.58%
which is higher than its baseline 33.85% and is improved 35.52%. Additionally
ablation experiments demonstrate the effectiveness of our method. Class-aware mechanism and TB contributes to the detection accuracy both. And
TB can build the global-temporal relations effectively with low computational cost compared to graph model
designed by sub-graph localization for temporal action detection (GTAD).
Conclusion
2
Our research tends to reveal two key aspects in temporal action detection task: 1) recognize action patterns and 2) exploring temporal relations. The class-aware mechanism (CAM) is designed to detect action segments of different classes rationally and accurately. Moreover
TB provides an effective way to explore temporal relations at frame level. These two ways are integrated into one framework named class-aware network (CAN) with global temporal relations
and has its optimization results on two benchmarks.
Alwassel H, Giancola S and Ghanem B. 2021. TSP: temporally-sensitive pretraining of video encoders for localization tasks//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision Workshops. Montreal, Canada: IEEE: 3166-3176 [ DOI: 10.1109/iccvw54120.2021.00356 http://dx.doi.org/10.1109/iccvw54120.2021.00356 ]
Bai Y R, Wang Y Y, Tong Y H, Yang Y, Liu Q Y and Liu J H. 2020. Boundary content graph neural network for temporal action proposal generation//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 121-137 [ DOI: 10.1007/978-3-030-58604-1_8 http://dx.doi.org/10.1007/978-3-030-58604-1_8 ]
Bodla N, Singh B, Chellappa R and Davis L S. 2017. Soft-NMS—improving object detection with one line of code//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 5562-5570 [ DOI: 10.1109/iccv.2017.593 http://dx.doi.org/10.1109/iccv.2017.593 ]
Carreira J and Zisserman A. 2017. Quo vadis, action recognition? A new model and the kinetics dataset//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 4724-4733 [ DOI: 10.1109/CVPR.2017.502 http://dx.doi.org/10.1109/CVPR.2017.502 ]
Cho K, van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H and Bengio Y. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation//Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing. Doha, Qatar: ACL: 1724-1734 [ DOI: 10.3115/v1/D14-1179 http://dx.doi.org/10.3115/v1/D14-1179 ]
Heilbron F C, Escorcia V, Ghanem B and Niebles J C. 2015. ActivityNet: a large-scale video benchmark for human activity understanding//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 961-970 [ DOI: 10.1109/CVPR.2015.7298698 http://dx.doi.org/10.1109/CVPR.2015.7298698 ]
Idrees H, Zamir A R, Jiang Y G, Gorban A, Laptev I, Sukthankar R and Shah M. 2017. The THUMOS challenge on action recognition for videos "in the wild". Computer Vision and Image Understanding, 155: 1-23 [DOI: 10.1016/j.cviu.2016.10.018]
Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev P Suleyman M and Zisserman A. 2017. The kinetics human action video dataset[EB/OL]. [2021-11-17] . https://arxiv.org/pdf/1705.06950.pdf https://arxiv.org/pdf/1705.06950.pdf
Kipf T N and Welling M. 2016. Semi-supervised classification with graph convolutional networks//Proceedings of the 5th International Conference on Learning Representations. Toulon, France: ICLR
Lin C M, Li J, Wang Y B, Tai Y, Luo D H, Cui Z P, Wang C J, Li J L, Huang F Y and Ji R R. 2020. Fast learning of temporal action proposal via dense boundary generator//Proceedings of the 34th AAAI Conference on Artificial Intelligence. New York, USA: AAAI: 11499-11506 [ DOI: 10.1609/aaai.v34i07.6815 http://dx.doi.org/10.1609/aaai.v34i07.6815 ]
Lin C M, Xu C M, Luo D H, Wang Y B, Tai Y, Wang C J, Li J L, Huang F Y and Fu Y W. 2021. Learning salient boundary feature for anchor-free temporal action localization//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 3319-3328 [ DOI: 10.1109/CVPR46437.2021.00333 http://dx.doi.org/10.1109/CVPR46437.2021.00333 ]
Lin T W. 2019. Temporal Convolutional Network Based Temporal Action Detection. Shanghai: Shanghai Jiao Tong University
林天威. 2019. 基于时序卷积网络的视频动作检测算法. 上海: 上海交通大学) [ DOI: 10.27307/d.cnki.gsjtu.2019.002563 http://dx.doi.org/10.27307/d.cnki.gsjtu.2019.002563 ]
Lin T W, Liu X, Li X, Ding E R and Wen S L. 2019. BMN: boundary-matching network for temporal action proposal generation//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 3888-3897 [ DOI: 10.1109/iccv.2019.00399 http://dx.doi.org/10.1109/iccv.2019.00399 ]
Lin T W, Zhao X,Su H S, Wang C J and Yang M. 2018. BSN: boundary sensitive network for temporal action proposal generation//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 3-21 [ DOI: 10.1007/978-3-030-01225-0_1 http://dx.doi.org/10.1007/978-3-030-01225-0_1 ]
Liu S M, Zhao X, Su H S and Hu Z L. 2020. TSI: temporal scale invariant network for action proposal generation//Proceedings of the 15th Asian Conference on Computer Vision. Kyoto, Japan: Springer: 530-546 [ DOI: 10.1007/978-3-030-69541-5_32 http://dx.doi.org/10.1007/978-3-030-69541-5_32 ]
Qin X, Zhao H B, Lin G C, Zeng H, Xu S C and Li X. 2021. PcmNet: position-sensitive context modeling network for temporal action localization [EB/OL]. [2021-03-09] . https://arxiv.org/pdf/2103.05270.pdf https://arxiv.org/pdf/2103.05270.pdf
Qing Z W, Su H S, Gan W H, Wang D L, Wu W, Wang X, Qiao Y, Yan J J, Gao C X and Sang N. 2021. Temporal context aggregation network for temporal action proposal refinement//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 485-494 [doi: 10.1109/CVPR46437.2021.00055 http://dx.doi.org/10.1109/CVPR46437.2021.00055 ]
Su H S, Gan W H, Wu W, Qiao Y and Yan J J. 2020. BSN++: complementary boundary regressor with scale-balanced relation modeling for temporal action proposal generation//Proceedings of the 35th AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI: 2602-2610
Tan J, Tang J Q, Wang L M and Wu G S. 2021. Relaxed transformer decoders for direct action proposal generation//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE: 13506-13515 [ DOI: 10.1109/ICCV48922.2021.01327 http://dx.doi.org/10.1109/ICCV48922.2021.01327 ]
Wang L M, Xiong Y J, Lin D H and Van Gool L. 2017. UntrimmedNets for weakly supervised action recognition and detection//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 6402-6411 [ DOI: 10.1109/CVPR.2017.678 http://dx.doi.org/10.1109/CVPR.2017.678 ]
Wang L M, Xiong Y J, Wang Z, Qiao Y, Lin D H, Tang X O and Van Gool L. 2019. Temporal segment networks for action recognition in videos. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(11): 2740-2755 [DOI: 10.1109/TPAMI.2018.2868668]
Xiong C X, Guo D and Liu X L. 2020. Temporal proposal optimization for temporal action detection. Journal of Image and Graphics, 25(7): 1447-1458
熊成鑫, 郭丹, 刘学亮. 2020. 时域候选优化的时序动作检测. 中国图象图形学报, 25(7): 1447-1458) [DOI: 10.11834/jig.190440]
Xiong Y J, Wang L M, Wang Z,Zhang B W, Song H, Li W, Lin D H, Qiao Y, Van Gool L and Tang X O. 2016. CUHK and ETHZ and SIAT submission to ActivityNet challenge 2016 [EB/OL]. [2021-08-02] . https://arxiv.org/pdf/1608.00797.pdf https://arxiv.org/pdf/1608.00797.pdf
Xu M M, Zhao C, Rojas D S, Thabet A and Ghanem B. 2020. G-TAD: sub-graph localization for temporal action detection//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 10153-10162 [ DOI: 10.1109/CVPR42600.2020.01017 http://dx.doi.org/10.1109/CVPR42600.2020.01017 ]
Zeng R H, Huang W B, Gan C, Tan M K, Rong Y, Zhao P L and Huang J Z. 2019. Graph convolutional networks for temporal action localization//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 7093-7102 [ DOI: 10.1109/ICCV.2019.00719 http://dx.doi.org/10.1109/ICCV.2019.00719 ]
Zhao P S, Xie L X, Ju C, Zhang Y, Wang Y F and Tian Q. 2020. Bottom-up temporal action localization with mutual regularization//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 539-565 [ DOI: 10.1007/978-3-030-58598-3_32 http://dx.doi.org/10.1007/978-3-030-58598-3_32 ]
Zhao Y, Xiong Y J, Wang L M, Wu Z R, Tang X O and Lin D H. 2017a. Temporal action detection with structured segment networks//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 2933-2942 [ DOI: 10.1109/ICCV.2017.317 http://dx.doi.org/10.1109/ICCV.2017.317 ]
Zhao Y, Zhang B W, Wu Z R, Yang S, Zhou L, Yan S J, Wang L M, Xiong Y J, Lin D H, Qiao Y and Tang X O. 2017b. CUHK and ETHZ and SIAT submission to ActivityNet challenge 2017[EB/OL]. [2021-10-08] . https://arxiv.org/pdf/1710.08011.pdf https://arxiv.org/pdf/1710.08011.pdf
相关作者
相关机构
京公网安备11010802024621