时域候选优化的时序动作检测
Temporal proposal optimization for temporal action detection
- 2020年25卷第7期 页码:1447-1458
收稿:2019-08-29,
修回:2019-12-14,
录用:2019-12-21,
纸质出版:2020-07-16
DOI: 10.11834/jig.190440
移动端阅览

浏览全部资源
扫码关注微信
收稿:2019-08-29,
修回:2019-12-14,
录用:2019-12-21,
纸质出版:2020-07-16
移动端阅览
目的
2
时序动作检测(temporal action detection)作为计算机视觉领域的一个热点课题,其目的是检测视频中动作发生的具体区间,并确定动作的类别。这一课题在现实生活中具有深远的实际意义。如何在长视频中快速定位且实现时序动作检测仍然面临挑战。为此,本文致力于定位并优化动作发生时域的候选集,提出了时域候选区域优化的时序动作检测方法TPO(temporal proposal optimization)。
方法
2
采用卷积神经网络(convolutional neural network,CNN)和双向长短期记忆网络(bidirectional long short term memory,BLSTM)来捕捉视频的局部时序关联性和全局时序信息;并引入联级时序分类优化(connectionist temporal classification,CTC)方法,评估每个时序位置的边界概率和动作概率得分;最后,融合两者的概率得分曲线,优化时域候选区域候选并排序,最终实现时序上的动作检测。
结果
2
在ActivityNet v1.3数据集上进行实验验证,TPO在各评价指标,如一定时域候选数量下的平均召回率AR@100(average recall@100),曲线下的面积AUC(area under a curve)和平均均值平均精度mAP(mean average precision)上分别达到74.66、66.32、30.5,而各阈值下的均值平均精度mAP@IoU(mAP@intersection over union)在阈值为0.75和0.95时也分别达到了30.73和8.22,与SSN(structured segment network)、TCN(temporal context network)、Prop-SSAD(single shot action detector for proposal)、CTAP(complementary temporal action proposal)和BSN(boundary sensitive network)等方法相比,TPO的所有性能指标均有提高。
结论
2
本文提出的模型兼顾了视频的全局时序信息和局部时序信息,使得预测的动作候选区域边界更为准确和灵活,同时也验证了候选区域的准确性能够有效提高时序动作检测的精确度。
Objective
2
With the ubiquity of electronic equipment
such as cellphones and cameras
massive video data of people's activities and behaviors in daily life are stored
recorded
and transmitted. Increasing video-based applications
such as video surveillance
have attracted the attention of researchers. However
real-world videos are consistently long and untrimmed. Long untrimmed videos in publicly available datasets for temporal action detection consistently contain several ambiguous frames and a large number of background frames. Accurately locating action proposals and recognizing action labels are difficult. Similar to object proposal generation in object detection task
the task of temporal action detection can be resolved into two phases
where the first phase is to determine the specific durations (starting and ending timestamps) of actions
and the second phase is to identify the category of each action instance. The development of single-action classification in trimmed videos has been extremely successful
whereas the performance of temporal action proposal generation remains unsatisfactory. The phase of candidate action proposal generation experiences time-consuming model training. High-quality proposals contribute to the performance of action detection. The study on temporal proposal generation can effectively and efficiently locate the video content and facilitate video understanding in untrimmed videos. In this work
we focus on the optimization of temporal action proposals for action detection.
Method
2
We aim to improve the performance of action detection by optimizing temporal action proposals
that is
accurately localizing the boundaries of actions in long untrimmed videos. We propose a temporal proposal optimization (TPO) model for the detection of candidate action proposals. TPO utilizes the advantages of convolutional neural networks (CNNs) and bidirectional long short-term memory (BLSTM) to simultaneously capture the local and global temporal cues. In the proposed TPO model
we introduce connectionist temporal classification (CTC) optimization
which excels at parsing global feature-level classification labels. The global actionness probability calculated by BLSTM and CTC modifies several inexact temporal cues in the local CNN actionness probability. Thus
a probability fusion strategy based on local and global actionness probabilities promotes the accuracy of temporal boundaries of actions in videos and results in the promising performance of temporal action detection. In particular
TPO is composed of three modules
namely
local actionness evaluation module (LAEM)
global actionness evaluation module (GAEM)
and post processing module (PPM). The extracted features are fed into LAEM and GAEM. Then
LAEM and GAEM generate the global and local actionness probabilities along the temporal dimension
respectively. LAEM is a temporal CNN-based module
and GAEM predicts the global actionness probabilities with the help of BLSTM and CTC losses. LAEM outputs three sequences. Starting and ending probabilities are found in addition to local actionness probabilities. The crossing of starting and ending probability curves builds the candidate temporal proposals. Thus
GAEM captures global actionness probabilities
which is auxiliary to LAEM. Then
the local and global actionness probabilities are fed into PPM to obtain a fused actionness probability curve. Subsequently
we sample the actionness probability curves through linear interpolation to extract proposal-level features. The proposal-level features are fed int a multilayer perceptron) to obtain the confidence score. We use the confidence score to rank the candidate proposals and adopt soft-NMS(non-maximum supression) to remove redundant proposals. Finally
we apply an existing classification model with our generated proposals to evaluate the detection performance of TPO.
Result
2
We validate the proposed model on two evaluations of action proposal generation and action detection. Experimental results indicate that TPO outperforms other state-of-the-art methods on ActivityNet v1.3 dataset. For the proposal generation
we compare our model with the methods
including SSN(structured segment network)
TCN(temporal context network)
Prop-SSAD(single shot action detector for proposal)
CTAP(complementary temporal action proposal)
and BSN(boundary sensitive network). The proposed TPO model performs best and achieves average recall @ average number of proposals of 74.66 and area under a curve of 66.32. For the temporal action detection task
we test the quantitative evaluation metric mean average precision@intersection over union (mAP@IoU). Compared with the existing methods
including SCC(semantic cascade context)
CDC(convolutional-de-convolutional)
SSN and BSN
TPO achieves the best mAPs of 30.73 and 8.22 under the tIoUs of 0.75 and 0.95
respectively
and obtains the best average mAP of 30.5. Notably
the mAP value decreases with the increase in tIoU value. The tIoU metric reflects the overlap between the generated proposals and the ground truth
where a high tIoU value indicates strict constraints on candidate proposals. Thus
TPO achieves the best mAP performance under high tIoU values (0.75 and 0.95). This result validates the detection performance. TPO generates accurate proposals of action instances with high overlap on the ground truth and improves the detection performance.
Conclusion
2
In this paper
we propose a novel model called TPO for temporal proposal generation that achieves promising performance on ActivityNet v1.3 to resolve the action detection problem. Experimental results demonstrate the effectiveness of TPO. TPO generates temporal proposals with precise boundaries and maintains flexible temporal durations
thereby covering sequential actions in videos with variable-length intervals.
Bodla N, Singh B, Chellappa R and Davis L S. 2017. Soft-NMS-improving object detection with one line of code//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 5562-5570[ DOI: 10.1109/ICCV.2017.593 http://dx.doi.org/10.1109/ICCV.2017.593 ]
Cao S Y, Liu Y H and Li X Z. 2017. Vehicle detection method based on Fast R-CNN. Journal of Image and Graphics, 22(5):671-677
曹诗雨, 刘跃虎, 李辛昭. 2017.基于Fast R-CNN的车辆目标检测.中国图象图形学报, 22(5):671-677)[DOI:10.11834/jig.160600]
Cui R P, Liu H and Zhang C H. 2017. Recurrent convolutional neural networks for continuous sign language recognition by staged optimization//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 1610-1618[ DOI: 10.1109/CVPR.2017.175 http://dx.doi.org/10.1109/CVPR.2017.175 ]
Dai X Y, Singh B, Zhang G Y, Davis L S and Chen Y Q. 2017. Temporal context network for activity localization in videos//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 5727-5736[ DOI: 10.1109/ICCV.2017.610 http://dx.doi.org/10.1109/ICCV.2017.610 ]
Gao J Y, Chen K and Nevatia R. 2018. CTAP: complementary temporal action proposal generation//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 70-85[ DOI: 10.1007/978-3-030-01216-8_5 http://dx.doi.org/10.1007/978-3-030-01216-8_5 ]
Heilbron F C, Barrios W, Escorcia V and Ghanem B. 2017. SCC: Semantic context cascade for efficient action detection//Proceedings of 2017 IEEE Conference on Computer Visionand Pattern Recognition. Honolulu, USA: IEEE: 3175-3184[ DOI: 10.1109/CVPR.2017.338 http://dx.doi.org/10.1109/CVPR.2017.338 ]
Heilbron F C, Escorcia V, Ghanem B and Niebles J C. 2015. Activitynet: a large-scale video benchmark for human activity understanding//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 961-970[ DOI: 10.1109/CVPR.2015.7298698 http://dx.doi.org/10.1109/CVPR.2015.7298698 ]
Huang S, Wang W Q, He S F and Lau R W H. 2018. Egocentric temporal action proposals. IEEE Transactions on Image Processing, 27(2):764-777[DOI:10.1109/tip.2017.2772904]
Lin T W, Zhao X and Shou Z. 2017. Temporal convolution based action proposal: submission to ActivityNet 2017[EB/OL].[2019-08-14] . https://arxiv.org/pdf/1707.06750.pdf https://arxiv.org/pdf/1707.06750.pdf
Lin T W, Zhao X, Su H S, Wang C J and Yang M. 2018. BSN: boundary sensitive network for temporal action proposal generation//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 3-21[ DOI: 10.1007/978-3-030-01225-0_1 http://dx.doi.org/10.1007/978-3-030-01225-0_1 ]
Luo H L, Lai Z Y and Kong F S. 2017. Action recognition in videos based on action segmentation and manifold metric learning. Journal of Image and Graphics, 22(8):1106-1119
罗会兰, 赖泽云, 孔繁胜. 2017.动作切分和流形度量学习的视频动作识别.中国图象图形学报, 22(8):1106-1119)[DOI:10.11834/jig.170032]
Luo H L, Tong K and Kong F S. 2019. The progress of human action recognition in videos based on deep learning:a review. Acta Electronica Sinica, 47(5):1162-1173
罗会兰, 童康, 孔繁胜. 2019.基于深度学习的视频中人体动作识别进展综述.电子学报, 47(5):1162-1173)[DOI:10.3969/j.issn.0372-2112.2019.05.025]
Ren S Q, He K M, Girshick R and Sun J. 2017. Faster R-CNN:towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1137-1149[DOI:10.1109/tpami.2016.2577031]
Shou Z, Chan J, Zareian A, Miyazawa K and Chang S F. 2017. CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 1417-1426[ DOI: 10.1109/cvpr.2017.155 http://dx.doi.org/10.1109/cvpr.2017.155 ]
Shou Z, Wang D G and Chang S F. 2016. Temporal action localization in untrimmed videos via multi-stage CNNs//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 1049-1058[ DOI: 10.1109/cvpr.2016.119 http://dx.doi.org/10.1109/cvpr.2016.119 ]
Simonyan K and Zisserman A. 2016. Two-stream convolutional networks for action recognition in videos//Proceedings of the 27th International Conference on Neural Information Processing Systems. Cambridge: MIT Press: 568-576
Singh B, Marks T K, Jones M, Tuzel O and Shao M. 2016. A multi-stream bi-directional recurrent neural network for fine-grained action detection//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 1961-1970[ DOI: 10.1109/cvpr.2016.216 http://dx.doi.org/10.1109/cvpr.2016.216 ]
Song S J, Lan C L, Xing J L, Zeng W J and Liu J Y. 2018. Spatio-temporal attention-based LSTM networks for 3D action recognition and detection. IEEE Transactions on Image Processing, 27(7):3459-3471[DOI:10.1109/tip.2018.2818328]
Soomro K, Idrees H and Shah M. 2019. Online localization and prediction of actions and interactions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2):459-472[DOI:10.1109/tpami.2018.2797266]
Tran D, Bourdev L, Fergus R, Torresani L and Paluri M. 2015. Learning spatiotemporal features with 3D convolutional networks//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 4489-4497[ DOI: 10.1109/iccv.2015.510 http://dx.doi.org/10.1109/iccv.2015.510 ]
Tran D, Wang H, Torresani L, Ray J, LeCun Y and Paluri M. 2018. A closer look at spatiotemporal convolutions for action recognition//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 6450-6459[ DOI: 10.1109/cvpr.2018.00675 http://dx.doi.org/10.1109/cvpr.2018.00675 ]
Tu Z G, Li H Y, Zhang D J, Dauwels J, Li B X and Yuan J S. 2019. Action-stage emphasized spatiotemporal VLAD for video action recognition. IEEE Transactions on Image Processing, 28(6):2799-2812[DOI:10.1109/TIP.2018.2890749]
Xiong Y J, Wang L M, Wang Z, Zhang B W, Song H, Li W, Lin D H, Qiao Y, van Gool L and Tang X O. 2016. CUHK and ETHZ and SIAT submission to ActivityNet challenge 2016[EB/OL].[2019-08-14] . https://arxiv.org/pdf/1608.00797.pdf https://arxiv.org/pdf/1608.00797.pdf
Xu B H, Ye H, Zheng Y B, Wang H, Luwang T Y and Jiang Y G. 2019. Dense dilated network for video action recognition. IEEE Transactions on Image Processing, 28(10):4941-4953[DOI:10.1109/tip.2019.2917283]
Xu H J, Das A and Saenko K. 2017. R-C3D: region convolutional 3D network for temporal activity detection//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 5794-5803[ DOI: 10.1109/ICCV.2017.617 http://dx.doi.org/10.1109/ICCV.2017.617 ]
Yang Y, Zhou J, Ai J B, Hanjalic A, Shen H T and Ji Y L. 2018. Video captioning by adversarial LSTM. IEEE Transactions on Image Processing, 27(11):5600-5611[DOI:10.1109/TIP.2018.2855422]
Zhao Y, Xiong Y J, Wang L M, Wu Z R, Tang X O and Lin D H. 2017. Temporal action detection with structured segment networks//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 2933-2942[ DOI: 10.1109/ICCV.2017.317 http://dx.doi.org/10.1109/ICCV.2017.317 ]
Zhu H Y, Vial R, Lu S J, Peng X, Fu H Z, Tian Y H and Cao X B. 2018. YoTube:searching action proposal via recurrent and static regression networks. IEEE Transactions on Image Processing, 27(6):2609-2622[DOI:10.1109/tip.2018.2806279]
相关作者
相关机构
京公网安备11010802024621