多特征融合的行为识别模型
Multi-feature fusion behavior recognition model
- 2020年25卷第12期 页码:2541-2552
收稿:2019-12-07,
修回:2020-4-3,
录用:2020-4-10,
纸质出版:2020-12-16
DOI: 10.11834/jig.190637
移动端阅览

浏览全部资源
扫码关注微信
收稿:2019-12-07,
修回:2020-4-3,
录用:2020-4-10,
纸质出版:2020-12-16
移动端阅览
目的
2
视频行为识别和理解是智能监控、人机交互和虚拟现实等诸多应用中的一项基础技术,由于视频时空结构的复杂性,以及视频内容的多样性,当前行为识别仍面临如何高效提取视频的时域表示、如何高效提取视频特征并在时间轴上建模的难点问题。针对这些难点,提出了一种多特征融合的行为识别模型。
方法
2
首先,提取视频中高频信息和低频信息,采用本文提出的两帧融合算法和三帧融合算法压缩原始数据,保留原始视频绝大多数信息,增强原始数据集,更好地表达原始行为信息。其次,设计双路特征提取网络,一路将融合数据正向输入网络提取细节特征,另一路将融合数据逆向输入网络提取整体特征,接着将两路特征加权融合,每一路特征提取网络均使用通用视频描述符——3D ConvNets(3D convolutional neural networks)结构。然后,采用BiConvLSTM(bidirectional convolutional long short-term memory network)网络对融合特征进一步提取局部信息并在时间轴上建模,解决视频序列中某些行为间隔相对较长的问题。最后,利用Softmax最大化似然函数分类行为动作。
结果
2
为了验证本文算法的有效性,在公开的行为识别数据集UCF101和HMDB51上,采用5折交叉验证的方式进行整体测试与分析,然后针对每类行为动作进行比较统计。结果表明,本文算法在两个验证集上的平均准确率分别为96.47%和80.03%。
结论
2
通过与目前主流行为识别模型比较,本文提出的多特征模型获得了最高的识别精度,具有通用、紧凑、简单和高效的特点。
Objective
2
With the rapid development of internet technology and the increasing popularity of video shooting equipment (e.g.
digital cameras and smart phones)
online video services have shown an explosive growth. Short videos have become indispensable sources of information for people in their daily production and life. Therefore
identifying how these people understand these videos is critical. Videos contain rich amounts of hidden information as these media can store more information compared with traditional ones
such as images and texts. Videos also show complexity in their space-time structure
content
temporal relevance
and event integrity. Given such complexities
behavior recognition research is presently facing challenges in extracting the time domain representation and features of videos. To address these difficulties
this study proposes a behavior recognition model based on multi-feature fusion.
Method
2
The proposed model is mainly composed of three parts
namely
the time domain fusion
two-way feature extraction
and feature modeling modules. The two- and three-frame fusion algorithms are initially adopted to compress the original data by extracting high- and low-frequency information from videos. This approach not only retains most information contained in these videos but also enhances the original dataset to facilitate the expression of original behavior information. Second
based on the design of a two-way feature extraction network
detailed features are extracted from videos through the positive input of the fused data to the network
whereas overall features are extracted through the reserve input of these data. A weighted fusion of these features is then achieved by using the common video descriptor
3D ConvNets (3D convolutional neural networks) structure. Afterward
BiConvLSTM (bidirectional convolutional long short-term memory network) is used to further extract the local information of the fused features and to establish a model on the time axis to address the relatively long behavior intervals in some video sequences. Softmax is then applied to maximize the likelihood function and to classify the behavioral actions.
Result
2
To verify its effectiveness
the proposed algorithm was tested and analyzed on public datasets UCF101 and HMDB51. Results of a five-fold cross-validation show that this algorithm has average accuracies of 96.47% and 80.03%for these datasets
respectively. Comparative statistics for each type of behavior show that the classification accuracy of the proposed algorithm is approximately equal in almost all categories.
Conclusion
2
Compared with the available mainstream behavior recognition models
the proposed multi-feature model achieves higher recognition accuracy and is more universal
compact
simple
and efficient. The accuracy of this model is mainly improved via two- and three-frame fusions in the time domain to facilitate video information analysis and behavior information expression. The network is extracted by a two-way feature to efficiently determine the spatio-temporal features of videos. The BiConvLSTM network is then applied to further extract the features and establish a timing relationship.
Chen E Q, Bai X, Gao L, Tinega H C and Ding Y Q. 2019. A spatiotemporal heterogeneous two-stream network for action recognition. IEEE Access, 7:57267-57275[DOI: 10.1109/access.2019.2910604]
Crasto N, Weinzaepfel P, Alahari K and Schmid C. 2019. MARS: motion-augmented RGB stream for action recognition//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 7874-7883[ DOI: 10.1109/cvpr.2019.00807 http://dx.doi.org/10.1109/cvpr.2019.00807 ]
Diba A, Fayyaz M, Sharma V, Karami A H, Arzani M M, Yousefzadeh R and van Gool L. 2017. Temporal 3d convnets: new architecture and transfer learning for video classification[EB/OL].[2017-11-22] . https://arxiv.org/pdf/1711.08200.pdf https://arxiv.org/pdf/1711.08200.pdf
Donahue J, Hendricks L A, Guadarrama S, Rohrbach M, Venugopalan S, Darrell T and Saenko K. 2015. Long-term recurrent convolutional networks for visual recognition and description//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE: 2625-2634[ DOI: 10.1109/CVPR.2015.7298878 http://dx.doi.org/10.1109/CVPR.2015.7298878 ]
Harris C and Stephens M. 1988. A combined corner and edge detector//Proceedings of Alvey Vision Conference. Manchester, UK: Alvey Vision Club: 147-151[ DOI: 10.5244/c.2.23 http://dx.doi.org/10.5244/c.2.23 ]
Ji J W, Buch S, Soto A and Niebles J C. 2018. End-to-end joint semantic segmentation of actors and actions in video//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 734-749[ DOI: 10.1007/978-3-030-01225-0_43 http://dx.doi.org/10.1007/978-3-030-01225-0_43 ]
Kläser A, Marszalek M and Schmid C. 2008. A spatio-temporal descriptor based on 3d-gradients//Proceedings of British Machine Vision Conference 2008. Leeds, UK: British Machine Vision Association: 275-284[ DOI: 10.5244/c.22.99 http://dx.doi.org/10.5244/c.22.99 ]
Laptev I and Lindeberg T. 2003. Space-time interest points//Proceedings of the 9th IEEE International Conference on Computer Vision. Nice, France: IEEE: 432-439[ DOI: 10.1109/iccv.2003.1238378 http://dx.doi.org/10.1109/iccv.2003.1238378 ]
Li Y G, Ge R, Ji Y, Gong S R and Liu C P. 2019. Trajectory-pooled spatial-temporal architecture of deep convolutional neural networks for video event detection. IEEE Transactions on Circuits and Systems for Video Technology, 29(9):2683-2692[DOI: 10.1109/tcsvt.2017.2759299]
Li Z Y, Gavrilyuk K, Gavves E, Jain M and Snoek C G M. 2018. Videolstm convolves, attends and flows for action recognition. Computer Vision and Image Understanding, 166:41-50[DOI: 10.1016/j.cviu.2017.10.011]
Liu J, Shahroudy A, Xu D and Wang G. 2016. Spatio-temporal LSTM with trust gates for 3D human action recognition//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands: Springer: 816-833[ DOI: 10.1007/978-3-319-46487-9_50 http://dx.doi.org/10.1007/978-3-319-46487-9_50 ]
Lowe D G. 1999. Object recognition from local scale-invariant features//Proceedings of the 7th IEEE International Conference on Computer Vision. Kerkyra, Greece: IEEE: 1150-1157[ DOI: 10.1109/iccv.1999.790410 http://dx.doi.org/10.1109/iccv.1999.790410 ]
Ma Y X, Tan L, Dong X and Yu C C. 2019. Action recognition for intelligent monitoring. Journal of Image and Graphics, 24(2):282-290
马钰锡, 谭励, 董旭, 于重重. 2019.面向智能监控的行为识别.中国图象图形学报, 24(2):282-290
Ng J Y H, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R and Toderici G. 2015. Beyond short snippets: deep networks for video classification//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 4694-4702[ DOI: 10.1109/cvpr.2015.7299101 http://dx.doi.org/10.1109/cvpr.2015.7299101 ]
Ouyang X, Xu S J, Zhang C Y, Zhou P, Yang Y, Liu G H and Li X L. 2019. A 3D-CNN and LSTM based multi-task learning architecture for action recognition. IEEE Access, 7:40757-40770[DOI: 10.1109/access.2019.2906654]
Qiu Z F, Yao T and Mei T. 2017. Learning spatio-temporal representation with pseudo-3D residual networks//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 5534-5542[ DOI: 10.1109/ICCV.2017.590 http://dx.doi.org/10.1109/ICCV.2017.590 ]
Scovanner P, Ali S and Shah M. 2007. A 3-dimensional sift descriptor and its application to action recognition//Proceedings of the 15th ACM International Conference on Multimedia. Augsburg, Germany: ACM: 357-360[ DOI: 10.1145/1291233.1291311 http://dx.doi.org/10.1145/1291233.1291311 ]
Sharma S, Kiros R and Salakhutdinov R. 2015. Action recognition using visual attention[EB/OL].[2016-02-14] . https://arxiv.org/pdf/1511.04119.pdf https://arxiv.org/pdf/1511.04119.pdf
Shi X J, Chen Z R, Wang H, Yeung D Y, Wong W K and Woo W C. 2015. Convolutional LSTM network: a machine learning approach for precipitation nowcasting[DB/OL].[2018-09-30] . https://www.researchgate.net/publication/278413880_Convolutional_LSTM_Network_A_Machine_Learning_Approach_for_Precipitation_Nowcasting https://www.researchgate.net/publication/278413880_Convolutional_LSTM_Network_A_Machine_Learning_Approach_for_Precipitation_Nowcasting
Simonyan K and Zisserman A. 2014. Two-stream convolutional networks for action recognition in videos//Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, Canada: NIPS: 568-576
Tran D, Bourdev L, Fergus R, Torresani L and Paluri M. 2015. Learning spatiotemporal features with 3D convolutional networks//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 4489-4497[ DOI: 10.1109/iccv.2015.510 http://dx.doi.org/10.1109/iccv.2015.510 ]
Wang H, Kläser A, Schmid C and Liu C L. 2011. Action recognition by dense trajectories//Proceedings of CVPR 2011. Providence, USA: IEEE: 3169-3176[ DOI: 10.1109/cvpr.2011.5995407 http://dx.doi.org/10.1109/cvpr.2011.5995407 ]
Wang H and Schmid C. 2013. Action recognition with improved trajectories//Proceedings of 2013 IEEE International Conference on Computer Vision. Sydney, Australia: IEEE: 3551-3558[ DOI: 10.1109/iccv.2013.441 http://dx.doi.org/10.1109/iccv.2013.441 ]
Wang L, Koniusz P and Huynh D. 2019. Hallucinating IDT descriptors and I3D optical flow features for action recognition with CNNs//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 8697-8707[ DOI: 10.1109/ICCV.2019.00879 http://dx.doi.org/10.1109/ICCV.2019.00879 ]
Wang L M, Xiong Y J, Wang Z, Qiao Y, Lin D H, Tang X O and van Gool L. 2016. Temporal segment networks: towards good practices for deep action recognition//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands: Springer: 20-36[ DOI: 10.1007/978-3-319-46484-8_2 http://dx.doi.org/10.1007/978-3-319-46484-8_2 ]
Wang X H, Gao L L, Wang P, Sun X S and Liu X L. 2018. Two-stream3-D ConvNet fusion for action recognition in videos with arbitrary size and length. IEEE Transactions on Multimedia, 20(3):634-644[DOI: 10.1109/tmm.2017.2749159]
Willems G, Tuytelaars T and van Gool L. 2008. An efficient dense and scale-invariant spatio-temporal interest point detector//Proceedings of the 10th European Conference on Computer Vision. Marseille, France: Springer: 650-663[ DOI: 10.1007/978-3-540-88688-4_48 http://dx.doi.org/10.1007/978-3-540-88688-4_48 ]
Yang H, Yuan C F, Li B, Du Y, Xing J L, Hu W M and Maybank S J. 2019. Asymmetric 3D convolutional neural networks for action recognition. Pattern Recognition, 85:1-12[DOI: 10.1016/j.patcog.2018.07.028]
Zheng Z X, An G Y, Wu D P and Ruan Q Q. 2019. Spatial-temporal pyramid based Convolutional Neural Network for action recognition. Neurocomputing, 358:446-455[DOI: 10.1016/j.neucom.2019.05.058]
Zhu W J, Hu J, Sun G, Cao X D and Qiao Y. 2016. A key volume mining deep framework for action recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 1991-1999[ DOI: 10.1109/cvpr.2016.219 http://dx.doi.org/10.1109/cvpr.2016.219 ]
相关作者
相关机构
京公网安备11010802024621