面向群体行为识别的非局部网络模型
Nonlocal based deep model for group activity recognition
- 2019年24卷第10期 页码:1728-1737
收稿:2019-01-29,
修回:2019-4-12,
录用:2019-4-19,
纸质出版:2019-10-16
DOI: 10.11834/jig.180695
移动端阅览

浏览全部资源
扫码关注微信
收稿:2019-01-29,
修回:2019-4-12,
录用:2019-4-19,
纸质出版:2019-10-16
移动端阅览
目的
2
视频行为识别一直广受计算机视觉领域研究者的关注,主要包括个体行为识别与群体行为识别。群体行为识别以人群动作作为研究对象,对其行为进行有效表示及分类,在智能监控、运动分析以及视频检索等领域有重要的应用价值。现有的算法大多以多层递归神经网络(RNN)模型作为基础,构建出可表征个体与所属群体之间关系的群体行为特征,但是未能充分考虑个体之间的相互影响,致使识别精度较低。为此,提出一种基于非局部卷积神经网络的群体行为识别模型,充分利用个体间上下文信息,有效提升了群体行为识别准确率。
方法
2
所提模型采用一种自底向上的方式来同时对个体行为与群体行为进行分层识别。首先从原始视频中沿着个人运动的轨迹导出个体附近的图像区块;随后使用非局部卷积神经网络(CNN)来提取包含个体间影响关系的静态特征,紧接着将提取到的个体静态特征输入多层长短期记忆(LSTM)时序模型中,得到个体动态特征并通过个体特征聚合得到群体行为特征;最后利用个体、群体行为特征同时完成个体行为与群体行为的识别。
结果
2
本文在国际通用的Volleyball Dataset上进行实验。实验结果表明,所提模型在未进行群体精细划分条件下取得了77.6%的准确率,在群体精细划分的条件下取得了83.5%的准确率。
结论
2
首次提出了面向群体行为识别的非局部卷积网络,并依此构建了一种非局部群体行为识别模型。所提模型通过考虑个体之间的相互影响,结合个体上下文信息,可从训练数据中学习到更具判别性的群体行为特征。该特征既包含个体间上下文信息、也保留了群体内层次结构信息,更有利于最终的群体行为分类。
Objective
2
Human action recognition
which is composed of single-person action and group activity recognition
has received considerable research attention. Group activity recognition is based on single-person action recognition and focuses on the group of people in the scene. This type of recognition has various applications
including video surveillance
sport analytics
and video retrieval. In group activity recognition
the hierarchical structure between the group and individuals is significant to recognition
and the main challenge is to build more discriminative representations of group activity based on the hierarchical structure. To overcome this difficulty
researchers have proposed numerous methods. Hierarchical framework is widely adopted to represent the relationships between individuals and their corresponding group and has achieved promising performance. In the early years
hand-crafted features are designed as the representations of individual and group-level activities. Recently
deep learning has been widely used in group activity recognition. Typically
hierarchical framework-based RNN (recurrent neural network) has been adopted to represent the relationships between individuals and their corresponding group and has achieved promising performance. Despite the promising performance
these methods ignore the relationships and interactions among individuals
thereby affecting the accuracy of recognition. Group activity is comprehensively defined by each individual action and the contextural information among individuals. Extracting individual features in isolation results in the loss of contextural information. To address this problem
we propose a novel model for group activity recognition based on the nonlocal network.
Method
2
The proposed model utilizes a bottom-up approach to represent and recognize individual actions and group activities in a hierarchical manner. First
tracklets of multi-person are constructed based on the detection and trajectories
and static features are extracted from these tracklets by nonlocal convolutional neural network (NCNN). Inside the NCNN module
the similarity of each individual is calculated to capture the nonlocal context within the individuals. The extracted features are then fed into the hierarchical temporal model (HTM)
which is based on LSTM (long short term memory). HTM is composed of individual-level LSTM and group-level LSTM
which focuses on group dynamics in a hierarchical manner. Dynamic features of individuals are extracted
and features of group activities are generated by aggregating individual features in the HTM. Finally
the group activities and individual actions are classified by utilizing the output of HTM. The entire framework is easily implemented in with end-to-end training style.
Result
2
We evaluate our model on the widely-used The Volleyball Dataset in two different dataset settings
namely
fine-division and non-fine-division. Fine-division experimental settings refer to the group as combination of different subgroups
and a subgroup is composed of several individuals. In this setting
the structure of the group is "group-subgroup-individuals". We aggregate the individual features within the subgroup and then concatenate the features of subgroups. Non-fine-division experimental setting means the lack of involvement of subgroup. We aggregate all the individual features to generate the features of the group. Experimental results show that the proposed method can achieve 83.5% accuracy in fine-division manner and 77.6% accuracy in non-fine-division manner. Examples of recognition and relationships within the group are visualized.
Conclusion
2
This study proposes a novel neural network for group activity recognition and constructs a unified framework based on the NCNN and hierarchical LSTM network. We address the motivation of taking the relationships among individuals into consideration with a nonlocal network and utilize the contextural information in the group. In extracting individual features
the method learns more discriminative features
which combine the impact of each individual. Thus
contextural information in nonlocal area is embedded into the extracted features. Experimental results confirm the effectiveness of our nonlocal model
indicating that the contextural information between individuals and the hierarchical structure of the group facilitate the group activity recognition.
Ji S W, Xu W, Yang M, et al. 3D convolutional neural networks for human action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(1):221-231.[DOI:10.1109/TPAMI.2012.59]
Shan Y H, Zhang Z, Huang K Q. Visual human action recognition:history, status and prospects[J]. Journal of Computer Research and Development, 2016, 53(1):93-112.
单言虎, 张彰, 黄凯奇.人的视觉行为识别研究回顾、现状及展望[J].计算机研究与发展, 2016, 53(1):93-112. [DOI:10.7544/issn1000-1239.2016.20150403]
Ibrahim M S, Muralidharan S, Deng Z W, et al. A hierarchical deep temporal model for group activity recognition[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 1971-1980.[ DOI: 10.1109/CVPR.2016.217 http://dx.doi.org/10.1109/CVPR.2016.217 ]
Herath S, Harandi M, Porikli F. Going deeper into action recognition:a survey[J]. Image and Vision Computing, 2017, 60:4-21.[DOI:10.1016/j.imavis.2017.01.010]
He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Ve gas, NV, USA: IEEE, 2016: 770-778.[ DOI: 10.1109/CVPR.2016.90 http://dx.doi.org/10.1109/CVPR.2016.90 ]
Zheng Y, Chen Q Q, Zhang Y J. Deep learning and its new progress in object and behavior recognition[J]. Journal of Image and Graphics, 2014, 19(2):175-184.
郑胤, 陈权崎, 章毓晋.深度学习及其在目标和行为识别中的新进展[J].中国图象图形学报, 2014, 19(2):175-184. [DOI:10.11834/jig.20140202]
Chen L C, Papandreou G, Kokkinos I, et al. Deeplab:semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(4):834-848.[DOI:10.1109/TPAMI.2017.2699184]
Nabi M, del Bue A, Murino V. Temporal poselets for collective activity detection and recognition[C]//Proceedings of 2013 IEEE International Conference on Computer Vision Workshops. Sydney, NSW, Australia: IEEE, 2013: 500-507.[ DOI: 10.1109/ICCVW.2013.71 http://dx.doi.org/10.1109/ICCVW.2013.71 ]
Choi W, Savarese S. A unified framework for multi-target tracking and collective activity recognition[C]//Proceedings of the 12th European Conference on Computer Vision. Florence, Italy: Springer, 2012: 215-230.[ DOI: 10.1007/978-3-642-33765-9_16 http://dx.doi.org/10.1007/978-3-642-33765-9_16 ]
Ibrahim M S, Muralidharan S, Deng Z W, et al. Hierarchical deep temporal models for group activity recognition[EB/OL].[2019-01-14] . https://arxiv.org/pdf/1607.02643.pdf https://arxiv.org/pdf/1607.02643.pdf .
Shu T M, Todorovic S, Zhu S C. CERN: confidence-energy recurrent network for group activity recognition[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 4255-4263.[ DOI: 10.1109/CVPR.2017.453 http://dx.doi.org/10.1109/CVPR.2017.453 ]
Bagautdinov T, Alahi A, Fleuret F, et al. Social scene understanding: end-to-end multi-person action localization and collective activity recognition[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 3425-3434.[ DOI: 10.1109/CVPR.2017.365 http://dx.doi.org/10.1109/CVPR.2017.365 ]
Li X, Chuah M C. SBGAR: semantics based group activity recognition[C]//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE, 2017: 2876-2885.[ DOI: 10.1109/ICCV.2017.313 http://dx.doi.org/10.1109/ICCV.2017.313 ]
Biswas S, Gall J. Structural recurrent neural network (SRNN) for group activity analysis[C]//Proceedings of 2018 IEEE Winter Conference on Applications of Computer Vision. Lake Tahoe, NV, USA: IEEE, 2018: 1625-1632.[ DOI: 10.1109/WACV.2018.00180 http://dx.doi.org/10.1109/WACV.2018.00180 ]
Silling S A. Reformulation of elasticity theory for discontinuities and long-range forces[J]. Journal of the Mechanics and Physics of Solids, 2000, 48(1):175-209.[DOI:10.1016/S0022-5096(99)00029-0]
Tadmor E. Mathematical aspects of self-organized dynamics:consensus, emergence of leaders, and social hydrodynamics[J]. SIAM News, 2015, 48(9).
Coifman R R, Lafon S. Diffusion maps[J]. Applied and Computational Harmonic Analysis, 2006, 21(1):5-30.
Wang X L, Girshick R, Gupta A, et al. Non-local neural networks[EB/OL].[2019-01-14] . https://arxiv.org/pdf/1711.07971.pdf https://arxiv.org/pdf/1711.07971.pdf .
Buades A, Coll B, Morel J M. A non-local algorithm for image denoising[C]//Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Diego, CA, USA: IEEE, 2005: 60-65.[ DOI: 10.1109/CVPR.2005.38 http://dx.doi.org/10.1109/CVPR.2005.38 ]
Dabov K, Foi A, Katkovnik V, et al. Image denoising by sparse 3-D transform-domain collaborative filtering[J]. IEEE Transactions on Image Processing, 2007, 16(8):2080-2095.[DOI:10.1109/TIP.2007.901238]
Burger H C, Schuler C J, Harmeling S. Image denoising: can plain neural networks compete with BM3D?[C]//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, RI, USA: IEEE, 2012: 2392-2399.[ DOI: 10.1109/CVPR.2012.6247952 http://dx.doi.org/10.1109/CVPR.2012.6247952 ]
Lefkimmiatis S. Non-local color image denoising with convolutional neural networks[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA: IEEE, 2017: 3587-3596.[ DOI: 10.1109/CVPR.2017.623 http://dx.doi.org/10.1109/CVPR.2017.623 ]
Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural Computation, 1997, 9(8):1735-1780.[DOI:10.1162/neco.1997.9.8.1735]
Zeiler M D, Fergus R. Visualizing and understanding convolutional networks[C]//Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer, 2014: 818-833.[ DOI: 10.1007/978-3-319-10590-1_53 http://dx.doi.org/10.1007/978-3-319-10590-1_53 ]
Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks[C]//Proceedings of the 25th International Conference on Neural Information Processing Systems. Lake Tahoe, Nevada: Curran Associates Inc., 2012: 1097-1105.
Deng J, Dong W, Socher R, et al. Imagenet: a large-scale hierarchical image database[C]//Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, FL, USA: IEEE, 2009: 248-255.[ DOI: 10.1109/CVPR.2009.5206848 http://dx.doi.org/10.1109/CVPR.2009.5206848 ]
Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks[C]//Proceedings of the 13th International Conference on Artificial Intelligence and Statistics. Chia Laguna Resort, Sardinia, Italy: JMLR, 2010: 249-256.
Kingma D P, Ba J. Adam: a method for stochastic optimization[EB/OL].[2019-01-14] . https://arxiv.org/pdf/1412.6980.pdf https://arxiv.org/pdf/1412.6980.pdf .
King D E. Dlib-ml:A machine learning toolkit[J]. Journal of Machine Learning Research, 2009, 10:1755-1758.
相关作者
相关机构
京公网安备11010802024621