摘 要：目的 视频行为识别一直广受计算机视觉领域研究者的关注，主要包括个体行为识别与群体行为识别。群体行为识别建立在个人行为识别的基础之上，以群体作为研究对象，对其行为进行有效表示及分类，在智能监控、运动分析以及视频检索等领域有重要的应用价值。现有的算法大多以多层RNN模型作为基础，构建出可表征个体与所属群体之间关系的群体行为特征，但是未能充分考虑个体之间的相互影响，从而导致群体行为识别精度较低。为此，本文提出一种基于非局部卷积神经网络的群体行为识别模型，充分利用个体间上下文信息，有效提升了群体行为识别准确率。方法 所提模型采用一种自底向上的方式来同时对个体行为与群体行为进行分层识别。首先从原始视频中沿着个人运动的轨迹导出个体附近的图像区块；随后使用非局部CNN来提取包含个体间影响关系的静态特征，紧接着将提取到的个体静态特征输入多层LSTM时序模型中，得到个体动态特征并通过个体特征聚合得到群体行为特征；最后利用个体、群体行为特征同时完成个体行为与群体行为的识别。结果 本文在国际通用的The Volleyball Dataset数据集上进行实验。实验结果表明，所提模型在未进行群体精细划分条件下取得了77.6%的准确率，在群体精细划分的条件下取得了83.5%的准确率。结论 本文首次提出了面向群体行为识别的非局部卷积网络，并依此构建了一种基于非局部网络的群体行为识别模型。所提模型通过考虑个体之间的相互影响，结合个体上下文信息，在训练数据中学习到更具判别性的群体行为特征。同时验证了既包含个体间上下文信息、又保留群体内层次结构信息的群体行为特征更有利于最终的群体行为分类。
Nonlocal network based deep model for group activity recognition
liding,majing,yangmenglin,zhangwensheng(School of Automation, Harbin University of Science and Technology)
Abstract: Objective Human action recognition has received considerable academic attention among researchers in computer vision, and it is composed of single-person action recognition and group activity recognition. Group activity recognition is on the basis of single-person action recognition, and focuses on the group of people in the scene, which facilitates a lot of applications, e.g. video surveillance, sport analytics and video retrieval. In group activity recognition, the hierarchical structure between the group and individuals is significant to the group activity, and the main challenge is to build more discriminative representations of group activity based on the hierarchical structure. To overcome this difficulty, researchers have proposed numerous methods. In the early years, hand-crafted features are designed as the representations of individual and group-level activities. Recently, deep learning has been widely used in group activity recognition. Typically, hierarchical framework based RNN has been adopted to represent the relationships between individuals and their corresponding group, and has achieved promising performance. Despite the promising performance, these methods ignored the relationships and interactions between individuals, which affects the accuracy of recognition. Group activity is comprehensively defined by each individual action and the contextural information between individuals. Extracting individual features in isolation results in the loss of contextural information. To address this problem, we propose a novel model for group activity recognition based on the nonlocal network. Method The proposed model utilizes a bottom-up approach to represent and recognize the individual actions and group activities in a hierarchical manner. Firstly, tracklets of multi-person are constructed based on the detection and trajectories, and static features are extracted from these tracklets by nonlocal convolutional neural network (NCNN). The extracted features are then fed into the hierarchical temporal model (HTM) which is based on LSTM. Dynamic features of individuals are extracted and features of group activities are generated by aggregating individual features in the HTM. Finally, the group activities and individual actions are classified by utilizing the output of the HTM. Result We evaluate our model on the widely-used The Volleyball Dataset. We perform the evaluation in two different dataset settings, named fine-division and non-fine-division. Fine-division experimental settings refers to the group as combination of different subgroups, and a subgroup is composed of several individuals. In this way, the structure of the group is ‘group-subgroup-individuals’. And we aggregate the individual features within the subgroup, and then concatenate the features of subgroups. Non-fine-division experimental setting means there is no subgroup involved, and we aggregate all the individual features to generate the features of the group activity. Experimental results show that the proposed method can achieve 83.5% accuracy in fine-division manner and 77.6% accuracy in non-fine-division manner. Conclusion This study proposes a novel neural network for group activity recognition and constructs a unified framework based on the nonlocal neural network and hierarchical LSTM network. We address the motivation of taking the relationships between individuals into consideration with a nonlocal network, and utilize the contextural information existing in the group. In the process of extracting individual features, the method learns more discriminative features which combine the impact of each individuals. The experimental results confirm the effectiveness of our nonlocal model, which indicates that it is the contextural information between individuals as well as the hierarchical structure of the group that facilitate the group activity recognition.