融合时空图卷积的多人交互行为识别
Multi-person interaction action recognition based on spatio-temporal graph convolution
- 2021年26卷第7期 页码:1681-1691
纸质出版日期: 2021-07-16 ,
录用日期: 2021-01-06
DOI: 10.11834/jig.200510
移动端阅览
浏览全部资源
扫码关注微信
纸质出版日期: 2021-07-16 ,
录用日期: 2021-01-06
移动端阅览
成科扬, 吴金霞, 王文杉, 荣兰, 詹永照. 融合时空图卷积的多人交互行为识别[J]. 中国图象图形学报, 2021,26(7):1681-1691.
Keyang Cheng, Jinxia Wu, Wenshan Wang, Lan Rong, Yongzhao Zhan. Multi-person interaction action recognition based on spatio-temporal graph convolution[J]. Journal of Image and Graphics, 2021,26(7):1681-1691.
目的
2
多人交互行为的识别在现实生活中有着广泛应用。现有的关于人类活动分析的研究主要集中在对单人简单行为的视频片段进行分类,而对于理解具有多人之间关系的复杂人类活动的问题还没有得到充分的解决。
方法
2
针对多人交互动作中两人肢体行为的特点,本文提出基于骨架的时空建模方法,将时空建模特征输入到广义图卷积中进行特征学习,通过谱图卷积的高阶快速切比雪夫多项式进行逼近。同时对骨架之间的交互信息进行设计,通过捕获这种额外的交互信息增加动作识别的准确性。为增强时域信息的提取,创新性地将切片循环神经网络(recurrent neural network,RNN)应用于视频动作识别,以捕获整个动作序列依赖性信息。
结果
2
本文在UT-Interaction数据集和SBU数据集上对本文算法进行评估,在UT-Interaction数据集中,与H-LSTCM(hierarchical long short-term concurrent memory)等算法进行了比较,相较于次好算法提高了0.7%,在SBU数据集中,相较于GCNConv(semi-supervised classification with graph convolutional networks)、RotClips+MTCNN(rotating cliips+multi-task convolutional neural netowrk)、SGC(simplifying graph convolutional)等算法分别提升了5.2%、1.03%、1.2%。同时也在SBU数据集中进行了融合实验,分别验证了不同连接与切片RNN的有效性。
结论
2
本文提出的融合时空图卷积的交互识别方法,对于交互类动作的识别具有较高的准确率,普遍适用于对象之间产生互动的行为识别。
Objective
2
The recognition of multi-person interaction behavior has wide applications in real life. At present
human activity analysis research mainly focuses on classifying video clips of behaviors of individual persons
but the problem of understanding complex human activities with relationships between multiple people has not been resolved. When performing multi-person behavior recognition
the body information is more abundant and the description of the two-person action features are more complex. The problems such as complex recognition methods and low recognition accuracy occur easily. When the recognition object changes from a single person to multiple people
we not only need to pay attention to the action information of each person but also need to notice the interaction information between different subjects. At present
the interaction information of multiple people cannot be extracted well. To solve this problem effectively
we propose a multi-person interaction behavior-recognition algorithm based on skeleton graph convolution.
Method
2
The advantage of this method is that it can fully utilize the spatial and temporal dependence information between human joints. We design the interaction information between skeletons to discover the potential relationships between different individuals and different key points. By capturing the additional interaction information
we can improve the accuracy of action recognition. Considering the characteristics of multi-person interaction behavior
this study proposes a spatio-temporal graph convolution model based on skeleton. In terms of space
we have various designs for single-person and multi-person connections. We design the single-person connection within each frame. Apart from the physical connections between the points of the body
some potential correlations are also added between joints that represent non-physical connections such as the left and right hands of a single person. We design the interaction connection between two people within each frame. We use Euclidean distance to measure the correlation between interaction nodes and determine which points between the two persons have a certain connection. Through this method
the connection of the key points between the two persons in the frame not only can add new and necessary interaction connections
which can be used as a bridge to describe the interaction information of the two persons' actions
but can also prevent noise connections and cause the underlying graph to have a certain sparseness. In the time dimension
we segment the action sequence. Every three frames of action are used as a processing unit. We design the joints between three adjacent frames
and use more adjacent joints to expand the receptive field to help us learn the change information in the time domain. Through the modeling design in the time and space dimensions
we have obtained a complex action skeleton diagram. We use the generalized graph convolution model to extract and summarize the two people action features
and approximate high-order fast Chebyshev polynomials of spectral graph convolution to obtain high-level feature maps. At the same time
to enhance the extraction of time domain information
we propose the application of sliced recurrent neural network(RNN) to video action recognition to enhance the characterization of two people actions. By dividing the input sequence into multiple equiling subsequences and using a separate RNN network for feature extraction on each subsequence
we can calculate each subsequence at the same time
thereby overcoming the limitations of sliced RNN that cannot be parallelized. Through the information transfer between layers
the local information on the subsequence can be integrated in the high-level network
which can integrate and summarize the information from local to global
and the network can capture the entire action-sequence dependent information. For the loss of information at the slice
we have solved this problem by taking the three frame actions as a processing unit.
Result
2
This study validates the proposed algorithm on two datasets (UT-Interaction and SBU) and compares them with other advanced interaction-recognition methods. The UT-Interaction dataset contains six classes of actions and the SBU interaction dataset has eight classes of actions. We use 10-fold and 5-fold cross-validation for evaluation. In the UT-Interaction dataset
compared with H-LSTCM(Chierarchical long-short-term concurrent memory) and other methods
the performance improves by 0.7% based on the second-best algorithm. In the SBU dataset
compared with GCNConv
RotClips+MTCNN
SGCConv
and other methods
the algorithm has been improved by 5.2%
1.03%
and 1.2% respectively. At the same time
fusion experiments are conducted in the SBU dataset to verify the effectiveness of various connections and sliced RNN. This method can effectively extract additional information on interactions
and has a good effect on the recognition of interaction actions.
Conclusion
2
In this paper
the interactive recognition method of fusion spatio-temporal graph convolution has high accuracy for the recognition of interactive actions
and it is generally applicable to the recognition of behaviors that generate interaction between objects.
动作识别交互信息时空建模图卷积切片循环神经网络(RNN)
action recognitioninteraction informationspatial-temporal modelinggraph convolutionsliced recurrent neural network(RNN)
Aliakbarian M S, Saleh F S, Salzmann M, Fernando B, Petersson L and Andersson L. 2017. Encouraging LSTMs to anticipate actions very early//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 280-289[DOI: 10.1109/ICCV.2017.39http://dx.doi.org/10.1109/ICCV.2017.39]
Cao Z, Simon T, Wei S E and Sheikh Y. 2017. Realtime multi-person 2D pose estimation using part affinity fields//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 1302-1310[DOI: 10.1109/CVPR.2017.143http://dx.doi.org/10.1109/CVPR.2017.143]
Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Darrell T and Saenko K. 2015. Long-term recurrent convolutional networks for visual recognition and description//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 2625-2634[DOI: 10.1109/CVPR.2015.7298878http://dx.doi.org/10.1109/CVPR.2015.7298878]
Du Y, Wang W and Wang L. 2015. Hierarchical recurrent neural network for skeleton based action recognition//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 1110-1118[DOI: 10.1109/CVPR.2015.7298714http://dx.doi.org/10.1109/CVPR.2015.7298714]
Gao X, Hu W, Tang J X, Liu J Y and Guo Z M. 2019. Optimized skeleton-based action recognition via sparsified graph regression//Proceedings of the 27th ACM International Conference on Multimedia. Nice, France: ACM: 601-610[DOI: 10.1145/3343031.3351170http://dx.doi.org/10.1145/3343031.3351170]
Ji Y L, Ye G and Cheng H. 2014. Interactive body part contrast mining for human interaction recognition//Proceedings of 2014 IEEE International Conference on Multimedia and Expo Workshops. Chengdu, China: IEEE: 1-6[DOI: 10.1109/ICMEW.2014.6890714http://dx.doi.org/10.1109/ICMEW.2014.6890714]
Ke Q, Bennamoun M, An S, Boussaid F and Sohel F. 2016. Human interaction prediction using deep temporal features//Proceedings of European Conference on Computer Vision. Amsterdam,the Netherlands: Springer: 403-414[DOI: 10.1007/978-3-319-48881-3_28http://dx.doi.org/10.1007/978-3-319-48881-3_28]
Ke Q H, Bennamoun M, An S J, Sohel F and Boussaid F. 2018. Learning clip representations for skeleton-based 3D action recognition. IEEE Transactions on Image Processing, 27(6): 2842-2855[DOI: 10.1109/TIP.2018.2812099]
Kipf T N and Welling M. 2017. Semi-supervised classification with graph convolutional networks[EB/OL]. [2021-01-06].https://arxiv.org/pdf/1609.02907.pdfhttps://arxiv.org/pdf/1609.02907.pdf
Li B, Cheng Z H, Xu Z H, Ye W, Lukasiewicz T and Zhang S K. 2019. Long text analysis using sliced recurrent neural networks with breaking point information enrichment//Proceedings of ICASSP 2019-2019 IEEE International Conferenceon Acoustics, Speech and Signal Processing. Brighton, UK: IEEE: 7550-7554[DOI: 10.1109/ICASSP.2019.8683812http://dx.doi.org/10.1109/ICASSP.2019.8683812]
Liu J, Shahroudy A, Xu D and Wang G. 2016. Spatio-temporal LSTM with trust gates for 3D human action recognition//Proceedings of European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 816-833[DOI: 10.1007/978-3-319-46487-9_50http://dx.doi.org/10.1007/978-3-319-46487-9_50]
Liu M Y, Liu H and Chen C. 2017. Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognition, 68(3): 346-362[DOI: 10.1016/j.patcog.2017.02.030]
Raptis M and Sigal L. 2013. Poselet key-framing: a model for human activity recognition//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, USA: IEEE: 2650-2657[DOI: 10.1109/CVPR.2013.342http://dx.doi.org/10.1109/CVPR.2013.342]
Ryoo M S, Chen C C, Aggarwal J K and Roy-Chowdhury A. 2010. An overview of contest on semantic description of human activities (SDHA) 2010//Proceedings of International Conference on Pattern Recognition. Istanbul, Turkey: Springer: 270-285[DOI: 10.1007/978-3-642-17711-8_28http://dx.doi.org/10.1007/978-3-642-17711-8_28]
Shi L, Zhang Y F, Cheng J and Lu H Q. 2019. Two-stream adaptive graph convolutional networks for skeleton-based action recognition//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 12018-12027[DOI: 10.1109/CVPR.2019.01230http://dx.doi.org/10.1109/CVPR.2019.01230]
Shu X B, Tang J H, Qi G J, Liu W and Yang J. 2021. Hierarchical long short-term concurrent memory for human interaction recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(3): 1110-1118[DOI: 10.1109/TPAMI.2019.2942030]
Song S J, Lan C L, Xing J L, Zeng W J and Liu J Y. 2017. An end-to-end spatio-temporal attention model for human action recognition from skeleton data//Proceedings of the 31st AAAI Conference on Artificial Intelligence. San Francisco, USA: AAAI Press: 4263-4270
Tang Y S, Tian Y, Lu J W, Li P Y and Zhou J. 2018. Deep progressive reinforcement learning for skeleton-based action recognition//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 5323-5332[DOI: 10.1109/CVPR.2018.00558http://dx.doi.org/10.1109/CVPR.2018.00558]
Van Gemeren C, Tan R T, Poppe R and Veltkamp R C. 2014. Dyadic interaction detection from pose and flow//Proceedings of International Workshop on Human Behavior Understanding. Zurich, Switzerland: Springer: 101-115[DOI: 10.1007/978-3-319-11839-0_9http://dx.doi.org/10.1007/978-3-319-11839-0_9]
Wang S G, Sun A M, Zhao W T and Hui X L. 2015. Single and interactive human behavior recognition algorithm based on spatio-temporal interest point. Journal of Jilin University (Engineering and Technology Edition), 45(1): 304-308
王世刚, 孙爱朦, 赵文婷, 惠祥龙. 2015. 基于时空兴趣点的单人行为及交互行为识别. 吉林大学学报(工学版), 45(1): 304-308)[DOI: 10.13229/j.cnki.jdxbgxb201501044]
Wang X H and Deng H M. 2020. A multi-feature representation of skeleton sequences for human interaction recognition. Electronics, 9(1): #187[DOI: 10.3390/electronics9010187]
Wu F, Zhang T Y, de Souza A H Jr, Fifty C, Yu T and Weinberger K Q. 2019. Simplifying graph convolutional networks[EB/OL]. [2021-01-06].https://arxiv.org/pdf/1902.07153.pdfhttps://arxiv.org/pdf/1902.07153.pdf
Yan S J, Xiong Y J and Lin D H. 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition//Proceedings of the 32nd AAAI Conference on Artificial Intelligence. New Orleans, USA: AAAI Press: 7444-7452
Yu T H, Kim T K and Cipolla R. 2010. Real-time action recognition by spatiotemporal semantic and structural forest//Proceedings of the British Machine Vision Conference. Aberystwyth, UK: BMVA Press: #52[DOI: 10.5244/C.24.52http://dx.doi.org/10.5244/C.24.52]
Yun K, Honorio J, Chattopadhyay D, Berg T L and Samaras D. 2012. Two-person interaction detection using body-pose features and multiple instance learning//Proceedings of 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. Providence, USA: IEEE: 28-35[DOI:10.1109/CVPRW.2012.6239234http://dx.doi.org/10.1109/CVPRW.2012.6239234]
相关作者
相关机构