关键语义区域链提取的视频人体行为识别

马淼; 李贻斌; 武宪青; 高金凤; 潘海鹏

doi:10.11834/jig.200049

图像分析和识别 | 浏览量 : 0 下载量: 11 CSCD: 1

PDF
导出
分享
收藏
专辑

关键语义区域链提取的视频人体行为识别
Human action recognition in videos utilizing key semantic region extraction and concatenation
2020年25卷第12期页码：2517-2529
收稿日期：2020-02-18，

修回日期：2020-03-13，

录用日期：2020-3-20，

纸质出版日期：2020-12-16
DOI： 10.11834/jig.200049
稿件说明：

移动端阅览

马淼, 李贻斌, 武宪青, 高金凤, 潘海鹏. 关键语义区域链提取的视频人体行为识别[J]. 中国图象图形学报, 2020,25(12):2517-2529. DOI： 10.11834/jig.200049.

Miao Ma, Yibin Li, Xianqing Wu, Jinfeng Gao, Haipeng Pan. Human action recognition in videos utilizing key semantic region extraction and concatenation[J]. Journal of image and graphics, 2020, 25(12): 2517-2529. DOI： 10.11834/jig.200049.

摘要

目的

视频中的人体行为识别技术对智能安防、人机协作和助老助残等领域的智能化起着积极的促进作用，具有广泛的应用前景。但是，现有的识别方法在人体行为时空特征的有效利用方面仍存在问题，识别准确率仍有待提高。为此，本文提出一种在空间域使用深度学习网络提取人体行为关键语义信息并在时间域串联分析从而准确识别视频中人体行为的方法。

方法

根据视频图像内容，剔除人体行为重复及冗余信息，提取最能表达人体行为变化的关键帧。设计并构造深度学习网络，对图像语义信息进行分析，提取表达重要语义信息的图像关键语义区域，有效描述人体行为的空间信息。使用孪生神经网络计算视频帧间关键语义区域的相关性，将语义信息相似的区域串联为关键语义区域链，将关键语义区域链的深度学习特征计算并融合为表达视频中人体行为的特征，训练分类器实现人体行为识别。

结果

使用具有挑战性的人体行为识别数据集UCF（University of Central Florida）50对本文方法进行验证，得到的人体行为识别准确率为94.3%，与现有方法相比有显著提高。有效性验证实验表明，本文提出的视频中关键语义区域计算和帧间关键语义区域相关性计算方法能够有效提高人体行为识别的准确率。

结论

实验结果表明，本文提出的人体行为识别方法能够有效利用视频中人体行为的时空信息，显著提高人体行为识别准确率。

Abstract

Objective

Human action recognition in videos aims to identify action categories by analyzing human action-related information and utilizing spatial and temporal cues. Research on human action recognition are crucial in the development of intelligent security

pedestrian monitoring

and clinical nursing; hence

this topic has become increasingly popular among researchers. The key point of improving the accuracy of human action recognition lies on how to construct distinctive features to describe human action categories effectively. Existing human action recognition methods fall into three categories:extracting visual features using deep learning networks

manually constructing image visual descriptors

and combining manual construction with deep learning networks. The methods that use deep learning networks normally operate convolution and pooling on small neighbor regions

thereby ignoring the connection among regions. By contrast

manual construction methods often have strong pertinence and poor adaptability to specific human actions

and its application scenarios are limited. Therefore

some researchers combine the idea of handmade features with deep learning computation. However

the existing methods still have problems in the effective utilization of the spatial and temporal information of human action

and the accuracy of human action recognition still needs to be improved. Considering the above problems

we research on how to design and construct distinguishable human action features and propose a new human action recognition method in which the key semantic information in the spatial domain of human action is extracted using a deep learning network and then connected and analyzed in the time domain.

Method

Human action videos usually record more than 24 frames per second; however

human poses do not change at this speed. In the computation of human action characteristics in videos

changes between consecutive video frames are usually minimal

and most human action information contained in the video is similar or repeated. To avoid redundant computations

we calculate the key frames of videos in accordance with the amplitude variation of the image content of interframes. Frames with repetitive content or slight changes are eliminated to avoid redundant calculation in the subsequent semantic information analysis and extraction. The calculated key frames contain evident changes of human body and human-related background and thus reveal sufficient human action information in videos for recognition. Then

to analyze and describe the spatial information of human action effectively

we design and construct a deep learning network to analyze the semantic information of images and extract the key semantic regions that can express important semantic information. The constructed network is denoted as Net1

which is trained by transfer learning and can use continuous convolutional layers to mine the semantic information of images. The output data of Net1 provides image regions

which contain various kinds of foreground semantic information and region scores

which represent the probability of containing foreground information. In addition

a nonmaximal suppression algorithm is used to eliminate areas that have too much overlap. Afterward

the key semantic regions are classified into person and nonperson regions

and then the position and proportion of person regions are used to distinguish the main person and the secondary persons. Moreover

object regions that have no relationship with the main person are eliminated

and only foreground regions that reveal human action-related semantic information are reserved. Afterward

a Siamese network is constructed to calculate the correlation of key semantic regions among frames and concatenate key semantic regions in the temporal domain. The proposed Siamese network is denoted as Net2

which has two inputs and one output; Net2 can be used to mine deeply and measure the similarity between two input image regions

and the output values are used to express the similarity. The constructed Net2 can concatenate the key semantic regions into a semantic region chain to ensure the time consistency of semantic information

and express human action change information in time domain more effectively. Moreover

we tailor the feature map of Net1 using the interpolation and scaling method

in order to obtain feature submaps of uniform size. That is

each semantic region chain corresponds to a feature matrix chain. Given that the length of each feature matrix chain is different

the maximum fusion method is used to fuse the feature matrix chain and obtain a single fused matrix

which reveals one kind of video semantic information. We stack the fused matrix from all feature matrix chains together and then design and train a classifier

which consists of two fully connected layers and a support vector machine. The output of the classifier is the final human action recognition result for videos.

Result

The UCF(University of Central Florida)50 dataset

a publicly available challenging human action recognition dataset

is used to verify the performance of our proposed human action recognition method. In this dataset

the average human action recognition accuracy of the proposed method is 94.3%

which is higher than that of state-of-the-are methods

such as that based on optical flow motion expression (76.9%)

that based on a two-stream convolutional neural network (88.0%)

and that based on SURF(speeded up robust features) descriptors and Fisher encoding (91.7%). In addition

the proposed crucial algorithms of the semantic region chain computation and the key semantic region correlation calculation are verified through a control experiment. Results reveal that the two crucial algorithms effectively improve the accuracy of human action recognition.

Conclusion

The proposed human action recognition method

which uses semantic region extraction and concatenation

can effectively improve the accuracy of human action recognition in videos.

关键词

Keywords

references

Bertinetto L, Valmadre J, Henriques J F, Vedaldi A and Torr P H S. 2016. Fully-convolutional Siamese networks for object tracking//Proceedings of European Conference on Computer Vision. Amsterdam, The Netherland: Springer: 850-865[ DOI: 10.1007/978-3-319-48881-3_56 http://dx.doi.org/10.1007/978-3-319-48881-3_56 ]

Chang C C and Lin C J. 2011. LIBSVM:a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3):1-27[DOI:10.1145/1961189.1961199]

Chéron G, Laptev I and Schmid C. 2015. P-CNN: pose-based CNN features for action recognition//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 3218-3226[ DOI: 10.1109/iccv.2015.368 http://dx.doi.org/10.1109/iccv.2015.368 ]

Everingham M, Van Gool L, Williams C K I, Winn J and Zisserman A. 2007. The PASCAL visual object classes challenge 2007 (VOC2007) results[EB/OL].[2019-12-22] . http://host.robots.ox.ac.uk/pascal/VOC/voc2007/index.html http://host.robots.ox.ac.uk/pascal/VOC/voc2007/index.html

Girshick R. 2015. Fast R-CNN//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 1440-1448[ DOI: 10.1109/ICCV.2015.169 http://dx.doi.org/10.1109/ICCV.2015.169 ]

Girshick R, Donahue J, Darrell T and Malik J. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE: 580-587[ DOI: 10.1109/cvpr.2014.81 http://dx.doi.org/10.1109/cvpr.2014.81 ]

Gkioxari G and Malik J. 2015. Finding action tubes//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 759-768[ DOI: 10.1109/cvpr.2015.7298676 http://dx.doi.org/10.1109/cvpr.2015.7298676 ]

He A F, Luo C, Tian X M and Zeng W J. 2018. A twofold siamese network for real-time object tracking//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake, USA: IEEE: 4834-4843[ DOI: 10.1109/cvpr.2018.00508 http://dx.doi.org/10.1109/cvpr.2018.00508 ]

Hu H, Liao Z and Xiao X. 2019. Action recognition using multiple pooling strategies of CNN features. Neural Processing Letters, 50(1):379-396[DOI:10.1007/s11063-018-9932-3]

Huang C M and Mutlu B. 2016. Anticipatory robot control for efficient human-robot collaboration//Proceedings of the 11th ACM/IEEE International Conference on Human Robot Interaction. Christchurch, New Zealand: IEEE: 83-90[ DOI: 10.1109/HRI.2016.7451737 http://dx.doi.org/10.1109/HRI.2016.7451737 ]

Kardaris N, Pitsikalis V, Mavroudi E and Maragos P. 2016. Introducing temporal order of dominant visual word sub-sequences for human action recognition//Proceedings of 2016 IEEE International Conference on Image Processing. Phoenix, USA: IEEE: 3061-3065[ DOI: 10.1109/ICIP.2016.7532922 http://dx.doi.org/10.1109/ICIP.2016.7532922 ]

Krizhevsky A, Sutskever I and Hinton G E. 2012. ImageNet classification with deep convolutional neural networks//Proceedings of the 25th International Conference on Neural Information Processing Systems. Red Hook, USA: Curran Associates Inc.: 1097-1105

Luo H L and Zhang Y. 2019. Semantic segmentation method with combined context features with CNN multi-layer features. Journal of Image and Graphics, 24(12):2200-2209

罗会兰, 张云. 2019.结合上下文特征与CNN多层特征融合的语义分割.中国图象图形学报, 24(12):2200-2209[DOI:10.11834/jig.190087]

Ma M, Marturi N, Li Y B, Leonardis A and Stolkin R. 2018. Region-sequence based six-stream CNN features forgeneral and fine-grained human action recognition in videos. Pattern Recognition, 76:506-521[DOI:10.1016/j.patcog.2017.11.026]

Mishra O, Kapoor R and Tripathi M M. 2019. Human action recognition using modified bag of visual word based on spectral perception. International Journal of Image, Graphics and Signal Processing, 11(9):34-43[DOI:10.5815/ijigsp.2019.09.04]

Mocanu I, Axinte D, Cramariuc O and Cramariuc B. 2018. Human activity recognition with convolution neural network using TIAGo robot//Proceedings of the 41st International Conference on Telecommunications and Signal Processing. Athens, Greece: IEEE: 1-4[ DOI: 10.1109/TSP.2018.8441486 http://dx.doi.org/10.1109/TSP.2018.8441486 ]

Nazir S, Yousaf M H, Nebel J C and Velastin S A. 2018. A bag of expression framework for improved human action recognition. Pattern Recognition Letters, 103:39-45[DOI:10.1016/j.patrec.2017.12.024]

Peng X J, Zou C Q, Qiao Y and Peng Q. 2014. Action recognition with stacked fisher vectors//Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer: 581-595[ DOI: 10.1007/978-3-319-10602-1_38 http://dx.doi.org/10.1007/978-3-319-10602-1_38 ]

Peng X J, Wang L M, Wang X X and Qiao Y. 2016. Bag of visual words and fusion methods for action recognition:comprehensive study and good practice. Computer Vision and Image Understanding, 150:109-125[DOI:10.1016/j.cviu.2016.03.013]

Reddy K K and Shah M. 2013. Recognizing 50 human action categories of web videos. Machine Vision and Applications, 24(5):971-981[DOI:10.1007/s00138-012-0450-4]

Ren S Q, He K M, Girshick R and Sun J. 2017. Faster R-CNN:towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1137-1149[DOI:10.1109/TPAMI.2016.2577031]

Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z H, Karpathy A, Khosla A, Bernstein M, Berg A C and Li F F. 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211-252[DOI:10.1007/s11263-015-0816-y]

Schydlo P, Rakovic M, Jamone L and Santos-Victor J. 2018. Anticipation in human-robot cooperation: a recurrent neural network approach for multiple action sequences prediction//Proceedings of 2018 IEEE International Conference on Robotics and Automation. Brisbane, Australia: IEEE: 1-6[ DOI: 10.1109/ICRA.2018.8460924 http://dx.doi.org/10.1109/ICRA.2018.8460924 ]

Simonyan K and Zisserman A. 2014a. Two-stream convolutional networks for action recognition in videos//Proceedings of the 27th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press: 568-576

Simonyan K and Zisserman A. 2014b. Very deep convolutional networks for large-scale image recognition[EB/OL].[2019-12-22] . https://arxiv.org/pdf/1409.1556.pdf https://arxiv.org/pdf/1409.1556.pdf

Tu Z G, Xie W, Qin Q Q, Poppe R, Veltkamp R C, Li B X and Yuan J S. 2018. Multi-stream CNN:learning representations based on human-related regions for action recognition. Pattern Recognition, 79:32-43[DOI:10.1016/j.patcog.2018.01.020]

Wang H, Oneata D, Verbeek J and Schmid C. 2016a. A robust and efficient video representation for action recognition. International Journal of Computer Vision, 119(3):219-238[DOI:10.1007/s11263-015-0846-5]

Wang H and Schmid C. 2013. Action recognition with improved trajectories//Proceedings of 2013 IEEE International Conference on Computer Vision. Sydney, Australia: IEEE: 3551-3558[ DOI: 10.1109/ICCV.2013.441 http://dx.doi.org/10.1109/ICCV.2013.441 ]

Wang Y F, Song J, Wang L M, Van Gool L and Hilliges O. 2016b. Two-stream SR-CNNs for action recognition in videos//Proceedings of British Machine Vision Conference. York, UK: BMVA: 108.1-108.12[ DOI: 10.5244/C.30.108 http://dx.doi.org/10.5244/C.30.108 ]

Xiao J H, Cui X H and Li F. 2019. Human action recognition based on convolutional neural network and spatial pyramid representation. Journal of Visual Communication and Image Representation, 71:1-10[DOI:10.1016/j.jvcir.2019.102722]

Yang Y Z, Li Y, Fermuller C and Aloimonos Y. 2015. Robot learning manipulation action plans by "watching" unconstrained videos from the world wide web//Proceedings of the 29th AAAI Conference on Artificial Intelligence. Austin, USA: AAAI: 9286-9673

Zheng Y, Yao H X, Sun X S, Zhao S C and Porikli F. 2018. Distinctive action sketch for human action recognition. Signal Processing, 144:323-332[DOI:10.1016/j.sigpro.2017.10.022]

Zhi Q and Cooperstock J R. 2012. Toward dynamic image mosaic generation with robustness to parallax. IEEE Transactions on Image Processing, 21(1):366-378[DOI:10.1109/tip.2011.2162743]

Zhu Y, Lan Z Z, Newsam S and Hauptmann A. 2018. Hidden two-stream convolutional networks for action recognition//Proceedings of the 14th Asian Conference on Computer Vision. Perth, Australia: Springer: 363-378[ DOI: 10.1007/978-3-030-20893-6_23 http://dx.doi.org/10.1007/978-3-030-20893-6_23 ]

文章被引用时，请邮件提醒。

提交

视频中多特征融合人体姿态跟踪

多视角深度运动图的人体行为识别

自适应骨骼中心的人体行为识别算法