面向智能监控的行为识别

马钰锡; 谭励; 董旭; 于重重

doi:10.11834/jig.180392

ChinaMM 2018 | 浏览量 : 0 下载量: 87 CSCD: 11

PDF
导出
分享
收藏
专辑

面向智能监控的行为识别
Action recognition for intelligent monitoring
2019年24卷第2期页码：282-290
收稿：2018-07-12，

修回：2018-8-26，

纸质出版：2019-02-16
DOI： 10.11834/jig.180392
稿件说明：

移动端阅览

马钰锡, 谭励, 董旭, 于重重. 面向智能监控的行为识别[J]. 中国图象图形学报, 2019,24(2):282-290. DOI： 10.11834/jig.180392.

Yuxi Ma, Li Tan, Xu Dong, Chongchong Yu. Action recognition for intelligent monitoring[J]. Journal of Image and Graphics, 2019, 24(2): 282-290. DOI： 10.11834/jig.180392.

摘要

目的

为了进一步提高智能监控场景下行为识别的准确率和时间效率，提出了一种基于YOLO（you only look once：unified，real-time object detection）并结合LSTM（long short-term memory）和CNN（convolutional neural network）的人体行为识别算法LC-YOLO（LSTM and CNN based on YOLO）。

方法

利用YOLO目标检测的实时性，首先对监控视频中的特定行为进行即时检测，获取目标大小、位置等信息后进行深度特征提取；然后，去除图像中无关区域的噪声数据；最后，结合LSTM建模处理时间序列，对监控视频中的行为动作序列做出最终的行为判别。

结果

在公开行为识别数据集KTH和MSR中的实验表明，各行为平均识别率达到了96.6%，平均识别速度达到215 ms，本文方法在智能监控的行为识别上具有较好效果。

结论

提出了一种行为识别算法，实验结果表明算法有效提高了行为识别的实时性和准确率，在实时性要求较高和场景复杂的智能监控中有较好的适应性和广泛的应用前景。

Abstract

Objective

The mainstream methods of action recognition still experience two main challenges

that is the extraction of target features and the speed and real-time of the overall process of action recognition. At present

most of the state-of-the-art methods use CNN(convolutional neural network) to extract depth features. However

CNN has a large computational complexity

and most of the regions in the video stream are not target images. The feature extraction of an entire image is certainly expensive. Target detection algorithms

such as optical flow method

are not real-time; unstable; susceptible to external environmental conditions

such as illumination

camera angle

and distance; increase the amount of calculation; and reduce time efficiency. Therefore

a human action recognition algorithm called LC-YOLO(LSTM and CNN based on YOLO)

which is based on YOLO(you only look once:unified

real-time object detection) combined with LSTM(long short-term memory) and CNN

is proposed to improve the accuracy and time efficiency of action recognition in intelligent surveillance scenarios.

Method

The LC-YOLO algorithm mainly consists of three parts

namely

target detection

feature extraction

and action recognition. YOLO target detection is added as an aid to the mainstream method system of CNN+LSTM. The fast and real-time nature of YOLO target detection is utilized; real-time detection of specific actions in surveillance video is conducted; target size

location

and other information are obtained; features are extracted; and noise data are efficiently removed from unrelated areas of the image. Combined with LSTM modeling and processing time series

the final action recognition is made for the sequence of actions in video surveillance. Generally

the proposed model is an end-to-end deep neural network that uses the input raw video action sequence as input and returns the action category. The specific process of the single action recognition of the LC-YOLO algorithm can be described as follows. 1) YOLO is used to extract the position and confidence information (

)

which has a 45 frame/s speed

can realize real-time detection of surveillance video when a specific action frame is detected; Under the training of a large number of datasets

the accurate rate of YOLO action detection can reach more than 90%. 2) On the basis of target detection

the target range image content is acquired and retained

and the noise data interference of the remaining background parts is removed

which extracts complete and accurate target features. A 4 096-dimensional depth feature vector is extracted by using a VGGNet-16 model and is returned to the recognition module combined with the target size and position information (

) predicted by YOLO. 3) In comparison with a standard RNN

the LSTM architecture uses memory cells to store and output information by using the LSTM unit as the identification module

thereby determining the temporal relationship of multiple target actions. The action category of the entire sequence of actions is outputted. In comparison with the work conducted by predecessors

the contributions of the proposed algorithm are as follows. 1) Instead of motion foreground extraction

R-CNN

and other target detection methods

the YOLO algorithm which is faster and more efficient

is used in this study. 2) The target size and position information are obtained when the target area is locked

and the interference information of the unrelated area in the picture can be removed

thereby effectively utilizing CNN to extract the depth feature. Moreover

the accuracy of feature extraction and overall time efficiency of behavior recognition are improved.

Result

Experiments in the public action recognition datasets KTH and MSR show that the average recognition rate of each action reaches 96.6%

the average recognition speed reaches 215 ms

and the proposed method has a good effect on the action recognition of intelligent monitoring.

Conclusion

This study presents a human action recognition algorithm called LC-YOLO

which is based on YOLO combined with LSTM and CNN. The fast and real-time nature of YOLO target detection is utilized; real-time detection of specific actions in surveillance video is conducted; target size

location

and other information are obtained; features are extracted; and the noise data of unrelated regions in the image are efficiently removed

which reduces the computational complexity of feature extraction and time complexity of behavior recognition. Experimental results in the public action recognition datasets KTH and MSR show that they have better adaptability and broad application prospects in intelligent monitoring with high real-time requirements and complex scenes.

关键词

Keywords

references

Stavropoulos G, Giakoumis D, Moustakas K, et al. Automatic action recognition for assistive robots to support MCI patients at home[C]//Proceedings of the 10th International Conference on Pervasive Technologies Related To Assistive Environments. Island of Rhodes, Greece: ACM, 2017: 366-371.[ DOI: 10.1145/3056540.3076185 http://dx.doi.org/10.1145/3056540.3076185 ]

Ng J Y H, Hausknecht M, Vijayanarasimhan S, et al. Beyond short snippets: Deep networks for video classification[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 2015: 4694-4702.[ DOI: 10.1109/CVPR.2015.7299101 http://dx.doi.org/10.1109/CVPR.2015.7299101 ]

Ullah A, Ahmad J, Muhammad K, et al. Action recognition in video sequences using deep Bi-directional LSTM with CNN features[J]. IEEE Access, 2017, 6.[DOI:10.1109/ACCESS.2017.2778011]

Donahue J, Hendricks L A, Rohrbach M, et al. Long-term recurrent convolutional networks for visual recognition and description[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4):677-691.[DOI:10.1109/TPAMI.2016.2599174]

Yu G, Li T. Recognition of human continuous action with 3D CNN[C]//Proceedings of the 11th International Conference on Computer Vision Systems. Shenzhen, China: Springer, Cham, 2017: 314-322.[ DOI: 10.1007/978-3-319-68345-4_28 http://dx.doi.org/10.1007/978-3-319-68345-4_28 ]

Mahjoub A B, Atri M. Human action recognition using RGB data[C]//The 11th International Design & Test Symposium. Hammamet, Tunisia: IEEE, 2016: 83-87.[ DOI: 10.1109/IDT.2016.7843019 http://dx.doi.org/10.1109/IDT.2016.7843019 ]

Li W Q, Zhang Z Y, Liu Z C. Action recognition based on a bag of 3D points[C]//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. San Francisco, CA, USA: IEEE, 2010: 9-14.[ DOI: 10.1109/CVPRW.2010.5543273 http://dx.doi.org/10.1109/CVPRW.2010.5543273 ]

Chéron G, Laptev I, Schmid C. P-CNN: pose-based CNN features for action recognition[C]//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 2015: 3218-3226.[ DOI: 10.1109/ICCV.2015.368 http://dx.doi.org/10.1109/ICCV.2015.368 ]

Fan H, Xu J, Deng Y, et al. Behavior recognition of human based on deep learning[J]. Geomatics and Information Science of Wuhan University, 2016, 41(4):492-497.

樊恒, 徐俊, 邓勇, 等.基于深度学习的人体行为识别[J].武汉大学学报:信息科学版, 2016, 41(4):492-497. [DOI:10.13203/j.whugis20140110]

Tu Z G, Cao J, Li Y K, et al. MSR-CNN: Applying motion salient region based descriptors for action recognition[C]//Proceedings of the 23rd International Conference on Pattern Recognition. Cancun, Mexico: IEEE, 2016: 3524-3529.[ DOI: 10.1109/ICPR.2016.7900180 http://dx.doi.org/10.1109/ICPR.2016.7900180 ]

Karpathy A, Toderici G, Shetty S, et al. Large-scale video classification with convolutional neural networks[C]//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA: IEEE, 2014: 1725-1732.[ DOI: 10.1109/CVPR.2014.223 http://dx.doi.org/10.1109/CVPR.2014.223 ]

Zhou L, Nagahashi H. Real-time action recognition based on key frame detection[C]//Proceedings of the 9th International Conference on Machine Learning and Computing. Singapore: ACM, 2017: 272-277.[ DOI: 10.1145/3055635.3056569 http://dx.doi.org/10.1145/3055635.3056569 ]

Redmon J, Divvala S, Girshick R, et al. You only look once: unified, real-time object detection[C]//Proceedings of 2016 IEEE Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 779-788.[ DOI: 10.1109/CVPR.2016.91 http://dx.doi.org/10.1109/CVPR.2016.91 ]

Greff K, Srivastava R K, Koutník J, et al. LSTM:a search space odyssey[J]. IEEE Transactions on Neural Networks and Learning Systems, 2017, 28(10):2222-2232.[DOI:10.1109/TNNLS.2016.2582924]

Schuldt C, Laptev I, Caputo B. Recognizing human actions: a local SVM approach[C]//Proceedings of the 17th International Conference on Pattern Recognition. Cambridge, UK: IEEE, 2004: 32-36.[ DOI: 10.1109/ICPR.2004.1334462 http://dx.doi.org/10.1109/ICPR.2004.1334462 ]

Li J J, Mao X, Wu X Y, et al. Human action recognition based on tensor shape descriptor[J]. IET Computer Vision, 2016, 10(8):905-911.[DOI:10.1049/iet-cvi.2016.0048]

Ijjina E P, Chalavadi K M. Human action recognition in RGB-D videos using motion sequence information and deep learning[J]. Pattern Recognition, 2017, 72:504-516.[DOI:10.1016/j.patcog.2017.07.013]

Liu J, Wang G, Duan L Y, et al. Skeleton-based human action recognition with global context-aware attention LSTM networks[J]. IEEE Transactions on Image Processing, 2018, 27(4):1586-1599.[DOI:10.1109/TIP.2017.2785279]

Megrhi S, Jmal M, Souidene W, et al. Spatio-temporal action localization and detection for human action recognition in big dataset[J]. Journal of Visual Communication and Image Representation, 2016, 41:375-390.[DOI:10.1016/j.jvcir.2016.10.016]

Sargano A B, Wang X F, Angelov P, et al. Human action recognition using transfer learning with deep representations[C]//Proceedings of 2017 International Joint Conference on Neural Networks. Anchorage, AK, USA: IEEE, 2017: 463-469.[ DOI: 10.1109/IJCNN.2017.7965890 http://dx.doi.org/10.1109/IJCNN.2017.7965890 ]