面向智能监控的行为识别
Action recognition for intelligent monitoring
- 2019年24卷第2期 页码:282-290
收稿:2018-07-12,
修回:2018-8-26,
纸质出版:2019-02-16
DOI: 10.11834/jig.180392
移动端阅览

浏览全部资源
扫码关注微信
收稿:2018-07-12,
修回:2018-8-26,
纸质出版:2019-02-16
移动端阅览
目的
2
为了进一步提高智能监控场景下行为识别的准确率和时间效率,提出了一种基于YOLO(you only look once:unified,real-time object detection)并结合LSTM(long short-term memory)和CNN(convolutional neural network)的人体行为识别算法LC-YOLO(LSTM and CNN based on YOLO)。
方法
2
利用YOLO目标检测的实时性,首先对监控视频中的特定行为进行即时检测,获取目标大小、位置等信息后进行深度特征提取;然后,去除图像中无关区域的噪声数据;最后,结合LSTM建模处理时间序列,对监控视频中的行为动作序列做出最终的行为判别。
结果
2
在公开行为识别数据集KTH和MSR中的实验表明,各行为平均识别率达到了96.6%,平均识别速度达到215 ms,本文方法在智能监控的行为识别上具有较好效果。
结论
2
提出了一种行为识别算法,实验结果表明算法有效提高了行为识别的实时性和准确率,在实时性要求较高和场景复杂的智能监控中有较好的适应性和广泛的应用前景。
Objective
2
The mainstream methods of action recognition still experience two main challenges
that is the extraction of target features and the speed and real-time of the overall process of action recognition. At present
most of the state-of-the-art methods use CNN(convolutional neural network) to extract depth features. However
CNN has a large computational complexity
and most of the regions in the video stream are not target images. The feature extraction of an entire image is certainly expensive. Target detection algorithms
such as optical flow method
are not real-time; unstable; susceptible to external environmental conditions
such as illumination
camera angle
and distance; increase the amount of calculation; and reduce time efficiency. Therefore
a human action recognition algorithm called LC-YOLO(LSTM and CNN based on YOLO)
which is based on YOLO(you only look once:unified
real-time object detection) combined with LSTM(long short-term memory) and CNN
is proposed to improve the accuracy and time efficiency of action recognition in intelligent surveillance scenarios.
Method
2
The LC-YOLO algorithm mainly consists of three parts
namely
target detection
feature extraction
and action recognition. YOLO target detection is added as an aid to the mainstream method system of CNN+LSTM. The fast and real-time nature of YOLO target detection is utilized; real-time detection of specific actions in surveillance video is conducted; target size
location
and other information are obtained; features are extracted; and noise data are efficiently removed from unrelated areas of the image. Combined with LSTM modeling and processing time series
the final action recognition is made for the sequence of actions in video surveillance. Generally
the proposed model is an end-to-end deep neural network that uses the input raw video action sequence as input and returns the action category. The specific process of the single action recognition of the LC-YOLO algorithm can be described as follows. 1) YOLO is used to extract the position and confidence information (
x
y
w
h
c
)
which has a 45 frame/s speed
can realize real-time detection of surveillance video when a specific action frame is detected; Under the training of a large number of datasets
the accurate rate of YOLO action detection can reach more than 90%. 2) On the basis of target detection
the target range image content is acquired and retained
and the noise data interference of the remaining background parts is removed
which extracts complete and accurate target features. A 4 096-dimensional depth feature vector is extracted by using a VGGNet-16 model and is returned to the recognition module combined with the target size and position information (
x
y
w
h
c
) predicted by YOLO. 3) In comparison with a standard RNN
the LSTM architecture uses memory cells to store and output information by using the LSTM unit as the identification module
thereby determining the temporal relationship of multiple target actions. The action category of the entire sequence of actions is outputted. In comparison with the work conducted by predecessors
the contributions of the proposed algorithm are as follows. 1) Instead of motion foreground extraction
R-CNN
and other target detection methods
the YOLO algorithm which is faster and more efficient
is used in this study. 2) The target size and position information are obtained when the target area is locked
and the interference information of the unrelated area in the picture can be removed
thereby effectively utilizing CNN to extract the depth feature. Moreover
the accuracy of feature extraction and overall time efficiency of behavior recognition are improved.
Result
2
Experiments in the public action recognition datasets KTH and MSR show that the average recognition rate of each action reaches 96.6%
the average recognition speed reaches 215 ms
and the proposed method has a good effect on the action recognition of intelligent monitoring.
Conclusion
2
This study presents a human action recognition algorithm called LC-YOLO
which is based on YOLO combined with LSTM and CNN. The fast and real-time nature of YOLO target detection is utilized; real-time detection of specific actions in surveillance video is conducted; target size
location
and other information are obtained; features are extracted; and the noise data of unrelated regions in the image are efficiently removed
which reduces the computational complexity of feature extraction and time complexity of behavior recognition. Experimental results in the public action recognition datasets KTH and MSR show that they have better adaptability and broad application prospects in intelligent monitoring with high real-time requirements and complex scenes.
Stavropoulos G, Giakoumis D, Moustakas K, et al. Automatic action recognition for assistive robots to support MCI patients at home[C]//Proceedings of the 10th International Conference on Pervasive Technologies Related To Assistive Environments. Island of Rhodes, Greece: ACM, 2017: 366-371.[ DOI: 10.1145/3056540.3076185 http://dx.doi.org/10.1145/3056540.3076185 ]
Ng J Y H, Hausknecht M, Vijayanarasimhan S, et al. Beyond short snippets: Deep networks for video classification[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 2015: 4694-4702.[ DOI: 10.1109/CVPR.2015.7299101 http://dx.doi.org/10.1109/CVPR.2015.7299101 ]
Ullah A, Ahmad J, Muhammad K, et al. Action recognition in video sequences using deep Bi-directional LSTM with CNN features[J]. IEEE Access, 2017, 6.[DOI:10.1109/ACCESS.2017.2778011]
Donahue J, Hendricks L A, Rohrbach M, et al. Long-term recurrent convolutional networks for visual recognition and description[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4):677-691.[DOI:10.1109/TPAMI.2016.2599174]
Yu G, Li T. Recognition of human continuous action with 3D CNN[C]//Proceedings of the 11th International Conference on Computer Vision Systems. Shenzhen, China: Springer, Cham, 2017: 314-322.[ DOI: 10.1007/978-3-319-68345-4_28 http://dx.doi.org/10.1007/978-3-319-68345-4_28 ]
Mahjoub A B, Atri M. Human action recognition using RGB data[C]//The 11th International Design & Test Symposium. Hammamet, Tunisia: IEEE, 2016: 83-87.[ DOI: 10.1109/IDT.2016.7843019 http://dx.doi.org/10.1109/IDT.2016.7843019 ]
Li W Q, Zhang Z Y, Liu Z C. Action recognition based on a bag of 3D points[C]//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. San Francisco, CA, USA: IEEE, 2010: 9-14.[ DOI: 10.1109/CVPRW.2010.5543273 http://dx.doi.org/10.1109/CVPRW.2010.5543273 ]
Chéron G, Laptev I, Schmid C. P-CNN: pose-based CNN features for action recognition[C]//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 2015: 3218-3226.[ DOI: 10.1109/ICCV.2015.368 http://dx.doi.org/10.1109/ICCV.2015.368 ]
Fan H, Xu J, Deng Y, et al. Behavior recognition of human based on deep learning[J]. Geomatics and Information Science of Wuhan University, 2016, 41(4):492-497.
樊恒, 徐俊, 邓勇, 等.基于深度学习的人体行为识别[J].武汉大学学报:信息科学版, 2016, 41(4):492-497. [DOI:10.13203/j.whugis20140110]
Tu Z G, Cao J, Li Y K, et al. MSR-CNN: Applying motion salient region based descriptors for action recognition[C]//Proceedings of the 23rd International Conference on Pattern Recognition. Cancun, Mexico: IEEE, 2016: 3524-3529.[ DOI: 10.1109/ICPR.2016.7900180 http://dx.doi.org/10.1109/ICPR.2016.7900180 ]
Karpathy A, Toderici G, Shetty S, et al. Large-scale video classification with convolutional neural networks[C]//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA: IEEE, 2014: 1725-1732.[ DOI: 10.1109/CVPR.2014.223 http://dx.doi.org/10.1109/CVPR.2014.223 ]
Zhou L, Nagahashi H. Real-time action recognition based on key frame detection[C]//Proceedings of the 9th International Conference on Machine Learning and Computing. Singapore: ACM, 2017: 272-277.[ DOI: 10.1145/3055635.3056569 http://dx.doi.org/10.1145/3055635.3056569 ]
Redmon J, Divvala S, Girshick R, et al. You only look once: unified, real-time object detection[C]//Proceedings of 2016 IEEE Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 779-788.[ DOI: 10.1109/CVPR.2016.91 http://dx.doi.org/10.1109/CVPR.2016.91 ]
Greff K, Srivastava R K, Koutník J, et al. LSTM:a search space odyssey[J]. IEEE Transactions on Neural Networks and Learning Systems, 2017, 28(10):2222-2232.[DOI:10.1109/TNNLS.2016.2582924]
Schuldt C, Laptev I, Caputo B. Recognizing human actions: a local SVM approach[C]//Proceedings of the 17th International Conference on Pattern Recognition. Cambridge, UK: IEEE, 2004: 32-36.[ DOI: 10.1109/ICPR.2004.1334462 http://dx.doi.org/10.1109/ICPR.2004.1334462 ]
Li J J, Mao X, Wu X Y, et al. Human action recognition based on tensor shape descriptor[J]. IET Computer Vision, 2016, 10(8):905-911.[DOI:10.1049/iet-cvi.2016.0048]
Ijjina E P, Chalavadi K M. Human action recognition in RGB-D videos using motion sequence information and deep learning[J]. Pattern Recognition, 2017, 72:504-516.[DOI:10.1016/j.patcog.2017.07.013]
Liu J, Wang G, Duan L Y, et al. Skeleton-based human action recognition with global context-aware attention LSTM networks[J]. IEEE Transactions on Image Processing, 2018, 27(4):1586-1599.[DOI:10.1109/TIP.2017.2785279]
Megrhi S, Jmal M, Souidene W, et al. Spatio-temporal action localization and detection for human action recognition in big dataset[J]. Journal of Visual Communication and Image Representation, 2016, 41:375-390.[DOI:10.1016/j.jvcir.2016.10.016]
Sargano A B, Wang X F, Angelov P, et al. Human action recognition using transfer learning with deep representations[C]//Proceedings of 2017 International Joint Conference on Neural Networks. Anchorage, AK, USA: IEEE, 2017: 463-469.[ DOI: 10.1109/IJCNN.2017.7965890 http://dx.doi.org/10.1109/IJCNN.2017.7965890 ]
相关作者
相关机构
京公网安备11010802024621