目的 为了进一步提高智能监控场景下行为识别的准确率和时间效率，提出了一种基于YOLO并结合LSTM和CNN的人体行为识别算法LC-YOLO。方法 利用YOLO目标检测的实时性，首先对监控视频中的特定行为进行即时检测，获取到目标大小、位置等信息后进行深度特征提取，去除图像中无关区域的噪声数据；最后结合LSTM建模处理时间序列，对监控视频中的行为动作序列做出最终的行为判别。结果 在公开行为识别数据集KTH和MSR中的实验表明，各行为平均识别率达到了96.6%，平均识别速度达到215ms，本文提出的方法在智能监控的行为识别上具有较好效果。结论 本文提出了一种行为识别算法，实验结果表明算法有效提高了行为识别的实时性和准确率，在实时性要求较高和场景复杂的智能监控中有较好的适应性和广泛的应用前景。
Objective The mainstream methods of action recognition still face two main challenges: the first is the extraction of target features, and the second is the speed and real-time of the overall process of action recognition. At present, most of the state-of-the-art methods use CNN to extract depth features. However, CNN itself has a large computational complexity, and most of the regions in the video stream are not target images. The feature extraction of the entire image will undoubtedly cost more. Target detection algorithms such as optical flow method are not real-time and stable, and are susceptible to external environmental conditions such as illumination, camera angle and distance, and increase the amount of calculation and reduce the time efficiency. Therefore, in order to further improve the accuracy and time efficiency of action recognition in intelligent surveillance scenarios, a human action recognition algorithm LC-YOLO (LSTM and CNN based on YOLO) based on YOLO combined with LSTM and CNN is proposed.Method The LC-YOLO algorithm main consists of three parts: target detection, feature extraction and action recognition. The YOLO target detection is added as an aid to the mainstream method system of CNN+LSTM: Firstly, utilize the fast and real-time nature of YOLO target detection, make real-time detection of specific actions in surveillance video, obtain target size, location and other information and then extract features, efficiently remove noise data from unrelated areas of the image. Finally, combined with LSTM modeling and processing time series, the final action recognition is made for the sequence of actions in video surveillance. In general, the proposed model is an end-to-end deep neural network that takes the input raw video action sequence as input and returns the action category. LC-YOLO algorithm single action recognition specific process can be described as follows: 1) When a specific action frame is detected, first use YOLO to extract position and confidence information , which is 45 fps Speed ??can realize real-time detection of surveillance video; under the training of a large number of data sets, the correct rate of YOLO action detection can reach more than 90%; 2) On the basis of target detection, the target range image content is acquired and retained, and the noise data interference of the remaining background parts is removed, thereby extracting complete and accurate target features. The 4096-dimensional depth feature vector is extracted by the VGGNet-16 model and fed back to the recognition module in combination with the target size and position information predicted by YOLO; 3) Using the LSTM unit as the identification module, unlike the standard RNN, the LSTM architecture uses memory cells to store and output information, allowing it to better discover the temporal relationship of multiple target actions, Finally, the action category of the entire sequence of actions is output. Compared with the work done by the predecessors, the proposed algorithm contributes mainly to the following two points: 1) Using YOLO algorithm instead of motion foreground extraction, R-CNN and other target detection methods, which is faster and more efficient; 2) After the target area is locked, the target size and position information are obtained, and the interference information of the unrelated area in the picture can be removed, thereby more effectively utilizing CNN to extract the depth feature, and the accuracy of feature extraction and the overall time efficiency of behavior recognition are improved.Result Experiments in the public action recognition datasets KTH and MSR show that the average recognition rate of each action reaches 96.6%, and the average recognition speed reaches 215ms, the proposed method has a good effect on the action recognition of intelligent monitoring.Conclusion This paper proposes a human action recognition algorithm LC-YOLO based on YOLO combined with LSTM and CNN. Firstly, utilize the fast and real-time nature of YOLO target detection, make real-time detection of specific actions in surveillance video, obtain target size, location and other information and then extract features, efficiently remove the noise data of the unrelated regions in the image, which further reduces the computational complexity of feature extraction and the time complexity of behavior recognition. The experimental results in the public action recognition data set KTH and MSR show that it has better adaptability and broad application prospects in intelligent monitoring with higher real-time requirements and complex scenes.