发布时间: 2019-02-16
摘要点击次数:
全文下载次数:
DOI: 10.11834/jig.180392
2019 | Volume 24 | Number 2

ChinaMM 2018

面向智能监控的行为识别

马钰锡, 谭励, 董旭, 于重重

北京工商大学计算机与信息工程学院食品安全大数据技术北京市重点实验室, 北京 100048

收稿日期: 2018-07-12; 修回日期: 2018-08-26

基金项目: 国家自然科学基金项目（61702020）；北京市自然科学基金项目（4172013）

第一作者简介: 马钰锡, 1995年生, 男, 硕士, 主要研究方向为计算机视觉、人工智能。E-mail:202755332@qq.com;
董旭, 男, 硕士, 主要研究方向为数据挖掘。E-mail:dongxu609@126.com;
于重重, 女, 博士, 教授, 主要研究方向为模式识别与机器学习。E-mail:yucc@btbu.edu.cn.

中图法分类号: TP311

文献标识码: A

文章编号: 1006-8961(2019)02-0282-09

摘要

目的为了进一步提高智能监控场景下行为识别的准确率和时间效率，提出了一种基于YOLO（you only look once：unified，real-time object detection）并结合LSTM（long short-term memory）和CNN（convolutional neural network）的人体行为识别算法LC-YOLO（LSTM and CNN based on YOLO）。方法利用YOLO目标检测的实时性，首先对监控视频中的特定行为进行即时检测，获取目标大小、位置等信息后进行深度特征提取；然后，去除图像中无关区域的噪声数据；最后，结合LSTM建模处理时间序列，对监控视频中的行为动作序列做出最终的行为判别。结果在公开行为识别数据集KTH和MSR中的实验表明，各行为平均识别率达到了96.6%，平均识别速度达到215 ms，本文方法在智能监控的行为识别上具有较好效果。结论提出了一种行为识别算法，实验结果表明算法有效提高了行为识别的实时性和准确率，在实时性要求较高和场景复杂的智能监控中有较好的适应性和广泛的应用前景。

关键词

行为识别; 目标检测; 深度学习; 卷积神经网络; 循环神经网络

Action recognition for intelligent monitoring

Ma Yuxi, Tan Li, Dong Xu, Yu Chongchong

College of Computer & Information Engineering, Beijing Technology & Business University, Beijing 100048, China

Supported by: National Natural Science Foundation of China (61702020); Natural Science Foundation of Beijing, China (4172013)

Abstract

Objective The mainstream methods of action recognition still experience two main challenges, that is the extraction of target features and the speed and real-time of the overall process of action recognition. At present, most of the state-of-the-art methods use CNN(convolutional neural network) to extract depth features. However, CNN has a large computational complexity, and most of the regions in the video stream are not target images. The feature extraction of an entire image is certainly expensive. Target detection algorithms, such as optical flow method, are not real-time; unstable; susceptible to external environmental conditions, such as illumination, camera angle, and distance; increase the amount of calculation; and reduce time efficiency. Therefore, a human action recognition algorithm called LC-YOLO(LSTM and CNN based on YOLO), which is based on YOLO(you only look once:unified, real-time object detection) combined with LSTM(long short-term memory) and CNN, is proposed to improve the accuracy and time efficiency of action recognition in intelligent surveillance scenarios. Method The LC-YOLO algorithm mainly consists of three parts, namely, target detection, feature extraction, and action recognition. YOLO target detection is added as an aid to the mainstream method system of CNN+LSTM. The fast and real-time nature of YOLO target detection is utilized; real-time detection of specific actions in surveillance video is conducted; target size, location, and other information are obtained; features are extracted; and noise data are efficiently removed from unrelated areas of the image. Combined with LSTM modeling and processing time series, the final action recognition is made for the sequence of actions in video surveillance. Generally, the proposed model is an end-to-end deep neural network that uses the input raw video action sequence as input and returns the action category. The specific process of the single action recognition of the LC-YOLO algorithm can be described as follows. 1) YOLO is used to extract the position and confidence information (x, y, w, h, c), which has a 45 frame/s speed, can realize real-time detection of surveillance video when a specific action frame is detected; Under the training of a large number of datasets, the accurate rate of YOLO action detection can reach more than 90%. 2) On the basis of target detection, the target range image content is acquired and retained, and the noise data interference of the remaining background parts is removed, which extracts complete and accurate target features. A 4 096-dimensional depth feature vector is extracted by using a VGGNet-16 model and is returned to the recognition module combined with the target size and position information (x, y, w, h, c) predicted by YOLO. 3) In comparison with a standard RNN, the LSTM architecture uses memory cells to store and output information by using the LSTM unit as the identification module, thereby determining the temporal relationship of multiple target actions. The action category of the entire sequence of actions is outputted. In comparison with the work conducted by predecessors, the contributions of the proposed algorithm are as follows. 1) Instead of motion foreground extraction, R-CNN, and other target detection methods, the YOLO algorithm which is faster and more efficient, is used in this study. 2) The target size and position information are obtained when the target area is locked, and the interference information of the unrelated area in the picture can be removed, thereby effectively utilizing CNN to extract the depth feature. Moreover, the accuracy of feature extraction and overall time efficiency of behavior recognition are improved. Result Experiments in the public action recognition datasets KTH and MSR show that the average recognition rate of each action reaches 96.6%, the average recognition speed reaches 215 ms, and the proposed method has a good effect on the action recognition of intelligent monitoring. Conclusion This study presents a human action recognition algorithm called LC-YOLO, which is based on YOLO combined with LSTM and CNN. The fast and real-time nature of YOLO target detection is utilized; real-time detection of specific actions in surveillance video is conducted; target size, location, and other information are obtained; features are extracted; and the noise data of unrelated regions in the image are efficiently removed, which reduces the computational complexity of feature extraction and time complexity of behavior recognition. Experimental results in the public action recognition datasets KTH and MSR show that they have better adaptability and broad application prospects in intelligent monitoring with high real-time requirements and complex scenes.

Key words

action recognition; target detection; deep learning; convolutional neural network; recurrent neural network

0 引言

基于机器视觉的行为识别是对视频图像中人的行为进行分析和识别，对于其中发生的特定行为，如挥拳、跑步等行为及时地响应，以便监控人员进行处理，是智能监控系统中一种关键技术，并广泛应用于安防监控、人机智能交互、虚拟现实等各个领域^[1]。

近年来，许多学者对基于视觉的行为识别问题进行了不同的研究，行为识别技术已成为计算机视觉研究中的热点方向。Ng等人^[2]使用LSTM对视频进行建模，LSTM将底层CNN的输出连接起来作为下一时刻的输入，在UCF101数据库上获得了82.6%的识别率。Ullah等人^[3]提出一种结合CNN和深度双向LSTM(DB-LSTM)的网络处理视频数据，使用DB-LSTM网络学习帧特征之间的顺序信息，通过分析特定时间间隔的特征来处理冗长的视频行为序列。Donahue等人^[4]提出了一种长时递归卷积神经网络(LRCN)，将CNN和LSTM结合在一起对视频数据进行特征提取，单帧的图像信息通过CNN获取特征，然后将CNN的输出按时间顺序通过LSTM，这样最终将视频数据在空间和时间维度上表征，在UCF101数据库上得到了82.92%的平均识别率。

然而上述现有的主流方法中仍面临两个主要挑战：1)目标特征的提取; 2)行为识别整体过程的速度、实时性。目前主流方法多数采用CNN提取深度特征，CNN本身计算复杂，加上视频流画面中大部分区域不是目标图像，对整幅图像的特征提取无疑会耗费更大的代价；而运动前景提取、光流法等目标检测算法并不具备实时和稳定性，容易受到外界环境条件，如光照、摄像角度和距离等影响，且增加了计算量，降低了时间效率^[5]。

针对以上现有方法中存在的缺陷，提出一种基于YOLO，结合LSTM和CNN的人体行为识别算法LC-YOLO，利用YOLO目标检测的快速和实时性，首先对监控视频中的特定行为进行即时检测，获取到目标大小、位置等信息再进行特征提取，去除了图像中无关区域的噪声数据，进一步降低了特征提取的计算复杂度和行为识别的时间复杂度。通过在公开行为识别数据集KTH和MSR中实验结果表明，本文方法在智能监控场景下能够有效进行行为识别。

1 相关工作

对基于视觉的行为识别的最新研究，针对提出的目标特征提取和行为识别的速度、实时性两个关键问题，进行更多相关细节的分析和讨论。

传统行为识别方法主要基于人工设计特征，侧重于设计强大的特征描述符，如梯度直方图(HOG)、光流直方图(HOF)和运动历史图像(MHI)等^[6]，Li等人^[7]通过提取视频中有代表性的3D词袋(BOPs)来表示人体的一系列姿势，然后以BOPs为点构建人体行为图，通过计算行为图上每一条路径的概率进行人体行为识别。深度学习能自动提取隐藏在数据间的多层特征表示，具有强大的表征能力，鉴于此优点，Chéron等人^[8]使用单帧深度特征和光流数据捕获运动信息，然后设计一种多分辨率卷积神经网络进行行为分类；樊恒等人^[9]通过高斯混合模型提取目标运动前景，然后对训练样本集中各种目标行为建立样本库，定义不同类别的行为作为先验知识，训练出一种行为识别的深度网络模型；Tu等人^[10]提出一种MSR-CNN算法，通过从运动显著区域(MSR)提取特征改进目标检测技术，实现了在较少训练数据下对行为的准确识别。

为了提高行为识别的速度和实时性，很多学者做了不同的尝试。Karpathy等人^[11]使用多分辨率的卷积神经网络对视频特征进行提取，输入视频被分为两组独立的数据流：低分辨率的数据流和原始分辨率的数据流，这两个数据流都交替地包含卷积层、正则层和抽取层，同时这两个数据流最后合并成两个全连接层用于后续的特征识别，以此并行数据流处理的方式来提高识别速度。Zhou等人^[12]提出一种实时动作识别算法，通过从所有帧中抽取部分关键帧分析来提高识别速度，并结合隐马尔可夫模型(HMM)分析检测到的关键帧的时间关系保证准确度。

本文提出的基于YOLO检测的LC-YOLO行为识别算法，将YOLO目标检测作为辅助方法，添加进CNN+LSTM的主流方法体系中，相比于前人所做的工作，本文的贡献主要有以下两点：

1) 使用YOLO算法替代运动前景提取、R-CNN等目标检测方法，更加快速和高效。

2) 锁定目标区域后，即获取到目标大小、位置信息，可去除图像中无关区域的干扰信息，进而更为高效地利用CNN提取深度特征，提高了特征提取的准确度和行为识别整体的时间效率。

2 算法概述

LC-YOLO算法主要由目标检测、特征提取和行为判别3个部分组成，整体结构和流程如图 1所示。

图 1 LC-YOLO算法整体流程图

Fig. 1 Overall flow diagram of LC-YOLO algorithm

首先根据用户定义的行为类别选用相应的行为数据集进行训练，训练完成后YOLO模型可以对视频流的每一帧进行快速、实时的目标检测，将该帧图像框出目标区域，使用传统的CNN模型提取目标特征，最后将连续的动作序列的特征向量加入到LSTM进行最终的行为判别。总体来说，所提出的模型是一个端到端的深度神经网络，它将输入的原始视频动作序列作为输入，并返回行为类别。

2.1 YOLO目标检测

目标检测是从视频或者图像中提取出运动前景或感兴趣目标，其中，YOLO是一种基于回归方法的深度学习目标检测技术^[13]，它将目标区域预测和目标类别预测整合于单个神经网络模型中，在测试阶段，整幅图像一次输入到模型中，预测结果结合了图像的全局信息，同时，模型只是用一次网络计算来进行预测，所以相比于光流法、背景减除法等传统目标检测算法和基于深度学习的R-CNN、Fast R-CNN等算法快很多倍，实现了在准确率较高的情况下以45帧/s的速度快速进行目标检测与识别，更加适合现实应用环境。

YOLO网络结构包括24个卷积层和2个全连接层，其中，卷积层用来提取图像特征，全连接层用来预测图像位置和类别概率值，YOLO网络借鉴了GoogLeNet分类网络结构，不同的是，YOLO未使用inception模块，而是使用1×1卷积层(为了整合跨通道信息)+3×3卷积层做简单替代。算法流程具体如下：

1) 给定一幅输入图像，首先将该图像划分成$S$×$S$的网格；

2) 对于每个网格，预测$B$个边框，包括每个边框是目标的置信度以及每个边框区域在多个类别上的概率；

3) 根据上一步骤可以预测出$S \times S \times B$个目标窗口，然后根据阈值去除可能性较低的目标窗口，最后采用非极大值抑制方法(NMS)去除冗余窗口即可。

2.2 LSTM体系结构

循环神经网络可以很容易地对时间序列建模，它将前几个时刻的隐藏层数据作为当前时刻的输入，从而允许时间维度上的信息得以保留，即当前输出取决于当前输入和前几个时刻的状态，假设给定的输入序列表示为$\mathit{\boldsymbol{x}}{\rm{ = }}\left\{ {{\mathit{\boldsymbol{x}}_1}, {\mathit{\boldsymbol{x}}_2}, \cdots , {\mathit{\boldsymbol{x}}_t}, \cdots {\mathit{\boldsymbol{x}}_T}} \right\}$，视频流共有$T$帧，其中$t$表示第$t$帧，则有

$ {\mathit{\boldsymbol{h}}_t} = {\sigma _{\rm{h}}}\left( {{\mathit{\boldsymbol{W}}_{x{\rm{h}}}}{\mathit{\boldsymbol{x}}_t} + {\mathit{\boldsymbol{W}}_{{\rm{hh}}}}{h_{{\rm{t - 1}}}}{\rm{ + }}{\mathit{\boldsymbol{b}}_{\rm{h}}}} \right) $

(1)

式中，${\mathit{\boldsymbol{h}}_t}$表示隐藏层在$t$时刻的输出，${\mathit{\boldsymbol{W}}_{x{\rm{h}}}}$表示从输入层到隐藏层的相应权重矩阵，${\mathit{\boldsymbol{W}}_{{\rm{hh}}}}$表示从隐藏层到隐藏层的权重矩阵，${\mathit{\boldsymbol{b}}_{\rm{h}}}$表示隐藏层的偏差，${\sigma _{\rm{h}}}$表示输出激活函数，最终的输出为

$ {\mathit{\boldsymbol{y}}_t} = {\sigma _y}\left( {{\mathit{\boldsymbol{W}}_{{\rm{ho}}}}{\mathit{\boldsymbol{h}}_t} + {\mathit{\boldsymbol{b}}_{\rm{o}}}} \right) $

(2)

式中，${\mathit{\boldsymbol{y}}_t}$表示第$t$个序列的预测标签，${{\mathit{\boldsymbol{W}}_{{\rm{ho}}}}}$代表隐藏层到输出层的权重矩阵，${{\mathit{\boldsymbol{b}}_{\rm{o}}}}$是输出的偏差，${\sigma _y}$表示输出激活函数。

传统的RNN的主要问题是只能对短时间序列进行建模，由于向后传播的过程中，误差随时间膨胀或衰减而无法访问远程上下文，因此当网络变得更深时，误差梯度消失更快，这称为消失梯度问题。为了解决这个问题，LSTM引入了3个门保持状态，LSTM接受上一时刻的输出结果、当前时刻的系统状态和当前系统输入，通过输入门、遗忘门和输出门更新系统状态并将最终的结果进行输出^[14]。如图 2所示，3个门包括输入门${\mathit{\boldsymbol{i}}_t}$，遗忘门${\mathit{\boldsymbol{f}}_t}$和输出门${\mathit{\boldsymbol{o}}_t}$，其中${\mathit{\boldsymbol{i}}_t}$和${\mathit{\boldsymbol{o}}_t}$控制信息的流入和流出网络，${\mathit{\boldsymbol{f}}_t}$控制先前序列的影响，具体公式为

$ \left\{ \begin{array}{l} {\mathit{\boldsymbol{i}}_t} = \sigma \left( {{\mathit{\boldsymbol{W}}_{x{\rm{i}}}}{\mathit{\boldsymbol{x}}_t} + {\mathit{\boldsymbol{W}}_{{\rm{hi}}}}{\mathit{\boldsymbol{h}}_{t - 1}} + {\mathit{\boldsymbol{W}}_{{\rm{ci}}}}{\mathit{\boldsymbol{c}}_{t - 1}} + {\mathit{\boldsymbol{b}}_{\rm{i}}}} \right)\\ {\mathit{\boldsymbol{f}}_t} = \sigma \left( {{\mathit{\boldsymbol{W}}_{x{\rm{f}}}}{\mathit{\boldsymbol{x}}_t} + {\mathit{\boldsymbol{W}}_{{\rm{hf}}}}{\mathit{\boldsymbol{h}}_{t - 1}} + {\mathit{\boldsymbol{W}}_{{\rm{cf}}}}{\mathit{\boldsymbol{c}}_{t - 1}} + {\mathit{\boldsymbol{b}}_{\rm{f}}}} \right)\\ {\mathit{\boldsymbol{o}}_t} = \sigma \left( {{\mathit{\boldsymbol{W}}_{x{\rm{o}}}}{\mathit{\boldsymbol{x}}_t} + {\mathit{\boldsymbol{W}}_{{\rm{ho}}}}{\mathit{\boldsymbol{h}}_{t - 1}} + {\mathit{\boldsymbol{W}}_{{\rm{co}}}}{\mathit{\boldsymbol{c}}_{t - 1}} + {\mathit{\boldsymbol{b}}_{\rm{o}}}} \right)\\ {\mathit{\boldsymbol{c}}_t} = {\mathit{\boldsymbol{f}}_t} \odot {\mathit{\boldsymbol{c}}_{t - 1}} + {\mathit{\boldsymbol{i}}_t} \odot \tanh \left( {{\mathit{\boldsymbol{W}}_{x{\rm{c}}}}{\mathit{\boldsymbol{x}}_t} + {\mathit{\boldsymbol{W}}_{{\rm{hc}}}}{\mathit{\boldsymbol{h}}_{t - 1}} + {\mathit{\boldsymbol{b}}_{\rm{c}}}} \right)\\ {\mathit{\boldsymbol{h}}_t} = {\mathit{\boldsymbol{o}}_t} \odot \tanh {\mathit{\boldsymbol{c}}_t} \end{array} \right. $

(3)

图 2 LSTM结构图

Fig. 2 Structure diagram of LSTM

式中，${\mathit{\boldsymbol{c}}_t}$表示$t$时刻的存储单元，${\mathit{\boldsymbol{h}}_t}$表示隐藏层的输出，$\mathit{\boldsymbol{b}}$表示偏置, $\mathit{\boldsymbol{W}} = \left\{ {{\mathit{\boldsymbol{W}}_{x{\rm{i}}}}, {\mathit{\boldsymbol{W}}_{x{\rm{o}}}}, {\mathit{\boldsymbol{W}}_{x{\rm{f}}}}, {\mathit{\boldsymbol{W}}_{{\rm{ci}}}}, {\mathit{\boldsymbol{W}}_{{\rm{co}}}}, {\mathit{\boldsymbol{W}}_{{\rm{cf}}}}, {\mathit{\boldsymbol{W}}_{{\rm{hi}}}}, {\mathit{\boldsymbol{W}}_{{\rm{ho}}}}, {\mathit{\boldsymbol{W}}_{{\rm{hf}}}}} \right\}$表示加权参数，并且通过反向传播共同学习时间序列。

2.3 LC-YOLO算法

LC-YOLO算法将YOLO目标检测作为辅助方法，添加进CNN+LSTM的主流方法体系中，将YOLO目标检测的速度和实时性与LSTM对长时间序列处理的优点相结合，基于回归方法的YOLO算法可以迅速检测到特定行为帧，进而结合CNN提取目标深度特征，而LSTM可以避免梯度消失，通过时间序列的建模处理可以对连续的动作帧做准确判别，最终有效提高行为识别的准确率和时间效率。LC-YOLO算法框架图如图 3所示。

图 3 LC-YOLO整体框架图

Fig. 3 Overall frame diagram of LC-YOLO

图 3中，($x$, $y$)表示检测目标的边界框的中心坐标，($w$, $h$)为边界框对应的宽和高，$c$即confidence，代表检测目标的置信度；${\mathit{\boldsymbol{X}}_t}$表示提取到的深度特征向量。

2.3.1 模型构建

YOLO目标检测和CNN特征提取的模型均需要截取视频中不同行为的图像作为训练集，并框定目标位置、大小范围。YOLO的全连接层将特征表示回归到区域预测中，这些预测被编码为一个$S$×$S$×($B$×$5+C$)大小的向量。它表示图像被分割成$S$×$S$个区域，每个分割区域都有$B$个预测边界框，每个边界框由它的5个位置参数表示，包括边界框中心坐标($x$, $y$)，宽高($w$, $h$)和置信度$c$。

一旦检测到了目标大小、位置等确切信息，就可以采用传统的CNN模型进行深度特征提取，CNN将视频帧作为其输入并产生整个图像的特征图，本文选取VGGNet-16网络作为训练模型进行一般特征学习；CNN网络中前端(靠近输入图像)的层提取的是纹理、色彩等基本特征，越靠近后端，提取的特征越高级、抽象，面向具体任务，所以首先利用1 000类的ImageNet数据来学习卷积权重，以使网络具有一个对多种类别视觉对象的广义理解，然后基于行为训练集KTH进行fine-tune(微调)，将网络最后几层的参数重新训练，再用较小的学习率将网络整体训练，使其对行为特征也可以较好地学习。

最后将特征向量序列输入到LSTM模型中，使用前向算法计算各个目标函数相对于权值的导数，随后进入LSTM单元，使用反向传播(BPTT)和实时递归学习(RTRL)梯度下降算法对网络进行训练，最终达到所有训练样本的动作序列在对应的特征值序列之上的概率之积最大。

2.3.2 识别过程

LC-YOLO算法单个行为识别过程如图 4所示，具体过程可描述如下：

图 4 单个行为的识别过程

Fig. 4 Single action recognition process

1) 当检测到特定的行为帧发生后，首先采用YOLO提取位置和置信度信息$(x, y, w, h, c)$，其45帧/s的速度可以实现对监控视频的实时检测；在大量的数据集的训练下，YOLO的行为检测的正确率可达到90%以上。

2) 在目标检测基础上，获取并保留目标范围图像内容，并去除其余背景部分的噪声数据干扰，以此提取到完整和准确的目标特征，图 4展示了卷积特征的可视化。采用VGGNet-16模型提取4 096维深度特征向量，与YOLO预测的目标大小、位置信息$(x, y, w, h, c)$相结合反馈到识别模块中。

3) 使用LSTM单元作为识别模块，与标准的RNN不同，LSTM架构使用存储器单元来存储和输出信息，从而允许它更好地发现多个目标动作的时间关系，最终输出整段动作序列的行为类别。

3 实验结果与分析

本文实验平台为戴尔服务器PowerEdge R430，操作系统：Ubuntu 14.04，CPU：Intel(R) Core i3 3220，内存：64 GB，GPU：NVIDIA Tesla K40m×2，显存：12 GB×2。

3.1 数据集

选择的训练样本来自于微软MSR Action Dataset^[16](简称为MSR)和公开行为识别数据集KTH^[15]，如图 5所示。

图 5 数据集样本示例

Fig. 5 Datasets sample example((a) MSR; (b) KTH)

1) MSR数据集共包含16段视频序列，63个动作：14个拍手、24个挥手、25个拳击，由10个实验人员完成；每个序列包含多种类型的操作，某些序列包含由不同人执行的操作，分为室内和室外场景；所有视频序列都是以杂乱和移动的背景捕获，每个视频大小为320×240像素，帧速率为15帧/s，长度在32~76 s之间。

2) KTH数据集包括6种人类行为：慢走、行走、跑步、拳击、挥手、拍手，分别由25个不同的人在4个场景下执行，共2 391段视频序列。将所有的序列下采样为160×120像素的分辨率，平均长度为4 s，共分为1个训练集(8个人)，1个验证集(8个人)和1个测试集(9个人)。

3.2 训练策略

LC-YOLO算法的整体训练过程在流行的Caffe框架中，目标检测模型遵循YOLO架构，设置$S $=7，$B$=2，$C$=20；特征提取采用在ImageNet上预先训练好的模型VGGNet，基于行为数据集对网络进行微调(fine-tune)，训练时批量(batch-size)设置为256，动量(momentum)设置为0.9；微调完成后，不同的图层特征将被缓存以供进一步使用。

下一步对抽取的卷积特征进行汇集，全连接层是一个固定大小的4 096维向量，将特征向量反馈到LSTM单元，然后训练Softmax分类器对动作进行分类，Softmax在LSTM之后连接，输出大小是行为类别的数量。循环神经网络使用反向传播(BPTT)和实时递归学习(RTRL)梯度下降算法进行训练，批量设置为64，动量为0.9，基础学习率设置为0.01，每进行2万次迭代，学习率乘0.1，模型在训练迭代5万次后开始收敛。

3.3 实验结果

在2.3.1节识别过程中展示了YOLO检测的效果，图 6则展示了KTH数据集中6种行为YOLO检测的准确率。

图 6 KTH数据集各个行为的检测准确率

Fig. 6 KTH dataset action detection accuracy

从图 6可以看出，YOLO目标检测的准确率均在85%~95%之间，误检率在5%~15%范围内，个别行为如挥拳、挥手，由于动作幅度较大，行为明显，误检率在10%以内，接近最终行为识别的准确率，由此表明YOLO行为检测的识别率维持在一个较高的水准，通过后续LSTM对时间序列的建模处理，可进一步排除误检情况。

图 7分别给出了本文方法对两个数据集行为识别的混淆矩阵(confusion matrix)，行代表正确的类别，列代表算法的分类结果。从图 7中可以看出，两个数据集中的各行为之间的混淆程度均较低，其中拳击和挥手两种行为与其他行为基本不会出现混淆；从而由于行为的动作存在相似性，且在不同场景下存在一定的干扰性，挥手与拍手、行走、慢走和跑步之间存在轻微混淆。本文方法根据YOLO检测的人体区域的兴趣点进行精准定位，可以预确定动作并消除了多余的背景噪声。

图 7 6种行为识别的混淆矩阵

Fig. 7 Confusion matrixes for six actions recognition of different datasets((a) MSR; (b) KTH)

表 1和表 2分别展示了不同文献采用的算法与本文算法在KTH数据集和MSR数据集上行为识别率的对比结果。从表 1中可以看出，针对MSR数据集中的3种行为，本文算法在挥拳和拍手识别率均高于其他4种算法，分别提高了0.75%和1.17%；从表 2中可以看出，针对KTH数据集中6种行为，挥拳的动作识别率最高，由于慢走动作幅度较小，且容易与跑步、行走等动作混淆，故识别率最低。对比其他方法，挥拳、跑步和慢走分别提高了0.42%、1.96%和1.25%，另外3种行为识别率与文献方法基本持平。

表 1 与其他算法在MSR数据集中的对比结果
Table 1 Comparisons with other algorithms in the MSR dataset

下载CSV

算法	不同行为的识别率/%			平均耗时/ms
算法	拳击	挥手	拍手	平均耗时/ms
文献[17]	98.50	97.25	97.33	500
文献[18]	98.00	98.75	95.00	650
文献[19]	95.75	97.33	94.25	420
文献[20]	96.33	93.50	96.75	550
本文	99.25	97.50	98.50	210
注：加粗字体表示最优结果。

表 2 与其他算法在KTH数据集中的对比结果
Table 2 Comparisons with other algorithms in the KTH dataset

下载CSV

算法	不同行为的识别率/%						平均耗时/ms
算法	挥拳	挥手	拍手	跑步	行走	慢跑	平均耗时/ms
文献[17]	98.33	98.82	98.25	92.69	93.50	81.33	550
文献[18]	98.26	98.50	94.55	95.54	94.33	91.25	630
文献[19]	93.30	98.62	93.33	92.33	98.50	85.50	460
文献[20]	95.76	91.56	95.50	94.35	99.50	89.67	570
本文	98.75	98.33	96.50	97.50	96.25	92.50	220
注：加粗字体表示最优结果。

而LC-YOLO算法识别速度平均用时分别为210 ms和220 ms，明显高于其他算法，实时的目标和行为检测速度不仅提升了模型整体识别的速度，而且使网络参数的训练更有效，最终构建出性能更好、识别精度更高的深度学习模型。由此证明了本文算法的有效性。

4 结论

提出了一种基于YOLO算法并结合LSTM和CNN的人体行为识别算法LC-YOLO，利用YOLO目标检测的快速和实时性，即时地对智能视频监控中的特定行为进行检测，并去除了图像中无关区域的噪声数据，结合LSTM对长时间序列的建模处理，可以对视频监控中的行为做出快速的检测和识别，且降低了行为识别的计算复杂度和时间复杂度。通过在公开行为识别数据集KTH和MSR中实验，各行为平均识别率达到了96.6%，表明本文方法的有效性，可应用于多数实时性要求较高和场景复杂的智能监控等安防领域。

由于LC-YOLO算法模型的网络结构复杂，需要训练检测、特征提取以及LSTM等多个模型，训练耗时较多、难度较大，对模型训练策略的优化和更多应用场景的适应将是下一步的工作和努力的方向。

参考文献

[1] Stavropoulos G, Giakoumis D, Moustakas K, et al. Automatic action recognition for assistive robots to support MCI patients at home[C]//Proceedings of the 10th International Conference on Pervasive Technologies Related To Assistive Environments. Island of Rhodes, Greece: ACM, 2017: 366-371.[DOI: 10.1145/3056540.3076185]

[2] Ng J Y H, Hausknecht M, Vijayanarasimhan S, et al. Beyond short snippets: Deep networks for video classification[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE, 2015: 4694-4702.[DOI: 10.1109/CVPR.2015.7299101]

[3] Ullah A, Ahmad J, Muhammad K, et al. Action recognition in video sequences using deep Bi-directional LSTM with CNN features[J]. IEEE Access, 2017: 6. [DOI:10.1109/ACCESS.2017.2778011]

[4] Donahue J, Hendricks L A, Rohrbach M, et al. Long-term recurrent convolutional networks for visual recognition and description[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4): 677–691. [DOI:10.1109/TPAMI.2016.2599174]

[5] Yu G, Li T. Recognition of human continuous action with 3D CNN[C]//Proceedings of the 11th International Conference on Computer Vision Systems. Shenzhen, China: Springer, Cham, 2017: 314-322.[DOI: 10.1007/978-3-319-68345-4_28]

[6] Mahjoub A B, Atri M. Human action recognition using RGB data[C]//The 11th International Design & Test Symposium. Hammamet, Tunisia: IEEE, 2016: 83-87.[DOI: 10.1109/IDT.2016.7843019]

[7] Li W Q, Zhang Z Y, Liu Z C. Action recognition based on a bag of 3D points[C]//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. San Francisco, CA, USA: IEEE, 2010: 9-14.[DOI: 10.1109/CVPRW.2010.5543273]

[8] Chéron G, Laptev I, Schmid C. P-CNN: pose-based CNN features for action recognition[C]//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE, 2015: 3218-3226.[DOI: 10.1109/ICCV.2015.368]

[9] Fan H, Xu J, Deng Y, et al. Behavior recognition of human based on deep learning[J]. Geomatics and Information Science of Wuhan University, 2016, 41(4): 492–497. [樊恒, 徐俊, 邓勇, 等. 基于深度学习的人体行为识别[J]. 武汉大学学报:信息科学版, 2016, 41(4): 492–497. ] [DOI:10.13203/j.whugis20140110]

[10] Tu Z G, Cao J, Li Y K, et al. MSR-CNN: Applying motion salient region based descriptors for action recognition[C]//Proceedings of the 23rd International Conference on Pattern Recognition. Cancun, Mexico: IEEE, 2016: 3524-3529.[DOI: 10.1109/ICPR.2016.7900180]

[11] Karpathy A, Toderici G, Shetty S, et al. Large-scale video classification with convolutional neural networks[C]//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA: IEEE, 2014: 1725-1732.[DOI: 10.1109/CVPR.2014.223]

[12] Zhou L, Nagahashi H. Real-time action recognition based on key frame detection[C]//Proceedings of the 9th International Conference on Machine Learning and Computing. Singapore: ACM, 2017: 272-277.[DOI: 10.1145/3055635.3056569]

[13] Redmon J, Divvala S, Girshick R, et al. You only look once: unified, real-time object detection[C]//Proceedings of 2016 IEEE Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 779-788.[DOI: 10.1109/CVPR.2016.91]

[14] Greff K, Srivastava R K, Koutník J, et al. LSTM:a search space odyssey[J]. IEEE Transactions on Neural Networks and Learning Systems, 2017, 28(10): 2222–2232. [DOI:10.1109/TNNLS.2016.2582924]

[15] Schuldt C, Laptev I, Caputo B. Recognizing human actions: a local SVM approach[C]//Proceedings of the 17th International Conference on Pattern Recognition. Cambridge, UK: IEEE, 2004: 32-36.[DOI: 10.1109/ICPR.2004.1334462]

[16] Li J J, Mao X, Wu X Y, et al. Human action recognition based on tensor shape descriptor[J]. IET Computer Vision, 2016, 10(8): 905–911. [DOI:10.1049/iet-cvi.2016.0048]

[17] Ijjina E P, Chalavadi K M. Human action recognition in RGB-D videos using motion sequence information and deep learning[J]. Pattern Recognition, 2017, 72: 504–516. [DOI:10.1016/j.patcog.2017.07.013]

[18] Liu J, Wang G, Duan L Y, et al. Skeleton-based human action recognition with global context-aware attention LSTM networks[J]. IEEE Transactions on Image Processing, 2018, 27(4): 1586–1599. [DOI:10.1109/TIP.2017.2785279]

[19] Megrhi S, Jmal M, Souidene W, et al. Spatio-temporal action localization and detection for human action recognition in big dataset[J]. Journal of Visual Communication and Image Representation, 2016, 41: 375–390. [DOI:10.1016/j.jvcir.2016.10.016]

[20] Sargano A B, Wang X F, Angelov P, et al. Human action recognition using transfer learning with deep representations[C]//Proceedings of 2017 International Joint Conference on Neural Networks. Anchorage, AK, USA: IEEE, 2017: 463-469.[DOI: 10.1109/IJCNN.2017.7965890]