Current Issue Cover
结合滑动窗口动态时间规整和CNN的视频人脸表情识别

胡敏1, 张柯柯1, 王晓华1, 任福继1,2(1.合肥工业大学计算机与信息学院情感计算与先进智能机器安徽省重点实验室, 合肥 230009;2.德岛大学先端技术科学教育部, 德岛 7708502, 日本)

摘 要
目的 相比静态表情图片,视频序列中蕴含更多的情感信息,视频序列中的具有明显表情的序列在特征提取与识别中具有关键作用,但是视频中同时存在的中性表情也可能会对模型参数的训练造成干扰,影响最终的判别。为了减少这种干扰带来的误差,本文对动态时间规整算法进行改进,提出一种滑动窗口动态时间规整算法(SWDTW)来自动选取视频中表情表现明显的图片序列;同时,为了解决人脸图像受环境光照因素影响较大和传统特征提取过程中存在过多人为干预的问题,构建一种基于深度卷积神经网络的人脸视频序列处理方法。方法 首先截取表情视频中人脸正面帧,用梯度方向直方图特征计算代价矩阵,并在代价矩阵上增加滑动窗口机制,计算所有滑动窗口的平均距离;然后通过平均距离最小值选取全局最优表情序列;最后采用深度卷积神经网络对规整后的人脸表情图像序列进行无监督学习和面部表情分类,统计视频序列图像分类概率和,进而得出视频序列的表情类别。结果 在CK+与MMI数据库上进行5次交叉实验,分别取得了92.54%和74.67%的平均识别率,与随机选取视频序列相比,分别提高了19.86%和22.24%;此外,与目前一些优秀的视频表情识别方法相比,也表现出了优越性。结论 本文提出的SWDTW不仅有效地实现了表情序列的选取,而且增强了卷积神经网络在视频面部表情分类中的鲁棒性,提高了视频人脸表情分析的自适应性度和识别率。
关键词
Video facial expression recognition combined with sliding window dynamic time warping and CNN

Hu Min1, Zhang Keke1, Wang Xiaohua1, Ren Fuji1,2(1.Anhui Province Key Laboratory of Affective Computing and Advanced Intelligent Machine, School of Computer and Information, Hefei University of Technology, Hefei 230009, China;2.Graduate School of Advanced Technology & Science, University of Tokushima, Tokushima 7708502, Japan)

Abstract
Objective Facial expression is an effective means of communicating inner feelings and intentions. With the rapid development of artificial intelligence, facial expression recognition has become a critical part of human-computer interaction today. Research on human facial expression recognition technology bears important theoretical significance and practical application value. A video sequence contains more emotional information than a static expression image. In recent years, facial expression recognition based on video sequences has gradually become a field of interest in computer vision because expressing emotions is a dynamic process. Information obtained from a single-frame image is not as rich as that from a video sequence, and the accuracy of the former is low compared with the latter. Presently, two mainstream solutions for video facial expression recognition are available. The advantage of motion or movement direction of the face unit to analyze facial expressions is that it does not require selection of video frames. Rather, this method directly extracts the dynamic feature of the video for processing. However, the recognition process is complex, and the recognition rate is low. Another method is basing on the image of facial expressions to identify the expression category. Despite its high recognition rate, this method requires processing the original video sequence in advance. In a complete sequence of expressions, obvious expression plays a key role in feature extraction and recognition, but a video sequence also presents a neutral expression, which may interfere training of the model parameters and affect the output. Therefore, we need to manually select noticeable features of the expression sequence from the original video sequence. However, this operation generates extra work and affects the accuracy of experiment. This study proposes a new modified method of dynamic time warping, which is called the sliding window dynamic time warping (SWDTW), to automatically select the distinct facial expressions in a video sequence. The method can reduce redundant input information and improve adaptability during experiment. Moreover, noise in feature extraction and expression recognition is lessened. In video facial expression recognition, identification results are greatly influenced by environmental lights, and traditional feature extraction requires excessive manual intervention. Therefore, this study proposes a facial expression recognition method based on deep convolution neural network, which is a new type of neural network. This network combines traditional artificial neural network with deep learning technology, and it has achieved considerable success in the area of image processing. Convolution neural network has two main characteristics. The first is its strategy in establishing local connection between neurons, and the second is the sharing of neurons in the same layer. These characteristics reduce the complexity of the model and the number of parameters to be trained. This network structure can achieve several degrees of invariance, such as translation, scale, and deformation. Method First, this method uses an algorithm to intercept front face frame in an expression sequence after a series of normalized processing by indicating the gradient direction histogram feature to calculate cost matrix and adding the sliding window mechanism to the cost matrix. Second, the average distances of all sliding windows are calculated, and the global optimal selection of the expression sequence is obtained by intercepting the sequence corresponding to the minimum distance. Finally, theoretical analysis and experimental verification are performed to determine the structure and parameters of the convolution neural network. In this work, the Alexnet network is selected as a reference because it won in the image classification competition of ImageNet Large Scale Visual Recognition Challenge (ⅡSVRC) in 2012. In view of the feature of facial expression images, this study makes certain adjustments on the original AlexNet network to better meet facial expression recognition and improves the network's utilization of overall information by reducing two convolution layers and adding a pooling layer. During convolution, the ReLU activation function is used to replace the traditional sigmoid and tanh activation function, increase the training speed of the model, and solve gradient disappearance. Dropout technology to address network over-fitting problem is also introduced. Finally, this study uses two fully connected layers to classify facial expression. Regularized facial expression video sequence is used for unsupervised learning and facial expression classification through deep convolution neural network. Images in video sequence undergo classification probability to determine which belong to each expression category, and the final identification results of the video sequence are obtained. Result Five cross-validation experiments are conducted on CK+ and MMI database. This method performs better than the randomly selected video sequence and manual feature extraction methods in terms of recognition and generalization. In CK+ and MMI database, the average recognition accuracies are 92.54% and 74.67%, respectively, which are 19.86% and 22.24% higher than those of randomly selected video sequences. In addition, in comparison with other methods, SWDTW achieves better recognition performance. Conclusion The proposed method exhibits good performance and adaptability during preprocessing, feature extraction, and recognition of facial expression system. SWDTW effectively achieves the selection of expression sequence, and the designed convolution neural network improves the robustness of facial expression classification based on video.
Keywords

订阅号|日报