结合滑动窗口动态时间规整和CNN的视频人脸表情识别
Video facial expression recognition combined with sliding window dynamic time warping and CNN
- 2018年23卷第8期 页码:1144-1153
收稿:2017-08-15,
修回:2018-3-7,
纸质出版:2018-08-16
DOI: 10.11834/jig.170454
移动端阅览

浏览全部资源
扫码关注微信
收稿:2017-08-15,
修回:2018-3-7,
纸质出版:2018-08-16
移动端阅览
目的
2
相比静态表情图片,视频序列中蕴含更多的情感信息,视频序列中的具有明显表情的序列在特征提取与识别中具有关键作用,但是视频中同时存在的中性表情也可能会对模型参数的训练造成干扰,影响最终的判别。为了减少这种干扰带来的误差,本文对动态时间规整算法进行改进,提出一种滑动窗口动态时间规整算法(SWDTW)来自动选取视频中表情表现明显的图片序列;同时,为了解决人脸图像受环境光照因素影响较大和传统特征提取过程中存在过多人为干预的问题,构建一种基于深度卷积神经网络的人脸视频序列处理方法。
方法
2
首先截取表情视频中人脸正面帧,用梯度方向直方图特征计算代价矩阵,并在代价矩阵上增加滑动窗口机制,计算所有滑动窗口的平均距离;然后通过平均距离最小值选取全局最优表情序列;最后采用深度卷积神经网络对规整后的人脸表情图像序列进行无监督学习和面部表情分类,统计视频序列图像分类概率和,进而得出视频序列的表情类别。
结果
2
在CK+与MMI数据库上进行5次交叉实验,分别取得了92.54%和74.67%的平均识别率,与随机选取视频序列相比,分别提高了19.86%和22.24%;此外,与目前一些优秀的视频表情识别方法相比,也表现出了优越性。
结论
2
本文提出的SWDTW不仅有效地实现了表情序列的选取,而且增强了卷积神经网络在视频面部表情分类中的鲁棒性,提高了视频人脸表情分析的自适应性度和识别率。
Objective
2
Facial expression is an effective means of communicating inner feelings and intentions. With the rapid development of artificial intelligence
facial expression recognition has become a critical part of human-computer interaction today. Research on human facial expression recognition technology bears important theoretical significance and practical application value. A video sequence contains more emotional information than a static expression image. In recent years
facial expression recognition based on video sequences has gradually become a field of interest in computer vision because expressing emotions is a dynamic process. Information obtained from a single-frame image is not as rich as that from a video sequence
and the accuracy of the former is low compared with the latter. Presently
two mainstream solutions for video facial expression recognition are available. The advantage of motion or movement direction of the face unit to analyze facial expressions is that it does not require selection of video frames. Rather
this method directly extracts the dynamic feature of the video for processing. However
the recognition process is complex
and the recognition rate is low. Another method is basing on the image of facial expressions to identify the expression category. Despite its high recognition rate
this method requires processing the original video sequence in advance. In a complete sequence of expressions
obvious expression plays a key role in feature extraction and recognition
but a video sequence also presents a neutral expression
which may interfere training of the model parameters and affect the output. Therefore
we need to manually select noticeable features of the expression sequence from the original video sequence. However
this operation generates extra work and affects the accuracy of experiment. This study proposes a new modified method of dynamic time warping
which is called the sliding window dynamic time warping (SWDTW)
to automatically select the distinct facial expressions in a video sequence. The method can reduce redundant input information and improve adaptability during experiment. Moreover
noise in feature extraction and expression recognition is lessened. In video facial expression recognition
identification results are greatly influenced by environmental lights
and traditional feature extraction requires excessive manual intervention. Therefore
this study proposes a facial expression recognition method based on deep convolution neural network
which is a new type of neural network. This network combines traditional artificial neural network with deep learning technology
and it has achieved considerable success in the area of image processing. Convolution neural network has two main characteristics. The first is its strategy in establishing local connection between neurons
and the second is the sharing of neurons in the same layer. These characteristics reduce the complexity of the model and the number of parameters to be trained. This network structure can achieve several degrees of invariance
such as translation
scale
and deformation.
Method
2
First
this method uses an algorithm to intercept front face frame in an expression sequence after a series of normalized processing by indicating the gradient direction histogram feature to calculate cost matrix and adding the sliding window mechanism to the cost matrix. Second
the average distances of all sliding windows are calculated
and the global optimal selection of the expression sequence is obtained by intercepting the sequence corresponding to the minimum distance. Finally
theoretical analysis and experimental verification are performed to determine the structure and parameters of the convolution neural network. In this work
the Alexnet network is selected as a reference because it won in the image classification competition of ImageNet Large Scale Visual Recognition Challenge (ⅡSVRC) in 2012. In view of the feature of facial expression images
this study makes certain adjustments on the original AlexNet network to better meet facial expression recognition and improves the network's utilization of overall information by reducing two convolution layers and adding a pooling layer. During convolution
the ReLU activation function is used to replace the traditional sigmoid and tanh activation function
increase the training speed of the model
and solve gradient disappearance. Dropout technology to address network over-fitting problem is also introduced. Finally
this study uses two fully connected layers to classify facial expression. Regularized facial expression video sequence is used for unsupervised learning and facial expression classification through deep convolution neural network. Images in video sequence undergo classification probability to determine which belong to each expression category
and the final identification results of the video sequence are obtained.
Result
2
Five cross-validation experiments are conducted on CK+ and MMI database. This method performs better than the randomly selected video sequence and manual feature extraction methods in terms of recognition and generalization. In CK+ and MMI database
the average recognition accuracies are 92.54% and 74.67%
respectively
which are 19.86% and 22.24% higher than those of randomly selected video sequences. In addition
in comparison with other methods
SWDTW achieves better recognition performance.
Conclusion
2
The proposed method exhibits good performance and adaptability during preprocessing
feature extraction
and recognition of facial expression system. SWDTW effectively achieves the selection of expression sequence
and the designed convolution neural network improves the robustness of facial expression classification based on video.
Ekman P, Friesen W V. Constants across cultures in the face and emotion[J]. Journal of Personality and Social Psychology, 1971, 17(2):124-129.[DOI:10.1037/h0030377]
Sariyanidi E, Gunes H, Cavallaro A. Automatic analysis of facial affect:a survey of registration, representation, and recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(6):1113-1133.[DOI:10.1109/TPAMI.2014.2366127]
Hasani B, Mahoor M H. Facial expression recognition using enhanced deep 3D convolutional neural networks[J]. ArXiv Preprint ArXiv:1705.07871, 2017.
He J, Cai J F, Fang L Z, et al. Facial expression recognition based on LBP/VAR and DBN model[J]. Application Research of Computers, 2016, 33(8):2509-2513.
何俊, 蔡建峰, 房灵芝, 等.基于LBP/VAR与DBN模型的人脸表情识别[J].计算机应用研究, 2016, 33(8):2509-2513. [DOI:10.3969/j.issn.1001-3695.2016.08.060]
Qiu Y, Zhao J Y, Wang Y F. Facial expression recognition using temporal relations among facial movements[J]. Acta Electronica Sinica, 2016, 44(6):1307-1313.
邱玉, 赵杰煜, 汪燕芳.结合运动时序性的人脸表情识别方法[J].电子学报, 2016, 44(6):1307-1313. [DOI:10.3969/j.issn.0372-2112.2016.06.007]
Berndt D J, Clifford J. Using dynamic time warping to find patterns in time series[C]//KDD Workshop. Seattle, WA: KDD, 1994: 359-370.
Zhao Y, Chen J S. Region correction for train coupler buffer images based on dynamic time warping[J]. Journal of Image and Graphics, 2017, 22(1):58-65.
赵耀, 陈建胜.动态时间规整下的列车车钩缓冲图像区域校正[J].中国图象图形学报, 2017, 22(1):58-65. [DOI:10.11834/jig.20170107]
Azam M, Bouguila N. Unsupervised keyword spotting using bounded generalized Gaussian mixture model with ICA[C ] //IEEE Global Conference on Signal and Information Processing. Orlando, FL: IEEE, 2016: 1150-1154. [ DOI:10.1109/GlobalSIP.2015.7418378 http://dx.doi.org/10.1109/GlobalSIP.2015.7418378 ]
Saeed A, Al-Hamadi A, Niese R. The effectiveness of using geometrical features for facial expression recognition[C ] //Proceedings of 2013 IEEE International Conference on Cybernetics (CYBCONF)., Lausanne, Switzerland: IEEE, 2013: 122-127. [ DOI:10.1109/CYBConf.2013.6617455 http://dx.doi.org/10.1109/CYBConf.2013.6617455 ]
Li G, Li W H. Face detection under rotation in image plane using principal direction rotation LBP[J]. Acta Electronica Sinica, 2015, 43(1):198-202.
李根, 李文辉.主方向旋转LBP特征的平面旋转人脸检测[J].电子学报, 2015, 43(1):198-202. [DOI:10.3969/j.issn.0372-2112.2015.01.031]
Fang H, Parthaláin N M, Aubrey A J, et al. Facial expression recognition in dynamic sequences:an integrated approach[J]. Pattern Recognition, 2014, 47(3):1271-1281.[DOI:10.1016/j.patcog.2013.09.023]
Ghimire D, Lee J. Geometric feature-based facial expression recognition in image sequences using multi-class adaBoost and support vector machines[J]. Sensors, 2013, 13(6):7714-7734.[DOI:10.3390/s130607714]
Lopes A T, de Aguiar E, Oliveira-Santos T. Afacial expression recognition system using convolutional networks[C ] //Proceedings of the 28th SIBGRAPI Conference on Graphics, Patterns and Images. Salvador, Brazil: IEEE, 2015: 273-280. [ DOI:10.1109/sibgrapi.2015.14 http://dx.doi.org/10.1109/sibgrapi.2015.14 ]
Liu M Y, Li S X, Shan S G, et al. Deeply learning deformable facial action parts model for dynamic expression analysis[C ] //Cremers D, Reid I, Saito H, et al. Computer Vision-ACCV 2014. Cham: Springer International Publishing, 2015: 143-157. [ DOI:10.1007/978-3-319-16817-3_10 http://dx.doi.org/10.1007/978-3-319-16817-3_10 ]
Hubel D H, Wiesel T N, LeVay S. Visual-field representation in layer Ⅳ C of monkey striate cortex[C]//Proceedings of the 4th Annual Meeting, Society for Neuroscience. St. Louis, Mo: Society for Neuroscience, 1974, 10: 264.
Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks[C]//Advances in Neural Information Processing Systems. Lake Tahoe, Nevada: ACM, 2012: 1097-1105.
Wang Z H, Wang S F, Ji Q. Capturing complex spatio-temporal relations among facial muscles for facial expression recognition[C ] //IEEE Conference on Computer Vision and Pattern Recognition. Portland, OR, USA: IEEE, 2013: 3422-3429. [ DOI:10.1109/CVPR.2013.439 http://dx.doi.org/10.1109/CVPR.2013.439 ]
Eskil M T, Benli K S. Facial expression recognition based on anatomy[J]. Computer Vision and Image Understanding, 2014, 119:1-14.[DOI:10.1016/j.cviu.2013.11.002]
Ptucha R, Tsagkatakis G, Savakis A. Manifold based Sparse Representation for robust expression recognition without neutral subtraction[C]//IEEE International Conference on Computer Vision Workshops. Barcelona, Spain: IEEE, 2011: 2136-2143.
Liu M Y, Li S X, Shan S G, et al. AU-aware Deep Networks for facial expression recognition[C ] //Proceedings of the 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition. Shanghai, China: IEEE, 2013: 1-6. [ DOI:10.1109/fg.2013.6553734 http://dx.doi.org/10.1109/fg.2013.6553734 ]
相关作者
相关机构
京公网安备11010802024621