采用Transformer网络的视频序列表情识别
Video sequence-based human facial expression recognition using Transformer networks
- 2022年27卷第10期 页码:3022-3030
纸质出版日期: 2022-10-16 ,
录用日期: 2021-07-06
DOI: 10.11834/jig.210248
移动端阅览
浏览全部资源
扫码关注微信
纸质出版日期: 2022-10-16 ,
录用日期: 2021-07-06
移动端阅览
陈港, 张石清, 赵小明. 采用Transformer网络的视频序列表情识别[J]. 中国图象图形学报, 2022,27(10):3022-3030.
Gang Chen, Shiqing Zhang, Xiaoming Zhao. Video sequence-based human facial expression recognition using Transformer networks[J]. Journal of Image and Graphics, 2022,27(10):3022-3030.
目的
2
相比于静态人脸表情图像识别,视频序列中的各帧人脸表情强度差异较大,并且含有中性表情的帧数较多,然而现有模型无法为视频序列中每帧图像分配合适的权重。为了充分利用视频序列中的时空维度信息和不同帧图像对视频表情识别的作用力差异特点,本文提出一种基于Transformer的视频序列表情识别方法。
方法
2
首先,将一个视频序列分成含有固定帧数的短视频片段,并采用深度残差网络对视频片段中的每帧图像学习出高层次的人脸表情特征,从而生成一个固定维度的视频片段空间特征。然后,通过设计合适的长短时记忆网络(long short-term memory network,LSTM)和Transformer模型分别从该视频片段空间特征序列中进一步学习出高层次的时间维度特征和注意力特征,并进行级联输入到全连接层,从而输出该视频片段的表情分类分数值。最后,将一个视频所有片段的表情分类分数值进行最大池化,实现该视频的最终表情分类任务。
结果
2
在公开的BAUM-1s(Bahcesehir University multimodal)和RML(Ryerson Multimedia Lab)视频情感数据集上的试验结果表明,该方法分别取得了60.72%和75.44%的正确识别率,优于其他对比方法的性能。
结论
2
该方法采用端到端的学习方式,能够有效提升视频序列表情识别性能。
Objective
2
Human facial expression is as one of the key information carriers that cannot be ignored in interpersonal communication. The development of facial expression recognition promotes the resilience of human-computer interaction. At present
to the issue of the performance of facial expression recognition has been improved in human-computer interaction systems like medical intelligence
interactive robots and deep focus monitoring. Facial expression recognition can be divided into two categories like static based image sand video sequence based dynamic images. In the current "short video" era
video has adopted more facial expression information than static images. Compared with static images
video sequences are composed of multi frames of static images
and facial expression intensity of each frame image is featured. Therefore
video sequence based facial expression recognition should be focused on the spatial information of each frame and temporal information in the video sequences
and the importance of each frame image for the whole video expression recognition. The early hand-crafted features are insufficient for generalization ability of the trained model
such as Gabor representations
local binary patterns (LBP). Current deep learning technology has developed a series of deep neural networks to extract facial expression features. The representative deep neural network mainly includes convolutional neural network (CNN)
long short-term memory (LSTM). The importance of each frame in video sequence is necessary to be concerned for video expression recognition. In order to make full use of the spatio-temporal scaled information in video sequences and driving factors of multi-frame images on video expression recognition
an end-to-end CNN + LSTM + Transformer video sequence expression recognition method is proposed.
Method
2
First
a video sequence is divided into short video clips with a fixed number of frames
and the deep residual network is used to learn high-level facial expression features from each frame of the video clip. Next
the high-level temporal dimension features and attention features are learned further from the spatial feature sequence of the video clip via designing a suitable LSTM and Transformer model
and cascaded into the full connection layer to output the expression classification score of the video clip. Finally
the expression classification scores of all video clips are pooled to achieve the final expression classification task. Our method demonstrates the spatial and the temporal features both. We use transformer to extract the attention feature of fragment frame to improve the expression recognition rate of the model in terms of the difference of facial expression intensity in each frame of video sequence. In addition
the cross-entropy loss function is used to train the emotion recognition model in an end-to-end way
which aids the model to learn more effective facial expression features. Through CNN + LSTM + Transformer model training
the size of batch is set to 4
the learning rate is set to 5×10
-5
and the maximum number of cycles is set to 80
respectively.
Result
2
The importance of frame attention features learned from the Transformer model is greater than that of temporal scaled features
the optimized accuracy of each is 60.72% and 75.44% on BAUM-1 s (Bahcesehir University multimodal) and RML (Ryerson Multimedia Lab) datasets via combining CNN + LSTM + Transformer model. It shows that there is a certain degree of complementarity among the three features learned by CNN
LSTM and Transformer. The combination of the three features can improve the performance of video expression recognition effectively. Furthermore
our averaged accuracy has its potentials on BAUM-1 s and RML datasets.
Conclusion
2
Our research develops an end-to-end video sequence-based expression method based on CNN + LSTM + Transformer. It integrates CNN
LSTM and Transformer models to learn the video features of high-level
spatial features
temporal features and video frame attention. The BAUM-1 s and RML-related experimental results illustrate that the proposed method can improve the performance of video sequence-based expression recognition model effectively.
视频序列人脸表情识别时空维度深度残差网络长短时记忆网络(LSTM)端到端Transformer
video sequencefacial expression recognitionspatial-temporal dimensiondeep residual networklong short-term memory network (LSTM)end-to-endTransformer
Ahad M A R, Tan J K, Kim H and Ishikawa S. 2012. Motion history image: its variants and applications. Machine Vision and Applications, 23(2): 255-281 [DOI: 10.1007/s00138-010-0298-4]
Anandan P. 1989. A computational framework and an algorithm for the measurement of visual motion. International Journal of Computer Vision, 2(3): 283-310 [DOI: 10.1007/BF00158167]
Bahdanau D, Cho K and Bengio Y. 2016. Neural machine translation by jointly learning to align and translate [EB/OL]. [2021-05-21].https://arxiv.org/pdf/1409.0473.pdfhttps://arxiv.org/pdf/1409.0473.pdf
Brown T B, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal A, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler D M, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I and Amodei D. 2020. Language models are few-shot learners [EB/OL]. [2020-07-22].https://arxiv.org/pdf/2005.4165.pdfhttps://arxiv.org/pdf/2005.4165.pdf
Cornejo J Y R and Pedrini H. 2019. Audio-visual emotion recognition using a hybrid deep convolutional neural network based on census transform//Proceedings of 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC). Bari, Italy: IEEE: 3396-3402 [DOI: 10.1109/SMC.2019.8914193http://dx.doi.org/10.1109/SMC.2019.8914193]
De Silva L C and Ng P C. 2000. Bimodal emotion recognition//Proceedings of the 4th IEEE International Conference on Automatic Face and Gesture Recognition. Grenoble, France: IEEE: 332-335 [DOI: 10.1109/AFGR.2000.840655http://dx.doi.org/10.1109/AFGR.2000.840655]
Deng J, Dong W, Socher R, Li L J, Li K and Li F F. 2009. ImageNet: a large-scale hierarchical image database//Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, USA: IEEE: 248-255 [DOI: 10.1109/CVPR.2009.5206848http://dx.doi.org/10.1109/CVPR.2009.5206848]
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X H, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J and Houlsby N. 2021. An image is worth 16×16 words: transformers for image recognition at scale. [EB/OL]. [2021-06-03].https://arxiv.org/pdf/2010.11929.pdfhttps://arxiv.org/pdf/2010.11929.pdf
Fan Y, Lu X J, Li D and Liu Y L. 2016. Video-based emotion recognition using CNN-RNN and C3D hybrid networks//Proceedings of the 18th ACM International Conference on Multimodal Interaction. Tokyo, Japan: Association for Computing Machinery: 445-450 [DOI: 10.1145/2993148.2997632http://dx.doi.org/10.1145/2993148.2997632]
Fu X F, Fu X J, Li J J and Yu Z S. 2015. Facial expression recognition using multi-scale spatiotemporal local orientational pattern histogram projection in video sequences. Journal of Computer-Aided Design and Computer Graphics, 27(6): 1060-1066
付晓峰, 付晓鹃, 李建军, 余正生. 2015. 视频序列中基于多尺度时空局部方向角模式直方图映射的表情识别. 计算机辅助设计与图形学学报, 27(6): 1060-1066
Hu M, Wang H W, Wang X H, Yang J and Wang R G. 2019. Video facial emotion recognition based on local enhanced motion history image and CNN-CTSLSTM networks. Journal of Visual Communication and Image Representation, 59: 176-185 [DOI: 10.1016/j.jvcir.2018.12.039]
Kansizoglou I, Bampis L and Gasteratos A. 2022. An active learning paradigm for online audio-visual emotion recognition. IEEE Transactions on Affective Computing, 13(2): 756-768 [DOI: 10.1109/TAFFC.2019.2961089]
Li J, Jin K, Zhou D L, Kubota N and Ju Z J. 2020. Attention mechanism-based CNN for facial expression recognition. Neurocomputing, 411: 340-350 [DOI: 10.1016/j.neucom.2020.06.014]
Li S and Deng W H. 2020. Deep facialexpression recognition: a survey. Journal of Image and Graphics, 25(11): 20-34
李珊, 邓伟洪. 2020. 深度人脸表情识别研究进展. 中国图象图形学报, 25(11): 20-34 [DOI: 10.11834/jig.200233]
Li Y, Zeng J B, Shan S G and Chen X L. 2019. Occlusion aware facial expression recognition using CNN with attention mechanism. IEEE Transactions on Image Processing, 28(5): 2439-2450 [DOI: 10.1109/TIP.2018.2886767]
Li Y D, Hao Z B and Lei H. 2016. Survey of convolutional neural network. Journal of Computer Applications, 36(9): 2508-2515, 2565
李彦冬, 郝宗波, 雷航. 2016. 卷积神经网络研究综述. 计算机应用, 36(9): 2508-2515, 2565 [DOI:10.11772/j.issn.1001-9081.2016.09.2508]
Liu S S, Tian Y T and Wan C. 2011. Facial expression recognition method based on gabor multi-orientation features fusion and block histogram. Acta Automatica Sinica, 37(12): 1455-1463
刘帅师, 田彦涛, 万川. 2011. 基于Gabor多方向特征融合与分块直方图的人脸表情识别方法. 自动化学报, 37(12): 1455-146 [DOI: 10.3724/SP.J.1004.2011.01455]
Ma Y X, Hao Y X, Chen M, Chen J C, Lu P and Košir A. 2019. Audio- visual emotion fusion (AVEF): a deep efficient weighted approach. Information Fusion, 46: 184-192 [DOI: 10.1016/j.inffus.2018.06.003]
Sun B, Wei Q L, Li L D, Xu Q H, He J and Yu L J. 2016. LSTM for dynamic emotion and group emotion recognition in the wild//Proceedings of the 18th ACM International Conference on Multimodal Interaction. Tokyo, Japan: Association for Computing Machinery: 451-457 [DOI: 10.1145/2993148.2997640http://dx.doi.org/10.1145/2993148.2997640]
Wang S M, Shuai H and Liu Q S. 2020. Facial expression recognition based on deep facial landmark features. Journal of Image and Graphics, 25(4): 813-823
王善敏, 帅惠, 刘青山. 2020. 关键点深度特征驱动人脸表情识别. 中国图象图形学报, 25(4): 813-823 [DOI: 10.11834/jig.190331]
Wang X H, Pan L J, Peng M Z, Hu M, Jin C H and Ren F J. 2020. Video emotion recognition based on hierarchical attention model. Journal of Computer-Aided Design and Computer Graphics, 32(1): 27-35
王晓华, 潘丽娟, 彭穆子, 胡敏, 金春花, 任福继. 2020. 基于层级注意力模型的视频序列表情识别. 计算机辅助设计与图形学学报, 32(1): 27-35 [DOI: 10.3724/SP.J.1089.2020.17719]
Wang Y J, Guan L and Venetsanopoulos A N. 2012. Kernel cross-modal factor analysis for information fusion with application to bimodal emotion recognition. IEEE Transactions on Multimedia, 14(3): 597-607 [DOI: 10.1109/TMM.2012.2189550]
Zhalehpour S, Onder O, Akhtar Z and Erdem C E. 2017. BAUM-1: a spontaneous audio-visual face database of affective and mental states. IEEE Transactions on Affective Computing, 8(3): 300-313 [DOI: 10.1109/TAFFC.2016.2553038]
Zhang K P, Zhang Z P, Li Z F and Qiao Y. 2016. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10): 1499-1503 [DOI: 10.1109/LSP.2016.2603342]
Zhang S Q, Zhang S L, Huang T J, Gao W and Tian Q. 2018. Learning affective features with a hybrid deep model for audio-visual emotion recognition. IEEE Transactions on Circuits and Systems for Video Technology, 28(10): 3030-3043 [DOI: 10.1109/TCSVT.2017.2719043]
相关作者
相关机构