融合时空域特征的人脸表情识别

陈拓; 邢帅; 杨文武; 金剑秋

doi:10.11834/jig.200782

图像分析和识别 | 浏览量 : 0 下载量: 93 CSCD: 2

PDF
导出
分享
收藏
专辑

融合时空域特征的人脸表情识别
Spatio-temporal features based human facial expression recognition
2022年27卷第7期页码：2185-2198
收稿：2020-12-29，

修回：2021-3-22，

录用：2021-3-29，

纸质出版：2022-07-16
DOI： 10.11834/jig.200782
稿件说明：

移动端阅览

陈拓, 邢帅, 杨文武, 金剑秋. 融合时空域特征的人脸表情识别[J]. 中国图象图形学报, 2022,27(7):2185-2198. DOI： 10.11834/jig.200782.

Tuo Chen, Shuai Xing, Wenwu Yang, Jianqiu Jin. Spatio-temporal features based human facial expression recognition[J]. Journal of Image and Graphics, 2022, 27(7): 2185-2198. DOI： 10.11834/jig.200782.

摘要

目的

人脸表情识别是计算机视觉的核心问题之一。一方面，表情的产生对应着面部肌肉的一个连续动态变化过程，另一方面，该运动过程中的表情峰值帧通常包含了能够识别该表情的完整信息。大部分已有的人脸表情识别算法要么基于表情视频序列，要么基于单幅表情峰值图像。为此，提出了一种融合时域和空域特征的深度神经网络来分析和理解视频序列中的表情信息，以提升表情识别的性能。

方法

该网络包含两个特征提取模块，分别用于学习单幅表情峰值图像中的表情静态“空域特征”和视频序列中的表情动态“时域特征”。首先，提出了一种基于三元组的深度度量融合技术，通过在三元组损失函数中采用不同的阈值，从单幅表情峰值图像中学习得到多个不同的表情特征表示，并将它们组合在一起形成一个鲁棒的且更具辩识能力的表情“空域特征”；其次，为了有效利用人脸关键组件的先验知识，准确提取人脸表情在时域上的运动特征，提出了基于人脸关键点轨迹的卷积神经网络，通过分析视频序列中的面部关键点轨迹，学习得到表情的动态“时域特征”；最后，提出了一种微调融合策略，取得了最优的时域特征和空域特征融合效果。

结果

该方法在3个基于视频序列的常用人脸表情数据集CK+(the extended Cohn-Kanade dataset)、MMI (the MMI facial expression database)和Oulu-CASIA (the Oulu-CASIA NIR&VIS facial expression database)上的识别准确率分别为98.46%、82.96%和87.12%，接近或超越了当前同类方法中的表情识别最高性能。

结论

提出的融合时空特征的人脸表情识别网络鲁棒地分析和理解了视频序列中的面部表情空域和时域信息，有效提升了人脸表情的识别性能。

Abstract

Objective

Human facial expression recognition (FER) is one of the key issues of computer vision analysis like human-computer interaction

medical care and intelligent driving. FER research has mainly two challenges in related to expression feature extraction and classification recognition. Current methods are mainly design facial expression features artificially

while deep learning based methods can independently learn to obtain semantic facial expression features. The deep learning based FER technology can integrate the two training processes of feature extraction and facial expression recognition. It has strong generalization ability and good recognition accuracy currently. Most of the existing FER algorithms are based on expression video sequences or a single peak expression scenario. However

the generation of expression corresponds to a continuous dynamic change process of facial muscles

and the motion-based expression peak frame identifies completed expression information in common. Our method demonstrates a spatio-temporal and features based deep neural network to analyze and understand video sequences derived expression information to improve expression recognition ability.

Method

Our network learn the static "spatial feature" of the expression and its dynamic "temporal feature" based on the video sequence

respectively. First

we illustrate a deep metric fusion network based on triplet loss learning. Our network is composed of two sub-modules like deep convolutional neural network (DCNN) module and

-metric module. The DCNN module is derived from a general convolutional neural network (CNN) to extract common detailed CNN facial features. In this module

the Visual Geometry Group 16-layer net (VGG16)-face network model structure is adopted where the output of its final 4 096-dimensional fully connected layer is used as the benched CNN feature. The

-metric module contains fully multiple connected layer branches. Each branch uses a triplet loss function to implement the supervised learning to represent different expression semantic multi-features. These dual-features representations are fused through two fully connected layers further. A more robust and spatial feature expression is illustrated. The each two fully connected layers have 256 hidden units and the output of each branch is merged together in a concatenating manner in the DCNN module. In the

-metric module

all of fully connected layer branches are shared to the same CNN feature. For example

the output of the final fully connected layer is used as the input of each branch in the DCNN module. In addition

a fixed dimension fully connected layer is used via each branch and it is associated with a certain threshold sampling for learning the corresponding feature embedding. Each branch is supervised and learned by the corresponding triple loss function. Next

facial expressions are essential to facial expression changes in motion because the changes are integrated to the overall facial expression changing. Existing methods are challenged to extract the dynamic expression features in the context of consecutive frames derived time domain through manual design or deep learning methods. But

manual-designed features are constrained of facial image sequence based temporal features extraction. The image sequence related deep neural network is insufficient to employ the prior knowledge of the key features of the face as well due to the non-learning temporal featured expressions. Our landmark-trajectory convolutional neural network analyzes the trajectory in the video sequence and learns the dynamic "temporal features" of the expression sequence consequently

which extracts the accurate motion characteristics of facial expressions in the time domain. Our network consists of four convolutional layers and two fully connected layers. The input of the landmark trajectory CNN (LTCNN) sub-network is a similar feature map constructed based on the trajectory of facial expression in the video. Third

a fine-tuning based fusion strategy is conducted to combine the learned features of two network modules obtained further

which achieves the temporal and spatial features based fusion result optimally. We train the deep metric fusion (DMF) and LTCNN sub-networks each

combine the two sub-networks through feature fusion

and fine-tuning them in an end-to-end manner sequentially. The implemented hyper-parameters are used for fine-tuning training in DMF sub-network optimization.

Result

Our demonstrated FEC algorithm is tested and verified on three public facial expression databases in terms of the extended Cohn-Kanade dataset (CK+)

the MMI facial expression database (MMI)

and the Oulu-CASIA NIR&VIS facial expression database (Oulu-CASIA). Our method achieves the recognition accuracy of 98.46%

82.96%

and 87.12% on the databases of CK+

MMI

and Oulu-CASIA

respectively.

Conclusion

For our deep learning based network integrated temporal and the spatial features both to realize video sequences based FER. In the network

our two sub-modules are used to learn the "spatial features" of the facial expression at the peak frame and the "temporal features" of facial expression motion. Finally

a fusion strategy is carried out to achieve better fusion effect of temporal and spatial features based on overall fine-tuning. Our FER method has its potentials to develop further.

关键词

Keywords

references

Acharya D, Huang Z W, Paudel D P and van Gool L. 2018. Covariance pooling for facial expression recognition//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Salt Lake City, USA: IEEE: 480-487[ DOI: 10.1109/CVPRW.2018.00077 http://dx.doi.org/10.1109/CVPRW.2018.00077 ]

Deng D D, Chen Z K, Zhou Y Q and Shi B. 2019. MIMAMO Net: integrating micro-and macro-motion for video emotion recognition[EB/OL]. [2020-12-14] . https://arxiv.org/pdf/1911.09784.pdf https://arxiv.org/pdf/1911.09784.pdf

Ekman P and Friesen W V. 1971. Constants across cultures in the face and emotion. Journal of Personality and Social Psychology, 17(2): 124-129[DOI:10.1037/h0030377]

Goodfellow I J, Erhan D, Carrier P L, Courville A, Mirza M, Hamner B, Cukierski W, Tang Y C, Thaler D, Lee D H, Zhou Y B, Ramaiah C, Feng F X, Li R F, Wang X J, Athanasakis D, Shawe-Taylor J, Milakov M, Park J, Ionescu R, Popescu M, Grozea C, Bergstra J, Xie J J, Romaszko L, Xu B, Chuang Z and Bengio Y. 2013. Challenges in representation learning: a report on three machine learning contests//Proceedings of the 20th International Conference on Neural Information Processing. Daegu, Korea (South): Springer: 117-124[ DOI: 10.1007/978-3-642-42051-1_16 http://dx.doi.org/10.1007/978-3-642-42051-1_16 ]

Gutierrez G. 2020. Artificial intelligence in the intensive care unit. Critical Care, 24(1): #101[DOI:10.1186/s13054-020-2785-y]

Hasani B and Mahoor M H. 2017. Facial expression recognition using enhanced deep 3D convolutional neural networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Honolulu, USA: IEEE: 2278-2288[ DOI: 10.1109/CVPRW.2017.282 http://dx.doi.org/10.1109/CVPRW.2017.282 ]

Hermans A, Beyer L and Leibe B. 2017. In defense of the triplet loss for personre-identification[EB/OL]. [2020-12-14] . https://arxiv.org/pdf/1703.07737.pdf https://arxiv.org/pdf/1703.07737.pdf

Jain S, Hu C B and Aggarwal J K. 2011. Facial expression recognition with temporal modeling of shapes//Proceedings of 2011 IEEE International Conference on Computer Vision Workshops. Barcelona, Spain: IEEE: 1642-1649[ DOI: 10.1109/iccvw.2011.6130446 http://dx.doi.org/10.1109/iccvw.2011.6130446 ]

Jung H, Lee S, Yim J, Park S and Kim J. 2015. Joint fine-tuning in deep neural networks for facial expression recognition//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 2983-2991[ DOI: 10.1109/ICCV.2015.341 http://dx.doi.org/10.1109/ICCV.2015.341 ]

Klaser A, Marszalek M and Schmid C. 2008. A spatio-temporal descriptor based on 3D-gradients//Proceedings of the British Machine Conference. [s. l.]: BMVC: #99[ DOI: 10.5244/C.22.99 http://dx.doi.org/10.5244/C.22.99 ]

Kowalski M, Naruniec J and Trzcinski T. 2017. Deep alignment network: a convolutional neural network for robust face alignment//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Honolulu, USA: IEEE: 2034-2043[ DOI: 10.1109/CVPRW.2017.254 http://dx.doi.org/10.1109/CVPRW.2017.254 ]

Kumawat S, Verma M and Raman S. 2019. LBVCNN: local binary volume convolutional neural network for facial expression recognition from image sequences//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Long Beach, USA: IEEE: 207-216[ DOI: 10.1109/cvprw.2019.00030 http://dx.doi.org/10.1109/cvprw.2019.00030 ]

Li S and Deng W H. 2020. Deep facial expression recognition: a survey. IEEE Transactions on Affective Computing: #2981446[ DOI: 10.1109/TAFFC.2020.2981446 http://dx.doi.org/10.1109/TAFFC.2020.2981446 ]

Liu M Y, Li S X, Shan S G, Wang R P and Chen X L. 2014. Deeply learning deformable facial action parts model for dynamic expression analysis//Proceedings of the 12th Asian Conference on Computer Vision. Singapore, Singapore: Springer: 143-157[ DOI: 10.1007/978-3-319-16817-3_10 http://dx.doi.org/10.1007/978-3-319-16817-3_10 ]

Liu X F, Kumar B V K V, You J and Jia P. 2017. Adaptive deep metric learning for identity-aware facial expression recognition//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Honolulu, USA: IEEE: 522-531[ DOI: 10.1109/cvprw.2017.79 http://dx.doi.org/10.1109/cvprw.2017.79 ]

Lucey P, Cohn J F, Kanade T, Saragih J, Ambadar Z and Matthews I. 2010. The extended Cohn-Kanade dataset (CK+): a complete dataset for action unit and emotion-specified expression//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops. San Francisco, USA: IEEE: 94-101[ DOI: 10.1109/cvprw.2010.5543262 http://dx.doi.org/10.1109/cvprw.2010.5543262 ]

Pantic M, Valstar M, Rademaker R and Maat L. 2005. Web-based database for facial expression analysis//Proceedings of 2005 IEEE International Conference on Multimedia and Expo. Amsterdam, the Netherlands: IEEE: 317-321[ DOI: 10.1109/icme.2005.1521424 http://dx.doi.org/10.1109/icme.2005.1521424 ]

Parkhi O M, Vedaldi A and Zisserman A. 2015. Deep face recognition//Proceedings of the British Machine Vision Conference. Swansea, UK: BMVA Press: #41

Ptucha R, Tsagkatakis G and Savakis A. 2011. Manifold based sparse representation for robust expression recognition without neutral subtraction//Proceedings of 2011 IEEE International Conference on Computer Vision Workshops. Barcelona, Spain: IEEE: 2136-2143[ DOI: 10.1109/iccvw.2011.6130512 http://dx.doi.org/10.1109/iccvw.2011.6130512 ]

Ren S Q, He K M, Girshick R and Sun J. 2017. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6): 1137-1149[DOI:10.1109/tpami.2016.2577031]

Schroff F, Kalenichenko D and Philbin J. 2015. FaceNet: a unified embedding for face recognition and clustering//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, USA: IEEE: 815-823[ DOI: 10.1109/CVPR.2015.7298682 http://dx.doi.org/10.1109/CVPR.2015.7298682 ]

Scovanner P, Ali S and Shah M. 2007. A 3-dimensional sift descriptor and its application to action recognition//Proceedings of the 15th ACM International Conference on Multimedia. Augsburg, Germany: ACM: 357-360[ DOI: 10.1145/1291233.1291311 http://dx.doi.org/10.1145/1291233.1291311 ]

Shelhamer E, Long J and Darrell T. 2017. Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4): 640-651[DOI:10.1109/TPAMI.2016.2572683]

Sikka K, Sharma G and Bartlett M. 2016. LOMo: latent ordinal model for facial analysis in videos//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 5580-5589[ DOI: 10.1109/cvpr.2016.602 http://dx.doi.org/10.1109/cvpr.2016.602 ]

Simonyan K and Zisserman A. 2015. Very deep convolutional networks for large-scale image recognition. [EB/OL]. [2020-12-14] . https://arxiv.org/pdf/1409.1556.pdf https://arxiv.org/pdf/1409.1556.pdf

Taini M, Zhao G Y, Li S Z and Pietikainen M. 2008. Facial expression recognition from near-infrared video sequences//Proceedings of the 19th International Conference on Pattern Recognition. Tampa, USA: IEEE: 1-4[ DOI: 10.1109/icpr.2008.4761697 http://dx.doi.org/10.1109/icpr.2008.4761697 ]

Tanguy A, Mandana F, Saleh B S and Guillaume V. 2019. G2-VER: geometry guided model ensemble for video-based facial expression recognition//Proceedings of the 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019). [s. l.]: [s. n.] : 1-6[ DOI: 10.1109/FG.2019.8756600 http://dx.doi.org/10.1109/FG.2019.8756600 ]

Vinciarelli A, Pantic M and Bourlard H. 2009. Social signal processing: survey of an emerging domain. Image and Vision Computing, 27(12): 1743-1759[DOI:10.1016/j.imavis.2008.11.007]

Wang Z H, Wang S F and Ji Q. 2013. Capturing complex spatio-temporal relations among facial muscles for facial expression recognition//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, USA: IEEE: 3422-3429[ DOI: 10.1109/cvpr.2013.439 http://dx.doi.org/10.1109/cvpr.2013.439 ]

Whitehill J, Serpell Z, Lin Y C, Foster A and Movellan J R. 2014. The faces of engagement: automatic recognition of student engagement from facial expressions. IEEE Transactions on Affective Computing, 5(1): 86-98[DOI:10.1109/TAFFC.2014.2316163]

Xie W C, Jia X, Shen L L and Yang M. 2019. Sparse deep feature learning for facial expression recognition. Pattern Recognition, 96: #106966[DOI:10.1016/j.patcog.2019.106966]

Yang H Y, Ciftci U and Yin L J. 2018. Facial expression recognition by de-expression residue learning//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 2168-2177[ DOI: 10.1109/CVPR.2018.00231 http://dx.doi.org/10.1109/CVPR.2018.00231 ]

Zeng Z H, Pantic M, Roisman G I and Huang T S. 2009. A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(1): 39-58[DOI:10.1109/TPAMI.2008.52]

Zhang K H, Huang Y Z, Du Y and Wang L. 2017. Facial expression recognition based on deep evolutional spatial-temporal networks. IEEE Transactions on Image Processing, 26(9): 4193-4203[DOI:10.1109/TIP.2017.2689999]

Zhang K P, Zhang Z P, Li Z F and Qiao Y. 2016. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10): 1499-1503[DOI:10.1109/LSP.2016.2603342]

Zhao G Y, Huang X H, Taini M, Li S Z and Pietikäinen M. 2011. Facial expression recognition from near-infrared videos. Image and Vision Computing, 29(9): 607-619[DOI:10.1016/j.imavis.2011.07.002]

Zhao G Y and Pietikainen M. 2007. Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6): 915-928[DOI:10.1109/tpami.2007.1110]

Zhao J F, Mao X and Zhang J. 2018. Learning deep facial expression features from image and optical flow sequences using 3D CNN. The Visual Computer, 34(10): 1461-1475[DOI:10.1007/s00371-018-1477-y]