关键点深度特征驱动人脸表情识别
Facial expression recognition based on deep facial landmark features
- 2020年25卷第4期 页码:813-823
收稿:2019-07-11,
修回:2019-10-8,
录用:2019-10-15,
纸质出版:2020-04-16
DOI: 10.11834/jig.190331
移动端阅览

浏览全部资源
扫码关注微信
收稿:2019-07-11,
修回:2019-10-8,
录用:2019-10-15,
纸质出版:2020-04-16
移动端阅览
目的
2
人脸关键点检测和人脸表情识别两个任务紧密相关。已有对两者结合的工作均是两个任务的直接耦合
忽略了其内在联系。针对这一问题
提出了一个多任务的深度框架
借助关键点特征识别人脸表情。
方法
2
参考inception结构设计了一个深度网络
同时检测关键点并且识别人脸表情
网络在两个任务的监督下
更加关注关键点附近的信息
使得五官周围的特征获得较大响应值。为进一步减小人脸其他区域的噪声对表情识别的影响
利用检测到的关键点生成一张位置注意图
进一步增加五官周围特征的权重
减小人脸边缘区域的特征响应值。复杂表情引起人脸部分区域的形变
增加了关键点检测的难度
为缓解这一问题
引入了中间监督层
在第1级检测关键点的网络中增加较小权重的表情识别任务
一方面
提高复杂表情样本的关键点检测结果
另一方面
使得网络提取更多表情相关的特征。
结果
2
在3个公开数据集:CK+(Cohn-Kanade dataset)
Oulu(Oulu-CASIA NIR & VIS facial expression database)和MMI(MMI facial expression database)上与经典方法进行比较
本文方法在CK+数据集上的识别准确率取得了最高值
在Oulu和MMI数据集上的识别准确率比目前识别率最高的方法分别提升了0.14%和0.54%。
结论
2
实验结果表明了引入关键点信息的有效性:多任务的卷积神经网络表情识别准确率高于单任务的传统卷积神经网络。同时
引入注意力模型也提升了多任务网络中表情的识别率。
Objective
2
Automatic facial expression recognition (FER) aims at designing a model to identify human emotions automatically from facial images. Several methods have been proposed in the past 20 years
and all the previous works can be generally divided into two categories:image-based methods and video-based methods. In this study
we propose a new image-based FER method
guided with facial landmarks. Facial expression is actually an ultimate representation of facial muscle movement
which consists of various facial action units (AUs) distributing among the facial organs. Meanwhile
the purpose of facial landmark detection is to localize the position and shape of face and facial organs. Thus
a good relationship is observed between the facial expression and facial landmark detection. Based on this observation
some works try to combine the facial expression recognition and facial landmark localization with different strategies
and most of them extract the geometric features or only pay attention to texture information around landmarks to recognize the facial expression. Although these methods achieved great results
they still have some issues. They assist the task of FER by using given facial landmarks as prior information
but internal connection between them is ignored. To solve this problem
a deep multitask framework is proposed in this study.
Method
2
A multitask network is designed to recognize facial expressions and locate facial landmarks simultaneously because both tasks pay attention to features around facial organs
including the eyebrows
eyes
nose
and mouth (points around the external counter are abandoned). However
to obtain the ground truth of facial landmarks in practices is not easy
especially in some FER benchmarks. We utilize a stacked hourglass network to detect facial landmark points first because stacked hourglass network achieves excellent performance in the task of face alignment
which was also demonstrated in the 2nd Facial Landmark Localization Competition conjunction with CVPR(IEEE Conference on Computer Vision and Pattern Recognition) 2017. The designed network has two branches
corresponding to two tasks accordingly. Considering the relationships between the two tasks
they share two convolution layers in the first. The structure of facial landmark localization is simple
including three convolution layers and a fully connected layer because it simply assists the facial expression recognition in selecting feature. For the branch of facial expression recognition
its structure is complicated
in which the inception module is introduced and convolution kernels with different size are applied to capture the multiscale features. Two tasks are optimized together with a unified loss to learn the network parameters
in which the popular distance loss and the entropy loss are designed to facial landmark localization and facial expression recognition. Although features around the facial landmarks obtain good response under the supervision of two tasks
other areas still exist some noises. For example
part collar is retained in the cropped face image
which has a bad effect on facial expression recognition. To deal with this issue
location attention maps are created with the landmarks obtained in the branch of facial landmark localization. The proposed location attention map is a weight matrix sharing the same size with the corresponding feature maps
and it indicates the importance of each position. Inspired by the stacked hourglass network
a series of heat maps is generated first by taking the coordinate of each point as the mean value and selecting an appropriate variance with Gaussian distribution. Then
the max-pooling operation is conducted to merge these maps to generate the location attention map. The generated location attention maps rely on the performance of facial landmark localization since they utilize the position of key points detetected in the first branch. Thus
valid features may be filtered out when the detected landmarks are with a large deviation. This problem can be alleviated by adjusting the variance of Gaussian distribution in the small offset
but it does not work while the predicted landmarks deviate from the ground truth greatly. Intermediate supervision is introduced to facial landmark localization to solve such a problem by adding the facial expression recognition task with a small weight. The final loss consists of three parts:intermediate supervision loss
facial landmark localization loss in the first branch
and facial expression recognition loss in the second branch.
Result
2
To validate the effectiveness of proposed method
ablation studies are conducted on three popular databases:CK+ (Cohn-Kanade dataset)
Oulu (Oulu-CASIA NIR & VIS facial expression database)
and MMI (MMI facial expression database). We also investigate the performance of the multitask network and single-task network to evaluate the importance of introducing the landmark localization to facial expression recognition. The experimental results demonstrate that the proposed multitask network outperforms the traditional convolution networks
and the recognition accuracy on three databases improves by 0.93%
1.71%
and 2.92%
respectively. Experimental results also prove that generated location attention map is effective
and recognition accuracy improves by 0.14%
2.43%
and 1.82%
respectively
on three databases. Finally
the performance on three databases reaches peak while adding intermediate supervision. Recognition accuracy on Oulu and MMI databases increases by 0.14% and 0.54%
respectively. Intermediate supervision has minimal effect on CK+ database because samples on this database are simple and predicted landmarks do not have significant deviation.
Conclusion
2
A multitask network is designed to recognize the facial expression and localize the facial landmark simultaneously
and the experimental results demonstrated that the relationship information between the task of facial expression recognition and landmark localization is useful for facial expression recognition. The proposed location attention map improved the recognition accuracy and revealed that features distributed among facial organs are powerful for facial expression recognition. Meanwhile
introduced intermediate supervision helps improve the performance of facial landmark localization so that generated location attention map can filter out noise accurately.
Cootes T F, Taylor C J, Cooper D H and Graham J. 1995. Active shape models-their training and application. Computer Vision and Image Understanding, 61(1):38-59[DOI:10.1006/cviu.1995.1004]
Dalal N and Triggs B. 2005. Histograms of oriented gradients for human detection//Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05). San Diego, CA, USA: IEEE: 886-893[ DOI:10.1109/CVPR.2005.177 http://dx.doi.org/10.1109/CVPR.2005.177 ]
Devries T, Biswaranjan K and Taylor G W. 2014. Multi-task learning of facial landmarks and expression//Proceedings of 2014 Canadian Conference on Computer and Robot Vision. Montreal, QC, Canada: IEEE: 98-103[ DOI:10.1109/CRV.2014.21 http://dx.doi.org/10.1109/CRV.2014.21 ]
Ding H, Zhou S K and Chellappa R. 2017. Facenet2expnet: regularizing a deep face recognition net for expression recognition//Proceedings of the 12th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2017). Washington, DC, USA: IEEE: 118-126[ DOI:10.1109/FG.2017.23 http://dx.doi.org/10.1109/FG.2017.23 ]
Ekman P and Friesen W V. 1971. Constants across cultures in the face and emotion. Journal of Personality and Social Psychology, 17(2):124-129[DOI:10.1037/h0030377]
Ekman P and Friesen W V. 1976. Measuring facial movement. Environmental Psychology and Nonverbal Behavior, 1(1):56-75[DOI:10.1007/BF01115465]
Freund Y and Schapire R E. 1996. Experiments with a new boosting algorithm//Proceedings of the 13th International Conference on International Conference on Machine Learning. Bari, Italy: ACM: 148-156
Hasani B and Mahoor M H. 2017. Facial expression recognition using enhanced deep 3D convolutional neural networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Honolulu, HI, USA: IEEE: 30-40[ DOI:10.1109/CVPRW.2017.282 http://dx.doi.org/10.1109/CVPRW.2017.282 ]
Jung H, Lee S, Yim J, Park S and Kim J. 2015. Joint fine-tuning in deep neural networks for facial expression recognition//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 2983-2991[ DOI:10.1109/ICCV.2015.341 http://dx.doi.org/10.1109/ICCV.2015.341 ]
Khorrami P, Paine T L and Hung T S. 2015. Do deep neural networks learn facial action units when doing exprssion recognition//Proceedings of 2015 International Conference on Computer Vision. Santiogo, Chile: IEEE: 19-27[ DOI:10.1109/ICCVW.2015.12 http://dx.doi.org/10.1109/ICCVW.2015.12 ]
Krizhevsky A, Sutskever I and Hinton G E. 2017. ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84-90[DOI:10.1145/3065386]
Kuo C M, Lai S H and Sarkis M. 2018. A compact deep learning model for robust facial expression recognition//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Salt Lake City, UT, USA: IEEE: 2202-22028[ DOI:10.1109/CVPRW.2018.00286 http://dx.doi.org/10.1109/CVPRW.2018.00286 ]
Liu M Y, Li S X, Shan S G and Chen X L. 2013. Au-aware deep networks for facial expression recognition//Proceedings of the 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG). Shanghai, China: IEEE: 1-6[ DOI:10.1109/FG.2013.6553734 http://dx.doi.org/10.1109/FG.2013.6553734 ]
Liu P, Han S Z, Meng Z B and Tong Y. 2014. Facial expression recognition via a boosted deep belief network//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA: IEEE: 1805-1812[ DOI:10.1109/CVPR.2014.233 http://dx.doi.org/10.1109/CVPR.2014.233 ]
Lowe D G. 1999. Object recognition from local scale-invariant features//Proceedings of the 7th IEEE International Conference on Computer Vision. Kerkyra, Greece: IEEE: 1150-1157[ DOI:10.1109/ICCV.1999.790410 http://dx.doi.org/10.1109/ICCV.1999.790410 ]
Lucey P, Cohn J F, Kanade T, Saragih J, Ambadar Z and Matthews I. 2010. The extended Cohn-Kanade dataset (CK+): a complete dataset for action unit and emotion-specified expression//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops. San Francisco, CA,USA: IEEE: 94-101[ DOI:10.1109/CVPRW.2010.5543262 http://dx.doi.org/10.1109/CVPRW.2010.5543262 ]
Munasinghe M I N P. 2018. Facial expression recognition using facial landmarks and random forest classifier//Proceedings of the 17th IEEE/ACIS International Conference on Computer and Information Science (ICIS). Singapore: IEEE: 423-427[ DOI:10.1109/ICIS.2018.8466510 http://dx.doi.org/10.1109/ICIS.2018.8466510 ]
Mollahosseini A, Chan D and Mahoor M H. 2016. Going deeper in facial expression recognition using deep neural networks//IEEE Winter Conference on Applications of Computer Vision (WACV). Lake Placid, NY, USA: 1-10[ DOI:10.1109/WACV.2016.7477450 http://dx.doi.org/10.1109/WACV.2016.7477450 ]
Ojala T, Pietikäinen M and Mäenpää T. 2002. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(7):971-987[DOI:10.1109/tpami.2002.1017623]
Ouellet S. 2014. Real-time emotion recognition for gaming using deep convolutional network features.[2019-06-26]. https://arxiv.org/pdf/1408.3750.pdf https://arxiv.org/pdf/1408.3750.pdf
Özbey N and Topal C. 2018. Expression recognition with appearance-based features of facial landmarks//Proceedings of the 26th Signal Processing and Communications Applications Conference (SIU). Izmir, Turkey: IEEE: 1-4[ DOI:10.1109/SIU.2018.8404541 http://dx.doi.org/10.1109/SIU.2018.8404541 ]
Pantic M, Valstar M, Rademaker R and Maat L. 2005. Web-based database for facial expression analysis//Proceedings of 2005 IEEE International Conference on Multimedia and Expo. Amsterdam, Netherlands: IEEE: 317-321[ DOI:10.1109/ICME.2005.1521424 http://dx.doi.org/10.1109/ICME.2005.1521424 ]
Simonyan K and, Zisserman A. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition.[2019-06-26]. https://arxiv.org/pdf/1409.1556.pdf https://arxiv.org/pdf/1409.1556.pdf
Szegedy C, Liu W, Jia Y Q, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V and Rabinovich A. 2015. Going deeper with convolutions//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA: IEEE: 1-9[ DOI:10.1109/CVPR.2015.7298594 http://dx.doi.org/10.1109/CVPR.2015.7298594 ]
Szegedy C, Ioffe S, Vanhoucke V and Alemi A A. 2017. Inception-v4, inception-resnet and the impact of residual connections on learning//Proceedings of the 31st AAAI Conference on Artificial Intelligence. San Francisco, California, USA: AAAI: 4278-4284
Valstar M and Pantic M. 2010. Induced disgust, happiness and surprise: an addition to the mmi facial expression database//Proceddings of the 7th International Conference on Language Resources and Evaluation (LREC) Workshops. Paris, France: [s.n.]: 65-70
Wang X and Liu X G. 2015. Learning the discriminate patches from the key landmarks for facial expression recognition//Proceedings of 2015 IEEE International Conference on Smart City/SocialCom/SustainCom (SmartCity). Chengdu, China: IEEE: 345-348[ DOI:10.1109/SmartCity.2015.95 http://dx.doi.org/10.1109/SmartCity.2015.95 ]
Yang H Y and Yin L J. 2017. CNN based 3D facial expression recognition using masking and landmark features//Proceedings of 2017 International Conference on Affective Computing and Intelligent Interaction. San Antonio, TX, USA: IEEE: 556-560[ DOI:10.1109/ACII.2017.8273654 http://dx.doi.org/10.1109/ACII.2017.8273654 ]
Yang H Y, Ciftci U and Yin L J. 2018. Facial expression recognition by de-expression residue learning//Proceedings of 2018 IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE: 2168-2177[ DOI:10.1109/CVPR.2018.00231 http://dx.doi.org/10.1109/CVPR.2018.00231 ]
Yang J, Liu Q S and Zhang K H. 2017. Stacked hourglass network for robust facial landmark localisation//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Honolulu, HI, USA: IEEE: 79-87[ DOI:10.1109/CVPRW.2017.253 http://dx.doi.org/10.1109/CVPRW.2017.253 ]
Yu Z B, Liu G C, Liu Q S and Deng J K. 2018. Spatio-temporal convolutional features with nested LSTM for facial expression recognition. Neurocomputing, 317:50-57[DOI:10.1016/j.neucom.2018.07.028]
Zhang K P, Zhang Z P, Li Z F and Qiao Y. 2016. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10):1499-1503[DOI:10.1109/lsp.2016.2603342]
Zhao G Y, Huang X H, Taini M, Li S Z and Pietikäinen M. 2011. Facial expression recognition from near-infrared videos. Image and Vision Computing, 29(9):607-619[DOI:10.1016/j.imavis.2011.07.002]
Zhao X Y, Liang X D, Liu L Q, Li T, Han Y G, Vasconcelos N and Yan S C. 2016. Peak-piloted deep network for facial expression recognition//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, Netherlands: Springer: 425-442[ DOI:10.1007/978-3-319-46475-6_27 http://dx.doi.org/10.1007/978-3-319-46475-6_27 ]
相关作者
相关机构
京公网安备11010802024621