面部动作单元检测方法进展与挑战
Progress and challenges in facial action unit detection
- 2020年25卷第11期 页码:2293-2305
收稿:2020-07-01,
修回:2020-9-23,
录用:2020-9-30,
纸质出版:2020-11-16
DOI: 10.11834/jig.200343
移动端阅览

浏览全部资源
扫码关注微信
收稿:2020-07-01,
修回:2020-9-23,
录用:2020-9-30,
纸质出版:2020-11-16
移动端阅览
人脸动作编码系统从人脸解剖学的角度定义了一组面部动作单元(action unit,AU),用于精确刻画人脸表情变化。每个面部动作单元描述了一组脸部肌肉运动产生的表观变化,其组合可以表达任意人脸表情。AU检测问题属于多标签分类问题,其挑战在于标注数据不足、头部姿态干扰、个体差异和不同AU的类别不均衡等。为总结近年来AU检测技术的发展,本文系统概述了2016年以来的代表性方法,根据输入数据的模态分为基于静态图像、基于动态视频以及基于其他模态的AU检测方法,并讨论在不同模态数据下为了降低数据依赖问题而引入的弱监督AU检测方法。针对静态图像,进一步介绍基于局部特征学习、AU关系建模、多任务学习以及弱监督学习的AU检测方法。针对动态视频,主要介绍基于时序特征和自监督AU特征学习的AU检测方法。最后,本文对比并总结了各代表性方法的优缺点,并在此基础上总结和讨论了面部AU检测所面临的挑战和未来发展趋势。
The anatomically based facial action coding system defines a unique set of atomic nonoverlapping facial muscle actions called action units (AUs)
which can accurately characterize facial expression. AUs correspond to muscular activities that produce momentary changes in facial appearance. Combinations of AUs can represent any facial expression. As a multilabel classification problem
AU detection suffers from insufficient AU annotations
various head poses
individual differences
and imbalance among different AUs. This article systematically summarizes representative methods that have been proposed since 2016 to facilitate the development of AU detection methods. According to different input data
AU detection methods are categorized on the basis of images
videos
and other modalities. We also discuss how AU detection methods can deal with partial supervision given the large scale of unlabeled data. Image-based methods
including approaches that learn local facial representations
exploit AU relations and utilize multitask and weakly supervised learning methods. Handcrafted or automatically learned local facial representations can represent local deformations caused by active AUs. However
the former is incapable of representing different AUs with adaptive local regions while the latter suffers from insufficient training data. Approaches that exploit AU relations can utilize prior knowledge that some AUs appear together or exclusively at the same time. Such methods adopt either Bayesian or graph neural networks to model manually inferred AU relations from annotations of specified datasets. However
these inflexible methods fail to perform cross dataset evaluation. Multitask AU detection methods are inspired by the phenomena that facial shapes represented by facial landmarks are helpful in AU detection and facial deformations caused by active AUs affect the location distribution of landmarks. Except for detecting facial AUs
such methods typically estimate facial landmarks or recognize facial expressions in a multitask manner. Other tasks of facial emotion analysis
such as emotional dimension estimation
can be incorporated in the multitask learning setting. Video-based methods are categorized into strategies that rely on temporal representation and self-supervised learning. Temporal representation learning methods commonly adopt long short-term memory (LSTM) or 3D convolutional neural networks (3D-CNNs) to model the temporal information. Other temporal representation approaches utilize optical flow between frames to detect facial AUs. Several self-supervised approaches have recently exploited the prior knowledge that facial actions
which are movements of facial muscles and between facial frames
can be used as the self-supervisory signal. Such video-based weakly supervised AU detection methods are reasonable and explainable and can effectively alleviate the problem of insufficient AU annotations. However
these methods rely on massive amounts of unlabeled video data in the training phase and fail to perform AU detection in an end-to-end manner. We also review methods that exploit point cloud or thermal images for AU detection and are capable of alleviating the influence of head pose or illumination. Finally
we compare representative methods and analyze their advantages and drawbacks. The analysis summarizes and discusses challenges and potential directions of AU detection. We conclude that methods capable of utilizing weakly annotated or unlabeled data are important research directions for future investigations. Such methods should be carefully designed according to the prior knowledge of AUs to alleviate the demand for large amounts of labeled data.
Albiero V, Bellon O R P and Silva L. 2018. Multi-label action unit detection on multiple head poses with dynamic region learning//Proceedings of the 25th IEEE International Conference on Image Processing. Athens: IEEE: 2037-2041[ DOI: 10.1109/ICIP.2018.8451267 http://dx.doi.org/10.1109/ICIP.2018.8451267 ]
Ali A M, Alkabbany I, Farag A, Bennett I and Farag A. 2017. Facial action units detection under pose variations using deep regions learning//Proceedings of the 7th International Conference on Affective Computing and Intelligent Interaction. San Antonio: IEEE: 395-400[ DOI: 10.1109/ACII.2017.8273630 http://dx.doi.org/10.1109/ACII.2017.8273630 ]
Chu W S, De La Torre F and Cohn J F. 2017. Learning spatial and temporal cues for multi-label facial action unit detection//Proceedings of the 12th IEEE International Conference on Automatic Face and Gesture Recognition. Washington: IEEE: 25-32[ DOI: 10.1109/FG.2017.13 http://dx.doi.org/10.1109/FG.2017.13 ]
Cohn J F, Ambadar Z and Ekman P. 2007. Observer-based measurement of facial expression with the facial action coding system//Coan J A and Allen J J B, eds. The Handbook of Emotion Elicitation and Assessment. New York: Oxford University Press: 203-221
Corneanu C, Madadi M and Escalera S. 2018. Deep structure inference network for facial action unit recognition//Proceedings of the 15th European Conference on Computer Vision. Munich: Springer: 309-324[ DOI: 10.1007/978-3-030-01258-8_19 http://dx.doi.org/10.1007/978-3-030-01258-8_19 ]
Du S C, Tao Y and Martinez A M. 2014. Compound facial expressions of emotion. Proceedings of the National Academy of Sciences of the United States of America, 111(15):E1454-E1462[DOI:10.1073/pnas.1322355111]
Ekman P and Friesen W V. 1978. Manual for the facial action coding system. Consulting Psychologists Press
Ekman P. 1971. Facial expression and emotion. American Psychologist, 48(4):#384
Ertugrul I O, Cohn J F, Jeni L A, Zhang Z, Yin L J and Ji Q. 2019a. Cross-domain AU detection: domains, learning approaches, and measures//Proceedings of the 14th IEEE International Conference on Automatic Face and Gesture Recognition. Lille: IEEE: 1-8[ DOI: 10.1109/FG.2019.8756543 http://dx.doi.org/10.1109/FG.2019.8756543 ]
Ertugrul I O, Jeni L A and Cohn J F. 2019b. PAttNet: patch-attentive deep network for action unit detection//Proceedings of the 30th British Machine Vision Conference. Cardiff, UK: BMVA Press: 1-13
Ertugrul I O, Yang L, Jeni L A and Cohn J F. 2019c. D-PAttNet:dynamic patch-attentive deep network for action unit detection. Frontiers in Computer Science, 1:#11[DOI:10.3389/fcomp.2019.00011]
Fabian Benitez-Quiroz C, Srinivasan R and Martinez A M. 2016. EmotioNet: an accurate, real-time algorithm for the automatic annotation of a million facial expressions in the wild//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE: 5562-5570[ DOI: 10.1109/CVPR.2016.600 http://dx.doi.org/10.1109/CVPR.2016.600 ]
Girard J M, Chu W S, Jeni L A and Cohn J F. 2017. Sayette group formation task (GFT) spontaneous facial expression database//Proceedings of the 12th IEEE International Conference on Automatic Face and Gesture Recognition. Washington: IEEE: 581-588[ DOI: 10.1109/FG.2017.144 http://dx.doi.org/10.1109/FG.2017.144 ]
Han S Z, Meng Z B, Khan A S and Tong Y. 2016. Incremental boosting convolutional neural network for facial action unit recognition//Proceedings of Advances in Neural Information Processing Systems. Barcelona: [s.n.]: 109-117
Han S Z, Meng Z B, Li Z Y, O'Reilly J, Cai J, Wang X F and Tong Y. 2018. Optimizing filter size in convolutional neural networks for facial action unit recognition//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE: 5070-5078[ DOI: 10.1109/CVPR.2018.00532 http://dx.doi.org/10.1109/CVPR.2018.00532 ]
Hao L F, Wang S F, Peng G Z and Ji Q. 2018. Facial action unit recognition augmented by their dependencies//Proceedings of the 13th IEEE International Conference on Automatic Face and Gesture Recognition. Xi'an: IEEE: 187-194[ DOI: 10.1109/FG.2018.00036 http://dx.doi.org/10.1109/FG.2018.00036 ]
Jaiswal S and Valstar M. 2016. Deep learning the dynamic appearance and shape of facial action units//Proceedings of 2016 IEEE Winter Conference on Applications of Computer Vision. Lake Placid: IEEE: 1-8[ DOI: 10.1109/WACV.2016.7477625 http://dx.doi.org/10.1109/WACV.2016.7477625 ]
Jyoti S, Sharma G and Dhall A. 2018. A single hierarchical network for face, action unit and emotion detection//Proceedings of 2018 Digital Image Computing: Techniques andApplications. Canberra: IEEE: 1-8[ DOI: 10.1109/DICTA.2018.8615852 http://dx.doi.org/10.1109/DICTA.2018.8615852 ]
Li G B, Zhu X, Zeng Y R, Wang Q and Lin L. 2019a. Semantic relationships guided representation learning for facial action unit recognition//Proceedings of the AAAI Conference on Artificial Intelligence. Honolulu: [s.n.]: 8594-8601[ DOI: 10.1609/aaai.v33i01.33018594 http://dx.doi.org/10.1609/aaai.v33i01.33018594 ]
Li W, Abtahi F and Zhu Z G. 2017. Action unit detection with region adaptation, multi-labeling learning and optimal temporal fusing//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE: 6766-6775[ DOI: 10.1109/CVPR.2017.716 http://dx.doi.org/10.1109/CVPR.2017.716 ]
Li W, Abtahi F, Zhu Z G and Yin L J. 2018. EAC-Net:deep nets with enhancing and croppingfor facial action unit detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(11):2583-2596[DOI:10.1109/TPAMI.2018.2791608]
Li Y, Zeng J B, Shan S G and Chen X L. 2019b. Self-supervised representation learning from videos for facial action unit detection//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE: 10916-10925[ DOI: 10.1109/CVPR.2019.01118 http://dx.doi.org/10.1109/CVPR.2019.01118 ]
Li Y Q, Chen J X, Zhao Y P and Ji Q. 2013. Data-free prior model for facial action unit recognition. IEEE Transactions on Affective Computing, 4(2):127-141[DOI:10.1109/T-AFFC.2013.5]
Liu P, Zhang Z, Yang H Y and Yin L J. 2019. Multi-modality empowered network for facial action unit detection//Proceedings of 2019 IEEE Winter Conference on Applications of Computer Vision. Waikoloa Village: IEEE: 2175-2184[ DOI: 10.1109/WACV.2019.00235 http://dx.doi.org/10.1109/WACV.2019.00235 ]
Ma C, Chen L and Yong J H. 2019. AU R-CNN:encoding expert prior knowledge into R-CNN for action unit detection. Neurocomputing, 355:35-47[DOI:10.1016/j.neucom.2019.03.082]
Mavadati S M, Mahoor M H, Bartlett K, Trinh P and Cohn J F. 2013. DISFA:a spontaneous facial action intensity database. IEEE Transactions on Affective Computing, 4(2):151-160[DOI:10.1109/T-AFFC.2013.4]
Mei C N, Jiang F, Shen R M and Hu Q P. 2018. Region and temporal dependency fusion for multi-label action unit detection//Proceedings of the 24th International Conference on Pattern Recognition. Beijing: IEEE: 848-853[ DOI: 10.1109/ICPR.2018.8545069 http://dx.doi.org/10.1109/ICPR.2018.8545069 ]
Misra I, Shrivastava A, Gupta A and Hebert M. 2016. Cross-stitch networks for multi-task learning//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE: 3994-4003[ DOI: 10.1109/CVPR.2016.433 http://dx.doi.org/10.1109/CVPR.2016.433 ]
Niu X S, Han H, Shan S G and Chen X L. 2019. Multi-label co-regularization for semi-supervised facial action unit recognition//Proceedings of the 33rd Conference on Neural Information Processing Systems. Vancouver: [s.n.]: 907-917
Peng G Z and Wang S F. 2019. Dual semi-supervised learning for facial action unit recognition//Proceedings of the AAAI Conference on Artificial Intelligence. Honolulu: [s.n.]: 8827-8834[ DOI: 10.1609/aaai.v33i01.33018827 http://dx.doi.org/10.1609/aaai.v33i01.33018827 ]
Reale M J, Klinghoffer B, Church M, Szmurlo H and Yin L J. 2019. Facial action unit analysis through 3D point cloud neural networks//Proceedings of the 14th IEEE International Conference on Automatic Face and Gesture Recognition. Lille: IEEE: 1-8[ DOI: 10.1109/FG.2019.8756610 http://dx.doi.org/10.1109/FG.2019.8756610 ]
Shao Z W, Liu Z L, Cai J F and Ma L Z. 2018. Deep adaptive attention for joint facial action unit detection and face alignment//Proceedings of the 15th European Conference on Computer Vision. Munich: Springer: 725-740[ DOI: 10.1007/978-3-030-01261-8_43 http://dx.doi.org/10.1007/978-3-030-01261-8_43 ]
Shao Z W, Liu Z L, Cai J F, Wu Y S and Ma L Z. 2019. Facial action unit detection using attention and relation learning.[EB/OL].[2020-06-01] . https://arxiv.org/pdf/1808.0345.pdf https://arxiv.org/pdf/1808.0345.pdf
Wang C, Zeng J B, Shan S G and Chen X L. 2019a. Multi-task learning of emotion recognition and facial action unit detection with adaptively weights sharing network//Proceedings of 2019 IEEE International Conference on Image Processing. Taipei, China: IEEE: 56-60[ DOI: 10.1109/ICIP.2019.8802914 http://dx.doi.org/10.1109/ICIP.2019.8802914 ]
Wang S F, Pan B W, Wu S and Ji Q. 2019b. Deep facial action unit recognition and intensity estimation from partially labelled data. IEEE Transactions on Affective Computing, 14(8):1-13[DOI:10.1109/TAFFC.2019.2914654]
Wang S F, Peng G Z, Chen S Y and Ji Q. 2018. Weakly supervised facial action unit recognition with domain knowledge. IEEE Transactions on Cybernetics, 48(11):3265-3276[DOI:10.1109/TCYB.2018.2868194]
Wang S F, Yang J J, Gao Z and Ji Q. 2017. Feature and label relation modeling for multiple-facial action unit classification and intensity estimation. Pattern Recognition, 65:71-81[DOI:10.1016/j.patcog.2016.12.007]
Wiles O, Koepke A S and Zisserman A. 2018. Self-supervised learning of a facial attribute embedding from video.[EB/OL].[2020-06-01] . https://arxiv.org/pdf/1808.06882.pdf https://arxiv.org/pdf/1808.06882.pdf
Wu B F, Wei Y T, Wu B J and Lin C H. 2019. Contrastive feature learning and class-weighted loss for facial action unit detection//Proceedings of 2019 IEEE International Conference on Systems, Man and Cybernetics. Bari, Italy: IEEE: 2478-2483[ DOI: 10.1109/SMC.2019.8914231 http://dx.doi.org/10.1109/SMC.2019.8914231 ]
Wu Y and Ji Q. 2016. Constrained joint cascade regression framework for simultaneous facial action unit recognition and facial landmark detection//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE: 3400-3408[ DOI: 10.1109/CVPR.2016.370 http://dx.doi.org/10.1109/CVPR.2016.370 ]
Yang H Y and Yin L J. 2019. Learning temporal information from a single image for AU detection//Proceedings of the 14th IEEE International Conference on Automatic Face and Gesture Recognition. Lille: IEEE: 1-8[ DOI: 10.1109/FG.2019.8756556 http://dx.doi.org/10.1109/FG.2019.8756556 ]
Yang L, Ertugrul I O, Cohn J F, Hammal Z, Jiang D M and Sahli H. 2019. FACS3D-Net: 3D convolution based spatiotemporal representation for action unit detection//Proceedings of the 8th International Conference on Affective Computing and Intelligent Interaction. Cambridge: IEEE: 538-544[ DOI: 10.1109/ACII.2019.8925514 http://dx.doi.org/10.1109/ACII.2019.8925514 ]
Zhang X, Yin L J, Cohn J F, Canavan S, Reale M, Horowitz A and Liu P. 2013. A high-resolution spontaneous 3D dynamic facial expression database//Proceedings of the 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition. Shanghai: IEEE: 1-6[ DOI: 10.1109/FG.2013.6553788 http://dx.doi.org/10.1109/FG.2013.6553788 ]
Zhang Y, Dong W M, Hu B G and Ji Q. 2018a. Classifier learning with prior probabilities for facial action unit recognition//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE: 5108-5116[ DOI: 10.1109/CVPR.2018.00536 http://dx.doi.org/10.1109/CVPR.2018.00536 ]
Zhang Z, Zhai S F and Yin L J. 2018b. Identity-based adversarial training of deep CNNs for facial action unit recognition//Proceedings of BMVC 2018. Newcastle: BMVA Press: #226
Zhao K L, Chu W S and Martinez A M. 2018. Learning facial action units from web images with scalable weakly supervised clustering//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE: 2090-2099[ DOI: 10.1109/CVPR.2018.00223 http://dx.doi.org/10.1109/CVPR.2018.00223 ]
Zhao K L, Chu W S and Zhang H G. 2016. Deep region and multi-label learning for facial action unit detection//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE: 3391-3399[ DOI: 10.1109/CVPR.2016.369 http://dx.doi.org/10.1109/CVPR.2016.369 ]
相关作者
相关机构
京公网安备11010802024621