融合ViT与对比学习的面部表情识别

崔鑫宇; 何翀; 赵宏珂; 王美丽

发布时间： 2024-01-16
摘要点击次数： 969
全文下载次数： 834
DOI: 10.11834/jig.230043
2024 | Volume 29 | Number 1

融合ViT与对比学习的面部表情识别

崔鑫宇¹, 何翀¹, 赵宏珂¹, 王美丽^1,2,3(1.西北农林科技大学信息工程学院, 杨凌 712100;2.农业农村部农业物联网重点实验室(西北农林科技大学), 杨凌 712100;3.陕西省农业信息感知与智能服务重点实验室(西北农林科技大学), 杨凌 712100)

摘要

目的面部表情识别是计算机视觉领域中的重要任务之一，而真实环境下面部表情识别的准确度较低。针对面部表情识别中存在的遮挡、姿态变化和光照变化等问题导致识别准确度较低的问题，提出一种基于自监督对比学习的面部表情识别方法，可以提高遮挡等变化条件下面部表情识别的准确度。方法该方法包含对比学习预训练和模型微调两个阶段。在对比学习预训练阶段，改进对比学习的数据增强方式及正负样本对对比次数，选取基于Transformer的视觉Transformer（vision Transformer，ViT）网络作为骨干网络，并在ImageNet数据集上训练模型，提高模型的特征提取能力。模型微调阶段，采用训练好的预训练模型，用面部表情识别目标数据集微调模型获得识别结果。结果实验在4类数据集上与13种方法进行了比较，在RAF-DB（real-world affective faces database）数据集中，相比于Face2Exp（combating data biases for facial expression recognition）模型，识别准确度提高了0.48%；在FER-Plus（facial expression recognition plus）数据集中，相比于KTN（knowledgeable teacher network）模型，识别准确度提高了0.35%；在AffectNet-8数据集中，相比于SCN（self-cure network）模型，识别准确度提高了0.40%；在AffectNet-7数据集中，相比于DACL（deep attentive center loss）模型，识别准确度略低0.26%，表明了本文方法的有效性。结论本文所提出的人脸表情识别模型，综合了对比学习模型和ViT模型的优点，提高了面部表情识别模型在遮挡等条件下的鲁棒性，使面部表情识别结果更加准确。

关键词

表情识别对比学习自监督学习 Transformer 正负样本对

Combining ViT with contrastive learning for facial expression recognition

Cui Xinyu¹, He Chong¹, Zhao Hongke¹, Wang Meili^1,2,3(1.College of Information Engineering, Northwest A&F University, Yangling 712100, China;2.Key Laboratory of Agricultural Internet of Things, Ministry of Agriculture(Northwest A&F University), Yangling 712100, China;3.Shaanxi Key Laboratory of Agricultural Information Perception and Intelligent Service(Northwest A&F University), Yangling 712100, China)

Abstract

Objective Facial expression is one of the important factors in human communication to help understand the intentions of others. The task of facial expression recognition is to output the category of facial expression corresponding to a given face picture. Facial expression has broad applications in areas such as security monitoring，education，and humancomputer interaction. Currently，facial expression recognition under uncontrolled conditions suffers from low accuracy due to factors such as pose variations，occlusions，and lighting differences. Addressing these issues will remarkably advance the development of facial expression recognition in real-world scenarios and hold great relevance in the field of artificial intelligence. Self-supervised learning is proposed to utilize specific data augmentations on input data and generate pseudo labels for training or pretraining models. Self-supervised learning leverages a large amount of unlabeled data and extracts the prior knowledge distribution of the images themselves to improve the performance of downstream tasks. Contrast learning belongs to self-supervised learning，which can further learn the intrinsic consistent feature information between similar images under the change of posture and light by increasing the difficulty of the task. This paper proposes an unsupervised contrastive learning-based facial expression classification method to address the problem of low accuracy caused by occlusion，pose variation，and lighting changes in facial expression recognition. Method To address the issue of occlusions in facial expression recognition datasets under real-world conditions，a method based on negative sample-based selfsupervised contrastive learning is employed. The method consists of two stages：contrastive learning pretraining and model fine-tuning. First，in the pretraining stage of contrastive learning，an unsupervised contrastive loss is introduced to reduce the distance between images of the same type and increase the distance between images of different classes to improve the discrimination ability of intraclass diversity and interclass similarity images of facial expression images. This method involves adding positive sample pairs for contrastive learning between the original images and occlusion-augmented images， enhancing the robustness of the model to image occlusion and illumination changes. Additionally，a dictionary mechanism is applied to MoCo v3 to overcome the issue of insufficient memory during training. The recognition model is pretrained on the ImageNet dataset. Next，the model is fine-tuned on the facial expression recognition dataset to improve the classification accuracy for facial expression recognition tasks. This approach effectively enhances the performance of facial expression recognition in the presence of occlusions. Moreover，the Transformer-based vision Transformer（ViT）network is employed as the backbone network to enhance the model’s feature extraction capability. Result Experiments were conducted on four datasets to evaluate the performance of the proposed method compared with the latest 13 methods. In the RAF-DB dataset，compared with the Face2Exp model，the recognition accuracy increased by 0. 48%；in the FERPlus dataset，compared with the knowledgeable teacher network（KTN）model，The recognition accuracy increased by 0. 35%；in the AffectNet-8 dataset，compared with the self-cure network （SCN） model，the recognition accuracy increased by 0. 40%；in the AffectNet-7 dataset，compared with the deep attentive center loss（DACL）model，the recognition accuracy was slightly lower by 0. 26%，which proves the effectiveness of the method in this paper. Conclusion A self-supervised contrastive learning-based method for facial expression recognition is proposed to address the challenges of occlusion，pose variation，and illumination changes in uncontrolled conditions. The method consists of two stages：pretraining and finetuning. The contributions of this paper lie in the integration of ViT into the contrastive learning framework，which enables the utilization of a large amount of unlabeled，noise-occluded data to learn the distribution characteristics of facial expression data. The proposed method achieves promising accuracy on facial expression recognition datasets，including RAFDB，FERPlus，AffectNet-7，and AffectNet-8. By leveraging the contrastive learning framework and advanced feature extraction networks，this work enhances the application of deep learning methods in everyday visual tasks.

Keywords

facial expression recognition comparative learning self-supervised learning Transformer positive and negative samples

在线采编平台

论文出版

年度会议

下载中心

年度信息