目的 面部表情识别是计算机视觉领域中的重要任务之一，而真实环境下面部表情识别的准确度较低。针对面部表情识别中存在的遮挡、姿态变化、光照变化等问题导致识别准确度较低的问题，提出一种基于自监督对比学习的面部表情识别方法，可以提高遮挡等变化条件下面部表情识别的准确度。方法 该方法包含对比学习预训练和模型微调两个阶段。在对比学习预训练阶段，改进对比学习的数据增强方式及正负样本对对比次数，选取基于Transformer的视觉Transformer（vision transformer，ViT）网络作为骨干网络，并在ImageNet数据集上训练模型，提高模型的特征提取能力。模型微调阶段，采用训练好的预训练模型，用面部表情识别目标数据集微调模型获得识别结果。结果 实验在4类数据集上与最新的13种方法进行了比较，在RAF-DB数据集中，相比于Face2Exp模型，识别准确度提高了0.48%；在FERPlus数据集中，相比于KTN模型，识别准确度提高了0.35%；在AffectNet-8数据集中，相比于SCN模型，识别准确度提高了0.40%；在AffectNet-7数据集中，相比于DACL模型，识别准确度略低了0.26 %，证明了本文方法的有效性。结论 本文所提出的人脸表情识别模型，综合了对比学习模型和ViT模型的优点，提高了面部表情识别模型在遮挡等条件的鲁棒性，使面部表情识别结果更加准确。
Combining ViT with Contrastive Learning for Facial Expression Recognition
Cui Xinyu, He Chong, Zhao Hongke, Wang Meili1,2,3(1.College of Information Engineering, Northwest A&2.amp;3.F University)
Objective Facial expression is one of the important factors in human communication to help understand the intentions of others. The task of facial expression recognition is to output the category of facial expression corresponding to a given face picture. It has broad applications in areas such as security monitoring, education, and human-computer interaction. Currently, facial expression recognition under uncontrolled conditions suffers from low accuracy due to factors such as pose variations, occlusions, and lighting differences. Addressing these issues will significantly advance the development of facial expression recognition in real-world scenarios and hold great significance in the field of artificial intelligence. Self-supervised learning is proposed to utilize specific data augmentations on input data and generate pseudo-labels for training or pretraining models. It leverages a large amount of unlabeled data and extracts the prior knowledge distribution of the images themselves, aiming to improve the performance of downstream tasks. Contrast learning belongs to self-supervised learning, which can further learn the intrinsic consistent feature information between similar images under the change of posture and light by increasing the difficulty of the task. This paper proposes an unsupervised contrastive learning-based facial expression classification method to address the problem of low accuracy caused by occlusion, pose variation, and lighting changes in facial expression recognition. Method To address the issue of occlusions in facial expression recognition datasets under real-world conditions, a method based on negative sample-based self-supervised contrastive learning is employed. The method consists of two stages: contrastive learning pretraining and model fine-tuning. First, in the pre-training stage of contrastive learning, an unsupervised contrastive loss is introduced to reduce the distance between images of the same type and increase the distance between images of different classes, so as to improve the discrimination ability of intra-class diversity and inter-class similarity images of facial expression images. This method involves adding positive sample pairs for contrastive learning between the original images and occlusion-augmented images, enhancing the robustness of the model to image occlusion and illumination changes. Additionally, a dictionary mechanism is applied to MoCo v3 to overcome the issue of insufficient memory during training. The recognition model is pretrained on the ImageNet dataset. Next, the model is fine-tuned on the facial expression recognition dataset to improve the classification accuracy for facial expression recognition tasks. This approach effectively enhances the performance of facial expression recognition in the presence of occlusions. Moreover, the Transformer-based Vision Transformer (ViT) network is employed as the backbone network to enhance the model"s feature extraction capability. Result Experiments were conducted on four datasets to evaluate the performance of the proposed method compared to the latest 13 methods. In the RAF-DB dataset, compared with the Face2Exp model, the recognition accuracy increased by 0.48%; in the FERPlus dataset, compared with the KTN model, The recognition accuracy increased by 0.35%; in the AffectNet-8 dataset, compared with the SCN model, the recognition accuracy increased by 0.40%; in the AffectNet-7 dataset, compared with the DACL model, the recognition accuracy was slightly lower by 0.26% , which proves the effectiveness of the method in this paper. Conclusion A self-supervised contrastive learning-based method for facial expression recognition is proposed to address the challenges of occlusion, pose variation, and illumination changes in uncontrolled conditions. The method consists of two stages: pretraining and fine-tuning. The contributions of this paper lie in the integration of ViT into the contrastive learning framework, which enables the utilization of a large amount of unlabeled and noise-occluded data to learn the distribution characteristics of facial expression data. The proposed method achieves promising accuracy on facial expression recognition datasets, including RAF-DB, FERPlus, AffectNet-7, and AffectNet-8. By leveraging the contrastive learning framework and advanced feature extraction networks, this work enhanced the application of deep learning methods in everyday visual tasks.