卢莉丹1,2, 夏海英1, 谭玉枚1, 宋树祥1(1.广西师范大学;2.南宁理工学院)
目的 在复杂的自然场景下，人脸表情识别存在着眼镜、手部动作和发型等局部遮挡的问题，这些遮挡区域会降低模型的情感判别能力。因此，本文提出了一种注意力引导局部特征联合学习的人脸表情识别方法。方法 该方法由全局特征提取模块、全局特征增强模块和局部特征联合学习模块组成。首先全局特征提取模块用于提取中间层全局特征。其次全局特征增强模块用于抑制人脸识别预训练模型带来的冗余特征，并增强全局人脸图像中与情感最相关的特征图语义信息。最后局部特征联合学习模块利用混合注意力机制来学习不同人脸局部区域的细粒度显著特征并使用联合损失进行约束。结果 在2个自然场景数据集RAF-DB和FERPlus数据集上进行了相关实验验证。在RAF-DB数据集中，识别准确率为89.24%，与MA-Net相比有0.84%的性能提升；在FERPlus数据集中，识别准确率为90.04%，与FER-VT的性能相当。表明该方法具有良好的鲁棒性。结论 本文所提出的方法通过先全局增强后局部细化的学习顺序，有效地减少了局部遮挡问题的干扰。
Attention-Guided Local Feature Joint Learning for Facial Expression Recognition
(Guangxi Normal University)
Objective When communicating face-to-face, people use a variety of different methods to convey their inner emotions, such as conversational tone, body movements and facial expressions. Among these methods, facial expression is the most direct means of observing human emotions. People can convey their thoughts and feelings through facial expression, and can also recognize others" attitudes and inner world through facial expression. Therefore, facial expression recognition belongs to one of the research directions in the field of affective computing, which can obviously be applied to many fields, such as fatigue driving detection, human-computer interaction, students" listening state analysis and intelligent medical services. However, in complex natural situations, facial expression recognition suffers from direct occlusion issues such as masks, sunglasses, gestures, hairstyles or beards, as well as indirect occlusion issues such as different lighting, complex backgrounds, and pose variation. All of these can pose great challenges to facial expression recognition in natural scenes, as it is difficult to extract discriminative features, resulting in poor final recognition results. Therefore, in order to reduce the interference of occlusion and pose variation problems, we propose an attention-guided local features joint learning method for facial expression recognition. Method Our method is composed of a global feature extraction module, a global feature enhancement module and a joint learning module for local features. Firstly, we uses ResNet-50 as the backbone network and initializes the network parameters using the MS-Celeb-1M face recognition dataset. We think that the rich information available in the face recognition model can be used to complement the contextual information needed for facial expression recognition, especially the middle layer features such as eyes, nose and mouth. So, the global feature extraction module is used to extract the global features of the middle layer, which consists of a 2D convolutional layer and three bottleneck residual convolutional blocks. Secondly, considering that most of the facial expression features are concentrated in localized key regions such as eyes, nose and mouth. It makes it possible to ignore the overall face information and directly recognize the expression categories correctly with the help of local key information. While face recognition requires overall facial information, so the face recognition pre-training model introduces some unimportant features for expression recognition. Therefore, we utilize a global feature enhancement module to suppress the redundant features (e.g., features in the nose region) brought by the pre-trained model for face recognition and to enhance the semantic information of global face image that is most relevant to the emotion. This module is implemented by the ECA attention mechanism, which strengthens the channel features that contribute to the classification and weakens the weights of the channel features that are detrimental to the classification through cross-channel interactions between high-level semantic channel features. Finally, we divide the output features of the global feature enhancement module into 4 non-overlapping local regions uniformly in terms of spatial dimensions. This method exactly distributes the eye and mouth regions in most of the face images in 4 sub-image blocks. And the global facial expression analysis problem is split into multiple local regions to solve. Then the fine-grained salient features of different localized regions of the face are learned through the mixed-attention mechanism, and the local feature joint learning module learns information from complementary contexts, thus reducing the negative effects of occlusion and pose variations. Since our method unites 4 classifiers for local feature learning, a decision-level fusion strategy is used for final prediction. That is, after summing the output probability results of the 4 classifiers, the category corresponding to the maximum probability is the model prediction category. Result Relevant experimental validation was performed on 2 in-the-wild expression datasets, RAF-DB (Real-world Affective Faces Database) and FERPlus (Face Expression Recognition Plus dataset) datasets. The results of the ablation experiments show that the gains of our method compared to the base model on the 2 datasets are 1.89% and 2.47%, respectively. In the RAF-DB dataset, the recognition accuracy is 89.24%, which has a performance improvement of 0.84% compared with MA-Net. In the FERPlus dataset, the recognition accuracy is 90.04%, which is comparable to the performance of FER-VT. It shows that our method has good robustness. We tested the model trained on the RAF-DB dataset by putting it on the FED-RO dataset with real occlusion and achieved an accuracy of 67.60%. And in order to demonstrate the effectiveness of the method in this paper more intuitively, we also use Grad-CAM++ to visualize the attention heatmap of the proposed model. The visualization of the local feature joint learning module illustrates that the module can direct the overall model to focus on the features in each individual local image block that are useful for classification. Conclusion In general, the method proposed in this paper is guided by the attention mechanism, which enhances the global features first and then learns the salient features in the local region, which effectively reduces the interference of the local occlusion problem through the learning sequence of global enhancement followed by local refinement. Experiments on two natural scene datasets and the occlusion test set prove that the model is simple, effective and robust.