注意力引导的三流卷积神经网络用于微表情识别

赵明华; 董爽爽; 胡静; 都双丽; 石程; 李鹏; 石争浩

发布时间： 2024-01-16
摘要点击次数： 895
全文下载次数： 715
DOI: 10.11834/jig.230053
2024 | Volume 29 | Number 1

注意力引导的三流卷积神经网络用于微表情识别

赵明华^1,2, 董爽爽¹, 胡静¹, 都双丽¹, 石程¹, 李鹏¹, 石争浩¹(1.西安理工大学计算机科学与工程学院, 西安 710048;2.陕西省网络计算与安全技术重点实验室, 西安 710048)

摘要

目的微表情识别在心理咨询、置信测谎和意图分析等多个领域都有着重要的应用价值。然而，由于微表情自身具有动作幅度小、持续时间短的特点，到目前为止，微表情的识别性能仍然有很大的提升空间。为了进一步推动微表情识别的发展，提出了一种注意力引导的三流卷积神经网络（attention-guided three-stream convolutionalneural network，ATSCNN）用于微表情识别。方法首先，对所有微表情序列的起始帧和峰值帧进行预处理；然后，利用TV-L1（total variation-L1）能量泛函提取微表情两帧之间的光流；接下来，在特征提取阶段，为了克服有限样本量带来的过拟合问题，通过3个相同的浅层卷积神经网络分别提取输入3个光流值的特征，再引入卷积块注意力模块以聚焦重要信息并抑制不相关信息，提高微表情的识别性能；最后，将提取到的特征送入全连接层分类。此外，整个模型架构采用SELU（scaled exponential linear unit）激活函数以加快收敛速度。结果本文在微表情组合数据集上进行LOSO（leave-one-subject-out）交叉验证，未加权平均召回率（unweighted average recall，UAR）以及未加权F1-Score（unweighted F1-score，UF1）分别达到了0.735 1和0.720 5。与对比方法中性能最优的Dual-Inception模型相比，UAR和UF1分别提高了0.060 7和0.068 3。实验结果证实了本文方法的可行性。结论本文方法所提出的微表情识别网络，在有效缓解过拟合的同时，也能在小规模的微表情数据集上达到先进的识别效果。

关键词

微表情识别光流三流卷积神经网络卷积块注意力模块（CBAM） SELU激活函数

Attention-guided three-stream convolutional neural network for microexpression recognition

Zhao Minghua^1,2, Dong Shuangshuang¹, Hu Jing¹, Du Shuangli¹, Shi Cheng¹, Li Peng¹, Shi Zhenghao¹(1.School of Computer Science and Engineering, Xi'an University of Technology, Xi'an 710048, China;2.Shaanxi Key Laboratory of Network Computing and Security Technology, Xi'an 710048, China)

Abstract

Objective In recent years，microexpression recognition has remarkable application value in various fields such as psychological counseling，lie detection，and intention analysis. However，unlike macro-expressions generated in conscious states，microexpressions often occur in high-risk scenarios and are generated in an unconscious state. They are characterized by small action amplitudes，short duration，and usually affect local facial areas. These features also determine the difficulty of microexpression recognition. Traditional methods used in early research mainly include methods based on local binary patterns and methods based on optical flow. The former can effectively extract the texture features of microexpressions，whereas the latter calculates the pixel changes in the temporal domain and the relationship between adjacent frames，providing rich，key input information for the network. Although the traditional methods based on texture features and optical flow features have made good progress in early microexpression recognition，they often require considerable cost and have room for improvement in recognition accuracy and robustness. Later，with the development of machine learning， microexpression recognition based on deep learning gradually became the mainstream of research in this field. This method uses neural networks to extract features from input image sequences after a series of preprocessing operations（facial cropping and alignment and grayscale processing）and classifies them to obtain the final recognition result. The introduction of deep learning has substantially improved the recognition performance of microexpressions. However，given the characteristics of microexpressions themselves，the recognition accuracy of microexpressions can still be improved considerably，while the limited scale of existing microexpression datasets also restricts the recognition effect of such emotional behaviors. To solve these problems，this paper proposes an attention-guided three-stream convolutional neural network（ATSCNN）for microexpression recognition. Method First，considering that the motion changes between adjacent frames of microexpressions are very subtle，to reduce redundant information and computation in the image sequence while preserving the important features of microexpressions，this paper only performs preprocessing operations such as facial alignment and cropping on the two key frames of microexpressions （onset frame and apex frame） to obtain a single-channel grayscale image sequence with a resolution of 128 × 128 pixels and to reduce the influence of nonfacial areas on microexpression recognition. Then，because optical flow can capture representative motion features between two frames of microexpressions，it can obtain a higher signal-to-noise ratio than the original data and provide rich，critical input features for the network. Therefore，this paper uses the total variation-L1（TV-L1）energy functional to extract optical flow features between two frames of microexpressions（the horizontal component of optical flow，the vertical component of optical flow，and the optical strain）. Next，in the microexpression feature extraction stage，to overcome the overfitting problem caused by limited sample size， three identical four-layer convolutional neural networks are used to extract the features of the input optical flow horizontal component，optical flow vertical component，and optical strain，（the input channel numbers of the four convolutional layers are 1，3，5，and 8，and the output channel numbers are 3，5，8，and 16），thus improving the network performance. Afterward，because the image sequences in the microexpression dataset used in this paper inevitably contain some redundant information other than the face，a convolutional block attention module（CBAM）with channel attention and spatial attention serially connected is added after each shallow convolutional neural network in each stream to focus on the important information of the input and suppress irrelevant information，while paying attention to both the channel dimension and the spatial dimension，thereby enhancing the network’s ability to obtain effective features and improving the recognition performance of microexpressions. Finally，the extracted features are fed into a fully connected layer to achieve microexpression emotion classification（including negative，positive，and surprise）. In addition，the entire model architecture uses the scaled exponential linear unit（SELU）activation function to overcome the potential problems of neuron death and gradient disappearance in the commonly used rectified linear unit（ReLU）activation function to speed up the convergence speed of the neural network. Result This paper conducted experiments on the microexpression combination dataset using the leaveone-subject-out（LOSO）cross-validation strategy. In this strategy，each subject serves as the test set，and all remaining samples are used for training. This validation method can fully utilize the samples and has a certain generalization ability. This method is the most commonly used validation in current microexpression recognition research. The results of this paper’s experiments on the unweighted average recall（UAR）and unweighted F1-score（UF1）reached 0. 735 1 and 0. 720 5， respectively. Compared with the Dual-Inception model，which performed best in the comparative methods，UAR and UF1 increased by 0. 060 7 and 0. 068 3，respectively. To verify further the effectiveness of the ATSCNN neural network architecture proposed in this paper，several ablation experiments were also conducted on the combined dataset，and the results confirmed the feasibility of this paper’s method. Conclusion The microexpression recognition network proposed in this paper can effectively alleviate overfitting，focus on important information of microexpressions，and achieve state-of-the-art （SOTA）recognition performance on small-scale microexpression datasets using LOSO cross-validation. Compared with other mainstream models，the proposed method achieved state-of-the-art recognition performance. In addition，the results of several ablation experiments made the proposed method more convincing. In conclusion，the proposed method remarkably improved the effectiveness of microexpression recognition.

Keywords

microexpression recognition optical flow three-stream convolution neural network convolutional block attention module（CBAM） SELU activation function

在线采编平台

论文出版

年度会议

下载中心

年度信息