赵明华, 董爽爽, 胡静, 都双丽, 石程, 李鹏, 石争浩(西安理工大学)
目的 近年来，微表情识别在心理咨询、置信测谎、意图分析等多个领域都有着重要的应用价值。然而，由于微表情自身具有动作幅度小、持续时间短的特点，到目前为止，微表情的识别性能仍然有很大的提升空间。为了进一步推动微表情识别的发展，本文提出了一种注意力引导的三流卷积神经网络(attention-guided three-stream convolutional neural network, ATSCNN)用于微表情识别。方法 首先，对所有微表情序列的起始帧和峰值帧进行预处理；然后，利用TV-L1能量泛函提取微表情两帧之间的光流；接下来，在特征提取阶段，为了克服有限样本量带来的过拟合问题，通过三个相同的浅层卷积神经网络分别提取输入三个光流值的特征，再引入卷积块注意力模块以聚焦重要信息同时抑制不相关信息，提高微表情的识别性能；最后，将提取到的特征送入全连接层分类。另外，整个模型架构采用SELU激活函数以加快收敛速度。结果 本文在微表情组合数据集上进行LOSO(leave-one-subject-out)交叉验证，未加权平均召回率(unweighted average recall, UAR)以及未加权F1-Score(unweighted F1-score, UF1)分别达到了0.7351和0.7205。与对比方法中性能最优的Dual-Inception模型相比，UAR和UF1分别提高了0.0607和0.0683。实验结果证实了本文方法的可行性。结论 本文方法所提出的微表情识别网络，在有效缓解过拟合的同时，也能在小规模的微表情数据集上达到先进的识别效果。
Micro-expression Recognition Based on Three stream Convolution Neural Network and Attention Mechanism
Zhao Minghua, Dong Shuangshuang, Hu Jing, Du Shuangli, Shi Cheng, Li Peng, Shi Zhenghao(Xi''an University of Technology)
Objective In recent years, micro-expression recognition has significant application value in various fields such as psychological counseling, lie detection and intention analysis. However, unlike macro-expressions generated in conscious states, micro-expressions often occur in high-risk scenarios and are generated in an unconscious state. They are characterized by small action amplitudes, short duration, and usually affect local facial areas. These features also determine the difficulty of micro-expression recognition. Traditional methods used in early research mainly include methods based on local binary patterns and methods based on optical flow. The former can effectively extract the texture features of micro-expressions, while the latter calculates the pixel changes in the temporal domain and the relationship between adjacent frames, providing rich and key input information for the network. Although the traditional methods based on texture features and optical flow features have made good progress in early micro-expression recognition, they often require significant cost and have room for improvement in recognition accuracy and robustness. Later, with the development of machine learning, micro-expression recognition based on deep learning gradually became the mainstream of research in this field. This method usually uses neural networks to extract features from input image sequences after a series of preprocessing operations (facial cropping and alignment, grayscale processing, etc.), and classifies them to obtain the final recognition result. The introduction of deep learning has significantly improved the recognition performance of micro-expressions. However, so far, given the characteristics of micro-expressions themselves, there is still significant room for improvement in the recognition accuracy of micro-expressions, while the limited scale of existing micro-expression datasets also restricts the recognition effect of such emotional behaviors. To solve these problems, this paper proposes an attention-guided three-stream convolutional neural network for micro-expression recognition. Method First of all, considering that the motion changes between adjacent frames of micro-expressions are very subtle, in order to reduce redundant information and computation in the image sequence while preserving the important features of micro-expressions, this paper only performs preprocessing operations such as facial alignment and cropping on the two key frames of micro-expressions (onset frame and apex frame) to obtain a single-channel grayscale image sequence with a resolution of 128*128 pixels, in order to reduce the influence of non-facial areas on micro-expression recognition. Then, since optical flow can capture representative motion features between two frames of micro-expressions, it can obtain a higher signal-to-noise ratio than the original data, and provide rich and critical input features for the network. Therefore, this paper uses the TV-L1 energy functional to extract optical flow features between two frames of micro-expressions (the horizontal component of optical flow, the vertical component of optical flow, and the optical strain). Next, in the micro-expression feature extraction stage, in order to overcome the overfitting problem caused by limited sample size, three identical four-layer convolutional neural networks are used to extract the features of the input optical flow horizontal component, optical flow vertical component, and optical strain, respectively (the input channel numbers of the four convolutional layers are 1, 3, 5, and 8, and the output channel numbers are 3, 5, 8, and 16), thus improving the network performance. Afterwards, since the image sequences in the micro-expression dataset used in this paper inevitably contain some redundant information other than the face, a convolutional block attention module with channel attention and spatial attention serially connected is added after each shallow convolutional neural network in each stream to focus on the important information of the input and suppress irrelevant information, while paying attention to both the channel dimension and the spatial dimension, thereby enhancing the network"s ability to obtain effective features and improving the recognition performance of micro-expressions. Finally, the extracted features are fed into a fully connected layer to achieve micro-expression emotion classification (including negative, positive, and surprise). In addition, the entire model architecture uses the SELU activation function to overcome the potential problems of neuron death and gradient disappearance in the commonly used ReLU activation function, in order to speed up the convergence speed of the neural network. Result This paper conducted experiments on the micro-expression combination dataset using the leave-one-subject-out (LOSO) cross-validation strategy. In this strategy, each subject serves as the test set, and all remaining samples are used for training. This validation method can fully utilize the samples and has a certain generalization ability. It is the most commonly used validation method in current micro-expression recognition research. The results of this paper"s experiments on the unweighted average recall (UAR) and unweighted F1-score (UF1) reached 0.7351 and 0.7205, respectively. Compared with the Dual-Inception model, which performed best in the comparative methods, UAR and UF1 increased by 0.0607 and 0.0683, respectively. In order to further verify the effectiveness of the ATSCNN neural network architecture proposed in this paper, several ablation experiments were also conducted on the combined dataset, and the results confirmed the feasibility of this paper"s method. Conclusion The micro-expression recognition network proposed in this paper can effectively alleviate overfitting, focus on important information of micro-expressions, and achieve state-of-the-art recognition performance on small-scale micro-expression datasets using LOSO cross-validation. Compared with other mainstream models, the method proposed in this paper achieves state-of-the-art recognition performance. In addition, the results of several ablation experiments make the proposed method more convincing. In conclusion, the method proposed in this paper significantly improves the effectiveness of micro-expression recognition.