摘 要 ： 目的 人脸表情识别和人脸关键点检测是密切相关的两个任务，关键点位置及其周围纹理的动态变化反映了人脸表
间的内在联系。为了解决上述问题，并结合关键点检测的中间特征用于表情识别，提出了一个多任务的深度框架。 方法 首先，
外，还引入了中间监督层使网络能学习到更多表情相关的特征，并且有利于提高大表情下关键点检测的结果。 结果 在三个公
开数据集(CK+，Oulu-CASIA 和 MMI)上与近几年经典的方法进行了比较，提出的方法在 CK+数据集上的识别准确率达到了
目前最高值，Oulu-CASIA 和 MMI 数据集上的识别准确率与目前最优方法相比分别提升了 0.14% 和 0.54%。 结论 实验结果
Abstract: Objective Facial landmark localization and facial expression recognition are highly correlated. Coordinates of same landmark from different expression are inconsistent. Meanwhile, facial expression can be recognized by combining some facial landmarks, such as points around eyes and mouth. Based on this, researchers combine the two tasks with different strategies. For example, they extract geometric feature with given landmarks or just extract texture information for facial expression recognition, for landmark localization, facial expression is taking into account while building 3D Face Morphable Model. However, most of these researches combine the two tasks directly, ignoring internal connection between them. To solve this problem, a multi-task deep framework is proposed. Method Firstly, a deep framework is designed with inception structure to detect landmark and recognize expression at the same time. Designed model pays more attention to information around facial landmarks under the supervision of two tasks, making features around five sense organs get larger response(landmarks around outer counter is abandoned). However, irrelevant facial areas contains noise affecting recognition accuracy of facial expression. To alleviate this problem and make full use of detected landmarks in first stage, a location attention map is generated to filter out noise in facial edge region. A series of heat maps are generated with detected landmarks and each point is corresponded with a map by taking coordinate of each point as mean value and select appropriate variance to implement gaussian distribution. Then max-pooling along channel dimension is conducted to merge these heat maps into one. We can consider obtained location attention map as weight matrix, point around five sense organs get larger value and others get smaller ones. Finally, location attention map is multiplied with corresponding feature maps and processed features are applied to further feature extraction and classification. Another problem to be solved is facial expression leads to the deformation of local facial areas, for example, mouth is opened widely and eyes are narrowed while laughing, which has increased the difficulty of facial landmark localization. To alleviate
mentioned problem, intermediate supervision is introduced to landmark localization by adding facial expression recognition task with small weight, which is similar with adding information of facial expression while building 3D Face Morphable Model. On the one hand, the results of landmark localization under complicated expressions can be improved to some extent; on the other hand, the network can extract more expression-related information. Result To verify effectiveness of proposed method, our results are compared with several classic methods on three popular databases: CK+, Oulu-CASIA and MMI. Results on CK+ database reaches its peak, and recognition accuracy on Oulu-CASIA and MMI databases increase 0.14% and 0.54% respectively. In addition, to prove each proposed module effective, ablation study is conducted. Firstly, added task of facial landmark detection has been proved effective: recognition accuracy from multi-task network is better than it from single-task with same settings. Similarly, combination of multi-task network and location attention map is superior to multi-task network. Finally, proposed intermediate supervision also improves recognition accuracy compared with combination of multi-task network and location attention map. Conclusion Facial expression recognition is combined with facial landmark localization considering correlation between them. A multi-task network is designed to recognize expression and localize landmark at the same time to avoid using detected landmark directly. Generated location attention map is multiplied with corresponding feature maps so that irrelevant information is filtered out. Intermediate supervision helps improve results of facial landmark localization under complicated expression.