发布时间: 2019-05-16
摘要点击次数:
全文下载次数:
DOI: 10.11834/jig.180462
2019 | Volume 24 | Number 5

图像分析和识别

小数据样本深度迁移网络自发表情分类

付晓峰, 吴俊, 牛力

杭州电子科技大学计算机学院, 杭州 310018

收稿日期: 2018-07-23; 修回日期: 2018-11-24

基金项目: 国家自然科学基金项目（61672199，61572161）；浙江省科技计划项目——2018年度重点研发计划项目（2018C01030）；浙江省自然科学基金项目（Y1110232）

第一作者简介: 付晓峰, 1981年生, 女, 副教授, 博士, 主要研究方向为计算机视觉、图像处理与模式识别。E-mail:fuxiaofeng@hdu.edu.cn;
牛力, 男, 硕士研究生, 主要研究方向为计算机视觉、图像处理与模式识别。E-mail:niulihdu@qq.com.

中图法分类号: TP301.6

文献标识码: A

文章编号: 1006-8961(2019)05-0753-09

摘要

目的相较于传统表情，自发表情更能揭示一个人的真实情感，在国家安防、医疗等领域有巨大的应用潜力。由于自发表情具有诱导困难、样本难以采集等特殊性，因此数据样本较少。为判别自发表情的种类，结合在越来越多的场景得到广泛应用的神经网络学习方法，提出基于深度迁移网络的表情种类判别方法。方法为保留原始自发表情图片的特征，即使在小数据样本上也不使用数据增强技术，并将光流特征3维图像作为对比样本。将样本置入不同的迁移网络模型中进行训练，然后将经过训练的同结构的网络组合成同构网络并输出结果，从而实现自发表情种类的判别。结果实验结果表明本文方法在不同数据库上均表现出优异的自发表情分类判别特性。在开放的自发表情数据库CASME、CASMEⅡ和CAS（ME）²上的测试平均准确率分别达到了94.3%、97.3%和97.2%，比目前最好测试结果高7%。结论本文将迁移学习方法应用于自发表情种类的判别，并对不同网络模型以及不同种类的样本进行比较，取得了目前最优的自发表情种类判别的平均准确率。

关键词

自发表情; 迁移学习; 分类; 神经网络; 同构网络

Classification of small spontaneous expression database based on deep transfer learning network

Fu Xiaofeng, Wu Jun, Niu Li

School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou 310018, China

Supported by: National Natural Science Foundation of China(61672199, 61572161)

Abstract

Objective Expression is important in human-computer interaction. As a special expression, spontaneous expression features shorter duration and weaker intensity in comparison with traditional expressions. Spontaneous expressions can reveal a person's true emotions and present immense potential in detection, anti-detection, and medical diagnosis. Therefore, identifying the categories of spontaneous expression can make human-computer interaction smooth and fundamentally change the relationship between people and computers. Given that spontaneous expressions are difficult to be induced and collected, the scale of a spontaneous expression dataset is relatively small for training a new deep neural network. Only ten thousand spontaneous samples are present in each database. The convolutional neural network shows excellent performance and is thus widely used in a large number of scenes. For instance, the approach is better than the traditional feature extraction method in the aspect of improving the accuracy of discriminating the categories of spontaneous expression. Method This study proposes a method on the basis of different deep transfer network models for discriminating the categories of spontaneous expression. To preserve the characteristics of the original spontaneous expression, we do not use the technique of data enhancement to reduce the risk of convergence. At the same time, training samples, which comprise three-dimensional images that are composed of optical flow and grayscale images, are compared with the original RGB images. The three-dimensional image contains spatial information and temporal displacement information. In this study, we compare three network models with different samples. The first model is based on Alexnet that only changes the number of output layer neurons that is equal to the number of categories of spontaneous expression. Then, the network is fine-tuned to obtain the best training and testing results by fixing the parameters of different layers several times. The second model is based on InceptionV3. Two fully connected layers whose neuron numbers are equal to 512 and the number of spontaneous expression categories, respectively, are added to the output results. Thus, we only need to fine-tune the parameters of the two layers. Network depth increases with a reduction of the number of parameters due to the 3×3 convolution kernel replacing the 7×7 convolution kernel. The third model is based on Inception-ResNet-v2. Similar to the first model, we only change the number of output layer neurons. Finally, the isomorphic network model is proposed to identify the categories of spontaneous expression. The model is composed of two transfer learning networks of the same type that are trained by different samples and then takes the maximum as the final output. The isomorphic network makes decisions with high accuracy because the same output of the isomorphic network is infinitely close to the standard answer. From the perspective of probability, we take the maximum of different outputs as a prediction value. Result Experimental results indicate that the proposed method exhibits excellent classification performance on different samples. The single network output clearly shows that the features extracted from RGB images are as effective as the features extracted from the three-dimensional images of optical flow. This result indicates that spatiotemporal features extracted by the optical flow method can be replaced by features that are extracted from the deep neural network. Simultaneously, the method shows that at a certain degree, features extracted from the neural network can replace the lost information and features, such as the temporal features of RGB images or color features of OF+ images. The high average accuracy of a single network indicates that it has good testing performance on each dataset. Networks with high complexity perform well because the samples of spontaneous expression can train the deep transfer learning network effectively. The proposed models achieve state-of-the-art performance and an average accuracy of over 96%. After analyzing the result of the isomorphic network model, we know that its expression is not better than that of a single network in some cases because a single network has a high confidence degree in discriminating the categories of spontaneous expression and thus, the isomorphic network cannot easily improve the average accuracy. The Titan Xp used for this research was donated by the NVIDIA Corporation. Conclusion Compared with traditional expression, spontaneous expression is able to change subtly and extract features in a difficult manner. In the study, different transfer learning networks are applied to discriminate the categories of spontaneous expression. Concurrently, the testing accuracies of different networks, which are trained by different kinds of samples, are compared. Experimental results show that in contrast to traditional methods, deep learning has obvious advantages in spontaneous expression feature extraction. The findings also prove that deep network can extract complete features from spontaneous expression and that it is robust on different databases because of its good testing results. In the future, we will extract spontaneous expressions directly from videos and identify the categories of spontaneous expression with high accuracy by removing distracting occurrences, such as blinking.

Key words

spontaneous expression; transfer learning; classification; neural networks; isomorphic network

0 引言

在国家安防、医疗等领域有巨大应用潜力的自发表情与传统表情^[1]都具有高兴、伤心、恐惧、厌恶、生气和惊讶等6类^[2]基本划分。

相较于具有欺骗性的传统表情^[3]，自发表情具有发生时间短、非自愿、难以被人的意志力控制等特性^[4-5]。现阶段自发表情研究存在以下几方面的问题^[6]：1)自发表情强度低、持续时间短、难以判别；2)诱发条件苛刻、自发表情的数据库较少；3)易被光照等客观因素和受试者不必要的动作等主观因素影响。

Mohammadi等人^[7]采用将自发表情分解为表情稀疏类和受试者无表情脸部类的稀疏表示方法，并从不同的AU(action unit)单元对自发表情进行分析，提高了AU单元的检测准确率。Li等人^[8]通过非负频谱分析和冗余控制的方法提取图像的无监督特征，并提出具有鲁棒性的结构化非负矩阵分解算法^[9]，利用块状对角结构去除干扰项提取图像特征，上述特征在特征表示和图像分类方面均表现出优异的性能。

近年来，随着人工智能的兴起以及深度学习的火热发展，深度学习在图像领域相对于传统学习逐渐显现出性能优势。深度学习的主要发展如下：AlexNet网络^[10]于2012年提出，并在当年的ImageNet识别大赛上取得了最好成绩。AlexNet网络由5个卷积层和3个全连接层组成，每层都具有重要作用，若去掉其中任意一层都会使识别率下降。然而该网络模型需要训练的参数很多，不利于小数据库的重新训练，易导致过拟合。InceptionV3网络^[11]是Google提出的规模较大的网络，其网络结构相比于AlexNet网络更加复杂。在InceptionV3网络模型中，大的卷积核被小卷积核替代，即7×7卷积核由3个3×3卷积核替代，虽然深度增加但参数减少并且分类效果更优。2016年Google提出了Inception-ResNet-v2网络^[12]，加入了残差模块与批量归一化，解决了网络深度变深以后的性能退化问题，并减少了内部协变量转移，在保持网络深度的同时加快了网络的收敛速度。

深度学习在自发表情种类检测中也有着广泛应用。Peng等人^[13]提出异构卷积神经网络识别自发表情，由经迁移学习的VGG与ResNet网络异构组成，其自发表情种类判别效果优于LBP(local binary pattern)和SVM(support vector machine)等方法。Takalkar等人^[14]在自发表情小数据库上利用深度学习对自发面部表情提取特征，并对提出的小型神经网络重新训练与调参，在数据库混合与数据增强的基础上，对自发表情种类判别取得了不错的效果。

现有的自发表情数据库都是小规模的，其规模与目前图像识别最大的数据库ImageNet相比相差若干个量级。因此，针对某个数据库重新训练上述提及的网络模型不仅耗费资源，而且也难以达到经过大规模数据训练的效果。虽然现在已有在小数据库上先学习卷积过滤器的结构和强度信息^[15]，再训练网络模型的方法，但是针对性比较强，若是迁移不同种类的数据则表现力会剧烈下降。

因此，在没有如ImageNet那样大型的自发表情数据库时，利用迁移学习能够充分利用其参数已经训练至最优的优势。本文分别在AlexNet网络、InceptionV3网络和Inception-ResNet-v2网络上使用不同小规模数据集提取出的RGB样本进行迁移学习。然后，分别将对应的3维光流特征图像(optical flow+，OF+)作为样本置入同构的神经网络模型进行对比，通过局部参数调整以及部分结构变换使测试结果达到最优。

1 方法概述

本文方法基本流程如图 1所示。用于训练和测试的数据为RGB和OF+自发表情样本，两者分别在不同的网络上进行迁移学习并训练至最优，然后将相同结构类型的网络同构组成新网络，即同构网络。下面从样本制作、网络结构和参数微调方面对提出的方法进行详尽的描述。

图 1 深度迁移学习流程图

Fig. 1 Deep transfer learning flow chart

1.1 RGB自发脸部表情样本

用ASM算法^[13]对数据库中自发表情片段所在的帧提取68个脸部标记点，其中内眼角点和鼻尖点，是正面视角中相对稳定的3个点，具有良好的抗干扰能力，因此被用于脸部区域仿射变换，实现每帧脸部区域的提取与偏移矫正。

将每个表情片段以不同的方式存放，并分别任取1张图片作为基准图片，为与OF+样本相对应，最终的RGB样本不包含该图片。任取1张图片而不是特定位置的图片，是因为不论取到起始帧、结束帧还是峰值帧，其他的图片与之对比都可以表达出自发表情变化的幅度。

1.2 OF+自发脸部表情样本

光流法(optical flow)^[16]被广泛用于动作检测，因此由自发表情的发生而导致的脸部动作变化也易被检测出来。选取各文件中的图片及对应的基准图片，按时序进行比较并提取光流特征，并对照基准图片的X和Y方向的位移向量分别存放当前图片，最后结合该彩色图片的差值灰度图组合成3维矩阵。此为OF+自发表情样本，既保留了原图的基本信息，又加入了时序中蕴含的形变信息。OF+样本的具体构成如图 2所示。

图 2 OF+自发表情样本

Fig. 2 Samples of OF+ spontaneous expression

1.3 迁移模型重构与同构网络

自发表情数据样本的规模不足以重新训练出最优化的模型，因此采用基于迁移学习的学习策略。迁移学习可以节省大量训练参数的时间，同时方便网络的扩展应用。

对经过训练的不同网络模型进行不同类型的迁移操作。对于AlexNet网络，因为全连接层需要训练的参数量很大，所以不对其结构进行调整，仅调整输出结果的神经元数量。该神经元数量分别为各数据库对应的自发表情种类数，该网络称为N-Ⅰ网络。对于网络规模较大、深度较深的InceptionV3网络，添加的输入为1 024、输出为512以及输入为512、输出为自发表情种类数的全连接层，并且训练优化添加的神经元参数，该网络称为N-Ⅱ网络。对于Inception-ResNet-v2网络，因为该网络已经加入了批量归一化(BN)与残差模块，所以只需对其进行局部微调就能表现出良好的性能，该网络称为N-Ⅲ网络。

最后，将相同结构的网络组合成同构网络(构成方法如图 1)用于预测自发表情种类。区别于由不同网络模型组成的异构网络^[9]，将同一数据库中不同类型样本对应的同结构网络用于自发表情种类判别，如此可以结合不同类型样本的判别优势。同构网络相较于异构网络，不仅能对已经确定种类的自发表情判别结果更加置信，而且对非确定的自发表情种类，将根据不同类型的数据样本提取具有一定的区分性和可比较性的判别特征，因此在一定程度上降低了自发表情判别误差概率。为将单一网络对不同训练样本的输出合并成同构网络的输出，采用求取对应自发表情种类最大预测值的方式作为该同构网络的输出。求取最大值的公式为

$ {F^i} = \max \left( {F_1^i, F_2^i} \right) $

(1)

式中，$F$、$F_1$、$F_2$分别表示同构网络与其相对应网络的自发表情分类判别置信度的结果，$i$表示对应的表情种类。

2 实验

2.1 数据库

SMIC(spontaneous micro-expression dataset)、CASME、CASMEⅡ、CAS(ME)²(Chinese Academy of Sciences macro-expression and micro-expression)等自发表情数据库^[17]都被广泛用于自发表情的检测和自发表情种类的判定。

为评估本文方法对自发表情种类的判别效果，使用3个公开的数据库CASME、CASMEⅡ和CAS(ME)²(如表 1所示)对不同的网络模型分别进行训练与测试。

表 1 自发表情数据库
Table 1 Database of spontaneous expression

下载CSV

数据库	受试者人数	片段数	图片总数	表情种类
CASME	35	189	4 220	8类(蔑视、厌恶、恐惧、高兴、压抑、伤心、惊讶、紧张)
CASMEⅡ	35	227	17 124	7类(厌恶、恐惧、高兴、其他、压抑、伤心、惊讶)
CAS(ME)²	22	342	11 156	10类(生气、迷惑、厌恶、恐惧、高兴、无助、痛苦、伤心、惊讶、同情)

2.2 实验细节

从制作数据库^[6]的过程可知，自发表情的种类不仅人工智能难以分辨，即使专家也需要非常认真地判定以及向受试者确认才能确定该自发表情的种类。例如CAS(ME)²数据库，如果按照原始诱发的自发表情视频分类，可以分为生气、厌恶、高兴3类。而当将其更为具体地分为10类时，属于3分类的生气类自发表情在10分类时属于高兴类自发表情，即受试者从消极类的诱发视频中表现出积极类的表情，这使自发表情判定的难度加剧。

由于数据库的样本少且分类较多，导致不同种类的图片数量分布极不均匀。以CASME为例，该数据库包含35个受试者，所有的图片可以被分为除自然表情外的7小类，其中恐惧类自发表情只有60张，而惊讶类自发表情有396张，相差了6倍，不同类别的图片数量极不均衡。

数据增强作为一种扩大小样本数量的方法被广泛地运用在神经网络的训练中。在实验中发现，对小样本数据使用数据增强不仅对自发表情种类判别结果影响甚微而且更加难以收敛。如图 3所示，利用以Inception-Resnet-V2网络模型为基础的迁移网络对CAS(ME)²数据库中的RGB样本与分别使用裁剪、裁剪与翻转变换的数据增强组合方式的样本进行对比，得到的平均准确率分别为96.1%、95.8%与96.3%，虽然样本数量增加了，但结果并未表现得更加优异。

图 3 CAS(ME)²数据样本增强对比

Fig. 3 Comparision of CAS(ME)² data samples with different data augmentation methods

因此，在不使用数据增强的基础上，将总体样本以8 :2的形式分成训练集和测试集，而不包含验证集。同时，为避免同一受试者同一类自发表情的图像相似度大而影响测试结果的准确性，将同一受试者同一类自发表情只放置于训练集或测试集。

在实验中发现，对于AlexNet网络，若改变其全连接层的结构重新训练并没有表现出更优异的性能，并且若只对全连接层进行微调的效果会差于同时对最后一层卷积层与全连接层进行微调。而对InceptionV3重构的网络模型要优于直接输出自发表情种类判别结果的网络模型。同时同构网络中采用求取最大值的方式得到的自发表情种类判别平均准确率要稍优于采用求取平均值的方式。表 2是同种网络不同方法的最优测试结果，即1.3节提到的不同网络模型的输出结果。

表 2 不同网络在不同数据库上的平均准确率对比
Table 2 Average accuracy comparison of different networks on different databases

下载CSV

网络类型			数据样本			数据类型
N-Ⅰ	N-Ⅱ	N-Ⅲ	D-Ⅰ	D-Ⅱ	D-Ⅲ	T-Ⅰ	T-Ⅱ	T-Ⅲ
√			√			0.411	0.330	0.322
	√		√			0.471	0.429	0.448
		√	√			0.955	0.946	0.943
√				√		0.336	0.350	0.324
	√			√		0.790	0.772	0.808
		√		√		0.970	0.969	0.973
√					√	0.236	0.184	0.223
	√				√	0.818	0.816	0.812
		√			√	0.961	0.764	0.972
注：N-I、N-Ⅱ、N-Ⅲ分别表示以AlexNet、InceptionV3、Inception-Resnet-V2为网络结构模型基础的网络；D-I、D-Ⅱ、D-Ⅲ分别表示CASME(8分类)、CASMEⅡ(7分类)、CAS(ME)²(10分类)；T-I、T-Ⅱ、T-Ⅲ分别表示RGB、OF+、RGB与OF+形式的数据样本。

2.3 网络性能与实验结果分析

从表 2可以得知，由同一网络模型训练不同类型数据样本的测试结果并没有很大差别，说明神经网络对图像特征的提取是足够充分的，光流法提取的图像时空特征可以被替代。并且不同网络对于数据量少、类别多的D-Ⅰ数据库测试结果始终要劣于数据量多、类别少的D-Ⅱ数据库，说明自发表情样本的数量与类别的高相似性会影响实验结果，即自发表情种类判别的准确率。

对单一网络而言，N-Ⅲ网络表现最好，因为网络中有BN和残差模块，不仅能对整体的输入图像归一化并保留其特征，而且也能为后面网络层保留更多的图像特征。而N-I网络测试结果表现不好，这是因为全连接层参数过多，在微调的过程中难以将参数训练至最优。然而对于同构网络，相比于对应的单一网络，虽然其网络模型更加复杂，但自发表情种类判别平均准确率并无明显提高，相差约1%。说明单一网络相对于统一样本训练效果已经足够好，从输出的预测值可知，每张图片表情种类判别的置信度接近或等于1(例如Softmax分类器)。

因此，利用混淆矩阵将识别结果最好的网络(即输入样本为RGB图像的单一网络：N-Ⅲ网络)对不同数据库样本分别测试并对结果加以描述与分析。图 4—图 6是N-Ⅲ网络对于不同数据库样本的混淆矩阵，从图中可以看出同一数据库样本的不同表情的分类准确率以及被误判的表情种类和误判率。对易误判的表情进行观察，可以发现相似表情被误判的概率很高。例如图 4中的压抑类自发表情有23%被误判成紧张类自发表情，图 6中的恐惧类自发表情有31%被误判成生气类自发表情。说明表情种类越多，其相似度越大，网络越难以区别出不同种类表情的差别。

图 4 CASME数据样本的混淆矩阵

Fig. 4 Confusion matrix of CASME data samples

图 5 CASMEⅡ数据样本的混淆矩阵

Fig. 5 Confusion matrix of CASME Ⅱ data samples

图 6 CAS(ME)²数据样本的混淆矩阵

Fig. 6 Confusion matrix of CAS(ME)² data samples

2.4 网络优势与对比实验

为表明本文中N-Ⅲ网络对自发表情种类判别的效果最优，将VGG、ResNet以及两者的异构网络HCN等与之比较，结果如表 3所示。

表 3 不同网络识别准确率对比
Table 3 Comparison of discrimination accuracy of different networks

下载CSV

方法	自发表情种类					平均值
方法	厌恶	恐惧	高兴	伤心	惊讶	平均值
CNN^[14]	0.90	0.39	0.48	0.03	0.65	0.78
VGG^[18]	0.75	0.76	0.98	0.91	0.94	0.83
ResNet^[19]	0.55	0.74	0.75	0.89	0.82	0.79
HCN^[13]	0.79	0.76	0.97	0.94	0.94	0.87
本文(N-Ⅲ)	0.97	0.63	0.98	0.93	0.91	0.96
注：加粗字体表示最优结果。

从总体趋势来看，不同网络对自发表情的识别准确率是上升的。从VGG到ResNet、再至N-Ⅲ网络，残差模块与BN分别被加入到网络中，同时识别准确率也渐次上升，说明其对提高识别准确率具有显著作用。对单一类表情而言，每个网络对于高兴类自发表情的识别准确率较高，其作为积极类表情，表情发生时相较于其他类表情脸部变化是比较明显的，并且数据样本相对较多，所以识别效果好。相对而言，表情样本数量少的恐惧类自发表情的识别准确率较低，并且该类自发表情易与其他类自发表情混淆。

另外，虽然本文中N-Ⅲ网络并非在每个小类都取得最好的效果，但就总体而言是目前最好的，除了恐惧类自发表情, 其他类的自发表情识别置信度都接近1。

3 结论

本文运用深度迁移学习在自发表情种类判别上取得目前最优的结果，并将不同类型的数据样本，如OF+数据样本，以及同构的网络模型混合比较，充分说明深度学习可用于自发表情种类的判别。

在未来的工作中，因为90%以上的信息由受试者的表情和语言传达，所以将围绕如何利用视频中的时空信息与自然语言处理(NLP)结合学习自发表情特征以及去除不同脸部形态的干扰，从而提高自发表情定位检测与自发表情种类判别的准确率。

参考文献

[1] Gavrilescu M. Proposed architecture of a fully integrated modular neural network-based automatic facial emotion recognition system based on facial action coding system[C]//Proceedings of the 10th International Conference on Communications. Bucharest, Romania: IEEE, 2014: 1-6.[DOI: 10.1109/ICComm.2014.6866754]

[2] Petrantonakis P C, Hadjileontiadis L J. An emotion elicitation metric for the valence/arousal and six basic emotions affective models: a comparative study[C]//Proceedings of the 10th IEEE International Conference on Information Technology and Applications in Biomedicine. Corfu, Greece: IEEE, 2010: 1-4.[DOI: 10.1109/ITAB.2010.5687675]

[3] Xue Y L, Mao X, Guo Y, et al. The research advance of facial expression recognition in human computer interaction[J]. Journal of Image and Graphics, 2009, 14(5): 764–772. [薛雨丽, 毛峡, 郭叶, 等. 人机交互中的人脸表情识别研究进展[J]. 中国图象图形学报, 2009, 14(5): 764–772. ] [DOI:10.11834/jig.20090503]

[4] Michael N, Dilsizian M, Metaxas D, et al. Motion profiles for deception detection using visual cues[C]//Proceedings of the 11th European Conference on Computer Vision. Heraklion, Crete, Greece: Springer, 2010: 462-475.[DOI: 10.1007/978-3-642-15567-3_34]

[5] Ekman P. Emotions revealed:recognizing faces and feelings to improve communication and emotional life[M]. Broché: Holt McDougal, 2007.

[6] Yan W J, Wu Q, Liu Y J, et al. CASME database: a dataset of spontaneous micro-expressions collected from neutralized faces[C]//Proceedings of the 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition. Shanghai, China: IEEE, 2013: 1-7.[DOI: 10.1109/FG.2013.6553799]

[7] Mohammadi M R, Fatemizadeh E, Mahoor M H. Intensity estimation of spontaneous facial action units based on their sparsity properties[J]. IEEE Transactions on Cybernetics, 2016, 46(3): 817–826. [DOI:10.1109/TCYB.2015.2416317]

[8] Li Z C, Tang J H. Unsupervised feature selection via nonnegative spectral analysis and redundancy control[J]. IEEE Transactions on Image Processing, 2015, 24(12): 5343–5355. [DOI:10.1109/TIP.2015.2479560]

[9] Li Z C, Tang J H, He X F. Robust structured nonnegative matrix factorization for image representation[J]. IEEE Transactions on Neural Networks and Learning Systems, 2018, 29(5): 1947–1960. [DOI:10.1109/TNNLS.2017.2691725]

[10] Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks[C]//Proceedings of the 25th International Conference on Neural Information Processing Systems. Lake Tahoe, Nevada: Curran Associates Inc., 2012: 1097-1105.[DOI: 10.1145/3065386]

[11] Szegedy C, Vanhoucke V, Ioffe S, et al. Rethinking the inception architecture for computer vision[C]//Proceedings of 2016 Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 2818-2826.[DOI: 10.1109/CVPR.2016.308]

[12] Szegedy C, Ioffe S, Vanhoucke V, et al. Inception-v4, inception-resnet and the impact of residual connections on learning[C]//Proceedings of 2017 Association for the Advance of Artificial Intelligence. San Francisco, California, USA: AAAI, 2017: 4-12.

[13] PengX L, Li L, Feng X Y, et al. Spontaneous facial expression recognition by heterogeneous convolutional networks[C]//Proceedings of 2017 International Conference on the Frontiers and Advances in Data Science. Xi'an, China: IEEE, 2017: 70-73.[DOI: 10.1109/FADS.2017.8253196]

[14] Takalkar M A, Xu M. Image based facial micro-expression recognition using deep learning on small datasets[C]//Proceedings of 2017 International Conference on Digital Image Computing: Techniques and Applications. Sydney, NSW, Australia: IEEE, 2017.[DOI: 10.1109/DICTA.2017.8227443]

[15] Keshari R, Vatsa M, Singh R, et al. Learning Structure and strength of CNN filters for small sample size training[C]//Proceedings of 2018 Computer Vision and Pattern Recognition. Salt Lake City, UT, USA: IEEE, 2018: 9349-9358.[DOI: 10.1109/CVPR.2018.00974]

[16] Zhang S F, Zhang W S, Ding H, et al. Background modeling and object detecting based on optical flow velocity field[J]. Journal of Image and Graphics, 2011, 16(2): 236–243. [张水发, 张文生, 丁欢, 等. 融合光流速度与背景建模的目标检测方法[J]. 中国图象图形学报, 2011, 16(2): 236–243. ] [DOI:10.11834/jig.20110220]

[17] Liong S T, See J, Wong K S, et al. Automatic apex frame spotting in micro-expression database[C]//Proceedings of the 3rd IAPR Asian Conference on Pattern Recognition. Kuala Lumpur, Malaysia: IEEE, 2015: 665-669.[DOI: 10.1109/ACPR.2015.7486586]

[18] Peng X L, Xia Z Q, Li L, et al. Towards facial expression recognition in the wild: a new database and deep recognition system[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops. Las Vegas, NV, USA: IEEE, 2016: 1544-1550.[DOI: 10.1109/CVPRW.2016.192]

[19] He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA: IEEE, 2016: 770-778.[DOI: 10.1109/CVPR.2016.90]