2018 | Volume 23 | Number 5 图像分析和识别

1. 合肥工业大学计算机与信息学院情感计算与先进智能机器安徽省重点实验室, 合肥 230009;
2. 德岛大学先端技术科学教育部, 德岛 7708502, 日本
 收稿日期: 2017-07-21; 修回日期: 2017-11-30 基金项目: 国家自然科学基金项目（61672202，61432004，61502141）；国家自然科学基金-深圳联合基金重点项目（U1613217） 第一作者简介: 任福继(1959-), 男, 教授, 博士生导师, 1991年于日本国立北海道大学获电子与信息科学专业博士学位, 主要研究方向为情感计算、自然语言处理、人工智能等。E-mail:ren2fuji@gmail.com. 中图法分类号: TP391 文献标识码: A 文章编号: 1006-8961(2018)05-0688-10

# 关键词

Dual-modality video emotion recognition based on facial expression and BVP physiological signal
Ren Fuji1,2, Yu Manli1, Hu Min1, Li Yanqiu1
1. School of Computer and Information of Hefei University of Technology, Anhui Province Key Laboratory of Affective Computing and Advanced Intelligent Machine, Hefei 230009, China;
2. University of Tokushima, Graduate School of Advanced Technology & Science, Tokushima 7708502, Japan
Supported by: National Natural Science Foundation of China(61672202, 61432004, 61502141); National Natural Science Foundation of China-Shenzhen Joint Fund Key Projects(U1613217)

# Abstract

Objective With the continuous development of artificial intelligence, researchers and scholars from other fields have become increasingly interested in providing computers with the capability to understand the emotions conveyed by(human beings and naturally interact with them. Therefore, emotion recognition has gradually become one of the key points of research to achieve harmonious human-computer interaction. The performance of video emotion recognition algorithms critically depends on the quality of the extracted emotion information. Previous research showed that facial expression is the most direct method to convey emotional information. Thus, current works usually rely on facial expressions only to complete emotion recognition. Feature extraction methods based on facial expression images are mostly based on gray images. However, during the conversion of color images into gray images, the latent physiological signals in the color information and the hidden physiological signals contained in facial videos that have discriminant information for emotion recognition are lost. In this study, a novel dual-modality video emotion recognition method for fusion decision, which combines facial expressions and blood volume pulse (BVP) physiological signals that can be extracted from facial videos, is introduced to overcome this problem. Method First, the video is preprocessed (including face detection and normalization) to acquire a sequence of video frames that contain only the face image. The LBP-TOP feature is an effective local texture descriptor, whereas the HOG-TOP feature is a gradient-based local shape descriptor that can compensate for the lack of LBP-TOP feature extraction in image edge and direction information. Thus, in this study, we extract the LBP-TOP and HOG-TOP features from the video frames and fuse the two facial expression features. We use video color amplification technology to process the original video and extract the BVP physiological signal from the processed video. Then, the emotional feature of physiological signals can be extracted from the BVP physiological signal. Afterward, the two features are inputted into the BP classifier to train the classification models. Finally, the fuzzy integral is used to fuse the posterior probability information obtained by the two classifiers to obtain the final emotion recognition result. Result Considering that the current commonly used video emotion databases cannot satisfy the requirements for extracting the BVP signal, we conduct experimental verification by using the self-built facial expression video database. Each group of experiments was cross-validated, and the final results were averaged to increase the credibility of the experiment. The average recognition rates of single modality, i.e., facial expression or physiological signal, are 80% and 63.75%, respectively, whereas the emotion recognition result of the fusion of the two modalities is up to 83.33%, which is higher than that of each single modality before fusion. This finding indicates that the fusion decision algorithm with facial expression and BVP physiological signal is effective for emotion recognition. The experimental results of other fusion methods, namely, the D-S evidence theory and the maximum value rule, are 71% and 80%, respectively, which are lower than that of the fuzzy integral method. In addition, the recognition rate of our method is 2% and 2.5% higher than the results of the two existing video emotion recognition methods. Conclusion The dual-modality space-time feature fusion method proposed in this study characterizes the emotion information contained in the facial videos from two aspects, i.e., the facial expression and the physiological signals, to make full use of the emotional information of the video. The experimental results show that this algorithm can make full use of the emotion information of the video and effectively improve the classification performance of video emotion recognition. The effectiveness of our proposed method in comparison to that of similar video emotion recognition algorithms is verified. In addition, the fuzzy integral is used to fuse two different modalities at the decision level. The reliability of different classifiers in the fusion process is considered and compared with that of D-S evidence theory and the maximum value rule. The influence of unreliable decision-making information on the fusion decision is effectively reduced. Finally, a high recognition accuracy is obtained by the proposed fusion method. The contrast experiment with other fusion methods also proves the superiority of the proposed fusion method.

# Key words

facial expression; physiological signal; video color amplification technology; fuzzy integral; dual-modality

# 1.1.2 梯度方向直方图-3维正交平面(HOG-TOP)

HOG最早是用来行人检测的，HOG的基本思想是物体的外观和形状可以被梯度分布或边缘方向信息很好地表示。图像中任一像素点$(x, y)$处的灰度值表示为$H(x, y)$，其水平方向和垂直方向的梯度为

 $\left\{ \begin{array}{l} {G_x}\left( {x,y} \right) = H\left( {x + 1,y} \right) - H\left( {x - 1,y} \right)\\ {G_y}\left( {x,y} \right) = H\left( {x,y + 1} \right) - H\left( {x,y - 1} \right) \end{array} \right.$ (1)

 $G\left( {x,y} \right) = \sqrt {{G_x}{{\left( {x,y} \right)}^2} + {G_y}{{\left( {x,y} \right)}^2}}$ (2)

 $\theta = \arctan \left( {\frac{{{G_y}\left( {x,y} \right)}}{{{G_x}\left( {x,y} \right)}}} \right)$ (3)

$\mathit{\theta }$的取值范围为[0，3600]，将[0，3600]平均划分成$n$个方向，构成直方图的$n$个范围，将每个像素点处的梯度幅值加到$\mathit{\theta }$对应的方向内，得到图像的梯度方向直方图。HOG-TOP类似于LBP-TOP将HOG由2平面扩展到3维空间，即在3正交平面上的梯度，其中心像素点$(x, y, t)$在3正交平面的梯度幅值和梯度方向分别为

 $\left\{ \begin{array}{l} {G_{xy}}\left( {x,y,t} \right) = \sqrt {{G_x}{{\left( {x,y,t} \right)}^2} + {G_y}{{\left( {x,y,t} \right)}^2}} \\ {G_{xt}}\left( {x,y,t} \right) = \sqrt {{G_x}{{\left( {x,y,t} \right)}^2} + {G_t}{{\left( {x,y,t} \right)}^2}} \\ {G_{yt}}\left( {x,y,t} \right) = \sqrt {{G_y}{{\left( {x,y,t} \right)}^2} + {G_t}{{\left( {x,y,t} \right)}^2}} \end{array} \right.$ (4)

 $\left\{ \begin{array}{l} {\theta _{xy}}\left( {x,y} \right) = \arctan \left( {\frac{{{G_y}\left( {x,y,t} \right)}}{{{G_x}\left( {x,y,t} \right)}}} \right)\\ {\theta _{xt}}\left( {x,t} \right) = \arctan \left( {\frac{{{G_t}\left( {x,y,t} \right)}}{{{G_x}\left( {x,y,t} \right)}}} \right)\\ {\theta _{yt}}\left( {x,y} \right) = \arctan \left( {\frac{{{G_t}\left( {x,y,t} \right)}}{{{G_y}\left( {x,y,t} \right)}}} \right) \end{array} \right.$ (5)

 ${\mu _z} = \frac{1}{N}\sum\limits_{i = 1}^N {{z_i}}$ (8)

 ${\sigma _z} = \sqrt {\frac{1}{N}\sum\limits_{i = 1}^N {{{\left( {{z_i} - {\mu _z}} \right)}^2}} }$ (9)

 ${\delta _z} = \frac{1}{{N - 1}}\sum\limits_{i = 1}^{N - 1} {\left| {{z_{i + 1}} - {z_i}} \right|}$ (10)

 ${\zeta _z} = \frac{{{\sigma _z}}}{{{\delta _z}}}$ (11)

 ${\gamma _z} = \frac{1}{{N - 2}}\sum\limits_{i = 1}^{N - 2} {\left| {{z_{i + 2}} - {z_i}} \right|}$ (12)

 ${\xi _z} = \frac{{{\gamma _z}}}{{{\delta _z}}}$ (13)

 ${H_{sub}} = - \sum\limits_{i = 1}^N {{f_i} \cdot {{\log }_2}\left( {{f_i}} \right)}$ (14)

 ${f_i} = \frac{{{F_i}}}{{\sum\limits_{i = 1}^N {{F_i}} }};\;\;\;\;i = 1, \cdots ,N$ (15)

Table 1 The selected emotional featuress

 特征 描述 Y_diff_1 BVP信号一阶差分标准差 Y_power_ratio 低频高频能量比 Y_entroy_ratio 低频高频熵比值 Y_entroy_mean BVP信号熵均值 Y_power_mean BVP信号能量均值 F0 0.8-2HZ功率谱密度峰值对应频率 G_mean G信号均值 G_std G信号标准差 G_diff_1 G信号的一阶差分标准差 G_diff_2 G信号的二阶差分标准差 NN_50 G信号相邻点之间的差值大于50ms个数 NN_ratio NN_50与BVP信号峰值个数的比值

# 2 融合表情和BVP生理信号的双模态情感识别方法

1) 对待测视频逐帧进行人脸检测和归一化；

2) 采用$k$均值聚类的方法，对人脸视频进行聚类，用$k$幅人脸图像来代替整个视频；

3) 将聚类得到的每幅图像等分成互不重叠的矩形子块。从前往后依次选取相邻的3幅图像，依次求取每个子块的LBP-TOP特征和HOG-TOP特征，然后分别将每个子块的LBP-TOP特征和HOG-TOP特征进行级联，最后将级联后LBP-TOP特征和HOG-TOP特征串联得到最终的表情特征；

4) 按照1.2节的方法对人脸视频进行放大和去噪，得到预处理后的视频图像，然后提取BVP信号，并对该信号提取情感特征；

5) 按照步骤1)—步骤4)，处理训练库中的所有视频，获得训练样本集的表情和BVP生理信号特征；

6) 分别将训练样本提取到的表情和BVP生理信号特征送入BP神经网络，训练BP分类器；

7) 利用训练好的BP分类器得到待测视频属于不同情感类别的概率值；

8) 利用模糊积分将两种不同模态的决策信息进行融合，得到最终的分类结果。

# 3.2 实验结果与分析

Table 2 The experimental results on expression mono-modality

 类别 测试样本数 平均正确识别数 平均识别率 标准差 高兴 20 18 90 0.031 6 恐惧 20 15.2 76 0.037 4 悲伤 20 16.6 83 0.024 6 生气 20 14.2 71 0.037 4 总计 80 64 80 0.007 9

Table 3 The experimental results on BVP signal mono-modality

 类别 测试样本数 平均正确识别数 平均识别率 标准差 高兴 20 14.6 73 0.024 5 恐惧 20 12.8 64 0.020 0 悲伤 20 10.6 53 0.024 5 生气 20 13 65 0.031 6 总计 80 51 63.75 0.013 7

Table 4 Comparison of recognition performance of different feature extraction methods based on expression

 特征 平均识别率/% 每帧平均识别时间/ms VLBP 73.25 391.09 LBP-TOP 76.75 197.48 HOG-TOP 77.5 98.03 LBP-TOP+HOG-TOP 80 235.11

Table 5 Comparison of recognition rates of different fusion methods

 融合方法 平均识别率 D-S证据 0.71 最大值规则 0.80 模糊积分 0.832 5

Table 6 Recognition rate comparison between other methods and our method

 方法 平均识别率 每帧平均识别时间/ms Fan[7] 0.807 5 247 Zhao[5] 0.812 5 298 本文 0.832 5 326

