Current Issue Cover
复合时空特征的双模态情感识别

王晓华1, 侯登永1, 胡敏1, 任福继1,2(1.合肥工业大学计算机与信息学院情感计算与先进智能机器安徽省重点实验室, 合肥 230009;2.德岛大学先端技术科学教育部, 日本 德岛 7708502)

摘 要
目的 针对体积局部二值模式应用到视频帧特征提取上,特征维数大,对光照及噪声鲁棒性差等问题,提出一种新的特征描述算法—时空局部三值模式矩(TSLTPM)。考虑到TSLTPM描述的仅是纹理特征,本文进一步融合3维梯度方向直方图(3DHOG)特征来增强对情感视频的描述。方法 首先对情感视频进行预处理获得表情和姿态序列;然后对表情和姿态序列分别提取TSLTPM和3DHOG特征,计算测试序列与已标记的情感训练集特征间的最小欧氏距离,并将其作为独立证据来构造基本概率分配;最后使用D-S证据联合规则得到情感识别结果。结果 在FABO数据库上进行实验,表情和姿态单模态分别取得83.06%和94.78%的平均识别率,在表情上分别比VLBP(体积局部二值模式)、LBP-TOP(三正交平面局部二值模式)、TSLTPM、3DHOG高9.27%、12.89%、1.87%、1.13%;在姿态上分别比VLBP、LBP-TOP、TSLTPM、3DHOG高24.61%、27.55%、1.18%、0.98%。将两种模态进行融合以后平均识别率达到96.86%,说明了融合表情和姿态进行情感识别的有效性。结论 本文提出的TSLTPM特征将VLBP扩展成时空三值模式,能够有效降低维数,减少光照和噪声对识别的影响,与3DHOG特征形成复合时空特征有效增强了情感视频的分类性能,与典型特征提取算法的对比实验也表明了本文算法的有效性。另外,与其他方法的对比实验也验证了本文融合方法的优越性。
关键词
Dual-modality emotion recognition based on composite spatio-temporal features

Wang Xiaohua1, Hou Dengyong1, Hu Min1, Ren Fuji1,2(1.School of Computer and Information of Hefei University of Technology, Anhui Province Key Laboratory of Affective Computing and Advanced Intelligent Machine, Hefei 230009, China;2.University of Tokushima, Graduate School of Advanced Technology & Science, Tokushima 7708502, Japan)

Abstract
Objective In view of existing algorithms, volume local binary pattern is applied to the feature extraction of video frames. However, problems such as large feature dimension, weak robustness to illumination, and noise exist. This study proposes a new feature description algorithm, which is temporal-spatial local ternary pattern moment. This algorithm introduces three value patterns, and it is extended to the temporal-spatial series to describe the variety of pixel values among adjacent frames. The value of texture feature is represented by the energy values of the three value model matrixes, which are calculated according to the gray-level co-occurrence matrix. Considering that the temporal-spatial local ternary pattern moment only describes the texture feature, it lacks the expression of image edge and direction information. Therefore, it cannot fully describe the characteristics of emotional videos. The feature of 3D histograms of oriented gradients is further fused to enhance the description of the emotion feature. Composite spatio-temporal features are obtained by combining two different features. Method First, the emotional videos are preprocessed, and five frame images are obtained by K mean clustering, which are used as the expression and body posture emotion sequences. Second, TSLTPM and 3DHOG features are extracted from the expression and gesture emotion sequences, and the minimum Euclidean distance of the feature between the test sequence and labeled emotion training set is calculated. The calculated value is used as independent evidence to construct the basic probability assignment function. Finally, according to the rules of D-S evidence theory, the expression recognition result is obtained by fused BPA. Result Experimental results on the bimodal expression and body posture emotion database show that complex spatio-temporal features exhibit good recognition performance. The average recognition rates of 83.06% and 94.78% are obtained in the single model identification of facial expressions and gestures, respectively, compared with other algorithms. The average recognition rate of the single-expression model is 9.27%, 12.89%, 1.87%, and 1.13% higher than those of VLBP, LBP-TOP, TSLTPM, and 3DHOG, respectively. The average recognition rate of the single-gesture model is 24.61%, 27.55%, 1.18%, and 0.98% higher than those of VLBP, LBP-TOP, TSLTPM, and 3DHOG, respectively. The average recognition rate after the fusion of these two models is 96.86%, which is higher than the rate obtained by a single model. This result confirms the effectiveness of emotion recognition under the fusion of expression and gesture. Conclusion The TSLTPM feature proposed in our paper extends the VLBP, which is effective in describing the local features of video images, into the temporal–spatial local ternary pattern. The proposed feature has low dimensionality, and it can enhance the robustness to illumination and noise. The composite spatio-temporal features fused with 3DHOG and TSLTPM can fully describe the effective information of emotional videos, and it enhances the classification performance of such videos. The effectiveness of the proposed algorithm in comparison with other typical feature extraction algorithms is also demonstrated. The proposed algorithm is proven suitable for identifying the emotion of static background videos, and the superiority of the fusion method in this study is verified.
Keywords

订阅号|日报