复合时空特征的双模态情感识别
Dual-modality emotion recognition based on composite spatio-temporal features
- 2017年22卷第1期 页码:39-48
网络出版:2016-12-29,
纸质出版:2017
DOI: 10.11834/jig.20170105
移动端阅览

浏览全部资源
扫码关注微信
网络出版:2016-12-29,
纸质出版:2017
移动端阅览
针对体积局部二值模式应用到视频帧特征提取上,特征维数大,对光照及噪声鲁棒性差等问题,提出一种新的特征描述算法—时空局部三值模式矩(TSLTPM)。考虑到TSLTPM描述的仅是纹理特征,本文进一步融合3维梯度方向直方图(3DHOG)特征来增强对情感视频的描述。 首先对情感视频进行预处理获得表情和姿态序列;然后对表情和姿态序列分别提取TSLTPM和3DHOG特征,计算测试序列与已标记的情感训练集特征间的最小欧氏距离,并将其作为独立证据来构造基本概率分配;最后使用D-S证据联合规则得到情感识别结果。 在FABO数据库上进行实验,表情和姿态单模态分别取得83.06%和94.78%的平均识别率,在表情上分别比VLBP(体积局部二值模式)、LBP-TOP(三正交平面局部二值模式)、TSLTPM、3DHOG高9.27%、12.89%、1.87%、1.13%;在姿态上分别比VLBP、LBP-TOP、TSLTPM、3DHOG高24.61%、27.55%、1.18%、0.98%。将两种模态进行融合以后平均识别率达到96.86%,说明了融合表情和姿态进行情感识别的有效性。 本文提出的TSLTPM特征将VLBP扩展成时空三值模式,能够有效降低维数,减少光照和噪声对识别的影响,与3DHOG特征形成复合时空特征有效增强了情感视频的分类性能,与典型特征提取算法的对比实验也表明了本文算法的有效性。另外,与其他方法的对比实验也验证了本文融合方法的优越性。
In view of existing algorithms
volume local binary pattern is applied to the feature extraction of video frames. However
problems such as large feature dimension
weak robustness to illumination
and noise exist. This study proposes a new feature description algorithm
which is temporal-spatial local ternary pattern moment. This algorithm introduces three value patterns
and it is extended to the temporal-spatial series to describe the variety of pixel values among adjacent frames. The value of texture feature is represented by the energy values of the three value model matrixes
which are calculated according to the gray-level co-occurrence matrix. Considering that the temporal-spatial local ternary pattern moment only describes the texture feature
it lacks the expression of image edge and direction information. Therefore
it cannot fully describe the characteristics of emotional videos. The feature of 3D histograms of oriented gradients is further fused to enhance the description of the emotion feature. Composite spatio-temporal features are obtained by combining two different features. First
the emotional videos are preprocessed
and five frame images are obtained by K mean clustering
which are used as the expression and body posture emotion sequences. Second
TSLTPM and 3DHOG features are extracted from the expression and gesture emotion sequences
and the minimum Euclidean distance of the feature between the test sequence and labeled emotion training set is calculated. The calculated value is used as independent evidence to construct the basic probability assignment function. Finally
according to the rules of D-S evidence theory
the expression recognition result is obtained by fused BPA. Experimental results on the bimodal expression and body posture emotion database show that complex spatio-temporal features exhibit good recognition performance. The average recognition rates of 83.06% and 94.78% are obtained in the single model identification of facial expressions and gestures
respectively
compared with other algorithms. The average recognition rate of the single-expression model is 9.27%
12.89%
1.87%
and 1.13% higher than those of VLBP
LBP-TOP
TSLTPM
and 3DHOG
respectively. The average recognition rate of the single-gesture model is 24.61%
27.55%
1.18%
and 0.98% higher than those of VLBP
LBP-TOP
TSLTPM
and 3DHOG
respectively. The average recognition rate after the fusion of these two models is 96.86%
which is higher than the rate obtained by a single model. This result confirms the effectiveness of emotion recognition under the fusion of expression and gesture. The TSLTPM feature proposed in our paper extends the VLBP
which is effective in describing the local features of video images
into the temporal–spatial local ternary pattern. The proposed feature has low dimensionality
and it can enhance the robustness to illumination and noise. The composite spatio-temporal features fused with 3DHOG and TSLTPM can fully describe the effective information of emotional videos
and it enhances the classification performance of such videos. The effectiveness of the proposed algorithm in comparison with other typical feature extraction algorithms is also demonstrated. The proposed algorithm is proven suitable for identifying the emotion of static background videos
and the superiority of the fusion method in this study is verified.
相关作者
相关机构
京公网安备11010802024621