真实环境下的多模态情感数据集MED
MED: multimodal emotion dataset in the wild
- 2020年25卷第11期 页码:2349-2360
收稿:2020-05-28,
修回:2020-9-7,
录用:2020-9-14,
纸质出版:2020-11-16
DOI: 10.11834/jig.200215
移动端阅览

浏览全部资源
扫码关注微信
收稿:2020-05-28,
修回:2020-9-7,
录用:2020-9-14,
纸质出版:2020-11-16
移动端阅览
目的
2
情感识别的研究一直致力于帮助系统在人机交互的环节中以更合适的方式来对用户的需求进行反馈。但它在现实应用中的表现却较差。主要原因是缺乏与现实应用环境类似的大规模多模态数据集。现有的野外多模态情感数据集很少,而且受试者数量有限,使用的语言单一。
方法
2
为了满足深度学习算法对数据量的要求,本文收集、注释并准备公开发布一个全新的自然状态下的视频数据集(multimodal emotion dataset,MED)。首先收集人员从电影、电视剧、综艺节目中手工截取视频片段,之后通过注释人员对截取视频片段的标注最终得到了1 839个视频片段。这些视频片段经过人物检测、人脸检测等操作获得有效的视频帧。该数据集包含7种基础情感和3种模态:人脸表情,身体姿态,情感语音。
结果
2
为了提供情感识别的基准,在本文的实验部分,利用机器学习和深度学习方法对MED数据集进行了评估。首先与CK+数据集进行了对比实验,结果表明使用实验室环境下收集的数据开发算法很难应用到实际中,然后对各个模态进行了基线实验,并给出了各个模态的基线。最后多模态融合的实验结果相对于单模态的人脸表情识别提高了4.03%。
结论
2
多模态情感数据库MED扩充了现有的真实环境下多模态数据库,以推进跨文化(语言)情感识别和对不同情感评估的感知分析等方向的研究,提高自动情感计算系统在现实应用中的表现。
Objective
2
Emotion recognition or affective computing is crucial in various human-computer interactions
including interaction with artificial intelligence (AI) assistants
such as home robots
Google assistant
Alexa
and even self-driving cars. AI assistants or other forms of technology can also be used to identify a person's emotional or cognitive state to help people live a happy
healthy
and productive life and even help with mental health treatment. Adding emotion recognition to human-machine systems can help the computer recognize emotion and intention of users when speaking and give an appropriate response. To date
computers inaccurately capture and interpret user emotions and intentions mainly because of the different datasets used when developing an intelligent system and lack of data collection in an actual application environment that reduces system robustness. The simple dataset collected in the laboratory environment
which uses an unreasonable induction method of emotion generation
is typically characterized by a solid background and uniform and strong illumination. The resulting emotion category is very exaggerated but untrue. User age
gender
and ethnicity as well as complexity of the application environment and diversity of collection angles in the actual application process are problems that need solutions when developing a system. Therefore
application of systems developed in the laboratory environment is difficult in the real world.
Method
2
Creating a dataset from the real environment can solve the problem of inconsistency between datasets used in software development and the real-world application. Wild datasets
especially multimodal sentiment datasets containing dynamic information
are limited. Therefore
The paper collected and annotated a new multimodal emotion dataset (MED) in the real environment. First
five collectors watched videos from various data sources with different content
such as TV series
movies
talk shows
and live broadcasts
and extracted over 2 500 video clips containing emotional states. Second
the video frame of each video is obtained and saved in a folder to determine the video sequence. The pedestrian detection model is used to obtain valid video frames because only some video frames contain valid person or face information. Clips without people are considered invalid video frames and undetected. The resulting video frame containing only personal information can be used to investigate postural emotional information
such as limbs. Posture emotion can be used to assess the emotional state of a person when the face is blocked or the character has a large motion range. Facial expressions account for a large proportion of emotional judgment. Third
two methods are used to face detection. Finally
Annotators manually annotated video sequences of detected people and faces although the staff collected videos according to the emotional state in the manual cutting process. Given that humans will have deviations in emotional judgment and each person has a different sensitivity to emotion
the paper used crowdsourcing method to make annotations. Crowdsourcing methods are used in the collection of many datasets
such as ImageNet and RAF. Fifteen taggers with professional emotional information training independently tagged all the video clips. A total of 1 839 video clips were obtained on the basis of seven types of emotions after annotation.
Result
2
Different divisions of the dataset are presented in the study. The dataset is classified into training and verification sets by 0.65:0.35 according to acted facial expression in the wild(AFEW) division. The amount of data for each type of emotion in the AFEW and MED datasets are then compared and presented in the form of a graph. MED has more quantities for each type of emotion than AFEW. The paper evaluates the dataset using a large number of deep and machine learning algorithms and provides baselines for each modality. First
classic machine learning methods
such as local binary patterns(LBP)
histogram of oriented gradient(HOG)
and Gabor wavelet are applied to obtain the baseline of the CK+ dataset. The same method is applied to the MED dataset
and accuracy decreases by more than 50%. Data collected in the real environment is complicated. The algorithm developed using the dataset in the laboratory environment is unsuitable for the real environment. Hence
creating the dataset in the real environment is necessary. The comparison of AFEW and MED datasets verifies that data of MED are reasonable and effective. The baseline of facial expression recognition and the two other modalities also are provided. The results indicate other modalities can be used as an auxiliary method for comprehensively assessing emotions
especially when the face is blocked or the face information is unavailable. Finally
the accuracy of emotion recognition improves by 4.03% through multimodal fusion.
Conclusion
2
MED is a multimodal real-world dataset that expands the existing multimodal dataset. Researchers can develop a deep learning algorithm by combining MED with other datasets to form a large multimodal database that contains multiple languages and ethnicities
promote cross-cultural emotion recognition and perception analysis of different emotion evaluations
and improve the performance of automatic emotion computing systems in real applications.
Ahonen T, Hadid A and Pietikäinen M. 2004. Face recognition with local binary patterns//Proceedings of the 8th European Conference on Computer Vision. Prague: Springer: 469-481[ DOI: 10.1007/978-3-540-24670-1_36 http://dx.doi.org/10.1007/978-3-540-24670-1_36 ]
Benitez-Quiroz C F, Srinivasan R and Martinez A M. 2016. Emotionet: an accurate, real-time algorithm for the automatic annotation of a million facial expressions in the wild//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE: 5562-5570[ DOI: 10.1109/CVPR.2016.600 http://dx.doi.org/10.1109/CVPR.2016.600 ]
Ben-Younes H, Cadene R, Thome N and Cord M. 2019. BLOCK: bilinear superdiagonal fusion for visual question answering and visual relationship detection//Proceedings of the AAAI Conference on Artificial Intelligence. Honolulu, USA: AAAI: 8102-8109[ DOI: 10.1609/aaai.v33i01.33018102 http://dx.doi.org/10.1609/aaai.v33i01.33018102 ]
Burkhardt F, Paeschke A, Rolfes M, Sendlmeier W and Weiss B. 2005. A database of german emotional speech//Proceedings of 2005-Eurospeech, 9th European Conference on Speech Communication and Technology. Lisbon, Portugal: ISCA: 1517-1520
Cho K, Van Merrienboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H and Bengio Y. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation//Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha: Association for ComputationalLinguistics: 1724-1734[ DOI: 10.3115/v1/D14-1179 http://dx.doi.org/10.3115/v1/D14-1179 ]
Cortes C and Vapnik V N. 1995. Support-vector networks. Machine Learning, 20(3):273-297[DOI:10.1023/A:1022627411411]
Dalal N and Triggs B. 2005. Histograms of oriented gradients for human detection//Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Diego: IEEE: 886-893[ DOI: 10.1109/CVPR.2005.177 http://dx.doi.org/10.1109/CVPR.2005.177 ]
Dhall A, Goecke R, Lucey S and Gedeon T. 2012. Collecting large, richly annotated facial-expression databases from movies. IEEE MultiMedia, 19(3):34-41[DOI:10.1109/MMUL.2012.26]
Du S C, Tao Y and Martinez A M. 2014. Compound facial expressions of emotion//Proceedings of the National Academy of Sciences of the United States of America, 111(15): E1454-E1462[ DOI: 10.1073/pnas.1322355111 http://dx.doi.org/10.1073/pnas.1322355111 ]
Eyben F, Weninger F, Gross F and Schuller B. 2013. Recent developments in opensmile, the munich open-source multimedia feature extractor//Proceedings of the 21st ACM International Conference on Multimedia. New York: ACM: 835-838[ DOI: 10.1145/2502081.2502224 http://dx.doi.org/10.1145/2502081.2502224 ]
Goodfellow I J, Erhan D, Carrier P L, Courville A, Mirza M, Hamner B, Cukierski W, Tang Y C, Thaler D, Lee D H, Zhou Y B and Ramaiah C. 2013. Challenges in representation learning: a report on three machine learning contests//Proceedings of the 20th International Conference on Neural Information Processing. Daegu, Korea: Springer: 117-124[ DOI: 10.1007/978-3-642-42051-1_16 http://dx.doi.org/10.1007/978-3-642-42051-1_16 ]
Gunes H and Piccardi M. 2006. A bimodal face and body gesture database for automatic analysis of human nonverbal affective behavior//Proceedings of the 18th International Conference on Pattern Recognition. Hong Kong: IEEE: 1148-1153[ DOI: 10.1109/ICPR.2006.39 http://dx.doi.org/10.1109/ICPR.2006.39 ]
Hara K, Kataoka H and Satoh Y. 2018. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE: 6546-6555[ DOI: 10.1109/CVPR.2018.00685 http://dx.doi.org/10.1109/CVPR.2018.00685 ]
He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE: 770-778[ DOI: 10.1109/CVPR.2016.90 http://dx.doi.org/10.1109/CVPR.2016.90 ]
Hu J, Shen L, Albanie S, Sun G and Wu E H. 2020. Squeeze-and-excitation networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(8):2011-2023[DOI:10.1109/TPAMI.2019.2913372]
Hu P, Cai D Q, Wang S D, Yao A B and Chen Y R. 2017. Learning supervised scoring ensemble for emotion recognition in the wild//Proceedings of the 19th ACM International Conference on Multimodal Interaction. New York: ACM: 553-560[ DOI: 10.1145/3136755.3143009 http://dx.doi.org/10.1145/3136755.3143009 ]
Huang G, Liu Z, Van Der Maaten L and Weinberger K Q. 2017. Densely connected convolutional networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE: 4700-4708[ DOI: 10.1109/CVPR.2017.243 http://dx.doi.org/10.1109/CVPR.2017.243 ]
Kazemi V and Sullivan J. 2014. One millisecond face alignment with an ensemble of regression trees//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus: IEEE: 1867-1874[ DOI: 10.1109/CVPR.2014.241 http://dx.doi.org/10.1109/CVPR.2014.241 ]
Levi G and Hassner T. 2015. Emotion recognition in the wild via convolutional neural networks and mapped binary patterns//Proceedings of 2015 ACM on International Conference on Multimodal Interaction. New York: ACM: 503-510[ DOI: 10.1145/2818346.2830587 http://dx.doi.org/10.1145/2818346.2830587 ]
Li S, Deng W H and Du J P. 2017a. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE: 2852-2861[ DOI: 10.1109/CVPR.2017.277 http://dx.doi.org/10.1109/CVPR.2017.277 ]
Li Y, Tao J H, Chao L L, Bao W and Liu Y Z. 2017b. CHEAVD:a Chinese natural emotional audio-visual database. Journal of Ambient Intelligence and Humanized Computing, 8(6):913-924[DOI:10.1007/s12652-016-0406-z]
Liu C H, Tang T H, Lv K and Wang M H. 2018. Multi-feature based emotion recognition for video clips//Proceedings of the 20th ACM International Conference on Multimodal Interaction. New York: Association for Computing Machinery: 630-634[ DOI: 10.1145/3242969.3264989 http://dx.doi.org/10.1145/3242969.3264989 ]
Liu C J and Wechsler H. 2002. Gabor feature based classification using the enhanced fisher linear discriminant model for face recognition. IEEE Transactions on Image Processing, 11(4):467-476[DOI:10.1109/TIP.2002.999679]
Lucey P, Cohn J F, Kanade T, Saragih J, Ambadar Z and Matthews I. 2010. The extended Cohn-Kanade dataset (CK+): a complete dataset for action unit and emotion-specified expression//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops. San Francisco: IEEE: 94-101[ DOI: 10.1109/CVPRW.2010.5543262 http://dx.doi.org/10.1109/CVPRW.2010.5543262 ]
Lyons M, Akamatsu S, Kamachi M and Gyoba J. 1998. Coding facial expressions with gabor wavelets//Proceedings of the 3rd IEEE International Conference on Automatic Face and Gesture Recognition. Nara: IEEE: 200-205[ DOI: 10.1109/AFGR.1998.670949 http://dx.doi.org/10.1109/AFGR.1998.670949 ]
McGilloway S, Cowie R, Douglas-Cowie E, Gielen S, Westerdijk M and Stroeve S. 2000. Approaching automatic recognition of emotion from voice: a rough benchmark//Proceedings of 2000 ISCA Workshop on Speech and Emotion: A Conceptual Framework for Research. Belfast: ISCA: 207-212
Mollahosseini A, Hasani B and Mahoor M H. 2019. AffectNet:a database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing, 10(1):18-31[DOI:10.1109/TAFFC.2017.2740923]
Redmon J and Farhadi A. 2018. Yolov3: An incremental improvement[EB/OL].[2020-09-15] . https://arxiv.org/pdf/1804.02767.pdf https://arxiv.org/pdf/1804.02767.pdf
Rothe R, Timofte R and Van Gool L. 2015. DEX: deep expectation of apparent age from a single image//Proceedings of 2015 IEEE International Conference on Computer Vision Workshop (ICCVW). Santiago: IEEE: 252-257[ DOI: 10.1109/ICCVW.2015.41 http://dx.doi.org/10.1109/ICCVW.2015.41 ]
Simonyan K and Zisserman A. 2014a. Very Deep Convolutional Networks for Large-Scale Image Recognition[EB/OL].[2020-05-15] . http://arxiv.org/pdf/1409.1556v6.pdf http://arxiv.org/pdf/1409.1556v6.pdf
Simonyan K and Zisserman A. 2014b. Two-stream convolutional networks for action recognition in videos//Proceedings of the 27th International Conference on Neural Information Processing Systems. Cambridge: ACM: 568-576
Tao J H, Liu F Z, Zhang M and Jia H B. 2008. Design of speech corpus for mandarin text to speech//Blizzard Challenge Workshop. Brisbane: Interspeech: 1-4
Valstar M F and Pantic M. 2010. Induced disgust, happiness and surprise: an addition to the MMI facial expression database//Proceedings of the 3rd International Workshop on EMOTION (satellite of LREC): Corpora for Research on Emotion and Affect. Paris, France: [s.n.]: 65-70
Yan J W, ZhengW M, Cui Z, Tang C G, Zhang T and Zong Y. 2018. Multicue fusion for emotion recognition in the wild. Neurocomputing, 309:27-35[DOI:10.1016/j.neucom.2018.03.068]
Yin F, Lu X J, Li D and Liu Y L. 2016. Video-based emotion recognition using CNN-RNN and C3D hybrid networks//Proceedings of the 18th ACM International Conference on Multimodal Interaction. New York: ACM: 445-450[ DOI: 10.1145/2993148.2997632 http://dx.doi.org/10.1145/2993148.2997632 ]
You D, Hamsici O C and Martinez A M. 2011. Kernel optimization in discriminant analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(3):631-638[DOI:10.1109/TPAMI.2010.173]
Zhang K P, Zhang Z P, Li Z F and Qiao Y. 2016. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10):1499-1503[DOI:10.1109/LSP.2016.2603342]
Zhu Y, Lan Z Z, Newsam S and Hauptmann A. 2018. Hidden two-stream convolutional networks for action recognition//Proceedings of the 14th Asian Conference on Computer Vision. Perth: Springer: 363-378[ DOI: 10.1007/978-3-030-20893-6_23 http://dx.doi.org/10.1007/978-3-030-20893-6_23 ]
相关作者
相关机构
京公网安备11010802024621