Current Issue Cover

陶建华1, 范存航2, 连政3, 吕钊2, 沈莹4, 梁山3(1.清华大学自动化系;2.安徽大学计算机科学与技术学院;3.中科院自动化所;4.同济大学)

摘 要
The development of multimodal sentiment recognition and understanding

(1.Anhui University;2.Institute of Automation, Chinese Academy of Sciences;3.Tongji University)

Affective computing is an important branch in the field of artificial intelligence, and aims to build a computational system that can automatically perceive, recognize, understand, and provide feedback on human emotions. It involves the intersection of multiple disciplines such as computer science, neuroscience, psychology, and social science. Deep emotional understanding and interaction can enable computers to better understand and respond to human emotional needs, and provide personalized interactions and feedback based on emotional states, thereby enhancing the human-computer interaction experience. It has a wide range of applications in areas such as intelligent assistants, virtual reality, and smart healthcare. Relying solely on single-modal information, such as speech signal or video, does not align with the way humans perceive emotions. The accuracy of recognition rapidly decreases when faced with interference. Multimodal emotion understanding and interaction technologies aim to fully model multidimensional information from audio, video, and physiological signals to achieve more accurate emotion understanding. This is a fundamental technology and an important prerequisite for achieving natural, human-like, and personalized human-computer interaction, and holds significant value for ushering in the era of intelligence and digitalization. To fully exploit the complementary nature of different modalities, multimodal fusion for sentiment recognition receives increasing attention from researchers. This paper introduces the current research status of multimodal sentiment computation from three dimensions: an overview of multimodal sentiment recognition, multimodal sentiment understanding, and detection and assessment of emotional disorders such as depression. The overview of emotion recognition is elaborated from the aspects of academic definition, mainstream datasets, and international competitions. In recent years, Large Language Models (LLMs) have demonstrated excellent modeling capabilities and achieved great success in the field of natural language processing with their outstanding language understanding and reasoning abilities. LLMs have garnered widespread attention for their ability to handle various complex tasks by understanding prompts with minimal or zero-shot learning. Through methods such as self-supervised learning or contrastive learning, LLMs can learn more expressive multimodal representations, which can capture the correlations between different modalities and emotional information. Multimodal sentiment recognition and understanding are discussed in terms of emotion feature extraction, multimodal fusion, and the representation and models involved in sentiment recognition under the background of pre-trained large models. With the rapid development of society, people are facing increasing pressure, which can lead to feelings of depression, anxiety, and other negative emotions. Those who are in a prolonged state of depression and anxiety are more likely to develop mental illnesses. Depression is a common and serious condition, with symptoms including low mood, poor sleep quality, loss of appetite, fatigue, and difficulty concentrating. Depression not only harms individuals and families, but also causes significant economic losses to society.The detection of emotional disorders starts from specific applications, selecting depression as the most common emotional disorder, and analyzing its latest developments and trends from the perspectives of assessment and intervention. In addition, this paper also provides a detailed comparison of the research status of affective computation domestically, and offers prospects for future development trends. We believe that scalable emotion feature designing and on large-scale model transfer learning based methods will be the future directions of development. The main challenge in multimodal emotion recognition lies in data scarcity, meaning there is not enough data available to build and explore complex models, making it difficult for deep neural network methods to create robust models. To address the above issues, it is necessary to construct large-scale multimodal emotion databases and explore transfer learning methods based on large models. By transferring knowledge learned from unsupervised tasks or other tasks to emotion recognition tasks, the problem of limited data resources can be alleviated. Due to the inherent fuzziness of emotions, using explicit discrete and dimensional labels to represent ambiguous emotional states has limitations. Enhancing the interpretability of prediction results to improve the reliability of recognition results is also an important research direction for the future. The role of multimodal emotion computing in addressing emotional disorders such as depression and anxiety is increasingly prominent. Future research can be conducted in the following three areas. Firstly, research and construction of multimodal emotion disorder datasets, which can provide a solid foundation for the automatic recognition of emotional disorders. However, this field still need to address challenges such as data privacy and ethics. In addition, considerations such as designing targeted interview questions, ensuring patient safety during data collection, and sample augmentation through algorithms are still worth exploring. Secondly, developing more effective algorithms. Emotional disorders fall within the psychological domain, and can also affect patients" physiological features such as voice and body movements. This psychological-physiological correlation is worth deep studying. Therefore, improving the accuracy of algorithms for multimodal emotion disorder recognition is a pressing research issue. Lastly, the design and implementation of intelligent psychological intervention systems. Questions includes how to effectively simulate the counseling process of a psychologist, how to promptly receive user emotional feedback, and how to generate empathetic conversations require further studies.