马利庄1, 吴飞2, 毛启容3, 王鹏杰4, 陈玉珑1(1.上海交通大学, 上海 200240;2.浙江大学, 杭州 310058;3.江苏大学, 镇江 212013;4.大连民族大学, 大连 116600)
Visual recognition technologies for complex scenarios analysis
Ma Lizhuang1, Wu Fei2, Mao Qirong3, Wang Pengjie4, Chen Yulong1(1.Shanghai Jiao Tong University, Shanghai 200240, China;2.Zhejiang University, Hangzhou 310058, China;3.Jiangsu University, Zhenjiang 212013, China;4.Dalian Nationalities University, Dalian 116600, China)
Public security and social governance is essential to national development nowadays. It is challenged to prevent large-scale riots in communities and various city crimes for spatial and timescaled social governance in corona virus disease 2019(Covid-19) likehighly accurate human identity verification, highly efficient human behavior analysis and crowd flow track and trace. The core of the challenge is to use computer vision technologies to extract visual information in complex scenarios and to fully express, identify and understand the relationship between human behavior and scenes to improve the degree of social administration and governance. Complex scenarios oriented visual technologies recognition can improve the efficiency of social intelligence and accelerate the process of intelligent social governance. The main challenge of human recognition is composed of three aspects as mentioned below:1) the diversity attack derived from mask occlusion attack, affecting the security of human identity recognition; 2) the large span of time and space information has affected the accuracy of multiple ages oriented face recognition (especially tens of millions of scales retrieval); 3) the complex and changeable scenarios are required for the high robustness of the system and adapt to diverse environments. Therefore, it is necessary to facilitate technologies of remote human identity verification related to the high degree of security, face recognition accuracy, human behavior analysis and scene semantic recognition. The motion analysis of individual behavior and group interaction trend are the key components of complex scenarios based human visual contexts. In detail, individual behavior analysis mainly includes video-based pedestrian re-recognition and video-based action recognition. The group interaction recognition is mainly based on video question-and-answer and dialogue. Video-based network can record the multi-source cameras derived individuals/groups image information. Multi-camera based human behavior research of group segmentation, group tracking, group behavior analysis and abnormal behavior detection. However, it is extremely complex that the individual behavior/group interaction is recorded by multiple cameras in real scenarios, and it is still a great challenge to improve the performance of multi-camera and multi-objective behavior recognition through integrated modeling of real scene structure, individual behavior and group interaction. The video-based network recognition of individual and group behavior mainly depends on visual information in related to scene, individual and group captured. Nonetheless, complex scenarios based individual behavior analysis and group interaction recognition require human knowledge and prior knowledge without visual information in common.Specifically, a crowdsourced data application has improved visual computing performance and visual question-and-answer and dialogue and visual language navigation. The inherited knowledge in crowdsourced data can develop a data-driven machine learning model for comprehensive knowledge and prior applications in individual behavior analysis and group interaction recognition, and establish a new method of data-driven and knowledge-guided visual computing. In addition, the facial expression behavior can be recognized as the human facial micro-motions like speech the voice of language. Speech emotion recognition can capture and understand human emotions and beneficial to support the learning mode of human-machine collaboration better. It is important for research to get deeper into the technology of human visual recognition. Current researches have been focused on human facial expression recognition, speech emotion recognition, expression synthesis, and speech emotion synthesis. We carried out about the contexts of complex scenarios based real-time human identification, individual behavior and group interaction understanding analysis, visual speech emotion recognition and synthesis, comprehensive utilization of knowledge and a priori mode of machine learning. The research and application scenarios for the visual ability is facilitated for complex scenarios. We summarize the current situations, and predict the frontier technologies and development trends. The human visual recognition technology will harness the visual ability to recognize relationship between humans, behavior and scenes. It is potential to improve the capability of standard data construction, model computing resources, and model robustness and interpretability further.