王生进1,2, 豆朝鹏1,2, 樊懿轩1,2, 李亚利1,2(1. 清华大学电子工程系, 北京 100084;2.
2. 北京信息科学与技术国家研究中心, 北京 100084)
行人再识别(person re-identification,Person ReID)指利用计算机视觉技术对在一个摄像头的视频图像中出现的某个确定行人在其他时间、不同位置的摄像头中再次出现时能够辨识出来,或在图像或视频库中检索特定行人。行人再识别研究具有强烈的实际需求,在公共安全、新零售以及人机交互领域具有潜在应用,具备显著的机器学习和计算机视觉领域的理论研究价值。行人成像存在复杂的姿态、视角、光照和成像质量等变化,同时也有一定范围的遮挡等难点,因此行人再识别面临着非常大的技术挑战。近年来,学术界和产业界投入了巨大的人力和资源研究该问题,并取得了一定进展,在多个数据集上的平均准确率均值(mean average precision,mAP)有了较大提升,并部分开始实际应用。尽管如此,当前行人再识别研究主要还是侧重于服装表观的特征,缺乏对行人表观显式的多视角观测和描述,这与人类观测的机理不尽相符。本文旨在打破现有行人再识别任务的设定,形成对行人综合性观测描述。为推进行人再识别研究的进展,本文在前期行人再识别研究的基础上提出了人像态势计算的概念(ReID2.0)。人像态势计算以像态、形态、神态和意态这4态对人像的静态属性和似动状态进行多视角观测和描述。构建了一个新的基准数据集Portrait250K,包含250 000幅人像和对应8个子任务的手动标记的8种标签,并提出一个新的评价指标。提出的人像态势计算从多视角表观信息对行人形成综合性的观测描述,为行人再识别2.0以及类人智能体的进一步研究提供了参考。
ReID2.0: from person ReID to portrait interpretation
Wang Shengjin1,2, Dou Zhaopeng1,2, Fan Yixuan1,2, Li Yali1,2(1. Department of Electronic Engineering, Tsinghua University, Beijing 100084, China;2.
2. Beijing National Research Center of Information Science and Technology, Beijing 100084, China)
Person re-identification（Person ReID）has been concerned more in computer vision nowadays. It can identify a pedestrian-targeted in the images and recognize its multiple spatio-temporal re-appearance. Person ReID can be used to retrieve pedestrians-specific from image or video databases as well. Person re-identification research has strong practical needs and has potential applications in the fields of public safety，new retailing，and human-computer interaction. Conventional forensic-based human-relevant face recognition can provide one of the most powerful technical means for identity checking. However，it is challenged that imaging-coordinated is restricted by its rigid angle and distance. The semicoordinated face recognition is evolved in technically. Actually，there are a large number of scenarios-discreted to be dealt with for public surveillance，where the monitored objects do not need to cooperate with the camera to image，and they do not need to be aware that they are being filmed；in some extreme cases，Some suspects may even deliberately cover themselves key biometric features. To provide wide-ranged tracking spatiotemepally，the surveillance of public security is called for person re-identification urgently. It is possible to sort facial elements out from the back and interprete the facial features further in support of pedestrian re-identification technology. The potential of the person re-identification task is that the recognition object is a non-cooperative target. Pedestrian-oriented imaging has challenged for complicated changes in relevant to its posture，viewing angle，illumination，imaging quality，and certain occlusion-ranged. The key challenges are dealt with its learning-related issues of temporal-based image feature expression and spatial-based meta-image data to the distinctive feature. In addition，compared to the face recognition task，data collection and labeling are more challenging in the person re-identification task，and existing datasets gap are called to be bridged and richer intensively in comparison with face recognition datasets. The feature extractor-generated has a severe overfitting phenomenon in common. The heterogeneity of data set-cross model is still a big challenging issue. Interdisplinary research is calling for the breakthrough of person re-identification. Rank-1 and mean average precision（mAP）have been greatly improved on multiple datasets，and some of them have begun to be applied practically. Current person re-identification analysis is mainly focused on the elements of clothing appearance and lacks of explicit multivisual anglesi-view observation and description of pedestrian appearance， which is inconsistent with the mechanism of human observation. The human-relevant ability of comprehensive perception can generate an observation description of the target from the multi-visual surface information. For example，meet a familiar friend on the street：we will quick-responsed for the perception subconsciously even if we cannot see the face clearly. In addition to clothing information，we will perceive more information-contextual as well，including gender，age，body shape， posture，facial expression and mental state. This paper aims to break the existing setting of person re-identification task and form a comprehensive observation description of pedestrians. To facilitate person re-identification research further，we develop a portrait interpretation calculation（ReID2. 0）on the basis of prior person re-identification. Its attributes and motion-like status are observed and described on four aspects as mentioned below：1）appearance，2）posture，3）emotion，and 4）intention. Here，appearance information is used to describe the apparent information of the face and biological characteristics；posture information is focused on the description of static and sequential body shape characteristics of the human body；emotion information is oriented to the facial expression of the human face and emotional expression of a pedestrian；intention information is targeted on the behavioral description and intentional predictions of a pedestrian；these four types of information is based on multi-view observation and perception of pedestrians，and a human-centered representation is constructed to a certain extent. Due to the difficulty of labeling，there is still no dataset to be constructed in a description requirements according to the four aspects of behavior awareness. We demonstrate a benchmark dataset of Portrait250K for the portrait interpretation calculation. The Portrait250K is composed of 250 000 portraits of 51 movies and TV series from various countries. For each portrait，there are eight human-annotated labels corresponding to eight subtasks. The distribution of images and labels illustrates ground truth features，such as its a）long-tailed or unbalanced distributions，b）diversified occlusions，c） truncations，d） lighting，e） clothing，f） makeup，and g） changeable background scenarios. To advance Portrait250K-based portrait interpretation calculation further，the metrics are designed for each subtask and an integrated evaluation metric，called portrait interpretation quality（PIQ）， is developed systematically，which can balance the weights for each subtask. Furthermore，we design a paradigm of multi-task learning-based baseline method. Multi-task representation learning is concerned about and a spatial scheme is demonstrated，named feature space separation. A simple learning loss is proposed as well. The proposed portrait interpretation calculation forms a comprehensive observational description of pedestrians，which provides a reference for further research on person re-identification and human-like agents.