Current Issue Cover


摘 要
人体姿态估计是计算机视觉中的一项基本任务,旨在从给定的图像中获取人体关节的空间坐标。人体姿态估计在动作识别、语义分割、人机交互、人员重新识别等方面得到了广泛应用。随着深度卷积神经网络(deep convolutional neural network,DCNN)的兴起,近年来人体姿态估计取得了显著进展。然而,尽管取得了不错的成果,人体姿态估计仍然是一项具有挑战性的任务,特别是在面对复杂姿态、关键点尺度的变化和遮挡等因素时。为了总结近年来关于遮挡的人体姿态估计技术的发展,本文系统地概述了自2018年以来的代表性方法,根据神经网络包含的训练数据,模型结构以及输出结果,本文将方法细分为基于数据增广(data augmentation)的预处理、基于特征区分的结构设计和基于人体先验的结果优化三类。基于数据增广方法通过生成遮挡的数据来增加训练样本;基于特征区分的方法通过利用注意力机制等方式来减少干扰特征;基于人体结构先验的方法通过利用人体结构先验来优化遮挡姿态。同时,为了更好地评测遮挡方法的性能,我们重新标注了MSCOCO val2017数据集。最后,我们对各种方法进行了对比和总结,阐明了它们在面对遮挡时性能的优劣。此外,我们在此基础上总结和讨论了遮挡情况下人体姿态估计困难的原因以及未来的发展趋势。
A Comprehensive Review of Progress in Deep Learning-Based Occluded Human Pose Estimation

Xu Linhao, Zhao Lin1, Sun Xinxin1, Yan Kedong1, Li Guangyu1,2(1.Nanjing University of Science and Technology;2.China)

Human pose estimation (HPE) has been a prominent area of research in computer vision, with the primary goal of accurately localizing annotated keypoints of the human body, such as wrists and eyes, etc. This fundamental task serves as a basis for numerous downstream applications, including human action recognition, human-computer interaction, pedestrian re-identification, video surveillance, and animation generation, among others. Thanks to the powerful nonlinear mapping capabilities offered by convolutional neural networks (CNN) , HPE has experienced notable advancements in recent years. Despite this progress, human pose estimation remains a challenging task, particularly when facing complex postures, variations in keypoint scales, occlusion, and other factors. Notably, current heatmap-based methods suffer from severe performance degradation when encountering occlusion, which remains a critical challenge in the field. Occlusion is a significant challenge in human pose estimation, as diverse human postures, complex backgrounds, and various occluding objects can all cause performance degradation. To comprehensively delve into recent advancements in occlusion-aware human pose estimation, this paper not only explores the intricacies of occlusion prediction difficulties but also delves into the reasons behind these challenges. The identified challenges encompass the absence of annotated occluded data. Annotating occluded data is inherently more complex and demanding. Most of the prevalent datasets for human pose estimation predominantly focus on visible keypoints, leaving only a limited portion that addresses and annotates occlusion scenarios. This deficiency in annotated occluded data during model training significantly compromises the model"s robustness in effectively handling situations involving partial or complete obstruction of body keypoints. Feature confusion is a notable challenge in top-down human pose estimation methods, where the reliance on detected bounding boxes extracted from the image leads to the cropping of the target person"s region for keypoint prediction. However, in the presence of occlusion, these detection boxes may include individuals other than the target person, causing interference with the accurate prediction of keypoints. This is particularly problematic because the high feature similarity between the target person and interfering individuals makes it difficult for the model to distinguish features effectively. As a result, the accuracy of keypoint predictions is compromised, emphasizing the need for strategies to address feature confusion in the context of occluded scenes. Navigating the intricacies of inference becomes particularly challenging in the presence of substantial occlusion. The expansive coverage of occlusion leads to the loss of valuable contextual and structural information essential for accurately predicting occluded keypoints. Both contextual cues and structural insights play pivotal roles in the inference process, and their absence impedes the model"s ability to draw precise conclusions.The significant loss of contextual information hampers the model"s capacity to glean necessary details from adjacent keypoints, crucial for making informed predictions about occluded keypoints. This, in turn, results in the potential omission of keypoints or the emergence of anomalous pose estimations. Besides, this paper systematically reviews representative methods since 2018. Based on the training data, model structure, and output results contained in neural networks, this paper categorizes methods into three types: preprocessing based on data augmentation, structural design based on feature discrimination, and result optimization based on human body priors. Preprocessing based on data augmentation techniques, generating data with occlusion is employed to augment training samples, aiming to compensate for the lack of annotated occluded data and alleviate the performance degradation of the model in the presence of occlusion.The key lies in utilizing synthetic methods to introduce occlusive elements, simulating occlusion scenarios observed in real-world settings. Through this approach, the model is exposed to a more diverse set of samples featuring occlusion during the training process, enhancing its robustness in complex environments.This data augmentation strategy aids the model in better understanding and adapting to occluded conditions for keypoint prediction. By incorporating diverse occlusion patterns, the model can learn a broader range of scenarios, improving its generalization ability in practical applications. This method not only helps enhance the model"s performance in occluded scenes but also provides comprehensive training, boosting its adaptability to complex situations. Feature-discrimination-based methods utilize attention mechanisms and similar techniques to reduce interference features. By strengthening features associated with the target person and suppressing those related to non-target individuals, these methods effectively mitigate the interference caused by feature confusion.This approach relies on mechanisms such as attention to selectively emphasize relevant features, allowing the model to focus on distinguishing keypoint features of the target person from those of interfering individuals. By enhancing the discriminative power of features belonging to the target individual, the model becomes more adept at navigating scenarios where feature confusion is prevalent. Methods based on human body structure priors optimize occluded poses by leveraging prior knowledge of the human body structure. The use of human body structure priors proves effective in providing valuable information about the structural aspects of the human body. These priors serve as constraints, imparting greater robustness to the model during the inference process.By incorporating human body structure priors, the model gains a more informed understanding of the expected configuration of body parts, even in the presence of occlusion. This prior knowledge helps guide the model"s predictions, ensuring that the estimated poses adhere more closely to anatomically plausible configurations.A comparative analysis is conducted to highlight the strengths and limitations of each method in handling occlusion. Furthermore, the paper discusses the inherent challenges of occluded pose estimation and offers insights into future research directions in this area.