Current Issue Cover
单幅图像刚体目标姿态估计方法综述

杨步一, 杜小平, 方宇强, 李佩阳, 王阳(航天工程大学, 北京 101416)

摘 要
刚体目标姿态作为计算机视觉技术的重点研究方向之一,旨在确定场景中3维目标的位置平移和方位旋转等多个自由度,越来越多地应用在工业机械臂操控、空间在轨服务、自动驾驶和现实增强等领域。本文对基于单幅图像的刚体目标姿态过程、方法分类及其现存问题进行了整体综述。通过利用单幅刚体目标图像实现多自由度姿态估计的各类方法进行总结、分类及比较,重点论述了姿态估计的一般过程、估计方法的演进和划分、常用数据集及评估准则、研究现状与展望。目前,多自由度刚体目标姿态估计方法主要针对单一特定应用场景具有较好的效果,还没有通用于复合场景的方法,且现有方法在面对多种光照条件、杂乱遮挡场景、旋转对称和类间相似性目标时,估计精度和效率下降显著。结合现存问题及当前深度学习技术的助推影响,从场景级多目标推理、自监督学习方法、前端检测网络、轻量高效的网络设计、多信息融合姿态估计框架和图像数据表征空间等6个方面对该领域的发展趋势进行预测和展望。
关键词
Review of rigid object pose estimation from a single image

Yang Buyi, Du Xiaoping, Fang Yuqiang, Li Peiyang, Wang Yang(University of Space and Engineering, Beijing 101416, China)

Abstract
Rigid object pose estimation, which is one of the most fundamental and challenging problems in computer vision, has elicited considerable attention in recent years. Researchers are searching for methods to obtain multiple degrees of freedom (DOFs) for rigid objects in a 3D scene, such as position translation and azimuth rotation, and to detect object instances from a large number of predefined categories in natural images. Simultaneously, the development of technologies in computer vision have achieved considerable progress in the rigid object pose estimation task, which is an important task in an increasing number of applications, e.g., robotic manipulations, orbit services in space, autonomous driving, and augmented reality. This work extensively reviews most papers related to the development history of rigid object pose estimation, spanning over a quarter century (from the 1990s to 2019). However, a review of the use of a single image in rigid object pose estimation does not exist at present. Most relevant studies focus only on the optimization and improvement of pose estimation in a single-class method and then briefly summarize related work in this field. To provide local and overseas researchers with a more comprehensive understanding of the rigid body target pose process, We reviewed the classification and existing problems based on computer vision systematically. In this study, we summarize each multi-DOF pose estimation method by using a single rigid body target image from major research institutions in the world. We classify various pose estimation methods by comparing their key intermediate representation. Deep learning techniques have emerged as a powerful strategy for learning feature representations directly from data and have led to considerable breakthroughs in the field of generic object pose estimation. This paper provides an extensive review of techniques for 20 years of object pose estimation history at two levels: traditional pose estimation period (e.g., feature-based, template matching-based, and 3D coordinate-based methods) and deep learning-based pose estimation period (e.g., improved traditional methods and direct and indirect estimation methods). Finally, we discuss them in accordance with each relevant technical process, focusing on crucial aspects, such as the general process of pose estimation, methodology evolution and classification, commonly used datasets and evaluation criteria, and overseas and domestic research status and prospects. For each type of pose estimation method, we first find the representation space of the image feature in the articles and use it to determine the specific classification of the method. Then, we conclude the estimation process to determine the image feature extraction method, such as the handcrafted design method and convolutional neural network extraction. In the third step, we determine how to match the feature representation space in the articles, summarize the matching process, and finally, identify the pose optimization method used in the article. To date, all pose estimation methods can be finely classified. At present, the multi-DOF rigid object pose estimation method is mostly effective in a single specific application scenario. No universal method is available for composite scenes. When existing methods meet multiple lighting conditions, highly cluttered scenes, and objects with rotational symmetry, the estimation accuracy and efficiency of the similarity target among classes are significantly reduced. Although a certain type of method and its improved version can achieve considerable accuracy improvement, the results will decline significantly when it is applied to other scenarios or new datasets. When applied to highly occluded complex scenes, the accuracy of this method is frequently halved. Moreover, various types of pose estimation methods rely excessively on specialized datasets, particularly various methods based on deep learning. After training, a neural network exhibits strong learning and reasoning capabilities for similar datasets. When introducing new datasets, the network parameters will require a new training set for learning and fine-tuning. Consequently, the method will rely on a neural network framework to achieve pose estimation of a rigid body. This situation requires a large training dataset for multiple scenarios to learn, making the method more practical; however, accuracy is generally not optimal. By contrast, the accuracy of most advanced single-class estimation can be achieved by researchers’ manually designed methods under certain single-scenario conditions, but migration application capability is insufficient. When encountering such problems, researchers typically choose two solutions. The first solution is to apply a deep learning technology, using its powerful feature abstraction and data representation capabilities to improve the overall usability of the estimation method, optimize accuracy, and enhance the effect. The other solution is to improve the handcrafted pose estimation method. A researcher can design an intermediate medium with increased representation capability to improve the applicability of a method while ensuring accuracy. History helps readers build complete knowledge hierarchy and find future directions in this rapidly developing field. By combining existing problems with the boosting effects of current deep learning technologies, we introduce six aspects to be considered, namely, scene-level multi-objective inference, self-supervised learning methods, front-end detection networks, lightweight and efficient network designs, multi-information fusion attitude estimation frameworks, and image data representation space. We prospect all the above aspects from the the perspective of development trends in multi-DOF rigid object pose estimation. The multi-DOF pose estimation method for the single image of a rigid object based on computer vision technology has high research value in many fields. However, further research is necessary for some limitations of current technical methods and application scenarios.
Keywords

订阅号|日报