Current Issue Cover

杨步一,杜小平,方宇强,李佩阳,王阳(航天工程大学, 北京 101416)

摘 要
Review of rigid object pose estimation from a single image

Yang Buyi,Du Xiaoping,Fang Yuqiang,Li Peiyang,Wang Yang(University of Space and Engineering, Beijing 101416, China)

Rigid object pose estimation, which is one of the most fundamental and challenging problems in computer vision, has elicited considerable attention in recent years. Researchers are searching for methods to obtain multiple degrees of freedom (DOFs) for rigid objects in a 3D scene, such as position translation and azimuth rotation, and to detect object instances from a large number of predefined categories in natural images. Simultaneously, the development of technologies in computer vision have achieved considerable progress in the rigid object pose estimation task, which is an important task in an increasing number of applications, e.g., robotic manipulations, orbit services in space, autonomous driving, and augmented reality. This work extensively reviews most papers related to the development history of rigid object pose estimation, spanning over a quarter century (from the 1990s to 2019). However, a review of the use of a single image in rigid object pose estimation does not exist at present. Most relevant studies focus only on the optimization and improvement of pose estimation in a single-class method and then briefly summarize related work in this field. To provide local and overseas researchers with a more comprehensive understanding of the rigid body target pose process, We reviewed the classification and existing problems based on computer vision systematically. In this study, we summarize each multi-DOF pose estimation method by using a single rigid body target image from major research institutions in the world. We classify various pose estimation methods by comparing their key intermediate representation. Deep learning techniques have emerged as a powerful strategy for learning feature representations directly from data and have led to considerable breakthroughs in the field of generic object pose estimation. This paper provides an extensive review of techniques for 20 years of object pose estimation history at two levels: traditional pose estimation period (e.g., feature-based, template matching-based, and 3D coordinate-based methods) and deep learning-based pose estimation period (e.g., improved traditional methods and direct and indirect estimation methods). Finally, we discuss them in accordance with each relevant technical process, focusing on crucial aspects, such as the general process of pose estimation, methodology evolution and classification, commonly used datasets and evaluation criteria, and overseas and domestic research status and prospects. For each type of pose estimation method, we first find the representation space of the image feature in the articles and use it to determine the specific classification of the method. Then, we conclude the estimation process to determine the image feature extraction method, such as the handcrafted design method and convolutional neural network extraction. In the third step, we determine how to match the feature representation space in the articles, summarize the matching process, and finally, identify the pose optimization method used in the article. To date, all pose estimation methods can be finely classified. At present, the multi-DOF rigid object pose estimation method is mostly effective in a single specific application scenario. No universal method is available for composite scenes. When existing methods meet multiple lighting conditions, highly cluttered scenes, and objects with rotational symmetry, the estimation accuracy and efficiency of the similarity target among classes are significantly reduced. Although a certain type of method and its improved version can achieve considerable accuracy improvement, the results will decline significantly when it is applied to other scenarios or new datasets. When applied to highly occluded complex scenes, the accuracy of this method is frequently halved. Moreover, various types of pose estimation methods rely excessively on specialized datasets, particularly various methods based on deep learning. After training, a neural network exhibits strong learning and reasoning capabilities for similar datasets. When introducing new datasets, the network parameters will require a new training set for learning and fine-tuning. Consequently, the method will rely on a neural network framework to achieve pose estimation of a rigid body. This situation requires a large training dataset for multiple scenarios to learn, making the method more practical; however, accuracy is generally not optimal. By contrast, the accuracy of most advanced single-class estimation can be achieved by researchers’ manually designed methods under certain single-scenario conditions, but migration application capability is insufficient. When encountering such problems, researchers typically choose two solutions. The first solution is to apply a deep learning technology, using its powerful feature abstraction and data representation capabilities to improve the overall usability of the estimation method, optimize accuracy, and enhance the effect. The other solution is to improve the handcrafted pose estimation method. A researcher can design an intermediate medium with increased representation capability to improve the applicability of a method while ensuring accuracy. History helps readers build complete knowledge hierarchy and find future directions in this rapidly developing field. By combining existing problems with the boosting effects of current deep learning technologies, we introduce six aspects to be considered, namely, scene-level multi-objective inference, self-supervised learning methods, front-end detection networks, lightweight and efficient network designs, multi-information fusion attitude estimation frameworks, and image data representation space. We prospect all the above aspects from the the perspective of development trends in multi-DOF rigid object pose estimation. The multi-DOF pose estimation method for the single image of a rigid object based on computer vision technology has high research value in many fields. However, further research is necessary for some limitations of current technical methods and application scenarios.