Review of rigid object pose estimation from a single image

Buyi Yang; Xiaoping Du; Yuqiang Fang; Peiyang Li; Yang Wang

doi:10.11834/jig.200037

Review | Views : 0 下载量: 0 CSCD: 4

PDF
Export
Share
Collection
Album

Review of rigid object pose estimation from a single image
Vol. 26, Issue 2, Pages: 334-354(2021)
Published： 16 February 2021 ，

Accepted： 20 May 2020
DOI： 10.11834/jig.200037
稿件说明：

移动端阅览

Buyi Yang, Xiaoping Du, Yuqiang Fang, Peiyang Li, Yang Wang. Review of rigid object pose estimation from a single image. [J]. Journal of Image and Graphics 26(2):334-354(2021)
DOI：

Buyi Yang, Xiaoping Du, Yuqiang Fang, Peiyang Li, Yang Wang. Review of rigid object pose estimation from a single image. [J]. Journal of Image and Graphics 26(2):334-354(2021) DOI： 10.11834/jig.200037.

摘要

刚体目标姿态作为计算机视觉技术的重点研究方向之一，旨在确定场景中3维目标的位置平移和方位旋转等多个自由度，越来越多地应用在工业机械臂操控、空间在轨服务、自动驾驶和现实增强等领域。本文对基于单幅图像的刚体目标姿态过程、方法分类及其现存问题进行了整体综述。通过利用单幅刚体目标图像实现多自由度姿态估计的各类方法进行总结、分类及比较，重点论述了姿态估计的一般过程、估计方法的演进和划分、常用数据集及评估准则、研究现状与展望。目前，多自由度刚体目标姿态估计方法主要针对单一特定应用场景具有较好的效果，还没有通用于复合场景的方法，且现有方法在面对多种光照条件、杂乱遮挡场景、旋转对称和类间相似性目标时，估计精度和效率下降显著。结合现存问题及当前深度学习技术的助推影响，从场景级多目标推理、自监督学习方法、前端检测网络、轻量高效的网络设计、多信息融合姿态估计框架和图像数据表征空间等6个方面对该领域的发展趋势进行预测和展望。

Abstract

Rigid object pose estimation

which is one of the most fundamental and challenging problems in computer vision

has elicited considerable attention in recent years. Researchers are searching for methods to obtain multiple degrees of freedom (DOFs) for rigid objects in a 3D scene

such as position translation and azimuth rotation

and to detect object instances from a large number of predefined categories in natural images. Simultaneously

the development of technologies in computer vision have achieved considerable progress in the rigid object pose estimation task

which is an important task in an increasing number of applications

e.g.

robotic manipulations

orbit services in space

autonomous driving

and augmented reality. This work extensively reviews most papers related to the development history of rigid object pose estimation

spanning over a quarter century (from the 1990s to 2019). However

a review of the use of a single image in rigid object pose estimation does not exist at present. Most relevant studies focus only on the optimization and improvement of pose estimation in a single-class method and then briefly summarize related work in this field. To provide local and overseas researchers with a more comprehensive understanding of the rigid body target pose process

We reviewed the classification and existing problems based on computer vision systematically. In this study

we summarize each multi-DOF pose estimation method by using a single rigid body target image from major research institutions in the world. We classify various pose estimation methods by comparing their key intermediate representation. Deep learning techniques have emerged as a powerful strategy for learning feature representations directly from data and have led to considerable breakthroughs in the field of generic object pose estimation. This paper provides an extensive review of techniques for 20 years of object pose estimation history at two levels: traditional pose estimation period (e.g.

feature-based

template matching-based

and 3D coordinate-based methods) and deep learning-based pose estimation period (e.g.

improved traditional methods and direct and indirect estimation methods). Finally

we discuss them in accordance with each relevant technical process

focusing on crucial aspects

such as the general process of pose estimation

methodology evolution and classification

commonly used datasets and evaluation criteria

and overseas and domestic research status and prospects. For each type of pose estimation method

we first find the representation space of the image feature in the articles and use it to determine the specific classification of the method. Then

we conclude the estimation process to determine the image feature extraction method

such as the handcrafted design method and convolutional neural network extraction. In the third step

we determine how to match the feature representation space in the articles

summarize the matching process

and finally

identify the pose optimization method used in the article. To date

all pose estimation methods can be finely classified. At present

the multi-DOF rigid object pose estimation method is mostly effective in a single specific application scenario. No universal method is available for composite scenes. When existing methods meet multiple lighting conditions

highly cluttered scenes

and objects with rotational symmetry

the estimation accuracy and efficiency of the similarity target among classes are significantly reduced. Although a certain type of method and its improved version can achieve considerable accuracy improvement

the results will decline significantly when it is applied to other scenarios or new datasets. When applied to highly occluded complex scenes

the accuracy of this method is frequently halved. Moreover

various types of pose estimation methods rely excessively on specialized datasets

particularly various methods based on deep learning. After training

a neural network exhibits strong learning and reasoning capabilities for similar datasets. When introducing new datasets

the network parameters will require a new training set for learning and fine-tuning. Consequently

the method will rely on a neural network framework to achieve pose estimation of a rigid body. This situation requires a large training dataset for multiple scenarios to learn

making the method more practical; however

accuracy is generally not optimal. By contrast

the accuracy of most advanced single-class estimation can be achieved by researchers' manually designed methods under certain single-scenario conditions

but migration application capability is insufficient. When encountering such problems

researchers typically choose two solutions. The first solution is to apply a deep learning technology

using its powerful feature abstraction and data representation capabilities to improve the overall usability of the estimation method

optimize accuracy

and enhance the effect. The other solution is to improve the handcrafted pose estimation method. A researcher can design an intermediate medium with increased representation capability to improve the applicability of a method while ensuring accuracy. History helps readers build complete knowledge hierarchy and find future directions in this rapidly developing field. By combining existing problems with the boosting effects of current deep learning technologies

we introduce six aspects to be considered

namely

scene-level multi-objective inference

self-supervised learning methods

front-end detection networks

lightweight and efficient network designs

multi-information fusion attitude estimation frameworks

and image data representation space. We prospect all the above aspects from the the perspective of development trends in multi-DOF rigid object pose estimation. The multi-DOF pose estimation method for the single image of a rigid object based on computer vision technology has high research value in many fields. However

further research is necessary for some limitations of current technical methods and application scenarios.

关键词

计算机视觉单幅图像刚体目标姿态估计深度学习

Keywords

computer visionsingle imagerigid objectpose estimationdeep learning

references

Alexandrov S V, Patten T and Vincze M. 2019. Leveraging symmetries to improve object detection and pose estimation from range data//Proceedings of the 12th International Conference on Computer Vision Systems. Thessaloniki, Greece: Springer: 397-407[DOI: 10.1007/978-3-030-34995-0_36http://dx.doi.org/10.1007/978-3-030-34995-0_36]

Aubry M, Maturana D, Efros A A, Russell B C and Sivic J. 2014. Seeing 3D chairs: exemplar part-based 2D-3D alignment using a large dataset of cad models//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE: 3762-3769[DOI: 10.1109/CVPR.2014.487http://dx.doi.org/10.1109/CVPR.2014.487]

Brachmann E, Krull A, Michel F, Gumhold S, Shotton J and Rother C. 2014. Learning 6D object pose estimation using 3D object coordinates//Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer: 536-551[DOI: 10.1007/978-3-319-10605-2_35http://dx.doi.org/10.1007/978-3-319-10605-2_35]

Brachmann E, Michel F, Krull A, Ying Yang M, Gumhold S and Rother C. 2016. Uncertainty-driven 6D pose estimation of objects and scenes from a single RGB image//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 3364-3372[DOI: 10.1109/CVPR.2016.366http://dx.doi.org/10.1109/CVPR.2016.366]

Cai H P, Werner T and Matas J. 2013. Fast detection of multiple textureless 3-D objects//Proceedings of the 9th International Conference on Computer Vision Systems. Saint Petersburg, Russia: Springer: 103-112[DOI: 10.1007/978-3-642-39402-7_11http://dx.doi.org/10.1007/978-3-642-39402-7_11]

Calli B, Singh A, Walsman A, Srinivasa S, Abbeel P and Dollar A M. 2015. The YCB object and model set: towards common benchmarks for manipulation research//Proceedings of 2015 International Conference on Advanced Robotics. Istanbul, Turkey: IEEE: 510-517[DOI: 10.1109/ICAR.2015.7251504http://dx.doi.org/10.1109/ICAR.2015.7251504]

Chen B, Parra A, Cao J, Li N and Chin T J. 2020.End-to-End Learnable Geometric Vision by Back propagating PnP Optimization//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 8100-8109[DOI: 10.1109/CVPR42600.2020.00812http://dx.doi.org/10.1109/CVPR42600.2020.00812]

Collet A, Martinez M and Srinivasa S S. 2011. The MOPED framework: object recognition and pose estimation for manipulation. The International Journal of Robotics Research, 30(10): 1284-1306[DOI: 10.1177/0278364911401765]

Corona E, Kundu K and Fidler S. 2018. Pose estimation for objects with rotational symmetry//Proceedings of 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Madrid, Spain: IEEE: 7215-7222[DOI: 10.1109/IROS.2018.8594282http://dx.doi.org/10.1109/IROS.2018.8594282]

Dalal N and Triggs B. 2005. Histograms of oriented gradients for human detection//Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05). San Diego, USA: IEEE: 886-893[DOI: 10.1109/CVPR.2005.177http://dx.doi.org/10.1109/CVPR.2005.177]

Do T T, Cai M, Pham T and Reid I. 2018. Deep-6DPose: recovering 6D object pose from a single RGB image[EB/OL].[2020-01-30].https://arxiv.org/pdf/1802.10367.pdfhttps://arxiv.org/pdf/1802.10367.pdf

Dosovitskiy A, Fischer P, Ilg E, Häusser P, Hazirbas C, Golkov V, Van Der Smagt P, Cremers D and Brox T. 2015. FlowNet: learning optical flow with convolutional networks//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 2758-2766[DOI: 10.1109/ICCV.2015.316http://dx.doi.org/10.1109/ICCV.2015.316]

Doumanoglou A, Kouskouridas R, Malassiotis S and Kim K T. 2016. Recovering 6D object pose and predicting next-best-view in the crowd//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 3583-3592[DOI: 10.1109/CVPR.2016.390http://dx.doi.org/10.1109/CVPR.2016.390]

Drost B, Ulrich M, Navab N and Ilic S. 2010. Model globally, match locally: efficient and robust 3D object recognition//Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco, USA: IEEE: 998-1005[DOI: 10.1109/CVPR.2010.5540108http://dx.doi.org/10.1109/CVPR.2010.5540108]

Fischler M A and Bolles R C. 1981. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6): 381-395[DOI: 10.1145/358669.358692]

Gall J, Yao A, Razavi N, Van Gool L and Lempitsky V. 2011. Hough forests for object detection, tracking, and action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(11): 2188-2202[DOI: 10.1109/TPAMI.2011.70]

Girshick R, Donahue J, Darrell T and Malik J. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE: 580-587[DOI: 10.1109/CVPR.2014.81http://dx.doi.org/10.1109/CVPR.2014.81]

Gupta S, Arbeláez P, Girshick R and Malik J. 2015. Aligning 3D models to RGB-D images of cluttered scenes//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 4731-4740[DOI: 10.1109/CVPR.2015.7299105http://dx.doi.org/10.1109/CVPR.2015.7299105]

He K M, Gkioxari G, Dollár P and Girshick R. 2017. Mask R-CNN. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2): 386-397[DOI: 10.1109/TPAMI.2018.2844175]

He Y H, Lin J, Liu Z J, Wang H R, Li L J and Han S. 2018. AMC: automl for model compression and acceleration on mobile devices//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 784-800[DOI: 10.1007/978-3-030-01234-2_48http://dx.doi.org/10.1007/978-3-030-01234-2_48]

Hinterstoisser S, Cagniart C, Ilic S, Sturm P, Navab N, Fua P and Lepetit V. 2011a. Gradient response maps for real-time detection of textureless objects. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(5): 876-888[DOI: 10.1109/TPAMI.2011.206]

Hinterstoisser S, Holzer S, Cagniart C, Ilic S, Konolige K, Navab N and Lepetit V. 2011b. Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes//Proceedings of 2011 International Conference on Computer Vision. Barcelona, Spain: IEEE: 858-865[DOI: 10.1109/ICCV.2011.6126326http://dx.doi.org/10.1109/ICCV.2011.6126326]

Hinterstoisser S, Lepetit V, Ilic S, Holzer S, Bradski G, Konolige K and Navab N. 2012. Model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes//Proceedings of the 11th Asian Conference on Computer Vision. Daejeon, South Korea: Springer: 548-562[DOI: 10.1007/978-3-642-37331-2_42http://dx.doi.org/10.1007/978-3-642-37331-2_42]

Hinterstoisser S, Lepetit V, Rajkumar N and Konolige K. 2016. Going further with point pair features//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 834-848[DOI: 10.1007/978-3-319-46487-9_51http://dx.doi.org/10.1007/978-3-319-46487-9_51]

Hodan T, Haluza P, ObdržálekŠ, Matas J, Lourakis M and Zabulis X. 2017. T-LESS: an RGB-D dataset for 6D pose estimation of texture-less objects//Proceedings of 2017 IEEE Winter Conference on Applications of Computer Vision. Santa Rosa, USA: IEEE: 880-888[DOI: 10.1109/WACV.2017.103http://dx.doi.org/10.1109/WACV.2017.103]

HodaňT, Michel F, Brachmann E, Kehl W, Glent Buch A, Kraft D, Drost B, Vidal J, Ihrke S, Zabulis X, Sahin C, Manhardt F, Tombari F, Kim T K, Matas J and Rother C. 2018. BOP: Benchmark for 6D object pose estimation//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 19-35[DOI: 10.1007/978-3-030-01249-6_2http://dx.doi.org/10.1007/978-3-030-01249-6_2]

HodaňT, Zabulis X, Lourakis M, ObdržálekŠand Matas J. 2015. Detection and fine 3D pose estimation of texture-less objects in RGB-D images//Proceedings of 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Hamburg, Germany: IEEE: 4421-4428[DOI: 10.1109/IROS.2015.7354005http://dx.doi.org/10.1109/IROS.2015.7354005]

Hu Y L, Hugonot J, Fua P and Salzmann M. 2019. Segmentation-driven 6D object pose estimation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE: 3385-3394[DOI: 10.1109/CVPR.2019.00350http://dx.doi.org/10.1109/CVPR.2019.00350]

Kehl W, Manhardt F, Tombari F, Ilic S and Navab N. 2017. SSD-6D: making RGB-based 3D detection and 6D pose estimation great again//Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV).Venice, Italy: IEEE: 1521-1529[DOI: 10.1109/ICCV.2017.169http://dx.doi.org/10.1109/ICCV.2017.169]

Kehl W, Tombari F, Navab N, Ilic S and Lepetit V. 2016. Hashmod: a hashing method for scalable 3D object detection[EB/OL].[2020-01-30].https://arxiv.org/pdf/1607.06062.pdfhttps://arxiv.org/pdf/1607.06062.pdf

Kendall A, Grimes M and Cipolla R. 2015. PoseNet: a convolutional network for real-time 6-dof camera relocalization//Proceedings of 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile: IEEE: 2938-2946[DOI: 10.1109/ICCV.2015.336http://dx.doi.org/10.1109/ICCV.2015.336]

Krizhevsky A, Sutskever I and Hinton G E. 2012. ImageNet classification with deep convolutional neural networks//Proceedings of the 25th International Conference on Neural Information Processing Systems. Lake Tahoe, USA: ACM: 1097-1105

Krull A, Brachmann E, Michel F, Ying Yang M, Gumhold S and Rother C. 2015. Learning analysis-by-synthesis for 6D pose estimation in RGB-D images//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 954-962[DOI: 10.1109/ICCV.2015.115http://dx.doi.org/10.1109/ICCV.2015.115]

LeCun Y, Bengio Y and Hinton G. 2015. Deep learning. Nature, 521(7553): 436-444[DOI: 10.1038/nature14539]

Li Y, Wang G, Ji X Y, Xiang Y and Fox D. 2018. DeepIM: deep iterative matching for 6D pose estimation. International Journal of Computer Vision, 128(3): 657-678[DOI: 10.1007/s11263-019-01250-9]

Li Z, Wang G, Ji X. 2019.Cdpn: Coordinates-based disentangled pose network for real-time RGB-based 6-DOF object pose estimation//Proceedings of 2019 IEEE International Conference on Computer Vision. Seoul, South Korea: IEEE: 7678-7687[DOI: 10.1109/ICCV.2019.00777http://dx.doi.org/10.1109/ICCV.2019.00777]

Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C Y and Berg A C. 2016. SSD: single shot multibox detector//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 21-37[DOI: 10.1007/978-3-319-46448-0_2http://dx.doi.org/10.1007/978-3-319-46448-0_2]

Lowe D G. 1999. Object recognition from local scale-invariant features//Proceedings of the 7th IEEE International Conference on Computer Vision. Kerkyra, Greece: IEEE: 1150-1157[DOI: 10.1109/ICCV.1999.790410http://dx.doi.org/10.1109/ICCV.1999.790410]

Lowe D G. 2001. Local feature view clustering for 3D object recognition//Proceedings of 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Kauai, USA: IEEE: 682[DOI: 10.1109/CVPR.2001.990541http://dx.doi.org/10.1109/CVPR.2001.990541]

Lowe D G. 2004. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2): 91-110[DOI: 10.1023/B:VISI.0000029664.99615.94]

Manhardt F, Kehl W and Gaidon A. 2019. Roi-10D: monocular lifting of 2D detection to 6D pose and metric shape//Proceedings of 2019 IEEE Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 2069-2078[DOI: 10.1109/CVPR.2019.00217http://dx.doi.org/10.1109/CVPR.2019.00217]

Manhardt F, Kehl W, Navab N and Tombari F. 2018. Deep model-based 6D pose refinement in RGB//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer: 800-815[DOI: 10.1007/978-3-030-01264-9_49http://dx.doi.org/10.1007/978-3-030-01264-9_49]

Michel F, Kirillov A, Brachmann E, Krull A, Gumhold S, Savchynskyy B and Rother C. 2017. Global hypothesis generation for 6D object pose estimation//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 462-471[DOI: 10.1109/CVPR.2017.20http://dx.doi.org/10.1109/CVPR.2017.20]

Mousavian A, Anguelov D, Flynn J and KošeckáJ. 2017. 3D bounding box estimation using deep learning and geometry//Proceedings of 2017 Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 7074-7082[DOI: 10.1109/CVPR.2017.597http://dx.doi.org/10.1109/CVPR.2017.597]

Pavlakos G, Zhou X W, Chan A, Derpanis K G and Daniilidis K. 2017. 6-Dof object pose from semantic keypoints//Proceedings of 2017 IEEE International Conference on Robotics and Automation. Singapore, Singapore: IEEE: 2011-2018[DOI: 10.1109/ICRA.2017.7989233http://dx.doi.org/10.1109/ICRA.2017.7989233]

Peng S D, Liu Y, Huang Q X, Zhou X W and Bao H J. 2019. PVNet: pixel-wise voting network for 6DoF pose estimation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 4561-4570[DOI: 10.1109/CVPR.2019.00469http://dx.doi.org/10.1109/CVPR.2019.00469]

Periyasamy A S, Schwarz M and Behnke S. 2019. Refining 6D Object pose predictions using abstract render-and-compare[EB/OL].[2020-01-30].https://arxiv.xilesou.top/pdf/1910.03412.pdfhttps://arxiv.xilesou.top/pdf/1910.03412.pdf

Pham H, Guan M Y, Zoph B, Le Q V and Dean J. 2018. Efficient neural architecture search via parameter sharing[EB/OL].[2020-01-30].https://arxiv.org/pdf/1802.03268.pdfhttps://arxiv.org/pdf/1802.03268.pdf

Poirson P, Ammirato P, Fu C Y, Liu W, KosšeckáJ and Berg A C. 2016. Fast single shot detection and pose estimation//Proceedings of the 4th International Conference on 3D Vision. Stanford, USA: IEEE: 676-684[DOI: 10.1109/3DV.2016.78http://dx.doi.org/10.1109/3DV.2016.78]

Rad M and Lepetit V. 2017. BB8: a scalable, accurate, robust to partial occlusion method for predicting the 3D poses of challenging objects without using depth//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 3828-3836[DOI: 10.1109/ICCV.2017.413http://dx.doi.org/10.1109/ICCV.2017.413]

Redmon J, Divvala S, Girshick R and Farhadi A. 2016. You only look once: Unified, real-time object detection//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 779-788[DOI: 10.1109/CVPR.2016.91http://dx.doi.org/10.1109/CVPR.2016.91]

Ren S Q, He K M, Girshick R and Sun J. 2017. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6): 1137-1149[DOI: 10.1109/TPAMI.2016.2577031]

Rios-Cabrera R and Tuytelaars T. 2013. Discriminatively trained templates for 3D object detection: a real time scalable approach//Proceedings of 2013 IEEE International Conference on Computer Vision. Sydney, Australia: IEEE: 2048-2055[DOI: 10.1109/ICCV.2013.256http://dx.doi.org/10.1109/ICCV.2013.256]

Rothganger F, Lazebnik S, Schmid C and Ponce J. 2006. 3D object modeling and recognition using local affine-invariant image descriptors and multi-view spatial constraints. International Journal of Computer Vision, 66(3): 231-259[DOI: 10.1007/s11263-005-3674-1]

Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S A, Huang Z H, Karpathy A, Khosla A and Bernstein M. 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3): 211-252[DOI: 10.1007/s11263-015-0816-y]

Savarese S and Li F F. 2007. 3D generic object categorization, localization and pose estimation//Proceedings of the 11th IEEE International Conference on Computer Vision. Rio de Janeiro, Brazil: IEEE: 1-8[DOI: 10.1109/ICCV.2007.4408987http://dx.doi.org/10.1109/ICCV.2007.4408987]

Schwarz M, Schulz H and Behnke S. 2015. RGB-D object recognition and pose estimation based on pre-trained convolutional neural network features//Proceedings of 2015 IEEE International Conference on Robotics and Automation (ICRA). Seattle, USA: IEEE: 1329-1335[DOI: 10.1109/ICRA.2015.7139363http://dx.doi.org/10.1109/ICRA.2015.7139363]

Shelhamer E, Long J and Darrell T. 2017. Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4): 640-651[DOI: 10.1109/TPAMI.2016.2572683]

Shotton J, Glocker B, Zach C, Izadi S, Criminisi A and Fitzgibbon A. 2013. Scene coordinate regression forests for camera relocalization in RGB-D images//Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, USA: IEEE: 2930-2937[DOI: 10.1109/CVPR.2013.377http://dx.doi.org/10.1109/CVPR.2013.377]

Simonyan K and Zisserman A. 2014. Very deep convolutional networks for large-scale image recognition[EB/OL].[2020-01-30].https://arxiv.org/pdf/1409.1556.pdfhttps://arxiv.org/pdf/1409.1556.pdf

Song C, Song J, Huang Q. 2020.Hybridpose: 6 d object pose estimation under hybrid representations//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 431-440[DOI: 10.1109/CVPR42600.2020.00051http://dx.doi.org/10.1109/CVPR42600.2020.00051]

Su H, Qi C R, Li Y Y and Guibas L J. 2015. Render for CNN: viewpoint estimation in images using CNNs trained with rendered 3D model views//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 2686-2694[DOI: 10.1109/ICCV.2015.308http://dx.doi.org/10.1109/ICCV.2015.308]

Sundermeyer M, Marton Z C, Durner M, Brucker M, Triebel R. 2018.Implicit 3 d orientation learning for 6 d object detection from rgb images//Proceedings of the European Conference on Computer Vision. Munich, Germany: ECCV: 699-715

Szegedy C, Ioffe S, Vanhoucke V and Alemi A A. 2017. Inception-V4, inception-resnet and the impact of residual connections on learning//Proceedings of the 31st AAAI Conference on Artificial Intelligence. San Francisco, USA: AAAI: 4278-4284

Szegedy C, Liu W, Jia Y Q, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V and Rabinovich A. 2015. Going deeper with convolutions//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 1-9[DOI: 10.1109/CVPR.2015.7298594http://dx.doi.org/10.1109/CVPR.2015.7298594]

Tejani A, Tang D H, Kouskouridas R and Kim T K. 2014. Latent-class Hough forests for 3D object detection and pose estimation//Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer: 462-477[DOI: 10.1007/978-3-319-10599-4_30http://dx.doi.org/10.1007/978-3-319-10599-4_30]

Tekin B, Sinha S N and Fua P. 2018. Real-time seamless single shot 6D object pose prediction//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 292-301[DOI: 10.1109/CVPR.2018.00038http://dx.doi.org/10.1109/CVPR.2018.00038]

Tu Z W and Bai X. 2009. Auto-context and its application to high-level vision tasks and 3D brain image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(10): 1744-1757[DOI: 10.1109/TPAMI.2009.186]

Tulsiani S and Malik J. 2015. Viewpoints and keypoints//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 1510-1519[DOI: 10.1109/CVPR.2015.7298758http://dx.doi.org/10.1109/CVPR.2015.7298758]

Umeyama S 1991. Least-squares estimation of transformation parameters between two point patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(4): 376-380[DOI: 10.1109/34.88573]

Vidal J, Lin C Y and MartíR. 2018. 6D pose estimation using an improved method based on point pair features//Proceedings of the 4th International Conference on Control, Automation and Robotics (ICCAR). Auckland, New Zealand: IEEE: 405-409[DOI: 10.1109/ICCAR.2018.8384709http://dx.doi.org/10.1109/ICCAR.2018.8384709]

Wagner D, Reitmayr G, Mulloni A, Drummond T and Schmalstieg D. 2008. Pose tracking from natural features on mobile phones//Proceedings of the 7th IEEE/ACM International Symposium on Mixed and Augmented Reality. Cambridge, UK: IEEE: 125-134[DOI: 10.1109/ISMAR.2008.4637338http://dx.doi.org/10.1109/ISMAR.2008.4637338]

Wang H, Sridhar S, Huang J W, Valentin J, Song S R and Guibas L J. 2019. Normalized object coordinate space for category-level 6D object pose and size estimation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 2642-2651[DOI: 10.1109/CVPR.2019.00275http://dx.doi.org/10.1109/CVPR.2019.00275]

Wohlhart P and Lepetit V. 2015. Learning descriptors for object recognition and 3D pose estimation//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 3109-3118[DOI: 10.1109/CVPR.2015.7298930http://dx.doi.org/10.1109/CVPR.2015.7298930]

Xiang Y, Schmidt T, Narayanan V and Fox D. 2017. PoseCNN: a convolutional neural network for 6 d object pose estimation in cluttered scenes[EB/OL].[2020-01-30].https://arxiv.org/pdf/1711.00199.pdfhttps://arxiv.org/pdf/1711.00199.pdf

Xu B and Chen Z Z. 2018. Multi-level fusion based 3D object detection from monocular images//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 2345-2353[DOI: 10.1109/CVPR.2018.00249http://dx.doi.org/10.1109/CVPR.2018.00249]

Zakharov S, Shugurov I, Ilic S. 2019.Dpod: 6 d pose object detector and refiner//Proceedings of 2019 IEEE International Conference on Computer Vision. Seoul, South Korea: IEEE: 1941-1950[DOI: 10.1109/ICCV.2019.00203http://dx.doi.org/10.1109/ICCV.2019.00203]

Alert me when the article has been cited

提交

Comprehensive survey on 3D visual-language understanding techniques

Deep learning-based real-time semantic segmentation： a survey

Survey on knowledge distillation and its application

Recent advances in drone-view object detection

Visual information extraction deep learning method： a critical review