结合掩码定位和漏斗网络的6D姿态估计

李冬冬; 郑河荣; 刘复昌; 潘翔

doi:10.11834/jig.200525

图像视频分析 | 浏览量 : 0 下载量: 0 CSCD: 1

PDF
导出
分享
收藏
专辑

结合掩码定位和漏斗网络的6D姿态估计
6D pose estimation based on mask location and hourglass network
2022年27卷第2期页码：642-652
纸质出版日期： 2022-02-16 ，

录用日期： 2021-02-04
DOI： 10.11834/jig.200525
稿件说明：

移动端阅览

李冬冬, 郑河荣, 刘复昌, 潘翔. 结合掩码定位和漏斗网络的6D姿态估计[J]. 中国图象图形学报, 2022,27(2):642-652.

Dongdong Li, Herong Zheng, Fuchang Liu, Xiang Pan. 6D pose estimation based on mask location and hourglass network[J]. Journal of Image and Graphics, 2022,27(2):642-652.
李冬冬, 郑河荣, 刘复昌, 潘翔. 结合掩码定位和漏斗网络的6D姿态估计[J]. 中国图象图形学报, 2022,27(2):642-652. DOI： 10.11834/jig.200525.

Dongdong Li, Herong Zheng, Fuchang Liu, Xiang Pan. 6D pose estimation based on mask location and hourglass network[J]. Journal of Image and Graphics, 2022,27(2):642-652. DOI： 10.11834/jig.200525.

摘要

目的

6D姿态估计是3D目标识别及重建中的一个重要问题。由于很多物体表面光滑、无纹理，特征难以提取，导致检测难度大。很多算法依赖后处理过程提高姿态估计精度，导致算法速度降低。针对以上问题，本文提出一种基于热力图的6D物体姿态估计算法。

方法

首先，采用分割掩码避免遮挡造成的热力图污染导致的特征点预测准确率下降问题。其次，基于漏斗网络架构，无需后处理过程，保证算法具有高效性能。在物体检测阶段，采用一个分割网络结构，使用速度较快的YOLOv3（you only look once v3）作为网络骨架，目的在于预测目标物体掩码分割图，从而减少其他不相关物体通过遮挡带来的影响。为了提高掩码的准确度，增加反卷积层提高特征层的分辨率并对它们进行融合。然后，针对关键点采用漏斗网络进行特征点预测，避免残差网络模块由于局部特征丢失导致的关键点检测准确率下降问题。最后，对检测得到的关键点进行位姿计算，通过P

$$n$$

P（perspective-

$$n$$

-point）算法恢复物体的6D姿态。

结果

在有挑战的Linemod数据集上进行实验。实验结果表明，本文算法的3D误差准确性为82.7%，与热力图方法相比提高了10%；2D投影准确性为98.9%，比主流算法提高了4%；同时达到了15帧/s的检测速度。

结论

本文提出的基于掩码和关键点检测算法不仅有效提高了6D姿态估计准确性，而且可以维持高效的检测速度。

Abstract

Objective

6D pose estimation is a core problem in 3D object detection and reconstruction. Traditional pose estimation methods usually cannot handle textureless objects. Many post processing procedures have been employed to solve this issue

but they lead to a decline in pose estimation speed. To achieve a fast

single-shot solution

a 6D object pose estimation algorithm based on mask location and heat maps is proposed in this paper. In the prediction of the method

masks are first employed to locate objects

which can reduce the error caused by occlusion. To accelerate mask generation

you only look once v3 (YOLOv3) network is used as the backbone. The algorithm presented in this paper does not require any post processing. Our neural network directly predicts the location of key points at a fast speed.

Method

Our algorithm mainly consists of the following steps. First

a segmentation network structure in object detection is used to generate masks. To speed up this process

YOLOv3 is used as the network backbone. Based on the original detection

a branch structure is added by the segmentation network

and deconvolution is used to extract features under different resolutions. Moreover

1×1

3×3

and 1×1 kernel size convolution layers are added to each deconvolution. Finally

these features are fused and used for generating object target and mask map by the mean square error as the loss function in the regression loss. Second

an hourglass network is used to predict key points for each object. A form of encoding and decoding is adopted by the hourglass network. In the encoding stage

down sampling and the residual module are used to reduce the scale and extract features

respectively. Up sampling is used to restore the scale during the decoding. Each level of scale passes through the residual module

and the residual module extracts features without changing the data size. To prevent the feature map from losing local information when the scale is enlarged

a multiscale feature constraint is proposed. Two branches are split to retain the original scale information before each down sampling

and a skip layer containing only one convolution kernel of 1 is used. Stitching is performed at the same scale after one up sampling. Four different resolutions used in convolution are spliced into the up sampling

and the initial feature map is combined with the up sampled feature map. The hourglass network is not directly up sampled to the same resolution size as the network input to obtain the heat map by performing regression. Instead

the hourglass network is used as relay supervision

which restricts the final heat map result from the residual network. Finally

the 6D pose of the object is recovered through the perspective-

$$n$$

-point algorithm.

Result

In the experimental part

the challenging Linemod datasets are used to evaluate our algorithm. The Linemod dataset has 15 models and is difficult to detect due to the complexity of the object scene. The proposed method is compared with state-of-the-art methods in terms of 3D average distance (ADD) errors and 2D projection error. Results show that the ADD of the paper can reach 82.7%

which is 10% higher than that of the existing heat map method such as Betapose. A 98.9% projection accuracy is reached

and a 4% improvement in 2D projection error is achieved. On symmetrical objects

feature points are selected by Betapose method by considering the symmetry of objects to improve the pose accuracy. As a comparison

feature points are extracted by our algorithm by using the sift method without any symmetry knowledge. However

the results of our algorithm on symmetrical objects are still higher than those of Betapose. Furthermore

the algorithm in this paper has a higher ADD accuracy than Betapose. Accuracy is improved by 10%

whereas computation efficiency is decreased slightly (17~15 frames/s). Finally

ablation experiments are carried out to illustrate the effects of hourglass and the mask module. The result of the algorithm is reduced by 5.4% if the hourglass module is removed. Similarly

the accuracy of the network is reduced by 2.3% if the mask module is removed. All experimental results show that the proposed network is the key to improving the overall performance of pose estimation.

Conclusion

A mask segmentation and key point detection network is proposed in this paper to improve the algorithm

which can avoid a large amount of post processing

maintain the speed of the algorithm

and improve the accuracy of the algorithm in pose estimation. The experimental results demonstrate that our method is efficient and outperforms other recent convolutional neural network (CNN)-based approaches

and the detection speed is consistent with existing methods.

关键词

姿态估计目标分割关键点定位漏斗网络特征融合

Keywords

pose estimationobject segmentationkey point locationhourglass networkfeature fusion

references

Bao Z Q, Xing Y, Lyu S Q and Huang Q D. 2020. Improved YOLO V2 6D object pose estimation algorithm. Computer Engineering and Applications, 57(9): 148-153

包志强, 邢瑜, 吕少卿, 黄琼丹. 2020. 改进YOLO V2的6D目标姿态估计算法. 计算机工程与应用, 57(9): 148-153)[DOI:10.3778/j.issn.1002-8331.2001-0367]

Brachmann E, Krull A, Michel F, Gumhold S, Shotton J and Rother C. 2014. Learning 6D object pose estimation using 3D object coordinates//Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer: 536-551[DOI: 10.1007/978-3-319-10605-2_35http://dx.doi.org/10.1007/978-3-319-10605-2_35]

Cao Q X and Zhang H R. 2017. Combined holistic and local patches for recovering 6D object pose//Proceedings of 2017 IEEE International Conference on Computer Vision Workshops. Venice, Italy: IEEE: 2219-2227[DOI: 10.1109/ICCVW.2017.259http://dx.doi.org/10.1109/ICCVW.2017.259]

Choi C and Christensen H I. 2016. RGB-D object pose estimation in unstructured environments. Robotics and Autonomous Systems, 75: 595-613[DOI:10.1016/j.robot.2015.09.020]

He K M, Zhang X Y, Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 770-778[DOI: 10.1109/CVPR.2016.90http://dx.doi.org/10.1109/CVPR.2016.90]

Hinterstoisser S, Lepetit V, Ilic S, Holzer S, Konolige K, Bradski G and Navab N. 2012. Technical demonstration on model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes//Proceedings of ECCV 2012. Florence, Italy: Springer: 593-596[DOI: 10.1007/978-3-642-33885-4_60http://dx.doi.org/10.1007/978-3-642-33885-4_60]

Kehl W, Milletari F, Tombari F, Ilic S and Navab N. 2016. Deep learning of local RGB-D patches for 3D object detection and 6D pose estimation//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 205-220[DOI: 10.1007/978-3-319-46487-9_13http://dx.doi.org/10.1007/978-3-319-46487-9_13]

Kehl W, Manhardt F, Tombari F, Ilic S and Navab N. 2017. SSD-6D: making RGB-based 3D detection and 6D pose estimation great again//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 1530-1538[DOI: 10.1109/ICCV.2017.169http://dx.doi.org/10.1109/ICCV.2017.169]

Kendall A, Grimes M and Cipolla R. 2015. PoseNet: a convolutional network for real-time 6-DOF camera relocalization//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 2938-2946[DOI: 10.1109/ICCV.2015.336http://dx.doi.org/10.1109/ICCV.2015.336]

Lepetit V, Moreno-Noguer F and Fua P. 2009. EPnP: an accurate o(n) solution to the PnP problem. International Journal of Computer Vision, 81(2): 155-166[DOI:10.1007/s11263-008-0152-6].

Li Y, Gu L and Kanade T. 2011. Robustly aligning a shape model and its application to car alignment of unknown pose. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(9): 1860-1876[DOI:10.1109/TPAMI.2011.40]

Lowe D G. 1999. Object recognition from local scale-Invariant features//Proceedings of the 7th IEEE International Conference on Computer Vision. Kerkyra, Greece: IEEE: 1150-1157[DOI: 10.1109/ICCV.1999.790410http://dx.doi.org/10.1109/ICCV.1999.790410]

Michel F, Kirillov A, Brachmann E, Krull A, Gumhold S, Savchynskyy B and Rother C. 2017. Global hypothesis generation for 6D object pose estimation//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 115-124[DOI: 10.1109/CVPR.2017.20http://dx.doi.org/10.1109/CVPR.2017.20]

Newell A, Yang K Y and Deng J. 2016. Stacked hourglass networks for human pose estimation//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 483-499[DOI: 10.1007/978-3-319-46484-8_29http://dx.doi.org/10.1007/978-3-319-46484-8_29]

Park K, Patten Tand Vincze M. 2019. Pix2Pose: pixel-wise coordinate regression of objects for 6D pose estimation//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 7667-7676[DOI: 10.1109/ICCV.2019.00776http://dx.doi.org/10.1109/ICCV.2019.00776]

Pavlakos G, Zhou X W, Chan A, Derpanis K G and Daniilidis K. 2017. 6-DoF object pose from semantic keypoints//Proceedings of 2017 IEEE International Conference on Robotics and Automation. Singapore, Singapore: IEEE: 2011-2018[DOI: 10.1109/ICRA.2017.7989233http://dx.doi.org/10.1109/ICRA.2017.7989233]

Peng S D, Liu Y, Huang Q X, Zhou X W and Bao H J. 2019. PVNeT: pixel-wise voting network for 6DoF pose estimation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 4556-4565[DOI: 10.1109/CVPR.2019.00469http://dx.doi.org/10.1109/CVPR.2019.00469]

Rad M and Lepetit V. 2017. BB8: a scalable, accurate, robust to partial occlusion method for predicting the 3D poses of challenging objects without using depth//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 3848-3856[DOI: 10.1109/ICCV.2017.413http://dx.doi.org/10.1109/ICCV.2017.413]

Ramnath K, Sinha S N, Szeliski R and Hsiao E. 2014. Car make and model recognition using 3D curve alignment//Proceedings of 2014 IEEE Winter Conference on Applications of Computer Vision. Steamboat Springs, USA: IEEE: 285-292[DOI: 10.1109/WACV.2014.6836087http://dx.doi.org/10.1109/WACV.2014.6836087]

Redmon J and Farhadi A. 2017. YOLO9000: better, faster, stronger//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 6517-6525[DOI: 10.1109/CVPR.2017.690http://dx.doi.org/10.1109/CVPR.2017.690]

Redmon J and Farhadi A. 2018. YOLOv3: an incremental improvement[EB/OL]. [2020-08-07].https://arxiv.org/pdf/1804.02767.pdfhttps://arxiv.org/pdf/1804.02767.pdf

Tekin B, Sinha S N and Fua P. 2018. Real-time seamless single shot 6D object pose prediction//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 292-301[DOI: 10.1109/CVPR.2018.00038http://dx.doi.org/10.1109/CVPR.2018.00038]

Wagner D, Reitmayr G, Mulloni A, Drummond T and Schmalstieg D. 2008. Pose tracking from natural features on mobile phones//Proceedings of the 7th IEEE/ACM International Symposium on Mixed and Augmented Reality. Cambridge, UK: IEEE: 125-134[DOI: 10.1109/ISMAR.2008.4637338http://dx.doi.org/10.1109/ISMAR.2008.4637338]

Wang C, Xu D F, Zhu Y K, Martín-Martín R, Lu C W, Li F F and Savarese S. 2019. DenseFusion: 6D object pose estimation by iterative dense fusion//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 3338-3347[DOI: 10.1109/CVPR.2019.00346http://dx.doi.org/10.1109/CVPR.2019.00346]

Xiang Y, Schmidt T, Narayanan V and Fox D. 2018. PoseCNN: a convolutional neural networkfor 6D object pose estimation in cluttered scenes[EB/OL]. [2020-08-07].https://arxiv.org/pdf/1711.00199.pdfhttps://arxiv.org/pdf/1711.00199.pdf

Yang B Y, Du X P, Fang Y Q, Li P Y and Wang Y. 2021. Review of rigid object pose estimation from a single image. Journal of Image and Graphics, 26(2): 334-354

杨步一, 杜小平, 方宇强, 李佩阳, 王阳. 2021. 单幅图像刚体目标姿态估计方法综述. 中国图象图形学报, 26(2): 334-354)[DOI:10.11834/jig.200037]

Zhang X, Jiang Z G and Zhang H P. 2019. Real-time 6D pose estimation from a single RGB image. Image and Vision Computing, 89: 1-11[DOI:10.1016/j.imavis.2019.06.013]

Zhao Z L, Peng G, Wang H Y, Fang H S, Li C K and Lu C W. 2018. Estimating 6D pose from localizing designated surface keypoints[EB/OL]. [2020-08-07].https://arxiv.org/pdf/1812.01387v1.pdfhttps://arxiv.org/pdf/1812.01387v1.pdf

文章被引用时，请邮件提醒。

提交

图神经网络与CNN融合的虹膜特征编码方法

混合监督学习的乳腺癌全切片病理图像分类

红外与可见光图像特征动态选择的目标检测网络

融合姿态引导和多尺度特征的遮挡行人重识别

结合双边交叉增强与自注意力补偿的点云语义分割