结合语义分割与模型匹配的室内场景重建方法

宁小娟; 巩亮; 韩怡; 马婷; 石争浩; 金海燕; 王映辉

doi:10.11834/jig.220518

图像理解和计算机视觉 | 浏览量 : 0 下载量: 1 CSCD: 0

PDF
导出
分享
收藏
专辑

结合语义分割与模型匹配的室内场景重建方法
Semantic segmentation and model matching-integrated indoor scenario-relevant reconstruction method
2023年28卷第10期页码：3149-3162
纸质出版日期： 2023-10-16 ，
DOI： 10.11834/jig.220518
稿件说明：

移动端阅览

宁小娟，巩亮，韩怡，马婷，石争浩，金海燕，王映辉. 2023. 结合语义分割与模型匹配的室内场景重建方法. 中国图象图形学报， 28(10):3149-3162

Ning Xiaojuan， Gong Liang， Han Yi， Ma Ting， Shi Zhenghao， Jin Haiyan， Wang Yinghui. 2023. Semantic segmentation and model matching-integrated indoor scenario-relevant reconstruction method. Journal of Image and Graphics， 28(10):3149-3162
宁小娟，巩亮，韩怡，马婷，石争浩，金海燕，王映辉. 2023. 结合语义分割与模型匹配的室内场景重建方法. 中国图象图形学报， 28(10):3149-3162 DOI： 10.11834/jig.220518.

Ning Xiaojuan， Gong Liang， Han Yi， Ma Ting， Shi Zhenghao， Jin Haiyan， Wang Yinghui. 2023. Semantic segmentation and model matching-integrated indoor scenario-relevant reconstruction method. Journal of Image and Graphics， 28(10):3149-3162 DOI： 10.11834/jig.220518.

摘要

目的

由于室内点云场景中物体的密集性、复杂性以及多遮挡等带来的数据不完整和多噪声问题，极大地限制了室内点云场景的重建工作，无法保证场景重建的准确度。为了更好地从无序点云中恢复出完整的场景，提出了一种基于语义分割的室内场景重建方法。

方法

通过体素滤波对原始数据进行下采样，计算场景三维尺度不变特征变换（3D scale-invariant feature transform，3D SIFT）特征点，融合下采样结果与场景特征点从而获得优化的场景下采样结果；利用随机抽样一致算法（random sample consensus，RANSAC）对融合采样后的场景提取平面特征，将该特征输入PointNet网络中进行训练，确保共面的点具有相同的局部特征，从而得到每个点在数据集中各个类别的置信度，在此基础上，提出了一种基于投影的区域生长优化方法，聚合语义分割结果中同一物体的点，获得更精细的分割结果；将场景物体的分割结果划分为内环境元素或外环境元素，分别采用模型匹配的方法、平面拟合的方法从而实现场景的重建。

结果

在S3DIS（Stanford large-scale 3D indoor space dataset）数据集上进行实验，本文融合采样算法对后续方法的效率和效果有着不同程度的提高，采样后平面提取算法的运行时间仅为采样前的15%；而语义分割方法在全局准确率（overall accuracy，OA）和平均交并比（mean intersection over union，mIoU）两个方面比PointNet网络分别提高了2.3%和4.2%。

结论

本文方法能够在保留关键点的同时提高计算效率，在分割准确率方面也有着明显提升，同时可以得到高质量的重建结果。

Abstract

Objective

Virtual reality technique has been focused on in relevance to such domains like intelligent robot， computer vision and artificial intelligence， and multiple scenes-oriented 3D reconstructions. Recent indoor scene reconstruction has been developing intensively in related to computer vision and robotics. The key task of 3D reconstruction is oriented to transform the point cloud data of indoor scene into a lightweight 3D scene model based on the spatial， geometric， semantic and other related features of point cloud. However， 3D indoor modeling is still challenged to reconstruct high quality 3D indoor scene straightforward because of complex structure， high occlusion and variability of indoor scenes. Current scene reconstruction methods are mainly segmented into such of methods relevant to model matching， machine learning， and deep learning. Model matching-based methods are linked to feature point selection in the matching process. Machine learning-based method is focused on scene segmentation， and its target can be relatively detected and replaced based on partial matching. However， when the indoor objects are severely missed， it is still challenged to deal with such narrow and cluttered indoor scene. Deep learning-based methods are required for training reliable and high-quality scene data， for which domain-specific and data-acquired are often costly for new scenes. To resolve these problems， we develop a semantic segmentation-based point cloud indoor scene reconstruction method， which can melt the point cloud data into a high-quality 3D scene model efficiently and accurately. The method proposed can be divided into three steps as listed below： fusion sampling， semantic segmentation and instance segmentation， and scene reconstruction.

Method

We demonstrate a semantic segmentation-based indoor scene reconstruction method. First， a down-sampling method is developed in terms of 3D scale-invariant feature transform （3D SIFT） feature points extraction and voxel filtering. It takes the local features of the scene as the guidance， and voxel filtering method is used to down-sample the point cloud and remove the noise outliers. The local feature points of the scene data are then obtained by 3D SIFT， which are used to optimize possible loss of key points in the sampling process under the voxel filtering. The local feature points are combined with voxel filtering to obtain the optimized sampling results. It can optimize a single voxel filter-derived critical points loss effectively， and efficient data representation can be offered for the semantic segmentation of subsequent indoor scenes. Second， we illustrate a plane feature-enhanced multi-level semantic segmentation method of PointNet. The plane feature is extracted based on the sampled scene of random sample consensus （RANSAC） algorithm， and planar features-related data is constructed as the dataset of training and testing network model， and the PointNet is then used for end-to-end scene semantic segmentation. The projection-based region growing optimization method is adopted to realize the fine segmentation of objects in indoor scene further. It can be used to optimize PointNet local feature representation and the accuracy of scene semantic segmentation to a certain extent. Finally， a model matching and plane fitting based 3D scene model reconstruction method is facilitated for both of internal and external scenarios-derived objects. The model library of the scene objects is built up in terms of the semantic segmentation analysis of the scene. To deal with the complex structure of each internal scenario-derived object， object and models-between similarities is calculated in the model library. Model matching method is melted in based on heuristic search and the semantic flags and local features of scene elements are used as indexes to carry out rough retrieval from model library， and the optimal matching model is used to match the objects in the scene to align and replace the objects in the scene. Therefore， the reconstruction work of internal scenario-derived objects can be completed further. The outdoor-related external environment objects are reconstructed via plane fitting method. After the axially-aligned bounding box （AABB） of each scene object is calculated， the plane model can be generated to complete the reconstruction of the external-related objects.

Result

To evaluate the performance of the proposed method， experiments are carried out in down-sampling， semantic segmentation， instance segmentation based on the Stanford large-scale 3D indoor space dataset （S3DIS）. Experimental analyses demonstrate that the proposed fusion of plane feature enhancement and voxel filtering can get better plane extraction results in comparison with the non-sampled data. The running time of plane extraction algorithm is shrinked 85% significantly after down-sampling， and it can be optimized about 62% with down-sampling. Compared to PointNet， plane feature-enhanced semantic segmentation method is proposed and trained in Area-1-Area-5 scene and tested in Area-6 scene. The overall accuracy （OA） can be reached to 84.02% and mean intersection over union （mIoU） is reached to 60.65%， in which each of them are improved 2.3% and 4.2% than PointNet network.

Conclusion

The S3DIS dataset-based experimental results have demonstrated that our method proposed can be dealt with semantic segmentation in related to large-scale indoor scenes. It can extract planar features better through the fusion of voxel filtering and the 3D SIFT. Furthermore， S3DIS area-6-related experiments have demonstrated that the performance of semantic segmentation is improved as well. The scene reconstruction method proposed can obtain more refined and accurate scene reconstruction results to some extent. Future research direction is predicted and focused on the completion of small objects with complex structures such as tables， chairs and bookshelves， which refers to little improvement in the accuracy of segmentation and reconstruction of such objects. To improve the accuracy of semantic segmentation， deep learning-based method can be probably used to deal with the features of small objects. It is required to develop potential reconstruction methods in the context of large and complex indoor scenes， especially for the scenes-related objects modeling.

关键词

点云室内场景语义分割实例分割三维重建

Keywords

point cloudindoor scenesemantic segmentationinstance segmentation3D reconstruction

references

Armeni I， Sener O， Zamir A R， Jiang H， Brilakis I， Fischer M and Savarese S. 2016. 3D semantic parsing of large-scale indoor spaces//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas， USA： IEEE： 1534-1543 ［DOI： 10.1109/CVPR.2016.170http://dx.doi.org/10.1109/CVPR.2016.170］

Avetisyan A， Dai A and Niessner M. 2019. End-to-end CAD model retrieval and 9DoF alignment in 3D scans//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul， Korea （South）： IEEE： 2551-2560 ［DOI： 10.1109/ICCV.2019.00264http://dx.doi.org/10.1109/ICCV.2019.00264］

Besl P J and McKay N D. 1992. Method for registration of 3-D shapes//Proceedings of SPIE 1611， Sensor Fusion IV： Control Paradigms and Data Structures. Boston， USA： SPIE： 586-606 ［DOI： 10.1117/12.57955http://dx.doi.org/10.1117/12.57955］

Charles R Q， Su H， Mo K C and Guibas L J. 2017. PointNet： deep learning on point sets for 3D classification and segmentation//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu， USA： IEEE： 77-85 ［DOI： 10.1109/CVPR.2017.16http://dx.doi.org/10.1109/CVPR.2017.16］

Dai A， Ritchie D， Bokeloh M， Reed S， Sturm J and Nießner M. 2018. ScanComplete： large-scale scene completion and semantic segmentation for 3D scans//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City， USA： IEEE： 4578-4587 ［DOI： 10.1109/CVPR.2018.00481http://dx.doi.org/10.1109/CVPR.2018.00481］

Fischler M A and Bolles R C. 1981. Random sample consensus： a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM， 24（6）： 381-395 ［DOI： 10.1145/358669.358692http://dx.doi.org/10.1145/358669.358692］

Gao B F， Hu K Q and Guo S X. 2014. Collision detection algorithm based on AABB for minimally invasive surgery//Proceedings of 2014 IEEE International Conference on Mechatronics and Automation. Tianjin， China： IEEE： 315-320 ［DOI： 10.1109/ICMA.2014.6885715http://dx.doi.org/10.1109/ICMA.2014.6885715］

Hu F Q， Huang Y and Li H. 2020. A review of 3D reconstruction methods applied in buildings. Intelligent Building and Smart City，（5）： 10-14

胡芳侨，黄永，李惠. 2020. 建筑三维重建方法综述. 智能建筑与智慧城市，（5）： 10-14 ［DOI： 10.13655/j.cnki.ibci.2020.05.005http://dx.doi.org/10.13655/j.cnki.ibci.2020.05.005］

Hu Q Y， Yang B， Xie L H， Rosa S， Guo Y L， Wang Z H， Trigoni N and Markham A. 2020. RandLA-Net： efficient semantic segmentation of large-scale point clouds//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle， USA： IEEE： 11108-11117 ［DOI： 10.1109/CVPR42600.2020.01112http://dx.doi.org/10.1109/CVPR42600.2020.01112］

Li Y Y， Dai A， Guibas L and Nießner M. 2015. Database-assisted object retrieval for real‐time 3D reconstruction. Computer Graphics Forum， 34（2）： 435-446 ［DOI： 10.1111/cgf.12573http://dx.doi.org/10.1111/cgf.12573］

Liu H M， Tang X C and Shen S H. 2020. Depth-map completion for large indoor scene reconstruction. Pattern Recognition， 99： #107112 ［DOI： 10.1016/j.patcog.2019.107112http://dx.doi.org/10.1016/j.patcog.2019.107112］

Long X X， Cheng X J， Zhu H， Zhang P J， Liu H M， Li J， Zheng L T， Hu Q Y， Liu H， Cao X， Yang R G， Wu Y H， Zhang G F， Liu Y B， Xu K， Guo Y L and Chen B Q. 2021. Recent progress in 3D vision. Journal of Image and Graphics， 26（6）： 1389-1428

龙霄潇，程新景，朱昊，张朋举，刘浩敏，李俊，郑林涛，胡庆拥，刘浩，曹汛，杨睿刚，吴毅红，章国锋，刘烨斌，徐凯，郭裕兰，陈宝权. 2021. 3维视觉前沿进展. 中国图象图形学报， 26（6）： 1389-1428 ［DOI： 10.11834/jig.210043http://dx.doi.org/10.11834/jig.210043］

Maturana D and Scherer S. 2015. VoxNet： a 3D convolutional neural network for real-time object recognition//Proceedings of 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems. Hamburg， Germany： IEEE： 922-928 ［DOI： 10.1109/IROS.2015.7353481http://dx.doi.org/10.1109/IROS.2015.7353481］

Meng H Y， Gao L， Lai Y K and Manocha D. 2019. VV-Net： voxel VAE net with group convolutions for point cloud segmentation//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul， Korea （South）： IEEE： 8500-8508 ［DOI： 10.1109/ICCV.2019.00859http://dx.doi.org/10.1109/ICCV.2019.00859］

Miknis M， Davies R， Plassmann P and Ware A. 2015. Near real-time point cloud processing using the PCL//Proceedings of 2015 International Conference on Systems， Signals and Image Processing. London， UK： IEEE： 153-156 ［DOI： 10.1109/IWSSIP.2015.7314200http://dx.doi.org/10.1109/IWSSIP.2015.7314200］

Nan L L， Xie K and Sharf A. 2012. A search-classify approach for cluttered indoor scene understanding. ACM Transactions on Graphics， 31（6）： #137 ［DOI： 10.1145/2366145.2366156http://dx.doi.org/10.1145/2366145.2366156］

Nie W Z， Wang Y， Song D and Li W H. 2020. 3D model retrieval based on a 3D shape knowledge graph. IEEE Access， 8： 142632-142641 ［DOI： 10.1109/ACCESS.2020.3013595http://dx.doi.org/10.1109/ACCESS.2020.3013595］

Park Y S， Yun Y I and Choi J S. 2009. A new shape descriptor using sliced image histogram for 3D model retrieval. IEEE Transactions on Consumer Electronics， 55（1）： 240-247 ［DOI： 10.1109/TCE.2009.4814441http://dx.doi.org/10.1109/TCE.2009.4814441］

Qi C R， Yi L， Su H and Guibas L J. 2017. PointNet++： deep hierarchical feature learning on point sets in a metric space//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach， USA： Curran Associates Inc.： 5105-5114

Qiu S， Anwar S and Barnes N. 2021. Semantic segmentation for real point cloud scenes via bilateral augmentation and adaptive fusion//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville， USA： IEEE： 1757-1767 ［DOI： 10.1109/CVPR46437.2021.00180http://dx.doi.org/10.1109/CVPR46437.2021.00180］

Rister B， Horowitz M A and Rubin D L. 2017. Volumetric image registration from invariant keypoints. IEEE Transactions on Image Processing， 26（10）： 4900-4910 ［DOI： 10.1109/TIP.2017.2722689http://dx.doi.org/10.1109/TIP.2017.2722689］

Rusu R B， Blodow N and Beetz M. 2009. Fast point feature histograms （FPFH） for 3D registration//Proceedings of 2009 IEEE International Conference on Robotics and Automation. Kobe， Japan： IEEE： 3212-3217 ［DOI： 10.1109/ROBOT.2009.5152473http://dx.doi.org/10.1109/ROBOT.2009.5152473］

Wang C， Hou S W， Wen C L， Gong Z， Li Q， Sun X T and Li J. 2018. Semantic line framework-based indoor building modeling using backpacked laser scanning point cloud. ISPRS Journal of Photogrammetry and Remote Sensing， 143： 150-166 ［DOI： 10.1016/j.isprsjprs.2018.03.025http://dx.doi.org/10.1016/j.isprsjprs.2018.03.025］

Wang P S， Liu Y， Guo Y X， Sun C Y and Tong X. 2017. O-CNN： octree-based convolutional neural networks for 3D shape analysis. ACM Transactions on Graphics， 36（4）： #72 ［DOI： 10.1145/3072959.3073608http://dx.doi.org/10.1145/3072959.3073608］

Wu Z R， Song S R， Khosla A， Yu F， Zhang L G， Tang X O and Xiao J X. 2015. 3D ShapeNets： a deep representation for volumetric shapes//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston， USA： IEEE： 1912-1920 ［DOI： 10.1109/CVPR.2015.7298801http://dx.doi.org/10.1109/CVPR.2015.7298801］

Yang M and Chen B Q. 2022. A survey of indoor scene generation algorithms. Journal of Integration Technology， 11（1）： 40-51

杨淼，陈宝权. 2022. 室内场景生成算法综述. 集成技术， 11（1）： 40-51 ［DOI： 10.12146/j.issn.2095-3135.20210928001http://dx.doi.org/10.12146/j.issn.2095-3135.20210928001］

Zhang L Q and Zhang L. 2018. Deep learning-based classification and reconstruction of residential scenes from large-scale point clouds. IEEE Transactions on Geoscience and Remote Sensing， 56（4）： 1887-1897 ［DOI： 10.1109/TGRS.2017.2769120http://dx.doi.org/10.1109/TGRS.2017.2769120］

Zhao H S， Jiang L， Fu C W and Jia J Y. 2019. PointWeb： enhancing local neighborhood features for point cloud processing//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach， USA： IEEE： 5565-5573 ［DOI： 10.1109/CVPR.2019.00571http://dx.doi.org/10.1109/CVPR.2019.00571］

文章被引用时，请邮件提醒。

提交

结合双边交叉增强与自注意力补偿的点云语义分割

大场景双视角点云特征融合语义分割方法

融合改进ASPP和极化自注意力的自底向上全景分割

基于递归切片网络的三维点云语义分割与实例分割

视觉弱监督学习研究进展