多模态大模型驱动的三维视觉理解技术前沿进展
Recent Progress in Large Multi-modal Model based 3D Vision Understanding
- 2024年 页码:1-47
网络出版日期: 2024-12-30
DOI: 10.11834/jig.240588
移动端阅览
浏览全部资源
扫码关注微信
网络出版日期: 2024-12-30 ,
移动端阅览
冯明涛,沈军豪,武子杰等.多模态大模型驱动的三维视觉理解技术前沿进展[J].中国图象图形学报,
Feng Mingtao,Shen Junhao,Wu Zijie,et al.Recent Progress in Large Multi-modal Model based 3D Vision Understanding[J].Journal of Image and Graphics,
三维(3D)视觉感知和理解在机器人导航、自动驾驶以及智能人机交互等众多领域广泛应用,是计算机视觉领域中备受瞩目的研究方向。随着多模态大模型的发展,它们与3D视觉数据的融合取得了快速进展,为理解和与3D物理世界交互提供了前所未有的能力,并展现了独特优势,如上下文学习、逐步推理、开放词汇能力和丰富的世界知识。本文涵盖了3D视觉数据基本表示,从点云到3D高斯泼溅;梳理了主流多模态大模型的发展脉络;对联合多模态大模型的3D视觉数据表征方法做了详细的归纳总结;梳理了基于多模态大模型的3D理解任务,如3D生成与重建、3D目标检测、3D语义分割、3D场景描述、语言引导的3D目标定位和3D场景问答等;以及基于多模态大模型的机器人具身智能系统中空间理解能力提升;最后梳理了核心数据集和对未来前景的深刻讨论,以期促进该领域的深入研究与广泛应用。本文提出的全面分析揭示了本领域的重大进展,强调了利用多模态大模型进行3D视觉理解的潜力和必要性。因此,本综述目标是为未来的研究绘制一条路线,探索和扩展多模态大模型在理解和与复杂3D世界的互动能力,为空间智能领域的进一步发展铺平道路。
Three-dimensional (3D) visual perception and understanding are fundamental to numerous applications, including robotic navigation, autonomous driving, and intelligent human-computer interaction. As one of the most prominent research directions in computer vision, 3D vision has experienced rapid advancement, particularly with the rise of multimodal large models (MLLMs). The integration of MLLMs with 3D visual data has unlocked unprecedented capabilities for understanding and interacting with the physical 3D world. These models bring distinct advantages, such as contextual learning, step-by-step reasoning, open-vocabulary support, and rich world knowledge, making them transformative tools in the field of 3D vision. This paper provides a comprehensive overview of the latest progress in 3D vision understanding driven by MLLMs. It begins by addressing the foundational representations of 3D visual data. From point clouds to 3D Gaussian splatting, the review systematically examines mainstream data representation methods, which form the backbone for intelligent processing and analysis of 3D visual information. These representations enable the integration of semantic, spatial, and structural information, serving as a critical basis for downstream tasks in 3D vision. Following this, the paper traces the evolution of MLLMs, starting from the development of large language models (LLMs) and their extension into multimodal systems. It highlights the emergence of vision-language models (VLMs) that synergize text and visual data, offering significant potential for advancing 3D vision. The combination of multimodal pre-trained priors and 3D data representations has opened new avenues for cross-modal understanding and interaction. The ability of MLLMs to align multimodal features while reasoning across modalities has proven crucial for overcoming traditional limitations in 3D vision, such as sparse data, occlusion, and noise. The review then focuses on the methods for representing 3D visual data using MLLMs, providing an in-depth synthesis of current strategies. It discusses how MLLMs leverage pre-trained knowledge to interpret 3D information, facilitate multimodal feature alignment, and enhance the contextual understanding of complex 3D scenes. For instance, MLLMs enable the integration of 3D data with semantic priors, improving the efficiency and accuracy of tasks such as 3D object recognition, scene reconstruction, and spatial reasoning. In the context of specific 3D vision tasks, this paper explores various applications of MLLMs, including 3D generation and reconstruction, 3D object detection, semantic segmentation, scene description, language-guided 3D object localization, and 3D scene question answering. These tasks demonstrate the transformative impact of MLLMs on 3D vision, showcasing their ability to elevate task performance by incorporating multimodal capabilities. For example, 3D generation tasks benefit from the contextual knowledge of MLLMs, enabling the creation of semantically coherent and visually accurate 3D content. Similarly, 3D object detection and segmentation tasks leverage the reasoning capabilities of MLLMs to identify and classify objects in complex scenes more effectively. The role of MLLMs extends beyond traditional 3D vision tasks to applications in embodied robotic intelligence systems. Examples include robotic 3D grasping and 3D visual navigation, where MLLMs facilitate spatial understanding and decision-making in dynamic environments. By integrating multimodal reasoning with 3D perception, MLLMs enable robots to perform language-guided manipulations, navigate cluttered spaces, and interact seamlessly with the physical world. The paper also provides a detailed examination of datasets that support research in this domain. It reviews fundamental 3D datasets, such as those designed for point cloud analysis, voxel-based representations, and mesh modeling, alongside multimodal 3D vision-language datasets. These datasets are analyzed in terms of their scale, diversity, and application scope, offering a robust foundation for training and evaluating MLLMs in various 3D vision tasks. The lack of diverse and representative datasets, however, remains a significant challenge, limiting the generalizability and robustness of existing models. Building on these foundations, the paper addresses key challenges and future research priorities in MLLM-driven 3D vision understanding. Major challenges include the unification of 3D representation formats, improving the spatial reasoning capabilities of MLLMs, enhancing the generalization of MLLMs for diverse 3D data processing, and addressing practical deployment mechanisms for multimodal models in real-world 3D vision tasks. Additionally, the paper highlights the need for efficient model miniaturization for edge-side applications, such as autonomous robots and drones, and explores the potential of cloud-edge collaborative frameworks to enhance 3D vision tasks in resource-constrained environments. The future research directions proposed in this paper aim to address these challenges and further advance the field. Key priorities include developing scalable and efficient 3D data representations, designing domain-adaptive MLLMs, and creating comprehensive multimodal benchmarks that reflect real-world complexities. Moreover, the paper emphasizes the importance of fostering innovation in multimodal reasoning frameworks, enabling MLLMs to interpret, generate, and interact with 3D data seamlessly. Exploring the integration of real-time data streams with multimodal pre-training strategies could also provide valuable insights into dynamic 3D environments. By offering a comprehensive analysis of the progress, challenges, and future directions in this field, the paper underscores the transformative potential of MLLMs for 3D vision understanding. It highlights the necessity of leveraging these models to bridge the gap between perception and reasoning, enabling systems to interact effectively with the complex 3D world. The findings emphasize the importance of addressing existing limitations in 3D vision and pave the way for the continued evolution of artificial intelligence in spatially complex and multimodal environments. This review aims to inspire further exploration and expansion of MLLMs in 3D vision understanding, providing a roadmap for future research in this domain. Through the synthesis of state-of-the-art developments, the paper lays a foundation for advancing spatial intelligence, fostering deeper integration between AI systems and the physical world. Ultimately, the analysis highlights the potential of MLLMs to revolutionize 3D vision tasks, promoting their broad application across industries and accelerating the progress of intelligent systems in complex, multimodal scenarios.
三维视觉多模态大模型三维视觉表征三维视觉生成三维重建机器人三维视觉三维场景理解
3D visionLarge multimodal model3d visual representation3D vision generation3D reconstructionRobot 3D vision3D scene understanding
Abdelreheem A, Olszewski K, Lee H Y, Wonka P and Achlioptas P. 2024. ScanEnts3D: exploiting phrase-to-3D-object correspondences for improved visio-linguistic models in 3D scenes//Proceedings of 2024 IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa, USA: IEEE: 3512-3522 [DOI: 10.1109/WACV57701.2024.00349http://dx.doi.org/10.1109/WACV57701.2024.00349]
Achlioptas P, Abdelreheem A, Xia F, Elhoseiny M and Guibas L. 2020. ReferIt3D: neural listeners for fine-grained 3D object identification in real-world scenes//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 422-440 [DOI: 10.1007/978-3-030-58452-8_25http://dx.doi.org/10.1007/978-3-030-58452-8_25]
Achlioptas P, Huang I, Sung M, Tulyakov S and Guibas L. 2023. ShapeTalk: a language dataset and framework for 3D shape edits and deformations//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 12685-12694 [DOI: 10.1109/CVPR52729.2023.01220http://dx.doi.org/10.1109/CVPR52729.2023.01220]
Afham M, Dissanayake I, Dissanayake D, Dharmasiri A, Thilakarathna K and Rodrigo R. 2022. Crosspoint: self-supervised cross-modal contrastive learning for 3D point cloud understanding//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE: 9892-9902 [DOI: 10.1109/CVPR52688.2022.00967http://dx.doi.org/10.1109/CVPR52688.2022.00967]
Alayrac J B, Donahue J, Luc P, Miech A, Barr I, Hasson Y, Lenc K, Mensch A, Millican K, Reynolds M, Ring R, Rutherford E, Cabi S, Han T, Gong Z, Samangooei S, Monteiro M, Menick J, Borgeaud S, Brock A, Nematzadeh A, Sharifzadeh S, Binkowski M, Barreira R, Vinyals O, Zisserman A and Simonyan K. 2022. Flamingo: a visual language model for few-shot learning//Proceedings of the 35th International Conference on Neural Information Processing Systems. New Orleans, USA: Curran Associates Inc.: 23716-23736
Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick C L and Parikh D. 2015. Vqa: visual question answering//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 2425-2433 [DOI: 10.1109/ICCV.2015.279http://dx.doi.org/10.1109/ICCV.2015.279]
Arandjelovic R and Zisserman A. 2017. Look, listen and learn//Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE: 609-617 [DOI: 10.1109/ICCV.2017.73http://dx.doi.org/10.1109/ICCV.2017.73]
Armeni I, He Z Y, Gwak J, Zamir A R, Fischer M, Malik J and Savarese S. 2019. 3D scene graph: a structure for unified semantics, 3D space, and camera//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea: IEEE: 5663-5672 [DOI: 10.1109/ICCV.2019.00576http://dx.doi.org/10.1109/ICCV.2019.00576]
Azuma D, Miyanishi T, Kurita S and Kawanabe M. 2022. ScanQA: 3D question answering for spatial scene understanding//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE: 19107-19117 [DOI: 10.1109/CVPR52688.2022.01854http://dx.doi.org/10.1109/CVPR52688.2022.01854]
Bahmani S, Skorokhodov I, Rong V, Wetzstein G, Guibas L, Wonka P, Tulyakov S, Park J J, Tagliasacchi A and Lindell D B. 2023. 4D-fy: text-to-4d generation using hybrid score distillation sampling [EB/OL]. [2023-11-29]. https://arxiv.org/pdf/2311.17984.pdfhttps://arxiv.org/pdf/2311.17984.pdf
Bai H, Lyu Y, Jiang L, Li S, Lu H, Lin X and Wang L. 2023. CompoNeRF: text-guided multi-object compositional NeRF with editable 3D scene layout [EB/OL]. [2023-03-24]. https://arxiv.org/pdf/2303.13843.pdfhttps://arxiv.org/pdf/2303.13843.pdf
Bakr E M, Ayman M, Ahmed M, Slim H and Elhoseiny M. 2023. CoT3DRef: chain-of-thoughts data-efficient 3D visual grounding [EB/OL]. [2023-10-10]. https://arxiv.org/pdf/2310.06214.pdfhttps://arxiv.org/pdf/2310.06214.pdf
Bangalath H, Maaz M, Khattak M U, Khan S H and Shahbaz Khan F. 2022. Bridging the gap between object and image-level representations for open-vocabulary detection//Proceedings of the 36th International Conference on Neural Information Processing Systems. New Orleans, USA: Curran Associates Inc.: 33781-33794
Bao H, Dong L, Piao S and Wei F. 2021. Beit: Bert pre-training of image transformers [EB/OL]. [2021-06-15]. https://arxiv.org/pdf/2106.08254.pdfhttps://arxiv.org/pdf/2106.08254.pdf
Bar-Hillel Y. 1960. The present status of automatic translation of languages. Advances in Computers,1: 91-163 [DOI: 10.7551/mitpress/5779.003.0009http://dx.doi.org/10.7551/mitpress/5779.003.0009]
Barron J T, Mildenhall B, Tancik M, Hedman P, Martin-Brualla R and Srinivasan P P. 2021. Mip-NeRF: a multiscale representation for anti-aliasing neural radiance fields//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE: 5835-5844 [DOI: 10.1109/ICCV48922.2021.00580http://dx.doi.org/10.1109/ICCV48922.2021.00580]
Bartsch A and Farimani A B. 2024. LLM-Craft: robotic crafting of elasto-plastic objects with large language models [EB/OL]. [2024-06-12]. https://arxiv.org/pdf/2406.08648.pdfhttps://arxiv.org/pdf/2406.08648.pdf
Boudjoghra M E A, Dai A, Lahoud J, Cholakkal H, Anwer R M, Khan S and Khan F S. 2024. Open-YOLO3D: towards fast and accurate open-vocabulary 3D instance segmentation [EB/OL]. [2024-06-04]. https://arxiv.org/pdf/2406.02548.pdfhttps://arxiv.org/pdf/2406.02548.pdf
Breyer M, Chung J J, Ott L, Siegwart R and Nieto J. 2021. Volumetric grasping network: real-time 6 DOF grasp detection in clutter [EB/OL]. [2021-01-04]. https://arxiv.org/pdf/2101.01132.pdfhttps://arxiv.org/pdf/2101.01132.pdf
Brown P F, Della Pietra V J, Desouza P V, Lai J C and Mercer R L. 1992. Class-based n-gram models of natural language. Computational Linguistics,18(4): 467-480 [DOI: 10.5555/176313.176316http://dx.doi.org/10.5555/176313.176316]
Brown T, Mann B, Ryder N, Subbiah M, Kaplan J D, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler D M, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I and Amodei D. 2020. Language models are few-shot learners//Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc.: 1877-1901
Caesar H, Bankiti V, Lang A H, Vora S, Liong V E, Xu Q, Krishnan A, Pan Y, Baldan G and Beijbom O. 2020. nuScenes: a multimodal dataset for autonomous driving//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 11618-11628 [DOI: 10.1109/CVPR42600.2020.01164http://dx.doi.org/10.1109/CVPR42600.2020.01164]
Cai D, Zhao L, Zhang J, Sheng L and Xu D. 2022. 3DJCG: a unified framework for joint dense captioning and visual grounding on 3D point clouds//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE: 16443-16452 [DOI: 10.1109/CVPR52688.2022.01597http://dx.doi.org/10.1109/CVPR52688.2022.01597]
Cai J, He Y, Yuan W, Zhu S, Dong Z, Bo L and Chen Q. 2024. Ov9d: category-levelopen-vocabulary 9D object pose and size estimation [EB/OL]. [2024-03-19]. https://arxiv.org/pdf/2403.12396.pdfhttps://arxiv.org/pdf/2403.12396.pdf
Cao A and Johnson J. 2023. Hexplane: a fast representation for dynamic scenes//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE:130-141 [DOI: 10.1109/CVPR52729.2023.00021http://dx.doi.org/10.1109/CVPR52729.2023.00021]
Cao Y, Cao Y P, Han K, Shan Y and Wong K Y K. 2024. Dreamavatar: text-and-shape guided 3d human avatar generation via diffusion models//Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 958-968 [DOI: 10.1109/CVPR52733.2024.00097http://dx.doi.org/10.1109/CVPR52733.2024.00097]
Cao Y, Yihan Z, Xu H and Xu D. 2024. Coda: collaborative novel box discovery and cross-modal alignment for open-vocabulary 3D object detection//Proceedings of the 37th Conference on Advances in Neural Information Processing Systems. New Orleans, USA: Curran Associates Inc.: 71862-71873
Chang A X, Funkhouser T, Guibas L, Hanrahan P, Huang Q, Li Z, Savarese S, Savva M, Song S, Su H, Xiao J, Yi L and Yu F. 2015. ShapeNet: an information-rich 3D model repository [EB/OL]. [2015-12-09]. https://arxiv.org/pdf/1512.03012.pdfhttps://arxiv.org/pdf/1512.03012.pdf
Chang A, Dai A, Funkhouser T, Halber M, Nießner M, Savva M, Song S, Zeng A and Zhang Y. 2017. Matterport3d: Learning from rgb-d data in indoor environments [EB/OL]. [2017-09-18]. https://arxiv.org/pdf/1709.06158.pdfhttps://arxiv.org/pdf/1709.06158.pdf
Chang Y, Ballotta L and Carlone L. 2023. D-Lite: navigation-oriented compression of 3D scene graphs for multi-robot collaboration. IEEE Robotics and Automation Letters,8(11): 7527-7534 [DOI: 10.1109/LRA.2023.3320011http://dx.doi.org/10.1109/LRA.2023.3320011]
Chatzipantazis E, Pertigkiozoglou S, Dobriban E and Daniilidis K. 2022. Se (3)-equivariant attention networks for shape reconstruction in function space [EB/OL]. [2022-04-05]. https://arxiv.org/pdf/2204.02394.pdfhttps://arxiv.org/pdf/2204.02394.pdf
Chen A, Xu Z, Zhao F, Zhang X, Xiang F, Yu J and Su H. 2021. MvsNeRF: fast generalizable radiance field reconstruction from multi-view stereo//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE: 14104-14113 [DOI: 10.1109/ICCV48922.2021.01386http://dx.doi.org/10.1109/ICCV48922.2021.01386]
Chen B, Xu Z, Kirmani S, Ichter B, Sadigh D, Guibas L and Xia F. 2024. SpatialVLM: endowing vision-language models with spatial reasoning capabilities [EB/OL]. [2024-01-23]. https://arxiv.org/pdf/2401.12168.pdfhttps://arxiv.org/pdf/2401.12168.pdf
Chen D Z, Chang A X and Nießner M. 2020. Scanrefer: 3D object localization in RGB-D scans using natural language//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 202-221 [DOI: 10.1007/978-3-030-58565-5_13http://dx.doi.org/10.1007/978-3-030-58565-5_13]
Chen L, Wang X, Lu J, Lin S, Wang C and He G. 2024. CLIP-driven open-vocabulary 3D scene graph generation via cross-modality contrastive learning//Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 27863-27873
Chen R, Chen Y, Jiao N and Jia K. 2023. Fantasia3d: disentangling geometry and appearance for high-quality text-to-3d content creation//Proceedings of 2023 IEEE/CVF International Conference on Computer Vision. Paris, France: IEEE: 22189-22199 [DOI: 10.1109/ICCV51070.2023.02033http://dx.doi.org/10.1109/ICCV51070.2023.02033]
Chen R, Liu Y, Kong L, Zhu X, Ma Y, Li Y, Hou Y, Qiao Y and Wang W. 2023. CLIP2Scene: towards label-efficient 3D scene understanding by CLIP//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 7020-7030 [DOI: 10.1109/CVPR52729.2023.00678http://dx.doi.org/10.1109/CVPR52729.2023.00678]
Chen S, Chen X, Zhang C, Li M, Yu G, Fei H, Zhu H, Fan J and Chen T. 2023. LL3DA: visual interactive instruction tuning for Omni-3D understanding, reasoning, and planning [EB/OL]. [2023-11-30]. https://arxiv.org/pdf/2311.18651.pdfhttps://arxiv.org/pdf/2311.18651.pdf
Chen S, Zhu H, Chen X, Lei Y, Yu G and Chen T. 2023. End-to-end 3D dense captioning with vote2cap-detr//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 11124-11133 [DOI: 10.1109/CVPR52729.2023.01070http://dx.doi.org/10.1109/CVPR52729.2023.01070]
Chen S, Zhu H, Li M, Chen X, Guo P, Lei Y, Yu G, Li T and Chen T. 2023. Vote2cap-detr++: decoupling localization and describing for end-to-end 3D dense captioning [EB/OL]. [2023-09-06]. https://arxiv.org/pdf/2309.02999.pdfhttps://arxiv.org/pdf/2309.02999.pdf
Chen S, Zhu X, Liu W, He X, Liu J. 2021. Global-local propagation network for RGB-D semantic segmentation [EB/OL]. [2021-01-26]. https://arxiv.org/pdf/2101.10801.pdfhttps://arxiv.org/pdf/2101.10801.pdf
Chen T, Yu C, Li J, Zhang J, Zhu L, Ji D, Zhang Y, Zang Y, Li Z and Sun L. 2024. Reasoning3D -- grounding and reasoning in 3D: fine-grained zero-shot open-vocabulary 3D reasoning part segmentation via large vision-language models [EB/OL]. [2024-05-29]. https://arxiv.org/pdf/2405.19326.pdfhttps://arxiv.org/pdf/2405.19326.pdf
Chen X, Wang X, Changpinyo S, Piergiovanni A J, Padlewski P, Salz D, Goodman S, Grycner A, Mustafa B, Beyer L, Kolesnikov A, Puigcerver J, Ding N, Rong K, Akbari H, Mishra G, Xue L, Thapliyal A, Bradbury J, Kuo W, Seyedhosseini M, Jia C, Ayan B K, Riquelme C, Steiner A, Angelova A, Zhai X, Houlsby N and Soricut R. 2022. PaLI: a jointly-scaled multilingual language-image model [EB/OL]. [2022-09-14]. https://arxiv.org/pdf/2209.06794.pdfhttps://arxiv.org/pdf/2209.06794.pdf
Chen Y, Yang S, Huang H, Wang T, Lyu R, Xu R, Lin D and Peng J. 2024. Grounded 3D-LLM with referent tokens [EB/OL]. [2024-05-16]. https://arxiv.org/pdf/2405.10370.pdfhttps://arxiv.org/pdf/2405.10370.pdf
Chen Z, Gholami A, Nießner M and Chang A X. 2021. Scan2Cap: context-aware dense captioning in RGB-D scans//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 3192-3202 [DOI: 10.1109/CVPR46437.2021.00321http://dx.doi.org/10.1109/CVPR46437.2021.00321]
Cheng T, Song L, Ge Y, Liu W, Wang X and Shan Y. 2024. Yolo-world: real-time open-vocabulary object detection [EB/OL]. [2024-01-30]. https://arxiv.org/pdf/2401.17270.pdfhttps://arxiv.org/pdf/2401.17270.pdf
Chibane J and Pons-Moll G. 2020. Neural unsigned distance fields for implicit function learning//Proceedings of the 34th Conference on Advances in Neural Information Processing Systems. Red Hook, USA: Curran Associates Inc.: 21638-21652
Chibane J, Alldieck T and Pons-Moll G. 2020. Implicit functions in feature space for 3d shape reconstruction and completion//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 6968-6979 [DOI: 10.1109/CVPR42600.2020.00700http://dx.doi.org/10.1109/CVPR42600.2020.00700]
Chou G, Bahat Y and Heide F. 2023. Diffusion-sdf: conditional generative modeling of signed distance functions//Proceedings of 2023 IEEE/CVF International Conference on Computer Vision. Paris, France: IEEE: 2262-2272 [DOI: 10.1109/ICCV51070.2023.00215http://dx.doi.org/10.1109/ICCV51070.2023.00215]
Choy C B, Xu D, Gwak J, Chen K and Savarese S. 2016. 3D-R2N2: a unified approach for single and multi-view 3d object reconstruction//Proceedings of 2016 European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 628-644 [DOI: 10.1007/978-3-319-46484-8_38http://dx.doi.org/10.1007/978-3-319-46484-8_38]
Chu T, Zhang P, Dong X, Zang Y, Liu Q and Wang J. 2024. Unified scene representation and reconstruction for 3D large language models [EB/OL]. [2024-04-19]. https://arxiv.org/pdf/2404.13044.pdfhttps://arxiv.org/pdf/2404.13044.pdf
Chu Y, Xu J, Zhou X, Yang Q, Zhang S, Yan Z, Yan Z, Zhou C and Zhou J. 2023. Qwen-Audio: advancing universal audio understanding via unified large-scale audio-language models [EB/OL]. [2023-11-14]. https://arxiv.org/pdf/2311.07919.pdfhttps://arxiv.org/pdf/2311.07919.pdf
Chung J, Oh J and Lee K M. 2023. Depth-regularized optimization for 3D gaussian splatting in few-shot images [EB/OL]. [2023-11-22]. https://arxiv.org/pdf/2311.13398.pdfhttps://arxiv.org/pdf/2311.13398.pdf
Cong P, Zhu X, Qiao F, Ren Y, Peng X, Hou Y, Xu L, Yang R, Manocha D and Ma Y. 2022. Stcrowd: A multimodal dataset for pedestrian perception in crowded scenes//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE: 19576-19585 [DOI: 10.1109/CVPR52688.2022.01899http://dx.doi.org/10.1109/CVPR52688.2022.01899]
Dai A, Chang A X, Savva M, Halber M, Funkhouser T and Nießner M. 2017. ScanNet: richly-annotated 3D reconstructions of indoor scenes//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 2432-2443 [DOI: 10.1109/CVPR.2017.261http://dx.doi.org/10.1109/CVPR.2017.261]
Dai Z, Asgharivaskasi A, Duong T, Lin S, Tzes M E, Pappas G and Atanasov N. 2023. Optimal scene graph planning with large language model guidance [EB/OL]. [2023-09-17]. https://arxiv.org/pdf/2309.09182.pdfhttps://arxiv.org/pdf/2309.09182.pdf
Das A, Datta S, Gkioxari G, Lee S, Parikh D and Batra D. 2018. Embodied question answering//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 1-10 [DOI: 10.1109/CVPR.2018.00008http://dx.doi.org/10.1109/CVPR.2018.00008]
Deitke M, Liu R, Wallingford M, Ngo H, Michel O, Kusupati A, Fan A, Laforte C, Voleti V, Gadre S Y, VanderBilt E, Kembhavi A, Vondrick C, Gkioxari G, Ehsani K, Schmidt L and Farhadi A. 2024. Objaverse-XL: a universe of 10M+ 3D objects//Proceedings of the 37th International Conference on Neural Information Processing Systems. New Orleans, USA: Curran Associates Inc.: 35799-35813
Deitke M, Schwenk D, Salvador J, Weihs L, Michel O, VanderBilt E, Schmidt L, Ehsanit K and Farhadi A. 2023. Objaverse: a universe of annotated 3D objects//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 13142-13153 [DOI: 10.1109/CVPR52729.2023.01263http://dx.doi.org/10.1109/CVPR52729.2023.01263]
Deng X, Zhang W, Ding Q and Zhang X. 2023. Pointvector: a vector representation in point cloud analysis//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 9455-9465 [DOI: 10.1109/CVPR52729.2023.00912http://dx.doi.org/10.1109/CVPR52729.2023.00912]
Devlin J, Chang M W, Lee K and Toutanova K. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding [EB/OL]. [2018-10-11]. https://arxiv.org/pdf/1810.04805.pdfhttps://arxiv.org/pdf/1810.04805.pdf
Ding R, Yang J, Xue C, Zhang W, Bai S and Qi X. 2023. PLA: language-driven open-vocabulary 3D scene understanding//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 7010-7019 [DOI: 10.1109/CVPR52729.2023.00677http://dx.doi.org/10.1109/CVPR52729.2023.00677]
Driess D, Xia F, Sajjadi M S, Lynch C, Chowdhery A, Ichter B, Wahid A, Tompson J, Vuong Q, Yu T, Huang W, Chebotar Y, Sermanet P, Duckworth D, Levine S, Vanhoucke V, Hausman K, Toussaint M, Greff K, Zeng A, Mordatch I and Florence P. 2023. PaLM-E: An Embodied Multimodal Language Model [EB/OL]. [2023-03-06]. https://arxiv.org/pdf/2303.03378.pdfhttps://arxiv.org/pdf/2303.03378.pdf
Etchegaray D, Huang Z, Harada T and Luo Y. 2024. Find n'Propagate: open-vocabulary 3D object detection in urban environments [EB/OL]. [2024-03-20]. https://arxiv.org/pdf/2403.13556.pdfhttps://arxiv.org/pdf/2403.13556.pdf
Fei B, Li Y, Yang W, Ma L and He Y. 2024. Towards unified representation of multi-modal pre-training for 3D understanding via differentiable rendering [EB/OL]. [2024-04-21]. https://arxiv.org/pdf/2404.13619.pdfhttps://arxiv.org/pdf/2404.13619.pdf
Fei J, Ahmed M, Ding J, Bakr E M and Elhoseiny M. 2024. Kestrel: point grounding multimodal LLM for part-aware 3D vision-language understanding [EB/OL]. [2024-05-29]. https://arxiv.org/pdf/2405.18937.pdfhttps://arxiv.org/pdf/2405.18937.pdf
Feng C, Hsu J, Liu W and Wu J. 2024. Naturally supervised 3D visual grounding with language-regularized concept learners [EB/OL]. [2024-04-30]. https://arxiv.org/pdf/2404.19696.pdfhttps://arxiv.org/pdf/2404.19696.pdf
Feng M, Hou H, Zhang L, Guo Y, Yu H, Wang Y and Mian A. 2023. Exploring hierarchical spatial layout cues for 3D point cloud based scene graph prediction. IEEE Transactions on Multimedia. [DOI: 10.1109/TMM.2023.3277736http://dx.doi.org/10.1109/TMM.2023.3277736]
Feng Q, Liu Y, Lai Y K, Yang J and Li K. 2022. FoF: learning fourier occupancy field for monocular real-time human reconstruction//Proceedings of the 36th International Conference on Neural Information Processing Systems. New Orleans, USA: Curran Associates Inc.: 7397-7409
Feng Y, Feng Y, You H, Zhao X and Gao Y. 2019. MeshNet: mesh neural network for 3d shape representation//Proceedings of the 33th AAAI Conference on Artificial Intelligence. Honolulu, USA: AAAI: 8279-8286 [DOI: 10.1609/aaai.v33i01.33018279http://dx.doi.org/10.1609/aaai.v33i01.33018279]
Fine S, Singer Y and Tishby N. 1998. The hierarchical hidden Markov model: analysis and applications. Machine Learning,32: 41–62 [DOI: 10.1023/A:1007469218079http://dx.doi.org/10.1023/A:1007469218079]
Fu H, Cai B, Gao L, Zhang L X, Wang J, Li C, Xun Z, Sun C, Jia R, Zhao B and Zhang H. 2021. 3D-FRONT: 3D furnished rooms with layouts and semantics//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE: 10913-10922 [DOI: 10.1109/ICCV48922.2021.01075http://dx.doi.org/10.1109/ICCV48922.2021.01075]
Fu H, Jia R, Gao L, Gong M, Zhao B, Maybank S and Tao D. 2021. 3D-Future: 3D furniture shape with texture. International Journal of Computer Vision,129(12): 3313-3337 [DOI: 10.1007/s11263-021-01534-zhttp://dx.doi.org/10.1007/s11263-021-01534-z]
Fu R, Liu J, Chen X, Nie Y and Xiong W. 2024. Scene-LLM: extending language model for 3D visual understanding and reasoning [EB/OL]. [2024-03-18]. https://arxiv.org/pdf/2403.11401.pdfhttps://arxiv.org/pdf/2403.11401.pdf
Gao G, Liu W, Chen A, Geiger A and Schölkopf B. 2024. Graphdreamer: compositional 3D scene synthesis from scene graphs [EB/OL]. [2023-11-30]. https://arxiv.org/pdf/2312.00093.pdfhttps://arxiv.org/pdf/2312.00093.pdf
Gao J, Shen T, Wang Z, Chen W, Yin K, Li D, Litany O, Gojcic Z and Fidler S. 2022. Get3d: A generative model of high quality 3d textured shapes learned from images//Proceedings of the 35th Conference on Advances in Neural Information Processing Systems. New Orleans, USA: Curran Associates Inc.: 31841-31854
Ge J, Luo H, Qian S, Gan Y, Fu J and Zhang S. 2023. Chain of thought prompt tuning in vision language models [EB/OL]. [2023-04-16]. https://arxiv.org/pdf/2304.07919.pdfhttps://arxiv.org/pdf/2304.07919.pdf
Gemini Team Google. 2023. Gemini: a family of highly capable multimodal models [EB/OL]. [2023-12-19]. https://arxiv.org/pdf/2312.11805.pdfhttps://arxiv.org/pdf/2312.11805.pdf
Girdhar R, El-Nouby A, Liu Z, Singh M, Alwala K V, Joulin A and Misra I. 2023. ImageBind: one embedding space to bind them all//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 15180-15190 [DOI: 10.1109/CVPR52729.2023.01457http://dx.doi.org/10.1109/CVPR52729.2023.01457]
Girish S, Gupta K and Shrivastava A. 2023. Eagles: efficient accelerated 3D gaussians with lightweight encodings [EB/OL]. [2023-12-07]. https://arxiv.org/pdf/2312.04564.pdfhttps://arxiv.org/pdf/2312.04564.pdf
Goyal A, Xu J, Guo Y, Blukis V, Chao YW and Fox D. 2023. Rvt: Robotic view transformer for 3d object manipulation//Proceedings of the 7th Conference on Robot Learning. Atlanta, USA: PMLR: 694-710 [DOI: 10.48550/arXiv.2306.14896http://dx.doi.org/10.48550/arXiv.2306.14896]
Gu J, Trevithick A, Lin K E, Susskind J M, Theobalt C, Liu L and Ramamoorthi R. 2023. NerfDiff: single-image view synthesis with nerf-guided distillation from 3D-aware diffusion//Proceedings of the 40th International Conference on Machine Learning. Honolulu, USA: PMLR: 11808-11826 [DOI: 10.5555/3618408.3618881http://dx.doi.org/10.5555/3618408.3618881]
Gu Q, Kuwajerwala A, Morin S, Jatavallabhula K M, Sen B, Agarwal A, Rivera C, Paul W, Ellis K, Chellappa R, Gan C, Melo C M D, Tenenbaum J B, Torralba A, Shkurti F and Paull L. 2023. ConceptGraphs: open-vocabulary 3D scene graphs for perception and planning [EB/OL]. [2023-09-28]. https://arxiv.org/pdf/2309.16650.pdfhttps://arxiv.org/pdf/2309.16650.pdf
Guo M H, Cai J X, Liu Z N, Mu T J, Martin R R and Hu S M. 2021. Pct: point cloud transformer. Computational Visual Media,7(2): 187-199 [DOI: 10.1007/s41095-021-0229-5http://dx.doi.org/10.1007/s41095-021-0229-5]
Guo Z, Tang Y, Zhang R, Wang D, Wang Z, Zhao B and Li X. 2023. ViewRefer: grasp the multi-view knowledge for 3D visual grounding//Proceedings of 2023 IEEE/CVF International Conference on Computer Vision. Paris, France: IEEE: 15326-15337 [DOI: 10.1109/ICCV51070.2023.01410http://dx.doi.org/10.1109/ICCV51070.2023.01410]
Guo Z, Zhang R, Zhu X, Tang Y, Ma X, Han J, Chen K, Gao P, Li X, Li H and Heng P A. 2023. Point-Bind & Point-LLM: aligning point cloud with multi-modality for 3D understanding, generation, and instruction following [EB/OL]. [2023-09-01]. https://arxiv.org/pdf/2309.00615.pdfhttps://arxiv.org/pdf/2309.00615.pdf
Gupta A, Xiong W, Nie Y, Jones I and Oğuz B. 2023. 3DGen: triplane latent diffusion for textured mesh generation [EB/OL]. [2023-03-09]. https://arxiv.org/pdf/2303.05371.pdfhttps://arxiv.org/pdf/2303.05371.pdf
Han D, McInroe T, Jelley A, Albrecht S V, Bell P and Storkey A. 2024. LLM-Personalize: aligning LLM planners with human preferences via reinforced self-training for housekeeping robots [EB/OL]. [2024-04-22]. https://arxiv.org/pdf/2404.14285.pdfhttps://arxiv.org/pdf/2404.14285.pdf
Han J, Gong K, Zhang Y, Wang J, Zhang K, Lin D, Qiao Y, Gao P and Yue X. 2024. OneLLM: one framework to align all modalities with language [EB/OL]. [2023-09-01]. https://arxiv.org/pdf/2309.00615.pdfhttps://arxiv.org/pdf/2309.00615.pdf
Han X, Tang Y, Wang Z, Li X. 2024. Mamba3d: Enhancing local features for 3d point cloud analysis via state space model [EB/OL]. [2024-04-23]. https://arxiv.org/pdf/2404.14966.pdfhttps://arxiv.org/pdf/2404.14966.pdf
Han Z, Wang X, Vong C M, Liu Y S, Zwicker M and Chen C L. 2019. 3DViewGraph: learning global features for 3D shapes from a graph of unordered views with attention [EB/OL]. [2019-05-17]. https://arxiv.org/pdf/1905.07503.pdfhttps://arxiv.org/pdf/1905.07503.pdf
Hanocka R, Hertz A, Fish N, Giryes R, Fleishman S and Cohen-Or D. 2019. Meshcnn: a network with an edge. ACM Transactions on Graphics,38(4): 1-12 [DOI: 10.1145/3306346.3322959http://dx.doi.org/10.1145/3306346.3322959]
He Q, Peng J, Jiang Z, Wu K, Ji X, Zhang J, Wang Y, Wang C, Chen M and Wu Y. 2024. UniM-OV3D: uni-modality open-vocabulary 3D scene understanding with fine-grained feature representation [EB/OL]. [2024-01-21]. https://arxiv.org/pdf/2401.11395.pdfhttps://arxiv.org/pdf/2401.11395.pdf
Hegde D, Valanarasu J M J and Patel V. 2023. CLIP goes 3D: leveraging prompt tuning for language grounded 3D recognition//Proceedings of 2023 IEEE/CVF International Conference on Computer Vision Workshops. Paris, France: IEEE: 2020-2030 [DOI: 10.1109/ICCVW60793.2023.00217http://dx.doi.org/10.1109/ICCVW60793.2023.00217]
Hess G, Tonderski A, Petersson C, Åström K and Svensson L. 2024. LidarCLIP or: how I learned to talk to point clouds//Proceedings of 2024 IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa, USA: IEEE: 7423-7432 [DOI: 10.1109/WACV57701.2024.00727http://dx.doi.org/10.1109/WACV57701.2024.00727]
Hochreiter S and Schmidhuber J. 1997. Long short-term memory. Neural Computation,9(8): 1735-1780 [DOI: 10.1162/neco.1997.9.8.1735http://dx.doi.org/10.1162/neco.1997.9.8.1735]
Honerkamp D, Buchner M, Despinoy F, Welschehold T and Valada A. 2024. Language-grounded dynamic scene graphs for interactive object search with mobile manipulation [EB/OL]. [2024-03-13]. https://arxiv.org/pdf/2403.08605.pdfhttps://arxiv.org/pdf/2403.08605.pdf
Hong F, Chen Z, Lan Y, Pan L and Liu Z. 2022. Eva3d: Compositional 3human generation fromd 2d image collections [EB/OL]. [2022-10-10]. https://arxiv.org/pdf/2210.04888https://arxiv.org/pdf/2210.04888
Hong F, Zhang M, Pan L, Cai Z, Yang L and Liu Z. 2022. Avatarclip: Zero-shot text-driven generation and animation of 3d avatars. ACM Transactions on Graphics,41(4): 1-19 [DOI: 10.1145/3528223.3530094http://dx.doi.org/10.1145/3528223.3530094]
Hong S, Yavartanoo M, Neshatavar R and Lee K M. 2023. ACL-SPC: adaptive closed-loop system for self-supervised point cloud completion //Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 9435-9444 [DOI: 10.1109/CVPR52729.2023.00910http://dx.doi.org/10.1109/CVPR52729.2023.00910]
Hong Y, Lin C, Du Y, Chen Z, Tenenbaum J B and Gan C. 2023. 3D concept learning and reasoning from multi-view images//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 9202-9212 [DOI: 10.1109/CVPR52729.2023.00888http://dx.doi.org/10.1109/CVPR52729.2023.00888]
Hong Y, Zhen H, Chen P, Zheng S, Du Y, Chen Z and Gan C. 2023. 3D-LLM: injecting the 3D world into large language models//Proceedings of the 36th International Conference on Neural Information Processing Systems. New Orleans, USA: Curran Associates Inc.: 20482-20494
Hong Y, Zheng Z, Chen P, Wang Y, Li J and Gan C. 2024. Multiply: a multisensory object-centric embodied large language model in 3D world [EB/OL]. [2024-01-16]. https://arxiv.org/pdf/2401.08577.pdfhttps://arxiv.org/pdf/2401.08577.pdf
Huang C, Mees O, Zeng A and Burgard W. 2023. Visual language maps for robot navigation//Proceedings of 2023 IEEE/CVF International Conference on Robotics and Automation. London, UK: IEEE: 10608-10615 [DOI: 10.1109/ICRA48891.2023.10160969http://dx.doi.org/10.1109/ICRA48891.2023.10160969]
Huang J, Yong S, Ma X, Linghu X, Li P, Wang Y, Li Q, Zhu S, Jia B and Huang S. 2023. An embodied generalist agent in 3D world [EB/OL]. [2023-11-18]. https://arxiv.org/pdf/2311.12871.pdfhttps://arxiv.org/pdf/2311.12871.pdf
Huang J, Zhang H, Zhao M and Wu Z. 2024. IVLMap: instance-aware visual language grounding for consumer robot navigation [EB/OL]. [2024-03-28]. https://arxiv.org/pdf/2403.19336.pdfhttps://arxiv.org/pdf/2403.19336.pdf
Huang K, Yang J, Wang J, He S, Wang Z, He H, Zhang Q and Lu G. 2024. Granular3D: delving into multi-granularity 3D scene graph prediction. Pattern Recognition,153: #110562 [DOI: 10.1016/j.patcog.2024.110562http://dx.doi.org/10.1016/j.patcog.2024.110562]
Huang R, Pan X, Zheng H, Jiang H, Xie Z, Wu C, Song S and Huang G. 2024. Joint representation learning for text and 3D point cloud. Pattern Recognition,147: #110086 [DOI: 10.1016/j.patcog.2023.110086http://dx.doi.org/10.1016/j.patcog.2023.110086]
Huang S, Wang Z, Li P, Jia B, Liu T, Zhu Y, Liang W and Zhu S C. 2023. Diffusion-based generation, optimization, and planning in 3D scenes//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 16750-16761 [DOI: 10.1109/CVPR52729.2023.01607http://dx.doi.org/10.1109/CVPR52729.2023.01607]
Huang T, Dong B, Yang Y, Huang X, Lau R W, Ouyang W and Zuo W. 2023. CLIP2Point: transfer clip to point cloud classification with image-depth pre-training//Proceedings of 2023 IEEE/CVF International Conference on Computer Vision. Paris, France: IEEE: 22100-22110 [DOI: 10.1109/ICCV51070.2023.02025http://dx.doi.org/10.1109/ICCV51070.2023.02025]
Huang W, Wang C, Zhang R, Li Y, Wu J and Fei-Fei L. 2023. Voxposer: composable 3D value maps for robotic manipulation with language models [EB/OL]. [2023-07-12]. https://arxiv.org/pdf/2307.05973.pdfhttps://arxiv.org/pdf/2307.05973.pdf
Huang X, Huang Z, Li S, Qu W, He T, Hou Y, Zuo Y and Ouyang W. 2024. Frozen CLIP transformer is an efficient point cloud encoder//Proceedings of the 38th AAAI Conference on Artificial Intelligence. Vancouver, USA: AAAI: 2382-2390 [DOI: 10.1609/aaai.v38i3.28013http://dx.doi.org/10.1609/aaai.v38i3.28013]
Huang X, Shao R, Zhang Q, Zhang H, Feng Y, Liu Y and Wang Q. 2024. Humannorm: Learning normal diffusion model for high-quality and realistic 3d human generation//Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 4568-4577 [DOI: 10.1109/CVPR52733.2024.00437http://dx.doi.org/10.1109/CVPR52733.2024.00437]
Huang Y, Du C, Xue Z, Chen X, Zhao H and Huang L. 2021. What makes multi-modal learning better than single (provably)//Proceedings of the 35th International Conference on Neural Information Processing Systems. New Orleans, USA: Curran Associates Inc.: 10944-10956
Hughes N, Chang Y and Carlone L. 2022. Hydra: a real-time spatial perception system for 3D scene graph construction and optimization [EB/OL]. [2022-01-31]. https://arxiv.org/pdf/2201.13360.pdfhttps://arxiv.org/pdf/2201.13360.pdf
Jain A, Mildenhall B, Barron J T, Abbeel P and Poole B. 2022. Zero-shot text-guided object generation with dream fields//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE: 857-866 [DOI: 10.1109/CVPR52688.2022.00094http://dx.doi.org/10.1109/CVPR52688.2022.00094]
Ji J, Wang H, Wu C, Ma Y, Sun X and Ji R. 2023. JM3D & JM3D-LLM: elevating 3D representation with joint multi-modal cues [EB/OL]. [2023-10-14]. https://arxiv.org/pdf/2310.09503.pdfhttps://arxiv.org/pdf/2310.09503.pdf
Jia B, Chen Y, Yu H, Wang Y, Niu X, Liu T, Li Q and Huang S. 2024. SceneVerse: scaling 3D vision-language learning for grounded scene understanding [EB/OL]. [2024-01-17]. https://arxiv.org/pdf/2401.09340.pdfhttps://arxiv.org/pdf/2401.09340.pdf
Jiang L, Shi S and Schiele B. 2024. Open-vocabulary 3D semantic segmentation with foundation models//Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 21284-21294
Jiang R, Wang C, Zhang J, Chai M, He M, Chen D and Liao J. 2023. Avatarcraft: transforming text into neural human avatars with parameterized shape and pose control//Proceedings of 2023 IEEE/CVF International Conference on Computer Vision. Paris, France: IEEE: 14325-14336 [DOI: 10.1109/ICCV51070.2023.01322http://dx.doi.org/10.1109/ICCV51070.2023.01322]
Jin Z, Hayat M, Yang Y, Guo Y and Lei Y. 2023. Context-aware alignment and mutual masking for 3D-language pre-training//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 10984-10994 [DOI: 10.1109/CVPR52729.2023.01057http://dx.doi.org/10.1109/CVPR52729.2023.01057]
Kerbl B, Kopanas G, Leimkühler T and Drettakis G. 2023. 3D gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics,42(4): 1-14 [DOI: 10.1145/3592433http://dx.doi.org/10.1145/3592433]
Khurana M, Peri N, Ramanan D and Hays J. 2024. Shelf-supervised multi-modal pre-training for 3D object detection [EB/OL]. [2024-06-14]. https://arxiv.org/pdf/2406.10115.pdfhttps://arxiv.org/pdf/2406.10115.pdf
Kim B, Kwon P, Lee K, Lee M, Han S, Kim D and Joo H. 2023. Chupa: Carving 3d clothed humans from skinned shape priors using 2d diffusion probabilistic models//Proceedings of 2023 IEEE/CVF International Conference on Computer Vision. Paris, France: IEEE: 15919-15930 [DOI: 10.1109/ICCV51070.2023.01463http://dx.doi.org/10.1109/ICCV51070.2023.01463]
Kim M, Lee J and Kim J. 2023. GMR-Net: gcn-based mesh refinement framework for elliptic pde problems. Engineering with Computers,39(5): 1778-1790 [DOI: 10.1007/s00366-023-01811-0http://dx.doi.org/10.1007/s00366-023-01811-0]
Ko H K, Park G, Jeon H, Jo J, Kim J and Seo J. 2023. Large-scale text-to-image generation models for visual artists’ creative works [EB/OL]. [2023-02-16]. https://arxiv.org/pdf/2210.08477.pdfhttps://arxiv.org/pdf/2210.08477.pdf
Koch S, Vaskevicius N, Colosi M, Hermosilla P and Ropinski T. 2024. Open3dsg: open-vocabulary 3D scene graphs from point clouds with queryable objects and open-set relationships [EB/OL]. [2024-02-19]. https://arxiv.org/pdf/2402.12259.pdfhttps://arxiv.org/pdf/2402.12259.pdf
Koh J Y, Fried D and Salakhutdinov R R. 2024. Generating images with multimodal language models//Proceedings of the 37th Conference on Advances in Neural Information Processing Systems. New Orleans, USA: Curran Associates Inc.: 21487-21506
Kumra S, Joshi S and Sahin F. 2020. Antipodal robotic grasping using generative residual convolutional neural network//Proceedings of 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems. Las Vegas, USA: IEEE: 9626-9633 [DOI: 10.1109/IROS45743.2020.9340777http://dx.doi.org/10.1109/IROS45743.2020.9340777]
Lee J C, Rho D, Sun X, Ko J H and Park E. 2023. Compact 3D gaussian representation for radiance field [EB/OL]. [2023-11-22]. https://arxiv.org/pdf/2311.13681.pdfhttps://arxiv.org/pdf/2311.13681.pdf
Lei Y J, Xu K, Guo Y L, Yang X, Wu Y W, Hu W, Yang J Q and Wang H Y. 2024. Comprehensive survey on 3D visual-language understanding techniques. Journal of Image and Graphics,29(6): 1747-1764
雷印杰,徐凯,郭裕兰,杨鑫,武玉伟,胡玮,杨佳琪,汪汉云.2024.“三维视觉—语言”推理技术的前沿研究与最新趋势.中国图象图形学报,29(6): 1747-1764 [DOI: 10.11834/jig.240385http://dx.doi.org/10.11834/jig.240385]
Leng Z, Birdal T, Liang X and Tombari F. 2024. HyperSDFusion: bridging hierarchical structures in language and geometry for enhanced 3D text2shape generation [EB/OL]. [2024-03-01]. https://arxiv.org/pdf/2403.00372.pdfhttps://arxiv.org/pdf/2403.00372.pdf
Li B, Weinberger K Q, Belongie S, Koltun V and Ranftl R. 2022. Language-driven semantic segmentation [EB/OL]. [2022-01-10]. https://arxiv.org/pdf/2201.03546.pdfhttps://arxiv.org/pdf/2201.03546.pdf
Li J, Li D, Savarese S and Hoi S. 2023. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models//Proceedings of the 40th International Conference on Machine Learning. Honolulu, USA: PMLR: 19730-19742 [DOI: 10.5555/3618408.3619222http://dx.doi.org/10.5555/3618408.3619222]
Li J, Li D, Xiong C and Hoi S. 2022. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation//Proceedings of the 39th International Conference on Machine Learning. Baltimore, USA: PMLR: 12888-12900 [DOI:10.48550/arXiv.2201.12086http://dx.doi.org/10.48550/arXiv.2201.12086]
Li J, Liu Z, Li L, Lin J, Yao J and Tu J. 2023. Multi-view convolutional vision transformer for 3D object recognition. Journal of Visual Communication and Image Representation,95: #103906 [DOI: 10.1016/j.jvcir.2023.103906http://dx.doi.org/10.1016/j.jvcir.2023.103906]
Li J, Zhang J, Bai X, Zheng J, Ning X, Zhou J and Gu L. 2024. Dngaussian: optimizing sparse-view 3D gaussian radiance fields with global-local depth normalization [EB/OL]. [2024-03-11]. https://arxiv.org/pdf/2403.06912.pdfhttps://arxiv.org/pdf/2403.06912.pdf
Li K, He Y, Wang Y, Li Y, Wang W, Luo P, Wang Y, Wang L and Qiao Y. 2023. Videochat: chat-centric video understanding [EB/OL]. [2023-05-10]. https://arxiv.org/pdf/2305.06355.pdfhttps://arxiv.org/pdf/2305.06355.pdf
Li K, Wang J, Yang L, Lu C and Dai B. 2024. Semgrasp: semantic grasp generation via language aligned discretization [EB/OL]. [2024-04-04]. https://arxiv.org/pdf/2404.03590.pdfhttps://arxiv.org/pdf/2404.03590.pdf
Li M, Chen X, Zhang C, Chen S, Zhu H, Yin F, Yu G and Chen T. 2023. M3DBench: Let's instruct large models with multi-modal 3D prompts [EB/OL]. [2023-12-17]. https://arxiv.org/pdf/2312.10763.pdfhttps://arxiv.org/pdf/2312.10763.pdf
Li P, He F, Fan B and Song Y. 2023. TPNet: a novel mesh analysis method via topology preservation and perception enhancement. Computer Aided Geometric Design,104: #102219 [DOI: 10.1016/j.cagd.2023.102219http://dx.doi.org/10.1016/j.cagd.2023.102219]
Li T, Slavcheva M, Zollhoefer M, Green S, Lassner C, Kim C, Schmidt T, Lovegrove S, Goesele M, Newcombe R and Lv Z. 2022. Neural 3D video synthesis from multi-view video//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE: 5511-5521 [DOI: 10.1109/CVPR52688.2022.00544http://dx.doi.org/10.1109/CVPR52688.2022.00544]
Li X, Ding J, Chen Z and Elhoseiny M. 2023. Uni3DL: unified model for 3D and language understanding [EB/OL]. [2023-12-05]. https://arxiv.org/pdf/2312.03026.pdfhttps://arxiv.org/pdf/2312.03026.pdf
Li Y, Yang W and Fei B. 2024. 3DMambaComplete: exploring structured state space model for point cloud completion [EB/OL]. [2024-02-16]. https://arxiv.org/pdf/2402.10739.pdfhttps://arxiv.org/pdf/2402.10739.pdf
Li Z, Zhang C, Wang X, Ren R, Xu Y, Ma R and Liu X. 2024. 3DMIT: 3D multi-modal instruction tuning for scene understanding [EB/OL]. [2024-01-06]. https://arxiv.org/pdf/2401.03201.pdfhttps://arxiv.org/pdf/2401.03201.pdf
Liang D, Zhou X, Wang X, Zhu X, Xu W, Zou Z and Bai X. 2024. Pointmamba: a simple state space model for point cloud analysis [EB/OL]. [2024-02-16]. https://arxiv.org/pdf/2402.10739.pdfhttps://arxiv.org/pdf/2402.10739.pdf
Liang F, Wu B, Dai X, Li K, Zhao Y, Zhang H, Zhang P, Vajda P and Marculescu D. 2023. Open-vocabulary semantic segmentation with mask-adapted CLIP//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 7061-7070 [DOI: 10.1109/CVPR52729.2023.00682http://dx.doi.org/10.1109/CVPR52729.2023.00682]
Liang J, Huang W, Xia F, Xu P, Hausman K, Ichter B, Florence P and Zeng A. 2023. Code as policies: language model programs for embodied control//Proceedings of 2023 IEEE International Conference on Robotics and Automation. London, UK: IEEE: 9493-9500 [DOI: 10.1109/ICRA48891.2023.10160591http://dx.doi.org/10.1109/ICRA48891.2023.10160591]
Lin C H, Gao J, Tang L, Takikawa T, Zeng X, Huang X, Kreis K, Fidler S, Liu M and Lin T Y. 2023. Magic3d: high-resolution text-to-3d content creation//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 300-309 [DOI: 10.1109/CVPR52729.2023.00037http://dx.doi.org/10.1109/CVPR52729.2023.00037]
Lin K E, Lin Y C, Lai W S, Lin T Y, Shih Y C and Ramamoorthi R. 2023. Vision transformer for nerf-based view synthesis from a single input image//Proceedings of 2023 IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa, USA: IEEE: 806-815 [DOI: 10.1109/WACV56688.2023.00087http://dx.doi.org/10.1109/WACV56688.2023.00087]
Liu D, Huang X, Hou Y, Wang Z, Yin Z, Gong Y, Gao P and Ouyang W. 2024. Uni3D-LLM: unifying point cloud perception, generation and editing with large language models [EB/OL]. [2024-01-09]. https://arxiv.org/pdf/2402.03327.pdfhttps://arxiv.org/pdf/2402.03327.pdf
Liu H, Li C, Wu Q and Lee Y J. 2024. Visual instruction tuning//Proceedings of the 37th Conference on Advances in Neural Information Processing Systems. New Orleans, USA: Curran Associates Inc.: 34892-34916
Liu M, Qin D X, Han Y B, Chen X and Wang Y N. 2024. A multi-scale dynamic visual network for surgical robots scene segmentation. Journal of Image and Graphics (刘敏,秦敦璇,韩雨斌,陈祥,王耀南.2024. 多尺度动态视觉网络的手术机器人场景分割.中国图象图形学报[DOI: 10.11834/jig.240385]
Liu M, Shi R, Kuang K, Zhu Y, Li X, Han S, Cai H, Porikli F and Su H. 2024. OpenShape: scaling up 3D shape representation towards open-world understanding//Proceedings of the 37th International Conference on Neural Information Processing Systems. New Orleans, USA: Curran Associates Inc.: 44860-44879
Liu R, Wang X, Wang W and Yang Y. 2023. Bird's-eye-view scene graph for vision-language navigation//Proceedings of 2023 IEEE/CVF International Conference on Computer Vision. Paris, France: IEEE: 10934-10946 [DOI: 10.1109/ICCV51070.2023.01007http://dx.doi.org/10.1109/ICCV51070.2023.01007]
Liu Y T, Wang L, Yang J, Chen W, Meng X, Yang B and Gao L. 2023. Neudf: leaning neural unsigned distance fields with volume rendering. IEEE Transactions on Pattern Analysis and Machine Intelligence,46(4): 2364-2377 [DOI: 10.1109/TPAMI.2023.3335353http://dx.doi.org/10.1109/TPAMI.2023.3335353]
Liu Z, Hu J, Hui K H, Qi X, Cohen-Or D and Fu C W. 2023. EXIM: a hybrid explicit-implicit representation for text-guided 3D shape generation. ACM Transactions on Graphics,42(6): 1-12 [DOI: 10.1145/3618312http://dx.doi.org/10.1145/3618312]
Long X, Lin C, Liu L, Liu Y, Wang P, Theobalt C, Komura T and Wang W. 2023. Neuraludf: learning unsigned distance fields for multi-view reconstruction of surfaces with arbitrary topologies//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 20834-20843 [DOI: 10.1109/CVPR52729.2023.01996http://dx.doi.org/10.1109/CVPR52729.2023.01996]
Loo J, Wu Z and Hsu D. 2024. Open scene graphs for open world object-goal navigation [EB/OL]. [2024-07-02]. https://arxiv.org/pdf/2407.02473.pdfhttps://arxiv.org/pdf/2407.02473.pdf
Loper M, Mahmood N, Romero J, Pons-Moll G and Black M J. 2015. SMPL: A skinned multi-person linear model. ACM Transactions on Graphics,34(6): 1-16 [DOI: 10.1145/2816795.2818013http://dx.doi.org/10.1145/2816795.2818013]
Lu P, Peng B, Cheng H, Galley M, Chang K W, Wu Y N, Zhu S and Gao J. 2024. Chameleon: plug-and-play compositional reasoning with large language models//Proceedings of the 37th Conference on Advances in Neural Information Processing Systems. New Orleans, USA: Curran Associates Inc.: 43447-43478
Lu S, Chang H, Jing E P, Boularias A and Bekris K. 2023. OVIR-3D: open-vocabulary 3D instance retrieval without training on 3D data//Proceedings of the 7th Conference on Robot Learning. Atlanta, USA: PMLR: 1610-1620 [DOI: 10.48550/arXiv.2311.02873http://dx.doi.org/10.48550/arXiv.2311.02873]
Lu T, Yu M, Xu L, Xiangli Y, Wang L, Lin D and Dai B. 2023. Scaffold-gs: structured 3D gaussians for view-adaptive rendering [EB/OL]. [2023-11-30]. https://arxiv.org/pdf/2312.00109.pdfhttps://arxiv.org/pdf/2312.00109.pdf
Lu Y, Xu C, Wei X, Xie X, Tomizuka M, Keutzer K and Zhang S. 2023. Open-vocabulary point-cloud object detection without 3D annotation//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 1190-1199 [DOI: 10.1109/CVPR52729.2023.00121http://dx.doi.org/10.1109/CVPR52729.2023.00121]
Lu Y, Zhang J, Li S, Fang T, McKinnon D, Tsin Y, Quan L, Cao X and Yao Y. 2023. Direct2.5: diverse text-to-3d generation via multi-view 2.5D diffusion [EB/OL]. [2023-11-27]. https://arxiv.org/pdf/2311.15980.pdfhttps://arxiv.org/pdf/2311.15980.pdf
Lundell J, Verdoja F and Kyrki V. 2020. Beyond top-grasps through scene completion//Proceedings of 2020 IEEE/CVF International Conference on Robotics and Automation. Paris, France: IEEE: 545-551 [DOI: 10.1109/ICRA40945.2020.9197320http://dx.doi.org/10.1109/ICRA40945.2020.9197320]
Luo H, Bao J, Wu Y, He X and Li T. 2023. SegCLIP: patch aggregation with learnable centers for open-vocabulary semantic segmentation//Proceedings of the 40th International Conference on Machine Learning. Honolulu, USA: PMLR: 23033-23044 [DOI: 10.5555/3618408.3619364http://dx.doi.org/10.5555/3618408.3619364]
Luo J, Fu J, Kong X, Gao C, Ren H, Shen H, Xia H and Liu S. 2022. 3D-SPS: single-stage 3D visual grounding via referred point progressive selection//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE: 16433-16442 [DOI: 10.1109/CVPR52688.2022.01596http://dx.doi.org/10.1109/CVPR52688.2022.01596]
Luo T, Rockwell C, Lee H and Johnson J. 2024. Scalable 3D captioning with pretrained models//Proceedings of the 37th International Conference on Neural Information Processing Systems. New Orleans, USA: IEEE: 75307-75337 [DOI: 10.5555/3666122.3669413http://dx.doi.org/10.5555/3666122.3669413]
Lyu R, Wang T, Lin J, Yang S, Mao X, Chen Y, Xu R, Huang H, Zhu C, Lin D and Pang J. 2024. MMScan: a multi-modal 3D scene dataset with hierarchical grounded language annotations [EB/OL]. [2024-06-13]. https://arxiv.org/pdf/2406.09401.pdfhttps://arxiv.org/pdf/2406.09401.pdf
Lyu Z, Wang J, An Y, Zhang Y, Lin D and Dai B. 2023. Controllable mesh generation through sparse latent point diffusion models//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 271-280 [DOI: 10.1109/CVPR52729.2023.00034http://dx.doi.org/10.1109/CVPR52729.2023.00034]
Ma C, Jiang Y, Wen X, Yuan Z and Qi X. 2024. Codet: co-occurrence guided region-word alignment for open-vocabulary object detection//Proceedings of the 37th Conference on Advances in Neural Information Processing Systems. New Orleans, USA: Curran Associates Inc.: 71078-71094
Ma X, Yong S, Zheng Z, Li Q, Liang Y, Zhu S C and Huang S. 2022. Sqa3d: situated question answering in 3D scenes [EB/OL]. [2022-10-14]. https://arxiv.org/pdf/2210.07474.pdfhttps://arxiv.org/pdf/2210.07474.pdf
Man Y, Gui L Y and Wang Y X. 2024. Situational awareness matters in 3D vision language reasoning [EB/OL]. [2024-06-11]. https://arxiv.org/pdf/2406.07544.pdfhttps://arxiv.org/pdf/2406.07544.pdf
Mao J, Xu W, Yang Y, Wang J, Huang Z and Yuille A. 2014. Deep captioning with multimodal recurrent neural networks (m-rnn) [EB/OL]. [2014-12-20]. https://arxiv.org/pdf/1412.6632.pdfhttps://arxiv.org/pdf/1412.6632.pdf
Maturana D, Scherer S. 2015. VoxNet: a 3d convolutional neural network for real-time object recognition//Proceedings of 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems. Hamburg, Germany: IEEE: 922-928 [DOI: 10.1109/IROS.2015.7353481http://dx.doi.org/10.1109/IROS.2015.7353481]
Mescheder L, Oechsle M, Niemeyer M, Nowozin S and Geiger A. 2019. Occupancy networks: learning 3d reconstruction in function space//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 4455-4465 [DOI: 10.1109/CVPR.2019.00459http://dx.doi.org/10.1109/CVPR.2019.00459]
Metzer G, Richardson E, Patashnik O, Giryes R and Cohen-Or D. 2023. Latent-NeRF for shape-guided generation of 3D shapes and textures//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 12663-12673 [DOI: 10.1109/CVPR52729.2023.01218http://dx.doi.org/10.1109/CVPR52729.2023.01218]
Michel O, Bar-On R, Liu R, Benaim S and Hanocka R. 2022. Text2Mesh: text-driven neural stylization for meshes//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE: 13482-13492 [DOI: 10.1109/CVPR52688.2022.01313http://dx.doi.org/10.1109/CVPR52688.2022.01313]
Mikami Y, Melnik A, Miura J and Hautamäki V. 2024. Natural language as polices: reasoning for coordinate-level embodied control with LLMs [EB/OL]. [2024-03-20]. https://arxiv.org/pdf/2403.13801.pdfhttps://arxiv.org/pdf/2403.13801.pdf
Mikolov T, Chen K, Corrado G and Dean J. 2013. Efficient estimation of word representations in vector space [EB/OL]. [2013-01-16]. https://arxiv.org/pdf/1301.3781.pdfhttps://arxiv.org/pdf/1301.3781.pdf
Mildenhall B, Srinivasan P P, Tancik M, Barron J T, Ramamoorthi R and Ng R. 2020. NeRF: representing scenes as neural radiance fields for view synthesis//Proceedings of 2020 European Conference on Computer Vision. Glasgow, UK: Springer: 405-421 [DOI: 10.1007/978-3-030-58452-8_24http://dx.doi.org/10.1007/978-3-030-58452-8_24]
Minderer M, Gritsenko A and Houlsby N. 2024. Scaling open-vocabulary object detection//Proceedings of the 37th Conference on Advances in Neural Information Processing Systems. New Orleans, USA: Curran Associates Inc.: 72983-73007
Mittal P, Cheng Y C, Singh M and Tulsiani S. 2022. Autosdf: shape priors for 3d completion, reconstruction and generation//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE: 306-315 [DOI: 10.1109/CVPR52688.2022.00040http://dx.doi.org/10.1109/CVPR52688.2022.00040]
Mo K, Zhu S, Chang A X, Yi L, Tripathi S, Guibas L J and Su H. 2019. PartNet: a large-scale benchmark for fine-grained and hierarchical part-level 3D object understanding//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 909-918 [DOI: 10.1109/CVPR.2019.00100http://dx.doi.org/10.1109/CVPR.2019.00100]
Morrison D, Corke P and Leitner J. 2018. Closing the loop for robotic grasping: a real-time, generative grasp synthesis approach [EB/OL]. [2018-04-14]. https://arxiv.org/pdf/1804.05172.pdfhttps://arxiv.org/pdf/1804.05172.pdf
Mu Y, Zhang Q, Hu M, Wang W, Ding M, Jin J, Wang B, Dai J, Qiao Y and Luo P. 2024. EmbodiedGPT: vision-language pre-training via embodied chain of thought//Proceedings of the 37th International Conference on Neural Information Processing Systems. New Orleans, USA: Curran Associates Inc.: 25081-25094 [DOI: 10.5555/3666122.3667212http://dx.doi.org/10.5555/3666122.3667212]
Navaneet K L, Meibodi K P, Koohpayegani S A and Pirsiavash H. 2023. Compact3D: compressing gaussian splat radiance field models with vector quantization [EB/OL]. [2023-11-30]. https://arxiv.org/pdf/2311.18159.pdfhttps://arxiv.org/pdf/2311.18159.pdf
OpenAI. 2023. GPT-4 Technical Report [EB/OL]. [2023-03-15]. https://arxiv.org/pdf/2303.08774.pdfhttps://arxiv.org/pdf/2303.08774.pdf
Osher S, Fedkiw R and Piechor K. 2004. Level set methods and dynamic implicit surfaces. Applied Mechanics Reviews,57(3): #B15 [DOI: 10.1016/s0898-1221(03)90179-9http://dx.doi.org/10.1016/s0898-1221(03)90179-9]
Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P, Zhang C, Agarwal S, Slama K, Ray A, Schulman J, Hilton J, Kelton F, Miller L, Simens M, Askell A, Welinder P, Christiano P, Leike J and Lowe R. 2022. Training language models to follow instructions with human feedback//Proceedings of the 36th International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc.: 27730-27744 [DOI: 10.5555/3600270.3602281http://dx.doi.org/10.5555/3600270.3602281]
Pang Y, Wang W, Tay FE, Liu W, Tian Y and Yuan L. 2022. Masked autoencoders for point cloud self-supervised learning//Proceedings of 2022 European Conference on Computer Vision. Tel Aviv, Israel: Springer: 604-621 [DOI: 10.1007/978-3-031-20086-1_35http://dx.doi.org/10.1007/978-3-031-20086-1_35]
Parelli M, Delitzas A, Hars N, Vlassis G, Anagnostidis S, Bachmann G and Hofmann T. 2023. CLIP-guided vision-language pre-training for question answering in 3D scenes//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Vancouver, Canada: IEEE: 5607-5612 [DOI: 10.1109/CVPRW59228.2023.00593http://dx.doi.org/10.1109/CVPRW59228.2023.00593]
Park J J, Florence P, Straub J, Newcombe R and Lovegrove S. 2019. Deepsdf: learning continuous signed distance functions for shape representation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 165-174 [DOI: 10.1109/CVPR.2019.00025http://dx.doi.org/10.1109/CVPR.2019.00025]
Patel B, Dorbala V S and Bedi A S. 2024. Embodied question answering via Multi-LLM systems [EB/OL]. [2024-06-16]. https://arxiv.org/pdf/2406.10918.pdfhttps://arxiv.org/pdf/2406.10918.pdf
Peng H, Li B, Zhang B, Chen X, Chen T and Zhu H. 2024. Multi-view vision fusion network: can 2D pre-trained model boost 3D point cloud data-scarce learning?IEEE Transactions on Circuits and Systems for Video Technology,34(7): 5951-5962 [DOI: 10.1109/TCSVT.2023.3343495http://dx.doi.org/10.1109/TCSVT.2023.3343495]
Peng S, Genova K, Jiang C, Tagliasacchi A, Pollefeys M and Funkhouser T. 2023. Openscene: 3D scene understanding with open vocabularies//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 815-824 [DOI: 10.1109/CVPR52729.2023.00085http://dx.doi.org/10.1109/CVPR52729.2023.00085]
Peng S, Niemeyer M, Mescheder L, Pollefeys M and Geiger A. 2020. Convolutional occupancy networks//Proceedings of 2020 European Conference on Computer Vision. Glasgow, UK: Springer: 523-540 [DOI: 10.1007/978-3-030-58580-8_31http://dx.doi.org/10.1007/978-3-030-58580-8_31]
Piekenbrinck J, Hermans A, Vaskevicius N, Linder T and Leibe B. 2024. RGB-D Cube R-CNN: 3D Object Detection with Selective Modality Dropout//Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 1997-2006
Po R and Wetzstein G. 2024. Compositional 3D scene generation using locally conditioned diffusion//Proceedings of 2024 International Conference on 3D Vision. Davos, Switzerland: IEEE: 651-663 [DOI: 10.1109/3DV62453.2024.00026http://dx.doi.org/10.1109/3DV62453.2024.00026]
Qi C R, Su H, Mo K and Guibas L J. 2017. PointNet: deep learning on point sets for 3D classification and segmentation//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 77-85 [DOI: 10.1109/CVPR.2017.16http://dx.doi.org/10.1109/CVPR.2017.16]
Qi C R, Su H, Nießner M, Dai A, Yan M and Guibas L J. 2016. Volumetric and multi-view CNNs for object classification on 3D data//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 5648-5656 [DOI: 10.1109/CVPR.2016.609http://dx.doi.org/10.1109/CVPR.2016.609]
Qi C R, Yi L, Su H and Guibas L J. 2017. PointNet++: deep hierarchical feature learning on point sets in a metric space//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc.: 5105-5114
Qi C, Yin J, Zhang Z and Tang J. 2023. Dynamic scene graph generation of point clouds with structural representation learning. Tsinghua Science and Technology,29(1): 232-243 [DOI: 10.26599/TST.2023.9010002http://dx.doi.org/10.26599/TST.2023.9010002]
Qi Z, Dong R, Zhang S, Geng H, Han C, Ge Z, Yi L and Ma K. 2024. ShapeLLM: universal 3D object understanding for embodied interaction [EB/OL]. [2024-02-27]. https://arxiv.org/pdf/2402.17766.pdfhttps://arxiv.org/pdf/2402.17766.pdf
Qi Z, Fang Y, Sun Z, Wu X, Wu T, Wang J, Lin D and Zhao H. 2023. Gpt4point: a unified framework for point-language understanding and generation [EB/OL]. [2023-12-05]. https://arxiv.org/pdf/2312.02980.pdfhttps://arxiv.org/pdf/2312.02980.pdf
Qian G, Li Y, Peng H, Mai J, Hammoud H, Elhoseiny M and Ghanem B. 2022. Pointnext: revisiting pointnet++ with improved training and scaling strategies//Proceedings of the 35th International Conference on Neural Information Processing Systems. New Orleans, USA: [s.n.]: 23192-23204
Radford A, Kim J W, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G and Sutskever I. 2021. Learning transferable visual models from natural language supervision//Proceedings of the 38th International Conference on Machine Learning. Vienna, Austria: PMLR: 8748-8763 [DOI: 10.48550/arXiv.2103.00020http://dx.doi.org/10.48550/arXiv.2103.00020]
Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W and Liu P J. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research,21(140): 5485-5551 [DOI: 10.5555/3455716.3455856http://dx.doi.org/10.5555/3455716.3455856]
Rajpal A, Cheema N, Illgner-Fehns K, Slusallek P and Jaiswal S. 2023. High-resolution synthetic RGB-D datasets for monocular depth estimation//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Vancouver, Canada: IEEE: 1188-1198 [DOI: 10.1109/CVPRW59228.2023.00126http://dx.doi.org/10.1109/CVPRW59228.2023.00126]
Ramakrishnan S K, Gokaslan A, Wijmans E, Maksymets O, Clegg A, Turner J, Undersander E, Galuba W, Westbury A, Chang A X, Savva M, Zhao Y and Batra D. 2021. Habitat-Matterport 3D dataset (HM3D): 1000 large-scale 3D environments for embodied AI [EB/OL]. [2021-09-16]. https://arxiv.org/pdf/2109.08238.pdfhttps://arxiv.org/pdf/2109.08238.pdf
Ramesh A, Pavlov M, Goh G, Gray S, Voss C, Radford A, Chen M and Sutskever I. 2021. Zero-shot text-to-image generation//Proceedings of the 38th International Conference on Machine Learning. Vienna, Austria: PMLR: 8821-8831 [DOI: 10.48550/arXiv.2102.12092http://dx.doi.org/10.48550/arXiv.2102.12092]
Rana K, Haviland J, Garg S, Abou-Chakra J, Reid I and Suenderhauf N. 2023. Sayplan: grounding large language models using 3D scene graphs for scalable robot task planning//Proceedings of the 7th Conference on Robot Learning. Atlanta, USA: PMLR: 23-72 [DOI: 10.48550/arXiv.2307.06135http://dx.doi.org/10.48550/arXiv.2307.06135]
Rombach R, Blattmann A, Lorenz D, Esser P and Ommer B. 2022. High-resolution image synthesis with latent diffusion models//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE: 10674-10685 [DOI: 10.1109/CVPR52688.2022.01042http://dx.doi.org/10.1109/CVPR52688.2022.01042]
Rubenstein P K, Asawaroengchai C, Nguyen D D, Bapna A, Borsos Z, Quitry F D C, Chen P, Badawy D E, Han W, Kharitonov E, Muckenhirn H, Padfield D, Qin J, Rozenberg D, Sainath T, Schalkwyk J, Sharifi M, Ramanovich M T, Tagliasacchi M, Tudor A, Velimirović M, Vincent D, Yu J, Wang Y, Zayats V, Zeghidour N, Zhang Y, Zhang Z, Zilka L and Frank C. 2023. AudioPaLM: a large language model that can speak and listen [EB/OL]. [2023-06-22]. https://arxiv.org/pdf/2306.12925.pdfhttps://arxiv.org/pdf/2306.12925.pdf
Sanghi A, Chu H, Lambourne J G, Wang Y, Cheng C Y, Fumero M and Malekshan K R. 2022. CLIP-forge: towards zero-shot text-to-shape generation//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE: 18582-18592 [DOI: 10.1109/CVPR52688.2022.01805http://dx.doi.org/10.1109/CVPR52688.2022.01805]
Schuhmann C, Beaumont R, Vencu R, Gordon C, Wightman R, Cherti M, Coombes T, Katta A, Mullis C, Wortsman M, Schramowski P, Kundurthy S, Crowson K, Schmidt L, Kaczmarczyk R and Jitsev J. 2022. LAION-5B: an open large-scale dataset for training next generation image-text models//Proceedings of the 36th International Conference on Neural Information Processing Systems. New Orleans, USA: Curran Associates Inc.: 25278-25294
Schumann R, Zhu W, Feng W, Fu T J, Riezler S and Wang W. Y. 2024. Velma: verbalization embodiment of LLM agents for vision and language navigation in street view//Proceedings of the 38th AAAI Conference on Artificial Intelligence. Vancouver, USA: AAAI: 18924-18933 [DOI: 10.1609/aaai.v38i17.29858http://dx.doi.org/10.1609/aaai.v38i17.29858]
Schuster M and Paliwal K K. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing,45(11): 2673–2681 [DOI: 10.1109/78.650093http://dx.doi.org/10.1109/78.650093]
Sella E, Fiebelman G, Hedman P and Averbuch-Elor H. 2023. Vox-e: text-guided voxel editing of 3d objects//Proceedings of 2023 IEEE/CVF International Conference on Computer Vision. Paris, France: IEEE: 430-440 [DOI: 10.1109/ICCV51070.2023.00046http://dx.doi.org/10.1109/ICCV51070.2023.00046]
Shao S, Pei Z, Wu X, Liu Z, Chen W and Li Z. 2023. Iebins: Iterative elastic bins for monocular depth estimation//Proceedings of the 36th International Conference on Neural Information Processing Systems. New Orleans, USA: Curran Associates Inc.: 53025-53037
Shen Y, Song K, Tan X, Li D, Lu W and Zhuang Y. 2023. Hugginggpt: solving ai tasks with chatgpt and its friends in hugging face [EB/OL]. [2023-03-30]. https://arxiv.org/pdf/2303.17580.pdfhttps://arxiv.org/pdf/2303.17580.pdf
Shi C and Yang S. 2023. Edadet: open-vocabulary object detection using early dense alignment//Proceedings of 2023 IEEE/CVF International Conference on Computer Vision. Paris, France: IEEE: 15678-15688 [DOI: 10.1109/ICCV51070.2023.01441http://dx.doi.org/10.1109/ICCV51070.2023.01441]
Shi S, Guo C, Jiang L, Wang Z, Shi J, Wang X and Li H. 2020. PV-RCNN: point-voxel feature set abstraction for 3d object detection//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 10526-10535 [DOI: 10.1109/CVPR42600.2020.01054http://dx.doi.org/10.1109/CVPR42600.2020.01054]
Shi S, Jiang L, Deng J, Wang Z, Guo C, Shi J, Wang X and Li H. 2023. PV-RCNN++: point-voxel feature set abstraction with local vector representation for 3d object detection. International Journal of Computer Vision,131(2): 531-551 [DOI: 10.1007/s11263-022-01710-9http://dx.doi.org/10.1007/s11263-022-01710-9]
Shim J, Kang C and Joo K. 2023. Diffusion-based signed distance fields for 3d shape generation//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 20887-20897 [DOI: 10.1109/CVPR52729.2023.02001http://dx.doi.org/10.1109/CVPR52729.2023.02001]
Singer U, Sheynin S, Polyak A, Ashual O, Makarov I, Kokkinos F, Goyal N, Vedaldi A, Parikh D, Johnson J and Taigman Y. 2023. Text-to-4D dynamic scene generation [EB/OL]. [2023-01-26]. https://arxiv.org/pdf/2301.11280.pdfhttps://arxiv.org/pdf/2301.11280.pdf
Singh K P, Salvador J, Weihs L and Kembhavi A. 2023. Scene graph contrastive learning for embodied navigation//Proceedings of 2023 IEEE/CVF International Conference on Computer Vision. Paris, France: IEEE: 10850-10860 [DOI: 10.1109/ICCV51070.2023.00999http://dx.doi.org/10.1109/ICCV51070.2023.00999]
Singh S, Pavlakos G and Stamoulis D. 2024. Evaluating zero-shot GPT-4v performance on 3D visual question answering benchmarks [EB/OL]. [2024-05-29]. https://arxiv.org/pdf/2405.18831.pdfhttps://arxiv.org/pdf/2405.18831.pdf
Singh V V, Sheshappanavar S V and Kambhamettu C. 2021. MeshNet++: A Network with a Face//Proceedings of the 29th ACM International Conference on Multimedia. [s.l.]: ACM: 4883-4891 [DOI: 10.1145/3474085.3475468http://dx.doi.org/10.1145/3474085.3475468]
Slaney J and Thiébaux S. 2001. Blocks world revisited. Artificial Intelligence,125(1-2): 119-153 [DOI: 10.1016/S0004-3702(00)00079-5http://dx.doi.org/10.1016/S0004-3702(00)00079-5]
Song C H, Wu J, Washington C, Sadler B M, Chao W L and Su Y. 2023. LLM-planner: few-shot grounded planning for embodied agents with large language models//Proceedings of 2023 IEEE/CVF International Conference on Computer Vision. Paris, France: IEEE: 2986-2997 [DOI: 10.1109/ICCV51070.2023.00280http://dx.doi.org/10.1109/ICCV51070.2023.00280]
Stechly K, Valmeekam K and Kambhampati S. 2024. Chain of thoughtlessness: an analysis of cot in planning [EB/OL]. [2024-05-08]. https://arxiv.org/pdf/2405.04776.pdfhttps://arxiv.org/pdf/2405.04776.pdf
Su H, Maji S, Kalogerakis E and Learned-Miller E. 2015. Multi-view convolutional neural networks for 3D shape recognition//Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile: IEEE: 945-953 [DOI: 10.1109/ICCV.2015.114http://dx.doi.org/10.1109/ICCV.2015.114]
Sun C, Wu X, Sun J, Sun C, Xu M and Ge Q. 2023. Saliency-Induced Moving Object Detection for Robust RGB-D Vision Navigation Under Complex Dynamic Environments. IEEE Transactions on Intelligent Transportation Systems,24(10): 10716-10734 [DOI: 10.1109/TITS.2023.3275279http://dx.doi.org/10.1109/TITS.2023.3275279]
Sun Q, Li Y, Liu Z, Huang X, Liu F, Liu X, Ouyang W and Shao J. 2023. UniG3D: unifieda 3D object generation dataset [EB/OL]. [2023-06-19]. https://arxiv.org/pdf/2306.10730.pdfhttps://arxiv.org/pdf/2306.10730.pdf
Tai H, He Q, Zhang J, Qian Y, Zhang Z, Hu X, Wang Y and Liu Y. 2024. Open-vocabulary SAM3D: anyunderstand 3D scene [EB/OL]. [2024-05-24]. https://arxiv.org/pdf/2405.15580.pdfhttps://arxiv.org/pdf/2405.15580.pdf
Takmaz A, Fedele E, Sumner R W, Pollefeys M, Tombari F and Engelmann F. 2023. Openmask3d: open-vocabulary 3D instance segmentation [EB/OL]. [2023-06-23]. https://arxiv.org/pdf/2306.13631.pdfhttps://arxiv.org/pdf/2306.13631.pdf
Tang J, Zhou H, Chen X, Hu T, Ding E, Wang J and Zeng G. 2023. Delicate textured mesh recovery from nerf via adaptive surface refinement//Proceedings of 2023 IEEE/CVF International Conference on Computer Vision. Paris, France: IEEE: 17739-17749 [DOI: 10.1109/ICCV51070.2023.01626http://dx.doi.org/10.1109/ICCV51070.2023.01626]
Tang Y, Han X, Li X, Yu Q, Hao Y, Hu L and Chen M. 2024. MiniGPT-3D: efficiently aligning 3D point clouds with large language models using 2D priors [EB/OL]. [2024-05-02]. https://arxiv.org/pdf/2405.01413.pdfhttps://arxiv.org/pdf/2405.01413.pdf
Tang Z, Yang Z, Khademi M, Liu Y, Zhu C and Bansal M. 2023. CoDi-2: in-context interleaved and interactive any-to-any generation [EB/OL]. [2023-11-30]. https://arxiv.org/pdf/2311.18775.pdfhttps://arxiv.org/pdf/2311.18775.pdf
Thirunavukarasu A J, Ting D S J, Elangovan K, Gutierrez L, Tan T F and Ting D S W. 2023. Large language models in medicine. Nature medicine,29(8): 1930-1940 [DOI: 10.1038/s41591-023-02448-8http://dx.doi.org/10.1038/s41591-023-02448-8]
Ton T, Hong J W, Eom S, Shim J Y, Kim J and Yoo C D. 2024. Zero-shot dual-path integration framework for open-vocabulary 3D instance segmentation//Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 7598-7607
Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M A, Lacroix T, Rozière B, Goyal N, Hambro E, Azhar F, Rodriguez A, Joulin A, Grave E and Lample G. 2023. LLaMA: open and efficient foundation language models [EB/OL]. [2023-02-27]. https://arxiv.org/pdf/2302.13971.pdfhttps://arxiv.org/pdf/2302.13971.pdf
Tsalicoglou C, Manhardt F, Tonioni A, Niemeyer M and Tombari F. 2024. TextMesh: generation of realistic 3D meshes from text prompts//Proceedings of 2024 International Conference on 3D Vision. Davos, Switzerland: IEEE: 1554-1563 [DOI: 10.1109/3DV62453.2024.00154http://dx.doi.org/10.1109/3DV62453.2024.00154]
Turki H, Zhang J Y, Ferroni F and Ramanan D. 2023. Suds: scalable urban dynamic scenes//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 12375-12385 [DOI: 10.1109/CVPR52729.2023.01191http://dx.doi.org/10.1109/CVPR52729.2023.01191]
Tychola K A, Tsimperidis I, Papakostas G A. 2022. On 3D reconstruction using RGB-D cameras. Digital,2(3): 401-421 [DOI: 10.3390/digital2030022http://dx.doi.org/10.3390/digital2030022]
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31th International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc.: 6000–6010
Vemprala S H, Bonatti R, Bucker A and Kapoor A. 2024. Chatgpt for robotics: design principles and model abilities. IEEE Access,12: 55682-55696 [DOI: 10.1109/ACCESS.2024.3387941http://dx.doi.org/10.1109/ACCESS.2024.3387941]
Vinyals O, Toshev A, Bengio S and Erhan D. 2015. Show and tell: a neural image caption generator//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 3156-3164 [DOI: 10.1109/CVPR.2015.7298935http://dx.doi.org/10.1109/CVPR.2015.7298935]
Wald J, Avetisyan A, Navab N, Tombari F and Nießner M. 2019. Rio: 3D object instance re-localization in changing indoor environments//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea: IEEE: 7657-7666 [DOI: 10.1109/ICCV.2019.00775http://dx.doi.org/10.1109/ICCV.2019.00775]
Wald J, Dhamo H, Navab N and Tombari F. 2020. Learning 3D semantic scene graphs from 3D indoor reconstructions//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 3960-3969 [DOI: 10.1109/CVPR42600.2020.00402http://dx.doi.org/10.1109/CVPR42600.2020.00402]
Wang A, Yin Z, Hu Y, Mao Y and Hui P. 2024. Exploring the potential of large language models in artistic creation: collaboration and reflection on creative programming [EB/OL]. [2024-02-15]. https://arxiv.org/pdf/2402.09750.pdfhttps://arxiv.org/pdf/2402.09750.pdf
Wang J, Chakraborty and Yu S X. 2022. Transformer for 3D point clouds. IEEE Transactions on Pattern Analysis and Machine Intelligence,44(8): 4419-4431 [DOI: 10.1109/TPAMI.2021.3070341http://dx.doi.org/10.1109/TPAMI.2021.3070341]
Wang J, Wu Z, Li Y, Jiang H, Shu P, Shi E, Hu H, Ma C, Liu Y, Wang X, Yao Y, Liu X, Zhao H, Liu Z, Dai H, Zhao L, Ge B, Li X, Liu T and Zhang S. 2024. Large Language Models for Robotics: Opportunities, Challenges, and Perspectives [EB/OL]. [2024-01-09]. https://arxiv.org/pdf/2401.04334.pdfhttps://arxiv.org/pdf/2401.04334.pdf
Wang L, Li Y, Huang J and Lazebnik S. 2018. Learning two-branch neural networks for image-text matching tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence,41(2): 394-407 [DOI: 10.1109/TPAMI.2018.2797921http://dx.doi.org/10.1109/TPAMI.2018.2797921]
Wang M, Hu L, Bai Y, Yao X, Hu J and Zhang S. 2024. AMNet: a new RGB-D instance segmentation network based on attention and multi-modality. The Visual Computer,40(2): 1311-1325 [DOI: 10.1007/s00371-023-02850-whttp://dx.doi.org/10.1007/s00371-023-02850-w]
Wang P S. 2023. Octformer: octree-based transformers for 3d point clouds. ACM Transactions on Graphics,42(4): 1-11 [DOI: 10.1145/3592131http://dx.doi.org/10.1145/3592131]
Wang T, Mao X, Zhu C, Xu R, Lyu R, Li P, Chen X, Zhang W, Chen K, Xue T, Liu X, Lu C, Lin D and Pang J. 2023. EmbodiedScan: a holistic multi-modal 3D perception suite towards embodied AI [EB/OL]. [2023-12-26]. https://arxiv.org/pdf/2312.16170v1.pdfhttps://arxiv.org/pdf/2312.16170v1.pdf
Wang X, Zhuang B and Wu Q. 2024. ModaVerse: efficiently transforming modalities with LLMs [EB/OL]. [2024-01-12]. https://arxiv.org/pdf/2401.06395.pdfhttps://arxiv.org/pdf/2401.06395.pdf
Wang Z, Cheng B, Zhao L, Xu D, Tang Y and Sheng L. 2023. VL-SAT: visual-linguistic semantics assisted training for 3D semantic scene graph prediction in point cloud//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 21560-21569 [DOI: 10.1109/CVPR52729.2023.02065http://dx.doi.org/10.1109/CVPR52729.2023.02065]
Wang Z, Lu C, Wang Y, Bao F, Li C, Su H and Zhu J. 2024. Prolificdreamer: high-fidelity and diverse text-to-3D generation with variational score distillation//Proceedings of the 37th International Conference on Neural Information Processing Systems. New Orleans, USA: Curran Associates Inc.: 8406-8441
Wang Z, Yu J, Yu A W, Dai Z, Tsvetkov Y and Cao Y. 2021. SimVLM: simple visual language model pretraining with weak supervision [EB/OL]. [2021-08-24]. https://arxiv.org/pdf/2108.10904.pdfhttps://arxiv.org/pdf/2108.10904.pdf
Wei J, Wang H, Feng J, Lin G and Yap K H. 2023. Taps3d: text-guided 3D textured shape generation from pseudo supervision//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 16805-16815 [DOI: 10.1109/CVPR52729.2023.01612http://dx.doi.org/10.1109/CVPR52729.2023.01612]
Wei J, Wang X, Schuurmans D, Bosma M, Ichter B, Xia F, Chi E, Le Q and Zhou D. 2022. Chain-of-thought prompting elicits reasoning in large language models//Proceedings of the 36th International Conference on Neural Information Processing Systems. New Orleans, USA: Curran Associates Inc.: 24824-24837
Wei X, Yu R and Sun J. 2020. View-GCN: view-based graph convolutional network for 3D shape analysis//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 1847-1856 [DOI: 10.1109/CVPR42600.2020.00192http://dx.doi.org/10.1109/CVPR42600.2020.00192]
Wei X, Yu R and Sun J. 2023. Learning view-based graph convolutional network for multi-view 3D shape analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence,45(6): 7525-7541 [DOI: 10.1109/TPAMI.2022.3221785http://dx.doi.org/10.1109/TPAMI.2022.3221785]
Wen B, Yang W, Kautz J and Birchfield S. 2024. Foundationpose: unified 6D pose estimation and tracking of novel objects [EB/OL]. [2024-03-19]. https://arxiv.org/pdf/2403.12396.pdfhttps://arxiv.org/pdf/2403.12396.pdf
Weng C Y, Curless B, Srinivasan P P, Barron J T and Kemelmacher-Shlizerman I. 2022. HumanNeRF: Free-viewpoint rendering of moving people from monocular video//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE: 16189-16199 [DOI: 10.1109/CVPR52688.2022.01573http://dx.doi.org/10.1109/CVPR52688.2022.01573]
Werby A, Huang C, Büchner M, Valada A and Burgard W. 2024. Hierarchical open-vocabulary 3D scene graphs for language-grounded robot navigation [EB/OL]. [2024-03-26]. https://arxiv.org/abs/2403.17846.pdfhttps://arxiv.org/abs/2403.17846.pdf
Wu H, Wen C, Shi S, Li X and Wang C. 2023. Virtual sparse convolution for multimodal 3D object detection//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 21653-21662 [DOI: 10.1109/CVPR52729.2023.02074http://dx.doi.org/10.1109/CVPR52729.2023.02074]
Wu Q, Liu X, Chen Y, Li K, Zheng C, Cai J and Zheng J. 2022. Object-compositional neural implicit surfaces//Proceedings of the 17th European Conference on Computer Vision. Tel Aviv, Israel: Springer: 197-213 [DOI: 10.1007/978-3-031-19812-0_12http://dx.doi.org/10.1007/978-3-031-19812-0_12]
Wu S C, Wald J, Tateno K, Navab N and Tombari F. 2021. Scenegraphfusion: incremental 3D scene graph prediction from RGB-D sequences//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 7511-7521 [DOI: 10.1109/CVPR46437.2021.00743http://dx.doi.org/10.1109/CVPR46437.2021.00743]
Wu T Y, Huang S Y and Wang Y C F. 2024. DOrA: 3D visual grounding with order-aware referring [EB/OL]. [2024-03-25]. https://arxiv.org/pdf/2403.16539v1.pdfhttps://arxiv.org/pdf/2403.16539v1.pdf
Wu T, Zhang J, Fu X, Wang Y, Ren J, Pan L, Wu W, Yang L, Wang J, Qian C, Lin D and Liu Z. 2023. Omniobject3d: large-vocabulary 3D object dataset for realistic perception, reconstruction and generation//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 803-814 [DOI: 10.1109/CVPR52729.2023.00084http://dx.doi.org/10.1109/CVPR52729.2023.00084]
Wu Y, Cheng X, Zhang R, Cheng Z and Zhang J. 2023. Eda: explicit text-decoupling and dense alignment for 3D visual grounding//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 19231-19242 [DOI: 10.1109/CVPR52729.2023.01843http://dx.doi.org/10.1109/CVPR52729.2023.01843]
Wu Z, Song S, Khosla A, Yu F, Zhang L, Tang X and Xiao J. 2015. 3D ShapeNets: a deep representation for volumetric shapes//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 1912-1920 [DOI: 10.1109/CVPR.2015.7298801http://dx.doi.org/10.1109/CVPR.2015.7298801]
Xian W, Huang J B, Kopf J and Kim C. 2021. Space-time neural irradiance fields for free-viewpoint video//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 9416-9426 [DOI: 10.1109/CVPR46437.2021.00930http://dx.doi.org/10.1109/CVPR46437.2021.00930]
Xiao Z, Jing L, Wu S, Zhu A Z, Ji J, Jiang C M, Hung W, Funkhouser T, Kuo W, Angelova A, Zhou Y and Sheng S. 2024. 3D open-vocabulary panoptic segmentation with 2D-3D vision-language distillation [EB/OL]. [2024-01-04]. https://arxiv.org/pdf/2401.02402.pdfhttps://arxiv.org/pdf/2401.02402.pdf
Xie L, Xu G, Cai D and He X. 2023. X-View: Non-egocentric multi-view 3D object detector. IEEE Transactions on Image Processing,32: 1488-1497 [DOI: 10.1109/TIP.2023.3245337http://dx.doi.org/10.1109/TIP.2023.3245337]
Xie T, Zong Z, Qiu Y, Li X, Feng Y, Yang Y and Jiang C. 2023. Physgaussian: physics-integrated 3D gaussians for generative dynamics [EB/OL]. [2023-11-20]. https://arxiv.org/pdf/2311.12198.pdfhttps://arxiv.org/pdf/2311.12198.pdf
Xing Y, Wang J, Chen X and Zeng G. 2019. Coupling two-stream RGB-D semantic segmentation network by idempotent mappings//Proceedings of the 23th IEEE International Conference on Image Processing. Taipei, Taiwan: IEEE: 1850-1854 [DOI: 10.1109/ICIP.2019.8803146http://dx.doi.org/10.1109/ICIP.2019.8803146]
Xiong H, Muttukuru S, Upadhyay R, Chari P and Kadambi A. 2023. Sparsegs: real-time 360° sparse view synthesis using gaussian splatting [EB/OL]. [2023-11-30]. https://arxiv.org/pdf/2312.00206.pdfhttps://arxiv.org/pdf/2312.00206.pdf
Xu C, Wu B, Hou J, Tsai S, Li R, Wang J, Zhan W, He Z, Vajda P, Keutzer K and Tomizuka M. 2023. NeRF-Det: learning geometry-aware volumetric representation for multi-view 3D object detection//Proceedings of 2023 IEEE/CVF International Conference on Computer Vision. Paris, France: IEEE: 23263-23273 [DOI: 10.1109/ICCV51070.2023.02131http://dx.doi.org/10.1109/ICCV51070.2023.02131]
Xu D, Liang H, Bhatt N P, Hu H, Liang H, Plataniotis K N and Wang Z. 2024. Comp4d: compositionalLLM-guided 4D scene generation [EB/OL]. [2024-03-25]. https://arxiv.org/pdf/2403.16993.pdfhttps://arxiv.org/pdf/2403.16993.pdf
Xu H, He K, Plummer B A, Sigal L, Sclaroff S and Saenko K. 2019. Multilevel language and vision integration for text-to-clip retrieval//Proceedings of the 33th AAAI Conference on Artificial Intelligence. Honolulu, USA: AAAI: 9062-9069 [DOI: 10.1609/aaai.v33i01.33019062http://dx.doi.org/10.1609/aaai.v33i01.33019062]
Xu J, Wang X, Cheng W, Cao Y P, Shan Y, Qie X and Gao S. 2023. Dream3d: zero-shot text-to-3d synthesis using 3D shape prior and text-to-image diffusion models//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 20908-20918 [DOI: 10.1109/CVPR52729.2023.02003http://dx.doi.org/10.1109/CVPR52729.2023.02003]
Xu R, Wang X, Wang T, Chen Y, Pang J and Lin D. 2023. PointLLM: empowering large language models to understand point clouds [EB/OL]. [2023-08-31]. https://arxiv.org/pdf/2308.16911.pdfhttps://arxiv.org/pdf/2308.16911.pdf
Xu X, Chen L, Cai C, Zhan H, Yan Q, Ji P, Yuan J, Huang H and Xu Y. 2023. Dynamic voxel grid optimization for high-fidelity rgb-d supervised surface reconstruction [EB/OL]. [2023-04-12]. https://arxiv.org/pdf/2304.06178.pdfhttps://arxiv.org/pdf/2304.06178.pdf
Xu Y, Peng S, Yang C, Shen Y and Zhou B. 2022. 3D-aware image
synthesis via learning structural and textural representations//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE: 18409-18418 [DOI: 10.1109/CVPR52688.2022.01788http://dx.doi.org/10.1109/CVPR52688.2022.01788]
Xue L, Gao M, Xing C, Martín-Martín R, Wu J, Xiong C, Xu R, Niebles J C and Savarese S. 2023. ULIP: learning a unified representation of language, images, and point clouds for 3D understanding//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 1179-1189 [DOI: 10.1109/CVPR52729.2023.00120http://dx.doi.org/10.1109/CVPR52729.2023.00120]
Xue L, Yu N, Zhang S, Panagopoulou A, Li J, Martín-Martín R, Wu J, Xiong C, Xu R, Niebles J C and Savarese S. 2023. ULIP-2: Towards scalable multimodal pre-training for 3D understanding [EB/OL]. [2023-05-14]. https://arxiv.org/pdf/2305.08275.pdfhttps://arxiv.org/pdf/2305.08275.pdf
Yadav K, Ramrakhya R, Ramakrishnan S K, Gervet T, Turner J, Gokaslan A, Maestre N, Chang A X, Batra D, Savva M, Clegg A W and Chaplot D S. 2023. Habitat-Matterport 3D semantics dataset//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 4927-4936 [DOI: 10.1109/CVPR52729.2023.00477http://dx.doi.org/10.1109/CVPR52729.2023.00477]
Yan X, Yuan Z, Du Y, Liao Y, Guo Y, Cui S and Li Z. 2023. Comprehensive visual question answering on point clouds through compositional scene manipulation [EB/OL]. [2023-05-22]. https://arxiv.org/pdf/2112.11691.pdfhttps://arxiv.org/pdf/2112.11691.pdf
Yang H, Sun Y, Sundaramoorthi G and Yezzi A. 2023. Stabilizing the optimization of neural signed distance functions and finer shape representation//Proceedings of the 36th International Conference on Neural Information Processing Systems. New Orleans, USA: Curran Associates Inc.: 13993-14004
Yang J, Chen X, Madaan N, Iyengar M, Qian S, Fouhey D F and Chai J. 2024. 3D-GRAND: a million-scale dataset for 3D-LLMs with better grounding and less hallucination [EB/OL]. [2024-06-07]. https://arxiv.org/pdf/2406.05132.pdfhttps://arxiv.org/pdf/2406.05132.pdf
Yang J, Chen X, Qian S, Madaan N, Iyengar M, Fouhey D F and Chai J. 2023. LLM-grounder: open-vocabulary 3D visual grounding with large language model as an agent [EB/OL]. [2023-09-21]. https://arxiv.org/pdf/2309.12311.pdfhttps://arxiv.org/pdf/2309.12311.pdf
Yang J, Ding R, Deng W, Wang Z and Qi X. 2023. Regionplc: regional point-language contrastive learning for open-world 3D scene understanding [EB/OL]. [2023-04-03]. https://arxiv.org/pdf/2304.00962.pdfhttps://arxiv.org/pdf/2304.00962.pdf
Yang S, Liu J, Zhang R, Pan M, Guo Z, Li X, Chen Z, Gao P, Guo Y and Zhang S. 2023. Lidar-LLM: exploring the potential of large language models for 3D lidar understanding [EB/OL]. [2023-12-21]. https://arxiv.org/pdf/2312.14074.pdfhttps://arxiv.org/pdf/2312.14074.pdf
Yang Y, Lu J, Zhao Z, Luo Z, Yu J J, Sanchez V and Zheng F. 2024. LLplace: the 3D indoor scene layout generation and editing via large language model [EB/OL]. [2024-06-06]. https://arxiv.org/pdf/2406.03866.pdfhttps://arxiv.org/pdf/2406.03866.pdf
Yang Z, Li L, Wang J, Lin K, Azarnasab E, Ahmed F, Liu Z, Liu C, Zeng M and Wang L. 2023. MM-REACT: prompting chatgpt for multimodal reasoning and action [EB/OL]. [2023-03-20]. https://arxiv.org/pdf/2303.11381.pdfhttps://arxiv.org/pdf/2303.11381.pdf
Yang Z, Yang H, Pan Z, Zhu X and Zhang L. 2023. Real-time photorealistic dynamic scene representation and rendering with 4d gaussian splatting [EB/OL]. [2023-10-16]. https://arxiv.org/pdf/2310.10642.pdfhttps://arxiv.org/pdf/2310.10642.pdf
Yao L, Han J, Liang X, Xu D, Zhang W, Li Z and Xu H. 2023. Detclipv2: scalable open-vocabulary object detection pre-training via word-region alignment//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 23497-23506 [DOI: 10.1109/CVPR52729.2023.02250http://dx.doi.org/10.1109/CVPR52729.2023.02250]
Yao L, Pi R, Han J, Liang X, Xu H, Zhang W, Li Z and Xu D. 2024. DetCLIPv3: towards versatile generative open-vocabulary object detection [EB/OL]. [2024-04-14]. https://arxiv.org/pdf/2404.09216.pdfhttps://arxiv.org/pdf/2404.09216.pdf
Ye J, Wang N and Wang X. 2023. FeatureNeRF: learning generalizable nerfs by distilling foundation models//Proceedings of 2023 IEEE/CVF International Conference on Computer Vision. Paris, France: IEEE: 8928-8939 [DOI: 10.1109/ICCV51070.2023.00823http://dx.doi.org/10.1109/ICCV51070.2023.00823]
Yenamandra T, Tewari A, Yang N, Bernard F, Theobalt C and Cremers D. 2024. FIRe: fast inverse rendering using directional and signed distance functions//Proceedings of 2024 IEEE/CVF Winter Conference on Applications of Computer Vision. Waikoloa, USA: IEEE: 3077-3087 [DOI: 10.1109/WACV57701.2024.00305http://dx.doi.org/10.1109/WACV57701.2024.00305]
Yin F, Chen X, Zhang C, Jiang B, Zhao Z, Fan J, Yu G, Li T and Chen T. 2023. ShapeGPT: 3D shape generation with a unified multi-modal language model [EB/OL]. [2023-11-29]. https://arxiv.org/pdf/2311.17618.pdfhttps://arxiv.org/pdf/2311.17618.pdf
Yin S, Fu C, Zhao S, Li K, Sun X, Xu T and Chen E. 2023. A survey on multimodal large language models [EB/OL]. [2023-06-23]. https://arxiv.org/pdf/2306.13549.pdfhttps://arxiv.org/pdf/2306.13549.pdf
Yin Z, Wang J, Cao J, Shi Z, Liu D, Li M, Sheng L, Bai L, Huang X, Wang Z, Shao J and Ouyang W. 2024. LAMM: language-assisted multi-modal instruction-tuning dataset, framework and benchmark//Proceedings of the 37th Conference on Advances in Neural Information Processing Systems. New Orleans, USA: Curran Associates Inc.: 26650-26685
Yu A, Ye V, Tancik M and Kanazawa A. 2021. PixelNeRF: neural radiance fields from one or few images//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 4576-4585 [DOI: 10.1109/CVPR46437.2021.00455http://dx.doi.org/10.1109/CVPR46437.2021.00455]
Yu H, Qin Z, Hou J, Saleh M, Li D, Busam B and ILic S. 2023. Rotation-invariant transformer for point cloud matching //Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 5384-5393 [DOI: 10.1109/CVPR52729.2023.00521http://dx.doi.org/10.1109/CVPR52729.2023.00521]
Yu J, Wang Z, Vasudevan V, Yeung L, Seyedhosseini M and Wu Y. 2022. Coca: contrastive captioners are image-text foundation models [EB/OL]. [2022-01-28]. https://arxiv.org/pdf/2201.12086.pdfhttps://arxiv.org/pdf/2201.12086.pdf
Yu Q, He J, Deng X, Shen X and Chen L C. 2024. Convolutions die hard: open-vocabulary segmentation with single frozen convolutional clip//Proceedings of the 37th Conference on Advances in Neural Information Processing Systems. New Orleans, USA: Curran Associates Inc.: 32215-32234
Yu T, Meng J and Yuan J. 2018. Multi-view harmonized bilinear network for 3D object recognition//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 186-194 [DOI: 10.1109/CVPR.2018.00027http://dx.doi.org/10.1109/CVPR.2018.00027]
Yu X, Tang L, Rao Y, Huang T, Zhou J and Lu J. 2022. Point-bert: Pre-training 3d point cloud transformers with masked point modeling//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 19291-19300 [DOI: 10.1109/CVPR52688.2022.01871http://dx.doi.org/10.1109/CVPR52688.2022.01871]
Yu Y, Ko H, Choi J and Kim G. 2017. End-to-end concept word detection for video captioning, retrieval and question answering//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 3261-3269 [DOI: 10.1109/CVPR.2017.347http://dx.doi.org/10.1109/CVPR.2017.347]
Yu Z, Chen A, Huang B, Sattler T and Geiger A. 2023. Mip-splatting: alias-free 3D gaussian splatting [EB/OL]. [2023-11-27]. https://arxiv.org/pdf/2311.16493.pdfhttps://arxiv.org/pdf/2311.16493.pdf
Yuan Z, Lan H, Zou Q and Zhao J. 2024. 3D-PreMise: can large language models generate 3D shapes with sharp features and parametric control? [EB/OL]. [2024-01-12]. https://arxiv.org/pdf/2401.06437.pdfhttps://arxiv.org/pdf/2401.06437.pdf
Yuan Z, Ren J, Feng C M, Zhao H, Cui S and Li Z. 2023. Visual programming for zero-shot open-vocabulary 3D visual grounding [EB/OL]. [2023-11-26]. https://arxiv.org/pdf/2311.15383.pdfhttps://arxiv.org/pdf/2311.15383.pdf
Yuan Z, Yan X, Liao Y, Guo Y, Li G, Cui S and Li Z. 2022. X-trans2Cap: cross-modal knowledge transfer using transformer for 3D dense captioning//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE: 8553-8563 [DOI: 10.1109/CVPR52688.2022.00837http://dx.doi.org/10.1109/CVPR52688.2022.00837]
Yuksekgonul M, Bianchi F, Kalluri P, Jurafsky D and Zou J. 2023. When and why vision-language models behave like bags-of-words and what to do about it? [EB/OL]. [2023-03-23]. https://arxiv.org/pdf/2210.01936.pdfhttps://arxiv.org/pdf/2210.01936.pdf
Zareian A, Rosa K D, Hu D H and Chang S F. 2021. Open-vocabulary object detection using captions//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 14388-14397 [DOI: 10.1109/CVPR46437.2021.01416http://dx.doi.org/10.1109/CVPR46437.2021.01416]
Zeng Y, Jiang Y, Zhu S, Lu Y, Lin Y, Zhu H, Hu W, Cao X and Yao Y. 2024. STAG4D: spatial-temporal anchored generative 4D gaussians [EB/OL]. [2024-03-22]. https://arxiv.org/pdf/2403.14939.pdfhttps://arxiv.org/pdf/2403.14939.pdf
Zhan J, Dai J, Ye J, Zhou Y, Zhang D, Liu Z, Zhang X, Yuan R, Zhang G, Li L, Yan H, Fu J, Gui T, Sun T, Jiang Y and Qiu X. 2024. AnyGPT: unified multimodal LLM with discrete sequence modeling [EB/OL]. [2024-02-19]. https://arxiv.org/pdf/2402.12226.pdfhttps://arxiv.org/pdf/2402.12226.pdf
Zhang C, Zhou Y and Zhang L. 2024. Vosh: voxel-mesh hybrid representation for real-time view synthesis [EB/OL]. [2023-03-11]. https://arxiv.org/pdf/2403.06505.pdfhttps://arxiv.org/pdf/2403.06505.pdf
Zhang D, Li C, Zhang R, Xie S, Xue W, Xie X and Zhang S. 2024. FM-OV3D: foundation model-based cross-modal knowledge blending for open-vocabulary 3D detection//Proceedings of the 38th AAAI Conference on Artificial Intelligence. Vancouver, USA: AAAI: 16723-16731 [DOI: 10.1609/aaai.v38i15.29612http://dx.doi.org/10.1609/aaai.v38i15.29612]
Zhang J, Dong R and Ma K. 2023. CLIP-FO3D: learning free open-world 3D scene representations from 2D dense clip//Proceedings of 2023 IEEE/CVF International Conference on Computer Vision Workshops. Paris, France: IEEE: 2040-2051 [DOI: 10.1109/ICCVW60793.2023.00219http://dx.doi.org/10.1109/ICCVW60793.2023.00219]
Zhang J, Huang J, Cai B, Fu H, Gong M, Wang C, Wang J, Luo H, Jia R, Zhao B and Tang X. 2022. Digging into radiance grid for real-time view synthesis with detail preservation//Proceedings of 2022 European Conference on Computer Vision. Tel Aviv, Israel: Springer: 724-740 [DOI: 10.1007/978-3-031-19784-0_42http://dx.doi.org/10.1007/978-3-031-19784-0_42]
Zhang J, Li X, Wan Z, Wang C and Liao J. 2024. Text2NeRF: Text-driven 3D scene generation with neural radiance fields [EB/OL]. [2024-01-31]. https://arxiv.org/pdf/2305.11588.pdfhttps://arxiv.org/pdf/2305.11588.pdf
Zhang J, Zhan F, Xu M, Lu S and Xing E. 2024. Fregs: 3D gaussian splatting with progressive frequency regularization [EB/OL]. [2024-03-11]. https://arxiv.org/pdf/2403.06908.pdfhttps://arxiv.org/pdf/2403.06908.pdf
Zhang K, Riegler G, Snavely N and Koltun V. 2020. NeRF++: Analyzing and improving neural radiance fields [EB/OL]. [2020-10-15]. https://arxiv.org/pdf/2010.07492.pdfhttps://arxiv.org/pdf/2010.07492.pdf
Zhang L, Rao A and Agrawala M. 2023. Adding conditional control to text-to-image diffusion models//Proceedings of 2023 IEEE/CVF International Conference on Computer Vision. Paris, France: IEEE: 3836-3847 [DOI: 10.1109/ICCV51070.2023.00355http://dx.doi.org/10.1109/ICCV51070.2023.00355]
Zhang L, Wang Z, Zhang Q, Qiu Q, Pang A, Jiang H., Yang W, Xu L and Yu J. 2024. CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets. ACM Transactions on Graphics,43(4): 1-20 [DOI: 10.1145/3658146http://dx.doi.org/10.1145/3658146]
Zhang M, Feng Q, Su Z, Wen C, Xue Z and Li K. 2024. Joint2Human: High-quality 3D Human Generation via Compact Spherical Embedding of 3D Joints//Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 1429-1438 [DOI: 10.1109/CVPR52733.2024.00142http://dx.doi.org/10.1109/CVPR52733.2024.00142]
Zhang Q, Wang C, Siarohin A, Zhuang P, Xu Y, Yang C, Lin D, Zhou B, Tulyakov S and Lee H Y. 2024. SceneWiz3D: towards text-guided 3D scene composition [EB/OL]. [2023-12-13]. https://arxiv.org/pdf/2312.08885.pdfhttps://arxiv.org/pdf/2312.08885.pdf
Zhang R, Guo Z, Zhang W, Li K, Miao X, Cui B, Qiao Y, Gao P and Li H. 2022. PointCLIP: point cloud understanding by clip//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE: 8542-8552 [DOI: 10.1109/CVPR52688.2022.00836http://dx.doi.org/10.1109/CVPR52688.2022.00836]
Zhang R, Wang L, Guo Z, Wang Y, Gao P, Li H and Shi J. 2023. Parameter is not all you need: starting from non-parametric networks for 3d point cloud analysis [EB/OL]. [2023-03-14]. https://arxiv.org/pdf/2303.08134.pdfhttps://arxiv.org/pdf/2303.08134.pdf
Zhang Y, Gong Z and Chang A X. 2023. Multi3drefer: grounding text description to multiple 3D objects//Proceedings of 2023 IEEE/CVF International Conference on Computer Vision. Paris, France: IEEE: 15179-15179 [DOI: 10.1109/ICCV51070.2023.01397http://dx.doi.org/10.1109/ICCV51070.2023.01397]
Zhang Y, Luo H and Lei Y. 2024. Towards CLIP-driven language-free 3D visual grounding via 2D-3D relational enhancement and consistency//Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 13063-13072
Zhang Y, Zhu Z and Du D. 2023. Occformer: Dual-path transformer for vision-based 3D semantic occupancy prediction//Proceedings of 2023 IEEE/CVF International Conference on Computer Vision. Paris, France: IEEE: 9399-9409 [DOI: 10.1109/ICCV51070.2023.00865http://dx.doi.org/10.1109/ICCV51070.2023.00865]
Zhang Z, Cao S and Wang Y X. 2024. TAMM: triadapter multi-modal learning for 3D shape understanding [EB/OL]. [2024-02-28]. https://arxiv.org/pdf/2405.01413.pdfhttps://arxiv.org/pdf/2405.01413.pdf
Zhao Z, Liu W, Chen X, Zeng X, Wang R, Cheng P, Fu B, Chen T, Yu G and Gao S. 2024. Michelangelo: conditional 3D shape generation based on shape-image-text aligned latent representation//Proceedings of the 37th Conference on Advances in Neural Information Processing Systems. New Orleans, USA: Curran Associates Inc.: 73969-73982
Zhen H, Qiu X, Chen P, Yang J, Yan X, Du Y, Hong Y and Gan C. 2024. 3D-VLA: a 3D vision-language-action generative world model [EB/OL]. [2024-03-14]. https://arxiv.org/pdf/2403.09631.pdfhttps://arxiv.org/pdf/2403.09631.pdf
Zheng D, Huang S, Zhao L, Zhong Y and Wang L. 2023. Towards learning a generalist model for embodied navigation [EB/OL]. [2024-12-04]. https://arxiv.org/pdf/2312.02010.pdfhttps://arxiv.org/pdf/2312.02010.pdf
Zheng H and Gao W. 2024. End-to-end RGB-D image compression via exploiting channel-modality redundancy//Proceedings of the 38th AAAI Conference on Artificial Intelligence. Vancouver, Canada: AAAI: 7562-7570 [DOI: 10.1609/aaai.v38i7.28588http://dx.doi.org/10.1609/aaai.v38i7.28588]
Zheng J, Zhang J, Li J, Tang R, Gao S and Zhou Z. 2020. Structured3d: a large photo-realistic dataset for structured 3D modeling//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 519-535 [DOI: 10.1007/978-3-030-58545-7_30http://dx.doi.org/10.1007/978-3-030-58545-7_30]
Zhou C, Zhang Y, Chen J and Huang D. 2023. Octr: octree-based transformer for 3d object detection//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 5166-5175 [DOI: 10.1109/CVPR52729.2023.00500http://dx.doi.org/10.1109/CVPR52729.2023.00500]
Zhou H, Qi L, Wan Z, Huang H, Yang X. 2020. RGB-D co-attention network for semantic segmentation//Proceedings of the 17th Asian Conference on Computer Vision. Kyoto, Japan: Springer: 519-536 [DOI: 10.1007/978-3-030-69525-5_31http://dx.doi.org/10.1007/978-3-030-69525-5_31]
Zhou X, Ran X, Xiong Y, He J, Lin Z, Wang Y, Sun D and Yang M H. 2024. GALA3D: towards text-to-3d complex scene generation via layout-guided generative gaussian splatting [EB/OL]. [2024-02-11]. https://arxiv.org/pdf/2402.07207.pdfhttps://arxiv.org/pdf/2402.07207.pdf
Zhu C, Wang T, Zhang W, Chen K and Liu X. 2024. Empowering 3D visual grounding with reasoning capabilities [EB/OL]. [2024-07-01]. https://arxiv.org/pdf/2407.01525.pdfhttps://arxiv.org/pdf/2407.01525.pdf
Zhu C, Zhang W, Wang T, Liu X and Chen K. 2023. Object2scene: putting objects in context for open-vocabulary 3D detection [EB/OL]. [2023-09-18]. https://arxiv.org/pdf/2309.09456.pdfhttps://arxiv.org/pdf/2309.09456.pdf
Zhu D, Chen J, Shen X, Li X and Elhoseiny M. 2023. Minigpt-4: enhancing vision-language understanding with advanced large language models [EB/OL]. [2023-04-20]. https://arxiv.org/pdf/2304.10592.pdfhttps://arxiv.org/pdf/2304.10592.pdf
Zhu X, Zhang R, He B, Guo Z, Zeng Z, Qin Z, Zhang S and Gao P. 2023. PointCLIP V2: prompting clip and gpt for powerful 3D open-world learning//Proceedings of 2023 IEEE/CVF International Conference on Computer Vision. Paris, France: IEEE: 2639-2650 [DOI: 10.1109/ICCV51070.2023.00249http://dx.doi.org/10.1109/ICCV51070.2023.00249]
Zhu Z, Fan Z, Jiang Y and Wang Z. 2023. FSGS: real-time few-shot view synthesis using gaussian splatting [EB/OL]. [2023-12-01]. https://arxiv.org/pdf/2312.00451.pdfhttps://arxiv.org/pdf/2312.00451.pdf
Zhu Z, Ma X, Chen Y, Deng Z, Huang S and Li Q. 2023. 3d-vista: Pre-trained transformer for 3d vision and text alignment//Proceedings of 2023 IEEE/CVF International Conference on Computer Vision. Paris, France: IEEE: 2899-2909 [DOI: 10.1109/ICCV51070.2023.00272http://dx.doi.org/10.1109/ICCV51070.2023.00272]
相关作者
相关机构