“三维视觉—语言”推理技术的前沿研究与最新趋势

雷印杰; 徐凯; 郭裕兰; 杨鑫; 武玉伟; 胡玮; 杨佳琪; 汪汉云

doi:10.11834/jig.240029

场景识别与跨模态学习 | 浏览量 : 0 下载量: 1349 CSCD: 0

PDF
导出
分享
收藏
专辑

“三维视觉—语言”推理技术的前沿研究与最新趋势
Comprehensive survey on 3D visual-language understanding techniques
2024年29卷第6期页码：1747-1764
收稿日期：2024-01-19，

修回日期：2024-03-04，

纸质出版日期：2024-06-16
DOI： 10.11834/jig.240029
稿件说明：

移动端阅览

雷印杰，徐凯，郭裕兰，杨鑫，武玉伟，胡玮，杨佳琪，汪汉云. 2024. “三维视觉—语言”推理技术的前沿研究与最新趋势. 中国图象图形学报， 29(06):1747-1764 DOI： 10.11834/jig.240029.

Lei Yinjie， Xu Kai， Guo Yulan， Yang Xin， Wu Yuwei， Hu Wei， Yang Jiaqi， Wang Hanyun. 2024. Comprehensive survey on 3D visual-language understanding techniques. Journal of Image and Graphics， 29(06):1747-1764 DOI： 10.11834/jig.240029.

摘要

三维视觉推理的核心思想是对点云场景中的视觉主体间的关系进行理解。非专业用户难以向计算机传达自己的意图，从而限制了该技术的普及与推广。为此，研究人员以自然语言作为语义背景和查询条件反映用户意图，进而与点云的信息进行交互以完成相应的任务。此种范式称做“三维视觉—语言”推理，在自动驾驶、机器人导航以及人机交互等众多领域广泛应用，已经成为计算机视觉领域中备受瞩目的研究方向。过去几年间，“三维视觉—语言”推理技术迅猛发展，呈现出百花齐放的趋势，但是目前依然缺乏对最新研究进展的全面总结。本文聚焦于两类最具代表性的研究工作，锚框预测和内容生成类的“三维视觉—语言”推理技术，系统性概括领域内研究的最新进展。首先，本文总结了“三维视觉—语言”推理的问题定义和现存挑战，同时概述了一些常见的骨干网络。其次，本文按照方法所关注的下游场景，对两类“三维视觉—语言”推理技术做了进一步细分，并深入探讨了各方法的优缺点。接下来，本文对比分析了各类方法在不同基准数据集上的性能。最后，本文展望了“三维视觉—语言”推理技术的未来发展前景，以期促进该领域的深入研究与广泛应用。

Abstract

The core of 3D visual reasoning is to understand the relationships among different visual entities in point cloud scenes. Traditional 3D visual reasoning typically requires users to possess professional expertise. However， nonprofessional users face difficulty conveying their intentions to computers， which hinders the popularization and advancement of this technology. Users now anticipate a more convenient way to convey their intentions to the computer for achieving information exchange and gaining personalized results. Researchers utilize natural language as a semantic background or query criteria to reflect user intentions for addressing the aforementioned issue. They further accomplish various missions by interacting such natural language with 3D point clouds. By multimodal interaction， often employing techniques such as the Transformer or graph neural network， current approaches not only can locate the entities mentioned by users （e.g.， visual grounding and open-vocabulary recognition） but also can generate user-required content （e.g.， dense captioning， visual question answering， and scene generation）. Specifically， 3D visual grounding is intended to locate desired objects or regions in the 3D point cloud scene based on the object-related linguistic query. Open-vocabulary 3D recognition aims to identify and localize 3D objects of novel classes defined by an unbounded （open） vocabulary at inference， which can generalize beyond the limited number of base classes labeled during the training phase. 3D dense captioning aims to identify all possible instances within the 3D point cloud scene and generate the corresponding natural language description for each instance. The goal of 3D visual question answering is to comprehend an entire 3D scene and provide an appropriate answer. Text-guided scene generation is to synthesize a realistic 3D scene composed of complex background and multiple objects from natural language descriptions. The aforementioned paradigm， which is known as 3D visual-language understanding， has gained significant traction in various fields， such as autonomous driving， robot navigation， and human-computer interaction， in recent years. Consequently， it has become a highly anticipated research direction within the computer vision domain. Over the past 3 years， 3D visual-language understanding technology has rapidly developed and showcased a blossoming trend. However， comprehensive summaries regarding the latest research progress remain lacking. Therefore， the necessary tasks are to systematically summarize recent studies， comprehensively evaluate the performance of different approaches， and prospectively point out future research directions. This situation motivates this survey to fill this gap. For this purpose， this study aims to focus on two of the most representative works of 3D visual-language understanding technologies and systematically summarizes their latest research advancements： anchor box prediction and content generation. First， the study provides an overview of the problem definition and existing challenges in 3D visual-language understanding， and it also outlines some common backbones used in this area. The challenges in 3D visual-language understanding include 3D-language alignment and complex scene understanding. Meanwhile， some common backbones involve priori rules， multilayer perceptrons， graph neural networks， and Transformer architectures. Subsequently， the study delves into downstream scenarios， which emphasize two types of 3D visual-language understanding techniques， including bounding box predation and content generation. This study thoroughly explores the advantages and disadvantages of each method. Furthermore， the study compares and analyzes the performance of various methods on different benchmark datasets. Finally， the study concludes by looking ahead to the future prospects of 3D visual-language reasoning technology， which can promote profound research and widespread application in this field. The major contributions of this study can be summarized as follows： 1） Systematic survey of 3D visual-language understanding. To the best of our knowledge， this survey is the first to thoroughly discuss the recent advances in 3D visual-language understanding. We categorize algorithms into different taxonomies from the perspective of downstream scenarios to provide readers with a clear comprehension of our article. 2） Comprehensive performance evaluation and analysis. We compare the existing 3D visual-language understanding approaches on several publicly available datasets. Our in-depth analysis can help researchers in selecting the baseline suitable for their specific applications while also offering valuable insights on the modification of existing methods. 3） Insightful discussion of future prospects. Based on the systematic survey and comprehensive performance comparison， some promising future research directions are discussed， including large-scale 3D foundation model， computational efficiency of 3D modeling， and incorporation of additional modalities.

关键词

Keywords

references

Achlioptas P ， Abdelreheem A ， Xia F ， Elhoseiny M and Guibas L . 2020 . ReferIt3D： neural listeners for fine-grained 3D object identification in real-world scenes // Proceedings of the 16th European Conference on Computer Vision . Glasgow， UK ： Springer： 422 - 440 ［ DOI： 10.1007/978-3-030-58452-8_25 http://dx.doi.org/10.1007/978-3-030-58452-8_25 ］

Anderson P ， Fernando B ， Johnson M and Gould S . 2016 . SPICE： semantic propositional image caption evaluation // Proceedings of the 14th European Conference on Computer Vision . Amsterdam， the Netherlands ： Springer： 382 - 398 ［ DOI： 10.1007/978-3-319-46454-1_24 http://dx.doi.org/10.1007/978-3-319-46454-1_24 ］

Armeni I ， Sener O ， Zamir A R ， Jiang H ， Brilakis I ， Fischer M and Savarese S . 2016 . 3D semantic parsing of large-scale indoor spaces // Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition . Las Vegas， USA ： IEEE： 1534 - 1543 ［ DOI： 10.1109/cvpr.2016.170 http://dx.doi.org/10.1109/cvpr.2016.170 ］

Azuma D ， Miyanishi T ， Kurita S and Kawanabe M . 2022 . ScanQA： 3D question answering for spatial scene understanding // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans， USA ： IEEE： 19107 - 19117 ［ DOI： 10.1109/cvpr52688.2022.01854 http://dx.doi.org/10.1109/cvpr52688.2022.01854 ］

Banerjee S and Lavie A . 2005 . METEOR： an automatic metric for MT evaluation with improved correlation with human judgments // Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization . Ann Arbor， Michigan， USA ： Association for Computational Linguistics： 65 - 72 ［ DOI： 10.3115/1626355.1626389 http://dx.doi.org/10.3115/1626355.1626389 ］

Bautista M A ， Guo P S ， Abnar S ， Talbott W ， Toshev A ， Chen Z Y ， Dinh L ， Zhai S F ， Goh H ， Ulbricht D ， Dehghan A and Susskind J . 2022 . GAUDI： a neural architect for immersive 3D scene generation ［EB/OL］. ［ 2024-01-19 ］. https://arxiv.org/pdf/2207.13751.pdf https://arxiv.org/pdf/2207.13751.pdf

Bermejo C ， Lee L H ， Chojecki P ， Przewozny D and Hui P . 2021 . Exploring button designs for mid-air interaction in virtual reality： a hexa-metric evaluation of key representations and multi-modal cues . Proceedings of the ACM on Human-Computer Interaction ， 5 ： 1 - 26 ［ DOI： 10.1145/3457141 http://dx.doi.org/10.1145/3457141 ］

Brown T B ， Mann B ， Ryder N ， Subbiah M ， Kaplan J ， Dhariwal P ， Neelakantan A ， Shyam P ， Sastry G ， Askell A ， Agarwal S ， Herbert-Voss A ， Krueger G ， Henighan T ， Child R ， Ramesh A ， Ziegler D M ， Wu J ， Winter C ， Hesse C ， Chen M ， Sigler E ， Litwin M ， Gray S ， Chess B ， Clark J ， Berner C ， McCandlish S ， Radford A ， Sutskever I and Amodei D . 2020 . Language models are few-shot learners // Proceedings of the 34th International Conference on Neural Information Processing Systems . Vancouver， Canada ： Curran Associates Inc.： 1877 - 1901

Cai D G ， Zhao L C ， Zhang J ， Sheng L and Xu D . 2022 . 3DJCG： a unified framework for joint dense captioning and visual grounding on 3D point clouds // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans， USA ： IEEE： 16443 - 16452 ［ DOI： 10.1109/cvpr52688.2022.01597 http://dx.doi.org/10.1109/cvpr52688.2022.01597 ］

Chen D Z ， Chang A X and Nießner M . 2020 . ScanRefer： 3D object localization in RGB-D scans using natural language // Proceedings of the 16th European Conference on Computer Vision . Glasgow， UK ： Springer： 202 - 221 ［ DOI： 10.1007/978-3-030-58565-5_13 http://dx.doi.org/10.1007/978-3-030-58565-5_13 ］

Chen D Z ， Gholami A ， Nießner M and Chang A X . 2021 . Scan2Cap： context-aware dense captioning in RGB-D scans // Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Nashville， USA ： IEEE： 3192 - 3202 ［ DOI： 10.1109/cvpr46437.2021.00321 http://dx.doi.org/10.1109/cvpr46437.2021.00321 ］

Chen D Z ， Wu Q R ， Nießner M and Chang A X . 2022 . D 3 Net： a unified speaker-listener architecture for 3D dense captioning and visual grounding // Proceedings of the 17th European Conference on Computer Vision . Tel Aviv， Israel ： Springer： 487 - 505 ［ DOI： 10.1007/978-3-031-19824-3_29 http://dx.doi.org/10.1007/978-3-031-19824-3_29 ］

Chen R N ， Liu Y Q ， Kong L D ， Zhu X G ， Ma Y X ， Li Y K ， Hou Y N ， Qiao Y and Wang W P . 2023a . CLIP2Scene： towards label-efficient 3D scene understanding by CLIP // Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Vancouver， Canada ： IEEE： 7020 - 7030 ［ DOI： 10.1109/cvpr52729.2023.00678 http://dx.doi.org/10.1109/cvpr52729.2023.00678 ］

Chen S J ， Zhu H Y ， Chen X ， Lei Y J ， Yu G and Chen T . 2023b . End-to-end 3D dense captioning with Vote2Cap-DETR // Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Vancouver， Canada ： IEEE： 11124 - 11133 ［ DOI： 10.1109/cvpr52729.2023.01070 http://dx.doi.org/10.1109/cvpr52729.2023.01070 ］

Chen X L and Zitnick C L . 2015 . Mind’s eye： a recurrent visual representation for image caption generation // Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition . Boston， USA ： IEEE： 2422 - 2431 ［ DOI： 10.1109/cvpr.2015.7298856 http://dx.doi.org/10.1109/cvpr.2015.7298856 ］

Choy C ， Gwak J and Savarese S . 2019 . 4D spatio-temporal ConvNets： Minkowski convolutional neural networks // Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach， USA ： IEEE： 3070 - 3079 ［ DOI： 10.1109/cvpr.2019.00319 http://dx.doi.org/10.1109/cvpr.2019.00319 ］

Cohen-Bar D ， Richardson E ， Metzer G ， Giryes R and Cohen-Or D . 2023 . Set-the-scene： global-local training for generating controllable NeRF scenes ［EB/OL］. ［ 2024-01-19 ］. https://arxiv.org/pdf/2303.13450.pdf https://arxiv.org/pdf/2303.13450.pdf

Dai A ， Chang A X ， Savva M ， Halber M ， Funkhouser T and Nießner M . 2017 . ScanNet： richly-annotated 3D reconstructions of indoor scenes // Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition . Honolulu， USA ： IEEE： 2432 - 2443 ［ DOI： 10.1109/cvpr.2017.261 http://dx.doi.org/10.1109/cvpr.2017.261 ］

Dao T ， Fu D Y ， Ermon S ， Rudra A and Ré C . 2022 . FlashAttention： fast and memory-efficient exact attention with IO-awareness ［EB/OL］. ［ 2024-01-19 ］. https://arxiv.org/pdf/2205.14135.pdf https://arxiv.org/pdf/2205.14135.pdf

Devlin J ， Chang M W ， Lee K and Toutanova K . 2019 . BERT： pre-training of deep bidirectional transformers for language understanding ［EB/OL］. ［ 2024-01-19 ］. https://arxiv.org/pdf/1810.04805.pdf https://arxiv.org/pdf/1810.04805.pdf

Ding R Y ， Yang J H ， Xue C H ， Zhang W Q ， Bai S and Qi X J . 2023 . PLA： language-driven open-vocabulary 3D scene understanding // Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Vancouver， Canada ： IEEE： 7010 - 7019 ［ DOI： 10.1109/cvpr52729.2023.00677 http://dx.doi.org/10.1109/cvpr52729.2023.00677 ］

Dong X Y ， Bao J M ， Zheng Y L ， Zhang T ， Chen D D ， Yang H ， Zeng M ， Zhang W M ， Yuan L ， Chen D ， Wen F and Yu N H . 2023 . MaskCLIP： masked self-distillation advances contrastive language-image pretraining // Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Vancouver， Canada ： IEEE： 10995 - 11005 ［ DOI： 10.1109/cvpr52729.2023.01058 http://dx.doi.org/10.1109/cvpr52729.2023.01058 ］

Dou J ， Xue J R and Fang J W . 2019 . SEG-VoxelNet for 3D vehicle detection from RGB and LiDAR data // Proceedings of 2019 International Conference on Robotics and Automation . Montreal， Canada ： IEEE： 4362 - 4368 ［ DOI： 10.1109/icra.2019.8793492 http://dx.doi.org/10.1109/icra.2019.8793492 ］

Engelmann F ， Kontogianni T ， Schult J and Leibe B . 2019 . Know what your neighbors do： 3D semantic segmentation of point clouds // Proceedings of 2019 European Conference on Computer Vision . Munich， Germany ： Springer： 395 - 409 ［ DOI： 10.1007/978-3-030-11015-4_29 http://dx.doi.org/10.1007/978-3-030-11015-4_29 ］

Feng M T ， Li Z ， Li Q ， Zhang L ， Zhang X D ， Zhu G M ， Zhang H ， Wang Y N and Mian A . 2021 . Free-form description guided 3D visual graph network for object grounding in point cloud // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal， Canada ： IEEE： 3702 - 3711 ［ DOI： 10.1109/iccv48922.2021.00370 http://dx.doi.org/10.1109/iccv48922.2021.00370 ］

Fridman R ， Abecasis A ， Kasten Y and Dekel T . 2023 . SceneScape： text-driven consistent scene generation ［EB/OL］. ［ 2024-01-19 ］. https://arxiv.org/pdf/2302.01133.pdf https://arxiv.org/pdf/2302.01133.pdf

Geiger A ， Lenz P and Urtasun R . 2012 . Are we ready for autonomous driving？ The KITTI vision benchmark suite // Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition . Providence， USA ： IEEE： 3354 - 3361 ［ DOI： 10.1109/cvpr.2012.6248074 http://dx.doi.org/10.1109/cvpr.2012.6248074 ］

Guo Y L ， Wang H Y ， Hu Q Y ， Liu H ， Liu L and Bennamoun M . 2021 . Deep learning for 3D point clouds： a survey . IEEE Transactions on Pattern Analysis and Machine Intelligence ， 43 （ 12 ）： 4338 - 4364 ［ DOI： 10.1109/TPAMI.2020.3005434 http://dx.doi.org/10.1109/TPAMI.2020.3005434 ］

He C H ， Zeng H ， Huang J Q ， Hua X S and Zhang L . 2020 . Structure aware single-stage 3D object detection from point cloud // Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Seattle， USA ： IEEE： 11870 - 11879 ［ DOI： 10.1109/cvpr42600.2020.01189 http://dx.doi.org/10.1109/cvpr42600.2020.01189 ］

He D L ， Zhao Y S ， Luo J Y ， Hui T R ， Huang S F ， Zhang A X and Liu S . 2021 . TransRefer3D： entity-and-relation aware Transformer for fine-grained 3D visual grounding // Proceedings of the 29th ACM International Conference on Multimedia . Chengdu， China ： ACM： 2344 - 2352 ［ DOI： 10.1145/3474085.3475397 http://dx.doi.org/10.1145/3474085.3475397 ］

Höllein L ， Cao A ， Owens A ， Johnson J and Nießner M . 2023 . Text2Room： extracting textured 3D meshes from 2D text-to-image models ［EB/OL］. ［ 2024-01-19 ］. https://arxiv.org/pdf/2303.11989.pdf https://arxiv.org/pdf/2303.11989.pdf

Hu E J ， Shen Y L ， Wallis P ， Allen-Zhu Z ， Li Y Z ， Wang S A ， Wang L and Chen W Z . 2021 . LoRA： low-rank adaptation of large language models ［EB/OL］. ［ 2024-01-19 ］. https://arxiv.org/pdf/2106.09685.pdf https://arxiv.org/pdf/2106.09685.pdf

Huang P H ， Lee H H ， Chen H T and Liu T L . 2021 . Text-guided graph neural networks for referring 3D instance segmentation // Proceedings of the 35th AAAI Conference on Artificial Intelligence . Online ： AAAI： 1610 - 1618 ［ DOI： 10.1609/aaai.v35i2.16253 http://dx.doi.org/10.1609/aaai.v35i2.16253 ］

Huang S J ， Chen Y L ， Jia J Y and Wang L W . 2022 . Multi-view Transformer for 3D visual grounding // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans， USA ： IEEE： 15503 - 15512 ［ DOI： 10.1109/cvpr52688.2022.01508 http://dx.doi.org/10.1109/cvpr52688.2022.01508 ］

Jain A ， Gkanatsios N ， Mediratta I and Fragkiadaki K . 2022 . Bottom up top down detection Transformers for language grounding in images and point clouds // Proceedings of the 17th European Conference on Computer Vision . Tel Aviv， Israel ： Springer： 417 - 433 ［ DOI： 10.1007/978-3-031-20059-5_24 http://dx.doi.org/10.1007/978-3-031-20059-5_24 ］

Jiang H J ， Lin Y Z ， Han D C ， Song S J and Huang G . 2022 . Pseudo-Q： generating pseudo language queries for visual grounding // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans， USA ： IEEE： 15492 - 15502 ［ DOI： 10.1109/cvpr52688.2022.01507 http://dx.doi.org/10.1109/cvpr52688.2022.01507 ］

Jiao Y ， Chen S X ， Jie Z Q ， Chen J J ， Ma L and Jiang Y G . 2022 . MORE： multi-order relation mining for dense captioning in 3D scenes // Proceedings of the 17th European Conference on Computer Vision . Tel Aviv， Israel ： Springer： 528 - 545 ［ DOI： 10.1007/978-3-031-19833-5_31 http://dx.doi.org/10.1007/978-3-031-19833-5_31 ］

Jin W K ， Zhao Z ， Cao X C ， Zhu J M ， He X Q and Zhuang Y T . 2021 . Adaptive spatio-temporal graph enhanced vision-language representation for video QA . IEEE Transactions on Image Processing ， 30 ： 5477 - 5489 ［ DOI： 10.1109/tip.2021.3076556 http://dx.doi.org/10.1109/tip.2021.3076556 ］

Jin Z ， Hayat M ， Yang Y W ， Guo Y L and Lei Y J . 2023 . Context-aware alignment and mutual masking for 3D-language pre-training // Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Vancouver， Canada ： IEEE： 10984 - 10994 ［ DOI： 10.1109/cvpr52729.2023.01057 http://dx.doi.org/10.1109/cvpr52729.2023.01057 ］

Jin Z ， Lei Y J ， Akhtar N ， Li H F and Hayat M . 2022 . Deformation and correspondence aware unsupervised synthetic-to-real scene flow estimation for point clouds // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans， USA ： IEEE： 7223 - 7233 ［ DOI： 10.1109/cvpr52688.2022.00709 http://dx.doi.org/10.1109/cvpr52688.2022.00709 ］

Jing Z W ， Guan H Y ， Zang Y F ， Ni H ， Li D L and Yu Y T . 2021 . Survey of point cloud semantic segmentation based on deep learning . Journal of Frontiers of Computer Science and Technology ， 15 （ 1 ）： 1 - 26

景庄伟，管海燕，臧玉府，倪欢，李迪龙，于永涛 . 2021 . 基于深度学习的点云语义分割研究综述 . 计算机科学与探索， 15 （ 1 ）： 1 - 26 ［ DOI： 10.3778/j.issn.1673-9418.2006025 http://dx.doi.org/10.3778/j.issn.1673-9418.2006025 ］

Joseph-Rivlin M ， Zvirin A and Kimmel R . 2019 . Momen e t： flavor the moments in learning to classify shapes // Proceedings of 2019 IEEE/CVF International Conference on Computer Vision Workshop . Seoul， Korea （South）： IEEE： 4085 - 4094 ［ DOI： 10.1109/iccvw.2019.00503 http://dx.doi.org/10.1109/iccvw.2019.00503 ］

Kamath A ， Singh M ， LeCun Y ， Synnaeve G ， Misra I and Carion N . 2021 . MDETR-modulated detection for end-to-end multi-modal understanding // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal， Canada ： IEEE： 1760 - 1770 ［ DOI： 10.1109/iccv48922.2021.00180 http://dx.doi.org/10.1109/iccv48922.2021.00180 ］

Khashabi D ， Min S ， Khot T ， Sabharwal A ， Tafjord O ， Clark P and Hajishirzi H . 2020 . UnifiedQA： crossing format boundaries with a single QA system ［EB/OL］. ［ 2024-01-19 ］. https://arxiv.org/pdf/2205.00700.pdf https://arxiv.org/pdf/2205.00700.pdf

Lester B ， Al-Rfou R and Constant N . 2021 . The power of scale for parameter-efficient prompt tuning ［EB/OL］. ［ 2024-01-19 ］. https://arxiv.org/pdf/2104.08691.pdf https://arxiv.org/pdf/2104.08691.pdf

Li C H ， Zhang C N ， Waghwase A ， Lee L H ， Rameau F ， Yang Y ， Bse S H and Hong C S . 2023a . Generative AI meets 3D： a survey on Text-to-3D in AIGC era ［EB/OL］. ［ 2024-01-19 ］. https://arxiv.org/pdf/2305.06131.pdf https://arxiv.org/pdf/2305.06131.pdf

Li C L ， Lu A D ， Liu L and Tang J . 2023 . Multi-modal visual tracking： a survey . Journal of Image and Graphics ， 28 （ 1 ）： 37 - 56

李成龙，鹿安东，刘磊，汤进 . 2023 . 多模态视觉跟踪方法综述 . 中国图象图形学报， 28 （ 1 ）： 37 - 56 ［ DOI： 10.11834/jig.220578 http://dx.doi.org/10.11834/jig.220578 ］

Li L H ， Zhang P C ， Zhang H T ， Yang J W ， Li C Y ， Zhong Y W ， Wang L J ， Yuan L ， Zhang L ， Hwang J N ， Chang K W and Gao J F . 2022 . Grounded language-image pre-training // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans， USA ： IEEE： 10955 - 10965 ［ DOI： 10.1109/cvpr52688.2022.01069 http://dx.doi.org/10.1109/cvpr52688.2022.01069 ］

Li J N ， Li D X ， Savarese S and Hoi S . 2023b . BLIP-2： Bootstrapping language-image pre-training with frozen image encoders and large language models ［EB/OL］. ［ 2024-01-19 ］. https://arxiv.org/pdf/2301.12597.pdf https://arxiv.org/pdf/2301.12597.pdf

Lin C Y . 2004 . ROUGE： a package for automatic evaluation of summaries // Proceedings of the Text Summarization Branches out . Barcelona， Spain ： Association for Computational Linguistics： 74 - 81

Luo J Y ， Fu J H ， Kong X H ， Gao C ， Ren H B ， Shen H ， Xia H X and Liu S . 2022 . 3D-SPS： single-stage 3D visual grounding via referred point progressive selection // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans， USA ： IEEE： 16433 - 16442 ［ DOI： 10.1109/cvpr52688.2022.01596 http://dx.doi.org/10.1109/cvpr52688.2022.01596 ］

Ma X J ， Yong S L ， Zheng Z L ， Li Q ， Liang Y T ， Zhu S C and Huang S Y . 2023 . SQA 3 D： situated question answering in 3D scenes ［EB/OL］. ［ 2024-01-19 ］. https://arxiv.org/pdf/2210.07474.pdf https://arxiv.org/pdf/2210.07474.pdf

Mittal V . 2020 . AttnGrounder： talking to cars with attention // Proceedings of 2020 European Conference on Computer Vision . Glasgow， UK ： Springer： 62 - 73 ［ DOI： 10.1007/978-3-030-66096-3_6 http://dx.doi.org/10.1007/978-3-030-66096-3_6 ］

Papineni K ， Roukos S ， Ward T and Zhu W J . 2002 . BLEU： a method for automatic evaluation of machine translation // Proceedings of the 40th Annual Meeting on Association for Computational Linguistics . Philadelphia， USA ： Association for Computational Linguistics： 311 - 318 ［ DOI： 10.3115/1073083.1073135 http://dx.doi.org/10.3115/1073083.1073135 ］

Peng S Y ， Genova K ， Jiang C Y ， Tagliasacchi A ， Pollefeys M and Funkhouser T . 2023 . OpenScene： 3D scene understanding with open vocabularies // Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Vancouver， Canada ： IEEE： 815 - 824 ［ DOI： 10.1109/cvpr52729.2023.00085 http://dx.doi.org/10.1109/cvpr52729.2023.00085 ］

Po R and Wetzstein G . 2023 . Compositional 3D scene generation using locally conditioned diffusion ［EB/OL］. ［ 2024-01-19 ］. https://arxiv.org/pdf/2303.12218.pdf https://arxiv.org/pdf/2303.12218.pdf

Poole B ， Jain A ， Barron J T and Mildenhall B . 2022 . DreamFusion： text-to-3D using 2D diffusion ［EB/OL］. ［ 2024-01-19 ］. https://arxiv.org/pdf/2209.14988.pdf https://arxiv.org/pdf/2209.14988.pdf

Qi C R ， Litany O ， He K M and Guibas L J . 2019 . Deep Hough voting for 3D object detection in point clouds // Proceedings of 2019 IEEE/CVF International Conference on Computer Vision . Seoul， Korea （South）： IEEE： 9276 - 9285 ［ DOI： 10.1109/iccv.2019.00937 http://dx.doi.org/10.1109/iccv.2019.00937 ］

Qi C R ， Su H ， Mo K C and Guibas L J . 2017a . PointNet： deep learning on point sets for 3D classification and segmentation // Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition . Hawaii， USA ： IEEE： 77 - 85 ［ DOI： 10.1109/cvpr.2017.16 http://dx.doi.org/10.1109/cvpr.2017.16 ］

Qi C R ， Yi L ， Su H and Guibas L J . 2017b . PointNet++： deep hierarchical feature learning on point sets in a metric space // Proceedings of the 31st International Conference on Neural Information Processing Systems . Long Beach， USA ： Curran Associates Inc.： 5105 - 5114

Radford A ， Kim J W ， Hallacy C ， Ramesh A ， Goh G ， Agarwal S ， Sastry G ， Askell A ， Mishkin P ， Clark J ， Krueger G and Sutskever I . 2021 . Learning transferable visual models from natural language supervision // Proceedings of the 38th International Conference on Machine Learning . Online ： MIT Press： 8748 - 8763

Rajpurkar P ， Zhang J ， Lopyrev K and Liang P . 2016 . SQuAD ： 100 ， 000 + questions for machine comprehension of text ［EB/OL］. ［ 2024-01-19 ］. https://arxiv.org/pdf/1606.05250.pdf https://arxiv.org/pdf/1606.05250.pdf

Roh J ， Desingh K ， Farhadi A and Fox D . 2022 . LanguageRefer： spatial-language model for 3D visual grounding // Proceedings of the 5th Conference on Robot Learning . London， UK ：［s.n.］： 1046 - 1056

Senior H ， Slabaugh G ， Yuan S X and Rossi L . 2023 . Graph neural networks in vision-language image understanding： a survey ［EB/OL］. ［ 2024-01-19 ］. https://arxiv.org/pdf/2303.03761.pdf https://arxiv.org/pdf/2303.03761.pdf

Silberman N ， Hoiem D ， Kohli P and Fergus R . 2012 . Indoor segmentation and support inference from RGBD images // Proceedings of the 12th European Conference on Computer Vision . Florence， Italy ： Springer： 746 - 760 ［ DOI： 10.1007/978-3-642-33715-4_54 http://dx.doi.org/10.1007/978-3-642-33715-4_54 ］

Tatarchenko M ， Park J ， Koltun V and Zhou Q Y . 2018 . Tangent convolutions for dense prediction in 3 D// Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City， USA ： IEEE： 3887 - 3896 ［ DOI： 10.1109/cvpr.2018.00409 http://dx.doi.org/10.1109/cvpr.2018.00409 ］

Vedantam R ， Zitnick C L and Parikh D . 2015 . CIDEr： consensus-based image description evaluation // Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition . Boston， USA ： IEEE： 4566 - 4575 ［ DOI： 10.1109/cvpr.2015.7299087 http://dx.doi.org/10.1109/cvpr.2015.7299087 ］

Vinyals O ， Toshev A ， Bengio S and Erhan D . 2015 . Show and tell： a neural image caption generator // Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition . Boston， USA ： IEEE： 3156 - 3164 ［ DOI： 10.1109/cvpr.2015.7298935 http://dx.doi.org/10.1109/cvpr.2015.7298935 ］

Wang H ， Zhang C Y ， Yu J H and Cai W D . 2022 . Spatiality-guided Transformer for 3D dense captioning on point clouds ［EB/OL］. ［ 2024-01-19 ］. https://arxiv.org/pdf/2204.10688.pdf https://arxiv.org/pdf/2204.10688.pdf

Wang X ， Huang Q Y ， Celikyilmaz A ， Gao J F ， Shen D H ， Wang Y F ， Wang W Y and Zhang L . 2019 . Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation // Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach， USA ： IEEE： 6622 - 6631 ［ DOI： 10.1109/cvpr.2019.00679 http://dx.doi.org/10.1109/cvpr.2019.00679 ］

Wang Z H ， Huang H F ， Zhao Y ， Li L J ， Cheng X Z ， Zhu Y C ， Yin A X and Zhao Z . 2023a . Distilling coarse-to-fine semantic matching knowledge for weakly supervised 3D visual grounding ［EB/OL］. ［ 2024-01-19 ］. https://arxiv.org/pdf/2307.09267.pdf https://arxiv.org/pdf/2307.09267.pdf

Wang Y J ， Mao Q Y ， Zhu H Q ， Deng J J ， Zhang Y ， Ji J M ， Li H Q and Zhang Y Y . 2023b . Multi-modal 3D object detection in autonomous driving： a survey . International Journal of Computer Vision ， 131 （ 8 ）： 2122 - 2152 ［ DOI： 10.1007/s11263-023-01784-z http://dx.doi.org/10.1007/s11263-023-01784-z ］

Wu Q ， Shen C H ， Liu L Q ， Dick A and Van Den Hengel A . 2016 . What value do explicit high level concepts have in vision to language problems? // Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition . Las Vegas， USA ： IEEE： 203 - 212 ［ DOI： 10.1109/cvpr.2016.29 http://dx.doi.org/10.1109/cvpr.2016.29 ］

Wu Y M ， Cheng X H ， Zhang R R ， Cheng Z S and Zhang J . 2023 . EDA： explicit text-decoupling and dense alignment for 3D visual grounding // Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Vancouver， Canada ： IEEE： 19231 - 19242 ［ DOI： 10.1109/cvpr52729.2023.01843 http://dx.doi.org/10.1109/cvpr52729.2023.01843 ］

Xie L ， Xiang C ， Yu Z X ， Xu G D ， Yang Z ， Cai D and He X F . 2020a . PI-RCNN： an efficient multi-sensor 3D object detector with point-based attentive Cont-Conv fusion module // Proceedings of the 34th AAAI Conference on Artificial Intelligence . New York， USA ： AAAI： 12460 - 12467 ［ DOI： 10.1609/aaai.v34i07.6933 http://dx.doi.org/10.1609/aaai.v34i07.6933 ］

Xie Y X ， Tian J J and Zhu X X . 2020b . Linking points with labels in 3D： a review of point cloud semantic segmentation . IEEE Geoscience and Remote Sensing Magazine ， 8 （ 4 ）： 38 - 59 ［ DOI： 10.1109/mgrs.2019.2937630 http://dx.doi.org/10.1109/mgrs.2019.2937630 ］

Yang J H ， Ding R Y ， Deng W P ， Wang Z and Qi X J . 2023a . RegionPLC： regional point-language contrastive learning for open-world 3D scene understanding ［EB/OL］. ［ 2024-01-19 ］. https://arxiv.org/pdf/2304.00962.pdf https://arxiv.org/pdf/2304.00962.pdf

Yang L ， Xu Y ， Yuan C F ， Liu W ， Li B and Hu W M . 2022 . Improving visual grounding with visual-linguistic verification and iterative reasoning // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans， USA ： IEEE： 9489 - 9498 ［ DOI： 10.1109/cvpr52688.2022.00928 http://dx.doi.org/10.1109/cvpr52688.2022.00928 ］

Yang Y W ， Hayat M ， Jin Z ， Ren C and Lei Y J . 2023b . Geometry and uncertainty-aware 3D point cloud class-incremental semantic segmentation // Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Vancouver， Canada ： IEEE： 21759 - 21768 ［ DOI： 10.1109/cvpr52729.2023.02084 http://dx.doi.org/10.1109/cvpr52729.2023.02084 ］

Yang Z Y ， Zhang S Y ， Wang L W and Luo J B . 2021 . SAT： 2D semantics assisted training for 3D visual grounding // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal， Canada ： IEEE： 1836 - 1846 ［ DOI： 10.1109/iccv48922.2021.00187 http://dx.doi.org/10.1109/iccv48922.2021.00187 ］

Yuan Z H ， Yan X ， Liao Y H ， Guo Y ， Li G B ， Cui S G and Li Z . 2022 . X-Trans2Cap： cross-modal knowledge transfer using Transformer for 3D dense captioning // Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans， USA ： IEEE： 8553 - 8563 ［ DOI： 10.1109/cvpr52688.2022.00837 http://dx.doi.org/10.1109/cvpr52688.2022.00837 ］

Yuan Z H ， Yan X ， Liao Y H ， Zhang R M ， Wang S ， Li Z and Cui S G . 2021 . InstanceRefer： cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal， Canada ： IEEE： 1771 - 1780 ［ DOI： 10.1109/iccv48922.2021.00181 http://dx.doi.org/10.1109/iccv48922.2021.00181 ］

Zeng Y H ， Jiang C H ， Mao J G ， Han J H ， Ye C Q ， Huang Q Q ， Yeung D Y ， Yang Z ， Liang X D and Xu H . 2023 . CLIP 2 ： contrastive language-image-point pretraining from real-world point cloud data // Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Vancouver， Canada ： IEEE： 15244 - 15253 ［ DOI： 10.1109/cvpr52729.2023.01463 http://dx.doi.org/10.1109/cvpr52729.2023.01463 ］

Zhang H Y ， Wang T B ， Li M Z ， Zhao Z ， Pu S L and Wu F . 2022 . Comprehensive review of visual-language-oriented multimodal pre-training methods . Journal of Image and Graphics ， 27 （ 9 ）： 2652 - 2682

张浩宇，王天保，李孟择，赵洲，浦世亮，吴飞 . 2022 . 视觉语言多模态预训练综述 . 中国图象图形学报， 27 （ 9 ）： 2652 - 2682 ［ DOI： 10.11834/jig.220173 http://dx.doi.org/10.11834/jig.220173 ］

Zhang J B ， Dong R P and Ma K S . 2023a . CLIP-FO3D： learning free open-world 3D scene representations from 2D dense CLIP ［EB/OL］. ［ 2024-01-19 ］. https://arxiv.org/pdf/2303.04748.pdf https://arxiv.org/pdf/2303.04748.pdf

Zhang Z W ， Zhang Z Z ， Yu Q ， Yi R ， Xie Y and Ma L Z . 2023b . LiDAR-camera panoptic segmentation via geometry-consistent and semantic-aware alignment // Proceedings of 2023 IEEE/CVF International Conference on Computer Vision . Paris， France ： IEEE： 3639 - 3648 ［ DOI： 10.1109/ICCV51070.2023.00339 http://dx.doi.org/10.1109/ICCV51070.2023.00339 ］

Zhao H S ， Jiang L ， Jia J Y ， Torr P and Koltun V . 2021a . Point Transformer // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal， Canada ： IEEE： 16259 - 16268 ［ DOI： 10.1109/ICCV48922.2021.01595 http://dx.doi.org/10.1109/ICCV48922.2021.01595 ］

Zhao L C ， Cai D G ， Sheng L and Xu D . 2021b . 3DVG-Transformer： relation modeling for visual grounding on point clouds // Proceedings of 2021 IEEE/CVF International Conference on Computer Vision . Montreal， Canada ： IEEE： 2908 - 2917 ［ DOI： 10.1109/iccv48922.2021.00292 http://dx.doi.org/10.1109/iccv48922.2021.00292 ］

Zhou Y and Tuzel O . 2018 . VoxelNet： end-to-end learning for point cloud based 3D object detection // Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City， USA ： IEEE： 4490 - 4499 ［ DOI： 10.1109/cvpr.2018.00472 http://dx.doi.org/10.1109/cvpr.2018.00472 ］

Zhu C Y ， Zhou Y Y ， Shen Y H ， Luo G ， Pan X J ， Lin M B ， Chen C ， Cao L J ， Sun X S and Ji R R . 2022 . SeqTR： a simple yet universal network for visual grounding // Proceedings of the 17th European Conference on Computer Vision . Tel Aviv， Israel ： Springer： 598 - 615 ［ DOI： 10.1007/978-3-031-19833-5_35 http://dx.doi.org/10.1007/978-3-031-19833-5_35 ］

Zhu H Q ， Deng J J ， Zhang Y ， Ji J M ， Mao Q Y ， Li H Q and Zhang Y Y . 2023a . VPFNet： improving 3D object detection with virtual point based LiDAR and stereo data fusion . IEEE Transactions on Multimedia ， 25 ： 5291 - 5304 ［ DOI： 10.1109/tmm.2022.3189778 http://dx.doi.org/10.1109/tmm.2022.3189778 ］

Zhu Z Y ， Ma X J ， Chen Y X ， Deng Z D ， Huang S Y and Li Q . 2023b . 3D-VisTA： pre-trained Transformer for 3D vision and text alignment // Proceedings of 2023 IEEE/CVF International Conference on Computer Vision . Paris， France ： 2899 - 2909 ［ DOI： 10.1109/ICCV51070.2023.00272 http://dx.doi.org/10.1109/ICCV51070.2023.00272 ］

Zhuang B H ， Shen C H ， Tan M K ， Liu L Q and Reid I . 2018a . Towards effective low-bitwidth convolutional neural networks // Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Salt Lake City， USA ： IEEE： 7920 - 7928 ［ DOI： 10.1109/cvpr.2018.00826 http://dx.doi.org/10.1109/cvpr.2018.00826 ］

Zhuang Z W ， Tan M K ， Zhuang B H ， Liu J ， Guo Y ， Wu Q Y ， Huang J Z and Zhu J H . 2018b . Discrimination-aware channel pruning for deep neural networks // Proceedings of the 32nd International Conference on Neural Information Processing Systems . Montréal， Canada ： Curran Associates Inc.： 883 - 894

文章被引用时，请邮件提醒。

提交

针对视觉深度学习模型的物理对抗攻击研究综述