Current Issue Cover

王丽安1, 缪佩翰1, 苏伟2, 李玺2, 吉娜烨3, 姜燕冰3(1.浙江大学软件学院, 宁波 315048;2.浙江大学计算机科学与技术学院, 杭州 310007;3.浙江传媒学院媒体工程学院, 杭州 310018)

摘 要
指代表达理解(referring expression comprehension,REC)作为视觉-语言相结合的多模态任务,旨在理解输入指代表达式的内容并在图像中定位其所描述的目标对象,受到计算机视觉和自然语言处理两个领域的关注。REC任务建立了人类语言与物理世界的视觉内容之间的桥梁,可以广泛应用于视觉理解系统和对话系统等人工智能设备中。解决该任务的关键在于对复杂的指代表达式进行充分的语义理解;然后利用语义信息对包含多个对象的图像进行关系推理以及对象筛选,最终在图像中唯一地定位目标对象。本文从计算机视觉的视角出发对REC任务进行了综述,首先介绍该任务的通用处理流程。然后,重点对REC领域现有方法进行分类总结,根据视觉数据表征粒度的不同,划分为基于区域卷积粒度视觉表征、基于网格卷积粒度视觉表征以及基于图像块粒度视觉表征的方法;并进一步按照视觉-文本特征融合模块的建模方式进行了更细粒度的归类。此外,本文还介绍了该任务的主流数据集和评估指标。最后,从模型的推理速度、模型的可解释性以及模型对表达式的推理能力3个方面揭示了现有方法面临的挑战,并对REC的发展进行了全面展望。本文希望通过对REC任务现有研究以及未来趋势的总结为相关领域研究人员提供一个全面的参考以及探索的方向。
Multimodal referring expression comprehension based on image and text: a review

Wang Li'an1, Miao Peihan1, Su Wei2, Li Xi2, Ji Naye3, Jiang Yanbing3(1.School of Software Technology, Zhejiang University, Ningbo 315048, China;2.College of Computer Science and Technology, Zhejiang University, Hangzhou 310007, China;3.College of Media Engineering, Communication University of Zhejiang, Hangzhou 310018, China)

As a subtask of visual grounding(VG),referring expression comprehension(REC)is focused on the input referring expression-defined object location in the given image. To optimize multimodal data-based artificial intelligence (AI)tasks,the REC has used to facilitate interaction ability between humans,machines,and the physical world. The REC can be used for such domains like navigation,autonomous driving,robotics,and early education in terms of visual understanding systems and dialogue systems. Additionally,it is beneficial for other related studies,including 1)image retrieval,2)image captioning,and 3)visual question answering. In the past two decades,computer vision-oriented object detection has been developing dramatically,which can locate all predefined and fixed categories objects. To get the referring expression input-defined object,a challenging problem of the REC is required for multiple objects-related reasoning. The general process of REC can be divided into three modules:linguistic feature extraction,visual feature extraction,and visual-linguistic fusion. The most important one of three modules is visual-linguistic fusion,which can realize the interaction and screening between linguistic and visual features. Furthermore,current researches are oriented to the design of the visual feature extraction module,which is recognized as the basic module of one REC model to a certain extent. Visual input has richer information than text input and more redundant information interference are required to be alleviated. So, the potentials of object localization are linked to extracting effective visual features further. We segment existing REC methods into three categories. 1)Regional convolution granularity visual representation method,it can be divided into five subcategories in accordance with visual-linguistic fusion module based modeling: (1) early,(2)attention mechanism fusion, (3)expression decomposition fusion,(4) graph network fusion,and(5) Transformer-based fusion. It is still challenged for computational cost and lower speed because it is required to generate object proposals for the input image in advance. Moreover,the performance of the REC model is challenged for the quality of the object proposals as well. 2)Grid convolution granularity visual representation method:the multi-modal fusion module of it can be divided into two categories:(1) filtering-based fusion and(2) Transformer-based fusion. Its model inference speed can be accelerated to 10 times at least since the generation of object proposals is not required for that. 3)Image patch granularity visual representation method: as visual feature extractors,two methods mentioned above are based on pre-trained object detection networks or convolutional networks. The visual features are still challenged to match REC-required visual elements. Therefore,more researches are focused on the integration of visual feature extraction module and the visual-linguistic fusion module,in which image patches-derived pixel can be as the input. To be compatible with the requirements of the REC task,direct text input-guided visual features are generated beyond pre-trained convolutional neural network (CNN)visual feature extractor. The REC mission are introduced and clarified on the basis of four popular datasets and the evaluation methods. Furthermore,three sort of REC-contextual challenging problems are required to be resolved:1)model’ s reasoning speed,2) interpretability of the model,and 3)reasoning ability of the model to expressions. The video and 3D domains-related future research direction of REC is predicted and analyzed further on the two aspects of its model design and domain development.