Current Issue Cover
分割一切模型SAM的潜力与展望:综述

王淼1, 黄智忠1, 何晖光2, 卢湖川3, 单洪明4, 张军平4(1.复旦大学;2.中国科学院自动化研究所;3.大连理工大学;4.复旦大学计算机科学技术学院)

摘 要
随着基于对比文本-图像对的预训练方法或者模型(contrastive language-image pre-training,CLIP)、聊天生成预训练转换器(chat generative pre-trained transformer,ChatGPT)、生成预训练转换器4(generative pre-trained transformer-4,GPT-4)等基础大模型的出现,通用人工智能(artificial general intelligence, AGI)的研究得到快速发展。AGI旨在为人工智能系统赋予更强大的执行能力,使其能够自主学习、不断进化,解决各种问题和处理不同的任务,从而在多个领域得到广泛应用。这些基础模型在大规模数据集上进行训练后,能够成功应对多样的下游任务。在这一背景下,Meta公司提出的分割一切模型(segment anything model,SAM)于2023年取得重要突破,在图像分割领域获得了优异的性能,以至于被称为图像分割终结者。其原因之一是,通过SAM数据引擎方法用三阶段采集的、包含110万图像和超过10亿的掩码的分割一切-十亿(segment anything 1 billion,SA-1B)图像分割数据集,同时保证了掩码的品质和多样性,继续导致在分割领域的突破。在SAM开源后不久,科研人员提出了一系列改进的方法和应用。为了能全面深入了解分割一切模型的发展脉络,优势与不足,本文对SAM的研究进展进行了梳理和综述。首先,本文从基础模型、数据引擎和数据集等多个方面简要介绍了分割一切模型的背景和核心框架。在此基础上,本文详细梳理了目前分割一切模型的改进方法,包括提高推理速度和增进预测精度两个关键方向。然后,深入探讨分割一切模型在图像处理任务、视频相关任务以及其他领域中的广泛应用。这一部分将详细模型在各种任务和数据类型上的卓越性能,突出其在多个领域的泛用性和发展潜力。最后,对分割一切模型未来的发展方向和潜在应用前景进行了深入分析和讨论。
关键词
The potential and prospects of segement anything model: a survey

(1.School of Computer Science,Fudan University;2.Institute of automation,Chinese Academy of Sciences;3.School of Information and Communication Engineering,Dalian University of Technology)

Abstract
With the emergence of foundational large-scale models such as CLIP, ChatGPT, and GPT-4, the field of artificial general intelligence (AGI) has experienced significant growth. AGI aims to imbue systems with the ability to perform a wide range of tasks, enabling them to learn autonomously and evolve. This broad applicability spans various domains and is intended to address diverse problems and accomplish numerous downstream tasks. These models, after being trained on massive datasets, possess the capability to handle a multitude of downstream tasks. In this context, Meta"s "segment anything model" (SAM) has made substantial progress and introduced the largest image segmentation dataset to date, SA-1B, which includes over 11 million images and more than one billion mask in 2023. One reason for this is that SA-1B was collected through SAM"s data engine approach in three stages, simultaneously ensuring the quality and diversity of these masks, contributing significantly to breakthroughs in the segmentation domain. This development has achieved a profound impact on advancing the foundational models in the field of computer vision. This paper provides a comprehensive understanding of the SAM framework through a detailed review and analysis of relevant research. First, this paper delves into three aspects of the SAM model"s background and basic framework. The first aspect is SAM"s tasks, including traditional image segmentation and prompt-guided interactive image segmentation. The second aspect is SAM"s model architecture, encompassing image encoders, prompt encoders, and mask decoders. The third aspect is data, including the data engine for collecting datasets and dataset SA-1B. Building upon this foundation, the paper then organizes and analyzes methods for improving the SAM model from two perspectives. The first perspective is enhancing inference speed, as improved inference speed reduces SAM"s deployment costs, making it more convenient for application on less powerful devices. The second perspective is enhancing prediction accuracy, as SAM itself lacks specific semantic information, leading to suboptimal segmentation results in complex scenarios. Hence, much research focuses on enhancing SAM"s prediction accuracy. Subsequently, the paper thoroughly reviews and analyzes the current applications of the SAM model in various tasks and data types. These applications are divided into three parts: the first part covers applications in image processing-related tasks, including style transfer, object detection, object counting, image editing, complex image segmentation, and medical image segmentation. It is noted that applying SAM directly to medical image segmentation may not yield satisfactory results, suggesting the need for further adjustments in specific scenario tasks. The second part encompasses applications in video-related tasks, including video super-resolution, video object tracking, and audio-visual scene segmentation. The third part explores applications in other directions, such as point cloud segmentation, 3D reconstruction, controllable image caption generation, and data annotation. Through the organization of SAM"s applications in these three parts, the paper summarizes the advantages and limitations of applying SAM to various downstream tasks. These analyses can assist researchers in better applying and improving SAM, thereby enhancing its robustness and generalization capabilities. Finally, the paper proposes several valuable future research directions for the SAM model. These directions include: 1) Modularization: Although SAM has already demonstrated excellent performance in some tasks, its efficiency and flexibility still need to be improved. With the continuous expansion of SAM application domains, many applications have put forward the requirement for SAM to possess new knowledge. Therefore, the model is required to have domain adaptation and continuous learning capabilities. Drawing inspiration from large language models, adding new modular structures to SAM to enhance its domain adaptation and continuous learning capabilities. 2) Weakly Supervised Semantic Segmentation: In weakly supervised semantic segmentation, retraining model classification and generating pseudo-labels are typically necessary, involving time-consuming and intricate steps. Recent studies use SAM as a base model in this domain, capitalizing on its strong generalization for satisfactory results without fine-tuning. However, although SAM is able to produce relatively clear results in many explicit scenarios, it is difficult for SAM to generate accurate segmentation masks in some semantically ambiguous scenarios because its model itself does not contain semantic information. In order to solve this difficulty, we can consider using more diverse weak labels for SAM and incorporating additional post-processing modules to enhance the segmentation accuracy of SAM and improve its performance in weakly supervised semantic segmentation. Exploring the application of SAM as a foundational model in weakly supervised semantic segmentation, potentially yielding promising results. 3) Multi-Modal Fusion for Image Segmentation: At present, the prompt input of SAM mainly includes four forms such as point, target box, split mask and text prompt. However, with the continuous expansion of SAM application areas, new requirements for cue input forms have been put forward. SAM"s current focus is on 2D visual tasks, with potential consideration for future application in 3D visual tasks. Considering different input modalities for SAM prompts, introducing time-series prompts to address SAM"s limitations in video processing tasks, and further improving SAM"s performance in various video downstream tasks. 4) Efficient Fine-Tuning of SAM: Although SAM has been widely used in various domains, its performance still falls short compared to other state-of-the-art models in the domain in some specific application scenarios. Studies have shown that by fine-tuning SAM for domain-specific datasets, its performance is improved. However, due to the large size of the SAM model, the fine-tuning process is costly. Therefore, how to perform the fine-tuning efficiently becomes an important issue.Given SAM"s substantial parameter count, incorporating new modules into the model, freezing SAM"s core during training, and only training the newly added modules to significantly reduce the training cost, facilitating further research on SAM"s application in various downstream tasks. 5) Leveraging Gestalt Psychology"s Holistic Cognitive Perspective to Enhance SAM"s Adversarial Robustness: SAM"s vulnerability to attacks may be due to overfitting on local cognitions. Introducing holistic cognition can prevent overfitting on local cognition and resist attacks involving noise. By consolidating and summarizing SAM in this paper, SAM can be further developed and applied to drive the advancement of foundational models in the field of computer vision.
Keywords

订阅号|日报