摘要:Scene relighting, a specialized area within computational imaging, aims to modify the illumination properties of a given image to produce a realistic and visually cohesive representation of the scene under specified lighting conditions. This process is of paramount importance in applications requiring lifelike rendering, such as the metaverse, virtual reality (VR), digital imaging, facial recognition, and studio entertainment. The ability to simulate and manipulate lighting in images and scenes has garnered significant interest in academic and industrial domains due to its potential to enhance user experiences and streamline workflows in these fields. However, traditional methods for scene relighting remain labor-intensive and time-consuming, requiring manual intervention to achieve precise and realistic results. These conventional approaches often involve extracting the foreground image with high accuracy, followed by manual adjustments to illumination conditions, shadow details, edge consistency, and other scene attributes to ensure compatibility with the target lighting. Although effective, these methods are not scalable for large-scale or dynamic applications, necessitating the development of more efficient and automated solutions. The advent of machine learning technologies has revolutionized the field of scene relighting, enabling the automation of complex relighting tasks and significantly improving the quality and efficiency of the results. By integrating computational imaging models, illumination models, 3D modeling, and deep learning techniques, researchers have achieved remarkable progress in generating high-quality relit images. These advancements have reduced the reliance on manual effort and opened new possibilities for applications in dynamic and complex scenarios. Despite these achievements, comprehensive reviews of scene relighting methodologies remain scarce, particularly those that systematically analyze recent developments and provide a holistic perspective on the field. To address this gap, this paper presents a methodical and discerning review of contemporary scene relighting methods, focusing on two pivotal aspects: the employed algorithms and the prevalent datasets. The paper begins by introducing the scene relighting task, its practical applications, and the key challenges associated with it. One of the primary challenges lies in efficiently extracting high-quality texture and geometric information from input images, especially under unknown or ambiguous illumination conditions. Moreover, the scarcity of high-quality, relightable datasets poses a significant obstacle to the training and evaluation of machine learning models. Other challenges include ensuring temporal coherence in dynamic scenarios, such as video-based relighting, and accurately representing complex materials like reflective or translucent surfaces. These challenges underscore the need for innovative approaches that can overcome these limitations and expand the capabilities of scene relighting technologies. To provide a comprehensive understanding of the field, the paper categorizes existing scene relighting methods into three main groups based on their procedural workflow: illumination decoupling, intrinsic decomposition, and rerendering. Each of these steps plays a critical role in the overall relighting process. The first step, illumination decoupling, involves extracting environmental illumination information from the input image and representing it using an appropriate model. This step is crucial for providing accurate lighting data for subsequent processes and improving the efficiency and accuracy of intrinsic decomposition. The second step, intrinsic decomposition, focuses on separating the intrinsic properties of the scene, such as geometric and texture attributes, from extrinsic factors like lighting. This step ensures that the structural integrity of the scene is preserved while adapting to new illumination conditions. The final step, rerendering, involves generating the target scene under specified lighting conditions by applying the surface attributes obtained during intrinsic decomposition. This process ensures that the illumination in the rendered image aligns with the given lighting parameters, resulting in a cohesive and realistic final output. In addition to reviewing scene relighting methods, the paper provides an overview of commonly employed datasets and acquisition devices used in this domain. High-quality datasets are essential for training and evaluating scene relighting algorithms, as they provide the ground truth data needed to assess model performance. These datasets often include images captured under various lighting conditions, along with corresponding 3D models, surface normals, or reflectance maps. Acquisition devices, such as light stage systems and multicamera setups, play a vital role in capturing detailed information about the scene’s geometry, texture, and illumination. However, the limited variety and scope of available datasets remain a significant barrier to the development of more generalized and robust relighting models. Despite the progress made in recent years, contemporary scene relighting techniques still face several limitations. For instance, existing methods often struggle in handling extreme lighting scenarios, such as extremely low or high illumination levels or complex shadow patterns. Ensuring temporal coherence in dynamic scenarios, such as video relighting, remains a significant challenge, as inconsistencies in illumination across frames can disrupt the realism of the output. In addition, accurately modeling and rendering complex materials, such as translucent, reflective, or anisotropic surfaces, is an ongoing area of research. The scarcity of diverse and high-quality datasets further exacerbates these challenges, limiting the generalizability of current models. To address these deficiencies, the paper outlines several promising directions for future research in scene relighting. These include developing robust methods for illumination decoupling under extreme lighting conditions, enhancing temporal coherence in dynamic scenarios, creating more sophisticated models for representing and rendering intricate materials, and expanding the variety and scope of available datasets. By tackling these challenges, future advancements in scene relighting are expected to further enhance the quality, efficiency, and applicability of this technology across various domains. In summary, this paper provides a comprehensive review of recent advancements in scene relighting methodologies, highlighting the key challenges, current approaches, and future directions in the field. By synthesizing and analyzing the employed methods and prevalent datasets, the paper aims to support ongoing research and innovation in scene relighting, paving the way for more efficient and realistic solutions in this rapidly evolving domain. The insights and recommendations presented in this study are expected to contribute to the development of future relighting technologies, with profound implications for applications in the metaverse, VR, digital imaging, and beyond.
摘要:High-resolution (HR) visual perception plays a pivotal role in numerous computer vision tasks, including scene understanding, object recognition, and image analysis, as it enables the extraction of finer details that are critical for accurate and meaningful interpretations of visual data. Applications such as autonomous driving, medical imaging, remote sensing, and surveillance heavily rely on the clarity and richness of high-resolution images. However, in real-world scenarios, achieving high-quality image capture is often constrained by various practical factors. Limitations in shooting conditions, such as low lighting, motion, and adverse weather, can introduce noise, blurriness, and distortion to images. Furthermore, the cost and limitations of imaging equipment, including optical system components, sensor sensitivity, and circuit noise, often result in degraded image quality, characterized by low resolution and loss of detail. These challenges pose significant hurdles for downstream applications, necessitating methods that can enhance and recover the resolution and quality of degraded images. One promising solution to address these challenges is image super-resolution (SR), a technique designed to reconstruct HR images from low-resolution (LR) inputs. SR holds immense research value, as it not only enhances visual quality but also boosts the performance of various downstream tasks, particularly in resource-constrained scenarios where high-end imaging devices are unavailable. By extracting and utilizing the latent information present in LR images, SR techniques aim to bridge the gap between degraded and high-quality visuals, making it an active area of investigation within the fields of computer vision and image processing. This paper provides a comprehensive review of the recent advancements in image SR reconstruction, focusing on progress made in addressing real-world challenges. It systematically explores the major research contributions domestically and internationally over recent years, categorizing them into several key areas. These include problem formulation and degradation modeling, commonly used datasets and evaluation metrics in SR research, traditional SR reconstruction methods, and modern approaches based on supervised and unsupervised learning. Through this structured analysis, the paper aims to offer valuable insights into the evolution of SR techniques, highlight persistent challenges, and provide an outlook on future development trends in this field. The first section of the paper addresses problem formulation and degradation modeling, which serve as the foundational step in SR research. Degradation modeling involves understanding the various processes that contribute to image degradation, such as noise, compression, and optical distortions. Researchers have proposed various approaches to address this, ranging from controlled synthetic degradation modeling for benchmarking to advanced data-driven methods that learn degradation patterns directly from real-world datasets. The second section discusses datasets and evaluation metrics, which are critical for the development and validation of SR methods. Synthetic datasets, such as Set5, Set14, and DIV2K, are widely used for controlled experiments because they provide paired LR-HR image pairs generated using predefined degradation models. However, these datasets often fail to represent the complexities of real-world scenarios. In response, real-world datasets, such as RealSR, DRealSR, and City100, are increasingly being utilized to train and evaluate SR models under practical conditions. The paper also examines commonly used evaluation metrics, including quantitative measures like peak signal-to-noise ratio and structural similarity index measure, as well as perceptual-based metrics like learned perceptual image patch similarity. These metrics help assess the fidelity and visual quality of reconstructed images, although they often have limitations in capturing subjective human perception. The third section delves into traditional SR reconstruction methods, which laid the foundation for modern SR research. Early methods primarily relied on interpolation techniques, such as bicubic interpolation, which are simple and computationally efficient but cannot recover high-frequency details. Conversely, reconstruction-based methods use prior knowledge about image structures to restore missing details. Techniques like sparse representation and dictionary learning have shown promise in this area, but their reliance on handcrafted features and assumptions about image properties limits their performance in complex real-world scenarios. The fourth section explores supervised learning-based SR reconstruction, which has gained immense popularity with the advent of deep learning. Supervised SR methods rely on paired LR-HR datasets to train neural networks that learn the mapping between LR and HR images. However, supervised SR methods face challenges in handling unknown degradation. To address this, researchers have proposed degradation-aware networks and domain adaptation techniques that aim to bridge the gap between synthetic and real-world data, ensuring improved generalization to practical scenarios. The fifth section discusses unsupervised learning-based SR reconstruction, which is particularly valuable in scenarios where paired LR-HR data are unavailable. Unsupervised methods leverage techniques such as generative adversarial networks (GANs) and cycle-consistent loss functions to learn mappings between LR and HR domains without explicit supervision. For example, CycleGAN-based models align the distributions of LR and HR images, enabling effective SR reconstruction. Self-supervised learning approaches, which utilize auxiliary tasks or pretext objectives, have also shown promise in leveraging unpaired data for SR tasks. Finally, the paper concludes with a detailed discussion of the challenges and future directions in real-world image SR research. Key challenges include addressing the diverse and unknown degradation patterns encountered in practical applications, improving the efficiency and scalability of SR models for deployment on resource-constrained devices, and enhancing the perceptual quality of reconstructed images. The integration of prior knowledge, such as scene semantics or geometric information, is identified as a promising direction for improving SR performance. Furthermore, the paper highlights the potential of emerging techniques such as zero-shot learning, physics-informed models, and hybrid approaches that combine traditional and deep learning-based methods. In summary, this paper provides a thorough analysis of the state of the art in image SR reconstruction, offering valuable insights into the progress, challenges, and future opportunities in the field. By bridging the gap between theory and application, the findings of this paper aim to advance the development of robust SR methods that can address the complexities of real-world scenarios.
摘要:In the field of computer vision, event cameras, a revolutionary class of neuromorphic visual sensors, have emerged as a transformative alternative to traditional frame-based cameras. While conventional cameras are constrained by fundamental limitations, such as fixed exposure times and limited dynamic range, often resulting in motion blur, rolling shutter distortion, and compromised performance in challenging lighting conditions, event cameras operate on an entirely different paradigm. These innovative sensors specifically detect and respond to dynamic changes in scene luminance, generating asynchronous events (spike signals) with remarkable microsecond-level temporal resolution. This bio-inspired design, which evolved from silicon retinas developed in the 1990s through several generations of technological advancement (including DVS, ATIS, and DAVIS architectures), enables event cameras to achieve exceptional dynamic range characteristics, effectively overcoming the inherent limitations of traditional imaging systems. The unique operating principle of event cameras, which mimics the human visual system’s ability to respond to changes rather than capture complete frames, represents a fundamental shift in visual sensing technology. The high temporal resolution and ultra-low latency characteristics of event cameras enable them to effectively complement the missing intra-frame/inter-frame information during traditional camera capture. In tasks such as motion deblurring, video frame interpolation, and rolling shutter correction, researchers have used an evolving array of methods from early physics-based methods to modern deep learning approaches, which fully utilize the visual texture and geometric motion information in event streams to enhance reconstruction effects. Representative works include the event-based double integration (EDI) model, event-fused video interpolation method (Timelens), and rolling shutter correction framework (EvUnroll). As research progresses, researchers have gradually extended single-task methods to joint tasks addressing multiple video degradations simultaneously. These joint tasks include combinations of deblurring and frame interpolation (EVDI), deblurring and rolling shutter correction, and rolling shutter correction with frame interpolation (SelfUnroll). Recent studies have achieved the joint processing of three tasks, demonstrated by neural network-based image re-exposure frameworks and the lightweight network UniINR. These methods effectively address the ill-posed problems caused by coupled degradations by fully using the temporal information provided by event streams, significantly improving video enhancement results.Beyond temporal compensation, event streams can also enhance image and video quality through spatial compensation. To achieve reconstruction of clear high-resolution image sequences from single blurry low-resolution images, researchers leverage the spatiotemporal correlation characteristics between event streams and images in super-resolution tasks. As demonstrated by the EHDR method and HDRev-Net, the high dynamic range characteristics (120 dB) of event cameras have also been used to improve imaging effects under extreme lighting conditions. Recent research has successfully combined events’ high dynamic range and temporal resolution characteristics, as exemplified by the Self-EHDRI framework, thereby achieving significant progress in handling mixed degradation problems, such as motion blur and rolling shutter effects. These advances in event-based enhancement techniques have yielded impressive results through model- and learning-based approaches. While physical models have established explicit mathematical relationships between events and image formation processes, deep learning methods have demonstrated remarkable ability in learning complex mappings from event data to enhanced images, particularly in challenging scenarios involving rapid motion, extreme lighting variations, and complex dynamic scenes. Research trends in event-based vision enhancement reveal several significant developments and promising directions. Over the years, the field has evolved from single-task solutions to sophisticated multitask frameworks capable of addressing multiple image degradation problems simultaneously. This advancement is complemented by a growing interest in self-supervised and unsupervised learning approaches, which significantly reduce dependence on paired training data — a crucial development given the unique nature of event data. Furthermore, the integration of event cameras with traditional vision systems has become increasingly sophisticated, leading to hybrid solutions that effectively combine the advantages of both sensing modalities. The field has also witnessed substantial hardware progress, with major technology companies, including IniVision, Prophesee, Samsung, and Sony, introducing advanced commercial event cameras. Such a technology has catalyzed widespread adoption across various applications from industrial automation and robotics to intelligent monitoring systems and autonomous vehicles.Current challenges in event-based vision span multiple technological aspects that require innovative solutions. At the hardware level, event cameras face ongoing challenges in achieving higher spatial resolution while maintaining temporal precision, managing sensor noise, and optimizing power consumption. Among these, a critical challenge lies in the efficient processing of massive event data streams, especially in high-speed scenarios where millions of events per second must be processed in real-time. This necessitates sophisticated data management strategies and highly efficient processing algorithms. Real-time processing requirements pose additional challenges, as algorithms must carefully balance computational complexity with processing speed to maintain temporal accuracy. Furthermore, the field continues to grapple with the need for standardized evaluation metrics and robust benchmarks that enable the fair comparison of different approaches and spur further technological advancement. Looking forward, several promising research directions have emerged that could address these challenges and further advance the field: the development of next-generation event sensors with enhanced resolution and improved noise characteristics, the exploration of specialized neural architectures optimized for event processing, and the integration of event-based vision in emerging applications, such as augmented reality and autonomous systems. The unique advantages of event cameras in temporal resolution and dynamic range, combined with ongoing advancements in hardware capabilities and algorithmic innovations, position them as crucial components of next-generation visual processing systems, particularly in challenging dynamic scenarios in which conventional cameras face fundamental limitations. This survey looks into the theoretical principles and technical approaches of neuromorphic spike vision imaging methods, which are represented by event cameras in video enhancement tasks. In particular, this work summarizes and reviews the latest domestic and international developments in video enhancement algorithms that integrate neuromorphic visual spikes. It also provides corresponding analysis and discussion on the bottlenecks and challenges faced in this field, such as low data processing efficiency, poor performance under low-light conditions, and insufficient spatial resolution.
摘要:Object detection is a fundamental task in computer vision, employing deep neural networks to identify and localize objects in images. Under closed-set conditions, where training and test data share similar distributions and the set of categories remains fixed, object detection systems have achieved remarkable success. These systems now play a pivotal role in applications such as autonomous driving, medical imaging, and facial recognition. However, the shift from closed-set to open-environment scenarios has introduced complex challenges, reflecting the unpredictability and diversity of real-world conditions. These include changes in data distribution (domain shift), the emergence of new categories, and the presence of noise, all of which significantly affect the robustness and accuracy of object detection models. Furthermore, the integration of object detection systems into real-world applications often necessitates balancing performance with resource efficiency, posing additional challenges in achieving scalability, interpretability, and low-latency processing for time-critical scenarios like video analytics and disaster response systems. This paper systematically investigates the challenges of object detection in open environments, focusing on four key areas: handling out-of-distribution (OOD) data, detecting objects in unknown categories, improving model robustness, and enabling incremental learning. First, addressing OOD challenges requires robust domain adaptation and domain generalization methods. The inability of traditional object detectors to generalize beyond their training domain often leads to degraded performance when deployed in diverse real-world settings. Techniques such as intermediate domain generation, adversarial learning, and contrastive learning have emerged as promising approaches to mitigate domain shift. These methods enhance generalization by enabling models to learn invariant features across domains or simulate unseen domains during training. Furthermore, unsupervised and semi-supervised learning paradigms extend these capabilities by leveraging unlabeled data to adapt detectors to new conditions. The second challenge pertains to detecting objects in unknown categories, a scenario common in real-world environments where new object categories may appear post-training. Traditional detectors, limited by their closed-set assumptions, struggle with this open-world requirement. Approaches addressing this issue include distinguishing known from unknown objects through uncertainty estimation and synthesizing pseudo-labels for unknown categories. Furthermore, leveraging auxiliary information such as attributes or visual-textual alignment enables detectors to infer relationships between known and unknown objects, improving their ability to identify and classify novel categories. Expanding these techniques to include cross-modal fusion strategies and leveraging contextual priors can further enhance performance in open-world scenarios. Robustness is the third critical focus area, particularly in defending against adversarial attacks and environmental noise. In open environments, object detection models must maintain reliability despite attempts to compromise their predictions through adversarial perturbations or natural disruptions such as occlusions or poor lighting. Techniques such as adversarial training, noise suppression modules, and the integration of domain-specific knowledge have shown promise in enhancing model resilience. The paper reviews advancements in defense mechanisms and adaptive adversarial training frameworks that ensure robustness without compromising performance on clean data. The exploration of novel architectures, such as transformer-based detectors, also holds potential for building inherently robust systems capable of learning global and local context simultaneously. Incremental learning represents the fourth challenge, addressing the need for models to adapt continually to new tasks or categories without forgetting previously learned knowledge. Traditional training processes often overwrite prior knowledge when exposed to new data, a phenomenon known as catastrophic forgetting. Solutions to this issue include knowledge distillation, pseudo-labeling, and data replay strategies. These approaches allow detectors to balance learning new information while preserving performance on previously encountered tasks or categories. The integration of large-scale pretrained models and generative techniques for creating synthetic data has further advanced the field by providing scalable and flexible solutions. Moreover, optimizing these methods to operate under constrained computational environments remains a key area for future research. This paper provides a comprehensive review of the methodologies and frameworks developed to address these challenges, assessing their strengths and limitations. Through detailed analysis, we identify key opportunities for advancing object detection technology in open environments. Future research directions include: 1) constructing diverse and comprehensive datasets that better reflect the complexity of real-world scenarios; 2) exploring the use of multimodal inputs, such as combining visual data with textual descriptions, to enhance contextual understanding; 3) developing lightweight, real-time adaptive mechanisms to defend against adversarial attacks; and 4) optimizing incremental learning algorithms to reduce computational costs while preserving accuracy across tasks. In addition, fostering collaboration between academia and industry is critical to address these challenges effectively, accelerating the translation of research breakthroughs into practical applications. By synthesizing insights from state-of-the-art methods and identifying critical gaps in current research, this work contributes a systematic perspective on the evolving landscape of object detection in open environments. This perspective aims to inspire innovative solutions that enhance the robustness, adaptability, and scalability of object detection systems. Ultimately, the advancements discussed here will empower object detection technologies to address the demands of dynamic real-world applications, fostering their adoption in diverse fields such as public safety, industrial automation, and healthcare, while paving the way for interdisciplinary innovations in robotics, augmented reality, and smart cities.
关键词:object detection;open environment;deep learning;robustness;out-of-category detection;incremental learning;data distribution
摘要:Person reidentification (re-id) aims to recognize target pedestrians across nonoverlapping camera views. It is a key area of focus in computer vision due to its significant research value and widespread application prospect in security surveillance. In recent years, the performance of re-id techniques has seen rapid growth, with state-of-the-art (SOTA) methods outperforming human performance. Furthermore, researchers have paid increasing attention to re-id in challenging uncontrolled environments, including visible-infrared, occluded, cloth-changing, low-resolution, and aerial person re-id. Despite these advancements, the performance of re-id models remains below the desired level for practical applications for two major reasons. First, existing re-id models are trained by closed datasets with single scenarios and sufficient labeled pedestrians. This approach falls short in real-world settings characterized by diverse scenarios, varying conditions across cameras, and the high cost of obtaining labeled data, leading to inadequate performance, robustness, and generalization for actual use. Second, the expensive nature of annotation limits the scale of re-id datasets, making them significantly smaller compared to datasets for other vision tasks, such as face recognition, object recognition, and segmentation. This limitation may cause re-id models to overfit to their training images, undermining their generalizability. Consequently, reaching universal person re-id remains a significant challenge. Recently, the field of large-scale pretraining models has attracted significant attention and rapid development due to t heir critical role in enhancing person re-id techniques. In this paper, we make an overview survey on the applications of large-scale pretraining techniques for person re-id. First, we introduce the background of large-scale pretraining models. Self-supervised pretraining techniques have gained great success in natural language processing (NLP). Particularly, the Transformer structure has excelled in extracting robust NLP features, with GPT and BERT emerging as pioneering models using the Transformer to generate useful outputs for subsequent tasks. GPT3 has demonstrated that large-scale pretraining models can rival the performance of SOTA supervised models without annotations. With the successful application of GPT3, many researchers have attempted to apply self-supervised pretraining techniques to vision tasks, and some pioneering research has been conducted for vision-language cross-modal tasks. ViLBERT marked the beginning of learning the relationships between vision and language. The CLIP model shows great generalization ability for zero-shot vision tasks. Furthermore, the MAE adopts mask modeling techniques to train a pretraining model with good generalization ability. These advancements highlight that large-scale pretraining techniques leveraging vast amounts of unsupervised data, not only elevate the baseline models’ performance and generalization abilities but also great promise for re-id by reducing the need for expensive labeled data gathering. Moreover, the information from large-scale pretraining models can be utilized to improve the performance of re-id models. Given that self-supervised pretraining techniques can promote re-id models, some researchers have tried pioneering efforts. Here, we introduce the existing research for large-scale pretraining re-id models, organizing the literature into three types, namely, self-supervised pretraining re-id methods, large-scale pretraining model-based re-id methods, and prompt learning-based re-id methods. We discuss the above large-scale pretraining technique-based methods and the effects and performances of SOTA methods on various benchmarks. Self-supervised pretraining re-id methods employ self-supervised pretraining techniques and large-scale unsupervised pedestrian benchmark to train a robust pretraining model, addressing the scarcity and high cost of labeled pedestrian data. Some researchers have constructed weakly supervised/unsupervised benchmarks for studying self-supervised pretraining re-id techniques. SYSU-30K is the first large-scale weakly supervised re-id dataset, which is constructed by over 30 million images and 30 000 IDs from 1 000 downloaded videos. The challenges of SYSU-30K includes low-resolution, view changes, occlusion, and changing illumination. LUPerson is the first large-scale unsupervised person benchmark, containing more than 4.2 million unsupervised pedestrian images from 46 000 scenes and covering the challenges of illumination variations, changing resolution, and occlusion. We adopt tracking for the LUPerson dataset and construct the weakly supervised dataset LUPerson-NL, which contains more than 10 million pedestrians and 430 000 noisy identities. With the emergence of large-scale unsupervised datasets, some researchers have applied self-supervised techniques for re-id. Some studies have utilized a contrastive learning framework to learn robust re-id models from unsupervised pedestrians. The MoCo framework and catastrophic forgetting score are utilized to improve the generalization ability of re-id models. Furthermore, some studies have employed the prior knowledge of pedestrians to improve the performance of self-supervised pretraining techniques. The local structure, view information, and color information are employed to incorporate prior knowledge for pretraining re-id methods. Large-scale pretraining model-based re-id methods employ the knowledge of multimodal large-scale model and use the interaction between vision and language to improve the performance of re-id models. Given that the CLIP model has shown superior performance for zero-shot vision tasks, most of the related studies have utilized it to learn a discriminant and robust re-id model. Llama2 is also adopted to promote re-id tasks. Prompt learning-based re-id methods introduce quick learning methods to learn a robust re-id model. First, prompt learning re-id methods utilize the relationships between text description and visual features to learn a more discriminative and robust model. We focus on employing the prompts to make the model adaptive to different environments, such that we can obtain a universal re-id model that can cope with changing environments. Experimental results show that self-supervised techniques, large-scale pretraining models, and prompt learning methods can significantly improve the performance and generalization ability of re-id models. We can achieve a more universal re-id model for unseen scenarios.Finally, we conclude the overview of the current literature, analyze the limitation of the existing literature, and discuss the potential directions for future research. In conclusion, the large-scale pretraining techniques are essential for universal re-id. Although existing research is pioneering yet nascent, with a somewhat weak connection between re-id and large-scale pretraining models, the integration of pedestrian priors and large-scale model knowledge to achieve universal re-id warrants concerted exploration and promotions from the academia and industry.
摘要:Surface defect inspection is a key aspect of industrial automation, primarily focused on detecting and evaluating defects or anomalies in products or materials. This technology mainly aims to ensure product quality, reduce production costs, and improve production efficiency by obtaining relevant information, such as the coordinates, categories, sizes, and contours of defects. In industrial settings, defects are generally seen as missing parts, flaws, damages, errors, or anomalies compared to normal samples. Defect information annotation can be categorized based on the level of annotation granularity in the dataset: image-level annotation, instance-level annotation, and pixel-level annotation, which correspond to classification, detection, and segmentation tasks in computer vision. In recent years, with the rapid advancements in machine vision, big data, and sensor technologies, the automation of surface defect inspection has become feasible, shifting the traditional reliance on human labor. The use of machine vision systems to replace manual inspections for surface defects in industrial products has rapidly gained popularity and become the norm. However, in industrial settings, the detailed annotation of defect data is time-consuming and labor-intensive; thus, a large amount of unannotated historical data often exists in these settings. The challenge of surface defect inspection with incomplete annotations focuses on how to effectively utilize this rich unannotated data or leverage a small amount of annotated data to improve inspection accuracy. Automated visual inspection systems play a crucial role in modern industrial manufacturing, gradually replacing manual quality control processes and supporting the development of smart manufacturing systems. While researchers have explored various deep learning-based surface defect inspection technologies, comprehensive reviews specifically addressing methods for handling incomplete annotations in the field of industrial product surface defect detection are still lacking. This paper aims to fill the gap by systematically reviewing the research background, foundational concepts, commonly used datasets, and related technologies in the field of surface defect inspection with incomplete annotations. Additionally, based on specific industrial contexts, commonly used datasets in fields such as metal surfaces, fabrics, building materials, and 3C electronic products have been collected and organized. The paper meticulously categorizes various surface defect inspection techniques by examining them from two key perspectives: label strategies and task strategies, highlighting the characteristics, advantages, and disadvantages of each method. Based on the types of available labels, methods for surface defect inspection with incomplete annotations are classified into three main categories: unsupervised learning, semi-supervised learning, and weakly supervised learning. Unsupervised learning methods mainly include strategies such as clustering, positive sample modeling, and template matching. Semi-supervised learning methods primarily include techniques such as image reconstruction, pseudo-label generation, and data augmentation. Weakly supervised learning methods mainly involve the use of class activation maps and similar techniques. Based on different task strategies, surface defect inspection tasks with incomplete annotations can be categorized into domain adaptation-based methods, few-shot learning-based methods, and large model-based methods. Domain adaptation methods primarily include approaches such as distribution alignment, learning domain-invariant features, and dynamically adjusting hyperparameters. Few-shot learning methods primarily include techniques such as meta-learning, metric learning, and graph neural networks. Large model-based methods mainly involve the use of large models such as segment anything model(SAM),contrastive language-image pre-training model(CLIP),large language model(LLM),and vision-language model(VLM). Subsequently, this paper compares the performance of cutting-edge unsupervised, semi-supervised, and weakly supervised algorithms across multiple datasets. Finally, the paper discusses and predicts future research trends in surface defect inspection with incomplete annotations, including the detection of small, weak targets, efficient utilization of unannotated defect images, and the application of language models and vision-language models in defect inspection. Surface defect inspection tasks with incomplete annotations are commonly encountered in industry, holding substantial research and application value, and deserve increased attention and promotion from the industrial and academic communities.
摘要:The rapid development of information technology has led to the explosive growth of multimedia data, including text, images, and videos. The vast amount of data and the powerful computational capabilities have significantly driven advances in deep learning, allowing models to achieve exceptional performance in single-modal tasks, such as object detection and semantic segmentation, as well as multimodal tasks, such as cross-modal retrieval and question answering. High-performance deep learning models are typically developed in static learning scenarios in which they can access all training data simultaneously and update the model using the entire dataset. However, in real-world applications, deep models always learn from task streams that dynamically receive new data over time. In these scenarios, deep models must simultaneously retain old task data and maintain performance on old and new tasks through repeated joint training. This approach of unrestricted joint training and continuous dataset expansion to maintain high performance incurs substantial time and financial costs. Furthermore, the prolonged runtime of high-power devices during joint training leads to significant carbon emissions, thereby contributing to environmental pollution. Such costs are even higher in multimodal scenarios due to the vast amounts of multimodal data and the large number of model parameters. This learning strategy, which is misaligned with the sustainability goals of artificial intelligence (AI), is impractical for real-world applications. Therefore, deep models must be capable of adapting to new tasks in dynamic environments, commonly known as incremental learning scenarios. However, developing a high-performance incremental learning model in incremental learning scenarios remains a challenging task, because the model must update its knowledge base each time it receives new data, without access to previously learned data. Due to the unavailability of previously learned data, models face the problem of catastrophic forgetting (CF) in which they tend to forget previously acquired knowledge when learning new tasks. CF degrades model performance, especially in scenarios with high privacy and security requirements, such as personal data in medical image processing. Therefore, incremental learning methods must possess the ability to acquire new knowledge (plasticity) while simultaneously retaining previously learned knowledge (stability). Enhancing stability reduces plasticity, while increasing plasticity causes instability in old tasks. These conflicting demands form the stability–plasticity dilemma, and resolving it is the primary challenge for researchers specializing in the field of incremental learning. This paper categorizes incremental learning methods into four perspectives: regularization-, replay-, and architecture-based methods, as well as methods based on fine-tuning pre-trained model. Regularization-based incremental learning methods mitigate catastrophic forgetting by adding regularization terms that optimize model parameters, adjusting the update of key parameters related to previous tasks while learning new tasks. Such an approach can be divided into two types: parameter and output regularization. The former adds regularization terms to the model’s parameters, while the latter applies them to the model’s outputs. Replay-based incremental learning methods transfer key knowledge from representative old samples to mitigate catastrophic forgetting. Replay-based methods are further divided into generation- and experience-based replay wherein the former generates old representative samples using generative models, while the latter retains actual old representative samples for training. Structure-based methods mitigate CF by maintaining task-specific model parameters. These methods are subdivided into neuron-expansion and parameter-isolation approaches. Neuron-expansion approaches allocate new model parameters for each task by expanding the network, while parameter-isolation approaches freeze key parameters of old tasks to ensure that learning new tasks does not interfere with previous ones. The fine-tuning of pre-trained model methods apply fine-tuning strategies to large models to learn new tasks directly, while leveraging their generalization ability to maintain performance on old tasks. Such methods can be divided into prompt- and representation-based fine-tuning approaches. The former introduces prompts to fine-tune the pre-trained model, thereby balancing old and new knowledge. In comparison, the latter directly leverages the model’s generalization ability to build classifiers, utilizing high-quality feature representations to handle new tasks without significant changes to the model architecture. This paper also provides a mathematical definition of incremental learning scenarios and the objective function for optimizing incremental learning models. In particular, it summarizes six key evaluation metrics used in the incremental learning field: average accuracy (AA), average incremental accuracy (AIA), average forgetting rate (AF), forward transfer rate (FTR), backward transfer rate (BTR), and learning plasticity (LP). Based on the tasks stream pattern in incremental scenarios, the need for task identification, and the number of classification heads, this paper divides incremental learning into three subfields: task-incremental learning (TIL), domain-incremental learning (DIL), and class-incremental learning (CIL), along with detailed explanations and mathematical definitions. This paper summarizes the latest research progress in single-modal fields, including semantic segmentation, image generation, and large language models, as well as multimodal fields like vision-language and vision-audio multimodal incremental learning, based on the application scenarios of incremental learning. Furthermore, this paper surveys the publication status of papers in leading English journals and major conferences in the field to compare the research levels of incremental learning between domestic and international scholars. The comparison analyzes research investments and progress in incremental learning, both domestically and internationally, based on the total number of published papers and the average citation count per paper. Finally, this article anticipates three future directions for the development of incremental learning: large multimodal model incremental learning, incremental learning based on novel deep network architectures, and continual forgetting of knowledge for AI security.
Liu Yebin, Su Hao, Gao Lin, Yi Li, Wang He, Liao Yiyi, Shi Boxin, Cao Yanpei, Hong Fangzhou, Dong Hao, Zhang Juyong, Wang Xintao, Xu Huazhe, Yang Jiaolong, Kang Bingyi, Chu Mengyu, Sun He, Chen Wenzheng, Ma Yuexin, Zhang Hongwen, Guo Yulan, Zhou Xiaowei, Zhang Guofeng, Han Xiaoguang, Dai Yuchao, Chen Baoquan
摘要:As an interdisciplinary field that integrates computer vision, graphics, artificial intelligence (AI), and optical imaging, three-dimensional (3D) vision serves as the foundational cornerstone for building embodied general intelligence and the metaverse ecosystem. In 2024, differentiable representation technologies, exemplified by neural raddiance field(NeRF) and Gaussian splatting, continued to evolve and refine, gradually transcending the boundaries of traditional 3D reconstruction. From microscopic cellular structures to macroscopic physical celestial bodies, and from static scenes to dynamic human bodies, significant improvements in accuracy have been achieved across a spectrum. Propelled by advancements in generative AI technologies and the scaling laws of large models, the field of 3D vision has witnessed a paradigm shift from optimization to generalizable feedforward generation, marking important progress and breakthroughs in the direction of controllable digital content generation. Embodied intelligence remains a focal point of interest, with researchers increasingly recognizing the capture and generation of 3D virtual simulation data and 3D human motion data as central elements of training embodied intelligence. As the concepts of world models and spatial intelligence become hot topics among technology researchers, modeling the physical world, understanding spatial relationships, and predicting future states have emerged as crucial research directions, all of which rely on the support of 3D vision technologies. Furthermore, through nontraditional visual sensors and novel reconstruction algorithms, innovations in computational imaging technology have exceeded the physical limitations and performance bottlenecks of traditional 3D reconstruction. These technological breakthroughs are propelling 3D vision into a new era characterized by an intelligent, large-scale learning process that encompasses the processes of perception, modeling, generation, and interaction. Specifically, the development trends in 3D vision primarily manifested in the following aspects:1)Controllable and physics-aware generation of the visual elements of AI-generated content (AIGC). With the rapid development of AIGC technology, visual content generation has evolved from simple two-dimensional (2D) image creation toward more controllable and physics-aware approaches. This trend requires the combination of physical prior knowledge and multidimensional control parameters, such as 3D viewpoints, lighting conditions, and 3D character motion, to achieve higher-quality content generation. 3D vision technology plays a pivotal role in this process, providing essential spatial-temporal and physical constraints for AIGC.2)4D spatial intelligence: bridging virtual and physical worlds. 4D (3D space + time dimension) spatial intelligence has emerged as a core technology connecting virtual worlds (e.g., the metaverse) and physical realities (e.g., embodied intelligent robots). This technology focuses on establishing a digital mapping of dynamic physical environments. By leveraging 3D vision and multimodal large model technologies, AI systems can construct 4D spatial models to understand spatial relationships, predict motion trajectories, and simulate future evolutions. In turn, intelligent agents can interactively learn within physical or virtual 4D environments to acquire intelligence.3)Data-driven embodied intelligence: 3D virtual simulation and human motion capture. The advancement of embodied intelligence relies heavily on high-quality 3D virtual simulation data and the capture/generation of human 3D motion data. These datasets fuel the training of embodied intelligent robots to achieve sophisticated behavioral control.Through high-precision 3D vision technologies, robots can better understand and simulate human actions, resulting in their enhanced intelligence in complex tasks. 4)Differentiable 3D representation and integration with large model technologies. From microscopic cellular structures to indoor environments, human/animal modeling, autonomous driving/city modeling, and even astronomical black hole reconstruction, novel 3D representations, such as NeRF and 3D Gaussian splatting(3DGS), drive performance improvements in scene generation and reconstruction across scales. Their efficiency and flexibility have opened up new possibilities for 3D vision applications. Furthermore, by integrating large-scale 3D data with transformer-based architectures and advanced generative methods, such as diffusion models, fundamental 3D vision tasks are being unified into efficient end-to-end frameworks, resulting in the scaled-up learning of core 3D vision paradigms.The breakthroughs in 3D vision in 2024 have injected new momentum into technological development, with future trends focused on several key directions. Spatiotemporal-consistent world models that integrate 4D spacetime and physical laws provide dynamic prediction and interaction support for complex scenarios, such as autonomous driving and embodied intelligence. It is expected that the deep integration of generative AI and 3D content generation technologies will overcome data bottlenecks, enabling the automated creation of high-fidelity and controllable 3D content. Enhanced cross-modal generalization capabilities will also strengthen the fusion of vision, language, and motion modalities, thus improving the adaptability and robustness of robotic strategies. Physics-driven dynamic reconstruction, combined with physical engines, will achieve high-precision modeling and interactive editing of dynamic scenes, thus advancing digital twins and virtual reality. Meanwhile, 3D imaging technologies will accelerate scientific exploration in astrophysics and cell biology. Efficient real-time processing and lightweight solutions will also boost the reconstruction and rendering efficiency of large-scale dynamic scenes, promoting edge device applications. Furthermore, ethical and privacy protection will emerge as critical concerns, balancing innovation and security through encryption technologies and regulatory frameworks that govern 3D data acquisition and generation. In summary, this article reveals that, in the future, 3D vision will evolve toward an era of more intelligent and universal “spatial intelligence” in which technological breakthroughs will reshape human-computer interaction, scientific exploration, and industrial ecosystems. Therefore, to foster academic exchange, this article analyzed cutting-edge trends in 3D vision and highlights the top ten research breakthroughs of the year, thus offering insights and references for both academia and industry.
摘要:Three-dimensional (3D) visual perception and understanding are fundamental to numerous applications, including robotic navigation, autonomous driving, and intelligent human-computer interaction. As one of the most prominent research directions in computer vision, 3D vision has experienced rapid advancement, particularly with the rise of multimodal large language models (MLLMs). The integration of MLLMs with 3D visual data has unlocked unprecedented capabilities for understanding and interacting with the physical 3D world. These models bring distinct advantages, such as contextual learning, step-by-step reasoning, open-vocabulary support, and rich world knowledge, making them transformative tools in the field of 3D vision. This paper provides a comprehensive overview of the latest progress in 3D vision understanding driven by MLLMs. It begins by addressing the foundational representations of 3D visual data. From point clouds to 3D Gaussian splatting, the review systematically examines mainstream data representation methods, which form the backbone for intelligent processing and analysis of 3D visual information. These representations enable the integration of semantic, spatial, and structural information, serving as a critical basis for downstream tasks in 3D vision. Following this, the paper traces the evolution of MLLMs, starting from the development of large language models and their extension into multimodal systems. It highlights the emergence of vision-language models that synergize text and visual data, offering significant potential for advancing 3D vision. The combination of multimodal pretrained priors and 3D data representations has opened new avenues for cross-modal understanding and interaction. The ability of MLLMs to align multimodal features while reasoning across modalities has proven crucial for overcoming traditional limitations in 3D vision, such as sparse data, occlusion, and noise. The review then focuses on the methods for representing 3D visual data using MLLMs, providing an in-depth synthesis of current strategies. It discusses how MLLMs leverage pretrained knowledge to interpret 3D information, facilitate multimodal feature alignment, and enhance the contextual understanding of complex 3D scenes. For instance, MLLMs enable the integration of 3D data with semantic priors, improving the efficiency and accuracy of tasks such as 3D object recognition, scene reconstruction, and spatial reasoning. In the context of specific 3D vision tasks, this paper explores various applications of MLLMs, including 3D generation and reconstruction, 3D object detection, semantic segmentation, scene description, language-guided 3D object localization, and 3D scene question answering. These tasks demonstrate the transformative influence of MLLMs on 3D vision, showcasing their ability to elevate task performance by incorporating multimodal capabilities. For example, 3D generation tasks benefit from the contextual knowledge of MLLMs, enabling the creation of semantically coherent and visually accurate 3D content. Similarly, 3D object detection and segmentation tasks leverage the reasoning capabilities of MLLMs to identify and classify objects in complex scenes more effectively. The role of MLLMs extends beyond traditional 3D vision tasks to applications in embodied robotic intelligence systems. Examples include robotic 3D grasping and 3D visual navigation, where MLLMs facilitate spatial understanding and decision making in dynamic environments. By integrating multimodal reasoning with 3D perception, MLLMs enable robots to perform language-guided manipulations, navigate cluttered spaces, and interact seamlessly with the physical world. The paper also provides a detailed examination of datasets that support research in this domain. It reviews fundamental 3D datasets, such as those designed for point cloud analysis, voxel-based representations, and mesh modeling, alongside multimodal 3D vision-language datasets. These datasets are analyzed in terms of their scale, diversity, and application scope, offering a robust foundation for training and evaluating MLLMs in various 3D vision tasks. However, the lack of diverse and representative datasets remains a significant challenge, limiting the generalizability and robustness of existing models. Building on these foundations, the paper addresses key challenges and future research priorities in MLLM-driven 3D vision understanding. Major challenges include unifying 3D representation formats, improving the spatial reasoning capabilities of MLLMs, enhancing the generalization of MLLMs for diverse 3D data processing, and addressing practical deployment mechanisms for multimodal models in real-world 3D vision tasks. In addition, the paper highlights the need for efficient model miniaturization for edge-side applications, such as autonomous robots and drones, and explores the potential of cloud-edge collaborative frameworks to enhance 3D vision tasks in resource-constrained environments. The future research directions proposed in this paper aim to address these challenges and further advance the field. Key priorities include developing scalable and efficient 3D data representations, designing domain-adaptive MLLMs, and creating comprehensive multimodal benchmarks that reflect real-world complexities. Moreover, the paper emphasizes the importance of fostering innovation in multimodal reasoning frameworks, enabling MLLMs to interpret, generate, and interact with 3D data seamlessly. Exploring the integration of real-time data streams with multimodal pretraining strategies can also provide valuable insights into dynamic 3D environments. By offering a comprehensive analysis of the progress, challenges, and future directions in this field, the paper underscores the transformative potential of MLLMs for 3D vision understanding. It highlights the necessity of leveraging these models to bridge the gap between perception and reasoning, enabling systems to interact effectively with the complex 3D world. Findings emphasize the importance of addressing existing limitations in 3D vision and pave the way for the continued evolution of artificial intelligence in spatially complex and multimodal environments. This review aims to inspire further exploration and expansion of MLLMs in 3D vision understanding, providing a roadmap for future research in this domain. Through the synthesis of state-of-the-art developments, the paper lays a foundation for advancing spatial intelligence, fostering deeper integration between AI systems and the physical world. Ultimately, the analysis highlights the potential of MLLMs to revolutionize 3D vision tasks, promoting their broad application across industries and accelerating the progress of intelligent systems in complex, multimodal scenarios.
关键词:3D vision;multimodal large model;3d visual representation;3D vision generation;3d reconstruction;Robot 3D vision;3D scene understanding
摘要:Simultaneous localization and mapping (SLAM) has undergone a profound evolution, transitioning through various stages and methodologies, each of which has contributed significantly to advancements in accuracy, robustness, and applicability across diverse scenarios. This paper provides a comprehensive exploration of the historical development and current trends in SLAM, with a particular focus on the progression from manual feature extraction to the adoption of modern deep learning and 3D graphics-based approaches. In the early stages of SLAM development, the process relied heavily on manual feature extraction, where visual features were carefully selected and extracted by human operators to facilitate localization and mapping tasks. Although this method proved effective in relatively simple environments, it was highly susceptible to the complexities of more dynamic scenes and variations in illumination. The dependency on human intervention for feature selection not only limited the scalability of these systems but also constrained their robustness in dynamic or unpredictable environments. The introduction of visual SLAM marked a pivotal advancement in the field. By leveraging the rapid progress in computer vision technologies, such as improved feature matching algorithms and visual odometry, visual SLAM systems significantly enhanced both the robustness and accuracy of SLAM. These innovations enabled SLAM systems to perform more reliably across a wide array of real-world environments, thereby paving the way for more automated and efficient approaches to SLAM. The integration of deep learning into SLAM methodologies represents a paradigm shift in scene understanding and reconstruction. One of the most notable advancements in this area is the emergence of neural radiance fields (NeRF)-based SLAM methods. NeRF-based approaches can model dense depth and color information with unprecedented accuracy, providing a more detailed and precise understanding of the environment. However, these methods are not without their challenges, particularly in terms of computational efficiency and real-time performance. Such challenges are especially pertinent in applications where rapid data processing and immediate response are crucial. In response to the limitations of existing SLAM methods, 3D Gaussian splatting (3DGS) technology has emerged as a promising alternative. SLAM methods based on 3DGS offer significant improvements in rendering speed and high-fidelity scene reconstruction. This technology enhances the speed and quality of spatial data processing, making it particularly well-suited for applications such as augmented reality and autonomous navigation systems. The inherent robustness of 3DGS-based SLAM methods makes them especially effective in large-scale environments and scenarios characterized by dynamic changes. The categorization of SLAM methodologies can be further refined by considering the types of sensory inputs and the specific application needs they address. Initially, SLAM methods utilizing simple RGB and RGB-D sensors were primarily focused on capturing visual and depth information, which proved effective in controlled environments, particularly indoors. These methods excel in scenarios with ample lighting, well-defined textures, and minimal occlusions. However, in more challenging conditions — such as outdoor environments, low-texture or textureless surfaces, and areas with significant lighting variations — RGB and RGB-D SLAM methods often face substantial limitations. For instance, in outdoor environments, changes in lighting, shadows, or exposure to direct sunlight can severely degrade the performance of these methods, leading to inaccuracies in localization and mapping. Similarly, in textureless regions like white walls or glass surfaces, RGB-D sensors struggle to capture sufficient visual features, resulting in failed or unreliable reconstructions. These limitations underscore the need for more robust SLAM solutions that can operate effectively across a broader range of conditions. Multimodal SLAM approaches have been developed to address these challenges by integrating data from multiple sensors, such as LiDAR, thermal cameras, and inertial measurement units (IMUs). By combining visual data with other sensory inputs, multimodal SLAM systems can overcome the weaknesses inherent in RGB and RGB-D-based methods. For example, LiDAR can provide accurate depth measurements even in low-light or textureless environments, whereas thermal cameras can detect heat signatures, aiding in environments where visual data are insufficient or unreliable. Moreover, IMUs contribute to maintaining accurate localization in scenarios with rapid motion or poor visual conditions by providing supplementary motion and orientation data. In scenarios where precise geometric reconstruction and higher-level environmental understanding are required, recent advancements have combined semantic information with SLAM methods based on 3DGS technology. By incorporating semantic cues, these methods not only enhance the accuracy of scene reconstruction and localization but also provide crucial perceptual information for downstream tasks such as robotic navigation, augmented reality, and embodied intelligence. This integration allows SLAM systems to interpret and adapt to complex, dynamic environments more effectively, making them suitable for a wide range of real-world applications, from indoor mapping to outdoor navigation in varied lighting and textural conditions. The nuanced understanding of when to deploy RGB or RGB-D SLAM versus multimodal SLAM is critical for optimizing the performance and applicability of these systems. Although RGB and RGB-D methods are efficient and effective in controlled, well-lit indoor environments, multimodal SLAM approaches are indispensable for applications in outdoor, textureless, or dynamically changing environments where robustness and adaptability are paramount. Despite the significant advancements in SLAM technology, current 3DGS-based methods still face several challenges. These include issues related to scalability in large-scale scenes, adaptability to dynamic environments, and the optimization required for real-time performance. Addressing these challenges is a key focus of ongoing research, which aims to integrate deep learning techniques with traditional geometric methods. Such integration is expected to further enhance the overall performance and versatility of SLAM systems. Moreover, the establishment of unified evaluation benchmarks is crucial for standardizing performance metrics across different SLAM methodologies. Such benchmarks will facilitate greater transparency and comparability in research outcomes, thereby driving further innovation in the field. In conclusion, the evolution of SLAM methodologies — from manual feature extraction to deep learning and 3D graphics-based approaches — has significantly advanced the capabilities of SLAM systems. By examining the historical developments, current methodologies, and future research directions, this paper provides researchers and engineers with comprehensive insights into the complexities and opportunities associated with advancing SLAM technology. Continued innovation and interdisciplinary collaboration will be essential in driving further advancements, enabling SLAM systems to fulfill their potential across a wide range of practical applications.
关键词:Simultaneous Localization and Mapping (SLAM);Neural Radiance Fields (NeRF);Gaussian splatting (GS);RGB-(D);multimodal;semantic information
摘要:The 3D real scene forms the spatial foundation and provides a unified spatial positioning framework and analysis basis for digital China. According to content and hierarchy, 3D real scene can be categorized into terrain, city, and component levels. The city-level 3D real scene is mainly composed of 3D mesh models derived from oblique photography, LiDAR (light detection and ranging) point clouds, and texture images, which are semantically processed and integrated with real-time perception data. Urban 3D mesh models are primarily composed of vertices, edges, triangular faces, and texture images. Compared to point clouds, 3D mesh models not only display more detailed information about objects but also allow for easy control of the level of detail through adjusting the parameters of the 3D mesh model. The focus of 3D real scene is on the digital mapping of production and living spaces, which can assist in the fine-grained management of cities and serve intelligent urban planning and construction. Interpreting 3D mesh models of urban scenes, such as semantic and instance segmentation, is a crucial step in constructing city-level 3D real scene. Currently, the semantic and instance segmentation of urban 3D mesh models primarily involves manually drawing the outline of object and cutting out each individual object from the 3D mesh model using object boundaries, followed by assigning semantic information. However, urban 3D mesh models are typically represented in a tile-based format, and cross-tile cutting can easily lead to issues such as fragmentation, seams, and discontinuities in the model. In recent years, deep learning technology has seen rapid development. Deep neural networks, due to their ability to learn discriminative high-level semantic features from given datasets, have been widely applied to the interpretation of image data and three-dimensional data (such as point clouds and 3D meshes). In addition, with the continuous improvement in the performance of graphics processing units and the expansion of annotated datasets, the accuracy of deep neural networks in interpreting 2D images and 3D data has significantly improved. Despite significant progress in the interpretation of 3D mesh models, most of these studies have focused on small-scale, toy, and simulated 3D mesh models. Research on deep neural networks for interpreting complex urban 3D mesh models is still in its early stages and faces many challenges and difficulties, primarily in the following three aspects. 1) Urban 3D mesh models are often irregular and may contain holes or be nonwatertight, making it difficult for traditional deep neural networks to directly apply to these models in extracting highly discriminative features. 2) The efficiency of extracting multiscale feature is low. Traditional 3D mesh simplification methods (such as quadric error metrics), which are used to generated hierarchy 3D mesh, use greedy strategies that are difficult to parallelize. When processing large-scale urban 3D mesh models, these methods inevitably increase computational burden. 3) Compared to benchmark datasets for images and point clouds, publicly available benchmark datasets for urban 3D mesh models are scarce. The structures of buildings, roads, and vegetation in urban scenes are complex and varied, making the annotation of urban 3D mesh models not only require specialized knowledge but also consume a significant amount of time and human resources. Compared to the intelligent interpretation of images and point clouds, the application of deep neural networks in the interpretation of urban 3D mesh models started later but has still seen rapid development. However, currently, few review articles systematically explore and summarize how different deep neural network architectures achieve the interpretation of urban 3D mesh models. Therefore, this paper aims to systematically review and summarize existing deep neural network methods for interpreting urban 3D mesh models and highlight the open challenges currently faced by researchers, providing a reference for future research. For this purpose, we initially survey the vast literature and categorize the intelligent interpretation methods for urban 3D mesh models into three classes, according to the types of representations used in processing urban 3D mesh models. 1) Methods based on multiview images attempt to project 3D mesh models into 2D images from multiple viewpoints and use well-established 2D image deep learning methods to learn discriminative semantic features from the projected images. Subsequently, the semantic features learned from the projected images are mapped back to the 3D mesh models. 2) Methods based on center-of-gravity (COG) point cloud representation convert each face of the urban 3D mesh model into its COG point, thereby abstracting the entire 3D mesh model into COG point clouds. Subsequently, intelligent interpretation algorithms designed for point clouds are used to process these COG point clouds. Different from traditional point clouds, COG point clouds can inherit rich texture and geometric information from the urban 3D mesh model. 3) Methods based on 3D mesh elements aim to define learnable operations (such as convolution and pooling) directly on the 3D mesh elements (vertices, edges, triangular faces). This approach allows for the direct learning and extraction of rich high-level semantic features from the urban 3D mesh model, thereby avoiding information loss that can result from preprocessing steps such as multiview image projection and centroid point cloud abstraction. Subsequently, we conduct a detailed comparison of the four categories of methods and outline their current challenges. Furthermore, we summarize commonly used datasets for intelligent interpretation of urban 3D mesh models and compare the interpreting performance of different methods on these datasets. Finally, based on the systematic survey and comprehensive performance comparison, we discuss some promising future research directions from aspects such as dataset creation, 3D large model construction, and application scenarios.
关键词:digital China;three-dimensional real scene;deep learning;scene interpretation;urban three-dimensional mesh
摘要:In today’s era of raid automation and technological advancement, unmanned systems are increasingly becoming a key area of strategic competition among major global powers. These new domains and capabilities of unmanned systems are not only key to supporting national security and strategic interests but also serve as the core force driving future technological innovation and application development. Unmanned systems are reshaping the boundaries of national security and redefining the connotations of strategic advantages. As a key component of unmanned systems, unmanned mobile visual technology is demonstrating its immense potential in assisting humans to gain a deeper understanding of the physical world. The advancement of this technology not only equips unmanned systems with richer and more precise perceptual capabilities but also offers humans with new perspectives to observe, analyze, and ultimately master the complex and dynamic physical environment. In the early stages of unmanned mobile visual technology development, researchers mainly relied on traditional learning methods for image processing. These methods focused on manual feature extraction, which depended heavily on the experience and knowledge of domain experts. For instance, feature descriptors such as scale-invariant feature transform(SIFT) and histogram of oriented gradients(HOG) played crucial roles in tasks such as image matching and target detection. Although traditional visual analysis methods still hold value in specific situations, their dependence on manual feature extraction and professional knowledge limits efficiency and accuracy. With the advent of deep neural network technology, unmanned mobile visual technology has ushered in revolutionary progress. Deep neural networks, through automatic feature extraction and hierarchical structures, can learn feature representations ranging from simple to complex, allowing them to capture local image features but also understand and interpret higher-level semantic information. Thus, these networks notably enhance the fitting and discriminative capabilities of models, offering advantages that traditional methods cannot match. Consequently, deep neural networks have become the benchmark for unmanned mobile visual technology. However, in practical applications, unmanned systems often encounter complex, diverse, and dynamically changing application scenarios, which present considerable challenges for the deployment and effectiveness of deep learning models. First, the complexity and dynamics of the imaging environment present notable problems in unmanned systems. Drastic changes in lighting, unpredictable weather conditions, and interference from other moving objects can degrade image quality, thereby affecting subsequent processing and analysis. Second, the high-speed maneuverability and camouflage strategies of imaging targets add another layer of difficulty for unmanned mobile visual systems. The rapid movement of targets complicates stable tracking, while camouflage and concealment make detection notably more difficult. These factors collectively reduce the accuracy of scene reconstruction, interpretation, and target identification in deep neural network-based unmanned mobile visual models. Furthermore, the diversity of imaging tasks introduces additional challenges. Different tasks often require tailored visual processing strategies, and the system must possess sufficient flexibility and adaptability to effectively handle different tasks. However, current deep neural network models are often tailored for specific tasks, limiting their adaptability across diverse applications. The uncertainty and unpredictability of environmental factors impose demanding requirements on unmanned mobile visual systems. These systems need to offer precise perception and in-depth analysis to provide decision support, enabling automated systems to respond quickly and accurately to environmental changes, thus improving overall efficiency and reliability. In response to the visual challenges of unmanned systems in complex, dynamic environments, this article delves into the current state of development of unmanned mobile visual technology in addressing these challenges, focusing on five key technical areas: image enhancement, 3D reconstruction, scene segmentation, object detection, and anomaly detection. Image enhancement, being the first step, is crucial for improving the quality of visual data. This process improves the contrast, clarity, and color of images, providing highly reliable input for subsequent analysis and processing, which enhances the performance of unmanned systems under various environmental conditions. 3D reconstruction technology facilitates the recovery of three-dimensional structures from two-dimensional images, enabling unmanned systems to gain a more comprehensive understanding of the environment, making it better suited for tasks in complex settings. Scene segmentation involves partitioning an image into semantically meaningful regions or objects, providing a basis for precise environmental perception and target recognition. Object detection is central to unmanned mobile visual technology, enabling the system to locate and identify specific targets within images or video streams. In contrast, anomaly detection focuses on identifying anomalies or events in the scene, allowing unmanned systems to timely identify and respond to potential threats. This article will provide an in-depth exploration of the research ideas, current status, and the advantages and disadvantages of typical algorithms for these key technologies, while also analyzing their performance in practical applications. The integration and collaboration of these technologies have substantially enhanced the visual perception capabilities of unmanned systems in dynamic and complex scenes, enabling them to perform tasks more intelligently and autonomously. Although some progress has been made in unmanned mobile visual technology, numerous problems in its practical application within complex dynamic scenes are still encountered. This review aims to provide a comprehensive perspective, systematically examining and analyzing the latest research advancements in unmanned mobile visual technology for such scenes. This paper explores the advantages and limitations of the above key tasks in practical applications. In addition, this paper will discuss the gaps and challenges in current research and propose future possible research directions. Through in-depth exploration of these research directions, unmanned mobile visual technology will continue to advance, offering more robust and flexible solutions to address the challenges posed by complex dynamic scenes. This progress will lay a solid foundation for the long-term development and practical application of unmanned systems in the fields of automation and intelligence.
关键词:unmanned mobile vision;complex dynamic scenes;image enhancement;3d reconstruction;scene segmentation;object detection;anomaly detection
摘要:Large-scale image and video datasets are indispensable for the development of computer vision algorithms, mainly because they provide the necessary resources to train and evaluate various models. Constructing such datasets for different computer vision tasks is a crucial but complex task, because it involves considerable challenges in data collection, annotation, and preservation of data diversity. Traditionally, acquiring large, high-quality image and video datasets has been a resource-intensive task, requiring manual labeling, data collection in real-world settings, and the use of specialized hardware for capturing high-quality images and videos. As deep learning methods increasingly rely on large-scale labeled data, the need for innovative data generation techniques has become more prominent. In recent years, generative models, such as generative adversarial network (GAN) and diffusion models have emerged as powerful tools for generating synthetic datasets. These models can create diverse, controllable, and highly realistic image and video data, offering an effective alternative or supplement to traditional data collection methods. By using these techniques, vast amounts of data can be generated to represent various scenarios and conditions, which are essential for training robust computer vision models. Generative models provide a flexible solution that can generate data without the need for real-world data acquisition, unlike traditional data collection, which is often constrained by geographic, financial, and logistical limitations. This review begins by introducing the significance and background of image and video data generation in computer vision. Image and video data play a critical role in the development and training of computer vision algorithms, as large-scale, diverse datasets are essential for building robust models. Moving on, the review categorizes the key data generation techniques into three broad approaches: traditional data augmentation methods, 3D rendering-based generation methods, and deep generative models. First, traditional data augmentation techniques, including geometric transformations, color adjustments, and cropping, are commonly used to improve model generalization by expanding existing datasets. Although these methods are relatively simple and computationally inexpensive, their ability to generate diverse and realistic datasets is limited. In comparison, 3D rendering technologies, such as virtual engines and neural radiance fields (NeRF), enable the creation of highly realistic synthetic data by simulating real-world environments. These technologies have the advantage of generating diverse datasets by adjusting environmental factors, such as lighting, object interactions, and camera angles. Furthermore, deep generative models, such as GAN and diffusion models, have shown remarkable effectiveness in generating high-quality synthetic data. On the one hand, GAN work by training two neural networks in a competitive manner in which a generator creates synthetic data, while a discriminator evaluates its realism. Over time, as the generator improves its output, it creates increasingly realistic data. Diffusion models, on the other hand, iteratively refine noisy data into clear and realistic images or videos, enabling the generation of diverse, high-quality datasets. Next, the review discusses the diverse applications of these generative models across a wide range of computer vision tasks. These tasks include image enhancement, object detection, tracking, pose or action recognition, biometric identification, crowd behavior analysis, and more recently, emerging fields like autonomous driving and embodied artificial intelligence. In particular, synthetic data have been instrumental in training models for tasks that are challenging to address using only real-world data. For example, in biometric identification, synthetic data can generate a wide variety of samples for fingerprints, faces, irises, and palmprints, thus providing more diverse training examples and reducing reliance on real biometric data, which are often difficult to acquire. Similarly, in autonomous driving, synthetic data can generate various driving scenarios, including different road conditions, weather patterns, and traffic behaviors, thereby training autonomous vehicle models in safe and controlled environments. In recent years, synthetic data have also been proven invaluable in fields like pose and action recognition, where diverse datasets are essential for accurately detecting human actions across different settings and contexts. However, despite the considerable progress made in image and video data generation, several challenges remain. One of the primary issues is ensuring the realism and diversity of generated data, which is crucial for training models that can generalize well to real-world scenarios. Furthermore, despite significant advances in generative models, there remains a lack of research on how to effectively evaluate the quality of synthetic data and use feedback mechanisms to guide the generation process. In addition, ethical considerations surrounding the use of synthetic data, especially in sensitive applications, such as biometric recognition, must be carefully addressed. Among others, the use of synthetic data raises concerns regarding privacy, consent, and potential misuse, which must be handled responsibly. Looking ahead, as generative models continue to evolve, they are expected to produce even more realistic and diverse datasets, thus offering new possibilities for training computer vision models. The future of image and video data generation holds great promise, with advancements in generative technologies poised to drive further innovation in computer vision, AI, and many other fields.
关键词:computer vision;data generation and application;conventional data generation;3D rendering;deep generative model;image enhancement;individual analysis;biometric recognition;crowd analysis;autonomous driving;video generation;embodied artificial intelligence
摘要:The growth of data and model size has led to the emergence of new generative models, such as large language models and diffusion models, which produce increasingly high-quality and diverse results. These large generative models are driving the rapid development of artificial intelligence (AI)-generated content (AIGC). This article focuses on the core needs of the creative industry and reviews the technological and industrial developments in the AIGC 2D/3D field from 2023 to 2024. First, this article summarizes the development background of generative technologies and their market value. Second, given the technical development in the AIGC 2D/3D field, the pace of technological evolution is evidently fast. The focus has transitioned from primarily being on “generative adversarial networks” to focusing on “denoising diffusion models and Transformer structures”. The new structure has stronger expressive power, greater diversity, and more flexible control capabilities. In the AIGC 2D review section, the article categorizes image and video generation technologies into three types: “high-quality generation foundation”, “controllable generation technology”, and “editable generation technology”. AIGC 2D technology began around 2014 with models such as GANs and VAEs, which made breakthroughs in image generation. Researchers have focused on generating high-quality 2D images, improving resolution, and enhancing details. By 2018, the focus shifted to better image quality, with StyleGAN advancing realism and diversity, thereby driving AIGC’s application in image processing and artistic creation. In 2020, diffusion models revolutionized 2D image generation, improving image quality and offering better control, such as text-to-image generation. By 2022, AIGC 2D technology evolved and expanded into advertising, content generation, and entertainment, with multimodal models such as CLIP and DALL-E enabling image creation from text descriptions. These innovations lowered the barrier for creators and improved generation quality. By 2024, AIGC 2D technology advanced to video generation, further expanding its applications. In the AIGC 3D review section, the article categorizes technologies based on “input data types”, “output data types”, and “generation methods”. AIGC 3D development began in 2022, shifting from combining 2D generative models with implicit 3D representations to 3D diffusion models and mixed/explicit 3D data. Early research focused on traditional 3D models, such as voxels, meshes, and point clouds. With the success of generative adversarial network (GAN) and variational autoencoder (VAE) in 2D generation, these techniques were adapted to 3D. In 2023, breakthroughs in denoising diffusion models led to their application in 3D generation, showing potential for creating geometric structures and textures. However, the limitations of 2D diffusion models raised concerns about their ability to handle 3D data. By 2024, generative 3D technology matured, integrating with real-world applications and evolving from a 2D-based approach to a 3D-driven one. The use of multimodal models and natural language processing allowed users to generate complex 3D data from simple text or 2D images. This marked a shift from research to practical applications, with technological advancements and market applications progressing in tandem. Next, the article summarizes the current technical challenges and industry application issues faced by both types of technologies. The main emphasis for the future development of AIGC 2D/3D technologies will be how to provide new technologies that better meet industry creation standards and needs. For AIGC 2D, precision in controlling the structure, details, and style of generated images or videos to meet user expectations remains a bottleneck. This is particularly challenging in video generation, where camera movements, object motion, and scene composition are harder to control. Moreover, the lack of flexibility for fine-tuning and editing generated content remains an issue, and achieving a high match between generated results and multimodal inputs continues to be a challenge. Furthermore, content quality such as realism, accuracy, and aesthetic appeal needs significant improvement, especially in areas like hand and eye details and object motion continuity. AIGC 3D faces three key challenges: the gap between generated results and industry standards, differences in application requirements across industries, and the need to balance the divergent needs of industry experts and the general public. Industry-specific standards for 3D data, such as geometry features and texture resolution, demand high-quality outputs from AIGC 3D. Different industries require different technical adaptations of AIGC 3D technology. Furthermore, the differing needs of professional creators, who seek high precision, and general users, who prioritize ease of use and affordability, add complexity to AIGC 3D’s real-world applications. Finally, the article provides a summary of the past 20 years, showing how the creative industry has experienced a “spiral upward development” driven by technological progress. It also offers some thoughts and perspectives on the future trends of technological development.
关键词:artificial intelligence generated content (AIGC);AIGC 2D;AIGC 3D;survey;creative industry application;large language model (LLM)
摘要:Medical imaging, a crucial tool for medical actions, utilizes various imaging techniques to capture the internal structure and function of the human body. Common types of medical images include magnetic resonance imaging (MRI), computed tomography (CT), positron emission tomography (PET), plain X-rays, ultrasound, and optical imaging. The information obtained from these images varies due to differences in imaging principles. For example, MRI uses a strong magnetic field and radio waves to obtain images of the inside of the body and provides good information about soft tissues. CT uses X-rays and computerized processing to create images of cross-sections of internal body structures and is primarily used to image high-electron-density tissues (e.g., bone); however, its capability to contrast soft tissues is somewhat limited. PET uses tracers labeled by radioisotopes to observe biological processes and functional activities within the body to image specific biological functions. At the same time, depending on the differences in imaging parameters and tracers, medical images of the same imaging type may also differ from different subtypes, such as T1- and T2-weighted MRI sequences, FDG-PET, and A-PET. These medical imaging techniques provide visual information about the anatomical, physiological, and pathological states of the human body and play an important role in disease diagnosis, treatment, and prognosis prediction. Medical images of the same type or subtype are referred to as single modality, and medical images that contain different modalities are referred to as multiple modalities. Given that various types or subtypes of medical images respond to different information about the patients’ body, multiple types/subtypes of medical images are often acquired to obtain more comprehensive information to improve diagnostic accuracy. However, multimodal image data acquisition faces difficulties such as long acquisition time, high cost, and possible increase in radiation dose. Therefore, generative techniques should be used for cross-modal medical image synthesis, that is, using medical images of one or some modalities to generate medical images of another or some other modalities. Although cross-modal medical image synthesis can facilitate multimodal image diagnosis, some technical challenges exist. For example, some information that can be captured in the target modality does not exist in the source modality due to different imaging principles of various imaging modalities. In this case, synthesized images of the target modality lack certain information, creating significant disparities in diagnostic performance between synthesized and real images, which leads to the problem of clinical failure. At the same time, privacy and ethical issues also contribute to the high cost of acquiring high-quality multimodal medical image data and the problem of missing data in cross-modal medical image synthesis. In addition, differences in resolution, contrast, and image quality between different modalities affect the consistency of image generation models during the generation process. Addressing these inconsistencies in data across different modalities poses a significant challenge for cross-modal medical image synthesis. The computational complexity and generalization ability of the model also need to be considered, as cross-modal medical image synthesis often requires complex models and many computational resources, which may limit the usefulness and scalability of cross-modal medical image synthesis methods. In addition to the training data that the model has already seen, whether the model can perform well on new or other different datasets should be considered. Most researchers start from the model itself and improve the quality of the synthesized images by improving the representation ability of the model or designing task-specific constraints. These developed cross-modal medical image synthesis techniques have been applied to image acquisition, reconstruction, alignment, segmentation, detection, and diagnosis, bringing new ideas and methods to solve many problems. This paper focuses on cross-modal image synthesis techniques and applications in the field of medical imaging. We will introduce existing cross-modal medical image synthesis techniques from three aspects: traditional synthesis methods, deep learning-based synthesis methods, and task-driven synthesis methods. Traditional synthesis methods usually divide an image into multiple small blocks and encode each block into a representation vector. This is done by establishing a mapping between the paired block representation vectors of different modalities and then generating the corresponding target modality block based on the encoding of the source modality block. The random forest-based approach treats image synthesis as a regression problem, assuming that the value of the target modal block or its central region is the dependent variable of the source modal block. This relationship can be obtained through a regression model. Dictionary learning-based methods assume that there exists a dictionary for each modality, that each image block can be obtained from a sparse representation of the elements in the dictionary, and that the image blocks corresponding to different modalities have the same dictionary encoding. Compared with traditional methods, deep learning-based cross-modal image synthesis methods can directly use large-scale parametric models to build mappings from source modal images to target modal images in an end-to-end manner. These methods automatically extract the representation features of an image or an image block in a data-driven manner without manually design the representation features. Given their ease of implementation and superior performance, deep learning-based techniques have become the leading method in cross-modal image synthesis. In this paper, we introduce them from simple CNN-based approach, encoder-decoder network-based approach, generative adversarial network approach, and diffusion model-based approach. Task-oriented cross-modal image synthesis methods consider that the synthesis task has a specific task bias and form a task-specific bias by adding a task-related design on the basis of a generalized technique. In this manner, the synthesized image preserves more information that contributes to the task and achieves a performance enhancement on the specific task. Such synthesis methods are presented in three categories: task-oriented biases, biases formed through network models, and image synthesis embedded in task models. Finally, we present the application scenarios of cross-modal medical image synthesis techniques and their application under their typical advantageous tasks.
摘要:Cross-media analysis and reasoning are indispensable in numerous areas, notably Web content management and service provision. These processes involve analyzing and synthesizing information across various media forms, such as text, images, videos, and audio, to generate insights and make informed decisions. Despite their critical importance, current methodologies predominantly employ black-box architectures, which model complex semantics through end-to-end processes. Although effective, these black-box models present significant limitations, particularly in their inability to provide human-understandable explanations of their operational mechanisms or decision-making processes. This lack of transparency results in insufficient interpretability, traceability, and generalization capabilities, particularly when these models are tasked with handling intricate, cross-domain, heterogeneous, and multisource data. Significant advancements in cross-media analysis tasks have been achieved through the deployment of large pretrained models, such as large language models, multimodal large language models, and large multimodal models. These models utilize substantial data volumes and advanced architectural designs to achieve superior performance levels. However, their inherent black-box nature poses substantial challenges. The lack of transparency of these models’ internal workings makes it difficult to understand how they arrive at specific conclusions or decisions. In addition, the timeliness constraints associated with their training data indicate that these models may not be able to incorporate the latest information or adapt quickly to new developments. Consequently, their widespread application is often limited by these critical shortcomings. Conversely, knowledge graph technology offers a compelling alternative, which can be attributed to its structured, semantic, and scalable characteristics. Knowledge graphs organize information into interconnected nodes and edges, representing entities and their relationships in a way that is human-readable and machine-interpretable. This technology facilitates transparent, accurate, and traceable reasoning processes, significantly enhancing the interpretability, traceability, and generalization abilities of cross-media analysis and reasoning systems. Knowledge graphs, by clearly defining the relationships and hierarchies within the data, provide a clear and understandable pathway that explains the decision-making process. This explicitness enables users to follow and comprehend the reasoning process well. To advance the research and application of cross-media analysis and reasoning, this article delves into the combination of knowledge graph technology and cross-media learning. The article is meticulously structured around three pivotal areas: cross-media knowledge graph construction, cross-media knowledge representation, and generalized reasoning with knowledge graphs. Specifically, the construction of cross-media knowledge graphs involves creating a seamless integration framework for data from various domains, which is often diverse and multisourced. Constructing these integrated graphs enables the representation of complex relationships and interactions across different media types in a unified and coherent manner, thereby facilitating a more comprehensive analysis. The next critical area is knowledge representation learning on these constructed graphs. This learning process empowers effective cross-media analysis and reasoning by enabling systems to understand and utilize the rich, structured information embedded within the knowledge graphs. As a result, this capability facilitates more accurate and insightful analysis, as the system can leverage the explicit relationships and hierarchies defined within the graph to draw more precise conclusions. Knowledge representation learning involves sophisticated techniques to encode and utilize the structured data in knowledge graphs, making it possible for machines to interpret and reason with complex, interrelated data points. As cross-media data continue to grow rapidly, extending the capabilities of existing techniques beyond the confines of the training data domain becomes increasingly important. To achieve generalizable reasoning on cross-media data, knowledge- and data-driven methods have been proposed. These methods aim to harness the strengths of knowledge graphs and machine learning to effectively handle new, unseen data. Knowledge-driven methods rely on the structured information within knowledge graphs to guide the reasoning process, whereas data-driven methods leverage large datasets and machine learning algorithms to learn patterns and make predictions. Combining these approaches enable the creation of robust systems capable of adapting to new data and evolving analytical requirements. This article provides an in-depth discussion of knowledge-driven cross-media analysis and reasoning, summarizing existing solutions that employ knowledge graphs to enhance transparency, accuracy, and traceability. It also identifies current challenges in cross-media knowledge graph research, such as integrating new data types and maintaining graph accuracy over time. The integration of new data types presents significant challenges, as it requires the development of methods to seamlessly incorporate diverse forms of information while maintaining the integrity and coherence of the knowledge graph. Conversely, outdated or incorrect information can lead to flawed reasoning and decision-making processes. Furthermore, the article explores the emerging opportunities and challenges posed by the rapid development of large pretrained models. It highlights potential future research directions, such as combining the strengths of knowledge graphs and large pretrained models to create more robust and interpretable cross-media analysis systems. Integrating the explicit, structured information from knowledge graphs with the advanced capabilities of large pretrained models can potentially lead to systems that are highly effective and easily interpretable. This hybrid approach leverages the strengths of both domains, combining the transparency and structure of knowledge graphs with the powerful pattern recognition and predictive abilities of large pretrained models. By addressing these areas, the research aims to foster the development of more transparent, accurate, and generalizable cross-media analysis and reasoning technologies. This comprehensive approach not only enhances current methodologies but also lays the groundwork for future advancements in the field. It paves the way for the development of more reliable and understandable systems in cross-media analysis and reasoning, ultimately contributing to the creation of more sophisticated and user-friendly tools for Web content management and services.
摘要:With the rapid advancement of artificial intelligence, the emergence of large language models (LLMs) and multimodal large language models (MLLMs) has profoundly affected optical character recognition (OCR), bringing about a paradigm shift in traditional OCR methods. This study systematically reviews recent developments in OCR and multimodal learning, emphasizing the latest applications and advancements of large OCR models in multimodal and multitask unified modeling. First, the scope of large OCR models is defined, and the models are categorized primarily into OCR multimodal large language models (OCR-MLLMs) and Omni-OCR models. OCR-MLLMs employ pretrained LLMs and supervised fine-tuning datasets, such as QA-based tasks, to learn from vast multimodal data across various OCR scenarios, leading to specialized multimodal OCR models that can perform diverse recognition and comprehension tasks. Conversely, the Omni-OCR model unifies multiple tasks within a general architecture and uses large-scale parameterization to learn generalized OCR capabilities from extensive multitask datasets. Specifically, this review covers four key aspects. 1) OCR-enhancing MLLMs: Early models, such as LLaVA and MiniGPT-4, exhibit basic OCR capabilities but lag behind specialized OCR systems. Researchers have improved OCR performance by introducing specialized OCR datasets, as exemplified by Qwen-VL and LLaVA-1.5 models. Another critical direction involves enhancing models’ ability to process high-resolution images. Approaches, such as Monkey and InternLM-XComposer2-4KHD, implement subimage cropping strategies, and Qwen2-VL adopts a ViT architecture with 2D rotational positional encoding to enable image encoding at arbitrary resolutions while mitigating semantic fragmentation caused by cropping. 2) MLLMs for Document Understanding: MLLMs for document understanding can be categorized into OCR-free and OCR-dependent approaches. OCR-free methods eliminate traditional OCR preprocessing, directly processing document images for end-to-end understanding. Notable breakthroughs include the generation of synthetic dialogue training data via LLMs or MLLMs. For instance, TextSquare combines OCR annotations with images to construct extensive cross-domain visual question-answering datasets. Moreover, advancements in cross-dataset integration and training-task designs have emerged. Examples include mPLUG-DocOwl, which consolidates diverse instructional datasets (e.g., documents, tables, charts, webpages, and natural images); Fox, which develops datasets for region-based OCR, translation, summarization, layout analysis, and dialogue; and DOGE, which constructs multigranularity document parsing datasets, where full-page parsing tasks enhance models’ comprehensive perception of document content. Furthermore, innovations in visual encoding architectures tailored to document characteristics have been introduced. Examples include UReader’s shape-adaptive cropping module, TextMonkey’s shifted window attention with token resampling, and Vary’s dual vision vocabularies. OCR-dependent approaches enhance accuracy by integrating OCR outputs into model architectures. Examples include LayoutLLM, which incorporates LayoutLMv3’s OCR features; DocLLM, which embeds layout information into attention mechanisms; and DocLayLLM, which achieves efficient multimodal extension of LLMs by inserting 2D positional tokens and leveraging chain-of-thought pretraining with annealing techniques. 3) Specialized Multimodal Large Models for OCR Tasks: Specialized multimodal large models focus on specific OCR-related tasks, including chart analysis, table parsing, and multipage document understanding, thereby addressing the limitations of general-purpose models. In chart understanding: MMCA and ChartLlama utilize GPT-4-generated instruction data; ChartAssistant-S adopts chart-to-table pretraining to enhance visual representation; TinyChart employs code generation and visual token aggregation strategies; ChartMoE, with its mixture-of-experts architecture, mitigates catastrophic forgetting. In table parsing: Table-LLaVA curates extensive pretraining and fine-tuning datasets to improve performance; TabPedia integrates dual-resolution encoders and meditative tokens for adaptive multimodal feature fusion. Other models specialize in scientific table parsing through a two-stage fine-tuning approach. In document retrieval and multipage understanding, advancements belong to two primary categories. The first is retrieval-augmented architecture design, where CREAM offers a hierarchical retrieval strategy that refines results progressively from coarse to fine granularity, and PDF-WuKong develops an end-to-end sparse sampling method that directly filters relevant content within the model itself. The second is multimodal indexing paradigms, in which DSE and ColPali leverage multimodal large models to encode document images and queries directly, eliminating the need for OCR-based document parsing. This approach preserves layout and visual information, thereby avoiding OCR-induced errors. 4) Omni-OCR Model: The Omni-OCR model, which was introduced before LLMs, was developed across four primary architectures. Document understanding pretrained models (e.g., LayoutLM series) integrate text, spatial, and visual features to support tasks, such as key information extraction and Q&A. Pix2Seq models (e.g., Donut and UDOP) perform multitask processing without relying on OCR information; they directly use image inputs combined with prompt mechanisms. OmniParser jointly models detection, recognition, and information extraction through a staged decoding process. Document parsing models (e.g., Nougat, KOSMOS-2.5, and GOT) transform documents into structured formats, such as HTML or Markdown. Furthermore, pixel-level unified models (e.g., DocRes and UPOCR) employ encoder-decoder architectures with task prompts to facilitate multitask training and knowledge transfer for pixel-level tasks. Despite these remarkable advancements, current large OCR models still face key challenges, including performance gaps in complex layout parsing, handwriting recognition, and historical manuscript analysis compared with specialized traditional models; inefficiencies related to high-resolution inputs, inference latency, and parameter redundancy, which hinder practical deployment; and insufficient fine-grained perception and logical reasoning in complex scenarios. Future developments should focus on four key areas: fine-grained feature extraction and document structural modeling for enhanced comprehension, large-scale self-supervised pretraining to improve generalizability, chain-of-thought reasoning for multimodal logic analysis and emerging tasks (e.g., mathematical reasoning and multilingual processing), and model lightweighting via visual token compression and dynamic computation allocation to balance accuracy and efficiency. In practical applications, large OCR models exhibit remarkable potential in intelligent document processing, automated testing, digital education, historical document restoration, and oracle bone language decipherment. Since the early 21st century, OCR technology has been undergoing a paradigm shift from character recognition to semantic understanding. With the emergence of MLLMs, OCR systems are evolving from text transcription tools to intelligent document understanding platforms, continuously expanding their capabilities to cross-modal reasoning, dynamic interaction, and intelligent decision support. As a foundational AI-driven technology, large OCR models are expected to play a pivotal role in driving digital transformation across industries, thus accelerating the intelligent evolution of document processing, text-image understanding, cultural heritage preservation, finance, and education. They can provide robust technical support for knowledge management and innovation at all levels of society.
关键词:large language model(LLM);multimodal large language model(MLLM);optical character recognition(OCR);Document Processing;Document Understanding
摘要:With the rise of deep learning technology, artificial intelligence technology has progressed from shallow machine learning to deep learning and from small-scale data learning to big data learning. In recent years, with the continuous improvement of data and computing resources, the scale of deep learning models has continued to increase, and large-scale pretrained models (large models) have begun to emerge. In general, any model that is trained on large-scale extensive data (usually via large-scale self-supervised training) and can be adjusted (e.g., through fine-tuning) to a wide range of downstream tasks and can be referred to as a large model. Meanwhile, a typical large model that is trained by utilizing multimodal information, such as text, images, videos, and audio, and aims to complete diverse multimodal application tasks is known as a multimodal large model. In the past two years, domestic and foreign multimodal large models, such as ChatGPT, Llama, and Qwen, have achieved remarkable success. As multimodal large models continuously evolve, the research on the security of these models has become the focus in the field of artificial intelligence. Given the superior processing abilities of these models in various multimodal tasks, their security issues may have harmful consequences. Large models are built with deep neural networks as their core, so they encounter security risks similar to those faced by deep neural networks. In addition, given the unique complexity of large models and their wide range of application, they face unique security risks. In this study, we systematically summarize the security risks associated with multimodal large models, including adversarial attacks, jailbreak attacks, backdoor attacks, copyright theft, hallucination phenomena, generalization issues, and bias problems. In adversarial attacks, attackers construct small yet deceptive adversarial examples to cause misjudgments by large models when they are fed with these adversarial perturbed inputs. Jailbreak attacks exploit the complex structure of large models to bypass or destroy the original security constraints and defense measures, enabling the models to perform unauthorized operations and/or even leak sensitive data in the outputs. Backdoor attacks involve implanting hidden triggers during the training phase of large models, causing the models to exhibit attacker-intended behaviors under specific conditions. Meanwhile, unauthorized thieves may distribute or use large models for commercial purposes without the consent of the model owners, causing losses to the copyright owner of the models. The hallucination phenomenon refers to the issue of inconsistency between a large model’s output and input. The generalization problem indicates the inability of large models to deal with new data distributions or styles. The bias of large models on sensitive issues, such as gender, race, skin color, and age, may lead to ethical problems, which may further produce severe consequences. After the presentations of these security risks, we introduce corresponding solutions to. By presenting the progress of research on the security risks and corresponding solutions of multimodal large models, this study aims to provide a unique perspective for understanding and addressing the unique security challenges of multimodal large models, promote the development of security technologies for multimodal large models, and guide the future direction of related security technology development. In conclusion, multimodal large models demonstrate excellent performance in many tasks and applications and provide various types of assistance to people’s work and daily lives. Through this research on the security technologies of multimodal large models, we hope to ensure the safety and reliability of these models, thereby providing a guarantee for people’s normal work and life.
关键词:multimodal large model;large model security;adversarial example(AE);jailbreak attack;backdoor attack;copyright theft;model hallucination;model bias
摘要:Deep learning has revolutionized the field of computer vision over the past two decades, bringing unprecedented advancements in accuracy and speed. These developments are vividly reflected in fundamental tasks such as image classification and object detection, where deep learning models have consistently outperformed traditional machine learning techniques. The superior performance of these models has led to their widespread adoption across various critical applications, including facial recognition, pedestrian detection, and remote sensing for earth observation. As a result, deep learning-based computer vision technologies are increasingly becoming indispensable for the continuous evolution and enhancement of intelligent vision systems. Despite these remarkable achievements, the robustness and reliability of deep learning models have come under scrutiny due to their vulnerability to adversarial attacks. Researchers have discovered that by introducing carefully designed perturbations—subtle modifications that may be imperceptible to the human eye—the decision-making processes of these models can be significantly disrupted. These adversarial attacks are not only theoretical constructs. They have practical implications that can potentially undermine the trustworthiness of deep learning systems deployed in real-world scenarios. One of the most concerning developments in this area is the emergence of physical adversarial attacks. Different from their digital counterparts, physical adversarial attacks involve perturbations that can be applied in the real world using common objects or natural phenomena encountered in daily life. For instance, a strategically placed sticker on a road sign might cause an autonomous vehicle’s vision system to misinterpret the sign, leading to potentially dangerous consequences. These attacks are particularly worrisome because they can deceive not only deep learning models but also human observers, thereby posing a more realistic and severe threat to the integrity of computer vision systems. In light of the growing significance of physical adversarial attacks, this paper aims to provide a comprehensive review of the state of the art in this field. By analyzing 114 selected papers, we seek to offer a detailed summary of the methods used to design physical adversarial attacks, focusing on the general designing process that researchers follow. This process can be broadly divided into three stages: the mathematical modeling of physical adversarial attacks, the design of performance optimization processes, and the development of implementation and evaluation schemes. In the first stage, mathematical modeling, researchers aim to define the problem and establish a framework for generating adversarial examples in the physical world. This involves understanding the underlying principles that make these attacks effective and exploring how physical characteristics, such as texture, lighting, and perspective, can be manipulated to create adversarial examples. Within this stage, we categorize existing attacks into three main types based on their application forms: 2D adversarial examples, 3D adversarial examples, and adversarial light and shadow projection. Typically, 2D adversarial examples involve altering the surface of an object, such as applying a printed pattern or sticker, to fool a computer vision model. These attacks are often used in scenarios like natural image recognition and facial recognition, where the goal is to create perturbations that are inconspicuous in real-world settings but highly disruptive to machine learning algorithms. Taking the concept further, 3D adversarial examples consider the three-dimensional structure of objects. For example, modifying the shape or surface of a physical object can create adversarial examples that remain effective from multiple angles and under varying lighting conditions. Adversarial light and shadow projection represents another innovative approach, where the manipulation of light sources or shadows creates perturbations. These attacks are often more challenging to detect and defend against because they do not require any physical alteration of the object itself. Instead, they exploit the way light interacts with surfaces to generate adversarial effects. This method has shown potential in indoor and outdoor scenarios. We also introduce their applications in five major scenarios: natural image recognition, facial image recognition, autonomous driving, pedestrian detection, and remote sensing. In the design phase of performance optimization, existing adversarial attacks mainly face two core problems: reality bias and the high degree of freedom observation. We introduce some solutions and key technologies for these core problems in the existing literature. In the design of implementation and evaluation schemes, we introduce the platforms and indicators used in existing works to evaluate the interference performance of physical adversarial examples. Finally, we discuss the highly promising research directions in physical adversarial attacks, particularly in the context of intelligent systems based on large models and embodied intelligence. This area of exploration can reveal critical insights into how these sophisticated systems, which combine extensive data processing capabilities with interactive and adaptive behaviors, can be compromised by physical adversarial attacks. In addition, studying physical adversarial attacks on hierarchical detection systems that integrate data from multiple sources and platforms holds significant potential. Understanding the vulnerabilities of such complex, layered systems can lead to more robust and resilient designs. Finally, the prospects of advancing defense technology against physical adversarial attacks are crucial. Developing comprehensive and effective defense mechanisms will be essential for ensuring the security and reliability of intelligent systems in real-world applications. We hope to provide meaningful insights for the design of high-quality physical adversarial example generation methods and the research of reliable deep learning models. The review homepage is available at https://github.com/Arknightpzb/Survey-of-Physical-adversarial-attack.
关键词:physical adversarial attacks;general designing process;practicality of adversarial examples;deep learning;computer vision
摘要:Emotion recognition is an essential research direction in artificial intelligence (AI), focusing on modeling the relationship between emotional expressions and measurable features. It enables computers to recognize and understand human emotions, thereby playing a crucial role in various domains of human-computer interaction. From intelligent assistants to mental health monitoring and social robotics, emotion-aware systems are becoming increasingly pervasive, making affective computing a cornerstone of future intelligent technologies.Psychological studies have long confirmed that human emotions are typically expressed through a combination of behavioral cues, most notably facial expressions, speech prosody, and language content. These multimodal signals are often intertwined, with each modality providing complementary information that reflects the inner emotional state of an individual. In addition to these observable behaviors, physiological signals such as electroencephalogram and electrocardiogram also vary with emotional changes. However, collecting physiological signals often requires contact-based or invasive sensors, limiting their practical use in daily environments. By contrast, facial expressions, voice, and language are naturally occurring, easily accessible, and can be captured unobtrusively by cameras and microphones. These advantages make non-invasive emotion recognition approaches particularly attractive for real-world applications, where user comfort and ease of deployment are critical.The field of non-invasive emotion recognition has made significant progress in recent years, supported by advancements in deep learning, multimodal fusion techniques, and the increasing availability of emotion-labeled datasets. At the theoretical level, emotions are typically represented through two main models: discrete and dimensional. The discrete model classifies affect into a set of basic categories such as happiness, sadness, anger, fear, and surprise. This model is intuitive and aligns well with how humans often perceive emotion. Alternatively, the dimensional model maps emotions onto a continuous space, commonly defined by valence (positive to negative), arousal (calm to excited), and sometimes dominance (submissive to dominant). This model allows for more nuanced and dynamic descriptions of emotional states. Both models form the basis for designing annotation protocols and developing machine learning algorithms for emotion detection.A fundamental driver of research in this field is the availability of multimodal emotional datasets. Numerous public databases support facial expression recognition, speech emotion analysis, and affective text mining. For example, facial datasets like AffectNet and real-world affective faces database(RAF-DB) contain millions of annotated images representing a broad spectrum of expressions, while audio datasets like interactive emotional dyadic motion capture database(IEMOCAP) and Ryerson audio-visual database of emotional speech and song(RAVDESS) offer rich emotional speech recordings. Language datasets, including those developed for sentiment analysis challenges, provide labeled corpora for detecting emotion in both written and spoken text. Multimodal datasets like Carnegie Mellon University Multimodal Opinion Sentiment and Emotion Intensity(CMU-MOSI) and Carnegie Mellon University Multimodal Opinion Sentiment and Emotion Intensity(CMU-MOSEI) integrate synchronized recordings of facial expressions, voice, and transcribed language, supporting more comprehensive studies on multimodal fusion and cross-modal emotion reasoning.Technological progress in this area has been fueled by innovations in machine learning and signal processing. Facial expression recognition has evolved from handcrafted feature extraction to end-to-end deep learning frameworks using convolutional neural networks (CNNs), attention mechanisms, and, more recently, vision transformers. Temporal modeling techniques have also been used to capture dynamic facial cues, including micro-expressions that may reveal concealed emotions. In speech emotion recognition, acoustic features such as pitch, intensity, Mel-frequency cepstral coefficients (MFCCs), and spectral entropy are used to infer emotion. Recurrent neural networks (RNNs), convolutional layers, and transformers have also shown strong performance, especially when trained on large emotional corpora. For language-based emotion analysis, both traditional natural language processing (NLP) techniques and modern pre-trained language models like bidirectional encoder representations from Transformers(BERT) and generative pretrained Transformer(GPT) have proven effective in extracting emotional content from text, including subtle and context-dependent cues.A major trend in the field is the integration of multiple modalities to improve recognition accuracy and robustness. Multimodal emotion recognition systems aim to leverage complementary information from facial, vocal, and linguistic cues. To achieve this, various fusion strategies have been developed: early fusion, which combines raw features; late fusion, which merges decision outputs; and hybrid fusion methods, which integrate features at intermediate stages. Cross-modal attention, contrastive learning, and reinforcement learning have been used to address issues such as modality imbalance, missing data, and asynchronous signals. Therefore, multimodal approaches consistently outperform single-modality systems, especially in real-world conditions where one modality might be unreliable.The applications of emotion recognition are also expanding into several domains. In healthcare, non-invasive emotion recognition is used for monitoring mental health, detecting depression, and supporting emotional therapy. In education, it enables intelligent tutoring systems to gauge student engagement and adapt teaching strategies. In intelligent driving, it helps assess driver fatigue and stress to enhance road safety. Customer service chatbots and entertainment platforms are also increasingly equipped with emotion-aware capabilities to personalize interactions. The ability to recognize and adapt to human emotions not only improves user satisfaction but also fosters more natural and effective communication between humans and machines.Despite these advances, the field of emotion recognition continues to face numerous challenges. Cultural and individual differences in emotional expression can hinder the generalizability of models trained on specific datasets. Emotions are often ambiguous and context-dependent, making accurate annotation and recognition difficult. Real-time deployment requires models that are accurate, lightweight, and efficient. Moreover, ethical concerns surrounding emotion recognition—especially regarding privacy, consent, and potential misuse of affective data—are increasingly coming into focus. Thus, emotion-aware systems must be developed and deployed responsibly, with transparency and fairness.Several promising directions are emerging. Looking ahead, transfer learning and domain adaptation techniques can help models generalize better across diverse user groups and environments. Furthermore, self-supervised and unsupervised learning approaches may alleviate the dependency on large annotated datasets. Researchers are also exploring explainable AI methods to increase the interpretability and trustworthiness of emotion recognition systems. Additionally, the fusion of modalities through generative techniques and cross-modal alignment is opening new avenues for robust and flexible emotion understanding, especially in noisy or incomplete input scenarios.In conclusion, this review presents a comprehensive analysis of non-invasive emotion recognition based on facial expressions, speech, and language. By summarizing theoretical foundations, available resources, technical advancements, practical applications, and ongoing challenges, it aims to provide researchers and practitioners with a clear understanding of the current landscape and future potential of this rapidly evolving field.
摘要:The development of emotionally intelligent digital humans and robotic technologies represents a significant advancement in contemporary research, focusing on the creation of systems capable of understanding and responding to human emotions in a nuanced manner. This paper systematically analyzes the current research status and advancements in four key areas: brain-cognition-driven emotional mechanisms, the integration and interpretation of multimodal emotional intelligence models, personalized emotional representation and dynamic computation, and the regulation of interactive emotional content generation. The brain-cognition-driven emotional mechanisms highlight the critical need to understand emotional characteristics and dynamic regulatory processes across various brain regions. Recent advances in neuroimaging technologies, such as functional magnetic resonance imaging and electroencephalography, have yielded deeper insights into the activation patterns of these areas in response to different emotional stimuli. For instance, the amygdala’s association with fear responses contrasts with the prefrontal cortex’s role in emotional regulation and cognitive control, emphasizing the importance of identifying these unique functions for the development of accurate emotion recognition systems capable of real-time emotional state analysis. The integration and interpretation of multimodal emotional intelligence models are also essential for enhancing emotion recognition capabilities. The ability to synthesize information from diverse sources, including audio, video, text, and physiological signals, provides a more robust understanding of emotional expressions. By analyzing vocal tone alongside facial expressions and contextual text data, models can achieve superior accuracy in identifying a spectrum of emotions such as happiness, sadness, or anger. This paper delves into methodologies for aligning and fusing cross-modal emotional data, showcasing how techniques, like deep learning and Transformer architectures, address challenges related to differences in modal features and temporal synchronization. For example, ensuring that emotional cues from video and audio data are accurately aligned in time can significantly enhance the overall recognition process. The discussion further explores the application of large models, particularly their capabilities in transfer and self-supervised learning, which enable these systems to adapt to new emotional contexts with minimal additional training. Such adaptability not only improves the naturalness of emotional expressions but also addresses critical privacy concerns associated with processing emotional data. In addition to these foundational elements, the paper emphasizes personalized emotional representation and dynamic computation. By capturing individual emotional traits, such as those related to gender, age, cultural background, and personality type, models can create more accurate emotional profiles tailored to specific users. This individualized approach is particularly relevant in areas such as mental health support, where a nuanced understanding of users’ emotional landscape can significantly enhance intervention effectiveness. The integration of social relationships and environmental stimuli into emotional analysis is also discussed, highlighting how contextual factors influence emotional responses and lead to more appropriate system reactions. Hierarchical knowledge-guided technologies are highlighted as enabling systems to respond to complex emotional scenarios, fostering more nuanced and context-aware interactions. Moreover, adaptive dynamic modeling techniques for emotional states introduce temporal dimensions into emotion processing, allowing real-time adjustments that ensure responses remain relevant and sensitive to user needs. The regulation of interactive emotional content generation is another critical aspect of this review, aiming to develop intelligent systems that can understand and produce multimodal emotional content. Key components include constructing emotional spaces for precise emotion representation, which involves defining discrete categories and continuous dimensions of emotional expression. This dual approach enhances the ability of systems to capture a wide range of emotional nuances, facilitating more accurate and relatable interactions. Furthermore, the paper examines controllable interaction technologies in emotional generation, particularly advancements in generative adversarial networks and diffusion models, which allow for the introduction of emotional conditions that guide content generation. This capability enhances the flexibility and relevance of emotional responses, enabling digital humans to adjust their expressions based on users’ emotional states for more engaging interactions. Utilizing multimodal reasoning is a crucial element of the discussion, as it leverages the inferential capabilities of large multimodal models to effectively align and generate cross-modal emotional information. This enriches generated content and ensures resonance with users across visual, auditory, and textual levels. The paper also addresses strategies for minimizing computational resources while maintaining content quality, essential for deploying these advanced systems in real-world applications where efficiency is paramount. In conclusion, emotionally intelligent digital humans signify a transformative advancement in human-computer interaction, with the potential to significantly enhance user engagement and satisfaction. By integrating high-fidelity digital reconstruction, controllable emotional expression, and intelligent interaction capabilities, these systems can facilitate more natural and effective user engagement. Future developments in this field are likely to focus on enhancing the realism of digital human interactions and improving the adaptability of emotional expressions based on user feedback. As technology continues to progress, the potential applications of emotionally intelligent digital humans and robotics will expand across various domains, including healthcare, where they can provide companionship and emotional support; education, where they can personalize learning experiences; and entertainment, where they can create immersive environments. Ultimately, these advancements promise enriched user experiences and deeper emotional connections, paving the way for a future where emotionally intelligent systems become integral to daily life.
关键词:affective computing;digital human;robotics;multimodal emotion large model;emotional mechanism
摘要:Medical image segmentation is a crucial component of clinical medical image analysis, aiming to accurately identify and delineate anatomical structures or regions of interest, such as lesions, within medical images. This process provides objective and quantitative support for decision-making in disease diagnosis, treatment planning, and postoperative evaluation. In recent years, the rapid expansion of available annotated datasets has driven the swift development of deep learning-based medical image segmentation methods. These methods have demonstrated superior accuracy and robustness compared to traditional segmentation techniques, establishing themselves as the mainstream technology in the field. Extensive research has been dedicated to improving the structural designs of segmentation models and further enhancing segmentation accuracy, leading to the development of various segmentation approaches. Current deep learning-based medical image segmentation methods can be classified into three main structural categories: convolutional neural networks (CNNs), Vision Transformers, and Vision Mamba. As a representative neural network architecture, CNNs effectively capture spatial features in images due to their unique local receptive fields and weight-sharing mechanisms. These characteristics make CNNs particularly suitable for image analysis and processing tasks. Since 2015, CNN-based methods, such as U-Net, have dominated the field of medical image segmentation, consistently achieving state-of-the-art performance across various downstream segmentation tasks. Many studies have focused on modifying and innovating the U-Net structure to further improve segmentation accuracy, resulting in several derived segmentation methods. Despite these advancements, CNN-based methods are still limited by the inherent constraints of convolutional operators, particularly the local nature of their receptive fields. These limitations restrict the capability of the model to capture global contextual dependencies, which are crucial for handling complex medical images and performing fine-grained segmentation. While techniques such as attention mechanisms and specialized convolutions have been introduced to address these challenges and help the model focus on global information, their effectiveness remains limited. Since 2020, researchers have begun introducing Transformer architectures, originally developed for natural language processing (NLP), into computer vision tasks, including medical image segmentation. Vision Transformers use self-attention mechanisms to effectively model global dependencies, drastically improving the quality of semantic feature extraction and facilitating the segmentation of complex medical images. Transformer-based methods for medical image segmentation mainly include hybrid approaches that combine Transformer with CNN, as well as pure Transformer methods, each displaying unique advantages and disadvantages. Hybrid approaches capitalize on the strengths of CNN in local feature extraction while leveraging the capabilities of Transformer in modeling global context. This combination enhances segmentation accuracy while maintaining computational efficiency. However, these hybrid methods remain dependent on CNN structures, which may limit their performance in complex scenarios. In contrast, pure Transformer methods excel at capturing long-range dependencies and multiscale features, leading to remarkable improvements in segmentation accuracy and generalization. However, pure Transformer architectures typically require substantial computational resources and high-quality training data, posing challenges, especially in the medical field, where large-scale annotated datasets are often difficult to obtain. Despite the notable advantages of Transformer architectures in capturing long-range dependencies and global contextual information, their computational complexity grows quadratically with the length of the input sequence, which limits their applicability in resource-constrained environments. Researchers are developing new methods that can model global dependencies with linear time complexity to overcome this challenge. For instance, Mamba introduces a novel selective state-space model that uses a selection mechanism, hardware-aware algorithms, and a streamlined architecture to reduce computational complexity while maintaining efficient performance in long-sequence modeling. Since 2024, numerous studies have started applying the Mamba structure to medical image segmentation tasks, achieving promising results and positioning it as a potential alternative to traditional Transformer structures. The hybrid method, combining Mamba with CNN, enhances segmentation accuracy and robustness by leveraging the feature extraction capabilities of CNN alongside the capability of Mamba to handle long-range dependencies. However, this approach may introduce additional computational complexity. In contrast, pure Mamba methods excel in tasks that require global contextual information but still face limitations in capturing spatial features within images and may demand higher computational resources during training. Overall, this paper systematically reviews and analyzes the development trajectory, advantages, and limitations of deep learning-based medical image segmentation methods from a structural perspective for the first time. First, all surveyed methods are categorized into three structural classes. Then, a brief overview of the structural evolution of different segmentation methods is provided, and their structural characteristics, strengths, and weaknesses are analyzed. Subsequently, the major challenges and opportunities encountered in the field of medical image segmentation are analyzed from multiple perspectives, including algorithm structure, learning methods, and task paradigms. Finally, an in-depth analysis and discussion of future development directions and application prospects are conducted.
摘要:Artificial intelligence (AI) is crucial in the field of medical imaging, thereby garnering significant attention. Medical imaging modalities widely used in clinical practice include magnetic resonance imaging (MRI), computed tomography (CT), positron emission tomography (PET), X-ray, and ultrasound, providing complementary information. AI technologies excel in mining image information and characterizing advanced features, driving constant innovation in core algorithms for applications such as disease detection, diagnosis, and treatment. This study systematically examines the current status, primary methods, and advancements of medical imaging AI, providing a thorough analysis and summary of its performance in real medical settings. The review begins by analyzing the key AI algorithms used in medical imaging, encompassing mapping, detection, segmentation, and classification models, detailing their applications and progress in the field. Despite considerable research being applied sporadically within specific areas of medical imaging without substantial overall clinical workflow enhancements, this review emphasizes the concepts of full-stack and full-spectrum innovations in introducing disruptive innovations and improvements to clinical workflows. The “full-stack” approach is dedicated to the development of medical imaging AI covering the entire process of pre-imaging, imaging, post-imaging, and functional assessment to improve imaging quality and diagnostic accuracy. In the pre-imaging phase, AI can intelligently handle positioning procedures, localization adjustments, and dose modulation. During imaging, AI reconstruction technology aids in generating fast and low-dose medical images. Post-imaging, AI-based quality control prevents image quality degradation, whereas in functional evaluation, AI-based detection and segmentation help identify abnormalities. In addition, AI-based classification supports disease diagnosis and treatment decisions, whereas AI-based registration technologies facilitate follow-up and disease progression monitoring. This review focuses on recent advancements in AI-based reconstruction for fast MRI, low-dose CT, and fast PET scenarios, with the goal of improving image quality, accelerating scanning processes, reducing noise and artifacts while preserving the detailed structure of the lesion, and amplifying lesion contrast. Notably, functional assessment is critical for the full course of disease management by aiding at-risk identification, diagnosis, molecular subtyping, treatment planning, and prognostic evaluation. We anticipate that AI technologies can be integrated into the existing clinical workflow to enable full-stack analysis of a specific disease, improving patient outcomes and alleviating radiologists’ workloads. The “full-spectrum” approach offers a different perspective by encompassing multiple imaging modalities supported by various imaging devices to accurately diagnose diseases independently or in combination using complementary modalities. It also broadens the application of AI to diverse diseases or body parts with the aim of diagnosing multiple conditions in a single scan for assessing structural and functional abnormalities throughout the body. Full spectrum aims to improve healthcare by providing comprehensive diagnostic capabilities using advanced AI technology, to enable doctors to better understand a wide range of diseases and provide personalized medical diagnoses for different patient profiles. Drawing inspiration from the concepts of full stack and full spectrum, this review outlines several AI applications in real-world healthcare settings, focusing on one-stop diagnostics and management strategies for stroke, lung cancer, and cardiovascular diseases. Stroke management initiatives encompass solutions for hemorrhagic and ischemic strokes, risk factor identification and management, and intelligent diagnostic protocols. The paper further explores AI approaches to lung cancer prevention and treatment, spanning early screening, target reconstruction, quantitative characteristic analysis, risk assessment, three-dimensional preoperative planning, follow-up evaluations, and structured report generation. Moreover, the review elaborates on the comprehensive cardiovascular AI process involving precise coronary artery imaging techniques, intelligent early screening for calcification scoring, three-dimensional analysis to aid diagnosis, and exploration into other cardiac conditions. Notably, a series of AI-based software has been developed to broaden the scope of AI interventions in the existing clinical workflow. In the context of the growing development of precision medicine, AI shows great potential in integrating multiple data streams into a powerful diagnostic or predictive system spanning radiomics, pathomics, and genomics, which is expected to accelerate the achievement of management goals that are truly tailored to patients. The emergence and rapid development of generative AI technologies and large language models will lead to a series of innovative applications of generative AI, including scenarios such as interactive report interpretation, medical and health consulting, and smart operating rooms. The paper concludes by summarizing the current research status and outlining future development directions for medical imaging AI while providing a thorough review and analysis of pertinent literature, serving as a valuable reference for future research endeavors. We anticipate multidisciplinary cross-collaboration to promote high-quality clinical research, technological innovation, and software development, and expand new application scenarios for medical AI. By leveraging full-stack and full-spectrum thinking and customized assembly of AI technology, its reach progressively extends across the full spectrum of clinical applications, imaging modalities, and disease types.
关键词:Medical Imaging;artificial intelligence(AI);deep learning;Full Stack Full spectrum;Medical Scenario
摘要:Hyperspectral imaging technology, an intricate integration of digital imaging and advanced spectral separation techniques, represents a remarkable leap in the field of remote sensing and data acquisition. This technology acquires images across an extensive range of multiple continuous narrow spectral channels to construct a 3D data cube. This cube seamlessly amalgamates spatial and spectral information to produce a hyperspectral image (HSI). HSIs are not merely repositories of rich spectral information; they also comprehensively integrate spatial and radiometric information. This unique combination provides robust and indispensable support for intricate classification, precise identification, and in-depth analysis of complex targets, which are often encountered in diverse real-world scenarios. Consequently, HSIs possess substantial application value in many civilian and military domains. In the aviation and aerospace sectors, they play a pivotal role in terrain mapping, environmental monitoring from high altitudes, and detection of potential hazards. In biomedical research, hyperspectral imaging is being increasingly utilized for noninvasive disease diagnosis, tissue characterization, and study of physiological processes at a cellular level. In mineral exploration, it enables the identification of various minerals and the assessment of their abundance on the basis of their distinct spectral signatures. In precision agriculture, HSIs are employed to monitor crop health, nutrient deficiencies, and water stress, thus optimizing agricultural practices for increased yields. Presently, hyperspectral imaging and processing technologies are still internationally regarded as highly promising and transformative areas of development. However, HSIs are characterized by deep coupling and diverse forms of information redundancies across spatial, spectral, and temporal domains. These inherent characteristics pose unique, formidable challenges, the most notable of which are the curse of dimensionality and the issue of mixed pixels. The curse of dimensionality refers to the exponential increase in computational complexity and data volume with the rise in the number of spectral bands, and it often leads to overfitting and substantial degradation of classification performance. Mixed pixels, on the other hand, occur when a single pixel in the image encompasses multiple land cover types; as a result, the pixel’s true nature becomes difficult to accurately classify. These issues render traditional pattern recognition and digital image processing methods unsuitable for direct application to HSI classification. Despite the remarkable progress achieved in multispectral and hyperspectral imaging and their associated processing techniques, numerous challenges persist, and the full potential of these approaches remains to be fully harnessed.Drawing on the latest global developments and the team’s over three-decades-long research practice, this study undertakes an in-depth exploration and comprehensive review of the research progress and future trends in HSI classification from a novel perspective. It systematically categorizes multispectral and HSI classification methods into four distinct types, as follows: 1) Traditional methods, which typically involve a two-step process of feature extraction followed by the application of conventional classifiers, such as maximum likelihood classifiers. These methods rely on hand-crafted features that are designed based on prior knowledge and assumptions about the data. 2) Conventional learning methods, which integrate feature extraction with conventional learning-based classifiers. These classifiers, such as support vector machines and multilayer perceptrons, can learn from the data to some extent but still require manual intervention in feature mining. 3) Deep learning methods, which leverage the power of neural networks for automated feature extraction and classification. These methods, such as convolutional and recurrent neural networks, can automatically learn hierarchical features from the data without the need for explicit feature mining. 4) Data and knowledge fusion methods, which combine deep learning with knowledge-based and feature-fusion techniques. These methods aim to incorporate prior knowledge and multiple sources of information into the deep learning framework to enhance classification performance. Among the aforementioned methods, conventional learning methods deep learning methods, and data and knowledge fusion methods are collectively denoted as intelligent classification methods and constitute the focal point of this work. This article is the first review in the world that specifically and systematically provides an overview of the intelligent classification of hyperspectral images. The paper begins with a detailed review of the background and development history of HSI classification, tracing its evolution from the early days of remote sensing to the current state-of-the-art techniques. It also provides an in-depth introduction to representative hyperspectral satellites and datasets, which are the foundation for research and validation in this field. Subsequently, it accentuates two core directions-feature mining (including feature extraction, feature or band selection, and feature combination) and classifiers-and discusses HSI feature mining, traditional classification, conventional learning classification, and deep learning classification methods. For each category, it presents examples of representative models, methods, and their real-world applications. Then, it presents existing challenges and problems in the field and examines future work directions. The effective fusion of multiscale, multiresolution, multifeature, and multiclassifier approaches is a crucial avenue for enhancing the accuracy of HSI classification. Moreover, small-sample learning, zero-shot transfer learning, and lightweight, limited-precision neural networks tailored for onboard HSI applications merit further attention.The research findings indicate that the four-category classification of HSI methods proposed in this study accurately mirrors the historical development, current research focus, and future trends of the technology. In particular, data and knowledge fusion-based classification (i.e., the fourth category) offers profound insights into cutting-edge research on HSI classification and has substantial guiding value for future studies and practical applications.
摘要:Asteroids are critical celestial bodies in our solar system, holding essential information about the early stages of planetary formation and evolution. These small rocky bodies are considered remnants from the early solar system, providing potential insights into the origins of life and water on Earth. Since the 1990s, asteroid exploration missions have steadily increased, becoming a central focus in deep space exploration. These missions aim not only to study asteroids as ancient objects but also to explore their potential as resources for future space exploration. The origins of asteroid exploration can be traced back to the 1970s. As interest in these bodies grew, space agencies like National Aeronautics and Space Administration(NASA) and the European Space Agency(ESA) began conducting successful unmanned missions. NASA’s Galileo spacecraft, for example, made the first flyby of asteroid 951 Gaspra in 1991, marking a key milestone and laying the groundwork for subsequent missions. The Dawn spacecraft, launched later, conducted detailed observations of asteroids Vesta and Ceres, providing substantial data on asteroid composition and origin. Furthermore, China’s Chang’e-2 mission performed a successful flyby of asteroid 4179 Toutatis in 2012, demonstrating the feasibility of asteroid exploration. In the coming years, a sample-return mission targeting Earth’s quasi-satellites and main-belt comets is planned. These early missions not only proved the feasibility of exploring asteroids but also provided valuable technical experience that has been integral in advancing scientific understanding and future exploration strategies. Close-range asteroid exploration involves spacecraft capturing image data from various distances, which is essential for studying the asteroid’s surface features and physical properties. However, acquiring high-quality images presents significant challenges. The asteroid’s surface is often irregular and complex, with a wide range of topographical features such as craters, boulders, and ridges. Moreover, dynamic lighting conditions and the constantly changing attitude of the spacecraft introduce further complications, making the captured images highly unique and diverse. Traditional image processing techniques often struggle to adapt to this variability. To overcome these challenges, integrating intelligent image processing technologies is crucial. AI-driven automation can enhance the spacecraft’s ability to perceive and analyze the environment in real time, thereby improving the overall scientific outcomes and increasing the success rates of asteroid exploration missions. One of the primary objectives of asteroid exploration is the intelligent analysis of surface images to identify key features, such as surface objects and obstacles, and predict the scientific value of surface deposits. This ability to analyze surface data is critical for hazard avoidance and selecting safe landing and sampling locations. The uniqueness of asteroid surface morphology, coupled with the scarcity of relevant datasets, poses a major challenge. Recent research has increasingly focused on combining deep neural networks with techniques such as transfer learning and few-shot learning. These approaches are particularly useful when large datasets are unavailable, allowing the models to generalize from smaller sets of data. The generalization capability of large pretrained models, which have been trained on extensive image datasets, offers new possibilities for improving recognition accuracy and performance in asteroid exploration. Another significant aspect of asteroid exploration is the intelligent perception and reconstruction of the asteroid’s 3D topography. Detailed 3D models are essential for making decisions about landing, attachment, and sample collection. Typically, generating these models requires the spacecraft to orbit the asteroid multiple times, which is one of the most time-consuming phases of close-range exploration. In addition, asteroid surface images often have a large dynamic range and high texture similarity, complicating the restoration of the 3D surface and its use in accurate localization. As a result, considerable research has been conducted on efficient methods for gathering image data, integrating this information, and reconstructing accurate 3D models of asteroid surfaces. Recent advancements in implicit 3D representations and generative models have shown promise in overcoming these challenges. These approaches enable more precise mapping and facilitate the use of 3D topographic data, significantly enhancing spacecraft navigation and surface analysis capabilities. The inversion of asteroid physical properties — such as surface material composition, weathering processes, and weak gravitational fields — is another key objective of exploration. Inversion of these properties requires combining image data with multisource data, such as spectral and orbital data, which allow for a more direct method of studying the asteroid’s physical characteristics. However, the wide variety in asteroid composition, structure, and surface conditions complicates this task. Limited premission knowledge about the target asteroid indicates that traditional inversion models often need to be tailored specifically to each new mission. To address this, researchers have developed generalized models that can be adapted to various types of asteroid, enabling more accurate predictions of their physical properties. Such advancements are critical for understanding the formation and evolution of asteroids, as well as their potential for resource extraction. The integration of artificial intelligence (AI) and advanced image processing techniques has significant potential for enhancing asteroid exploration. As AI and machine learning, particularly deep learning, evolve, they are increasingly applied to challenges such as asteroid surface analysis, 3D topography reconstruction, and physical property inversion. The future of asteroid exploration will focus on improving AI algorithms’ adaptability to handle diverse data types and conditions. Developing robust and generalizable models capable of understanding complex asteroid surfaces, integrating multisource data, and accurately modeling 3D topographies will be key to mission success. As technology advances, on-board, real-time processing of asteroid images using intelligent algorithms will likely become feasible, enabling spacecraft to make immediate decisions about landing and sampling, thereby enhancing mission efficiency and safety. In conclusion, the field of image intelligence in asteroid exploration is evolving rapidly. The use of machine learning, especially deep learning and large pretrained models, is improving the accuracy of surface analysis, topography reconstruction, and physical property inversion. These advancements will play a crucial role in future deep space missions, offering essential insights into the origins of life and water on Earth. As asteroid exploration develops, intelligent image processing will be vital to maximize the scientific return of these missions, ensuring their safety and efficiency and advancing humanity’s understanding of our solar system.
摘要:Mounted on satellites or spaceborne platforms, spaceborne synthetic aperture radar (SAR) utilizes transmitted signals to capture detailed surface information, irrespective of the time of day or weather conditions. The combination of all-weather, all-time operational capacity and deep penetration into the atmosphere makes spaceborne SAR an indispensable tool for a broad spectrum of applications, including military reconnaissance, environmental monitoring, disaster management, and resource management. The signals from spaceborne SAR involve echo acquisition, imaging processing, and image applications. This paper focuses on the data link of spaceborne SAR and comprehensively analyzes the current state of development, cutting-edge trends, and pressing issues in the field of spaceborne SAR. It begins by reviewing the current state of spaceborne SAR systems, with an emphasis on the development of SAR platforms and the datasets they produce. A comparative analysis of the key parameters of domestic and international spaceborne SAR systems is presented, highlighting the technological capabilities, operational constraints, and strategic goals behind these systems. Notable parameters include the system’s frequency bands, spatial resolution, polarization modes, and swath width, which all influence the potential applications and limitations of the data generated. Recent advances in spaceborne SAR imaging technology have significantly expanded the capabilities of these systems. One of the most prominent innovations is multidimensional observation, which allows SAR systems to acquire data from multiple viewpoints or angles within a single acquisition cycle. It involves several key innovations, including multiband, multipolarity, and double/multiple base cooperative detection, all of which contribute to enhanced imaging and a more comprehensive understanding of the observed area. Each of these components offers unique advantages that complement each other in providing more detailed, accurate, and varied insights into the Earth’s surface. Multiband SAR refers to the use of different frequency bands in a single SAR system or through the combination of different systems operating in different bands. The use of multiple frequency bands allows for the capture of complementary information about the target area. Multipolarity refers to the use of multiple polarization modes in SAR systems, where the transmitted and received radar waves are polarized in different orientations. This approach significantly enhances the ability to distinguish between different surface types and materials based on their interaction with polarized electromagnetic waves. Double or multiple base cooperative detection refers to the integration of data from multiple SAR platforms or sensor locations, operating cooperatively to observe the same area from different angles or baselines. By leveraging multiple sensors operating in different locations or with different baselines, multibase cooperative detection enhances the depth and precision of SAR observations, providing richer datasets for change detection and surface movement analysis. Another critical advancement is the development of high-resolution wide-swath imaging, which involves multichannel technique, varied pulse repetition frequency, and MIMO SAR. The multichannel technique involves the use of multiple receiving channels within the SAR system, allowing for simultaneous reception of signals from different parts of the radar beam. By utilizing multiple channels, SAR systems can cover a larger area with greater detail, as the signals from various channels are processed in parallel. The varied pulse repetition frequency is a technique used to adjust the interval between successive radar pulses based on the specific operational requirements of the SAR system. By dynamically changing the pulse repetition rate, the system can optimize the trade-off between resolution and coverage, depending on the target’s distance from the radar and the desired imaging resolution. MIMO SAR represents a groundbreaking innovation in radar technology that employs multiple transmitting and receiving antennas simultaneously. By using a combination of multiple input and output signals, MIMO SAR enhances the radar system’s ability to gather detailed information from a large area while maintaining high resolution. This technique allows for the simultaneous acquisition of data from different angles, which improves the swath width and imagery resolution. The increasing volume, complexity, and diversity of data produced by spaceborne SAR systems have created a demand for advanced processing and analytical techniques capable of handling large datasets and extracting meaningful insights efficiently. Traditional image processing methods, which often rely on manual intervention and domain expertise, have limitations in terms of speed, scalability, and adaptability. By contrast, intelligent processing leveraging machine learning (ML) and deep learning (DL) has revolutionized SAR data analysis, enabling automated, accurate, and scalable solutions for various applications, including classification, target detection, change detection, and anomaly detection. These intelligent techniques enhance SAR systems by improving their data interpretation capabilities, reducing the reliance on manual processes and enabling real-time data analysis. Common ML methods include support vector machine, Markov random field, dictionary learning, decision trees, and unsupervised clustering. Compared to traditional image processing techniques, ML methods offer significant advantages in rapidly selecting from large volumes of known information. These advantages include fewer hyperparameters, high processing efficiency, and strong adaptability, making ML methods particularly suitable for tasks that involve complex data analysis and real-time decision making. DL is an advanced artificial intelligence approach characterized by its ability to learn effective features from large datasets in a hierarchical manner, significantly reducing the complexity and error associated with manual feature extraction. Common DL architectures include convolutional neural networks, deep belief networks, stacked autoencoders, and transformer networks. DL-based image data processing methods, through the stacking of multiple layers of neural networks, can automatically extract more abstract and higher-level target features directly from raw data, thereby enhancing the overall accuracy of prediction and recognition tasks. Over the past decades, significant progress has been made in spaceborne SAR technology, with notable developments in several key areas. These include the advancement of constellation-based SAR and lightweight SAR systems, high-resolution and wide-swath imaging, multipolarization and arbitrary frequency band imaging, intelligent data processing, and the application of interferometric SAR (InSAR) and differential InSAR for complex scene analysis. However, several challenges persist, including the control of attitude and orbit errors, the transmission and storage of massive data volumes, target interpretation in complex scenes, and susceptibility to electromagnetic interference and external noise. These issues continue to pose significant obstacles to the further advancement and operational deployment of spaceborne SAR systems, necessitating ongoing research and technological innovation to address them.
摘要:Social group detection in public spaces has become a popular research topic in computer vision and social behavior analysis in recent years. This research aims to analyze surveillance video data, utilizing technologies such as social interaction, spatiotemporal relationships, and computer vision to identify and understand human social behavior patterns, thereby detecting pedestrian groups that are actively interacting. With the advancement of urbanization, the demand for public safety and social behavior monitoring has increased significantly. Social group detection is crucial in various research fields, including trajectory prediction, group anomaly detection, human-computer interaction, and intelligent surveillance. Exploring and identifying human social behavior patterns, especially recognizing pedestrian groups in interaction, provides strong support for applications such as predicting pedestrian behavior, detecting abnormal group activities, and enhancing public safety. This, in turn, propels further developments in computer vision and artificial intelligence. Despite some progress in this field, accurately modeling group interaction phenomena remains challenging. First, social group behaviors are diverse and complex, lacking formal rules and clear social interpretations, making social group detection a difficult task. Second, most existing methods rely on high-quality annotated data, which poses significant costs and technical challenges, especially in real-world scenarios where the number and behavior patterns of social groups are constantly changing. As a result, existing datasets often fail to cover all possible scenarios. Furthermore, social group detection faces challenges related to lightweight network design and few-shot learning. Particularly in the absence of large-scale annotated data, how to effectively train models and improve detection accuracy remains a key issue in current research. This paper reviews the existing research on social behavior understanding and social group detection. It categorizes methods for social group detection in public spaces into two main categories: heuristic-rule-based methods and learning-based methods. In heuristic-rule-based methods, researchers define a series of rules to determine the interaction features of pedestrian groups, such as using distance, speed change, relative direction, and visual descriptors to define the occurrence of social behavior, or extracting high-level semantic expressions of human interaction patterns, such as social force models or social interaction fields. However, the applicability of these methods is often limited by the accuracy and complexity of the rules. Conversely, learning-based methods leverage machine learning or deep learning techniques to automatically learn the patterns of social behaviors from large datasets, thereby achieving higher detection accuracy. In recent years, deep learning methods, particularly convolutional, recurrent, and graph neural networks, have been widely applied in social group detection and have achieved significant results. In terms of evaluation, this paper summarizes common evaluation metrics, datasets, and detection performance. Common evaluation metrics include accuracy, recall, and F1 score, which effectively reflect the performance of social group detection algorithms. For example, when using mainstream deep learning frameworks for group detection, the average precision and recall are typically above 85%, whereas state-of-the-art behavior pattern-based methods achieve F1 scores of up to 89%. This highlights the potential of behavior patterns in recognizing social behavior and detecting interacting groups. Significant performance differences exist across various datasets, such as detection accuracy reaching 87% on the Crowd-Tracking dataset and slightly lower performance (around 83%) on the ETH dataset. These differences suggest that the diversity and complexity of datasets affect detection results to some extent, and current datasets cannot sufficiently cover all possible social behavior patterns. Lastly, the paper discusses the challenges and limitations of social group detection and outlines future research directions. Current challenges mainly focus on how to effectively process large-scale video data, achieve accurate group detection in data-scarce environments, and address algorithms’ real-time performance and computational efficiency. Furthermore, social group detection research faces challenges in integrating cross-disciplinary knowledge, such as incorporating behavioral patterns and social psychology theories to enhance detection accuracy. Future research could expand in the following areas. First, researchers could explore implicit social patterns, particularly covert criminal group detection. Second, interpretable fundamental social patterns could be extracted to enhance the accuracy and robustness of social group detection. Third, unsupervised and self-supervised learning methods could be used to address the issue of few-shot learning, thereby improving the applicability of algorithms in real-world scenarios. In conclusion, social group detection is not only a technical challenge but also an interdisciplinary research issue with vast application potential. With continuous advancements in algorithms and hardware, future social group detection technologies will play an increasingly important role in public safety, smart cities, and human-computer interaction.
关键词:Social group detection;Understanding interaction behavior;Interaction detection;F-formation;behavioral pattern
摘要:With the rapid advancement of virtual reality (VR) technology and the cultural tourism industry, the digitalization of cultural heritage and digital cultural tourism visualization services have garnered increasing attention. This paper investigates the research progress in the fields of cultural heritage digitalization and the use of digital cultural tourism visualization service technologies. Initially, this work analyzes the current status and challenges of artificial intelligence (AI) and deep learning technologies in the collection, storage, integration, and sharing of cultural heritage digital resources. In recent years, AI-driven technologies have significantly transformed the methods of cultural heritage digitization and preservation, enabling more efficient and accurate approaches to data collection and management.Subsequently, this paper explores how digital technologies facilitate the transition of cultural heritage protection from traditional methods to more digital and intelligent approaches. In particular, the integration of high-precision 3D scanning, VR, and augmented reality (AR) technologies into digital cultural tourism services has opened up new possibilities for immersive experiences and personalized recommendations. These technologies allow users to interact with cultural heritage in innovative ways, virtually visit sites, and explore objects that would otherwise be inaccessible. However, these advancements also pose challenges in ensuring the authenticity, integrity, and accessibility of digital reproductions, as well as in addressing the technical limitations of current tools.This paper further examines the applications of cultural heritage digitalization and visualization technologies in the development of integrated cultural tourism service platforms. These platforms leverage AI, machine learning, and large models to provide personalized recommendations, optimize resource allocation, and enhance user experiences. By integrating cultural heritage data with tourism services, these platforms create a seamless connection between cultural preservation and tourism. Currently, the ability to offer tailored cultural experiences to visitors based on their interests, behaviors, and locations has become an essential feature of modern cultural tourism services. Additionally, by generating new opportunities for public engagement and revenue generation, these platforms contribute to the sustainable development of cultural heritage preservation and the tourism industry. AI plays a crucial role in empowering the development of cultural tourism integration service platforms. In particular, machine learning algorithms are increasingly being used to analyze user preferences and historical data, thereby improving the accuracy and efficiency of cultural heritage management and protection. For instance, AI algorithms can predict potential risks to cultural heritage sites as well as suggest preventive measures to mitigate damage. Moreover, AI enables smarter and more efficient data management, facilitating the long-term preservation and sharing of cultural heritage resources. These technologies drive the ongoing digital transformation of cultural tourism, opening up new possibilities for cultural dissemination and public engagement.Finally, this paper discusses the potential of AI and big data technologies in driving the continued evolution of cultural heritage digitalization and cultural tourism visualization services in the future. As AI and big data technologies continue to develop, their applications in the cultural heritage field are expected to expand further. For example, predictive analytics could be used to forecast the future deterioration of cultural artifacts and historical sites, resulting in the development of more proactive preservation strategies. Furthermore, the ongoing advancement of AR and VR technologies is expected to enhance the immersive experience for users, enabling them to engage with cultural heritage in even more dynamic and interactive ways. The integration of AI, big data, and immersive technologies is paving the way for a more personalized and sustainable future for digital cultural tourism services in which users can enjoy a deeper connection to culture and heritage.In conclusion, digital cultural tourism services have become a vital force in the integration of the cultural and tourism industries. In the future, cultural tourism integration service platforms are expected to drive the intelligent analysis of cultural heritage data, predictive protection, and multiscenario applications, further advancing the personalized, diverse, and sustainable development of digital cultural heritage and cultural tourism visualization technologies. As AI and big data technologies continue to evolve, their roles in shaping the future of cultural heritage protection and dissemination will become even more significant, ensuring that cultural heritage is preserved and made accessible for generations to come.
关键词:cultural heritage;digital technology;integration of culture and tourism;artificial intelligence (AI);Visualization services