最新刊期

    31 6 2026
    • 序言 AI导读

      Vol. 31, Issue 6, Pages: 1-2(2026)
      序言
        
      0
      |
      1
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 158697314 false
      更新时间:2026-06-18

      Image Processing and Perception

    • Chen Tong, Lu Ming, Shi Junqi, Cong Wuyang, Ding Dandan, Jia Chuanmin, Liu Jiaying, Liu Dong, Song Li, Ma Siwei, Yang You, Liu Wenyu, Cao Xun, Ma Zhan
      Vol. 31, Issue 6, Pages: 1595-1618(2026) DOI: 10.11834/jig.250627
      A review and frontier perspectives on end-to-end learned image and video coding
      摘要:Over the past three decades, image and video coding technologies and their associated international standards have served as the foundational compression engines underpinning core internet-scale multimedia services, ranging from on-demand streaming and live broadcasting to real-time video conferencing and social media sharing. Traditional approaches have predominantly followed a rule-driven, modular paradigm, in which carefully engineered components—such as intra and inter prediction, block-based transforms (e.g., DCT and DWT), scalar quantization, entropy coding, and in-loop filtering—are jointly optimized under classical rate-distortion (R-D) theory. This methodology, refined through successive generations of standards, including JPEG, H.26×, MPEG, and AVS, has achieved remarkable efficiency and interoperability through the coordinated efforts of standardization bodies. However, over the past decade, a paradigm shift has been catalyzed by the rapid advancement of deep learning, resulting in the emergence of end-to-end learned image and video compression. End-to-end trainable systems, empowered by expressive neural architectures, large-scale public datasets, and mature training ecosystems, have demonstrated R-D performance that consistently surpasses conventional codecs on benchmark datasets. These learned systems are primarily built upon variational autoencoders (VAEs), which replace the handcrafted rules of traditional pipelines with a unified differentiable framework. In this architecture, an encoder utilizes analysis transforms to map image data into compact latent representations, while a decoder applies synthesis transforms to reconstruct the image. Unlike linear transforms in traditional coding, these transforms leverage powerful neural networks, evolving from early convolutional neural networks (CNNs) to advanced architectures incorporating attention mechanisms, Transformers, and Mamba-based state-space models. A critical challenge in this framework lies in the non-differentiable nature of quantization. Methods, such as additive uniform noise, are used during training to approximate quantization errors while maintaining differentiability to enable end-to-end optimization. Meanwhile, straight-through estimators are utilized to pass gradients directly through quantization layers. Although uniform quantization remains standard, recent advancements have explored vector quantization and non-uniform quantization strategies to further refine feature representation. The core of compression efficiency in these systems lies in entropy modeling, which estimates the probability distribution of latent variables to minimize the bitrate. This field has significantly evolved from early factorized models that assumed statistical independence among latents. The introduction of the hyperprior structure, which utilizes auxiliary latent variables to model the spatial distribution parameters of the primary latents, marked a significant milestone in capturing dependencies. Subsequent innovations introduced autoregressive modeling, which predicts current features based on causal contexts in spatial or channel dimensions, further enhancing probability estimation accuracy. Recently, hierarchical autoregressive models have been developed to capture global and local contexts in a coarse-to-fine manner, pushing the boundaries of feature compactness and coding efficiency. Furthermore, the optimization objectives have expanded beyond pixel-level fidelity metrics, such as MSE and MS-SSIM, to include perceptual metrics and adversarial losses, allowing for a trade-off between signal distortion and perceptual quality. Beyond theoretical performance, the transition of learned coding from academic exploration to industrial application requires addressing practical dimensions, such as variable rate control, hardware efficiency, and robustness. Researchers have developed mechanisms involving multi-scale decomposition and feature modulation, where quality factors or maps scale latent variables or intermediate features, to support variable bitrates within a single model. Rate control algorithms have also advanced, utilizing iterative search strategies or deep modeling of the rate-parameter relationship to meet specific bandwidth constraints. Model quantization techniques, including quantization-aware training and post-training quantization, are utilized to convert floating-point models into fixed-point integer operations, ensuring cross-platform consistency, reducing computational overhead, and facilitating deployment on commodity hardware. Furthermore, robust coding frameworks are designed to defend against adversarial attacks and transmission errors through training regularization and input preprocessing, addressing the vulnerability of neural networks to perturbations. These technical advancements have culminated in the integration of learned compression into formal international standards. Two landmark standards have recently emerged: JPEG AI, ratified by ITU-T as T.840.1, and IEEE 1857.11-2024. Although both standards adopt VAE-based architectures, they differ in design philosophy. JPEG AI utilizes a multi-branch network design and emphasizes subjective quality and machine-task compatibility, optimizing for perception-oriented metrics, such as MS-SSIM and VMAF. By contrast, IEEE 1857.11 focuses on objective gains in PSNR and MS-SSIM, offering tiered complexity profiles (base, main, and high) to adapt to different computational capabilities. Both standards have established rigorous training and evaluation protocols, including the use of specific datasets, such as Kodak and dedicated robustness benchmarks, to ensure a fair comparison and reproducibility. The principles of learned image coding have naturally extended to video coding, although with unique challenges in temporal modeling. The evolution of neural video coding can be categorized into three developmental phases. The first phase involved hybrid approaches that replaced specific modules, such as intra prediction, with learned networks while retaining the traditional motion-compensated residual coding framework. The second phase progressed toward conditional inter-frame coding, utilizing learned optical flow networks for motion estimation and warping to generate temporal contexts. The third and most recent phase marks a shift toward unified probabilistic frameworks that entirely eliminate explicit motion estimation. These systems leverage hierarchical spatial-temporal priors to perform joint intra and inter prediction within a single model, achieving performance that rivals or exceeds the latest H.266/VVC standard while approaching real-time processing speeds on GPUs. Future directions indicate that the field is converging toward two major trends: task-aware coding and generative integration. Task-aware coding aims to support human vision and machine perception from a single bitstream, aligning with the biological principle of “compression as intelligence”, where compact representations facilitate diverse downstream cognitive tasks. Furthermore, the integration of generative models, such as diffusion and large multimodal models, is enabling ultra-low-bitrate reconstruction with high semantic fidelity, fundamentally altering the rate-distortion-perception trade-off. This report synthesizes these technical, practical, and standardization advances to provide a comprehensive perspective. Finally, the future of intelligent compression lies in establishing a new foundation for multimodal, task-agnostic, and semantically aware visual communication.  
      关键词:Neural Image Compression;variational autoencoder(VAE);rate-distortion(R-D);Practicality;Standardization   
      181
      |
      434
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 146770116 false
      更新时间:2026-06-18
    • Research progress of 3D point cloud coding and communication AI导读

      Yuan Hui, Ding Dandan, Zhang Wei, Gao Wei, Xu Yiling, Liu Qi, Su Honglei, Liu Hao, Ma Zhan, Yang You, Liu Wenyu
      Vol. 31, Issue 6, Pages: 1619-1670(2026) DOI: 10.11834/jig.250625
      Research progress of 3D point cloud coding and communication
      摘要:3D point clouds constitute the quintessential digital representation of the physical world and serve as the foundational data format for the emerging era of spatial computing and digital transformation, propelled by advances in high-precision sensing, high-performance computing, and next generation communication networks. These data are indispensable across a wide range of frontier applications, from the precise environmental mapping required for autonomous driving and the preservation of cultural heritage to the immersive interactive experiences of virtual reality and the complex simulations found in digital twins and deep space exploration. A point cloud is a massive collection of disordered data points, with each individual point encoding precise geometric coordinates along with a rich set of attributes, such as color, reflectance, and surface normals, which allow for a high-fidelity reconstruction of 3D scenes that far exceeds the capabilities of traditional 2D imagery. However, this comprehensive spatial representation comes at the cost of an enormous data volume. For instance, a single frame of a large-scale scene can contain millions of points with high-precision geometry and attribute information, generating a staggering amount of data that poses severe challenges to existing storage capabilities and transmission bandwidth. Consequently, high-efficiency 3D point cloud coding and communication systems have become a critical research focus for industry and academia, enabling the scalable deployment of 3D visual services. This study provides a comprehensive review of the state-of-the-art developments in the field, systematically analyzing advancements from industrial standardization efforts to cutting-edge academic research. First, from an industrial perspective, this study elucidates the technical evolution and architectural differences between international and domestic standards. The Moving Picture Experts Group has led the international standardization through the development of two distinct frameworks known as video-based point cloud compression (V-PCC) and geometry-based point cloud compression (G-PCC). V-PCC is designed to leverage the mature and highly optimized hardware ecosystem of existing video codecs by projecting 3D data onto 2D planes, generating occupancy, geometry, and attribute maps that can be compressed using standards, such as high efficiency video coding, making it an ideal solution for dense and dynamic contents, such as telepresence and volumetric video. Meanwhile, G-PCC addresses the unique characteristics of sparse data typical of LiDAR sensors by utilizing native 3D coding structures, including octrees and prediction trees, to model geometry directly in the spatial domain without projection while utilizing sophisticated techniques, such as the region adaptive hierarchical transform, to compress attributes. Alongside these international developments, the audio video coding standard workgroup of China (AVS) has made significant strides in establishing domestic standards that compete on the global stage by introducing novel optimizations specifically tailored for LiDAR scenarios, such as spherical coordinate prediction trees that align with the scanning mechanisms of sensors and semantic aware coding strategies that prioritize objects of interest. Second, this study comprehensively surveys academic research progress across four dimensions of the communication system: compression coding, sampling and enhancement, quality assessment, and transmission control. In the realm of compression coding, this study highlights the paradigm shift from traditional signal processing to deep learning architectures. In geometry compression, this study evolves from simple octree-based entropy coding to sophisticated methods utilizing voxel-based autoencoders, sparse convolutions, and Transformers, which effectively capture long-range spatial dependencies and local geometric details to achieve superior rate-distortion performance. In attribute compression, which often consumes the majority of the bitstream, this study examines the evolution from region-adaptive hierarchical transforms and lifting schemes to learnable transforms and joint geometry-attribute coding frameworks that exploit the correlation between spatial structures and color information to minimize redundancy. Regarding sampling and enhancement, this study analyzes strategies for addressing the sparse, unstructured, and non-uniform nature of point clouds. This study reviews up-sampling algorithms that reconstruct dense surfaces from sparse inputs and quality enhancement techniques—such as graph signal processing and neural network-based filtering—that mitigate compression artifacts and restore visual fidelity. In the domain of quality assessment (QA), this study traces the development of metrics from simple geometric measures, such as point-to-point (p2p) and point-to-plane (p2pl) distances, to complex perceptual evaluations. This study emphasizes the recent surge in projection-based and multi-modal fusion methods and highlights the critical advancement of no-reference (NR) quality assessment using deep learning, an area where domestic researchers have shown significant leadership by developing models that accurately predict human visual perception without requiring pristine reference data. In transmission control, this study discusses mechanisms to ensure reliable delivery over bandwidth-constrained and error-prone networks, including fine-grained rate control algorithms that dynamically adjust quantization parameters based on content analysis, and joint source-channel coding (JSCC) schemes that optimize the trade-off between compression efficiency and error resilience. Furthermore, this study explores emerging research on semantic communication, which proposes a shift from signal-level fidelity to semantic-level understanding, optimizing transmission for machine vision tasks rather than mere reconstruction. In addition, this study critically analyzes the disparity between domestic and international research landscapes. Although the international community, led by MPEG, maintains a mature standardization ecosystem with comprehensive software libraries and datasets, Chinese researchers have established a significant innovative advantage in AI-driven compression algorithms, particularly for dynamic and LiDAR point clouds, as well as in blind quality assessment methodologies. Finally, this study offers forward-looking predictions on future development trends, highlighting several key directions: improving attribute compression efficiency to match geometry coding gains; minimizing the computational complexity of deep learning models for real-time edge deployment; developing unified frameworks that integrate sampling, compression, and enhancement; and evolving toward semantic-aware communication architectures that support the next generation of intelligent systems. This study aims to provide a comprehensive and in-depth technical reference for researchers and engineers, fostering further innovation in the rapidly evolving field of 3D point cloud coding and communication. The above-mentioned methods are linked at:https://github.com/3DPCC/Point-Cloud-Coding-and-Transmission.  
      关键词:point cloud compression;point cloud processing;coding standard;point cloud sampling;attribute quality enhancement;quality assessment;rate control;joint source-channel coding;3D point cloud   
      353
      |
      375
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 146832480 false
      更新时间:2026-06-18
    • Yang Jian, Chen Jie, Xu Huaping, Wang Xiaoliang, You Ya’nan, Feng Xiao
      Vol. 31, Issue 6, Pages: 1671-1688(2026) DOI: 10.11834/jig.250648
      A review of joint target detection and recognition techniques for microwave and optical remote sensing imagery
      摘要:With the rapid development of earth observation technology, rapid and accurate detection and recognition of specific targets from massive remote sensing images have become critical tasks in environmental monitoring, disaster assessment, national defense security, and other related fields. Optical images and microwave images are the most common types of remote sensing images. The joint target detection and recognition utilizing optical and microwave images can have complementary advantages and effectively overcome the limitations of obtaining target information from a single type of sensor. It has important theoretical value and promising application prospects in breaking through the performance bottleneck of single-type remote sensing and improving the interpretation ability for targets in complex environments. The work provides a review of the progress of research on joint target detection and recognition utilizing microwave and optical remote sensing images. First, the characteristics of the two types of remote sensing images and the common processing procedures are described. Early approaches are usually characterized by a multistage process involving feature engineering and fusion at the feature or decision level by using classifiers, such as support vector machines. With the widespread application of deep learning in various fields, end-to-end intelligent interpretation dominated by deep learning paradigms are now widely adopted in joint target detection and recognition. The rise of convolutional neural networks has revolutionized the field. These models can learn hierarchical and discriminative feature representations automatically from raw pixel data. Despite the notable strides brought by deep learning, the joint interpretation of microwave and optical remote sensing images remains a formidable challenge. Second, a thorough analysis of the main challenges currently faced in this field is conducted. The foremost issue is the discrepancy in imaging mechanisms and subsequent feature representations and the heterogeneous modality gap. Sensors for optical images usually record the sunlight reflected by objects on land and sea surfaces. Microwave images include active and passive microwave remote sensing images. Active microwave remote sensing images are usually obtained by synthetic aperture radar, which captures the backscatter in the microwave band of objects on land and sea surfaces, and passive microwave remote sensing images reflect the microwave radiation of objects. A target may be distinct visually from that in an optical image but appear as a collection of a few bright scatters in a microwave image. Another issue is the imbalance of datasets of optical and microwave images that can be used for training model. The number of datasets for optical images is larger than that for microwave images, and the resolution of optical image datasets is usually higher than that of microwave image datasets. The scarcity of high-quality, large-scale, pixel-to-pixel co-registered, and accurate annotated datasets for multimodal training is a difficult issue for joint target detection and recognition. Furthermore, the spatiotemporal asynchrony poses a substantial practical barrier. Microwave and optical remote sensing images are rarely acquired simultaneously or from the same viewpoint, leading to temporal changes, such as the target movement and geometric misalignments that require data association and sophisticated registration techniques. Another challenge is small or weak target detection and recognition in complex backgrounds. Small or weak targets occupy only a few pixels and are likely to be submerged in background clutter, such as high sea-state wave reflections in microwave images or dense urban textures in optical images. The main technological approaches in two primary application domains, namely, marine and terrestrial fields, are analyzed through an assessment of challenges. In the marine field, the detection and recognition of ships are crucial for combating illegal activities and for naval defense. For the detection and recognition of ships, four key research routes are analyzed: feature fusion methods, which involve various network architectures that extract features from each modality separately at different network depths before fusing and are highly common; knowledge driven methods, which incorporate a priori knowledge or physics models, such as using microwave images to detect potential targets in all-weather conditions and combining high-resolution optical images for fine-grained identification, and using physical scattering models to regularize a deep learning network; fine-grained detection and recognition methods for complex scenes, which focus on the robust detection and recognition of small or weak targets and arbitrarily oriented targets, especially when operating in cluttered, high-sea-state environments; and indirect detection and recognition methods on the basis of ship wakes, which utilize the distinctive V-shaped wakes left by moving vessels that can be captured by microwave and optical sensors, provide an additional function for estimating ships’ speed and heading, and are particularly valuable for the detection and recognition of stealthy or low-observable targets. In the terrestrial field, the targets of concern focus on aircrafts, vehicles and infrastructures, etc. Three major technological avenues are discussed for land targets. The first one relates to feature fusion methods, which face different challenges compared with those in the marine field because land clutters present a more heterogeneous background than the ocean. The second avenue is knowledge transfer and distillation methods, which can use optical images with abundant labeled data to train a teacher network and distill its learned knowledge to a student network that operates on microwave images or fused data. The third avenue is small or weak target detection and recognition in complex scenes, which employ more complex datasets or more advanced methods focusing on the challenge of small or weak targets on land. In addition, commonly used performance evaluation metrics and some public datasets available are presented. The future development trends of joint target detection and recognition techniques with remote sensing images are also prospected in the review.  
      关键词:remote sensing;optical imagery;synthetic aperture radar(SAR);target detection;target recognition;information fusion;deep learning;knowledge driven   
      270
      |
      448
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 144741349 false
      更新时间:2026-06-18
    • Biometrics disciplinary development report (2021—2025) AI导读

      Feng Jianjiang, Jia Wei, Li Qi, Cui Zhe, Zhao Cairong, Lei Zhen, Wang Caiyong, Kang Wenxiong, Yu Shiqi, Fei Lunke, Li Xiaobai, Ye Mang, Wei Jianze, Cao Shiwen, Sun Shibo, Xie Tianming, Zheng Weishi, Yang Hongyu, Huang Junduan, Huang Di, Sun Zhenan
      Vol. 31, Issue 6, Pages: 1689-1740(2026) DOI: 10.11834/jig.260069
      Biometrics disciplinary development report (2021—2025)
      摘要:Biometric recognition has become a foundational enabling technology of modern digital society, serving as core infrastructure for identity authentication, access control, and human-centered intelligence. By exploiting intrinsic physiological and behavioral characteristics, biometrics offers advantages over traditional knowledge- or token-based authentication mechanisms, including stronger security, better usability, and greater resistance to impersonation. Over the past decade, and especially since 2021, rapid advances in deep learning, sensing technologies, and large-scale data resources have profoundly reshaped the landscape of biometric research and application. Biometric systems have evolved from task-specific pipelines based on handcrafted features into data-driven, end-to-end intelligent systems capable of operating in complex, unconstrained, and large-scale real-world environments. This report provides a comprehensive review of the development of biometric recognition technologies from 2021 to 2025, covering both methodological advances and emerging application paradigms. We focus on major biometric modalities, including face, iris, fingerprint, palmprint, finger and palm vein, body, gait, and person re-identification, while also highlighting the increasingly critical roles of security, privacy protection, and trustworthiness in biometric systems. Face recognition remains the most widely deployed biometric modality because of its non-contact nature, low acquisition cost, and broad social acceptance. Recent progress has been driven by innovations in network architectures, large-scale training data, and discriminative loss functions. Convolutional neural networks and vision transformers have substantially improved representation capacity, while margin-based and quality-aware losses have enhanced intra-class compactness and inter-class separability. At the same time, face detection and alignment have advanced toward robust performance under extreme conditions such as low resolution, severe illumination variation, occlusion, and large pose changes. Beyond recognition, face generation and synthesis have emerged as both enabling technologies and security challenges. Generative adversarial networks and diffusion models now support high-fidelity, controllable, and even 3D-aware face generation, facilitating data augmentation, virtual humans, and animation while simultaneously introducing new threats in the form of deepfakes and identity spoofing. Iris recognition continues to be regarded as one of the most secure biometric modalities because of the uniqueness and stability of iris texture. In recent years, research has shifted from controlled laboratory settings toward less constrained and mobile scenarios. Advances in iris acquisition include visible-light imaging, mobile-device capture, and near-eye sensing in VR/AR environments. Deep-learning-based iris segmentation and localization methods have greatly improved robustness to noise, occlusion, and cross-domain variation. In parallel, iris feature representation has evolved from handcrafted binary codes toward deep embeddings and hybrid models that preserve compatibility with traditional matching schemes. Synthetic iris data generation has gained attention as a way to address data scarcity and privacy constraints, although its impact on large-scale recognition performance and its potential security implications remain open research questions. Fingerprint recognition, one of the most mature biometric technologies, has undergone a significant transformation with the adoption of deep learning. Modern fingerprint systems address challenges such as low-quality latent prints, partial fingerprints, distortion, and cross-sensor variability through deep enhancement, minutiae extraction, dense descriptors, and multi-stage matching frameworks. The emergence of new fingerprint modalities, including contactless fingerprints, 3D fingerprints, and optical coherence tomography (OCT)-based internal fingerprints, has expanded the representational space of fingerprint biometrics and improved robustness to distortion and spoofing. At the same time, large-scale real and synthetic fingerprint datasets have enabled more systematic training and evaluation of deep models. Palmprint and palm-vein recognition have experienced rapid growth in both research and large-scale deployment. Advances in deep learning have addressed long-standing challenges in region-of-interest extraction, alignment, and feature robustness, enabling commercial applications such as contactless palm-payment systems. Multimodal fusion of palmprint and vein patterns has further improved recognition accuracy and security. In addition, progress in palm image synthesis and 3D palm modeling has opened new directions for data augmentation and unconstrained recognition. Body-based biometrics, including person re-identification and gait recognition, are playing an increasingly important role in surveillance and public-safety scenarios where close-range or cooperative acquisition is not feasible. Recent work has focused on cross-view, cross-domain, and long-term recognition under clothing changes, occlusion, and low-resolution conditions. Deep spatiotemporal modeling, attention mechanisms, and sequence-level representations have significantly improved robustness, while also raising concerns about privacy, fairness, and ethical deployment. Alongside performance improvements, security and privacy have become central themes in biometric research. Biometric systems are increasingly exposed to sophisticated attacks, including presentation attacks, adversarial examples, template inversion, and deepfake-based impersonation. Therefore, substantial effort has been devoted to spoof detection, adversarial defense, secure template protection, and cancellable biometrics. Privacy-preserving techniques such as template transformation, encryption, and federated or decentralized learning are gaining importance as regulatory requirements and public awareness continue to intensify. Ensuring that biometric systems are not only accurate but also trustworthy, explainable, and compliant with data-protection regulations is now a core research objective. Beyond identity authentication, biometrics is expanding toward broader human-centered applications, including human-computer interaction, healthcare monitoring, behavioral analysis, and immersive virtual environments. This shift reflects a transition from “who you are” to “how you are,” in which biometric signals contribute to comprehensive perception and intelligent interaction rather than mere identification. In summary, the period from 2021 to 2025 has witnessed rapid and multifaceted progress in biometric recognition. Advances in deep learning, sensing, and data generation have substantially improved accuracy, robustness, and scalability across modalities, while also introducing new challenges related to security, privacy, and societal impact. By systematically reviewing recent developments across core biometric technologies and security frameworks, this report aims to provide a holistic perspective on the current state of the field and to outline promising directions for future research and deployment. We hope this work will serve as a valuable reference for researchers, engineers, and policymakers seeking to understand and shape the next generation of biometric systems.  
      关键词:biometrics;face recognition;iris recognition;Fingerprint and palmprint recognition;Finger and palm vein recognition;person re-identification;gait recognition;Spoof detection   
      173
      |
      258
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 154696060 false
      更新时间:2026-06-18
    • Deep learning for LiDAR point cloud processing: a survey AI导读

      Ao Sheng, Wen Chenglu, Li Wen, Liu Dunqiang, Xing Leyuan, Li Mingzhe, Guo Yulan, Wang Cheng
      Vol. 31, Issue 6, Pages: 1741-1762(2026) DOI: 10.11834/jig.250664
      Deep learning for LiDAR point cloud processing: a survey
      摘要:LiDAR, as a high-precision 3D sensing technology, has become a cornerstone in a wide array of intelligent systems such as autonomous vehicles, robotics, and augmented reality. Compared with traditional RGB or RGB-D sensors, LiDAR offers superior performance in long-range depth estimation, illumination invariance, and structural scene understanding, particularly under challenging environmental conditions. With the surge in AI-driven perception, the intelligent processing of LiDAR point clouds has rapidly emerged as a key research frontier. This paper provides a systematic survey of recent developments in LiDAR-based perception technologies, focusing on four representative tasks: 3D object detection, LiDAR localization, human motion capture, and language-guided spatial reasoning. In the domain of 3D object detection, deep learning models have evolved along three primary lines: point-based, voxel-based, and multiview approaches. Each presents unique trade-offs between geometric fidelity and computational efficiency. More recently, advanced architectures incorporating attention mechanisms, BEV representations, and multisensor fusion strategies have achieved significant improvements in accuracy and robustness. Label-efficient learning paradigms, such as semisupervised, self-supervised, and domain adaptive learning, have also gained traction to mitigate the high cost of annotated 3D data. LiDAR-based localization has progressed in absolute and relative positioning tasks. Absolute localization methods rely on map-based retrieval or direct pose regression using neural networks, often enhanced by feature descriptors or Transformer architectures. Relative localization, or LiDAR odometry, estimates frame-to-frame motion and is fundamental to LiDAR SLAM systems. Research has expanded into geometry-aware learning, differentiable pose estimation, and multitemporal consistency. Domestic studies have demonstrated strong progress in real-time, lightweight localization solutions through efficient model design and self-supervised learning frameworks, especially for edge deployment. Human motion capture using LiDAR addresses the challenge of estimating dynamic human poses in sparse, noisy point clouds. The field has evolved from single-frame pose regression to more advanced spatiotemporal modeling techniques that integrate SMPL body priors, inverse kinematic solvers, and Transformer-based temporal encoders. Multimodal fusion with IMU and vision sensors has further enhanced robustness in occluded or long-range scenes. The construction of large-scale datasets and task-specific benchmarks has greatly supported research and practical applications in surveillance, animation, and sports analytics. Language-driven LiDAR reasoning represents a novel, rapidly developing task that combines natural language understanding with spatial localization. Models are designed to infer 3D positions or regions in a point cloud scene based on descriptive language inputs. Pioneering frameworks such as Text2Pos and Text2Loc adopt contrastive learning or coarse-to-fine alignment strategies, whereas newer approaches integrate scene graphs, multimodal Transformers, or large language models to enhance semantic comprehension. This direction supports applications in human-robot interaction, navigation, and open-world instruction following. Comparative analysis of international and domestic research reveals complementary emphases: International efforts are characterized by systematic theoretical modeling, dataset construction, and general-purpose frameworks, whereas domestic work emphasizes computational efficiency, real-world deployment, and task-specific performance. Notably, Chinese research has made significant strides in lightweight model design, regression-based localization, and motion capture under sparse LiDAR input. Looking forward, the future of intelligent LiDAR processing lies in three major trajectories: 1) algorithmic fusion, involving unified representation spaces across point clouds, images, and language, and enabling cross-modal semantic reasoning; 2) task expansion, pushing LiDAR perception beyond detection and mapping toward richer interaction, behavior understanding, and cognitive reasoning; and 3) system optimization, balancing accuracy, generalization, and efficiency for deployment in real-time, resource-constrained environments. Research in neural architecture search, unsupervised pretraining, and end-to-end multitask learning will be instrumental in meeting these goals. In conclusion, LiDAR intelligent processing is rapidly evolving into a comprehensive, interdisciplinary research field, integrating geometric computation, deep learning, and cross-modal cognition. With ongoing advancements in algorithms, data, and hardware systems, LiDAR will continue to be central to building safe, interpretable, and generalizable 3D intelligent perception systems. The algorithms, datasets, and evaluation metrics mentioned in this paper are summarized at https://github.com/aosheng1996/DL4LiDAR.  
      关键词:LiDAR;3D object detection;LiDAR localization;human motion capture;language-driven LiDAR reasoning   
      239
      |
      526
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 149622490 false
      更新时间:2026-06-18
    • Active speaker detection in videos: a survey AI导读

      Zhang Yuanhang, Yang Shuang, Shan Shiguang
      Vol. 31, Issue 6, Pages: 1763-1794(2026) DOI: 10.11834/jig.260107
      Active speaker detection in videos: a survey
      摘要:Active speaker detection (ASD) aims to identify speakers and their active speech intervals within video sequences by leveraging both audio and visual modalities. ASD is a foundational technology for applications such as media content analysis, human-computer interaction, intelligent meeting systems, and audio-visual speech recognition. Despite substantial progress driven by deep learning since 2015, real-world deployment still faces major challenges, including visual occlusion, acoustic interference, overlapping speech, and dynamic camera motion. To address these developments and challenges, this survey reviews ASD research over the past 25 years and categorizes existing methods into vision-based and audio-visual approaches. Vision-based methods infer speech activity solely from visual cues such as lip contours, facial motion, and body gestures, making them useful when audio is unavailable or heavily corrupted. Although immune to acoustic degradation, they inherently struggle to distinguish true speech from non-speech mouth movements and remain highly sensitive to low image resolution, non-frontal head poses, and occlusion. Audio-visual methods, which dominate current research, exploit the complementary strengths of auditory and visual signals. This survey further divides them into three major paradigms. 1) Matching-based methods identify speakers by learning cross-modal correspondences, typically with limited manual annotation. This paradigm includes two routes: synchronization-based and identity-based association. Synchronization-based methods estimate short-term temporal alignment between lip motion and audio, often using contrastive learning to project audio and visual features into a shared embedding space. Although these methods benefit from self-supervised learning, they require tight audio-visual synchronization and can fail under desynchronization or in dubbed videos. Identity-based association methods instead emphasize long-term consistency. They typically cluster speaker embeddings and facial feature sequences separately, then associate voices with faces using co-occurrence statistics or cross-modal face-voice matching networks. This route is more robust to dubbing, off-screen speech, and poor visual quality, such as in egocentric videos, but depends heavily on the accuracy of intermediate clustering. 2) Fusion-based classification methods formulate ASD as a fully supervised binary classification task—speaking versus non-speaking—for each candidate face at each time step. Their pipelines generally include four stages: feature extraction, feature fusion, temporal modeling, and final activity detection. Modern systems use large-scale pre-trained acoustic encoders and deep visual backbones for feature extraction. Dynamic fusion strategies, including cross-attention, gating networks, and uncertainty-aware adaptive fusion, have largely replaced simple static concatenation. Temporal modeling has evolved from local short-term processing to recurrent neural networks (RNNs), and more recently to global spatiotemporal reasoning with Transformers and graph neural networks (GNNs). By explicitly modeling interactions among candidate speakers and the broader scene context, fusion-based classification methods achieve state-of-the-art performance on most benchmarks. However, they require densely annotated data and are vulnerable to domain shift. 3) Hybrid methods combine the complementary strengths of matching and classification paradigms to handle more complex scenarios. By integrating short-term speech behavior, via synchronization or classification, with long-term identity verification through speaker profiles, hybrid systems can suppress interference from non-target speakers, overlapping voices, and off-screen narrators, thereby improving robustness in real-world environments. Beyond this methodological taxonomy, this survey also summarizes benchmark datasets and evaluation metrics commonly used in the ASD community. It traces the evolution of datasets from small, highly constrained laboratory recordings with limited participants to large-scale, in-the-wild benchmarks containing thousands of hours of video from movies, vlogs, egocentric wearable cameras, and surveillance footage. Common evaluation metrics, such as mean average precision (mAP), are also discussed. Finally, this survey highlights emerging technical trends and several persistent open problems. Although current state-of-the-art models achieve near-perfect performance on some benchmarks, they still show limited cross-dataset generalization, particularly across languages, domains, and extreme conditions. Moreover, existing systems still lack deep semantic understanding of conversational dynamics, including turn-taking, interruptions, and nonverbal social cues. Future work should therefore focus on building more inclusive datasets, developing data-efficient learning strategies, integrating large language models and vision-language models (LLMs/VLMs) for higher-level semantic reasoning, and designing lightweight architectures for edge deployment. Additional resources are available at https://github.com/VIPL-Audio-Visual-Speech-Understanding/Active-Speaker-Detection.  
      关键词:active speaker detection(ASD);audio-visual information;multi-modal;deep learning;survey   
      192
      |
      288
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 154166095 false
      更新时间:2026-06-18
    • Research progress on object re-identification in special scenarios AI导读

      Han Qing, Li Longfei, Min Weidong
      Vol. 31, Issue 6, Pages: 1795-1822(2026) DOI: 10.11834/jig.260117
      摘要:Object re-identification (Re-ID) is a pivotal technology for retrieving and associating specific targets across non-overlapping cameras and different time spans, and it serves as a core component of intelligent surveillance, smart transportation, and smart-city sensing systems. Over the past decade, Re-ID research has evolved dramatically, moving from hand-crafted features in its early stage to deep-learning-based automatic representation learning. With the recent emergence of large-scale vision-language models, the field is now advancing toward more generalizable and versatile solutions. This paper provides a systematic review of the foundations of object Re-ID, including its developmental trajectory, mainstream datasets, and evaluation metrics. It then focuses on six specialized directions that address challenges arising in complex real-world deployment: unsupervised object Re-ID, multi-spectral object Re-ID, cross-modality person Re-ID, occluded person Re-ID, clothes-changing person Re-ID, and group person Re-ID. For each direction, we summarize research progress and development trends, analyze representative state-of-the-art methods, and compare their performance. We also briefly review the current status of animal Re-ID, which is of substantial ecological importance. First, to address the high cost of manual annotation in practice, unsupervised object Re-ID aims to reduce or eliminate dependence on labeled data. This survey reviews two major streams in this area: domain-adaptation-based methods and fully unsupervised methods. Both typically rely on iterative optimization strategies in which clustering is used to generate pseudo-labels for unlabeled data. Current research focuses on mitigating label noise and learning more discriminative feature representations within unsupervised frameworks. Second, to overcome the limitations of visible-light imagery in adverse conditions such as nighttime or heavy fog, multi-spectral object Re-ID integrates complementary cues from visible, near-infrared, and thermal infrared spectra. Current work mainly addresses the distribution gaps among spectral bands in order to exploit their complementarity and achieve effective cross-spectral alignment and aggregation. Third, we present an in-depth review of cross-modality person Re-ID, which performs retrieval using heterogeneous data beyond visible-light images. Based on the modalities involved, related studies can be divided into three sub-tasks. Visible-thermal Re-ID addresses the modality gap between day and night surveillance; mainstream methods focus on either style alignment through image generation or modality-invariant feature learning. Text-to-image Re-ID uses natural-language descriptions as queries when query images are unavailable, and current work emphasizes fine-grained semantic-visual interaction mechanisms. Sketch-to-photo Re-ID enables retrieval from intuitive visual sketches; its key challenge is the large stylistic variation inherent in hand-drawn sketches, and existing approaches therefore focus on improving style generalization. Fourth, because occlusion is pervasive in real-world scenarios, occluded person Re-ID has become an important research topic. We categorize mainstream methods into two groups. Visibility-aware methods improve matching by exploiting information from unoccluded regions, typically using human structural priors to locate visible parts or enabling the model to focus adaptively on visible discriminative cues. Feature-reconstruction-based methods instead seek to restore or hallucinate missing pedestrian information in occluded regions so that the resulting features are more discriminative. Fifth, to address the significant intra-class variation caused by clothing changes in long-term Re-ID, this paper reviews clothes-changing person Re-ID. Existing methods fall into two broad categories: those that extract clothing-agnostic features, such as facial characteristics, gait, and body shape, and those that explicitly disentangle clothing information from identity information. Sixth, because pedestrians often appear in groups in real scenes, we review group person Re-ID, which seeks to associate group images across cameras. In addition to the challenges found in single-person Re-ID, this task must handle variations in group membership, mutual occlusion, and changes in group layout. Current methods mainly use graph neural networks to model topological relationships among members or transformer-based architectures to learn higher-order statistical features and uncertainty representations for robust group matching. In addition, we provide a brief introduction to animal Re-ID, a field of growing importance for ecological conservation. This section outlines the technological evolution from traditional marker-based methods to modern deep-learning approaches and highlights the key shift toward non-invasive monitoring. Finally, the paper summarizes the development and trends of object Re-ID and discusses future directions, including the construction of large-scale datasets, research on large foundation models for general object Re-ID, and lightweight deployment strategies. These directions are essential for improving practical applicability and accelerating broad real-world adoption.  
      关键词:object Re-ID in special scenarios;deep learning;unsupervised object Re-ID;multi-spectral object Re-ID;cross-modal person Re-ID;occluded person Re-ID;clothes-changing person Re-ID;group person Re-ID;animal Re-ID   
      126
      |
      154
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 153128399 false
      更新时间:2026-06-18
    • Yang Jingxiang, Zeng Jian’an, Diao Wenxiu, Xiao Liang
      Vol. 31, Issue 6, Pages: 1823-1846(2026) DOI: 10.11834/jig.260110
      Overview of compressive spectral imaging reconstruction driven by physical model and generative prior
      摘要:A hyperspectral image (HSI) contains rich spatial-spectral information. It exhibits high discriminative capability and wide applications, such as in remote sensing, geologic examination, and medical diagnosis. Traditional spectral imaging technologies include whisk broom, push broom, and staring imaging. They suffer from large volume of equipment, long period of collection, and limited spatiotemporal resolution, hindering their usage in dynamic scenes and motional platforms. Compressive spectral imaging has recently garnered research interests. Coded aperture snapshot spectral imaging (CASSI) can take the compressed measurement of a 3D HSI within a single exposure, and its high efficiency makes it a hotspot in computational imaging. One key technology in CASSI is HSI reconstruction, which aims to restore the latent HSI with high quality from the compressed measurement. In the last decades, several HSI reconstruction algorithms have been proposed. In this overview, we comprehensively review recent advancements in spectral imaging and reconstruction methods. First, we analyze the physical process of compressive spectral imaging and formulate a spatial-spectral degradation model. Then, we model CASSI reconstruction as an ill-posed reverse problem that requires priors for regularization to reduce the solution space. Taking the prior as a view angle, we divide the current HSI reconstruction technologies into four categories: 1) model-driven methods based on handcrafted priors, 2) data-driven methods based on deep learning networks, 3) model-data joint driven methods based on deep priors, and 4) the recently proposed generative diffusion prior. Through such a structured analysis, this overview aims to offer valuable insights into the core idea, design paradigm, and evolution of different methods; highlight persistent challenges; and provide an outlook for future development trends. Model-driven methods rely on handcrafted priors, and various priors, such as total variation and sparsity, have been proposed as regularization in the HSI reconstruction problem. They are mathematically interpretable and can be generalized to different imaging systems as long as the degradation model is accurate. However, handcrafted priors may be simplistic and may fail to capture fully the complex spatial-spectral characteristic of HSI. The iterative optimization process of the HSI reconstruction model is computationally expensive for real-time applications. Tuning the hyper-parameter in the HSI reconstruction model is also difficult. Data-driven methods uses deep learning networks to learn mapping between measurements and HSIs. Different networks (e.g., convolutional networks and Transformers) have been designed to utilize spatial-spectral features for HSI reconstruction. In general, high-fidelity HSI can be inferred efficiently after learning complex data-driven features. However, such networks are black boxes with limited interpretability. Moreover, deep learning networks may fail catastrophically when spatial-spectral degradation is unknown or even unseen during inference. Model-data joint driven methods combine the strengths of model-driven and data-driven methods. It originates from the traditional HSI reconstruction model but replaces the handcrafted prior with an implicit deep prior. Classic optimization algorithms are used to minimize the HSI reconstruction model. The iterative solutions are unrolled into a deep network. Each iterative solution becomes an unfolding stage in the network. The handcrafted prior is replaced with a learnable denoiser as the deep proximal operator. The unrolled network is trained in an end-to-end manner. The network is designed under the guidance of imaging physics; hence, it has higher interpretability and robustness in varying degradation cases compared with data-driven methods. By learning the deep prior, it exhibits higher quality than handcrafted priors. However, these networks can be regarded as discriminative models learned by regression losses. They tend to produce deterministic results that are actually the “averaged” distributions of potential ground truth, leading to blurry output and hindering the reconstruction of fine-grained and detailed image structures. The diffusion model can generate diverse and highly realistic contents, leveraging that the generative diffusion prior may address the limitation and has demonstrated potential in HSI reconstruction. Furthermore, we select 12 mainstream HSI reconstruction methods in the overview and compare their performance on widely used datasets. Finally, we discuss the shortcomings of existing work and propose future work trends on the basis of the experimental results. Problem points include those that represent complex spatial-spectral features, limited generative capability and content distortion, and the disjointed relationship among compressive imaging, HSI reconstruction, and downstream tasks. The objective of this review is to provide a comprehensive introduction of spectral imaging and reconstruction, and present valuable insights for future advancement. The experimental code and data can be found at:https://doi.org/10.57760/sciencedb.j00240.00063 and https://github.com/DDXNJUST/Computational-Imaging.  
      关键词:compressive spectral imaging;computational reconstruction;imaging model;deep learning;model and data driven   
      128
      |
      219
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 154696195 false
      更新时间:2026-06-18
    • Liu Yuhan, Ma Yapeng, Yang Jianwei, Wang Ziming, Aliha Aersi, Cao Huazhao, Wen Zixiao, Hu Shengran, Zhou Guangyao, Hu Yuxin
      Vol. 31, Issue 6, Pages: 1847-1874(2026) DOI: 10.11834/jig.260109
      Infrared dim and small moving target detection in recent years: an overview
      摘要:Infrared search and tracking systems have emerged as pivotal technologies in the military and civilian domains due to their inherent advantages of all-weather and all-day imaging, flexible deployment, and high concealment capability. In modern defense and surveillance applications, such systems are crucial for early warning, security monitoring, missile guidance, and anti-access/area denial operations. Despite these advantages, the detection of weak and small moving aerial targets remains one of the most formidable challenges in infrared imaging. Factors, such as limited sensor sensitivity, long imaging distances, complex and variable backgrounds, low signal-to-noise ratios, and sub-pixel target sizes, exacerbate the difficulty of accurately identifying such targets. These targets frequently exhibit low contrast against a cluttered background, high maneuverability, and diverse motion patterns, making conventional detection approaches insufficient. This work presents a comprehensive overview of recent advances in infrared dim and small target detection, categorizing methods into three principal frameworks: traditional detection, low-rank sparse decomposition (LRSD), and deep learning frameworks. Traditional approaches leverage spatial, temporal, and transform-domain features to enhance target saliency, utilizing techniques, such as top-hat filtering, local contrast measures, gradient and derivative analyses, multi-scale strategies, and visual saliency models. Although they are effective in relatively simple imaging scenarios, these methods exhibit limitations in complex or dynamic backgrounds, often yielding high false alarm rates. Recent improvements incorporate adaptive windowing, entropy-based measures, and trajectory-consistent temporal modeling to mitigate the aforementioned issues. LRSD frameworks represent a significant evolution in detection methodology. By modeling infrared imagery as the superposition of a low-rank background and sparse target components, these methods convert target detection into an optimization problem that is solvable via robust principal component analysis and tensor decomposition. Innovations in this domain include the development of the infrared patch image and infrared patch tensor models, weighted Lp and Schatten norms, nonnegative constraints, and total variation regularization. All of which contribute to enhanced background suppression, reduced false alarms, and improved detection robustness. Recent studies have further utilized spatiotemporal correlations and hierarchical subspace learning to increase efficiency and adaptivity, making these methods suitable for more challenging operational scenarios. Deep learning frameworks constitute the most recent and rapidly expanding paradigm for infrared weak target detection. End-to-end neural networks, including convolutional neural networks, attention-based modules, and transformer architectures, have been employed to learn discriminative features from infrared imagery automatically. Despite the inherent challenges posed by low-resolution, textureless targets with minimal appearance cues, the integration of contextual information, feature attention mechanisms, and specialized loss functions that address missed detections and false alarms has resulted in substantial performance gains. Hybrid approaches that combine handcrafted features with learned representations further improve detection reliability. Moreover, recent attention-guided architectures and multi-scale processing strategies help preserve small target information that is typically lost in conventional pooling operations, enabling high-fidelity detection in complex backgrounds. In addition to algorithmic innovations, this study reviews publicly available datasets, performance evaluation metrics, and comparative experimental studies across different detection methods. Results indicate that although deep learning approaches offer superior detection performance in diverse and cluttered scenarios, LRSD-based methods remain competitive due to their interpretability and robustness in low signal-to-noise regimes. Although more susceptible to false alarms in complex environments, traditional detection frameworks continue to provide lightweight solutions for real-time applications where computational resources are constrained. This study also identifies several critical research trends and future directions. First, multimodal data fusion, which incorporates visible spectrum, infrared, and radar data, has emerged as a promising strategy for enhancing target detectability and reducing environmental ambiguity. Second, real-time performance and computational efficiency remain as key challenges in LRSD and deep learning frameworks, motivating research into accelerated optimization algorithms and lightweight network architectures. Third, unsupervised and self-supervised learning approaches have begun to demonstrate potential in mitigating the scarcity of annotated infrared datasets, enabling scalable deployment in operational settings. Lastly, the integration of trajectory prediction, motion modeling, and adaptive feature extraction is increasingly recognized as essential for robust detection under rapid target maneuvers and evolving environmental conditions. In summary, this overview consolidates and analyzes the latest advances in infrared dim and small moving target detection, encompassing methodological evolution, experimental validation, and future research trajectories. By systematically classifying detection frameworks, highlighting key innovations, and identifying ongoing challenges, this work aims to provide researchers, engineers, and system designers with a comprehensive reference that informs about the development of next-generation infrared search and tracking systems. The insights presented herein not only advance the theoretical understanding of weak target detection but also offer practical guidance for improving operational capabilities in defense, surveillance, and aerospace applications worldwide. The convergence of traditional, LRSD, and deep learning methodologies, along with emerging trends in data fusion and adaptive modeling, signals a new era of highly reliable, automated, and context-aware infrared target detection systems that are set to meet increasingly stringent performance requirements. The corresponding link:https://github.com/Yoooohan/Collection-for-Infrared-dim-and-small-target-detection-methods and https://doi.org/10.57760/sciencedb.j00240.00068.  
      关键词:Infrared Search and Track (IRST) system;Infrared imaging detection;aerial moving target;Infrared dim and small target detection;traditional detection method;low-rank and sparse decomposition(LRSD);deep learning   
      163
      |
      193
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 153128169 false
      更新时间:2026-06-18
    • Mamba-based image restoration: a comprehensive survey AI导读

      Jin Wushuai, Li Jie, Gao Xinbo
      Vol. 31, Issue 6, Pages: 1875-1896(2026) DOI: 10.11834/jig.260111
      Mamba-based image restoration: a comprehensive survey
      摘要:Image restoration aims to recover high-quality images from observations that are degraded by various factors, such as sensor noise, optical blur, spatial downsampling, atmospheric scattering, and insufficient illumination. As a fundamental low-level task in computer vision, it directly underpins the performance of downstream high-level tasks, including object detection, semantic segmentation, and scene understanding. It also exhibits significant practical value in medical imaging, remote sensing interpretation, surveillance, and consumer photography. The field has witnessed a clear paradigm evolution driven by the inherent tension between global modeling capability and computational efficiency. Convolutional neural networks (CNNs), exemplified by enhanced deep super-resolution networks and residual channel attention networks, offer efficient local feature extraction with translational equivariance but suffer from limited receptive fields that hinder long-range dependency modeling. Vision Transformers , represented by SwinIR, Restormer, and HAT, capture the global context through self-attention but incur quadratic computational complexity that becomes prohibitive for high-resolution input. The selective state-space model called Mamba, which was introduced in late 2023, resolves this fundamental contradiction by extending classical linear time-invariant state-space models (SSMs) into linear time-variant systems through an input-dependent selection mechanism, achieving a global receptive field with linear computational complexity. Following the pioneering cross-scan module of VMamba and its systematic application to image restoration via MambaIR in early 2024, Mamba-based image restoration has rapidly emerged as an active research frontier, with substantial contributions published at top-tier venues, including the European Conference on Computer Vision, Institute of Electrical and Electronics Engineers (IEEE)/Computer Vision Foundation (CVF) Conference on Computer Vision and Pattern Recognition, IEEE/CVF International Conference on Computer Vision, Annual Conference on Neural Information Processing Systems, ACM International Conference on Multimedia, and AAAI Conference on Artificial Intelligence, and prestigious journals, such as IEEE Transactions on Pattern Analysis and Machine Intelligence, IEEE Transactions on Multimedia, and IEEE Transactions on Circuits and Systems for Video Technology. This study presents a systematic and comprehensive Chinese-language survey that is dedicated to Mamba-based image restoration methods. From a technical perspective, we identify and analyze the core challenge of adapting the inherently 1D Mamba to 2D image data. We provide an in-depth examination of representative 2D scanning strategies, including cross-scan, omnidirectional selective scan, nested S-shaped scan, and Hilbert scan, and systematically compare their performance efficiency trade-offs across five dimensions: directional coverage, locality preservation, path continuity, information loss, and computational overhead. Our analysis reveals a clear evolutionary trajectory from brute force multidirectional coverage toward quality-oriented single-pass designs that achieve comparable or superior information modeling with minimal computational cost. In addition to scanning strategies, we discuss multiple solutions that address the causality limitation of Mamba, which conflicts with the nondirectional prior that is inherent in image restoration. These solutions include multidirectional scanning for implicit noncausal modeling, bidirectional scanning for forward-backward information fusion, the attentive state equation mechanism that fundamentally enables noncausal modeling by introducing global query capability into the SSM output equation, and cross-window interaction schemes. Building upon these analyses, we propose a unified analytical framework that is organized around four core design axes: scanning strategy, noncausal information injection, local modeling compensation, and prior knowledge fusion. This framework serves as an interpretive coordinate system for understanding intrinsic connections, complementary relationships, and design trade-offs among different methods. From a methodological perspective, we systematically review existing studies organized by task type and covering general image restoration, super-resolution, denoising, deblurring, deraining and dehazing, low-light enhancement, remote sensing and hyperspectral processing, and video restoration. Across these tasks, we identify six recurring architectural paradigms: pure Mamba backbone, CNN/Transformer-Mamba hybrid, U-Net with embedded Mamba, frequency-domain enhanced Mamba, lightweight Mamba, and diffusion model fused with Mamba. We analyze the applicable scenarios, technical characteristics, and representative instantiations for each paradigm, providing researchers with a structured map for architectural design decisions. From an evaluation perspective, we compile commonly used benchmark datasets for each subtask and establish a multidimensional evaluation system. This system encompasses full-reference metrics, such as peak signal-to-noise ratio, structural similarity index, and learned perceptual image patch similarity for pixel-level and perceptual quality assessment; no-reference metrics, such as the naturalness image quality evaluator and kernel inception distance, for real-world scenarios that lack ground truth; and model efficiency metrics, including parameters, floating-point operations, graphics processing unit memory, and inference time; which are particularly relevant given the linear complexity advantage of Mamba. We further provide task-specific metric selection recommendations to guide standardized and fair evaluation practices. Finally, we identify and discuss several core open challenges, including the absence of theoretical guidance for scanning strategy design, the immature hardware acceleration ecosystem that prevents the theoretical complexity advantage of Mamba from fully translating into practical speedup, insufficient generalization from synthetic to real-world degradations, difficulties in lightweight model design for edge deployment, and the lack of interpretability and visualization tools that are comparable with Transformer attention maps. We also outline promising future research directions, including native noncausal SSM variants, video and 3D restoration that utilize the inherent sequential modeling strength of Mamba, and the construction of unified evaluation benchmarks. This work aims to provide researchers with a thorough and in-depth reference, facilitating the continued advancement of Mamba-based image restoration techniques toward academic impact and real-world deployment.  
      关键词:image restoration;Mamba;state space model(SSM);selective state space model;deep learning;image super-resolution;image denoising   
      216
      |
      230
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 152832420 false
      更新时间:2026-06-18
    • Image Engineering in China: 2025 AI导读

      Zhang Yujin
      Vol. 31, Issue 6, Pages: 1897-1910(2026) DOI: 10.11834/jig.260166
      Image Engineering in China: 2025
      摘要:This is the 31st annual survey series of bibliographies on image engineering in China. This statistic and analysis study aims to capture the up-to-date development of image engineering in China, provide a targeted means of literature searching facility for readers working in related areas, and supply a useful recommendation for the editors of journals and potential authors of papers. Specifically, considering the wide distribution of related publications in China, all references (755) on image engineering research and technique are selected carefully from the research papers (2 917 in total) published in all issues (154) of a set of 15 Chinese journals. These 15 journals are considered important, in which papers concerning image engineering have higher quality and are relatively concentrated. The selected references are initially classified into five categories (image processing, image analysis, image understanding, technique application, and survey) and then into 23 specialized classes in accordance with their main contents (same as the last 20 years). Analysis and discussions about the statistics of the results of classifications by journal and by category are also presented. According to the analysis of 2025 statistical data, it can be observed that from a research perspective, the field of image analysis has received the most attention, with image segmentation and primitive detection, object detection and recognition being key research focuses; within the image understanding technologies, studies on spatiotemporal techniques and behavior understanding have developed over more than a decade, emerging as a significant domain. From an application point of view, remote sensing, radar, sonar, surveying, and mapping, as well as medical and health-related fields, remain the most active areas; new image technology development and application expansion have progressed rapidly, yielding a series of achievements. Overall, China's image engineering research continues to deepen and broaden in 2025, maintaining a momentum of rapid development. The comprehensive 31-year statistical data also provides readers with more complete and reliable insights into the development trends of various research directions.  
      关键词:image engineering;image processing(IP);image analysis(IA);image understanding(IU);technique application(TA);literature survey;literature statistics;literature classification;bibliometrics   
      0
      |
      0
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 158697312 false
      更新时间:2026-06-18

      Embodied Intelligence and Brain\-Inspired Intelligence

    • Vision-language-action models: current developments and frontier advances AI导读

      He You, Lu Huchuan, Wang Dong, Li Shaohui, Li Zhi, Liu Yang, Zhao Jie, Ruan Shulan
      Vol. 31, Issue 6, Pages: 1911-1941(2026) DOI: 10.11834/jig.260042
      Vision-language-action models: current developments and frontier advances
      摘要:Vision-language-action (VLA) models represent a paradigm shift in multimodal artificial intelligence (AI) by unifying visual perception, linguistic comprehension, and motor control into a cohesive computational framework. Traditional robotic control frequently relies on decoupled modules for sensing, planning, and execution. However, such modules frequently fail to generalize across unstructured environments or complex semantic instructions. VLA models address these limitations by co-embedding multimodal input into a unified representation space, leveraging the expansive knowledge within large language models and vision-language models to facilitate zero-shot task execution and robust physical interaction. This study provides a systematic review of the VLA landscape, analyzing technical architectures, training methodologies, and empirical evaluation frameworks. By synthesizing the transition from modular robotics to end-to-end generative controllers, the survey elucidates how large-scale pretraining on diverse Internet data can be effectively transferred to downstream physical tasks. The internal mechanisms of VLA systems center on cross-modal alignment and sequence modeling. By discretizing robotic actions into tokens, researchers regard embodied control as a generative task where the model predicts subsequent motor commands based on high-dimensional visual observations and textual goals. This formulation allows agents to capture temporal dependencies between environmental states and linguistic intents through the self-attention mechanisms of Transformer-based backbones. The survey examines how causal modeling enables robots to anticipate the consequences of their actions, while language-conditioned task formulations allow for the interpretation of diverse natural language instructions without task-specific fine-tuning. We specifically analyze the technical implementation of action tokenization, discussing how continuous joint velocities or end effector poses are mapped into discrete vocabularies that the model can process alongside linguistic tokens. This integration ensures that the reasoning capability of the language component directly informs the low-level motor output. Recent research has introduced several optimization strategies to enhance the operational capability of VLA models. Embodied chain-of-thought reasoning improves long-horizon planning by generating intermediate symbolic or natural language subgoals, increasing success rate and system interpretability. To facilitate real-time deployment on edge hardware, studies have focused on efficiency via model quantization, knowledge distillation, and architectural innovations, such as state-space models and mixture of experts. Furthermore, the integration of reinforcement learning allows pretrained VLA models to adapt to specific physical dynamics through environmental interaction, mitigating the limitations of static imitation learning. Cross-action learning techniques further extend these capabilities by enabling skill transfer across heterogeneous robotic platforms and varied degrees of freedom, effectively creating a shared representation for diverse robotic morphologies. This review also explores the role of auxiliary objectives, such as future image prediction or contrastive alignment, in stabilizing the learning process and improving the visual grounding of linguistic concepts. The data ecosystem for VLA development is categorized into three primary domains. Simulation environments provide scalable platforms for automated data generation by using synthetic supervision and physics-based domain randomization. These platforms enable the collection of millions of trajectories without the risk of hardware damage. However, they necessitate sophisticated techniques to bridge the gap to reality. Real robot repositories, including the Open X-Embodiment dataset, offer high-fidelity demonstrations across diverse hardware but face scaling constraints due to the labor-intensive nature of teleoperation. Human video datasets serve as a massive passive learning source for understanding world physics, object affordances, and task hierarchies without explicit action labels. This review evaluates these resources based on their support for manipulation, navigation, and mobile manipulation tasks, providing a comparative analysis of benchmarks, such as robotics Transformer-1 and virtual manipulator, alongside simulation-to-real evaluation frameworks. We also discuss the importance of data diversity, noting that performance correlates strongly with the variety of objects, environments, and camera perspectives present during the pretraining phase. Despite recent progress, significant bottlenecks remain in achieving general purpose-embodied intelligence. Data scarcity for high-quality VLA triplets restricts model scaling compared with pure text or image domains. The simulation-to-real gap, which is driven by discrepancies in physical friction and sensor noise, continues to hinder the direct transfer of simulated policies to physical platforms. In addition, cross-robot adaptability and covariate shift pose ongoing challenges to maintaining performance across different kinematics and long-duration tasks. Safety constraints and the lack of transparency in neural controllers also complicate human-robot collaboration. Current models frequently struggle with fine-grained manipulation that requires high-frequency tactile feedback. This limitation arises from the predominantly vision-centric nature of current datasets. Furthermore, the computational cost of running large-scale vision Transformers at frequencies required for stable control remains a barrier for low-latency applications. This study is concluded by identifying future research directions, emphasizing the need for improved physical common sense and data-efficient adaptation to move toward reliable and autonomous embodied agents. We argue that future VLA systems must move beyond simple imitation to include active exploration and self-correction mechanisms. The integration of multimodal feedback, including haptic and auditory signals, is identified as a necessary step for achieving human-level dexterity. Moreover, the development of standardized evaluation protocols that account for success rate and safety metrics will be essential for the field to progress. By addressing these open problems, the robotics community can transition from specialized task-specific agents to general-purpose robots that are capable of assisting in diverse domestic and industrial environments. This roadmap highlights the convergence of generative AI and physical robotics as the primary path toward artificial general intelligence in the physical world.  
      关键词:vision-language-action(VLA) model;embodied intelligence;embodied chain-of-thought;multimodal reasoning;robot control   
      0
      |
      0
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 158697218 false
      更新时间:2026-06-18
    • Dexterous robotic hands: enabling general-purpose manipulation AI导读

      Liang Shutong, Xie Dongjin, Li Dong, Zhang Hui, Jia Xiaofeng, Wang Fei-Yue, Li Yidong, Li Lingxi
      Vol. 31, Issue 6, Pages: 1942-1970(2026) DOI: 10.11834/jig.260100
      Dexterous robotic hands: enabling general-purpose manipulation
      摘要:In recent years, the rapid advancement of foundation models, including large language models, vision-language models, and world models, has introduced a paradigm shift that enables humanoid robots to transition from laboratory demonstrations to open-world applications, such as household services, industrial manufacturing, and medical assistance. As the primary end effector responsible for high-dimensional and fine-grained physical interaction, multi-fingered dexterous hands represent one of the most challenging and emblematic platforms in embodied intelligence due to their high degrees of freedom, strongly nonlinear contact dynamics, and tightly coupled multimodal feedback mechanisms. The emergence of vision-language-action (VLA) models and large-scale foundation architectures, the breakthrough application of diffusion models and flow matching in continuous control policy generation, hybrid reinforcement-imitation learning frameworks, and advances in high-resolution tactile sensing, variable-stiffness mechanisms, and rigid-soft hybrid materials collectively drive a fundamental transition in dexterous hands—from a paradigm of “rigid high-precision” mechanical determinism toward an integrated, perception-learning-execution-centered closed-loop intelligent system. This work presents a comprehensive review of robotic dexterous hands across four dimensions: mechanical structures, intelligence capability grading, data resources, and benchmarking methodologies. From a historical perspective, we first systematically trace the evolution of mechanical architectures and hardware paradigms, summarizing representative technical routes, including fully actuated multi-finger designs, underactuated compliant mechanisms, tendon-driven systems, soft robotic hands, and rigid-soft hybrid structures. Our analysis indicates that the evolution of dexterous hand mechanisms is not merely an accumulation of degrees of freedom, but rather a gradual shift toward engineering-oriented paradigms that are characterized by underactuated coupling, material compliance, and hybrid structural design. By embedding adaptive coordination mechanisms into a mechanical body through passive responses, these approaches effectively reduce actuation and control dimensionality while physically enhancing robustness against object diversity and contact uncertainty. Building upon this foundation, we propose a systematic five-level taxonomy of dexterous intelligence (H1–H5) centered on the evolution of perceptual capability. H1 (perception-free) is characterized by open-loop program execution and teleoperation, wherein the system lacks environmental modeling and policy generation capabilities. H2 (single-modal perception) introduces either vision or tactile feedback to enable perception-driven grasping and basic stability regulation. H3 (multimodal perception) integrates vision, tactile, and force sensing through deep multimodal collaboration, supporting complex fine manipulation tasks, such as precision assembly, deformable object manipulation, and tool use. During this stage, systematic methodologies emerge across three technical directions: hierarchical task planning, multimodal servo control, and data-driven policy learning. H4 (open perception) centers on VLA models and addresses perceptual generalization, long-horizon task planning, and deep multimodal fusion to enable language-guided open-world task understanding and zero-shot manipulation. H5 (dynamic perception) envisions an autonomous, evolving general manipulation capability that is supported by deep multimodal dynamic perception and real-time coordination mechanisms, representing a historical leap from robots as “tools” to embodied “symbiotic agents”. This taxonomy provides a unified reference framework for evaluating the technological transition of dexterous hands from repetitive execution to open-world task planning, and ultimately, toward autonomous evolution. Furthermore, we systematically review key data resources and evaluation benchmarks that support dexterous intelligence from two complementary dimensions: real-world interaction and high-fidelity simulation. At the data level, real-world datasets offer ecological validity but suffer from high collection costs, limited scalability, and safety risks. Synthetic datasets and simulation platforms enable large-scale and diverse data generation at controllable costs but remain constrained by simplified contact models and the simulation-to-reality gap. We outline the evolution of synthetic datasets from static grasp poses to dynamic manipulation sequences and analyze representative resources in terms of their contributions to grasp generation, cross-hand generalization, articulated object manipulation, and long-horizon modeling. We further summarize the technological progression of simulation platforms from basic physical validation to high-fidelity interaction and cross-domain transfer. In terms of evaluation, we categorize performance metrics into outcome-oriented and process-oriented dimensions, including task success rate, grasp cycle time, target pose error, normalized task error, contact region error, stability and drop rate, and efficiency and robustness. Benchmark tasks are organized into five families: stable grasping and transport, re-grasping and contact transition, in-hand manipulation and reorientation, constrained operation and assembly, and tool use and functional manipulation. Combined, these constructs form a systematic 2D evaluation spectrum that spans contact complexity and temporal depth, emphasizing reproducible and diagnostically meaningful standards for assessing generalization capability and deployment readiness. Finally, we summarize the core challenges and future directions toward the general-purpose deployment of dexterous hands. From a data perspective, the scarcity of real interaction data and the persistent simulation-to-reality gap remain fundamental bottlenecks for effective policy transfer. From a modeling perspective, efficient and robust multimodal joint representations, 3D foundation model construction, and interpretable decision-making mechanisms have yet to converge into a unified theoretical framework, while inherent tensions persist between model scale and real-time inference requirements. From a hardware perspective, long-standing engineering trade-offs exist between high degrees of freedom and low cost, reliability, and lightweight design, and between precision force-tactile control and structural simplicity. In the future, a deep integration of perception, decision-making, and execution; incorporation of physical common sense and causal reasoning through world models and embodied foundation models; generative AI-driven data-efficient learning and simulation credibility enhancement; biomimetic variable-stiffness mechanisms and endogenous tactile sensing through soft-hard codesign; and long-term real-world deployment with closed-loop optimization in high-value scenarios, such as intelligent manufacturing, domestic service, and specialized operations, will be critical pathways. Such effort will drive dexterous hands from laboratory prototypes toward reliable real-world applications, ultimately achieving the objective of general embodied intelligence that is capable of perceiving, reasoning, and manipulating “like a human hand”. This work provides a unified capability framework and systematic reference for understanding and tracking the frontier of robotic dexterous hands, offering theoretical guidance and practical insights for future research in hardware paradigm evolution, intelligence capability transition, and data and benchmarking system construction.  
      关键词:dexterous hand;Embodied AI;humanoid robot;multimodal tactile perception;vision-language-action model;task generalization   
      409
      |
      414
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 154038951 false
      更新时间:2026-06-18
    • Intelligent driving foundation model AI导读

      Hu Jianfang, Huang Linjiang, Zhai Wei, Yan Ruisong, Li Chenglin, Zheng Weishi, He Ran, Zha Zhengjun, Xiong Hongkai
      Vol. 31, Issue 6, Pages: 1971-1988(2026) DOI: 10.11834/jig.260085
      Intelligent driving foundation model
      摘要:The intelligent driving foundation model integrates vision, language, and action through multimodal learning, driving the evolution of autonomous systems from the traditional “perception-planning-control” architecture toward an end-to-end unified paradigm. By leveraging the capabilities of unified representation, generative reasoning, and few-shot generalization, this model significantly enhances system robustness and decision-making intelligence. From the perspective of research, the intelligent driving foundation model incorporates achievements from multiple disciplines: the latest advances in visual computing, natural language processing, reinforcement learning, cognitive science, computer graphics, and virtual simulation are comprehensively applied within this system. At the industry level, global leading automotive and technology companies have also regarded large models as the technological cornerstone of the next generation of intelligent driving systems. This work systematically reviews the latest progress in the intelligent driving foundation model at the international and domestic levels, including decision planning, environmental perception, visual question answering, and data generation. Specifically, the decision planning section is dedicated to achieving direct mapping from perception input to planning output through a unified large model architecture, while maintaining explainability and generalization capabilities. On the one hand, to address the shortcomings of traditional end-to-end driving models in terms of interpretability, generalization ability, and safety verification, researchers have proposed a series of approaches that incorporate language models with visual models, achieving “model thinking readability” through natural languages. On the other hand, large language models also provide a unified language-level interface for the collaborative optimization of multi-vehicle planning, allowing different vehicle agents to engage in semantic communication, share intentions and policies, and form a group intelligence similar to “implicit collaboration” among human drivers. The section on the task of perception and motion prediction not only includes detecting surrounding objects, but also understanding environmental semantics, inferring the behavioral intentions of other traffic participants, and performing multi-target trajectory prediction in dynamic scenarios. Traditional perception systems rely on the dense labeling and geometric reconstruction models, which often experience performance degradation in long-tail scenarios (e.g., extreme weather, emergencies). To address these issues, the academic community has recently introduced large language models and multimodal fusion mechanisms, incorporating semantic reasoning into visual perception to achieve “semantic-enhanced visual understanding”. This perception-semantic integration design significantly enhances the depth of understanding of complex environments by autonomous driving systems. Predictive capability can also be enhanced by introducing language reasoning. Some methods use language prompts as the semantic guidance, combining visual and motion features for future trajectory prediction, referred to as “language prompt guided prediction”. To enable the model to explain and communicate with human drivers in natural language, researchers further introduced visual question answering into intelligent driving foundation models. With this approach, driving models cannot only answer questions, such as “why is the vehicle slowing down” and “can we change lanes”, but also adjust policies on the basis of semantic questions, achieving explainable and intervenable driving intelligence. Retrieval-augmented generation and chain-of-thought techniques are applied as effective means to enhance question-answering capability in autonomous driving systems. This report discusses related methods in the visual question-answering section. Data are key driving factors for enhancing the capabilities of autonomous driving systems. High-quality, diverse, and semantically consistent driving data directly determine the generalization and safety performance of the model. Traditional data collection and annotation methods are extremely costly. For example, the annotation cost for urban-level autonomous driving can reach $3–$5 (USD)per frame, with less than 5% coverage of long-tail scenarios. To address this issue, research focus has shifted to automatic annotation, self-supervised learning, and generative data synthesis. The objective of these methods is to reduce dependence on manual annotation, synthesize rare samples that are difficult to capture in the real world in virtual space, and form a closed-loop data engine, enabling the coevolution of models and data. The data generation section, which is distinguished by the data sources, explores how automatic annotation, generative data synthesis, world models, and integrated virtual and real simulation methods solve the problems of high cost and insufficient coverage of long-tail scenarios in autonomous driving data. On these bases, this work conducts a horizontal comparison and further analyzes China’s strengths and limitations in terms of data resources, computational infrastructure, algorithmic innovation, and standardization. International research is leading in the theoretical depth and integration of multimodal fusion, particularly demonstrating considerable innovative potential in unified architecture, generative world models, and collective intelligence. Meanwhile, China exhibits significant advantages in engineering applications, real-time optimization, and scenario adaptation, particularly with a unique practical experience in data closed-loop systems, automatic annotation, and computational optimization. For future development trends, intelligent driving technology faces challenges in real time, safety, and personalization. This report recommends strengthening fundamental research and public infrastructure, establishing a unified open-source data platform to promote the sharing and collaboration of multimodal data, building trustworthy artificial intelligence (AI) evaluation systems, advancing personalized driving and human-AI alignment, and fostering an autonomous and controllable innovation ecosystem. Intelligent driving foundation models have become crucial enablers for the high-quality development of China’s automotive industry and a new frontier in applied AI. The algorithms and open-source codes mentioned are summarized at https://github.com/Ruisong-Yan/Intelligent-Driving-Foundation-Model and can also be accessed via https://doi.org/10.57760/sciencedb.j00240.00121.  
      关键词:intelligent driving;foundation model;multimodal learning;World model;end-to-end(E2E);interpretability   
      152
      |
      380
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 153982454 false
      更新时间:2026-06-18
    • Zhao Yao, Li Jia, Jin Yi, Wei Yunchao, Zhao Yifan, Zhang Hui, Wang Xu, Qu Mengxue, Zeng Yuqiao, Wang Wenzhuang
      Vol. 31, Issue 6, Pages: 1989-2016(2026) DOI: 10.11834/jig.250644
      Survey on intelligent generation of traffic data towards advanced smart driving: models, systems, and evaluation
      摘要:As higher-level smart driving continues to advance, it increasingly relies on multimodal perception, predictive modeling, and intelligent decision-making. However, the acquisition of real-world traffic data faces substantial challenges, especially under extreme weather conditions, rare or long-tail scenarios, and privacy-sensitive contexts. These challenges manifest as high data collection costs, insufficient scenario coverage, and labor-intensive labeling processes, which collectively hinder the ability to support the large-scale training, validation, and deployment of smart driving systems. Consequently, the efficient generation of traffic data that simultaneously exhibits realism, controllability, and diversity has emerged as a crucial research problem, with significant implications for safety and system reliability in extreme or unforeseen driving conditions. To address these pressing challenges, this paper presents a comprehensive survey of intelligent traffic data generation techniques specifically targeting high-level smart driving applications. The survey aims to provide a structured overview of the state-of-the-art methods, identify the core technical bottlenecks, and outline best practices for translating research into engineering applications. This paper organizes the discussion along three interrelated dimensions—models, systems, and evaluation—forming a holistic perspective that links generative algorithms with practical deployment considerations and quantitative assessment protocols. This paper begins by introducing a model-system-evaluation workflow that clarifies the central technical challenges faced in traffic data generation. Key issues include data scarcity, which limits the capacity to model rare events; cross-modal alignment, which ensures consistent mapping between visual, spatial, and textual modalities; conditional controllability, which enables flexible generation under user-specified constraints; scene consistency, which preserves realistic spatial and temporal correlations; and closed-loop validation, which evaluates generated data in the context of perception-prediction-control feedback loops. These challenges not only reflect fundamental research questions but also directly affect the robustness and generalizability of smart driving systems. Following the problem definition, this paper systematically reviews representative generative techniques that have been applied in this domain. Our survey covers several prominent methodological families, including diffusion models, generative adversarial networks, neural radiance fields and 3D Gaussian splatting, world models, and multimodal foundation models. For each category, this paper discusses the underlying principles, highlights recent advances, and examines their suitability for generating high-fidelity, controllable traffic data. This paper pays special attention to the integration of spatial and temporal priors, multiagent interactions, and semantic guidance, which are critical for ensuring that synthetic data faithfully reflect real-world driving dynamics. This paper further categorizes applications into three principal domains—intelligent cockpits, single-vehicle autonomy, and V2X-based cooperative perception. In intelligent cockpits, generated data can support driver assistance systems and human-machine interface evaluation and enable the study of driver behavior and risk perception under controlled yet realistic conditions. For single-vehicle smart driving, synthetic data facilitate model training for perception, prediction, and planning modules, especially in scenarios that are rare or safety critical, such as pedestrian crossing at night or severe weather conditions. In multivehicle cooperative perception leveraging V2X communication, generated datasets facilitate systematic exploration of sensor fusion strategies, information sharing protocols, and distributed decision-making under varying network and environmental conditions. Across these domains, this paper identifies key technical considerations, including sensor modality alignment, occlusion handling, and fidelity of dynamic interactions among agents. In addition to algorithmic techniques and application scenarios, evaluation and benchmarking constitute an essential component of intelligent traffic data generation. This paper proposes a multilevel evaluation framework that spans perception-prediction-control closed-loop metrics, physical consistency of sensors, and scenario diversity measures. This framework not only assesses the visual realism of synthetic data but also quantifies their utility in downstream smart driving tasks and ensures that generated datasets contribute meaningfully to system reliability and safety validation. Moreover, this paper discusses engineering practices for constructing scalable data engines that balance realism, diversity, and controllability, and offers practical insights into data augmentation strategies, hybrid synthetic-real data pipelines, and scenario generation workflows. In summary, this survey provides a structured, comprehensive reference for researchers and practitioners working on traffic data generation for advanced smart driving. By integrating perspectives from models, systems, and evaluation, this work highlights the progress made and the remaining challenges in generating high-quality, controllable, and task-relevant traffic data. Such insights will support the development of robust data ecosystems, inform the establishment of standardized evaluation protocols, and ultimately accelerate the safe deployment of high-level smart driving technologies in real-world environments.  
      关键词:model-system-evaluation;high-level smart driving;intelligent data generation;intelligent cockpit;vehicle-centric smart driving;multi-vehicle cooperative perception;multi-level evaluation   
      157
      |
      555
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 149622363 false
      更新时间:2026-06-18
    • Mu Yao, Zhao Hao, Hu Ruizhen, Zhang Li, Li Hongyang, Yang Jiaolong, Wang Jingbo, Han Lei, Su Yongfeng, Xu Kai, Yang Yi, Li Jiang, Dai Ruoli, Chen Baoquan, Liu Yebin, Yi Li
      Vol. 31, Issue 6, Pages: 2017-2025(2026) DOI: 10.11834/jig.260059
      Frontiers and prospects of embodied AI: evolution of data, models, and systems
      摘要:As a critical, rapidly evolving domain within artificial intelligence (AI), embodied AI represents the convergence of computer vision, natural language processing, and robotics, and aims to create intelligent agents capable of perceiving, reasoning, and acting within the physical world. However, despite the transformative success of large language models (LLMs) in the digital realm, embodied AI faces unprecedented, multifaceted challenges that hinder the direct replication of the “large-scale pre-training plus scaling law” paradigm. These challenges include extreme data heterogeneity across different robot morphologies, strong physical constraints that demand safety and precision, and the prohibitively expensive interaction costs associated with collecting real-world robotic data. Consequently, simply scaling up model parameters without addressing these domain-specific hurdles has proven insufficient for achieving general-purpose robotic intelligence. This paper comprehensively reviews the frontier technical evolution of embodied AI and offers a systematic analysis across four critical dimensions, namely, data, models, systems, and evaluation, to chart a path toward more robust and generalized embodied agents. In terms of data, this paper proposes a “Data Pyramid” structure designed to maximize data efficiency and transferability. This hierarchical framework advocates for the foundational use of massive, low-cost simulation data and Internet-scale video datasets at the bottom layer to build broad physical commonsense and visual representations; the utilization of human interaction data (such as ego-centric videos and teleoperation logs) in the middle layer to facilitate behavioral mapping and intent understanding; and the strategic application of a small, high-quality amount of real-world robot data at the top layer for fine-tuning and final skill deployment, an approach bridging the reality gap. Regarding models, the paper critically discusses the current state of mainstream vision-language-action models and highlights that while they excel at semantic understanding, they encounter significant scaling bottlenecks in continuous control and fine-grained manipulation. To overcome this, this paper identifies “World Models” as a pivotal new direction for embodied pretraining. By learning to simulate environmental dynamics, predict future states, and understand causal relationships without explicit supervision, world models promise to endow agents with deeper physical intuition and superior generalization capabilities in unseen environments. In terms of systems, this paper observes a paradigm shift, where the architecture is evolving from monolithic, single end-to-end models toward an operating system-like “Hierarchical Architecture.” This evolution achieves the necessary decoupling of high-level semantic planning——powered by the reasoning capabilities of LLMs——and low-level motion control, which ensures precise execution and hardware compliance. This modular approach not only improves system robustness but also facilitates easier debugging and component upgrades. Finally, the paper examines the critical issues within current evaluation systems and specifically focuses on the challenges of authenticity in simulation benchmarks and the lack of reproducibility in real-world experiments. The field suffers from fragmented metrics that fail to capture the complexity of open-world interaction. In conclusion, this paper provides a forward-looking perspective on the inevitable integration of locomotion and manipulation——moving beyond stationary arms to mobile manipulators——and anticipates the arrival of the “ImageNet moment” for embodied AI, where standardized datasets and benchmarks will catalyze a Cambrian explosion of robotic capabilities, a development ultimately bridging the gap between digital intelligence and physical reality.  
      关键词:Embodied AI;Data Pyramid;World Models;VLA Models;Hierarchical Control Architecture;Embodied Evaluation   
      297
      |
      381
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 153982537 false
      更新时间:2026-06-18
    • Zhang Teng, Wang Cong, Miao Feng, Yang Yuchao
      Vol. 31, Issue 6, Pages: 2026-2044(2026) DOI: 10.11834/jig.260090
      Review of neuromorphic materials and devices for computing-in-memory applications
      摘要:The advent of the big data era and the proliferation of artificial intelligence have exposed the fundamental limitations of the traditional von Neumann computing architecture. The physical separation between the central processing unit and memory units necessitates frequent data shuttling, leading to the severe “memory wall” bottleneck that is characterized by high latency and excessive energy consumption. By contrast, the biological brain exhibits remarkable computational efficiency, performing complex cognitive tasks, such as pattern recognition, associative memory, and autonomous learning with an exceptionally low power budget of approximately 20 W. This biological efficiency intrinsically arises from the brain’s massive parallelism, event-driven processing, and dense colocation of memory and computation. Consequently, neuromorphic computing, specifically the development of computing-in-memory (CIM) architectures, has emerged as a pivotal frontier in post-Moore’s law electronics. This comprehensive review provides a holistic and detailed survey of materials, devices, and system-level strategies that are currently driving the development of neuromorphic hardware, aiming to bridge the gap between fundamental material physics and brain-inspired intelligence meticulously. To realize this paradigm shift, extensive research has focused on the development of novel materials that can physically embody neuronal and synaptic dynamics at the hardware level. We first systematically analyze the state-of-the-art material systems that are utilized for these purposes, beginning with advanced silicon-based field-effect transistors, where optimization in a floating gate allows for a highly compact implementation of analog weight updates. Moving beyond traditional silicon, we deeply explore emerging nonvolatile memory technologies. Memristive materials are highlighted for their highly scalable metal-insulator-metal structures, where the mechanisms of ion migration and conductive filament formation provide the nonvolatile multilevel states that are necessary for high-density synaptic arrays. Similarly, phase-change materials utilize the reversible transition between amorphous and crystalline states via localized Joule heating, offering distinct advantages in array scalability and multi-bit storage despite challenges related to resistance drift. Furthermore, we investigate ferroelectric materials, which utilize the polarization reversal of ferroelectric domains to provide highly linear and symmetric weight updates with sub-nanosecond switching speeds and ultralow switching energy. The integration of spintronic and magnetic materials is also discussed, leveraging electron spin dynamics and spin transfer torque to achieve exceptional endurance. Optoelectronic materials are examined for their capability to couple optical and electrical signals, enabling ultrafast signal transmission that mimics the high-dimensional connectivity of biological networks. Crucially, a central theme of this review is the paradigm shift from suppressing material non-idealities to actively utilizing intrinsic device dynamics. Physical complexities, such as volatile relaxation, stochastic switching, and nonlinear current-voltage characteristics, are increasingly utilized as rich computational resources for implementing short-term memory, reservoir computing, and probabilistic learning rules. Building upon these fundamental material properties, this review elaborates on the engineering of artificial synapses and neurons, which are the two foundational pillars of neural networks. The artificial synapse, which acts as the primary locus of learning and memory, is designed to regulate connection weights in response to external stimuli. We analyze the precise hardware implementation of essential bio-plasticity rules, including long-term potentiation, long-term depression, and spike timing-dependent plasticity, across various device architectures that range from two-terminal electrical memristors to complex multi-physics synapses. Complementary to synaptic memory, the artificial neuron serves as the nonlinear processing engine. We review the physical realization of neuronal models, particularly the leaky integrate-and-fire model, by using threshold switching mechanisms found in Mott insulators or volatile diffusive memristors. These devices dynamically mimic the accumulation of membrane potential and the subsequent generation of action potentials, enabling highly efficient event-driven processing wherein power is consumed strictly during spiking events. Beyond the engineering of individual devices, the realization of practical neuromorphic intelligence necessitates robust system integration and hardware-software codesign. This review extensively discusses the topology of crossbar arrays, where synaptic devices are strategically located at the cross points. This architecture enables highly parallelized vector-matrix multiplication in a single computational time step via the direct application of Ohm’s law and Kirchhoff’s current law, drastically accelerating neural network inference and training. The transition to analog neuromorphic computing introduces significant challenges, notably the “reality gap” caused by device-to-device variability, cycle-to-cycle noise, and limited operational endurance. To mitigate these pervasive issues effectively and ensure the reliable execution of complex cognitive tasks, researchers are increasingly focusing on comprehensive hardware-algorithm codesign strategies. This action involves the development of robust network architectures and adaptive learning algorithms that are inherently resilient to underlying hardware imperfections. By establishing a closed-loop optimization framework that seamlessly integrates device-level physical traits with system-level computational models, significantly enhancing the overall reliability and fault tolerance of the neuromorphic hardware becomes possible. Such synergistic approaches are indispensable for translating the theoretical advantages of analog computing into tangible performance gains. Moreover, the relentless pursuit of these advanced computing architectures represents a deeply interdisciplinary endeavor, intrinsically uniting the latest breakthroughs in materials science, solid-state physics, and computational neuroscience. As the field continues to mature, the seamless convergence of these diverse scientific domains will be critical in overcoming the remaining technological barriers and establishing universally accepted testing standards. Finally, this review showcases the practical deployment of these integrated materials and devices in transformative applications. From highly energy-efficient edge computing nodes for the Internet of things to adaptive brain-computer interfaces that are capable of real-time neurological signal decoding, and autonomous sensory-motor systems in advanced robotics, the potential of neuromorphic hardware is vast. In conclusion, the continued advancement of neuromorphic materials and devices represents a profound shift from rigid, clock-driven logic to adaptive, biologically inspired intelligence. By harnessing the rich underlying physics of emerging materials, researchers are successfully replicating the functional building blocks of the biological brain, paving the way for a future wherein CIM architectures will enable sophisticated artificial intelligence systems that are autonomous, ultraefficient, and deeply integrated into the physical world.  
      关键词:neuromorphic computing;Computing-in-Memory;artificial synapse;artificial neuron;Emerging Electronic Materials;Brain-inspired Chips   
      54
      |
      44
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 157006604 false
      更新时间:2026-06-18
    • Zheng Yajing, Zhao Rui, Zhu Lin, Liu Yujia, Huang Tiejun
      Vol. 31, Issue 6, Pages: 2045-2069(2026) DOI: 10.11834/jig.260128
      SpikeCV: a survey on the hierarchical modeling of continuous-time spike representations and systematic progress in neuromorphic vision
      摘要:With the rapid development of neuromorphic vision sensors, spike cameras have emerged as a promising paradigm for continuous-time visual perception. In contrast with conventional frame-based cameras that sample scenes at fixed frame rates, spike cameras encode luminance variations as asynchronous binary spike streams triggered by intensity accumulation at each pixel. This sensing mechanism changes the manner in which visual information is acquired and represented. Instead of producing discrete image frames, spike cameras generate continuous spike events that directly reflect temporal changes in scene intensity. Consequently, spike cameras provide several unique advantages, including extremely high temporal resolution, wide dynamic range, low motion blur, and sparse event-driven representations. These properties make spike cameras particularly suitable for challenging visual environments, such as high-speed motion analysis, extreme illumination conditions, and subtle temporal change detection, where traditional imaging systems frequently encounter limitations. Spike vision also introduces new challenges for visual computing. The statistical characteristics and data structures of spike streams differ significantly from those of conventional images or videos. Spike cameras produce sparse binary spike sequences in continuous time rather than dense intensity frames. Therefore, many established computer vision algorithms cannot be directly applied to spike data without modifications. Effective processing of spike streams requires new modeling strategies that explicitly consider the temporal dynamics and sparsity of signals. This survey reviews recent progress in spike vision research from the perspective of the hierarchical modeling of continuous-time spike representations. Existing methods are organized into several levels that reflect the progressive expansion of spike vision capabilities, ranging from signal modeling and reconstruction to semantic perception and system deployment. At the lowest level, the physically consistent modeling of spike generation and sensor noise provides a foundation for understanding spike data statistics. Studies in this direction analyze pixel triggering mechanisms, noise characteristics, and spike accumulation processes, forming the basis for reliable signal processing and algorithm design. Low-level visual reconstruction methods aim to recover stable visual signals from spike streams. Representative tasks include intensity reconstruction, high dynamic range imaging, motion deblurring, super-resolution, and low-light enhancement. These approaches convert spike sequences into interpretable intensity representations while preserving the temporal information contained in the spike data. The next level focuses on spatiotemporal modeling. The continuous-time nature of spike streams enables joint modeling of spatial structure and temporal motion. Research in this area addresses problems, such as optical flow estimation, motion segmentation, and dynamic scene analysis. Compared with frame-based methods, spike-based models provide improved temporal fidelity in fast motion scenarios. At the semantic perception level, spike representations are increasingly applied to tasks, such as object detection, recognition, and multi-object tracking. Continuous spike streams are integrated with deep neural networks, Transformer architectures, or spiking neural networks to perform higher-level visual reasoning. These methods utilize the temporal sparsity of spike data while maintaining low-latency processing. Spike cameras have also been introduced into 3D scene modeling. Recent studies combine spike streams with neural implicit representations to reconstruct static or dynamic 3D scenes. Continuous spike measurements provide detailed temporal information that can benefit dynamic scene reconstruction and neural rendering. System-level considerations play an important role in practical spike vision deployment. The evaluation of spike-based methods not only involves accuracy but also system metrics, such as latency, throughput, and energy consumption. These metrics become particularly important in real-time perception systems and edge computing scenarios. Progress in spike vision has also been supported by the development of datasets, simulation tools, and open-source platforms. Collecting spike camera data can be costly, and thus, spike simulators are widely used in algorithm development and validation. Simulation methods attempt to reproduce sensor physics and temporal spike generation processes. Public datasets and benchmarking protocols further support reproducible research. The SpikeCV platform provides a unified open-source framework for spike vision research. It integrates datasets, algorithm implementations, hardware interfaces, and evaluation tools, allowing researchers to prototype and evaluate spike-based algorithms rapidly. The platform has helped facilitate collaborative development and reproducible experiments within the community. Research activity in spike vision has grown rapidly in recent years. Publications, open-source resources, and benchmark datasets have increased steadily. Two international competitions organized in 2025 attracted wide participation from academic institutions and industry teams. These competitions encouraged standardized task definitions and stimulated methodological progress in the field. However, continuous-time representation learning for spike streams is still an active research area. Large-scale self-supervised learning for spike data remains largely unexplored. Multimodal fusion with complementary sensors introduces additional challenges into temporal alignment and noise modeling. System-level optimization that involves latency, throughput, and energy consumption also requires further investigation. Hardware-algorithm codesign with neuromorphic processors may provide new opportunities for efficient spike-based computation. Spike vision represents an emerging direction that reconsiders visual perception from a continuous-time perspective. Advances in sensor technology, representation learning, and system integration are gradually forming a new framework for visual computing. Continued progress in spike-based sensing, modeling, and deployment may enable high-speed, energy-efficient visual intelligence for future perception systems. Overall, this work provides a structured overview of spike vision research from the perspectives of sensing principles, representation modeling, perception algorithms, and system-level infrastructure. By organizing existing studies within a hierarchical framework of continuous-time spike representations, this study highlights how spike vision methods evolve from low-level signal reconstruction to high-level semantic perception and 3D scene understanding. The discussion of datasets, simulators, evaluation protocols, and open-source platforms further reflects the growing research ecosystem that surrounds spike-based vision. Through this synthesis, we aim to clarify the relationships among different research directions, summarize the current development status of the field, and provide a reference for future work on continuous-time visual computing and neuromorphic perception systems.  
      关键词:SpikeCV;spike vision;continuous-time representation;neuromorphic vision;high-speed motion;spatiotemporal modeling;open-source ecosystem   
      104
      |
      180
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 153578257 false
      更新时间:2026-06-18
    • Yang Shuangming, Shen Jiangrong, Li Youjun, Huang Zigang, Chen Badong
      Vol. 31, Issue 6, Pages: 2070-2102(2026) DOI: 10.11834/jig.260022
      Recent advances and future prospects in brain-inspired artificial intelligence research
      摘要:Brain-inspired artificial intelligence (AI) seeks to emulate the structural organization, functional mechanisms, and adaptive learning principles of the human brain, with the objective of developing intelligent systems that exhibit high energy efficiency, strong generalization capability, continual learning, and robust adaptability to complex environments. Despite the remarkable success of contemporary AI approaches dominated by deep learning, these methods remain fundamentally constrained by high computational cost, excessive energy consumption, limited interpretability, and weak adaptability to nonstationary or resource-constrained environments. Such limitations hinder their deployment in real-world scenarios that require long-term autonomy, online learning, and efficient decision-making under uncertainties. By contrast, the human brain represents an unparalleled natural intelligent system that is capable of performing perception, cognition, learning, memory, and decision-making in parallel while consuming only tens of watts of power. This exceptional efficiency arises from the synergistic interaction of neuronal dynamics, synaptic plasticity, neural circuit organization, and neuromodulatory regulation across multiple spatial and temporal scales. These characteristics provide profound inspiration for rethinking the computational paradigms, learning mechanisms, and hardware architectures that underlie AI. Consequently, brain-inspired AI has emerged as a frontier research direction at the intersection of neuroscience, AI, electronic engineering, and materials science, aiming to bridge the gap between biological intelligence and engineered systems. This study presents a comprehensive review of recent advances in brain-inspired AI from international and domestic perspectives. From the viewpoint of brain structural inspiration, we systematically analyze multilevel modeling approaches that span neuronal models, neural circuits, and neuromodulatory systems. The evolution of spiking neuron models and spiking neural networks is reviewed, with particular emphasis on the trade-offs among biological plausibility, computational efficiency, and scalability. Key mechanisms, such as dendritic nonlinearity, neuronal heterogeneity, synaptic dynamics, and event-driven information processing, are discussed in terms of their contributions to efficient temporal encoding, sparse computation, and robustness. Furthermore, neuromodulatory mechanisms inspired by dopamine, acetylcholine, and other neurotransmitters are examined, highlighting their roles in regulating learning rates, credit assignment, and context-dependent adaptation, and their potential to support continual learning and mitigate catastrophic forgetting in artificial systems. From the perspective of brain functional inspiration, this study reviews algorithmic developments across core intelligent functions, including perception, attention, memory, learning, reasoning, decision-making, and control. Brain-inspired principles that underlie convolutional architectures, recurrent neural networks, attention mechanisms, reinforcement learning, and meta-learning are analyzed, revealing how biological concepts, such as hierarchical processing, temporal recurrence, selective attention, reward-driven learning, and multi-timescale adaptation, have shaped modern AI algorithms. Special attention is given to recent trends toward integrating multiple cognitive functions into unified architectures, enabling closer coupling among perception, cognition, and action, and facilitating more flexible and generalizable intelligence. From the hardware and system implementation perspective, this study summarizes state-of-the-art neuromorphic computing systems that are designed to support brain-inspired algorithms. Emerging computing paradigms based on near-memory and in-memory computing architectures are reviewed, emphasizing their capability to overcome the von Neumann bottleneck and achieve significant gains in energy efficiency and parallelism. Advances in neuromorphic devices, including memristors, ferroelectric devices, electrochemical memory elements, and novel transistor structures, are discussed with respect to their capability to emulate synaptic plasticity, neuronal dynamics, and memory computation colocation at the physical level. The importance of algorithm hardware codesign is highlighted, because the efficient deployment of brain-inspired intelligence critically depends on the co-optimization of learning rules, network architectures, device characteristics, and system-level communication mechanisms. In addition, this study provides a systematic comparison of international and domestic research effort in brain-inspired AI. While international research has demonstrated strong advantages in fundamental neural modeling, large-scale neuromorphic platforms, and interdisciplinary collaboration, domestic research has made rapid progress in application-oriented system integration, independently developed neuromorphic hardware, and large-scale brain-inspired computing platforms. The complementary strengths and existing gaps between these research trajectories are analyzed, offering insights into future collaborative and competitive development strategies. Finally, future directions and challenges in brain-inspired AI are discussed. Key trends include deeper algorithm, hardware codesign, tighter integration between neuroscience and AI, and the development of scalable, interpretable, and energy-efficient intelligent systems for real-world applications. Emphasis is placed on advancing theoretical foundations, improving system-level robustness and adaptability, and accelerating the translation of brain-inspired technologies into industrial and societal applications. Several strategic directions are outlined to promote leapfrog advances in brain-inspired AI, particularly in the context of China’s long-term scientific and technological development.  
      关键词:brain-inspired;neuromorphic intelligence;brain structure;brain function;neuromorphic systems   
      181
      |
      548
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 151835627 false
      更新时间:2026-06-18

      Generative Artificial Intelligence and Content Security

    • Survey on visual text generation techniques AI导读

      Shu Yan, Zhao Fangmin, Chen Zeyu, Zhao Tianqi, Wang Yizhu, Li Kunchi, Zhou Yu, Wang Dahan, Peng Liangrui, Gao Liangcai, Yin Xucheng
      Vol. 31, Issue 6, Pages: 2103-2124(2026) DOI: 10.11834/jig.260047
      Survey on visual text generation techniques
      摘要:Visual text image generation and editing are important research directions at the intersection of computer vision and natural language processing, aiming to achieve seamless text removal, precise text editing, and intelligent text generation in images. In contrast with general image generation tasks, visual text possesses the dual attributes of context semantic information and visual graph features, imposing higher requirements for models’ multimodal representation capability and generation precision in terms of glyph structure, stroke details, color texture, and layout composition. With the rapid development of generative adversarial networks (GANs), diffusion models, flow matching, and multimodal large models (e.g., CLIP and Flamingo), this field has achieved significant breakthroughs in technical paradigms and practical application scenarios over the past decade. This survey systematically reviews research progress in three core tasks: visual text removal (VTR), visual text editing (VTE), and visual text generation (VTG), which constitute the technical system of visual text image processing when they are combined. The three tasks are mutually complementary and form a closed loop of visual text processing: VTR lays the foundation for clean image background, VTE realizes flexible modification of existing text information, and VTG completes the active creation of text content in images, jointly supporting the full chain application of visual text technology. In VTR, three major paradigms, namely, knowledge transfer, multitask learning, and progressive learning, have continuously advanced the synergistic optimization of text detection and background inpainting capabilities. Knowledge transfer paradigms leverage pretrained image inpainting models (e.g., contextual residual aggregation networks) to enhance background restoration quality, effectively addressing the problem of texture inconsistency caused by the independent training of detection and inpainting modules. Multitask learning frameworks integrate text detection and inpainting into a unified model architecture, sharing feature extraction backbones to reduce cumulative errors between upstream and downstream tasks, and improving end-to-end processing efficiency. By contrast, progressive learning approaches adopt a coarse-to-fine strategy. First, text regions based on semantic features are roughly located and eliminated. Then, background details are iteratively refined by fusing local texture and global structure information, achieving thorough text elimination while maximally preserving the integrity and consistency of the original background texture, lighting, and spatial structure. In VTE, technical evolution shifts from GAN-based stepwise processing (including text detection, removal, and regeneration) to end-to-end conditional generation models, considerably improving the efficiency and fidelity of editing results. Early stepwise methods suffer from evident seams between edited text and the background due to the separation of each module. Meanwhile, modern end-to-end models focus on the precise extraction and cross-domain transfer of text style features (e.g., font, size, color, and transparency), stroke features (e.g., thickness, smoothness, texture, and wear degree), and semantic features. By introducing attention mechanisms to focus on the correlation between text and background regions, and style embedding modules to encode scene-specific visual attributes, these models realize a unified modeling of style preservation and content replacement. Such modeling enables seamless editing effects wherein the edited text is consistent with the surrounding scene in terms of visual appearance, layout logic, and lighting conditions, effectively avoiding disharmony of the edited content. In VTG, research has undergone a fundamental transformation from early graphics-based rendering synthesis (relying on predefined font libraries and layout rules) to data-driven neural generation, breaking through the limitations of fixed styles and single scenes in traditional methods. Recent advances in character-aware encoding (which captures fine-grained glyph features), glyph-conditioned control (which regulates text shape and layout), and multimodal alignment mechanisms (which align text content with image context) have significantly improved the performance of text generation. Specifically, these improvements are reflected in three aspects: text spelling accuracy (decreasing missing characters, distortions, and errors), scene coherence (matching the lighting, perspective, texture, and noise of the target image), and multilingual generalization (supporting complex scripts, such as Chinese, Arabic, Sanskrit, and other languages with irregular stroke structures). These advancements have made neural text generation more adaptable to real-world application scenarios, such as advertising design, paper poster, scene customization, and intelligent annotation. This survey further analyzes the core challenges that confront the field, restricting the large-scale practical application of visual text image technologies. First, accurate rendering of multilingual complex characters remains difficult due to the diverse glyph structures, stroke variations, and semantic associations across languages. For example, Chinese calligraphy with freehand brushwork and Arabic cursive script pose considerable challenges to model feature extraction. Second, models lack strong generalization capability across diverse scenes (e.g., indoor, outdoor, low-light, and motion-blurred environments) and text styles (handwritten, printed, artistic fonts, and worn ancient text), frequently exhibiting a sharp performance decline when facing unseen scenarios. Third, achieving precise alignment between generated content and human intention requires more effective interaction mechanisms to capture fine-grained user requirements, because current models experience difficulty in accurately understanding ambiguous editing instructions. Finally, the high computational cost of current models (especially diffusion and large multimodal models) hinders their deployment in real-time interaction scenarios, such as mobile devices and edge computing platforms. In the future, with the continuous enhancement of multimodal large model capabilities, the ongoing optimization of diffusion model architectures (e.g., efficiency-oriented lightweight designs, distillation strategies, and fast sampling algorithms), and the refinement of high-quality benchmark datasets (covering more diverse languages, scenes, and text styles), visual text image generation and editing technologies will play increasingly important roles in multiple fields. In intelligent media creation, they can assist designers in quickly generating and editing text elements in posters, videos, and animations. In information visualization, they can dynamically generate scene-adaptive text labels for data charts. In cultural heritage preservation, they can restore blurred or damaged text in ancient artifacts and manuscripts without damaging the original cultural relics. In accessible reading, they can generate large-font, high-contrast accessible text for visually impaired users to improve reading experience. Ultimately, these technologies will become key enablers for advancing human-computer interaction and visual intelligence in the digital era, bridging the gap between text information and visual scenes.  
      关键词:visual text removal(VTR);visual text editing(VTE);visual text generation(VTG);Diffusion models;multimodal learning;image generation   
      206
      |
      823
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 152863648 false
      更新时间:2026-06-18
    • Review of Intelligent Digital Human Content Generation Technology AI导读

      Yang Hang, Liu Na, Meng Lei, Mao Qirong, Li Manyi, Li Xiang, Wang Chengjie, Zhu Junwei, Wang Pengjie
      Vol. 31, Issue 6, Pages: 2125-2143(2026) DOI: 10.11834/jig.260074
      Review of Intelligent Digital Human Content Generation Technology
      摘要:Intelligent digital humans have rapidly evolved with the advancement of computer graphics, computer vision, speech synthesis, and multimodal generative modeling. From early virtual avatars that focus on visual representation, digital humans are currently developing toward dynamic motion modeling, emotion-aware interaction, and real-time deployment. This study presents a systematic review of recent research progress in intelligent digital human content generation, organized around three core technical directions: video-to-digital human generation, 3D human motion synthesis and editing, and emotion-driven digital human generation. In addition, practical considerations for real-time on-device deployment are discussed. Video-to-digital human generation serves as the foundational stage for digital human construction. Its objective is to reconstruct animatable 3D human avatars from monocular, multi-view, or in-the-wild video input. Early approaches primarily relied on implicit neural representations, such as neural radiance fields, frequently combined with parametric body models, such as the skinned multi-person linear model (SMPL). Although implicit methods provide continuous and high-fidelity geometric representation, their rendering efficiency limits real-time applicability. Recent studies have shifted toward explicit or hybrid representations, particularly 3D Gaussian splatting, which significantly improves rendering speed while maintaining visual quality. Extensions that incorporate SMPL or SMPL-X priors further enhance geometric stability and animatability. In multi-view settings, stronger geometric constraints improve reconstruction accuracy, while open-scene scenarios introduce additional challenges, such as occlusion handling, multi-person interaction, and background interference. Despite notable progress, maintaining temporal consistency and geometric robustness in complex environments remains an open problem. In addition to geometric reconstruction, 3D human motion synthesis and editing enable digital humans to exhibit realistic dynamic behaviors. Compared with static modeling, motion generation requires accurate modeling of high-dimensional temporal distributions under kinematic and physical constraints. Early approaches based on statistical models or variational autoencoders improved representation capacity but frequently suffered from limited motion diversity. In recent years, diffusion models have become the dominant paradigm for motion generation due to their strong capability to model complex multimodal distributions. Representative frameworks demonstrate improved motion realism, diversity, and semantic alignment with textual or conditional input. Latent diffusion strategies further enhance efficiency by performing denoising processes in compact latent spaces. Beyond unconditional generation, condition-driven and fine-grained motion editing have attracted increasing attention. Text-guided editing frameworks allow local modification of specific joints or temporal segments while preserving overall motion style. Skeleton-aware and physics-guided diffusion models introduce structural constraints to improve anatomical plausibility and reduce artifacts, such as foot sliding. Moreover, research has gradually expanded toward multi-person interaction modeling and long-sequence coherence, addressing challenges in action composition, interaction synchronization, and environment-aware motion planning. Nevertheless, balancing physical consistency, computational efficiency, and controllability remains a critical challenge in practical applications. Emotion-driven digital human generation further enhances interactivity and human-likeness. This direction includes facial expression synthesis, emotional speech synthesis, and multi-turn empathetic interaction modeling. In facial animation, research has progressed from parameterized 3D morphable models to implicit neural rendering and, more recently, Gaussian-based explicit representations that achieve improved fidelity and real-time performance. In emotional speech synthesis, end-to-end neural architectures, non-autoregressive frameworks, and neural codec language models enable expressive, zero-shot, and fine-grained controllable speech generation. Meanwhile, emotion modeling in interactive dialogue systems has evolved from passive emotion recognition toward empathetic response generation, incorporating graph-based contextual modeling and large language model fine-tuning strategies. Although current systems can generate recognizable emotional expressions, challenges remain in maintaining emotional consistency over long interactions, decoupling emotion from underlying controllable factors, and ensuring cross-modal alignment in the speech, facial motion, and semantic contexts. To support real-world deployment, real-time on-device digital human systems have also gained attention. Lightweight models, reduced-resolution rendering, intermediate parameter representations, and efficient inference frameworks are commonly adopted under constrained computational resources. In practical applications, a collaborative architecture is frequently employed, where large language models handle semantic reasoning in the cloud while speech synthesis and avatar rendering are executed locally. This edge-cloud collaboration balances interaction latency and generation quality, facilitating scalable deployment in mobile and desktop environments. In addition to reviewing representative models and technical routes, this work summarizes commonly used datasets and evaluation metrics across subfields, including text-to-motion benchmarks, emotional speech corpora, and multi-view reconstruction datasets. Performance is typically evaluated from multiple perspectives, such as perceptual realism, geometric accuracy, semantic alignment, motion stability, and subjective human assessment. To facilitate reproducible research and provide a centralized resource for the community, we have curated all surveyed datasets, benchmark links, and a structured list of representative models into a public GitHub repository, which is available at:https://github.com/blue-cola-bc/Overview-of-Intelligent-Digital-Humans.git. Despite rapid progress, unified benchmarks for long-term interactive digital humans are still lacking. Overall, intelligent digital human technology is advancing along a progressive pathway from geometric reconstruction to motion generation, and finally, emotion-aware interaction. Future research is expected to focus on unified multimodal generative frameworks, improved long-term consistency modeling, physics-aware motion control, and efficient real-time deployment. By systematically organizing recent developments and open challenges, this review aims to provide a structured understanding of current progress and potential research directions in intelligent digital human content generation.  
      关键词:digital human technology;3D human motion synthesis and editing;digital human generation;diffusion model;video-to-digital human generation;multimodal emotional interaction   
      191
      |
      171
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 152863484 false
      更新时间:2026-06-18
    • Cross-modal 3D Generation:principles,methods,and recent advances AI导读

      Chen Zhineng, Yuan Zhaoquan, Yang Xiaoshan, Cao Yixin, Li Liang, Wu Xiao, Bao Bingkun
      Vol. 31, Issue 6, Pages: 2144-2181(2026) DOI: 10.11834/jig.250655
      Cross-modal 3D Generation:principles,methods,and recent advances
      摘要:Cross-modal 3D generation aims to automate the synthesis of high-fidelity 3D geometry and texture from 1D or 2D modalities such as text and images to bridge the semantic gap between the virtual and physical worlds. In recent years, cross-modal 3D generation has become a pivotal technology in the fields of multimedia analysis, natural language processing, computer vision, and computer graphics, especially in emerging industries such as virtual reality, augmented reality, the metaverse, digital twin manufacturing, and autonomous robotics, where the requirement of high-quality grows rapidly. Recently, the advancement of deep learning and pretrained multimodal large models has significantly promoted the performance of 3D content generation and its applications. In particular, the emergence of advanced techniques, including implicit neural representations, 2D-to-3D knowledge distillation, and large-scale feed-forward reconstruction frameworks, has led to a qualitative leap in generation efficiency and visual realism. However, a comprehensive review connecting data representations, semantic alignment mechanisms, and model architectures is required to clarify the complex technical routes in this rapidly evolving field. Thus, this paper develops a systematic, timely review to explore the principles, methods, and recent advances of cross-modal 3D generation. First, this paper presents a systematic analysis of 3D data representations, categorized into explicit, implicit, and hybrid classes. Specifically, explicit representations offer compatibility with graphics engines and efficient geometry processing, whereas implicit representations enable continuous topology handling and photorealistic view synthesis. Second, this paper dissects the core mechanisms of semantic alignment and model architectures, and introduces typical datasets for these tasks. This paper classifies the alignment strategies into contrastive and generative approaches, while the architectures cover adversarial, variational, diffusion-based, and autoregressive models. From the perspective of generation tasks, the existing methods can be divided into three major categories: text-to-3D generation, image-to-3D generation, and 3D scene generation. Specifically, text-to-3D generation has evolved from optimization-based methods to feed-forward native 3D generation, focusing on resolving multiview inconsistency and enhancing geometry-texture decoupling. Image-to-3D generation has transitioned from single-view reconstruction using 2D priors to multiview consistent generation and direct regression models that infer 3D structures in seconds. 3D scene generation extends these capabilities to complex spatial layouts, utilizing video priors and procedural generation to handle large-scale environments. In addition, this paper summarizes mainstream datasets and analyzes how the shift from large-scale synthetic repositories to real-world scanned collection affects model generalization. Finally, this review highlights the remaining challenges in the community, including semantic ambiguity, physical inconsistency, and high computational cost, and conducts forecasting analysis. This paper further recommends prospects and states that cross-modal 3D generation is accelerating toward the era of “world models”, characterized by 4D spatiotemporal understanding, physical extractability, unified multimodal architectures, and 3D-native foundation models. This paper presents a systematic survey of the cross-modal 3D generation field, covering multiple aspects ranging from data representation to model architectures. This paper aims to provide a knowledge framework for future research and to promote the application and development of cross-modal 3D content generation in tasks related to world understanding and creation. The datasets, algorithms, and evaluation metrics mentioned are linked at the following:https://github.com/L-Matilda/Cross-modal-3D-Generation.  
      关键词:cross-modal 3D generation;text-to-3D generation;image-to-3D generation;3D scene generation;semantic alignment   
      158
      |
      126
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 151835417 false
      更新时间:2026-06-18
    • Frontiers and prospects of 3D reconstruction and generation AI导读

      Han Xiaoguang, Xiu Yuliang, Xu Zhen, Lian Zhouhui, Peng Sida, Yao Yao, Chen Anpei, Huang Jingwei, Zhang Bang, Xu Lan, Xu Feng, Zhang Guofeng, Xu Weiwei, Yu Jingyi, Liu Ligang, Chen Baoquan, Liu Yebin, Zhou Xiaowei
      Vol. 31, Issue 6, Pages: 2182-2197(2026) DOI: 10.11834/jig.260070
      Frontiers and prospects of 3D reconstruction and generation
      摘要:The field of 3D vision is currently undergoing a profound and historical paradigm shift, transitioning from a traditional core that focuses on “perception and reconstruction”, which emphasizes the faithful recovery of geometry from observation, to a new, integrated stage characterized by “reconstruction-generation-interaction”. This work provides a systematic, comprehensive, and critical review of frontier advances in 3D reconstruction and generation technologies, covering key directions that include 3D reconstruction paradigms, 3D object and scene generation, and the evolution of 3D digital humans. Specifically, it analyzes the fundamental principle differences between optimization-based and feed-forward reconstruction methods; evaluates the current status and bottlenecks of object-level generation, computer-aided design (CAD) generation, and embodied artificial intelligence (AI) scene generation; and compares the performance and future viability of 2D versus 3D digital human technologies in the context of real-time rendering and complex spatial interactions. The analysis indicates that in the domain of 3D reconstruction, two distinct technical paradigms have emerged, each with its unique strengths and limitations. Optimization-based methods, exemplified by classical pipelines, such as COLMAP, and modern neural representations, such as neural radiance fields and 3D Gaussian splatting, excel in achieving high-precision geometric recovery. By defining 3D representations as optimizable structures and iteratively minimizing the photometric error between rendered results and ground truth, these methods can achieve submillimeter-level accuracy on standard benchmarks, such as the Technical University of Denmark dataset. However, they suffer from significant computational redundancy, frequently requiring hundreds or thousands of iterations, and exhibit poor robustness when dealing with ill-posed problems, such as sparse-view input or textureless regions. By contrast, feed-forward reconstruction methods, represented by architectures such as VGGT, DUSt3R, and Fast3R, represent a shift toward data-driven direct prediction. These models utilize neural networks to predict 3D geometry from input images in a single forward pass, offering rapid inference speed and superior generalization capability in underdetermined conditions. However, they currently lack the fine-grained detail of optimization methods, often producing “blurry” geometry due to the domain gap between synthetic training data and real-world scenarios. The review identifies that the current mainstream trend is the deep fusion of the two paradigms, such as using feed-forward models to initialize optimization processes (e.g., InstantSplat), and the injection of multimodal semantic information. Notably, optimization methods excel at fusing pixel-aligned modalities, such as depth and normals, while feed-forward methods are more adept at integrating abstract semantic signals, such as text and audio. In the field of 3D generation, technical focus has shifted rapidly from pure visual quality to structural controllability and part-level manipulation. Although “3D native” large models have achieved success in generating high-resolution meshes, significant challenges remain in bridging the gap between AI generation and industrial production standards. An identified critical bottleneck is the specific domain of CAD generation. Although recent approaches have attempted to model CAD designs as sequence learning problems or graph-based B-rep generation, they frequently face the “dirty geometry” challenge. Generated models often exhibit non-watertight surfaces, irregular topology, and defective assembly relations, failing to meet the strict constraints required for manufacturing, physics simulation, or downstream engineering tasks. Furthermore, this study highlights the intersection of 3D generation and embodied AI. For robotic simulation and training, the demand is not merely for static visual assets but for interactive scenes wherein objects possess articulated parts, physical properties, and realistic but messy layouts. Current generative methods struggle to produce such complex, unordered scenes that reflect the reality of the physical world. However, such condition is essential for training robust embodied agents. With regard to digital human technology, the field is witnessing a fierce competition between 2D and 3D approaches. 2D generation methods, which are driven by advanced video diffusion models (e.g., Sora, Animate Anyone), have demonstrated tremendous progress, delivering hyperrealistic visuals and rapid iteration speed that challenge the necessity of traditional 3D pipelines. These 2D models leverage massive video datasets to learn motion and appearance, frequently bypassing the need for explicit geometric modeling. However, the analysis argues that 3D technology remains irreplaceable in handling complex spatial interactions. 3D digital humans (e.g., Gaussian avatars) provide the precise geometric consistency, explicit depth information, and collision data that are necessary for immersive virtual reality/augmented reality experiences and autonomous system interactions—capabilities that 2D video generation currently lacks. This study suggests a future trend where 2D and 3D technologies converge, with massive 2D video data being used to supervise and refine 3D representations, moving toward “multimodal interactive avatars” that are capable of understanding and generating speech, gesture, and expression in real-time environments. This study reveals that the 3D field is experiencing a fundamental paradigm shift from “observation-driven reconstruction” to “data-driven generation”. Future developments will likely focus on three strategic directions: 1) the deep fusion of feed-forward and optimization-based methods to solve the inherent trade-off between robustness and precision; 2) the evolution of 3D generation toward industrial usability and editability, ensuring that assets are physically valid and topologically sound rather than merely visually plausible; and 3) the deep coupling of 3D technology with scenarios, such as embodied AI and digital humans, where generation serves functional interaction needs. Ultimately, 3D reconstruction and generation are no longer isolated visual problems but have evolved into foundational capabilities that support virtual-real fusion and intelligent decision-making.  
      关键词:3d reconstruction;3D Generation;digital human;spatial intelligence;Embodied AI   
      48
      |
      41
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 157006420 false
      更新时间:2026-06-18
    • Survey on large model-empowered visualization and visual analytics AI导读

      Wang Yunhai, Cao Nan, Chen Siming, Li Chenhui, Zeng Wei, Tao Jun, Zeng Qiong, Wang Changbo, Zhang Jiawan
      Vol. 31, Issue 6, Pages: 2198-2221(2026) DOI: 10.11834/jig.260046
      Survey on large model-empowered visualization and visual analytics
      摘要:Data visualization, as a fundamental technology supporting human cognition and data-driven scientific reasoning, is undergoing a profound shift driven by the rapid development of large-scale models and is attracting widespread attention from the academia and industry. For decades, traditional visualization approaches have mainly relied on manually or automatically designed visual encoding rules, graphical grammars, and human-computer interaction techniques. Through explicit visual mappings, graphical composition, and interface operations, these methods enable users to observe patterns, understand structures, and communicate insights from data. However, with the continuous growth of data size, task complexity, and decision-making contexts, statistical visual mappings and parameter-driven interaction are increasingly insufficient to support modern analytical needs. Instead, users now demand more than visual presentation alone; they require systems capable of semantic understanding, task-driven reasoning, and cross-modal information integration. The emergence of large models——particularly large language models, vision-language foundation models, and intelligent agents——has introduced unprecedented technological momentum to the visualization field. These models are not only transforming how visualizations are generated and interacted with but also reshaping the theoretical foundations, analytical workflows, and narrative practices of data visualization. Their strong capabilities in semantic representation, abstraction, and reasoning enable visualization systems to move beyond surface-level depiction toward deeper support for analytical intent and knowledge construction. From a theoretical perspective, visualization research has long been grounded in the Grammar of Graphics and declarative specification languages, which provide highly abstract and formalized descriptions of graphical structures, data mappings, and interaction logic. The introduction of large models does not replace this; instead, it reinforces its importance. Declarative grammars increasingly function as an interpretable “intermediate language” between natural-language intent and executable visual representations, enabling large models to translate high-level analytical goals reliably into controlled, consistent visualization specifications. This mediation is critical for ensuring the interpretability, reproducibility, and controllability of automatically generated visualizations. Moreover, the semantic reasoning capabilities of large models allow them to infer the underlying intent and logic behind visual layouts, color encodings, and spatial structures, and pushes visualization research from low-level perceptual encoding toward semantic-driven visual understanding and intent modeling. At the same time, techniques such as differentiable rendering, neural implicit representations, and Gaussian splatting provide new frameworks for high-fidelity scientific rendering, continuous data representation, and optimization in parameterized spaces. These methods enable visualization systems to become differentiable, learnable, and optimizable, and allows tighter integration between visual representation, computational models, and analytical objectives. As a result, visualization is increasingly situated within richer signal and representation spaces, supporting adaptive, expressive, and controllable visual analysis. Beyond theoretical advances, large models are fundamentally changing how users collaborate with visualization systems. Traditionally, effective visualization required users to master visualization grammars, tool-specific operations, and data processing workflows. By contrast, large-model-driven systems allow users to express analytical goals, design preferences, and data-related questions directly through natural language. The system can then interpret user intent, generate appropriate visualizations, restructure data views, or optimize visual encodings accordingly. This fusion of model-based semantic knowledge with formal visualization representations forms the foundation of a new generation of intelligent visualization systems. At the level of visual analytics, large models and agents are driving a transition from human-centered interaction toward a hybrid collaborative paradigm involving humans, agents, and knowledge. Early machine-learning-for-visualization research primarily focused on addressing isolated subtasks, such as view recommendation, feature detection, or layout optimization. By contrast, visualization frameworks based on large language models (LLMs) leverage unified semantic understanding across tasks and modalities to support end-to-end analytical pipelines. These systems can assist with visualization generation, pattern discovery, and reasoning-based explanation, and enable more holistic support for complex analytical processes. As human-machine collaboration evolves from simple question answering toward knowledge generation, large models increasingly function as proactive cognitive collaborators rather than passive assistants. They can produce trend interpretations, support hypothesis testing, and summarize underlying mechanisms while continuously modeling user behavior, analytical state, and task context. This approach enables a shift from reactive assistance to mixed-initiative collaboration, where systems actively guide users through complex, dynamic, and uncertain analytical environments, augmenting human reasoning and sensemaking capabilities. In the domain of narrative visualization, large models mark the beginning of a new era of intelligent content generation. Traditional narrative visualization requires expertise in design, storytelling, writing, and programming, which makes producing high-quality narratives costly and time consuming. Large models significantly lower this barrier by automatically identifying data themes, extracting salient trends, constructing narrative structures, and integrating visual, textual, and multimodal elements into coherent, stylistically consistent narratives. This capability substantially improves efficiency and accessibility and enables end-to-end automation from data to story in applications such as data journalism, educational visualization, science communication, and business reporting. With the maturation of multimodal generation technologies, narrative processes can further incorporate natural-language interaction, dynamically adapt narrative perspectives, and tailor content to audiences with different backgrounds and goals. In visualization evaluation, large models demonstrate promising potential due to their understanding of visual aesthetics, layout quality, perceptual principles, and readability. They can automatically assess visualization quality, detect misleading encodings, propose design improvements, and generate multiple design alternatives for comparison. However, model-based visual judgment is not inherently equivalent to human perception, a situation raising critical research challenges. Ensuring alignment between model assessments and human visual cognition, mitigating hallucinations and biases, and improving transparency, reliability, and interpretability remain central issues in large-model-driven visualization research. To organize the rapidly evolving landscape of large-model-powered data visualization systematically, this paper presents a comprehensive survey from four perspectives: visualization fundamentals, visual analytics, narrative visualization, and visualization evaluation. This paper reviews representative research advances in each area, analyzes emerging technologies enabled by large models, and discusses key challenges and future research directions. By providing a structured knowledge map and theoretical framework, this survey aims to support future innovation in intelligent visualization systems and contribute to the development of data visualization as a critical bridge between human intelligence and artificial intelligence in the era of large models.  
      关键词:fundamental theories of visualization;visualization interaction;visual analytics;visual storytelling;large language models   
      220
      |
      523
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 151295679 false
      更新时间:2026-06-18
    • Advances in video and image security research in the era of large models AI导读

      Sang Nong, Huang Kaiqi, Zhao Yao, Gao Changxin, Kao Yueying, Tan Chuangchuang, Wang Xiang, Wu Meiqi, Yin Wenti
      Vol. 31, Issue 6, Pages: 2222-2259(2026) DOI: 10.11834/jig.250656
      Advances in video and image security research in the era of large models
      摘要:With the rapid advancement of multimodal large models and generative artificial intelligence (AI), the paradigms, understanding, and generation of image and video acquisition are undergoing profound transformations. In recent years, new-generation AI systems represented by vision-language pretraining models and diffusion-based generative models have achieved remarkable progress in semantic alignment, cross-modal understanding, and high-fidelity content generation. By leveraging large-scale data and powerful representation learning capabilities, these models have significantly enhanced the performance and flexibility of visual intelligence systems, promoting their widespread adoption in intelligent security, content creation, industrial inspection, and public governance. Simultaneously, the increasing capability and deployment of visual intelligence systems have exposed a series of security risks and governance challenges, which have become increasingly prominent and cannot be disregarded.From the perspective of image and video understanding, existing visual models are frequently required to operate in complex and open-world environments that are characterized by dynamic scenes, background clutter, illumination variation, viewpoint changes, and long-tailed event distributions. In such scenarios, the cost of obtaining large-scale, fine-grained annotations is prohibitively high, leading many practical systems to rely on limited supervision or weak labels. Although large pretrained models exhibit strong generalization capability, they still suffer from misclassification, semantic bias, and insufficient robustness when faced with domain shift, distribution mismatch, and unseen abnormal patterns. These limitations are particularly evident in safety-critical applications, where incorrect predictions or unstable behavior may result in serious consequences. Therefore, improving the reliability, robustness, and interpretability of image and video understanding systems has become a central topic in visual security research. In this context, anomaly detection has emerged as a core task for understanding security, because it aims to identify rare, unexpected, or abnormal events from complex visual data. Existing anomaly detection methods can be broadly categorized into fully supervised, semi-supervised, weakly supervised, and unsupervised paradigms in accordance with the availability and granularity of annotations. Fully supervised approaches rely on precise frame-level or pixel-level labels and typically achieve strong performance under controlled conditions, but their scalability and generalization capability are limited in real-world scenarios. Semi-supervised and unsupervised methods, which assume access only to normal samples during training, attempt to model normal patterns through reconstruction, prediction, or one-class learning, and detect anomalies as deviations from learned normality. Weakly supervised approaches, which are frequently formulated under the multiple instance learning framework, achieve a balance between annotation cost and detection performance, but still face challenges in accurate temporal localization and the semantic interpretation of anomalies. With the emergence of vision-language large models, recent studies have begun exploring new paradigms for anomaly detection and visual understanding security. By leveraging pretrained cross-modal representations and natural language supervision, vision-language models enable zero-shot and few-shot anomaly detection, reducing reliance on task-specific annotations. Open-vocabulary anomaly recognition further allows models to detect and describe abnormal events beyond a fixed set of predefined categories, improving flexibility in open-world environments. In addition, explainable anomaly detection methods based on cross-modal alignment and attention mechanisms provide semantic-level interpretations for detected anomalies, enhancing transparency and trustworthiness. These advances indicate a clear trend toward a more general, scalable, and interpretable understanding of security frameworks. From the perspective of image and video generation, recent progress in generative adversarial networks (GANs) and diffusion models (DMs) has considerably improved the realism and controllability of synthesized visual content. GAN-based methods introduce adversarial learning mechanisms to produce visually plausible samples, while DMs further enhance generation quality and training stability through iterative denoising processes. By building upon these foundations, modern text-to-image and text-to-video generation systems integrate large vision–language models to achieve fine-grained semantic control, enabling the generation of complex scenes that closely resemble real-world data. These developments have brought significant benefits to creative industries and visual content production, but they have also amplified security risks associated with the misuse of generative technologies. High-quality synthetic images and videos can be maliciously exploited for deepfake generation, false information dissemination, identity impersonation, and privacy infringement, posing direct threats to social trust and public security. Consequently, generation security has become an essential component of image and video security research. Existing deepfake detection methods have evolved alongside generative models and can be roughly divided into several categories, including approaches based on visual artifacts, frequency-domain characteristics, temporal consistency, and semantic coherence. While early methods focused on detecting low-level inconsistencies introduced by generation algorithms, recent approaches increasingly emphasize higher-level semantic and temporal modeling to cope with the rapid improvement of generative quality. In addition to algorithmic research, security issues associated with image and video generation have also elicited growing attention in policy regulation and engineering practice. Detection systems are being integrated into real-world platforms to support content moderation, authenticity verification, and risk assessment. Meanwhile, regulatory frameworks and technical guidelines are gradually being established to govern the responsible use of generative models. Such effort highlights the necessity of combining technical solutions with governance mechanisms to address the challenges posed by generative visual technologies. Finally, despite substantial progress, image and video security in the era of large models still faces several open challenges. For understanding security, improving robustness under complex environmental changes, achieving precise temporal and spatial localization of anomalies, and enhancing semantic interpretability remain as key research problems. For generation security, developing generalizable deepfake detection methods that can adapt to rapidly evolving generative models remains an open issue. Moreover, balancing model capability, usability, and security constraints requires further exploration. By systematically reviewing existing research from the perspectives of understanding security and generation security, this study aims to provide a structured overview of the current landscape and offer insights into future research directions for image and video security in the era of large models.  
      关键词:multimodal large model;generative AI;Image and Video Security;anomaly detection;DeepFake detection   
      243
      |
      478
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 152511142 false
      更新时间:2026-06-18
    • Review of development and application of deepfake face detection technology AI导读

      Li Weibin, Feng Yuting, Hou Biao, Jiao Licheng
      Vol. 31, Issue 6, Pages: 2260-2278(2026) DOI: 10.11834/jig.250556
      Review of development and application of deepfake face detection technology
      摘要:With the rapid maturation and democratization of deep learning technologies, deepfake generation has evolved from a niche technical curiosity into a pervasive global phenomenon. As these tools become increasingly accessible, forged images and videos are proliferating across the Internet at an unprecedented scale. Among these, deepfake images and videos targeting human faces—particularly those involving identity manipulation of public figures and celebrities—are frequently exploited for malicious intent. These applications range from defamation and nonconsensual pornography to the manipulation of public opinion and the dissemination of disinformation, and thereby pose severe threats to digital trust, individual reputation, and social stability. Consequently, the development of robust deepfake face detection (DFD) technology has emerged as a critical research frontier and a strategic priority for the academic community and the industrial sector. This review provides a systematic, comprehensive survey of the current landscape of DFD, with a specific focus on the prevalent manipulation technique of face swapping. The paper first scrutinizes the evolution of underlying generation mechanisms, detailing typical forgery methods including variational autoencoders, generative adversarial networks, and the emerging diffusion models. According to changed attributes, face forgery methods can be classified into four common forms—facial transfer, facial swapping, facial reenactment, and facial editing. Subsequently, the review proposes a structured taxonomy of detection methodologies and categorizes them into three primary streams based on model architecture—CNN-based methods, Transformer-based methods, and the new paradigms. CNN-based methods currently represent the mainstream approach. This paper analyzes their diverse structural characteristics and further subdivides them into six distinct subcategories based on their feature extraction strategies, such as basic CNN architecture and frequency domain analysis. This paper also analyzes the advantages, disadvantages, and applicable scenarios of each structure. This work highlights Transformer-based methods for their recent rapid advancement. This review discusses the Transformer-based detection models in two branches. One type consists solely of the Transformer structure. By adding modules and enhancing the structure, the detection accuracy and computing efficiency can be improved. CNN-Transformer hybrid architecture makes full use of the local receptive field of the CNN architecture and the global modeling capability of the Transformer. The new paradigms section explores innovative innovations, specifically self-supervised/unsupervised learning combined with large language models (LLMs). The paper elucidates how self-supervised/unsupervised learning mitigates the dependency on labeled data and effectively avoids the overfitting bias caused by specific datasets or forgery types. Furthermore, this paper examines how integrating LLMs introduces semantic understanding and text features and significantly enhances the generalization capabilities and the interpretability of detection decisions. In the domain of data infrastructure, the paper traces the evolution of deepfake datasets and contrasts classic datasets, which rely solely on visual data (images or videos), with the new generation of multimodal datasets that have emerged over the past two years. These modern datasets, enriched with human annotations and descriptive text prompts, are pivotal for training more context-aware detection models and thereby fundamentally redefine the objective from simplistic perceptual categorization to a more demanding exercise in cognitive reasoning and contextual justification. By using the multimodal dataset to enhance, the interpretability of the detection method will be improved. Evaluation is also a critical component of this survey. The paper provides a comprehensive summary of evaluation metrics, distinguishing between classification performance, generalization performance, and practical deployment metrics. While acknowledging that DFD is a binary classification task, common classification evaluation metrics can be used. Considering the particularity of the DFD task, in general cases, an imbalance exists between the true and false categories of the data samples, and some metrics such as accuracy have certain limitations. Therefore, this paper advocates adopting more reliable metrics. Regarding generalization, the paper details protocols for cross-forgery and cross-dataset evaluation, which are essential for measuring robustness against unseen attacks. Generalization is an indispensable indicator during model evaluation, especially in the current era where forgery techniques are diverse and rapidly evolving. Additionally, considering application requirements such as real-time monitoring on edge devices, the review summarizes efficiency metrics, including inference speed, memory footprint, and parameter count, which are important in the model during deployment. Moreover, from an application perspective, DFD technology can be used to prevent the spread of false information and to stop criminals from using deepfake face technology to steal people’s information, property, and other targets. Thus, this review synthesizes four major deployment scenarios—verification of media content authenticity, protection of digital identity security, judicial evidence collection/forensics, and the automated supervision of social network content. In parallel, this work reviews the developing legal landscape, specifically domestic regulations governing deep synthesis services, and highlights the intersection of technology and law. Finally, based on a dialectical analysis of the primary contradictions in the field (such as the failure of detection features due to the evolution of generation technology and insufficient data-driven and generalization capabilities) this paper identifies critical future trends—the deeper integration of detection systems with LLMs, the pursuit of explainable, highly generalizable detection models to enhance trust, model lightweighting for edge device deployment, and the continuous refinement of industry regulations to govern the ethical use of synthetic media. In summary, the DFD technology, as a core technology for safeguarding the truth and trust in the digital age, is currently in a stage of rapid evolution and multidisciplinary integration. DFD is worthy of researchers’ investment of time and effort for research and advancement. The GitHub of the summary of relevant content can be accessed as follows:https://github.com/yttttkskr/2025-deepfake-detection.  
      关键词:deepfake face;deepfake face detection(DFD);convolutional neural network(CNN);Transformer;self-supervised/unsupervised learning;large language models(LLM)   
      297
      |
      460
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 147850750 false
      更新时间:2026-06-18
    0