最新刊期

    31 3 2026

      Review

    • Multiview stereo 3D reconstruction research: a survey AI导读

      多视角三维重建是计算机视觉与图形学中的关键问题之一,广泛应用于虚拟现实、增强现实、自动驾驶和文物修复等领域。其核心目标是从多个视角的图像或视频中恢复出三维场景的几何结构信息,实现物体和场景的高精度三维建模。专家从图像投影—几何推理与全局—局部两个维度将现有多视角三维重建方法分成4个类别,介绍了各类方法的典型模型、最新研究进展和它们的适用性及局限性。
      Yuan Zhenlong, Li Zehao, Chen Kehua, Mao Tianlu, Jiang Hao, Wang Zhaoqi
      Vol. 31, Issue 3, Pages: 657-685(2026) DOI: 10.11834/jig.250348
      Multiview stereo 3D reconstruction research: a survey
      摘要:Multiview stereo (MVS) 3D reconstruction aims to recover the three-dimensional geometric structure of a scene from multiple two-dimensional images or videos, enabling precise modeling of objects and environments. This task is fundamental to applications such as virtual reality, autonomous driving, smart city construction, and cultural heritage preservation. Traditional MVS methods rely on geometric constraints such as homography mappings and epipolar geometry to align multiview images and infer depth. However, these approaches are limited by their dependence on hand-crafted features and struggle with low-texture regions or dynamic scenes. The advent of deep learning has driven significant advancements, enabling end-to-end networks to automatically learn image-depth relationships and achieve higher accuracy and robustness. Despite these breakthroughs, challenges such as computational efficiency, generalization to complex scenes, and real-time performance remain unresolved. To systematically analyze the current state of MVS 3D reconstruction, this work categorizes existing methods into image projection-driven and geometry reasoning-driven paradigms, further subdividing them on the basis of local modeling and global modeling strategies. The paper also explores widely used datasets, evaluation metrics, and technical innovations within each category while identifying key challenges and proposing future research directions. First, this study investigates commonly used datasets and evaluation indicators in MVS 3D reconstruction. Datasets serve as the foundation for training and benchmarking algorithms and can be classified into three categories: synthetic datasets, real-scene datasets, and hybrid datasets. Synthetic datasets, such as DTU and Tanks and Temples, provide high-resolution, precisely annotated data for algorithm training and validation. Real-scene datasets, including ETH3D and LLFF, capture realistic environments with varying textures and lighting conditions, enabling evaluation under practical constraints. Hybrid datasets, such as nerf-synthesis and Synthetic Soccer NeRF, combine synthetic and real-world elements to address domain adaptation challenges. Evaluation metrics are critical for quantifying algorithm performance and are primarily divided into geometric accuracy metrics and rendering quality metrics. Geometric accuracy metrics, such as chamfer distance (CD) and F-score, measure the spatial fidelity between reconstructed and ground-truth models, with lower CD values and higher F-score values indicating better performance. Rendering quality metrics, including peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIM), and learned perceptual image patch similarity (LPIPS), are used to assess the visual consistency of synthesized views, with higher PSNR and SSIM scores and lower LPIPS scores reflecting superior perceptual quality. These metrics collectively provide a comprehensive framework for evaluating MVS methods across diverse scenarios. Second, the paper classifies MVS 3D reconstruction methods on the basis of their technical frameworks and innovations. Methods can be broadly categorized into image projection-driven approaches and geometry reasoning-driven approaches. Image projection-driven methods, such as MVSNet and its variants (e.g., CasMVSNet, R-MVSNet), focus on explicit cost volume construction and global optimization to infer depth maps. These methods leverage feature pyramids and attention mechanisms to enhance robustness in low-texture regions and complex geometries. Geometry reasoning-driven methods, including NeRF, Plenoxels, and 3D Gaussian splatting, adopt implicit or explicit representations to model scenes. NeRF and its derivatives (e.g., mip-NeRF, Ref-NeRF) use neural radiance fields to achieve photorealistic view synthesis, while 3DGS introduces Gaussian point clouds for efficient rendering and dynamic scene modeling. Within these categories, methods can further be divided into local modeling and global modeling subcategories. Local modeling approaches, such as PatchMatch and Gipuma, emphasize pixel-level matching and plane-induced homography, while global modeling techniques prioritize holistic scene consistency through volumetric or hierarchical representations. Innovations in recent works include sparse-to-dense depth estimation, adaptive sampling strategies, and hybrid architectures that combine explicit structures (e.g., octrees) with implicit neural networks to balance efficiency and detail preservation. Third, the paper identifies unresolved challenges and outlines future research directions. Data-related challenges include the difficulty of collecting high-quality multi-view datasets for dynamic scenes and the need for scalable annotation tools to reduce human labor. From a methodological perspective, existing techniques face limitations in computational efficiency (e.g., cubic memory complexity in NeRF) and generalization to unseen environments (e.g., natural vegetation or translucent objects). Training paradigms also need to be improved; supervised methods depend on expensive 3D annotations, while unsupervised and self-supervised approaches often sacrifice reconstruction quality for reduced data dependency. To address these issues, future research should focus on the following: 1) lightweight and real-time optimization, which involves developing hardware-aware architectures (e.g., edge computing frameworks) and dynamic resolution adjustment to reduce GPU memory consumption and inference latency; 2) dynamic scene modeling, integrating temporal consistency constraints and physics-based priors to handle motion, occlusions, and deformable objects in real-world applications; 3) cross-modal fusion, leveraging multimodal inputs (e.g., LiDAR, inertial sensors) to enhance robustness in low-texture or adverse lighting conditions, and 4) semantic-aware reconstruction, incorporating semantic segmentation and instance-level reasoning to enable editable 3D models for virtual production and interactive design. The convergence of MVS with emerging technologies such as multimodal large models and metaverse platforms will further expand its applications. For example, real-time 3D reconstruction in autonomous vehicles requires not only geometric precision but also semantic understanding for obstacle detection. Similarly, metaverse environments demand high-fidelity, scalable reconstructions of large-scale urban scenes. To bridge the gap between academic advancements and industrial deployment, future efforts must prioritize domain adaptation, computational efficiency, and cross-modal learning. By addressing these challenges, MVS 3D reconstruction will play a pivotal role in shaping next-generation immersive technologies, robotics, and digital twins. Additionally, the integration of multitask learning and meta-learning could further enhance model generalization by jointly optimizing geometric and semantic tasks. For instance, meta-learning frameworks could adapt to novel object categories with minimal samples, while multitask architectures might simultaneously reconstruct geometry and infer surface properties (e.g., material reflectance). Furthermore, the development of 3D foundation models that are pretrained on vast heterogeneous datasets could unlock universal reconstruction capabilities across diverse domains. Ethical considerations, such as privacy in public space reconstruction and bias in dataset curation, must be addressed to ensure responsible deployment. Addressing these multidimensional challenges will enable MVS to evolve from a research-oriented technique to an indispensable tool for real-world 3D perception.  
      关键词:multiview stereo (MVS);3d reconstruction;3D vision;neural radiance field (NeRF);3D Gaussian splatting (3DGS)   
      459
      |
      771
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 129893336 false
      更新时间:2026-03-18
    • Review of deep learning-based 3D point cloud upsampling algorithms AI导读

      随着3D扫描技术普及,点云数据在自动驾驶等领域广泛应用,但存在密度不均等问题。专家提出基于深度学习的点云上采样技术,建立层次化分类框架,为提升点云数据质量提供新方案。
      Zhan Jie, Xu Fan, Shi Yizhou, Ma Kaiguang
      Vol. 31, Issue 3, Pages: 686-718(2026) DOI: 10.11834/jig.250314
      Review of deep learning-based 3D point cloud upsampling algorithms
      摘要:The widespread adoption of 3D scanning technologies and LiDAR systems has established point clouds as a critical data modality for mission-critical applications, including autonomous driving, robotic environmental perception, and digital archiving of cultural heritage. However, limitations in acquisition devices, combined with challenges such as environmental occlusion and surface material reflectivity, often result in sparse density, non-uniform distribution, and reduced geometric fidelity of raw point cloud data. These deficiencies significantly impair performance in downstream tasks, such as 3D object detection and semantic segmentation. Point cloud upsampling mitigates these challenges by employing advanced algorithms to transform low-resolution inputs into high-density, uniformly distributed 3D point sets, thereby restoring fine-grained geometric structures of object surfaces. Advancements in deep learning have significantly enhanced the quality and practical utility of point cloud data in this domain. This survey systematically reviews advancements in deep learning-based point cloud upsampling methods from 2018 to mid-2025. We categorize existing deep learning approaches into two primary paradigms on the basis of their training methodology: supervised and unsupervised learning. Supervised methods rely on high-quality paired training datasets to enable end-to-end learning of dense point cloud representations. Within this supervised domain, we analyze four key methodological paradigms on the basis of their operational principles. The three-stage upsampling paradigm, pioneered by PU-Net, decomposes the task into sequential phases: feature extraction using architectures like PointNet, GCNs, or Transformers; feature expansion via techniques such as duplication/folding, PixelShuffle-inspired mechanisms such as NodeShuffle, or novel kernel-to-displacement approaches; and coordinate reconstruction through direct regression or offset prediction, often enhanced by geometric correction modules. This paradigm is straightforward and widely adopted but struggles with arbitrary upsampling ratios and limitations in coordinate prediction accuracy. The geometry expansion-coordinate optimization paradigm, exemplified by methods such as Grad-PU, TP-NoDe, and ND-PUFlow, decouples the process. It first increases the point count in the geometric space via midpoint interpolation or density-aware noise addition to achieve any desired upsampling ratio, producing an intermediate point set; a subsequent coordinate optimization network refines these points to align precisely with the target surface. This approach supports arbitrary ratios but requires careful handling of potential artifacts, such as abnormal points or surface shrinkage. Generative models, leveraging frameworks such as generative adversarial networks, normalizing flows, and diffusion models, focus on learning the underlying distribution of dense point clouds, offering greater flexibility in generating high-fidelity results, though often at the cost of increased computational complexity and intricate hyperparameter tuning. Surface reconstruction-based methods, such as PUGeo-Net, Neural Points, PU-SDF, and APU-LDI, conceptualize upsampling as a two-step process, first involving the reconstruction of a continuous surface representation explicitly through parametric surfaces or voxel grids or implicitly via neural fields, occupancy fields, signed distance field, or unsigned distance field from the sparse input and then resampling this surface to generate the dense output. This strategy typically ensures superior geometric consistency and surface fidelity. By contrast, unsupervised methodologies reduce reliance on large-scale, meticulously labeled datasets by exploiting the intrinsic geometric structure of the input point cloud to derive supervision signals or by designing sophisticated self-supervised learning tasks. We categorize these strategies into four classes on the basis of their core mechanisms for generating training guidance: utilizing sparse point cloud for supervision, exemplified by L2G-AE, SSAS, and SPU-PMD; transforming sparse point clouds for supervision, represented by SSPU-Net, PPU, and SPU-IMR; constructing training data via downsampling sparse inputs, such as SPU-Net and ZSPU; and unsupervised point cloud upsampling based on the twin-network structure, including UPU-SNet and S3U-PVNet. This survey details key benchmark datasets essential for training and evaluation, including PU-GAN, the larger-scale PU1K, PUgeo (notable for including ground-truth normals), and real-world scanned datasets such as ScanObjectNN and KITTI, highlighting their characteristics and relevance. Furthermore, it explains standard evaluation metrics used to rigorously quantify performance, such as Chamfer distance, Hausdorff distance, and point-to-surface distance, emphasizing their roles in assessing reconstruction accuracy, surface proximity, and distribution uniformity. Comparative experimental results across these benchmarks are presented to illustrate the quantitative performance of representative state-of-the-art techniques. Several compelling research directions deserve attention. A key challenge is the development of adaptive arbitrary upsampling strategies that dynamically adjust the upsampling ratio on the basis of localized point density and geometric complexity, enabling precise enhancement where needed. Additionally, improving adaptability to complex topological structures, such as intricate edges, thin surfaces, and fine details, is critical. This task may be achieved through advanced feature extraction techniques inspired by point cloud classification and segmentation or by integrating multimodal data fusion with complementary sources such as RGB images or depth sequences. Addressing computational efficiency and real-time performance constraints is essential for practical deployment in latency-sensitive scenarios, such as autonomous navigation, requiring innovations in network architecture optimization, model simplification, and lightweight design. Finally, advancing robust unsupervised and self-supervised learning frameworks promises to significantly reduce reliance on costly labeled data, leveraging inherent geometric properties and novel pretext tasks to improve model generalization across diverse and challenging real-world environments.  
      关键词:deep learning;3D vision;point cloud upsampling;generative models;surface reconstruction;unsupervised learning   
      159
      |
      294
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 129892981 false
      更新时间:2026-03-18
    • Trajectory planning for autonomous driving: a survey AI导读

      自动驾驶技术在交通领域具有变革性意义,专家从传统算法、强化学习和模仿学习3个方面梳理了自动驾驶轨迹规划领域的前沿成果,为后续研究者提供了研究方向及参考。
      Li Chengxiang, Hu Haixiao, Guo Dabo, Cheng Rongmin, Wu Hongkun, Yuan Chang
      Vol. 31, Issue 3, Pages: 719-744(2026) DOI: 10.11834/jig.240774
      Trajectory planning for autonomous driving: a survey
      摘要:Autonomous driving technology represents a cornerstone of innovation within the transportation sector, with the potential to dramatically enhance the safety and efficiency of transportation systems on a global scale. This technological advancement has received substantial policy backing from numerous governments worldwide, while the automotive industry has significantly ramped up its commitment to research and development efforts, concurrently initiating processes aimed at establishing standardized protocols for autonomous vehicles. Autonomous driving is highly complex, encompassing a vast array of interdisciplinary fields and technological nuances. This work summarizes the most recent and cutting-edge findings in autonomous driving trajectory planning, including traditional algorithmic approaches, reinforcement learning methodologies, and imitation learning methodologies, to systematically collate, dissect, and synthesize the relevant academic literature, providing a comprehensive overview of the current state and future directions of this field. First, this paper provides an in-depth examination of the two most widely recognized frameworks for autonomous driving: the modular framework and the end-to-end framework. Additionally, it acknowledges the existence and relevance of other, less commonly utilized but no less classical frameworks, offering a nuanced analysis of their respective strengths and weaknesses. This analysis focuses on the practical implementation of trajectory planning within these frameworks, with a particular emphasis on identifying potential areas for future enhancement and optimization within each framework’s architecture. Afterward, the paper presents a detailed survey of traditional planning algorithms that are currently employed in autonomous driving trajectory planning. The survey is meticulously organized, categorizing and summarizing the state-of-the-art traditional algorithms at the forefront of trajectory planning research. The algorithms under consideration include graph-based search methodologies, rule-based local planning approaches, intelligent optimization techniques, static and dynamic sampling methods, artificial potential field-based algorithms, and various model control strategies, among others. The paper provides a thorough synthesis of the application domains, advantages, and limitations of these algorithms while also highlighting the innovative optimization strategies and emerging trends that have been proposed by the academic community. Building upon this foundation, the paper introduces the concept of hybrid traditional algorithms, which combine local planning algorithms with global planning algorithms or integrate multiple local and global algorithms to create new and potentially more effective trajectory planning methods. The paper engages in speculative analysis regarding the future trends and directions of these hybrid traditional algorithms and discusses how such algorithms can address the deficiencies of traditional approaches. Moreover, the paper scrutinizes the impact and effectiveness of these hybrid algorithms on the path planning capabilities of autonomous driving systems. The paper concludes this section by delineating the problems and limitations that are associated with hybrid traditional trajectory algorithms. Evidently, individual traditional trajectory planning algorithms and their hybrid counterparts have numerous issues and shortcomings that need to be considered carefully. In response to these challenges, the paper focuses on the integration of novel reinforcement learning algorithms as a means to overcome the limitations of traditional trajectory planning methods. These reinforcement learning algorithms, which include value-based, policy-based, and actor-critic-based approaches, are classified, explored, and summarized. This paper discusses the operational principles, application domains, advantages, and disadvantages of each reinforcement learning algorithm in detail, supported by the research findings and scholarly contributions in the field. The paper’s focus then shifts to the critical aspects of reinforcement learning in the context of autonomous driving, such as efficiency, safety, and function optimization. It addresses the difficulties encountered by existing reinforcement learning techniques and presents innovative solutions proposed by scholars to tackle core issues such as the security and sample efficiency of reinforcement learning algorithms. These solutions include the establishment of virtual simulation driving platforms, the application of imitation learning algorithms, lifelong learning techniques, hierarchical learning methods, and multi-agent reinforcement learning algorithms. Also, this paper introduces a reinforcement learning approach based on the world model and summarizes the latest progress, along with the latest research progress of VLM and VLA and the role of reinforcement learning. This paper also provides a comprehensive evaluation of popular virtual simulation platforms, assessing their suitability for research in autonomous driving on the basis of 12 metrics: scenario building, vehicle dynamics, sensor models, traffic participants, weather and time simulation, data logging, API interfaces, realism and performance, scalability, community support, gym interface compatibility, and support for multi-agent reinforcement learning. The paper synthesizes the research applications and distinctive features of these platforms, offering a valuable resource for researchers selecting appropriate simulation environments for their work. Furthermore, the paper studies the application of imitation learning techniques in autonomous driving, including inverse reinforcement learning, behavioral cloning, and generative imitation learning, along with optimization strategies proposed by the academic community. It explores how these techniques can refine reinforcement learning policy networks, enhancing the sample efficiency and optimality of the learning strategies. This paper also introduces the VLA method based on the diffusion model, which is currently popular. On the basis of traditional imitation learning, more novel directions such as online imitation learning, self-imitation learning, state-only imitation learning, multi-agent imitation learning, generative imitation learning, and meta-imitation learning are also expounded. In the final sections of the paper, the authors critically analyze the current limitations and future development trends of reinforcement learning algorithms and imitation learning algorithms in autonomous driving. Finally, the future research directions and challenges are elaborated, including the VLA framework based on pre-trained VLM, the diffusion model combined with reinforcement learning, and the reinforcement learning based on the world model. In this regard, a detailed analysis of the causes was conducted, and the pain points of the above three research directions were identified: the pain points of the world model (virtual to reality), the pain points of VLA architecture (data requirements), and the pain points of the diffusion model (computing power bottleneck). Other papers were reviewed to provide ideas and directions for solving these pain points. In conclusion, this paper offers a comprehensive overview of the trajectory planning algorithms in autonomous driving, synthesizing the development trends and challenges of traditional algorithms, reinforcement learning methods, and imitation learning algorithms. It aims to provide future researchers with a solid foundation for understanding the current state of the art, as well as a roadmap for future research directions in the field of autonomous driving technology. This paper also emphasizes the need for interdisciplinary collaboration and the integration of diverse methodologies to advance the state of autonomous driving systems and ensure their safe, efficient, and widespread adoption in the transportation sector.  
      关键词:autonomous driving;trajectory planning;Traditional programming algorithms;reinforcement learning(RL);imitation learning(IL);Visual Language Action model;diffusion model(DM);World model   
      402
      |
      698
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 126140964 false
      更新时间:2026-03-18

      Image Processing and Coding

    • Dynamically distribution-aware quantization for diffusion models AI导读

      扩散模型在图像生成等领域应用前景广阔,但现有模型存在存储开销大、推理时延长等问题。专家设计了一种分布范围动态感知的训练后量化方法,通过校准过程中的量化参数选择、微调及误差抑制,显著提升了量化扩散模型在低激活比特位宽上的性能,为解决量化模型性能差距问题提供了有效方案。
      Zhan Ruiyi, Fan Yi, Zhou Lina, Xie Yubao, Chen Jiaxin, Yang Hongyu, Huang Di, Wang Yunhong
      Vol. 31, Issue 3, Pages: 745-754(2026) DOI: 10.11834/jig.250319
      Dynamically distribution-aware quantization for diffusion models
      摘要:ObjectiveIn recent years, the diffusion model has become a promising alternative to conventional generative models including generative adversarial network and variational autoencoder due to the high quality and diversity of its generated images, as well as its stable training process. It has a wide range of applications such as super-resolution, graph generation, and image restoration. Generally, the generation process of diffusion models involves gradually adding Gaussian noise to image data and then iteratively removing the noise step by step through a noise estimation network. As this process typically takes hundreds or even thousands of steps to find sampling trajectories for denoising, the diffusion model usually requires tremendous storage overhead and inference time cost. The high computational complexity of diffusion models is mainly attributed to the following two issues. First, generating a single image requires hundreds or even thousands of denoising steps, which involve repeatedly executing the estimation network. Second, the estimation network alone involves significant computational cost due to the high network complexity. Although many approaches have been proposed to reduce the number of estimation steps, balancing the number of steps with the quality of generated images remains an unsolved problem. In this paper, we address the second issue, i.e., accelerating the U-Net-based noise estimation network. Quantization has been a promising solution by converting the floating-point weights and activations to low-bit-width integers. Typically, quantizing a full-precision model to 8-bit can theoretically accelerate the inference process by about 2.2 times, while further reducing to 4-bit achieves an additional 59% improvement. However, directly applying the quantization methods designed for general-purpose to diffusion models often yields poor performance, as the diffusion models using the same network to denoise inputs with different distributions at various timesteps, which is neglected by most existing approaches. Some works incorporate multi-timestep calibration into the quantization process or focus on the temporal characteristics within the estimation network to mitigate the impact of multi-timestep distribution. Despite these advancements, a significant performance drop persists when models are quantized to bit-widths lower than 8-bit using post-training techniques. To overcome the above drawbacks, we analyze quantization errors within the estimation network across different timesteps. Our analysis reveals that the reconstruction granularities employed in quantization are often inappropriate for diffusion models, leading to pronounced discrepancies among quantized modules. Moreover, we identify substantial quantization errors in specific modules characterized by a wide range of activation distributions across timesteps, leading to diminished performance when quantizing to lower bit-widths.MethodIn this study, we propose a novel method called temporal distribution-aware quantization for diffusion models to deal with the uneven distribution of internal quantization errors within the network and the accumulation of external quantization errors over multiple sampling timesteps. We evaluate the significance of different quantized modules by analyzing relative quantization errors and select quantization parameters for fine-tuning on the basis of the significance score. Meanwhile, the fine-tuning strength is dynamically set according to the range of inputs. Furthermore, to reduce the accumulation of quantization errors across multiple sampling timesteps in diffusion models, we calculate the mean squared error between output of the full-precision model at each sampling timestep with that of quantized models.ResultOur method outperforms the state-of-the-art compared approaches, achieving an improvement of 0.34 (9.40 vs. 9.06) in inception score(IS) and 1.96 (4.61 vs. 6.57) in Frechet inception distance score (FID) on CIFAR-10, respectively. The quantized LDM by our method outperforms TFMQ-DM by an FID reduction of 0.34 on the LSUN-Church benchmark. In the meantime, the computational cost for our method remains consistent with that of baseline methods. The results demonstrate improved performance across various datasets and distinct models. Additionally, our method can be incorporated into existing quantization methods as a plug-and-play module.ConclusionIn this study, we propose a novel temporal distribution-aware quantization method for diffusion models, which reduces the accumulating quantization errors arising from the substantial distribution variations across distinct layers and blocks at different timesteps. Our method is a plug-and-play method that is applicable to existing quantization approaches. Extensive experiments on various benchmarks clearly demonstrate the effectiveness of our method.  
      关键词:image generation;diffusion model(DM);model compression;Inference acceleration;Model quantization;Post-training quantization   
      110
      |
      205
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 127251153 false
      更新时间:2026-03-18
    • 介绍了其在计算机视觉模型安全领域的研究进展,相关专家提出了一种基于特征阻断的轻量化后门防御机制,为解决现有模型防御方案面临的计算资源消耗、模型参数损伤及部署灵活性受限等问题提供了高效解决方案。
      Tong Songsong, Yang Kuiwu, Wang Wen, Wei Jianghong, He Haofeng
      Vol. 31, Issue 3, Pages: 755-768(2026) DOI: 10.11834/jig.250260
      Backdoor defense mechanism for computer vision models based on feature blocking
      摘要:ObjectiveBackdoor attacks pose a critical threat to computer vision systems by establishing strong associations between trigger patterns and target labels, enabling adversaries to manipulate model decisions during inference. Current defense strategies predominantly rely on computationally intensive full-model fine-tuning or architectural modifications, which introduce significant drawbacks, including prohibitive computational overhead, irreversible model alterations, and inflexible deployment frameworks. These limitations hinder practical applications in real-world scenarios, particularly for latency-sensitive systems or models that require temporary defense mechanisms. To address these challenges, this study introduces a new lightweight defense mechanism for the image classification model, which integrates a modular feature block component in the model to block the normal transmission of backdoor features during the reasoning process. The proposed method eliminates the need for prior knowledge of attack patterns and avoids structural modifications to the host model, instead leveraging a plug-and-play module to achieve simultaneous blocking of backdoor features and unimpeded transmission of benign features while maintaining baseline model performance.MethodThis work primarily uses the cascaded feature blocking module (CFBM), a compact neural component that integrates a multilayer architecture optimized through targeted fine-tuning. The CFBM combines four synergistic layers: 1) cross-channel spatial filters (mainly composed of 1 × 1 convolutional layers), which are used for cross-channel interaction to destroy spatially triggered artifacts; 2) instance statistical calibrator (mainly composed of the strength normalization layer), used to stabilize the feature distribution and prevent statistical offsets caused by backdoors; 3) dynamic channel suppressor (mainly composed of the channel attention module), which generates channel weight vectors through the compression-excitation mechanism to suppress the abnormal activation of contaminated channels; and 4) random feature masks (mainly composed of dropout layers), which enhance generalization by probabilistically discarding low-confidence activation patterns. A lightweight fine-tuning strategy is implemented to ensure the minimum computational overhead while restoring the normal transfer of benign features by the module. In this strategy, the semantic integrity of the pretrained features is maintained by fixing all the original model weights, while the model-specific optimization improves CFBM within five to eight epochs by using only 5% of the clean training data (for example, 3 000 CIFAR-10 images) through the loss function, which inhibits backdoor features and minimizes classification interference. Seamless integration with existing models is achieved through PyTorch’s Hook mechanism, enabling zero-intrusion embedding via register_forward_hook without altering the original architecture, real-time activation/deactivation (within 5 ms via hook.remove), and lossless removal that fully restores baseline model accuracy and inference speed post-defense.ResultComprehensive evaluations across three benchmark datasets (MNIST, CIFAR-10, and MINI-ImageNet) demonstrate robustness against diverse attack paradigms, including BadNets, Blended, WaNet, BppAttack, and WaveAttack. Experimental results reveal an average 90.0% reduction in attack success rate (ASR), exemplified by BadNets on CIFAR-10 dropping from 97.2% to 10.9%, while limiting clean sample accuracy degradation to < 3%. Compared with conventional defenses, the CFBM requires < 1% of the host model's parameters (e.g., 0.23 M vs. ResNet-18’s 11.7 M) and reduce straining costs by 94% through efficient 5% clean data fine-tuning. The module’s dynamic deployment capability ensures operational flexibility, supporting instantaneous activation/deactivation during inference with lossless performance recovery post-removal. Cross-architecture validation on ResNet-18 and VGG-11 confirms universal applicability, achieving ASR reductions of 90.0% and 88.9%, respectively, without architecture-specific adjustments. Heatmap comparisons visually demonstrate the model’s restored attention to benign image features post-CFBM integration, confirming effective backdoor feature suppression. Ablation studies further validate the indispensability of each CFBM layer, with dropout removal increasing ASR by 23.5% and instance normalization omission expanding accuracy loss to 7.8%. All experiments were rigorously conducted on an NVIDIA RTX A6000 GPU under Linux Ubuntu using Python 3.8.19 and PyTorch 1.8.0 + cu111. Attack implementations and baseline comparisons utilized the standardized BackdoorBox toolkit to ensure reproducibility. Dataset configurations followed established protocols: MNIST (60 000/10 000 grayscale images at 32 × 32 resolution), CIFAR-10 (50 000/10 000 RGB images at 3 × 32 × 32 resolution), and MINI-ImageNet (48 000/12 000 images across 100 classes), with a 5% poisoning ratio simulating realistic attack scenarios.ConclusionThrough three pioneering contributions, this study establishes a paradigm shift in backdoor defense: 1) the first runtime-activatable and reversible defense solution, 2) a non-destructive framework compatible with cross-architecture pretrained models, and 3) empirical validation of cross-domain robustness against evolving attack vectors. CFBM’s plug-and-play operation and minimal computational footprint (1% of host resources) address critical barriers to practical deployment, particularly in resource-constrained environments such as edge devices or latency-sensitive applications. Through lightweight modular design and fine-tuning mechanism, our method effectively resolves the computational cost and flexibility bottlenecks of traditional model defense methods. Its plug-and-play and lossless removal features provide an efficient solution for the secure deployment of models in actual scenarios.  
      关键词:model security;image classification;backdoor defense;Feature Blocking;lightweight;Dynamic Toggling   
      132
      |
      178
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 127250461 false
      更新时间:2026-03-18

      Image Analysis and Recognition

    • 微表情识别领域迎来新突破,相关专家提出融合多光流与KAN的识别方法,有效应对光照变化等问题,显著提升识别性能,为该领域研究开辟新方向。
      Chang Heyou, Yang Jiazheng, Gao Guangwei, Zhang Jian, Zheng Hao
      Vol. 31, Issue 3, Pages: 769-782(2026) DOI: 10.11834/jig.240572
      Micro-expression recognition method based on multi-optical flow fusion and KAN
      摘要:ObjectiveMicro-expressions are rapid, involuntary facial muscle movements that occur when individuals attempt to suppress or conceal their genuine emotions. These fleeting facial cues typically last less than half a second and are often imperceptible to the naked eye. Despite their subtlety, micro-expressions convey significant emotional information and have shown considerable potential in various real-world applications, including psychological evaluation, medical diagnosis, law enforcement, and deception detection. However, accurately recognizing micro-expressions remains a highly challenging task due to their low intensity, brief duration, and sensitivity to environmental conditions such as lighting variations and inconsistent facial expression strength. Most existing micro-expression recognition methods rely on a single optical flow algorithm to capture motion information. This dependence often leads to suboptimal performance in real-world scenarios because individual flow models struggle to fully capture the subtle and diverse motion patterns of micro-expressions. Aiming to overcome these limitations, this paper proposes a novel approach called the multiple optical flow feature fusion network (MOFFFN). This method leverages the advantages of multiple optical flow types combined with advanced attention mechanisms to improve the recognition performance of micro-expressions.MethodA novel micro-expression recognition framework based on MOFFFN, specifically designed to capture the subtle spatiotemporal variations inherent in micro-expressions, is proposed. This framework comprises three key components: the optical flow fusion module, the mobile residual KAN CBAM block net (MRKCBN), and the attention pooling self-attention block (APSB). First, the OFFM module integrates multiple types of optical flow features——total variation L1 (TVL1), dense inverse search (DIS), and principal component analysis (PCA) optical flow——each capturing different motion characteristics across pixels. To retain directional cues, the horizontal and vertical components are extracted from each optical flow type. These components are then fused to construct three composite flow images: global fusion, horizontal fusion, and vertical fusion. This multi-flow fusion strategy enables the model to capture comprehensive motion information and enhances robustness to variations in lighting and expression intensity, thereby improving recognition performance. Second, to extract discriminative features from the fused flow images, a novel MRKCBN architecture that integrates Kolmogorov-Arnold network (KAN) into a lightweight residual backbone is introduced. KAN has recently attracted attention for its strong generalization capabilities and interpretability, particularly in scientific computing and image analysis. Within MRKCBN, KAN modules are embedded in the channel and spatial attention submodules of the classic convolutional block attention module (CBAM), where the original MLP and convolution layers are replaced by KANLinear and KAN_Conv2d, respectively. These KAN-based components enhance adaptability by leveraging B-spline approximations, group convolutions, and dropout, thereby improving the network’s ability to model fine-grained spatiotemporal variations in facial micro-expressions. Finally, the APSB module is designed to fuse the multisource features extracted from the TVL1, DIS, and PCA flows. Unlike conventional self-attention mechanisms that operate on a single feature sequence, the APSB module learns an effective fusion strategy across diverse optical flow representations. This fusion process runs in parallel across multiple attention heads, capturing feature relationships at different dimensions and semantic levels. By integrating an Attention Pooling layer, the APSB module dynamically learns the importance weights of different facial regions, emphasizing key regions to enhance recognition performance and robustness. Collectively, these modules synergistically contribute to state-of-the-art performance in micro-expression recognition, particularly in challenging scenarios characterized by subtle motions and limited data availability.ResultAiming to evaluate the effectiveness and generalizability of MOFFFN, experiments are conducted on three publicly available benchmark datasets: CASME II, SAMM, SMIC-HS, as well as on a composite dataset (CD) that combines all three. The widely used leave-one-subject-out cross-validation protocol is adopted to ensure robust, subject-independent evaluation in micro-expression research. The proposed method achieves unweighted average recall scores of 91.79%, 85.69%, 86.56%, and 85.03% on CASME II, SAMM, SMIC-HS, and CD, respectively. Furthermore, the corresponding unweighted F1-scores (UF1) are 92.95%, 89.10%, 91.78%, and 87.63%. These results consistently outperform existing state-of-the-art methods, demonstrating the superior performance of the MOFFFN framework across multiple datasets with different characteristics and challenges. In addition to the quantitative results, qualitative visualizations of attention maps further confirm that the proposed model effectively highlights the most discriminative facial regions and captures the most relevant motion patterns.ConclusionThis paper introduces a novel micro-expression recognition method that fuses multiple optical flow features while leveraging the representational power of KAN and advanced attention mechanisms to improve recognition performance. By integrating diverse motion representations, learning discriminative features through KAN, and employing attention-guided fusion, the method effectively addresses the challenges posed by the subtle and complex nature of micro-expressions. The incorporation of KAN within the attention modules ——replacing traditional MLP and convolutional layers—— notably improves the network’s capability to model subtle spatiotemporal facial dynamics, enabling the extraction of discriminative and interpretable features from complex micro-expression patterns. In addition, the APSB facilitates cross-stream feature fusion, learning to emphasize critical facial regions across different motion representations. These innovations jointly yield substantial performance gains on benchmark datasets, particularly under the challenging conditions of subtle motion and limited data. The proposed framework demonstrates strong generalization and robustness, making it a promising approach for real-world micro-expression recognition applications. The publicly code of this paper is available at https://github.com/useless12138/mofffn.  
      关键词:micro-expression recognition;optical flow;feature fusion;Kolmogorov-Arnold network(KAN);self-attention mechanism   
      297
      |
      281
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 98368303 false
      更新时间:2026-03-18
    • 电力绝缘子表面缺陷检测领域迎来新突破,相关专家开发出基于三维几何信息的自动化检测方法,通过结构光扫描与优化算法,实现缺陷精确定位与定量分析,为电力设备安全可靠性评估提供有力技术支撑。
      Li Runqiao, Wang Pengfei, Zuo Wei, Zhu Linhai, Chen Shuangmin, Xin Shiqing, Tu Changhe
      Vol. 31, Issue 3, Pages: 783-796(2026) DOI: 10.11834/jig.250355
      Three-dimensional surface defect detection of insulators by integrating geometric optimization and interactive annotation
      摘要:ObjectivePower insulators are critical elements in high-voltage transmission and distribution networks, whose structural integrity directly influences operational safety, reliability, and service life. Even seemingly minor surface defects——such as cracks, pits, bubbles, or chips——can significantly reduce dielectric strength, increase the likelihood of partial discharge or flashover, and, in severe cases, trigger cascading equipment failures or widespread outages. As modern power systems operate under increasing demands for capacity and stability, the timely, precise identification of such defects has become essential for preventive maintenance and quality assurance. However, existing inspection practices remain fundamentally limited: Manual visual inspection is labor-intensive, subjective, and prone to oversight, while conventional two-dimensional image-based analysis suffers from background clutter, occlusion, and insufficient geometric detail to capture the true three-dimensional nature of surface anomalies. These shortcomings make accurate localization and quantitative measurement of defects on curved, complex surfaces——such as porcelain and composite insulators——particularly challenging. Even when integrated with advanced deep learning techniques, image-based approaches require extensive, carefully annotated datasets and still fail to capture subtle geometric anomalies with high reliability. This study addresses these fundamental limitations by developing an automated three-dimensional defect detection framework that exploits the intrinsic rotational symmetry of insulators as a structural prior, enabling precise defect localization and accurate area quantification while maintaining computational efficiency suitable for industrial deployment.MethodWe propose a comprehensive three-dimensional (3D) defect detection methodology that integrates computational geometry, surface parameterization, interactive annotation, and constrained optimization into a coherent pipeline. The approach fundamentally departs from conventional texture-based methods by working directly on high-resolution 3D geometry, thus achieving inherent robustness against lighting conditions, background interference, and surface material variations. The methodology unfolds through four interconnected stages. First, high-fidelity 3D models are acquired via structured-light scanning along controlled trajectories to ensure complete coverage and minimize occlusions. The acquired geometry undergoes preprocessing where the mesh is radially split into two halves along a plane containing the rotation axis, followed by UV parameterization using mean value coordinates. This parameterization maps the curved surface onto standardized 2D rectangular domains while preserving local geometry and minimizing distortion, facilitating intuitive visualization and direct annotation. Normal vectors at each vertex are then encoded into RGB color space to generate clear normal maps, where surface irregularities and defects typically exhibit distinctive patterns that enable rapid identification by inspectors. Second, the interactive annotation phase allows users to enclose defect regions by using axis-aligned rectangular selections in the UV-mapped normal images, striking a balance between annotation speed and spatial localization precision. This design also enables coarse annotation to be performed by minimally trained personnel without sacrificing downstream detection accuracy. The annotated 2D regions are subsequently projected back to 3D space to identify corresponding vertices and faces in the original mesh. Third, an optimization framework guided by rotational symmetry principles is applied to refine the detection. The geometry of a perfect surface of revolution is mathematically formulated via a mixed product condition: for any surface point P with normal vector N, rotation axis direction A, and vector V connecting P to any point on the axis, the coplanarity constraint (N × A) · V = 0 must hold. This condition is satisfied for all points on an ideal symmetric surface but deviates systematically in defect regions. A combined energy function incorporating Laplacian regularization terms for surface smoothness and rotational symmetry constraints is constructed and minimized using the limited-memory Broyden-Fletcher-Goldfarb-Shanno algorithm, chosen for its scalability and effectiveness in large-scale mesh processing. The optimization process reconstructs an idealized, defect-free model that maintains rotational symmetry while preserving boundary continuity through fixed boundary vertices. Fourth, defect identification is performed through systematic geometric comparison between the original and optimized models, with triangles marked as defective when all three vertex distances exceed a threshold proportional to the model’s bounding box diagonal (0.35% in our experiments). This spatial consistency criterion enables detection of diverse anomalies——including concavities, protrusions, and surface roughness——without making assumptions about defect shape or orientation.ResultThe proposed method was comprehensively validated on a dataset of 27 high-resolution 3D scans encompassing porcelain and composite insulators with diverse real-world defects such as cracks, bubbles, pits, surface wear, and impact damage. Quantitative evaluation demonstrates exceptional accuracy in defect area measurements, achieving an average relative error of less than 0.2‰ compared with ground truth obtained through meticulous manual face-by-face annotation. The method performs consistently across different defect categories: For defects with clear boundaries such as chips and major cracks, area calculation errors remain minimal, while for defects with ambiguous boundaries such as minor surface wear, errors increase slightly but remain within acceptable engineering tolerances. Computational efficiency analysis reveals that the complete processing pipeline executes in approximately 19.62 s for typical insulator models containing around 250 000 vertices, with UV parameterization and normal mapping taking 2.45 s, and defect region optimization requiring 17.17 s on average. This processing speed demonstrates that the proposed method is feasible for production-line quality control and field inspection scenarios. The UV parameterization consistently produces low-distortion mappings that accurately preserve correspondence between 2D annotations and 3D geometry across diverse insulator geometries. Most importantly, symmetry-based optimization significantly improves boundary precision relative to raw manual selections, enabling detection of subtle geometric anomalies that would be systematically missed by conventional approaches. Qualitative evaluation confirms robust performance across different defect types, sizes, and spatial distributions, with the method successfully handling complex scenarios such as multiple overlapping defects and defects near geometric discontinuities.ConclusionThis research presents a robust, efficient, and accurate 3D defect detection methodology for electrical insulators that successfully unites computational geometry, interactive visualization, and physics-inspired optimization principles. The approach fundamentally addresses key limitations of existing methods by leveraging high-fidelity 3D geometry rather than texture or intensity data, achieving superior robustness against environmental variations, including lighting conditions, background clutter, and surface material heterogeneity. The framework demonstrates exceptional performance in defect localization accuracy and area quantification precision while maintaining practical computational requirements suitable for real-world industrial inspection environments. By combining human-guided coarse annotation with automated, physics-based refinement, the method minimizes the need for exhaustive manual labeling while retaining expert oversight for ambiguous cases. Its reliance on geometric principles rather than purely statistical learning significantly reduces dependency on large training datasets, making it particularly effective in low-data manufacturing inspection regimes where annotated datasets are scarce or prohibitively expensive to obtain. Beyond its immediate application to insulator quality assessment, the proposed framework offers a generalizable paradigm for inspecting rotationally symmetric objects across diverse manufacturing domains, including precision-machined components, turbine blades, pipe joints, ceramic products, and aerospace engine parts. The modular design facilitates integration into existing inspection workflows with minimal system reconfiguration, while the emphasis on geometric consistency provides a principled foundation for extending the approach to objects with more complex symmetries or manufacturing tolerances. Future research directions include automating the initial defect localization phase through deep learning-based analysis of normal maps, extending the framework to quantify defect depth and volume in addition to area, and adapting the method for real-time inspection scenarios in high-throughput manufacturing environments.  
      关键词:electrical insulator;defect detection;normal map;rotational symmetry;intrinsic rotational symmetry   
      38
      |
      100
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 125792381 false
      更新时间:2026-03-18

      Image Understanding and Computer Vision

    • Selective attention-based for infrared small target detection AI导读

      红外小目标检测领域迎来新突破,相关专家构建了基于选择性注意力的检测网络SANet,为提升检测精度与降低误报率难题给出创新方案。
      Zhang Yingmei, Bao Wangtao, Xiao Qin, Yang Yong, Wan Weiguo, Luo Yitao, Zou Xueting, Zhang Lei
      Vol. 31, Issue 3, Pages: 797-810(2026) DOI: 10.11834/jig.250313
      Selective attention-based for infrared small target detection
      摘要:ObjectiveInfrared small target detection (IRSTD) is critical for numerous real-world applications, including maritime surveillance, military search and rescue, early warning systems, and precision-guided strikes. These systems depend on the precise identification of dim, small targets within cluttered infrared backgrounds. Compared with conventional targets in natural scene images, infrared small targets present several significant challenges: 1) extremely limited spatial resolution, often covering only a few pixels; 2) a low signal-to-clutter ratio and weak contrast, particularly in dynamic or noisy backgrounds such as sea surfaces and clouds; and 3) structurally complex and diverse backgrounds that frequently induce false alarms. These factors may cause the target to be obscured by the complex background, leading to false positives or missed detections, thereby making IRSTD a persistently difficult task. Existing IRSTD methods are mainly divided into two categories: traditional methods and deep learning-based methods. Traditional methods are generally classified into three subcategories: filtering-based methods, low-rank representation-based methods, and local contrast-based methods. Although these methods yield satisfactory results in specific scenarios, they largely depend on hand-crafted feature extraction and manually tuned parameters, limiting their ability to capture the diversity and variability of targets in complex scenes. Consequently, their detection performance and robustness are often inadequate in practical applications. In recent years, deep learning-based methods have gained widespread adoption in IRSTD owing to their powerful feature learning capabilities and have contributed to notable advancements in the field. However, an information bottleneck remains in target feature extraction. Moreover, the static skip connection structures commonly used in mainstream deep neural networks struggle to adaptively distinguish real targets from pseudo-target regions in infrared images, further constraining model performance in complex backgrounds. To address these limitations, this paper proposes a novel selective attention-based network (SANet).MethodThe proposed network builds upon the classical U-Net encoder-decoder framework and incorporates two key modules: 1) a dual-path semantic-aware module (DSM) and 2) a selective attention fusion module (SAFM). These enhancements are intended to mitigate the information bottleneck inherent in previous U-Net-based approaches and to address issues such as semantic misalignment and background noise interference caused by static skip connections. The DSM improves the model’s sensitivity to dim targets by improving its ability to preserve fine-grained spatial details while modeling contextual semantics. Specifically, this module integrates standard convolution (to extract local fine-scale features) with pinwheel-shaped convolution——a dilation-inspired structure that expands the receptive field in multiple directions without a significant computational burden. This combination enables the network to jointly capture local spatial continuity and global contextual information, which is essential for suppressing background clutter and enhancing target saliency. A classical spatial and channel attention mechanism is introduced following the DSM to further improve feature representation. This mechanism performs fine-grained recalibration by adaptively weighting spatial locations and feature channels, enabling the network to focus more effectively on regions with high target probability while suppressing irrelevant background signals. Additionally, SANet addresses a major limitation of the original U-Net architecture: the static and uniform nature of skip connections. Traditional skip connections uniformly transfer encoder features to the decoder without considering spatial-semantic correlations, often introducing background noise and degrading detection performance. The SAFM is proposed to overcome this issue. This module introduces a globally learnable scalar parameter to perform spatially aware, weighted fusion of multiscale features, facilitating more discriminative information transfer. Through this mechanism, the network adaptively filters the most relevant and representative features, thereby enhancing its response to real targets while reducing false alarms.ResultExtensive experiments were conducted on three widely used IRSTD datasets——NUAA-SIRST, IRSTD-1K, and NUDT-SIRST——to verify the effectiveness of the proposed SANet. SANet was comprehensively compared with a variety of current mainstream methods, including five traditional methods (FKRW, TLLCM, NWMTH, NOLC, and PSTNN) and nine deep learning-based methods (ACM, RDIAN, AGPCNet, DNANet, UIUNet, RPCANet, MSHNet, SCTransNet, and L2SKNet). Four standard evaluation metrics were employed to objectively assess detection performance: intersection over union (IoU), normalized IoU, probability of detection (Pd), and false alarm rate (Fa). The experimental results demonstrate that SANet consistently outperforms existing state-of-the-art methods across all evaluation metrics. Specifically, for the IoU metric, SANet achieves improvements of 1.93%, 4.32%, and 2.21% over the second-best methods on the NUAA-SIRST, IRSTD-1K, and NUDT-SIRST datasets, respectively, demonstrating strong generalization capability and robustness across diverse scenarios.ConclusionThis paper introduces SANet, an IRSTD network grounded in a selective attention mechanism. Built upon the U-Net architecture, the proposed method incorporates a dual-path semantic-aware module and a selective attention fusion module, which enhance the model’s capacity for faint target perception, key feature representation, and background suppression. Experimental results on multiple public IRSTD datasets demonstrate that SANet outperforms existing mainstream approaches in terms of detection accuracy, robustness, and generalization, highlighting its practical relevance and broad applicability. Future work will concentrate on model lightweighting and inference optimization to fulfill deployment requirements on edge devices. Additionally, we aim to investigate the model’s adaptability under complex meteorological conditions (e.g., rain, fog, thermal disturbances) to broaden its utility across diverse infrared imaging scenarios. The source code is available on https://gitcode.com/m0_61988291/SANet.  
      关键词:infrared small target detection (IRSTD);dual-path semantic-aware module (DSM);pinwheel-shaped convolution;selective attention fusion module (SAFM);spatial dynamic weight mechanism   
      126
      |
      240
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 129892942 false
      更新时间:2026-03-18
    • Image relighting via depth-light direction joint modeling AI导读

      重光照技术在元宇宙等领域应用广泛,但现有方法存在表达能力有限、易产生伪影等问题。专家提出深度—光源方向联合建模的重光照方法,通过提取深度等信息,设计神经渲染器,有效解决了现有问题,为重光照任务提供新方案。
      Li Hongzhen, Yang Zhulun, Ding Xin, Liu Qiong, Yang You, Li Wei
      Vol. 31, Issue 3, Pages: 811-825(2026) DOI: 10.11834/jig.250032
      Image relighting via depth-light direction joint modeling
      摘要:ObjectiveThe photometric consistency between virtual objects and real-world scenes directly impacts user immersion in applications such as the metaverse and augmented reality. In the film industry and artistic creation, lighting plays a vital role in expressing visual aesthetics and emotional experiences. However, the arrangement of lighting has always been a technical and labor-intensive task. Image relighting, which enables the control of lighting in images, can heavily reduce the workload of professionals in related fields. Existing relighting models based on the Lambertian diffuse reflection model and depth map path tracing perform image relighting by constructing reflectance field models or leveraging computer graphics principles, but these approaches are limited by the expressive capacity of the models and the difficulty of data acquisition. The latest deep learning networks, including convolutional neural networks (CNNs), generative adversarial networks, and diffusion models, have demonstrated exceptional performance in image generation tasks and have been introduced to image relighting. Benefiting from the implicit modeling capabilities of neural networks, end-to-end learning-based relighting methods achieve remarkable results in handling complex surface reflection phenomena. However, they perform poorly in addressing cast shadows caused by pixel-to-pixel interactions. The challenges of using deep learning methods for relighting with cast shadows are due to two primary issues. First, neural networks struggle to model cast shadows without prior knowledge or shadow features matching the target lighting condition. Second, most deep learning methods rely on convolution operations, which excel at capturing local features and fusing multichannel features but are less effective at modeling the long-distance dependencies between pixels that are inherent to cast shadows. This study addresses the lack of expressiveness in modeling cast shadows, the difficulty faced by CNN in handling long-distance dependencies, and the inability of traditional reflection models to represent the complex surface reflection properties of objects. Inspired by previous approaches, including those that use computer graphics principles to model relighting, deep learning-based neural rendering methods, and attention mechanisms from natural language processing for capturing long-distance dependencies, this study analyzes past methods and resolves the issue of cast shadow generation in image relighting through a relighting method via depth-light direction joint modeling.MethodThe core features of this method are the analysis of the generation mechanism of cast shadows and the development of an explicit mathematical model, depth-light direction joint modeling algorithm, for their representation, and the design of an advanced neural renderer based on the characteristics of attention mechanisms and CNNs. First, the learning-based method is used to estimate the depth of the image, and an explicit model is constructed using the depth map to predict the feature prior where cast shadows are generated. Given that cast shadows arise from occlusion, this study designed an algorithm leveraging vector inner products to determine cast shadow regions. Additionally, considering that the scene geometry estimated by monocular methods is ill-posed, leading to inaccurate cast shadow predictions, and that explicit models describing surface reflection phenomena are limited in their ability to express complex surface reflections, the proposed neural renderer employs a two-stage, cascaded U-Net-like architecture for neural rendering. The design preserves the U-shaped structure of U-Net while incorporating a multihead attention mechanism to capture long-distance dependencies between pixels, effectively representing cast shadows in conjunction with the depth-light direction joint modeling algorithm. In the second stage, a traditional convolutional U-Net is employed to merge multichannel features such as shadows, textures, brightness, and specular highlights, thereby achieving neural rendering. To ensure effective supervision of the generated images, this study adopts common relighting loss functions and constructs an image pyramid to supervise the generated images at multiple scales. A synthetic dataset called the human stage (HS) dataset was created using the rendering software Blender to evaluate the effectiveness of the proposed method. Unlike previous datasets that used flat surfaces as shadow receivers, the human stage dataset employs horizontal ground and vertical walls as shadow receivers, better reflecting the relighting method’s performance in handling cast shadows. Furthermore, training and testing were conducted on the real scene relighting (RSR) dataset to validate the generalization of the proposed method on real-world data. For the real dataset validation, depth and normal were obtained using publicly available pretrained models. Considering the method’s applicability across different scenarios, a subset of the RSR dataset was used, excluding objects with colored lighting or metallic materials.ResultThis paper conducts comparative experiments on the real-world dataset and the synthetic dataset against four representative relighting methods, evaluating qualitatively and quantitatively. The qualitative evaluation of relighting images focuses on the following aspects: maintaining the texture of the input image while adjusting brightness according to the target lighting; generating clear cast shadows under the target lighting; and restoring specular highlights experimental results demonstrate that the proposed method achieves the best peak signal-to-noise ratio (PSNR), structure similarity index measure (SSIM), learned perceptual image patch similarity (LPIPS), and mean perceptual score (MPS) on the RSR dataset, with improvements of 5.45% and 2.58% in peak signal-to-noise ratio and the comprehensive metric MPS, respectively, compared with the second-best model. On the synthetic dataset, the method achieves the best LPIPS metric and produces subjectively more intuitive results aligned with human perception. Additionally, ablation studies on the cast shadow algorithm, loss function, and network architecture validate that the proposed algorithm, loss function, and network design effectively enhance the quality of relighting images.ConclusionThe proposed method effectively addresses the cast shadow generation problem in image relighting and has been validated on synthetic and real-world datasets to produce relighted images with accurate cast shadows. This research introduces a method for predicting shadows using explicit mathematical modeling and proposes leveraging the strengths of attention mechanisms and convolution operations in provides improvement directions for future research. For instance, in scenarios involving colored lighting and metallic objects, future work should refine the neural renderer architecture and optimize the predefined surface reflectance model on the basis of the linear nature of lighting and reflection models for metallic objects, enhancing relighting performance in more complex lighting scenarios and diverse object types.  
      关键词:image relighting;shadow generation;attention mechanism;neural rendering;deep learning   
      133
      |
      305
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 129893189 false
      更新时间:2026-03-18
    • Image-text decoupling method for compositional zero-shot learning AI导读

      组合零样本识别领域迎来新突破,研究者提出双模态解耦机制,通过文本端图神经网络与视觉端交叉注意力机制,显著提升模型对属性和物体概念的建模能力,增强未见组合识别效果。
      Tian Yi, Qian Yixin, Huang Qingbao, Chen Jiayue, Zhong Lei, Wu Xianrui
      Vol. 31, Issue 3, Pages: 826-839(2026) DOI: 10.11834/jig.250189
      Image-text decoupling method for compositional zero-shot learning
      摘要:ObjectiveCompositional zero-shot learning (CZSL) is a subtask of zero-shot learning (ZSL) in the field of computer vision, aiming to learn attribute and object concepts from seen composition images and transfer them to unseen compositions. For example, learning the attribute “cute” and the object “cat” from the composition images of “cute dog” and “nimble cat” to recognize the unseen composition of “cute cat”. Existing methods are unable to decouple attributes and objects in compositional images and have not fully utilized the decoupling role of text labels in attribute and object information. In previous research on compositions, attributes and objects were often treated with the assumption that the object is the main subject, with attributes attached to the object, lacking recognition of the equal importance of attributes and objects. Although some studies recognize the importance of decoupling in the compositional zero-shot recognition task, they still focus on decoupling at the superficial level or fail to fully decouple attributes and objects.MethodDecoupling the conceptual features of attributes and objects from images is both a prerequisite and a challenge for enabling the model to recognize unseen compositions. By simulating the interactions of attribute and object elements within compositions and between compositions and considering the differences in textual and visual modalities, we have designed a text decoupler based on graph neural networks and a visual decoupler based on cross-attention contrastive learning. These components are integrated into contrastive language-image pre-training (CLIP) to achieve the decoupling of attribute and object elements at the linguistic and visual levels. More specifically, we constructed a compositional graph network by using the text labels in the data and modeled the relationships between attributes, objects, and compositions within the graph by using graph convolutional networks. This approach allows the model to understand the corresponding relationships of the same labels in different composition texts through the compositional graph, further assisting the model in decoupling at the visual level. This representation at the text level helps the model understand the relationships between attributes and objects within the same composition, as well as the inherent connections between the same attributes and objects across different compositions. At the image level, for the composition image to be recognized, we search for an image in the dataset that shares the same attribute and another image that shares the same object. The image with the same attribute is then passed into the cross-attention module along with the composition image, allowing the model to learn the attribute through the attention layer. Applying the same operation on the object side, attention-level contrastive learning with visually similar images improves recognition of visual elements. In this way, we achieve the decoupling of attribute and object information in the image at the visual level. These two components are inserted into the final layers of CLIP’s text encoder and image encoder, enhancing CLIP’s decoupling ability while fully utilizing its powerful image-text representation and retrieval recognition capabilities.ResultEvaluations on three well-known CZSL benchmark datasets, namely, MIT-States, UT-Zappos, and C-GQA, demonstrate that the proposed method significantly improves model performance. On the MIT-States dataset, compared with the second-best performing model, the proposed method achieved a 3.3% increase in area under curve (AUC), a 2.4% increase in HM, a 5.3% improvement in the recognition accuracy for seen compositions, and a 1.0% improvement for unseen compositions in the closed-world setting. In the open-world setting, AUC increased by 0.9%, HM improved by 0.7%, seen composition recognition accuracy improved by 3.2%, and unseen composition recognition accuracy improved by 1.0%. On the UT-Zappos dataset, in the closed-world setting, compared with the second-best performing model, the proposed method achieved a 0.5% increase in AUC, a 1.3% improvement in the recognition accuracy for seen compositions, a 0.1% increase for unseen compositions, and the second-best performance on HM. In the open-world setting, AUC increased by 2.5%, HM improved by 0.9%, seen composition recognition accuracy achieved the same performance as the state-of-the-art method, and unseen composition recognition accuracy improved by 2.6%. Our model achieved the second-best performance on the C-GQA dataset. Additionally, ablation studies on the text-visual decoupling module and its contextual components on the MIT-States dataset confirm the effectiveness of the proposed method. Ablation experiments on the MIT-States dataset show that removing the visual decoupling module resulted in a 9.8% decrease in AUC, an 8.9% decrease in HM, an 11.0% decrease in recognition accuracy for seen compositions, and a 9.0% decrease in recognition accuracy for unseen compositions. Removing the text decoupling module resulted in a 2.5% decrease in AUC, a 0.9% decrease in HM, a 3.7% decrease in recognition accuracy for seen compositions, and a 0.5% increase in recognition accuracy for unseen compositions. This paper further conducts ablation experiments on the contextual relationships within the image decoupling component and the text decoupling component. For the text decoupling component, we eliminate the contextual relationships by retaining only the trainable tokens related to the text prompt and removing the compositional graph information modeled by GCN. For the image decoupling component, we eliminate the contextual relationships by removing the images with matching attributes and objects, replacing them with the training images themselves. This paper conducts qualitative experimental analysis, including retrieving text through unseen images, retrieving images through text, and retrieving visual elements.ConclusionThe proposed text-visual decoupling module enhances the model’s ability to learn attributes and objects in compositions, significantly improving the model’s performance in compositional zero-shot recognition tasks. Ablation experiments demonstrate that models for the compositional zero-shot learning task should fully utilize the contextual information from both images and texts in the training samples, providing a direction for future research.  
      关键词:zero-shot learning(ZSL);compositional zero-shot learning(CZSL);Decouple;graph convolutional network(GCN);cross-attention;contrastive language image pre-training(CLIP)   
      143
      |
      371
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 126141010 false
      更新时间:2026-03-18
    • 介绍了其在矢量图风格迁移领域的研究进展,相关专家提出了一种跨模态风格评分模型引导的矢量图风格迁移方法(CLIPVGStyler),为解决现有方法风格迁移结果与人类视觉感知不符的问题提供了有效解决方案。
      Zheng Shuyang, Chen Jiazhou, Zhu Xinding, Li Kaiyong
      Vol. 31, Issue 3, Pages: 840-849(2026) DOI: 10.11834/jig.250310
      Vector graphics style transfer guided by cross-modal style evaluation models
      摘要:ObjectiveStyle transfer is a popular research problem in computer graphics, aimed at applying one style to another content to give it a new visual style. Since the introduction of the Gram matrix-based method for representing style, many deep learning-based methods have emerged in the field of style transfer, which have achieved good results through convolutional neural networks. However, these studies are limited to raster images and difficult to directly apply to style transfer in vector images mainly because the basic elements of vector images are different from bitmap images and cannot be convolved like pixels. Iteratively updating network parameters through reverse traditional propagation is also impossible. Vector graphics describe images using geometric shapes, which have the advantages of scale invariance, small data volume, and flexible editing. They are widely used in fields such as illustration and web design. Therefore, the style transfer of vector graphics is also one of the research hotspots in computer graphics. The existing methods face serious limitations: they typically only extract specific low-level features (such as strokes or textures) from reference style images and cannot capture the overall and multifaceted visual perception of style by humans (including color, texture, strokes, and thematic elements). This can lead to inconsistent perceptual results, such as unnatural color shifts or distorted content. The dependence on style images also limits control and expressiveness.MethodCLIPVGStyler proposes a contrastive language-Image pre-training (CLIP)-based cross-modal style scoring model to supervise vector graphics parameter optimization. The core innovation lies in using CLIP’s zero-shot style discrimination capability as a perceptual loss function. The method takes an input vector graphic defining the content and a textual style description. Optionally, a small cache (4~16 images) representing a novel style (StyleCache) can be provided for few-shot style adaptation. The differentiable rasterizer called DiffVG is used to render the current vector graphic state during optimization. The loss function consists of two key components. The style loss measures the dissimilarity between the current rendered image and the target style. It primarily uses the cosine distance between the CLIP encodings of current vector graphic and style. For few-shot adaptation, an additional loss term (1–S_cache) is added, where S_cache is the average cosine similarity between the current vector graphic and the CLIP encodings of the StyleCache images. The content loss ensures content preservation and is a weighted combination (α = 0.25) of the pixel-level L2 distance between current vector graphic and content image and the cosine distance between their CLIP encodings. Crucially, the rendered current vector graphic undergoes robust image augmentation (multiscale random cropping with white padding and random perspective transformations) before CLIP encoding to enhance geometric invariance and prevent overfitting. Negative text prompts (“bad image”, “low quality”, “blurry image”) are also incorporated to improve output quality. The combined loss (L_style’ + L_content) is minimized iteratively using gradient descent to update the vector graphic’s path parameters (shapes, colors). An early stopping mechanism halts optimization if the loss plateaus for 10 consecutive iterations, typically converging within 50–150 steps.ResultComprehensive experiments demonstrate the effectiveness and efficiency of CLIPVGStyler. Quantitatively, using CLIP similarity scores as a semantic-alignment metric, CLIPVGStyler achieved the highest score (0.256) compared with baselines (StyleCLIPDraw: 0.241, VectorPainter: 0.253, VectorNST: 0.239) when evaluating the semantic fidelity of the output to the textual style prompt T_style. A large-scale user study (360 participants) employing a 10-point scale for blind A/B testing further confirmed the perceptual superiority of CLIPVGStyler. It received the highest average rating (7.5), significantly outperforming StyleCLIPDraw (3.2) and VectorNST (3.86), and slightly exceeding the well-regarded but inefficient VectorPainter (7.14). Qualitatively, CLIPVGStyler generated results with more pronounced and holistic style transfer across diverse styles (e.g., watercolor, One Piece anime, ink wash, sketch), successfully capturing complex stylistic elements such as color palettes and thematic coherence without the distortions prevalent in other methods (e.g., unnatural color bleeding in VectorPainter, excessive contour distortion in VectorNST, or chaotic composition in StyleCLIPDraw). Crucially, it maintained superior content integrity compared with alternatives when using text-based style control. The proposed few-shot style adaptation mechanism effectively enhanced the transfer of novel or rare styles unseen by CLIP during pre-training. Ablation studies confirmed the critical role of the primary style loss based on text (S_Text), which alone handled common styles effectively. The S_Cache component proved essential for boosting novel style perception and provided marginal refinement for common styles but was ineffective when used alone. While S_Text could operate independently, S_Cache required the foundation of S_Text for optimal results. Performance-wise, CLIPVGStyler (avg. 19 min 10 s) demonstrated a substantial speed advantage over the highly effective but impractical VectorPainter (842 min 32 s) and was faster than VectorNST (24 min 14 s), approaching the speed of the lower-quality StyleCLIPDraw (14 min 27 s).ConclusionCLIPVGStyler presents a groundbreaking approach to vector graphics style transfer by introducing a CLIP-based cross-modal style scoring model. This model leverages CLIP’s inherent human-aligned style recognition capabilities to supervise the optimization of vector graphic parameters, leading to results that are significantly more perceptually coherent than prior work. Key contributions include the novel formulation of style loss using CLIP encodings of textual style descriptions (T_style), enabling precise, high-level control over the desired artistic expression. The method introduces an efficient few-shot adaptation mechanism (StyleCache) that allows the system to handle novel or niche styles without any additional training, compensating for CLIP’s limitations on unseen data. Extensive experimental validation confirms that CLIPVGStyler generates vector graphics with superior style transfer quality, achieving higher semantic alignment (CLIP score) and user preference ratings than state-of-the-art methods such as StyleCLIPDraw, VectorPainter, and VectorNST. It accomplishes this task with high computational efficiency, particularly compared with the high-quality but prohibitively slow VectorPainter. Limitations include potential minor artifacts upon close inspection and the possibility of inadvertently transferring non-style features. Overall, CLIPVGStyler establishes a new paradigm for perceptually guided, efficient, and controllable vector graphics style transfer.  
      关键词:vector graphics;style transfer;cross-modal;few-shot;contrastive language-image pre-training(CLIP)   
      57
      |
      116
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 125790987 false
      更新时间:2026-03-18
    • 组合零样本学习领域迎来新突破,相关专家提出融合显式语义解耦与软提示机制的跨模态识别方法,有效解决状态与物体特征混淆及语义对齐不足问题,显著提升模型泛化能力,为该领域研究开辟新方向。
      Liu Jie, Tao Chongben, Shen Zhongwei, Luo Xizhao, Cao Feng
      Vol. 31, Issue 3, Pages: 850-861(2026) DOI: 10.11834/jig.250265
      Cross-modal compositional zero-shot recognition via disentanglement and soft prompt
      摘要:ObjectiveCompositional zero-shot learning (CZSL) aims to recognize novel combinations of states (attributes) and objects that are not explicitly observed during training. This task presents a fundamental challenge for real-world applications, where the combinatorial explosion of state-object pairs makes it impractical to annotate all possible combinations. Despite significant progress in recent years, two core challenges remain unsolved. First, existing models often struggle to effectively disentangle visual representations of states and objects. Standard deep learning architectures tend to encode these semantic components into entangled latent spaces, thereby limiting the ability to generalize to novel compositions. Second, insufficient semantic alignment between visual and textual modalities remains a persistent issue. Because visual embeddings and linguistic representations are heterogeneous, cross-modal alignment is often weak, hindering the transfer of knowledge from seen to unseen compositions. Addressing these limitations is essential for improving generalization in CZSL and enabling more scalable, adaptable recognition systems in open-world scenarios.MethodTo tackle the above challenges, we propose a unified cross-modal compositional learning framework that explicitly incorporates semantic disentanglement, adaptive soft prompting, adversarial regularization, and cross-modal relation fusion. Our method comprises three key modules, each designed to address one of the aforementioned limitations. First, we introduce a learnable soft prompt module operating in the language modality. This module constructs dynamic prompts based on learnable vectors, forming soft semantic templates tailored to individual states and objects. These prompts enhance the model’s capability to generate fine-grained, composition-specific textual representations, which in turn improves alignment with corresponding visual features. The soft prompts are trained end-to-end with the rest of the network, allowing the model to adaptively refine its semantic interpretation of input concepts. Second, we integrate a variational autoencoder (VAE)-based visual module to explicitly disentangle visual features into semantically relevant and irrelevant latent variables. The semantically relevant variable encodes compositional information, while the irrelevant one captures style, background, and other non-semantic aspects. An adversarial discriminator is introduced to minimize mutual information between these two latent subspaces and thus enforce this separation. This adversarial training encourages orthogonality between disentangled features, resulting in more structured and interpretable visual embeddings. Third, to bridge the modality gap and more effectively model compositional relations, we design a cross-modal relation fusion module that fuses the disentangled visual embeddings with adaptive soft prompts via a relation-aware attention mechanism. It learns to align and reason over state-object semantics jointly across modalities, thereby improving the model’s ability to match visual inputs with corresponding textual descriptions, especially for unseen compositions. This multilevel interaction across semantic modules enables the framework to capture both intramodal structure and intermodal alignment. Results We conduct comprehensive experiments on three widely used CZSL benchmark datasets: MIT-States, UT-Zappos, and C-GQA. Each dataset presents distinct challenges: MIT-States offers a diverse and complex compositional space with numerous fine-grained object and attribute variations, UT-Zappos focuses on fine-grained footwear attributes, and C-GQA contains large-scale compositional queries based on visual question-answering tasks. Across all datasets, our model consistently achieves state-of-the-art performance. On MIT-States, our approach attains 54.2% accuracy on unseen compositions, outperforming the best previous method by 1.3%. The Harmonic mean(H) and area under curve(AUC) metrics also show noticeable improvements, reflecting better generalization and balance between seen and unseen combinations. On UT-Zappos, our method achieves significant gains in fine-grained settings, demonstrating its robustness in scenarios with subtle attribute distinctions. For the more challenging C-GQA dataset, the model achieves new best results across multiple evaluation metrics, confirming its scalability to large and complex visual domains. Ablation studies are conducted to assess the contribution of each component. When the soft prompt module is removed, performance drops considerably, confirming its importance for enhancing textual encoding. Similarly, removing the VAE-based disentanglement or adversarial discriminator results in decreased generalization, underscoring their critical role in structuring the visual representation space. We also perform a sensitivity analysis on the adversarial loss weight, showing that moderate regularization (e.g., 0.5) provides the best trade-off between disentanglement and reconstruction. Furthermore, visualizations using dimensionality reduction techniques such as t-SNE reveal that our semantic disentanglement module effectively separates semantically meaningful clusters for different compositions, while irrelevant features remain dispersed, verifying the structural clarity of learned representations.ConclusionIn summary, we propose a comprehensive and interpretable framework for CZSL that directly addresses two longstanding limitations: visual-semantic entanglement and cross-modal misalignment. Through the integration of adaptive soft prompt learning, VAE-based visual disentanglement, adversarial independence constraints, and relation-aware cross-modal fusion, our method achieves superior generalization to unseen compositions. The model not only yields substantial performance improvements over existing baselines but also offers clear interpretability and modularity. Each component is functionally and structurally validated through ablation and visualization, highlighting the robustness and transparency of the overall framework. These contributions provide a solid foundation for future research directions, such as extending the method to dynamic scenes with temporal dependencies, multi-object compositions, or context-aware compositional reasoning in complex environments. Ultimately, our approach bridges important gaps in compositional recognition and opens new avenues for robust, flexible, and scalable visual understanding systems in open-world and real-world settings.  
      关键词:compositional zero-shot learning (CZSL);Semantic Disentanglement;Cross-Modal Alignment;Soft Prompting;adversarial training   
      45
      |
      132
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 125792421 false
      更新时间:2026-03-18
    • 专家提出多模态引导裙装图像生成新方法,构建结构化风格增强学习体系,有效解决多角度文本注释冗余冲突、跨区域风格传递有限及语义风格协同控制难题,为高质量裙装图像生成提供新方案。
      Ma Jiani, Liu Li, Fu Xiaodong, Liu Lijun, Peng Wei
      Vol. 31, Issue 3, Pages: 862-879(2026) DOI: 10.11834/jig.250338
      Structured style enhancement learning for multimodal guided dress image generation
      摘要:ObjectiveMultimodal image generation aims to produce high-quality images by jointly encoding and conditionally controlling diverse modalities such as text, sketches, and reference style images, while ensuring semantic consistency. With the digital transformation of the fashion industry, applications such as virtual try-on, digital garments, and online shopping are placing higher demands on the realism and controllability of generated images. Different clothing categories vary considerably in their structural characteristics and visual attributes. Among clothing categories, dresses, as a key category in the fashion industry, present greater challenges due to their coverage of both upper and lower body regions and their wide variety of styles, increasing complexity in structure modeling, texture synthesis, and cross-region style transition. Currently, existing approaches for dress image generation can be broadly classified into three categories: generative adversarial network (GAN) methods, autoregressive models, and diffusion methods. Early works predominantly rely on GANs, which leverage adversarial training between generators and discriminators to capture local textures and structural details. However, these models often suffer from training instability and mode collapse. Autoregressive models first discretize multimodal inputs into token sequences and progressively generate image regions within a latent token space. This approach enables effective fusion of multimodal conditions. However, their strictly sequential generation process makes it difficult to model long-range dependencies, leading to limited information flow and reduced scalability in complex image synthesis tasks. More recently, diffusion models have emerged as a new paradigm due to their stepwise denoising process, which enables stable training and stronger conditional control. These models iteratively reconstruct clean images from noisy representations, making them well suited for incorporating complex multimodal guidance. Despite recent progress in semantic consistency and style generation, dress image synthesis remains a challenging task due to its rich attribute variations and stylistic diversity. First, multi-angle textual annotations of dresses often contain redundancy and conflict and lack contextual coherence, making it difficult to provide deep semantic guidance for visual synthesis. Second, although sketches offer explicit structural cues, they fall short in conveying consistent cross-region styles, leading to local-global style mismatches. Moreover, effective dress generation requires fine-grained coordination between semantics and style, yet current guidance mechanisms lack a unified and efficient control strategy. In this paper, we propose a structured style enhancement learning method for dress image generation.MethodFirst, a dynamic attribute template generation strategy is designed on the basis of the structural and stylistic characteristics of dresses. It enables the intelligent extraction and reconstruction of seven key dress attributes, thereby constructing structured textual prompts that mitigate redundancy and semantic conflicts in multiview descriptions. Second, a semantic inversion fusion mechanism is proposed, wherein visual features of dress images are projected into pseudo-token embeddings through a text-inversion process and subsequently fused with the structured prompts to form semantically enriched textual representations. Third, a cross-domain image feature alignment module is developed, in which a skip-cross attention mechanism is introduced to selectively integrate sketch structures and style references, facilitating consistent style transfer across spatial regions. Finally, a dual-branch conditional fusion framework is constructed, wherein the enhanced textual and style representations are hierarchically injected into a latent diffusion model. This feature enables fine-grained control over semantic content and visual style during the dress image generation process.ResultComparative experiments were conducted on the dress subset of the DressCode Multimodal dataset against five state-of-the-art approaches to evaluate the effectiveness of the proposed method. Quantitative results show that, compared with the second-best method, the proposed method achieves improvements of 2.131 in Fréchet inception distance and 0.193 in learned perceptual image patch similarity, indicating that the introduced structured style-enhanced learning mechanism provides clearer, richer, and more consistent multimodal guidance during training. The Contrastive Language-Image Pre-training score is improved by 17.57%, demonstrating that the unified semantic template focusing on attribute information effectively eliminates ambiguity and redundancy in natural language descriptions, enhances semantic coherence, and provides the model with clearer generation targets. The integration of visual features further enriches the textual semantics, resulting in better alignment between the generated dress images and their textual descriptions. An 8.29% increase in texture score indicates the model’s strong ability to maintain style consistency and detailed fidelity even under complex sketch structures by effectively sharing and transferring style information across regions. The sketch distance score remains within a low and comparable range to existing methods, reflecting stable generation performance. Qualitative experimental results demonstrate that the proposed method is highly capable of recognizing and reconstructing key dress attributes, particularly color, material, and pattern. Furthermore, it achieves satisfactory performance in style transfer across regions, overall style consistency, image quality, and visual realism. These outcomes effectively fulfill the dual constraints of semantics and style required for high-quality dress image generation. The results of user studies further demonstrate that the proposed method achieved the highest preference rates across three evaluation dimensions: overall visual quality of the generated dress images, consistency between the images and the input text, and cross-domain style consistency within the generated results. This finding indicates that the proposed approach can effectively enhance the perceptual quality of dress image generation. Moreover, compared with the relatively close performance reflected by automatic evaluation metrics, user preferences reveal more pronounced differences among competing methods in terms of generation quality and style fidelity. These findings highlight the necessity of subjective evaluation in perception-driven tasks such as dress image generation, as it provides valuable insights into the perceptual differences that may not be fully captured by objective metrics.ConclusionTo address the challenges in multimodal-guided dress image generation, including redundant and conflicting multiview textual annotations, limited cross-region style transfer capability, and the difficulty of fine-grained coordination between semantics and style, this paper proposes a structured style enhancement learning method. The method effectively captures the deep correlations between semantic content and structural style, ensuring multimodal consistency while achieving high-quality dress image generation.  
      关键词:dress image generation;structured text prompt;text inversion semantic fusion;cross-domain image feature alignment;diffusion model   
      144
      |
      257
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 126141048 false
      更新时间:2026-03-18

      Computer Graphics

    • 实时渲染领域迎来新突破,专家提出基于帧循环结构的实时神经超采样方法,有效提升图像质量与实时性,为高分辨率高刷新率渲染难题提供创新解决方案。
      Li Lin, Xue Haowen, Zhu Jichun, Zhao Yang, Liu Xiaoping
      Vol. 31, Issue 3, Pages: 880-895(2026) DOI: 10.11834/jig.250296
      Real-time neural super-sampling rendering based on frame-recurrent structure
      摘要:ObjectiveThe complexity of modern real-time rendering algorithms, such as ray tracing, is often closely related to the number of pixels on the target screen. However, with the increasing resolution and refresh rates of external displays and head-mounted displays, as well as the proposal of graphics algorithms that achieve more realistic visual effects, the computational cost of real-time rendering programs has increased. As a result, many users must balance rendering resolution, real-time performance, and rendering quality, even with the support of new parallel computing hardware. Although many video super-resolution methods now produce excellent reconstruction results, these methods cannot be directly applied to real-time rendering super-resolution. First, most of these methods require a network submodule to estimate the optical flow map between two adjacent frames, which inevitably increases the algorithm’s time complexity and reconstruction error. In real-time rendering, the algorithm can directly obtain almost complete motion vector maps from the modern graphics rendering pipeline, effectively avoiding the time overhead and cumulative errors of generating optical flow maps. Second, the latest video super-resolution algorithms emphasize the importance of bidirectional propagation of video frame features for reconstruction quality. However, in real-time rendering, the algorithm cannot obtain feature information of future frames, making bidirectional propagation of frame features impossible. Lastly, real-time rendering pipelines often generate images that record scene geometry information, and failing to effectively utilize this data limits the final reconstruction results.MethodIn response to the above issues, this method proposes a real-time neural super-sampling method based on a frame cyclic structure. First, it can make full use of the low-resolution scene geometry data generated in the real-time rendering pipeline to enhance the three-dimensional spatial information perception of the super-sampling network. Second, the frame cyclic framework is integrated into the super-sampling method. Temporal stability is achieved by introducing the features of the previous frame’s reconstruction results to improve the current frame’s reconstruction. Finally, a reweighting network and an attention network are embedded in the feature extraction module to enhance the effectiveness of the extracted features. Additionally, this method presents a real-time rendering pipeline oriented to neural super-sampling, which can deploy the super-sampling network onto the graphics computing pipeline and integrate it with the real-time rendering pipeline. To further ensure the real-time performance of the real-time rendering pipeline, we performed lightweight processing on the proposed real-time neural supersampling method at different levels and compared the changes in various image evaluation metrics and real-time performance of the network before and after lightweighting in subsequent experiments. No public and universal dataset in the field of neural supersampling is available. Thus, we use Unity3D for the collection of the dataset. For low-resolution data, we set the output resolution to 320 × 180 pixels and disable anti-aliasing. For high-resolution data, we set the output resolution to 5 120 × 2 880 pixels with 8 × MSAA anti-aliasing enabled,then downsample it to 1 280 × 720 pixels and apply a sharpening operation with a 3 × 3 convolution kernel to the downsampled result to further improve the quality of the test data.ResultCompared with NSRR, a benchmark method that also achieves good real-time performance, the proposed method shows a slight improvement in speed while increasing the peak signal-to-noise ratio by an average of 0.4 dB. Meanwhile, the structural similarity index is increased by an average of 0.015, and the learned perceptual image patch similarity is decreased by an average of 0.028. After being deployed to the real-time rendering pipeline, the method maintains real-time performance through lightweight pruning, and its performance remains superior to that of the non-real-time deployed NSRR in most scenarios. Subsequent experiments also demonstrate that the lightweight network can operate on the graphics computing pipeline and meet real-time requirements at various resolutions, as validated by experimental data. Additionally, through ablation experiments on the three submodules of the network, this method validates the effectiveness of each submodule in reconstructing the current frame. Moreover, the image evaluation metrics show a minimal decline when the scene normals are sufficiently simple or the camera movement amplitude is too large.ConclusionFirst, compared with previous super-resolution and super-sampling methods, the proposed neural super-sampling network model not only achieves better performance but also demonstrates superior real-time performance. Second, the constructed neural super-sampling rendering pipeline can operate on the graphics pipeline and meet real-time requirements at different resolutions. In the future, the regions of rendered images prone to artifacts can be further optimized by starting from the generation principles of rendered images. For instance, for dynamic shadows and specular reflections, ordinary motion vector maps cannot capture their changing trends as objects and light sources move. Thus, the rendering pipeline may need to be customized further to output vector maps that describe such changes to address this issue.  
      关键词:real-time rendering;frame-recurrent neural network;super-sampling;super-resolution (SR);convolutional neural network(CNN)   
      75
      |
      251
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 132130993 false
      更新时间:2026-03-18
    • 相关研究在逆渲染领域取得新突破,专家们构建了从三维高斯泼溅表达到可物理渲染的网格与材质贴图的快速、端到端逆渲染流程,为该技术在实时交互与工业级场景中的应用提供了高效且鲁棒的新范式。
      Liu Zheng, Tang Shengjun, Yao Mengmeng, Li You, Guo Renzhong
      Vol. 31, Issue 3, Pages: 896-911(2026) DOI: 10.11834/jig.250257
      Fast physically-based inverse rendering framework fusing 3D Gaussian splatting geometry enhancement and efficient material encoding
      摘要:ObjectiveInverse rendering aims to recover the underlying three-dimensional geometry, surface material properties, and illumination parameters of a scene from a collection of two-dimensional images. This capability can be applied in a wide range of tasks——from accurate scene reconstruction in autonomous driving and robotic navigation, to photorealistic asset creation for virtual reality and game engines, and even post-production workflows in film and visual effects. Traditional inverse rendering methods, including multiview stereo, depth-based reconstruction, and neural radiance field (NeRF) approaches, typically face a trade-off between reconstruction accuracy, material fidelity, and computational cost. 3D Gaussian splatting (3DGS) has emerged as a highly efficient and flexible rendering approach, representing scenes as collections of anisotropic Gaussian primitives rather than dense volumetric grids or explicit meshes. However, when applied directly to inverse rendering, 3DGS-based techniques can suffer from misaligned normal directions on reconstructed surfaces and spurious artifacts in recovered material maps, limiting their practical utility.MethodIn this work, we propose a novel optimization framework that integrates multiview geometric consistency constraints with a weighted pool sampling strategy to address these limitations. Our approach comprises two complementary modules: a geometry enhancement pipeline that redefines surface normals through compressed Gaussian primitive analysis and enforces consistency across multiple camera views, and a material-lighting decomposition pipeline that accelerates direct illumination computation via a dynamic weighted sampling pool while employing multiple importance sampling for indirect lighting. By iteratively coupling these two modules, we achieve state-of-the-art performance in geometric accuracy and photorealistic material recovery, all without sacrificing the rendering efficiency of 3DGS representation. The first step in creating the geometry enhancement module is to observe that raw 3DGS reconstructions define normal vectors solely on the basis of the local Gaussian covariance axes, which can diverge from true surface orientations when primitives exhibit near-spherical shapes or when sampling is sparse. To remedy this issue, we propose a shape-driven normal redefinition strategy; we compress each Gaussian primitive along its shortest axis and define the normal direction as pointing along that shortest axis toward the camera view direction. This shape-driven normal initialization is then refined by incorporating a multiview reprojection consistency term, in which we measure the discrepancy between its projections under all available camera poses and enforce alignment through a robust optimization objective for each 3D point on the reconstructed surface. This constraint not only corrects directional drift in the normals but also suppresses outlier primitives that would otherwise introduce noise or spurious geometry. For material and lighting recovery, direct illumination estimation remains a bottleneck in inverse rendering, especially for large-scale or complex outdoor scenes with high-resolution textures. We therefore design a weighted pool sampling strategy that maintains a small, adaptive set of representative light rays——each weighted by its contribution to the final pixel intensity——and updates this pool dynamically during optimization. By reusing and re-weighting these samples across iterations, our method rapidly approximates the direct lighting integral with minimal memory overhead. To capture indirect illumination——including soft shadows, inter-reflection, and global light transport——we incorporate a multiple importance sampling framework that blends light-source sampling, cosine-weighted hemisphere sampling for diffuse components, and GGX microfacet sampling for specular highlights. The combined strategy yields accurate shading and material estimates while maintaining a throughput that is competitive with forward-only 3DGS rendering.ResultWe validate our approach on three widely adopted public benchmarks: the TensoIR dataset, the DTU multiview stereo collection, and the TNT scene suite. On TensoIR, our geometry module reduces the average angular error of surface normals by 19.6% compared with the next best 3DGS-based inverse rendering method, according to standard evaluation protocols. On DTU and TNT, we achieve mesh reconstruction quality——measured by Chamfer distance and F1 score——that matches or exceeds the top-performing baselines while boosting overall scene processing speed by approximately 50%. For material recovery and novel view synthesis on TensoIR, material PSNR increased by 2.8% and synthesis PSNR increased by 0.08 dB over the second-best approach. Compared with NeRF methods, our full pipeline delivers an end-to-end speedup of roughly 60%, demonstrating that high-fidelity inverse rendering can be both accurate and efficient. To isolate the contributions of each design choice, we conduct an extensive ablation study.ConclusionDespite these advances, certain limitations remain. Our current implementation assumes static scenes and fixed lighting during capture, excluding dynamic object interactions or time-varying illumination. Highly reflective or translucent materials can still challenge the weighted pool sampler’s coverage, resulting in subtle bias in specular lobes. Furthermore, while our method scales well to moderately complex scenes, extremely large-scale outdoor environments may require hierarchical partitioning or out-of-core streaming for real-time performance. We thus envision future extensions that incorporate temporal coherence constraints for video-based inverse rendering, adaptive sampling strategies for specialized material classes (e.g., subsurface scattering), and integration with custom GPU kernels to push toward interactive rates. In summary, we present a comprehensive inverse rendering pipeline built on the 3DGS foundation, enriched by multiview geometric consistency and a novel weighted sampling approach for lighting decomposition. Our method bridges the gap between accuracy and efficiency, delivering superior geometry, material, and illumination recovery on benchmark datasets while maintaining high rendering throughput. This work paves the way for practical, scalable inverse rendering solutions in applications ranging from autonomous perception to immersive content creation.  
      关键词:Inverse Rendering;3D Gaussian splatting(3DGS);3d reconstruction;Re-projection Error;weighted reservoir sampling(WRS)   
      313
      |
      182
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 126142182 false
      更新时间:2026-03-18

      Remote Sensing Image Processing

    • 遥感图像目标检测领域迎来新突破,相关专家提出CGR-OBB体系,有效解决定向边界框不连续问题,显著提升检测精度,为高精度检测提供有力支持。
      Yao Rui, Li Yuman, Guo Haofan, Hu Wentao, Tian Xiangrui
      Vol. 31, Issue 3, Pages: 912-926(2026) DOI: 10.11834/jig.250172
      Continuous geometric parameter representation for arbitrarily-oriented objects in remote sensing images
      摘要:ObjectiveDetecting arbitrarily oriented and densely arranged targets such as ships and aircraft in remote sensing images presents significant challenges. Traditional horizontal bounding boxes (HBBs) contain excessive background noise, substantially lowering the intersection over union (IoU) metric and compromising localization accuracy. Current rotated object detectors suffer from discontinuous representations of oriented bounding boxes (OBBs), causing abrupt jumps in regression targets that limit detection precision. Current approaches to address OBB discontinuity typically follow three directions. First, improved loss functions can mitigate gradient conflicts in angle regression but ultimately fail to eliminate parameter jumps at periodic boundaries. Second, angle encoding methods effectively handle rotational discontinuity yet remain inadequate for aspect ratio variations. Third, and most fundamentally, novel OBB representations introduce continuous parameters through encode-decode modules, resolving the discontinuity issue at the representation level itself. However, current OBB representations have limitations, including incomplete continuity constraints, decoding ambiguity for special-shaped targets (square/extreme-aspect-ratio), and irregular parameter variations that restrict accuracy. To address these issues, this paper proposes a continuous geometric representation of OBBs (CGR-OBB) with rigorous continuity metrics to achieve high-precision detection.MethodCGR-OBB constructs a nine-dimensional continuous representation space for OBBs on the basis of geometric continuity constraints, incorporating three key components: position parameters, area factors, and the aspect ratio parameter. The construction of these parameters strictly adheres to continuity metrics covering object rotation/aspect ratio continuity, loss rotation/aspect ratio continuity, and decoding completeness/robustness. First, the position parameters are formed by combining the distance offsets of target corners relative to the midpoints of HBB edges with the center point, width, and height of the HBB, enabling rapid OBB localization. Subsequently, the area factors are computed using the upper-right region areas obtained from the OBBs’ partitioning of their circumscribed HBBs. These factors are then employed to determine the OBBs’ vertex positions, effectively avoiding decoding ambiguity. Finally, the aspect ratio parameter of the OBB is applied to constrain its geometric shape, ensuring that the decoded OBB tightly encloses the target object. For high-aspect-ratio target detection, CGR-OBB enhances the detector’s modeling capability by optimizing area factor definitions and performing logarithmic transformation on aspect ratio regression targets. CGR-OBB is implemented as a plug-and-play module with an encoder-decoder architecture. This modular component can be seamlessly integrated into mainstream detection frameworks such as Faster R-CNN and RetinaNet. During training, ground-truth bounding box annotations are transformed by the encoder module into a nine-dimensional continuous parameter representation. These parameters are then standardized according to regression parameter definitions and jointly computed with the detector’s regression targets for loss calculation. During testing, the regression targets are decoded into bounding box parameters through the decoder module. The decoded bounding boxes are then quantitatively evaluated against ground-truth annotations, with simultaneous visualization of detection results on the input images.ResultExperimental results demonstrate CGR-OBB’s effectiveness through comprehensive validation. First, continuity tests show that baseline detectors with CGR-OBB have reduced loss jumps and faster convergence, proving better training stability and efficiency. Second, unified benchmark tests on four detectors confirm accuracy improvements: on HRSC2016, mAP@75 increases by 34.9% for Rotated Faster R-CNN, 1.8% for Oriented R-CNN, 9.6% for RoI Transformer, and 1.0% for ReDet; on DOTA, improvements are 2.68% for Rotated Faster R-CNN, 0.66% for Oriented R-CNN, 3.96% for RoI Transformer, and 4.31% for ReDet. Third, compared with the latest other OBB representation methods (DHRec and COBB), CGR-OBB shows greater improvements across mAP@50, mAP@75, and mAP@50:95, averaging 2.08% over DHRec and 1.41% over COBB for RoI Transformer, and 1.69% over DHRec, and 1.79% over COBB for ReDet. Finally, in DOTA’s mAP@50 tests against mmrotate’s top detectors (including Gliding Vertex and R3Det + KFIoU), CGR-OBB achieves the best average accuracy while ranking first in eight out of 15 categories (containing extreme-aspect-ratio and square targets) and second in five others, demonstrating excellent multiscale target detection capability.ConclusionThe proposed CGR-OBB fundamentally addresses the OBB representation-level discontinuity, delivering significant improvements in detection accuracy. It constructs a continuous representation space composed of nine parameters, including position parameters, area factors, and aspect ratio parameters. These parameters vary continuously and predictably during regression, effectively eliminating sudden jumps in regression targets. The encode-decode structure ensures a strict reversible mapping between the OBB and its continuous representation, with the area factor resolving decoding ambiguities and the aspect ratio parameter enabling fine adjustment for high-precision detection. Its robust performance is particularly evident in challenging scenarios with dense target arrangements and complex backgrounds, where it effectively reduces missed detections and false alarms for rotated object detectors. Furthermore, the method combines high precision with easy integration capabilities, serving as a plug-and-play module that can be seamlessly incorporated into various mainstream detection frameworks, demonstrating strong practical value for rotated object detection applications. Future work will focus on further enhancing performance under lower IoU thresholds to achieve comprehensive improvement across high and low precision metrics.  
      关键词:oriented object detection;continuous representation;oriented bounding box (OBB);remote sensing image;deep learning   
      86
      |
      371
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 127252520 false
      更新时间:2026-03-18
    • 相关研究在跨域高光谱图像分类领域取得新进展,专家们构建了LRCT-CDFSL体系,为解决现有方法中空间特征提取导致的地面物体分布和类别边界扭曲问题提供了新方案。
      Yang Lixia, Bao Yajun, Zhang Rui, Yang Shuyuan
      Vol. 31, Issue 3, Pages: 927-943(2026) DOI: 10.11834/jig.250020
      Lightweight Res-3D-CNN with Transformer embedding for cross-domain hyperspectral image classification
      摘要:ObjectiveBenefiting from their high-resolution spectral information and large-scale spatial information, hyperspectral images (HSIs) have demonstrated exceptional capabilities in numerous remote sensing applications. Over the past few decades, hyperspectral image classification (HIC) has attracted considerable research attention. Many machine learning-based HIC methods have been proposed; however, these approaches require a sufficient number of labeled samples to achieve ideal classification accuracy. Unfortunately, the high cost and effort associated with labeling HSIs often results in a scarcity of labeled data for many newly acquired HSIs. Therefore, researchers have introduced the cross-domain HIC (CD-HIC) mechanism, which uses a hyperspectral image with sufficient labeled samples (source domain) to assist in classifying a hyperspectral image with limited labeled samples (target domain). However, cross-domain classification is one of the major challenges in HIC due to feature and class distribution differences between source and target domains. The cross-domain few shot learning (CDFSL) methods, which integrated domain adaptation with few-shot learning, have been widely applied to the CD-HIC problem. Owing to the difficulty of spectral sequence encoding and the spectral similarity between classes, most existing CDFSL methods use convolutional neural network (CNN) or other remarkable spatial feature extractors to obtain spatial information, thereby improving classification accuracy. However, extracting spatial features often leads to distortion in the distribution of ground objects and their class boundaries. Aiming to address this issue, a lightweight Res-3D-CNN with embedded Transformer layers (LRCT) has been designed for feature extraction in CD-HIC. LRCT effectively captures long-term dependencies of the spectrum while simultaneously extracting spatial information, thereby notably improving the performance of spectral feature-based methods.MethodIn this study, a simple and effective deep learning network is proposed for feature extraction from HSIs. In CNNs, the convolution (Conv) captures high-frequency features of images by employing a weight-sharing mechanism within local receptive fields. In contrast, Transformers model long-range dependencies between features using self-attention mechanisms and adaptively focus on key areas. Moreover, the Transformer exhibits low-pass filtering characteristics, which primarily captures the low-frequency global information of images. Considering the complementary characteristics of Conv and Transformer, the Transformer layer is embedded into Res-3D-CNN to establish a lightweight dual-stream feature extraction network to perform feature extraction on the source and target domains. Furthermore, the CDFSL method is adopted to learn general information from the extracted features of source class data, which helps in target class data prediction with only very few or no labeled data. This approach helps achieve outstanding classification performance in the subsequent CD-HIC. The LRCT-CDFSL method comprises four main aspects as follows: 1) Data preprocessing: using a mapping layer to combine the dimensions of the original his; 2) Feature learning based on lRCT: employing a deep neural network for feature learning to enhance the representation capability for HSI data; 3) Few-shot learning: enhancing intra-class compactness and inter-class separability by calculating the Euclidean distance between labeled and unlabeled samples, thereby effectively adapting to scenes with limited samples; 4) Domain adaptation: using domain adaptation techniques to enable the feature extractor to generate highly generalized features, thereby improving the generalization capability of the model across different domains.ResultUsing Chikusei data as the source domain, and Indian Pines, Salinas, and Pavia University data as the target domains, extensive experiments are conducted to validate the performance of the proposed LRCT-CDFSL model and ensure fair comparison with advanced methods. Moreover, the effectiveness of the proposed LRCT-CDFSL is respectively tested by using the Indian Pines, Salinas, and Pavia University datasets as the target domain. The experimental results demonstrate that the LRCT-CDFSL method achieves faster and more accurate classification performance compared to existing methods. When only five labeled samples per class are available, the LRCT-CDFSL method achieves overall accuracy (OA) scores of 71.01% and 92.06%, and 84.14% on the respective target domain datasets. Compared to current mainstream cross-domain few-shot HIC methods, the LRCT-CDFSL method shows superior classification performance across various target domain datasets. Specifically, LRCT-CDFSL improves OA by 7.57%, 3.35%, and 2.77%, respectively, and reduces training time by 36%, 37%, and 30%, respectively.ConclusionA deep transfer learning network called LRCT network is introduced by embedding Transformer layer into a residual three-dimensional CNN (Res-3D-CNN). In Res-3D-CNN, a 1 × 1 convolution kernel is incorporated to adjust the number of channels in the feature map, reduce the model parameters, and accelerate the training. In addition, after n×n convolutional kernels, 1×1 convolutional kernels are then introduced to expand the number of channels of the feature map, thereby addressing the problem of reducing the feature map area caused by the convolution kernel. Additionally, the network shows excellent performance in few-shot learning and domain adaptation using convolutional kernels of different scales for feature extraction. Meanwhile, the Transformer layer is used to extract the local-global semantic information of HSI. Experimental results reveal that LRCT effectively captures spatial-spectral features through a combination of local-global information, fully representing the local-global semantic information of HSIs and enabling strong classification performance in the subsequent CD-HIC task. However, within the LRCT-CDFSL framework, the metric-based few-shot learning method, which emphasizes the relationships between samples, has not yet been fully explored. Aiming to further enhance performance in CD-HIC tasks, future studies may explore the integration of cross-attention learning techniques into the LRCT-CDFSL architecture. This enhancement is expected to improve the generalization capability and adaptability of the model across diverse domains.  
      关键词:hyperspectral image classification (HIC);cross-domain classification;few-shot learning(FSL);domain adaptation;residual 3-dimensional convolutional neural network (Res-3D-CNN);Transformer   
      194
      |
      122
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 112400127 false
      更新时间:2026-03-18
    • 相关研究在高光谱图像超分辨率领域取得新进展,专家们构建了双域信息融合网络,为解决现有融合方法中光谱扭曲与空间细节模糊等问题提供了有效方案。
      Yang Huaiyuan, Yang Yong, Huang Shuying, Liu Ziyang, Zhang Long
      Vol. 31, Issue 3, Pages: 944-957(2026) DOI: 10.11834/jig.250325
      Dual-domain information fusion network for hyperspectral image super-resolution
      摘要:ObjectiveHyperspectral image super-resolution (HISR) has become a crucial task in the field of remote sensing, aiming to reconstruct high-resolution hyperspectral images (HR-HSI) that possess fine spatial details and rich spectral information. Given the inherent limitations of remote sensing imaging sensors, directly acquiring images with simultaneously high spatial and spectral resolutions remains technically challenging. Typically, only high-resolution multispectral images (HR-MSI) with limited spectral bands or low-resolution hyperspectral images (LR-HSI) with abundant spectral information are obtainable. However, relying solely on either data source is insufficient to meet the demands of practical applications such as land cover classification, urban analysis, and environmental monitoring. Therefore, fusing HR-MSI and LR-HSI to generate high-quality HR-HSI is a promising direction. Although existing methods have made some progress, most current deep learning-based HISR approaches primarily perform feature fusion in the spatial domain. Such methods often overlook the inherent modality differences between HR-MSI and LR-HSI, leading to spectral distortion and spatial structure blurring in the reconstructed images. Although some recent studies have introduced frequency-domain information, existing frequency-domain fusion methods typically treat frequency features as auxiliary enhancements without deeply modeling their internal structures. To overcome these limitations, this paper proposes a novel HISR network that jointly models the interactions between HR-MSI and LR-HSI across both spatial and frequency domains. The proposed method aims to mitigate modality differences, enhance feature complementarity, and effectively utilize multiscale and long-range dependency features to improve the accuracy, fidelity, and robustness of the reconstructed images.MethodTo address the limitations of existing HISR techniques, we propose a novel dual-domain information fusion network (DDIF-Net), which simultaneously models spatial and frequency information in a unified architecture. The network first extracts multiscale features from HR-MSI through a convolutional encoder path. Meanwhile, LR-HSI is upsampled to the resolution of HR-MSI and combined with it to produce shallow fused features. These features are fed into a U-shaped dual-domain fusion branch composed of a series of frequency-spatial feature fusion modules (FSFFMs). Each FSFFM integrates a frequency-feature injection block, which decomposes input features into amplitude and phase components via Fourier transform. The amplitude components, representing spectral intensity, are enhanced through injection and modulation mechanisms, while phase components, containing structural information, are fused and refined. The inverse Fourier transform then reconstructs the frequency-enhanced feature maps. Each FSFFM is followed by a spatial feature enhancement block, which leverages attention mechanisms to model long-range dependencies and reinforce spatial coherence, to further enhance the spatial structure. In the decoder path, a multiscale feature fusion module (MFFM) aggregates features from multiple scales and progressively refines them to reconstruct the HR-HSI. The network is trained using an L1 loss function, ensuring spectral accuracy and spatial consistency. The architecture is fully convolutional and optimized using the adaptive moment estimation optimizer with a two-stage learning rate decay strategy.ResultWe evaluate DDIF-Net on three benchmark hyperspectral remote sensing datasets: Pavia Center, Botswana, and Chikusei. Compared with eight representative deep learning-based models (DARN, HSRNet, HSRNet, HyperPNN, SSFCNN, SSRNet, HyperDSNet, HyperRefiner, and FPFNet), our method consistently achieves superior performance. On the Pavia Center dataset, DDIF-Net improves the peak signal-to-noise ratio (PSNR) by 0.23 dB and reduces the spectral angle mapper (SAM) by 0.24 compared with the second-best method. On the Botswana dataset, our approach gains a PSNR improvement of 0.41 dB and reduces SAM by 0.12. On the Chikusei dataset, we observe a PSNR improvement of 0.37 dB and a SAM reduction of 0.01. Visual inspection of reconstructed images further confirms our method’s ability to preserve fine-grained texture and maintain spectral integrity, particularly in urban regions with complex spatial layouts. Moreover, we conduct ablation studies to demonstrate the contribution of each module within DDIF-Net. Removing the frequency-domain module or the multiscale fusion strategy results in significant performance drops, indicating their crucial role in the network’s effectiveness. Finally, classification experiments using the k-means algorithm in the Environment for Visualizing Images software show that the fused HR-HSI generated by DDIF-Net leads to the most accurate classification results across all methods, further validating the quality of its spatial-spectral fusion.ConclusionThis paper presents DDIF-Net, a dual-domain fusion network designed for hyperspectral image super-resolution, which integrates spatial and frequency information through a novel architecture composed of FSFFM and MFFM modules. The proposed network explicitly models amplitude and phase characteristics to bridge the modality gap between LR-HSI and HR-MSI, while enhancing spatial details via long-range attention mechanisms. Extensive experiments on multiple datasets confirm that DDIF-Net outperforms state-of-the-art methods in objective metrics and visual quality, achieving significant gains in PSNR, SAM, and classification accuracy. The ablation studies validate the importance of dual-domain modeling and multiscale integration, highlighting the robustness and generalizability of our approach. In future work, we aim to optimize the network’s computational efficiency and extend it to larger-scale satellite datasets, such as Sentinel-2 or Gaofen series, to further explore its applicability in real-world remote sensing scenarios.  
      关键词:image fusion;remote sensing;deep learning;frequency domain information;image super-resolution   
      106
      |
      94
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 125793257 false
      更新时间:2026-03-18
    • 无人驾驶正从城市向野外拓展,准确提取可通行区域并构建可通行地图至关重要。专家提出融合多源数据并引入语义信息的稀疏核贝叶斯预测方法,有效补全稀疏高程地图缺失信息,提高可通行区域提取精度。
      Zhong Ziwei, Shan Yunxiao, Zhou Xundao
      Vol. 31, Issue 3, Pages: 958-972(2026) DOI: 10.11834/jig.250193
      Bayesian inference-based terrain elevation map completion method for off-road environments
      摘要:ObjectiveCurrent research on autonomous vehicles primarily focuses on structured environments, such as self-driving on urban roads and highways, unmanned transport vehicles in warehouse logistics, and autonomous guidance robots in hospitals and banks. These scenarios are characterized by uniform environmental information and well-defined road boundaries, enabling autonomous vehicles to efficiently acquire road information, generate precise trajectory plans, and complete tasks safely and reliably. However, with continuous technological advancements, autonomous driving technology is gradually expanding into unstructured scenarios, including agricultural plant protection, field rescue, military reconnaissance, and field exploration. These environments often feature complex terrain, dynamic and variable conditions, sparse perceptual information, and lack clear boundaries between traversable and non-traversable areas. Moreover, different terrain characteristics have varying impacts on vehicle traversal, increasing the requirements for the autonomous navigation capabilities of unmanned vehicles. Therefore, in autonomous navigation tasks, the accurate extraction of traversable areas in complex off-road environments and the construction of traversability maps are crucial for the safe operation of autonomous vehicles. To extract traversable regions, autonomous vehicles need to use sensors such as cameras or LiDAR to acquire real-time environmental information, including obstacle locations and terrain features. This data are then processed to ultimately achieve the extraction of traversable areas. However, while cameras can capture rich semantic information in unstructured environments, their performance is susceptible to degradation due to lighting variations. By contrast, LiDAR provides accurate three-dimensional positional information of the environment and is less affected by lighting conditions. However, LiDAR point clouds lack strong semantic information, making it difficult to clearly interpret terrain features, and issues such as the sparse nature of LiDAR and environmental occlusions lead to widespread information gaps in existing elevation maps. To address the missing information in sparse elevation maps and improve the accuracy of the completed data, this paper proposes an elevation map completion method that utilizes multisource data fusion and a sparse kernel Bayesian prediction approach incorporating semantic information. This method enhances the elevation map and enables the extraction of safer traversable areas.MethodThe first step involves fusing monocular estimation data from cameras to fill gaps caused by LiDAR’s near-range blind spots. The second step applies traversability condition judgments to obtain binary semantics (traversable and non-traversable), assigning semantic information to the elevation map. Statistical probability judgments are then used to determine the semantics of locations with missing elevation information. The third step employs Bayesian sparse kernel inference, combined with the semantics of missing locations, to predict and complete the elevation information at those locations. This process results in an accurate and refined elevation map, improving the precision of traversable area extraction.ResultTo validate the effectiveness of the proposed multisource data fusion and semantic-aware sparse kernel Bayesian prediction method for elevation map completion, multiple experiments were designed to individually test the efficacy of each module. First, real off-road datasets RELLIS-3D and TartanDrive2.0 were used as data sources, and a series of experiments was conducted. Comparative experiments were designed comparing the proposed method with other completion methods to verify the effectiveness of the proposed Bayesian inference-based elevation map completion method for off-road environments. The effectiveness, accuracy, and completeness of different methods were evaluated using metrics such as error, accuracy rate, and completion rate. On the RELLIS-3D dataset, the information missing rate was reduced from 25.42% to 1.56%, with a mean error of 0.045 5 m and an accuracy rate of 94.37%. On the TartanDrive2.0 dataset, the information missing rate was reduced from 65.16% to 25.15%, with a mean error of 0.103 m and an accuracy rate of 93.28%. Comparisons with existing methods across various off-road scenarios demonstrate the generality and effectiveness of the proposed completion method. To address the limitations of LiDAR-only completion methods, such as insufficient robustness and the inability to effectively complete nearby grids, the proposed algorithm adopts a multisource data fusion approach by incorporating monocular depth estimation to achieve effective completion of nearby grids. A comparative experiment was designed to validate the accuracy and effectiveness of the proposed monocular depth estimation network proposed in this chapter. Experimental results demonstrate that the depth estimation algorithm presented in this paper achieves the smallest error, confirming its effectiveness.ConclusionThe proposed elevation map completion method, which integrates multisource data fusion and Bayesian sparse kernel prediction with binary semantics, not only obtains continuously varying elevation information in complex environments but also provides rich and accurate elevation data for cost computation in traversability cost maps, thereby achieving effective completion of elevation maps. To validate the effectiveness of the method, this chapter conducts accuracy verification and performance evaluation of each algorithmic module on the basis of real off-road environment datasets. Experimental results demonstrate that the proposed multisource data fusion module achieves the best performance in terms of both accuracy and completion rate. By systematically addressing issues such as elevation data sparsity, the proposed method effectively and accurately completes missing information in sparse elevation maps, significantly improving the precision of traversable area extraction.  
      关键词:Elevation map;Semantic Sparse Kernel Bayesian;Completion method;Off-road environment;Traversable area identification   
      103
      |
      171
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 129893071 false
      更新时间:2026-03-18
    0