最新刊期

    31 2 2026

      Review

    • 介绍了其在增强现实领域的研究进展,相关专家构建了输入—输出双维度的AR交互技术全景框架,为解决多模态融合复杂度高、隐私安全隐患等核心挑战提供解决方案。
      Wang Yiwen, Zhao Xi
      Vol. 31, Issue 2, Pages: 349-373(2026) DOI: 10.11834/jig.250197
      Input-output modalities in augmented reality: a two-dimensional survey of human-computer interaction
      摘要:Augmented reality (AR) technology has transcended the limitations of traditional 2D interactions relying on 3D spatial perception and multimodal interaction mechanisms. By seamlessly integrating digital content with the physical environment, AR systems offer users immersive experiences. In AR environments, users can navigate physical spaces while interacting with virtual objects through intuitive and natural behaviors. Such a demand has positioned AR technology at the forefront of human-computer interaction (HCI) research, attracting considerable academic and industrial attention across multiple fields. The rapid increase in consumer-grade AR devices——including Microsoft HoloLens, Meta Quest Pro, and Apple Vision Pro——has accelerated the technology’s adoption across various scenarios. This widespread deployment requires robust interaction frameworks that support AR’s unique operational demands. Different interaction modalities present diverse characteristics. For instance, head/face-based interaction is suitable for scenarios in which users’ hands are occupied, though it may lead to limited precisions. By contrast, hand-based manipulation offers greater flexibility and is good for tasks requiring precise control. Each interaction modality relies on specific hardware configurations. Applications employing bare-hand interaction must consider camera placement to ensure complete hand tracking for accurate interaction. Meanwhile, applications utilizing gaze interaction require careful consideration of eye-tracking apparatus installation. To this end, selecting appropriate interaction methods on the basis of task requirements while designing corresponding hardware support constitutes one of the key challenges hindering the widespread adoption of AR technologies. Unlike traditional computing environments, AR interactions occur in dynamic 3D spaces and engage multiple human sensory channels simultaneously. Consequently, effective AR systems must integrate diverse input modalities (e.g., hand gestures, voice commands, and eye tracking) with appropriate output feedback systems (e.g., haptic responses, spatialized audio, and olfactory cues). However, existing research landscapes reveal significant gaps in addressing these complex requirements. Current studies predominantly focus on isolated input modalities——such as gesture recognition algorithms or eye-tracking precision——or narrow application-specific implementations (e.g., museum guides or classroom applications). While these investigations provide valuable domain-specific insights, they fail to establish comprehensive frameworks that integrate multimodal information. More critically, the majority of AR HCI literature emphasizes input mechanisms while neglecting nonvisual feedback channels. This limitation results in a fragmented understanding of cross-sensory interaction paradigms, leaving critical questions regarding multimodal synchronization, sensory bandwidth constraints, and perceptual congruence. We propose a dual-dimensional analytical framework based on input and output modalities for AR interaction technologies. Through a comprehensive review of existing AR interaction paradigms——including their underlying principles, recent developments, core applications, and functional characteristics——we construct an integrated conceptual map to support future research. The input modalities are classified into speech recognition and gesture-based inputs, with the latter further subdivided into eye-, head motion-, and body movement-based interaction technologies on the basis of anatomical engagement. Output modalities are classified according to in accordance with the human sensory channels into vision, hearing, touch, smell, and taste. Our analysis reveals four fundamental challenges in contemporary AR interaction systems: 1) multimodal fusion complexity: real-time sensor fusion imposes significant computational demands. For instance, integrating high-frequency gesture tracking with voice recognition often exceeds the capabilities of current edge computing systems, resulting in perceptible visual latency. 2) Privacy-security concerns: AR devices equipped with cameras and microphones raise concerns about surveillance and data exposure. Meanwhile, neural interface technologies introduce new challenges surrounding the protection of neural data and cognitive privacy. 3) Input-output asymmetry: disproportionate development focus on input systems has created feedback channels incapable of matching the richness of user actions (e.g., advanced gesture recognition paired with rudimentary vibration feedback). 4) Interdisciplinary fragmentation: gaps between materials science (for working on flexible electronics), neuroscience (for studying multisensory perception), and computer engineering (for optimizing rendering pipelines) hinder comprehensive system-level optimization and innovation. Addressing these challenges requires deep integration across traditionally siloed disciplines. Materials science must advance the development of flexible, biocompatible sensors; cognitive neuroscience should quantify thresholds for effective multisensory integration; computer science needs to create adaptive algorithms for latency compensation; and computer ethics must establish robust frameworks for governing neural data. Only through such synergistic efforts can AR evolve beyond its current role as an interface enhancement tool into a transformative platform for human-machine cognitive collaboration. In summary, the proposed framework offers dual value to AR researchers and developers, particularly newcomers. On the one hand, it organizes fragmented technologies into a coherent structure, accelerating learning and helping readers navigate complex technical domains while identifying key research questions. On the other hand, through critical analysis of technical bottlenecks——such as cross-modal latency and limited sensory bandwidth——it highlights underexplored areas and inspires innovative solutions, avoiding redundant efforts and low-level repetition of existing work.  
      关键词:augmented reality (AR); human-computer interaction; multi-modality; input modality; response   
      111
      |
      184
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 119437079 false
      更新时间:2026-02-11
    • Review of the progress in 3D defect detection methods AI导读

      本文介绍了现代制造业中缺陷检测的重要性,指出传统二维检测方法的局限性,强调了三维缺陷检测技术的优势和必要性,同时指出现有综述的不足,阐述了本文对三维缺陷检测技术进行全面回顾和探讨研究方向的意义。
      Wang Congrong, Chen Yajun, Liu Shanhui, Chen Dongle
      Vol. 31, Issue 2, Pages: 374-390(2026) DOI: 10.11834/jig.250111
      Review of the progress in 3D defect detection methods
      摘要:With the rapid development of modern manufacturing, consumers and producers have increasingly higher demands for product quality and safety. However, various defects such as dents, cracks, and bubbles inevitably occur during production, which may directly affect product performance. Therefore, early detection and accurate identification of defects are crucial for ensuring product quality and improving production efficiency. Traditional 2D defect detection methods only provide surface information of objects, making them inconsiderably effective in detecting deep or internal defects. Additionally, factors such as lighting, texture, and shadows can significantly influence the stability and robustness of 2D images in complex environments. In recent years, 3D defect detection technology has gradually become a research hotspot. By utilizing 3D data, such as point clouds and depth maps, it provides more comprehensive and accurate information, capturing the geometric shape, spatial distribution, and depth of objects, thereby enhancing detection accuracy and reliability. However, there is currently no systematic and comprehensive review article that tackles 3D defect detection technologies. This study aims to fill this gap by providing a thorough review of existing 3D defect detection methods and exploring future research directions. This paper first introduces the background of defect detection, highlighting the limitations of 2D defect detection technologies in detecting complex 3D objects and deep defects, and emphasizes the advantages of 3D defect detection technologies. Traditional 2D methods, which rely on image features such as texture analysis, edge detection, and morphological processing, have achieved certain success in surface defect detection. However, they struggle with complex 3D objects and internal defects because of their reliance on 2D images. By contrast, 3D defect detection technologies leverage point clouds and depth maps to provide a more comprehensive analysis of surface and internal structures, making them more suitable for modern industrial applications. Subsequently, this paper systematically summarizes the current research status of 3D defect detection from two perspectives: traditional and deep learning. In traditional methods, 3D defect detection is mainly divided into three categories: methods based on local features in point clouds, methods based on point cloud registration, and methods based on point cloud segmentation. Methods based on local features in point clouds detect defects by extracting local features such as depth, area, and slope, combined with threshold settings. These methods exhibit robustness and computational efficiency but lack utilization of global contextual information. For example, they can effectively detect common industrial defects like dents and protrusions but may miss defects that require a global understanding of the object’s structure. Methods based on point cloud registration align target point clouds with model data and identify defects by comparing differences. Their performance depends on the accuracy and efficiency of the registration algorithm. While these methods excel in utilizing global information and shape consistency, they face challenges in computational complexity and error accumulation. Methods based on point cloud segmentation detect defects by segmenting point clouds and extracting regional features. They excel in local feature extraction and fine-grained localization but may lose global information. These methods are computationally efficient in the preprocessing stage but may suffer from segmentation errors and parameter sensitivity. In deep learning methods, 3D defect detection is divided into point cloud-based methods and multimodal fusion-based methods. Point cloud-based methods automatically learn point cloud features through neural networks, making them suitable for complex surfaces. However, they involve intricate data processing and require significant computational resources. For instance, these methods can handle irregular shapes and textures but may struggle with real-time applications owing to their computational demands. Multimodal fusion-based methods combine information from multiple modalities such as RGB images and point cloud data, significantly improving detection accuracy and robustness. Specific methods include teacher-student network-based methods, memory bank-based methods, reconstruction-based methods, and methods leveraging contrastive language-image pretraining (CLIP). Teacher-student network-based methods achieve efficient feature learning through knowledge distillation or reverse distillation, offering advantages such as model lightweighting and high detection sensitivity. However, their performance heavily depends on the quality of the teacher network and the precision of feature matching. Memory bank-based methods enable rapid detection by storing multimodal features but incur high storage costs. Reconstruction-based methods analyze defect locations through reconstruction errors but exhibit low sensitivity to minor defects. These methods leverage the spatial and texture features of 3D data but require complex model training and high-quality reconstruction networks. CLIP-based methods leverage cross-modal characteristics for zero-shot detection, offering strong adaptability but requiring high computational costs. They are promising for their ability to generalize across different defect types but may need task-specific optimizations to improve performance. Moreover, this paper summarizes commonly used public datasets in the field of 3D defect detection, including the MVTec 3D Anomaly Detection(MVTec 3D-AD), Eyecandies, Real 3D-AD, Play-Doh made (PD-REAL), Anomaly ShapeNet, and Multi-pose Anomaly Detection datasets. These datasets provide various samples and annotations, enabling researchers to train, validate, and evaluate their methods effectively. This paper also introduces evaluation metrics such as the image-level area under the receiver operating characteristic curve (I-AUROC), pixel-level area under the receiver operating characteristic curve (P-AUROC), and area under the precision-recall operating curve (AUPRO), along with their definitions and calculation methods. I-AUROC measures a model’s ability to distinguish between normal and defective images at the image level, while P-AUROC evaluates pixel-level defect localization. AUPRO focuses on the overlap between predicted and actual defect regions, providing a more detailed assessment of detection performance. Finally, this study analyzes the current challenges in 3D defect detection, including data quality, large-scale data processing, real-time performance, and the integration of domain knowledge with algorithms. Data quality remains a critical issue, given that noisy or incomplete data can significantly affect detection accuracy. Large-scale data processing is another challenge, especially in industrial applications where real-time performance is essential. The integration of domain knowledge with advanced algorithms can enhance detection performance but requires interdisciplinary collaboration. This study further explores future research directions, such as cross-modal information fusion, the application of augmented reality and virtual reality technologies, and the optimization of real-time performance and efficiency. Cross-modal fusion can improve detection accuracy by combining complementary information from different data sources. Augmented reality and virtual reality technologies can enhance defect visualization and interaction, providing intuitive solutions for industrial applications. Real-time performance optimization is crucial for deploying 3D defect detection systems in production lines, where efficiency and speed are paramount. In conclusion, 3D defect detection is a critical and challenging task with broad applications in various fields. This paper provides a comprehensive review of traditional and deep learning methods in 3D defect detection, summarizes existing datasets and evaluation metrics, and discusses current challenges and future directions. By addressing these challenges and exploring new research avenues, this study aims to inspire further advancements in 3D defect detection technologies, ultimately contributing to improved product quality and manufacturing efficiency.  
      关键词:defect detection;computer vision;3D vision;deep learning;multimodal;review   
      276
      |
      425
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 124234126 false
      更新时间:2026-02-11
    • 介绍了超声舌成像在语言学等领域的研究进展,专家借助深度学习攻克了超声舌成像处理难题,为语音工程应用等开辟了新方向。
      Zhang Jinxi, Zhang Kehong, Yu Hongzhi
      Vol. 31, Issue 2, Pages: 391-432(2026) DOI: 10.11834/jig.250015
      Advances in deep learning techniques for ultrasound tongue imaging processing
      摘要:Ultrasound imaging devices enable dynamic visualization and recording of articulatory physiology, specifically the tongue, during speech production. The processing and analysis of ultrasound tongue imaging (UTI) data hold significant importance for practical linguistics, experimental phonetic research, and speech engineering applications. Within these domains, the dynamic analysis of tongue movement trajectories and the quantitative analysis of lingual posture based on tongue contour extraction constitute critical technical challenges in practical implementation. However, UTI analysis faces substantial difficulties due to inherent noise characteristics, such as image blurring and indistinct tongue boundaries. These challenges manifest primarily as obstacles in accurately extracting key lingual positional features and reliably tracking dynamic tongue contours throughout speech sequences. Deep learning (DL) techniques, leveraging their powerful capacity for hierarchical feature learning and representation, have demonstrated considerable success in advancing UTI processing and analysis. Significant progress has been achieved in several core areas, including the automated extraction of tongue contours from noisy ultrasound frames, the construction of sophisticated models representing complex tongue motion dynamics, and the facilitation of deeper integration into advanced speech engineering applications. This comprehensive overview details the fundamental principles of medical ultrasound imaging and its specialized adaptation for lingual visualization (i.e., UTI). It systematically examines the persistent challenges encountered in UTI data processing. Furthermore, it concisely outlines the methodological foundations of DL, encompassing key architectures like convolutional neural networks, recurrent neural networks, and their variants (e.g., long short-term memory networks), which are particularly suited for spatial and spatiotemporal data analyses. The core of this review is that it synthesizes and critically evaluates the modeling mechanisms and specific applications of DL techniques across various tasks within UTI processing and analysis. Key application areas include, but are not limited to the following: ultrasound-video synchronization: precise temporal alignment of ultrasound image sequences with concurrently recorded video (e.g., lip movement) and acoustic signals for multimodal analysis. Biometric identification: exploiting unique individual tongue movement patterns derived from UTI for speaker recognition or verification systems. Multimodal learning: integrating UTI data with complementary modalities such as audio signals, electromagnetic articulography, or visual speech information (video) to build robust and comprehensive models of speech production using fusion techniques. Silent speech interfaces (SSIs): enabling speech recognition or synthesis solely from articulatory movements captured by UTI, which is particularly valuable in noisy environments or for individuals with voice impairments. Acoustic-articulatory inversion modeling: predicting the underlying articulatory configurations (represented by UTI data) from the acoustic speech signal or synthesizing speech from articulatory data. Language learning and speech rehabilitation: providing visual biofeedback of tongue positioning and movement to second language learners aiming to master new phonemes or to individuals undergoing therapy for speech sound disorders (e.g., apraxia and dysarthria). Innovative interaction and artistic expression: exploring novel human-computer interaction paradigms and artistic performances utilizing real-time tongue movement tracking. For each application domain, this review systematically delineates the underlying theoretical frameworks, specific technical methodologies employed (detailing network architectures, input representations, loss functions, and training protocols), and established performance evaluation metrics used to assess the efficacy of the proposed DL solutions. Common metrics include contour extraction accuracy (e.g., Dice coefficient and Hausdorff distance), tracking error, recognition accuracy, synthesis quality (e.g., Mel cepstral distortion), and inference speed. Finally, this review critically examines the current state of the field and outlines compelling research prospects and future trends for DL in UTI processing and analysis. These aspects include the following: development of more efficient and lightweight models: enabling real-time processing on mobile or embedded devices for practical applications like wearable SSI or biofeedback tools. Enhanced robustness and generalization: creating models that perform reliably across diverse speakers, accents, imaging devices, and recording conditions, potentially through advanced domain adaptation, data augmentation, or self-supervised/semisupervised learning paradigms. Explainable artificial intelligence (AI) for UTI: moving beyond black-box predictions to develop interpretable DL models that provide insights into how articulatory features are learned and utilized, fostering trust and deep linguistic understanding. Integration with advanced speech synthesis (vocoding): combining high-quality articulatory-to-acoustic inversion models with state-of-the-art neural vocoders for natural-sounding speech synthesis from UTI. Large-scale, multispeaker UTI datasets: facilitating the training of generalizable and powerful models through collaborative efforts to create and share comprehensive, annotated datasets. Exploration of self-supervised and unsupervised learning: reducing the heavy reliance on manually annotated UTI data by leveraging the inherent structure within unlabeled ultrasound sequences. Cross-lingual and low-resource adaptation: developing techniques to effectively apply models trained on data-rich languages to under-resourced languages with limited UTI data. Combining UTI with other biosignals: investigating fusion with neural signals (e.g., EEG) or other physiological data for next-generation brain-computer interfaces or advanced speech rehabilitation monitoring. Future progress hinges on sustained interdisciplinary collaboration between linguists, speech scientists, computer vision researchers, machine learning experts, and engineers. Beyond technical advancements, the ethical deployment and accessibility of UTI-DL systems warrant significant attention. Ensuring data privacy in biometric applications, addressing algorithmic bias across diverse populations, and developing cost-effective solutions for clinical and educational settings are crucial for responsible innovation. Furthermore, establishing standardized evaluation benchmarks and open-source frameworks will accelerate reproducibility and community-driven progress. The continued integration and innovation in computer science, particularly in AI and machine learning, alongside advancements in ultrasound imaging technology, are anticipated to significantly propel scientific research and disciplinary development within experimental phonetics and linguistics. DL stands as a pivotal catalyst, unlocking the full potential of UTI as a powerful tool for understanding speech production, developing novel technologies, and enhancing human communication and well-being.  
      关键词:deep learning(DL);ultrasonic tongue imaging(UTI);articulatory physiological tongue body;speech engineering;speech recognition;speech synthesis   
      82
      |
      99
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 119436882 false
      更新时间:2026-02-11

      Dataset

    • ST1.0: a synthetic dataset for military tank detection in drone imagery AI导读

      介绍了其在军事目标检测领域的研究进展,相关专家构建了无人机视角坦克目标检测仿真数据集,验证了仿真数据在缓解实战图像数据稀缺问题上的有效性,为武器平台的实战运用提供了支持。
      Wu Zhenglong, Lian Xinan, Zhao Zhongshi, Meng Qi
      Vol. 31, Issue 2, Pages: 433-447(2026) DOI: 10.11834/jig.250080
      ST1.0: a synthetic dataset for military tank detection in drone imagery
      摘要:ObjectiveThe continuous evolution of computer simulation technology has made synthetic data an effective solution for data-scarce domains. However, in military target detection—where battlefield samples are exceptionally sparse—the cross-domain generalization of models trained on synthetic data remains empirically unverified. This research aims to explore the mechanisms underlying the cross-domain generalization of synthetic data, with a particular emphasis on tank target detection from unmanned aerial vehicle (UAV) perspectives. A key contribution of this study is the development of the synthetic tank (ST) 1.0 dataset, which comprises three distinct subsets (ST1, ST2, and ST3), each containing 1 600 images. These subsets are differentiated by their data sources and exhibit varying levels of simulation precision and scenario realism. This tiered design enables the deconstruction of the “domain gap” to quantify the effect of different types of synthetic data on model generalization. Ultimately, this research seeks to provide quantitative and qualitative insights into how synthetic data can be leveraged to train object detectors for practical deployment in modern weapon systems.MethodTo rigorously evaluate the cross-domain generalization of models trained on synthetic data, this study proposes a multilayered validation framework and a structured experimental design. The three-level validation framework systematically assesses performance across domains: 1) synthetic domain: models are tested on reserved synthetic subsets (e.g., models trained on ST1 are evaluated on the ST1 test set), establishing a performance benchmark in a controlled “ideal” setting to quantify basic synthetic-domain capabilities. 2) Real domain: performance is evaluated using public real-world military tank datasets comprising 41 images, with 11 sourced from ImageNet and 30 from Open Images V7. This step quantifies the initial synthetic-to-real domain discrepancy, elucidating the challenges in transitioning from controlled synthetic environments to real-world scenarios. 3) Combat domain: models are evaluated on a proprietary dataset consisting of UAV-acquired imagery collected from combat areas. This dataset contains 1 000 images depicting complex combat scenarios and is designed to serve as a rigorous benchmark for assessing model generalization and battlefield adaptability. To mitigate architecture-specific biases, we evaluate three representative heterogeneous object detectors: cascade R-CNN is a two-stage anchor-based model that enhances proposal accuracy through cascaded regression and classification. YOLOv10 is a single-stage detector optimized for latency-sensitive applications while maintaining competitive precision. RT-DETR is a real-time Transformer-based architecture that leverages global self-attention to model long-range dependencies. Furthermore, we incorporate a contemporary state-of-the-art detector, YOLOv12, as an additional performance benchmark. The experimental design is structured around three research inquiries: 1) synthetic domain discrepancy: this inquiry analyzes how model performance varies when trained on ST1, ST2, or ST3 and evaluated using the three-level validation framework, directly assessing the influence of different synthetic domains on generalization. 2) Mixed synthetic strategies: this aspect investigates the effects of training on mixed datasets, aiming to determine whether combining samples from multiple synthetic domains can enhance model robustness and generalization capabilities. 3) Data augmentation efficacy: this inquiry examines the effectiveness of using data from one synthetic domain to augment training on another. This approach systematically assesses whether cross-domain data augmentation can further improve generalization and mitigate domain-specific biases.ResultExperimental investigations provided insights into the intricate relationship between synthetic data and real-world model performance. First, across all evaluated architectures, model performance—as measured by average precision —declined significantly when moving from controlled synthetic environments to real-world scenarios and further to the highly unpredictable combat domain. This trend clearly highlights the pronounced effect of the domain gap. Second, among the three detectors, RT-DETR consistently exhibited greater cross-domain robustness, with performance degrading more gradually in real-world and combat domains in comparison with the other detectors. Third, the generalization capability of models can be substantially enhanced through the incorporation of high-quality synthetic data. Such data are characterized by three critical attributes: photorealistic fidelity, adaptability to target-domain attributes, and close alignment with the feature distribution of real-world data. These qualities collectively ensure that synthetic data not only visually resemble real data but also effectively capture the underlying characteristics and variability of the target domain, thereby facilitating improved model robustness and transferability across diverse operational scenarios. Fourth, model performance can be further improved by incorporating high-quality synthetic data through data mixing or augmentation strategies. These methods effectively reduce overfitting on dataset-specific characteristics by introducing data distributions that closely mirror those of the target domain.ConclusionResearch has demonstrated that synthetic data can partially alleviate the scarcity of real-world data in the military domain, providing critical support for the practical deployment of weapon systems. However, synthetic data continue to exhibit inherent limitations in cross-domain generalization. Future research should prioritize the collaborative construction of multisource heterogeneous datasets and the deep integration of domain generalization techniques while broadening the research scope to encompass a wider array of military target categories and model architectures. Furthermore, investigating the security boundaries of intelligent data fusion in decision-making processes is essential to establish more scientific and reliable methodologies for constructing military object detection datasets.  
      关键词:military tank detection;unmanned aerical vehicle(UAV);synthetic data;single-source domain generalization;operational application   
      355
      |
      219
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 119437105 false
      更新时间:2026-02-11

      Image Processing and Coding

    • 相关研究在低光图像增强领域取得新进展,研究人员提出了一种融合多重空洞卷积与坐标分组的暗光图像增强网络(MCCNet),为解决低光环境下拍摄图像存在的亮度不均、细节丢失和色彩失真等问题提供了有效方案。
      Sun Wanqian, Peng Chunyan, Zhang Xiaojuan
      Vol. 31, Issue 2, Pages: 448-464(2026) DOI: 10.11834/jig.250177
      Low-light image enhancement via a multiscale dilated convolution with coordinate grouping enhancement network
      摘要:ObjectiveImages captured under low-light conditions often suffer from various visual quality degradation issues, including uneven illumination, significant loss of structural and textural details, and severe color distortions. These issues not only compromise human visual perception but also pose serious challenges for high-level vision tasks such as object detection and recognition. While numerous low-light image enhancement (LLIE) techniques have been developed to address these problems, most existing methods primarily concentrate on improving brightness or contrast. As a result, they often overlook the critical aspect of accurate color restoration, leading to undesired artifacts such as color casts, overenhancement, or structural inconsistencies. Conventional approaches, such as Retinex-based models, enhance image brightness by decomposing images into reflectance and illumination components. However, this strategy tends to produce color distortions, particularly in extremely low-light scenarios in which the lack of information causes the reflectance component to be inaccurately estimated. Deep learning-based methods have emerged as a promising alternative given their ability to learn complex mappings between low- and normal-light domains. Nonetheless, many of these models are characterized by large network sizes, high computational costs, and suboptimal generalization to diverse illumination conditions, making them minimally suitable for practical deployment on resource-constrained platforms.MethodTo overcome the aforementioned challenges, we propose a novel LLIE framework named multiscale dilated convolution with coordinate grouping enhancement network (MCCNet). MCCNet is designed with a clear emphasis on brightness restoration and color correction, ensuring visually pleasing and natural-looking outputs. The architecture comprises two independently designed yet interactively trained branches: a color conversion branch and a detail enhancement branch. The color conversion branch operates in a hue-value-intensity (HUI) color space, which is more perceptually aligned with human vision than traditional RGB representations. We introduce a color adaptation factor that is dynamically learned on the basis of pixel-wise weight perception to effectively decouple and modulate the color and brightness components of images. This decoupling enables targeted enhancement of color fidelity without introducing additional artifacts or distortion. In the enhancement branch, we design a global grouped coordinate attention (GGCA) module to improve feature expressiveness and encourage effective cross-branch information exchange. GGCA selectively emphasizes meaningful features while suppressing irrelevant or noisy signals, which is especially beneficial for images with high levels of darkness and low signal-to-noise ratios. Furthermore, we propose a multiscale dilated fusion attention (MDFA) module that aggregates features across multiple dilation rates. This module enables the network to capture global context and fine-grained local structure, leading to precise brightness recovery and edge preservation. By fusing multiscale features, MDFA mitigates the typical blurring and oversmoothing problems that occur in conventional convolutional enhancement networks. The two branches of MCCNet work collaboratively to generate high-quality enhanced images. The color conversion branch ensures accurate chromatic consistency, while the enhancement branch restores luminance and structure details. This dual-branch strategy effectively addresses the common trade-off between brightness enhancement and color fidelity that limits many existing LLIE methods.ResultTo thoroughly evaluate the performance of the proposed MCCNet, we conduct comprehensive experiments on multiple public benchmarks, including LOLv1, LOLv2, and five unpaired real-world low-light datasets. In addition, we test the model on two extremely low-light subsets to assess its robustness under severe lighting degradation. MCCNet is compared against 15 state-of-the-art LLIE methods, including traditional models like RetinexNet and zero-reference deep curve estimation(Zero-DCE), as well as recent deep learning approaches such as enlightening generative adversarial networks for low-light image enhancement(EnlightGAN), unsupervised retinex network(URetinex-Net), generative diffusion prior for unified image restoration and enhancement(GDP), lightening diffusion for low-light image enhancement(LightenDiffusion), and CIDNet. Quantitative results demonstrate that MCCNet achieves superior performance in terms of peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM). Specifically, MCCNet improves PSNR by 1.57 dB compared with the best-performing baseline on the LOLv1 dataset. Our method also achieves the highest SSIM score, indicating better structural preservation and perceptual quality. Qualitative comparisons reveal that MCCNet produces images with more natural lighting, reduced noise, and vivid, realistic colors, even in complex scenes with significant shadows or saturation challenges. In terms of computational complexity, MCCNet contains only 1.98 million parameters, and FLOPs are limited to 8.06 G. This dramatic reduction in model size and computational load makes MCCNet highly suitable for deployment on edge devices, mobile platforms, or real-time applications where efficiency is critical.ConclusionThe proposed color space-based MCCNet effectively restores brightness and color, addresses color cast and artifacts, and excels at lighting adjustment and noise suppression, achieving high-quality visual enhancement. Moreover, the method fully considers the practical demands of computational complexity and model size. The MCCNet model contains only 1.98 million parameters and requires 8.06 G FLOPs. While achieving state-of-the-art performance, it significantly reduces model size and computational cost. In particular, compared with classic image enhancement methods and recent advanced algorithms, MCCNet reduces model parameters and computational load by approximately 96% and 46% on average, respectively, demonstrating an efficient and lightweight design.  
      关键词:low-light image enhancement;HVI color space;attention mechanism;Coordinate grouping;dilated convolution   
      140
      |
      112
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 119436955 false
      更新时间:2026-02-11
    • 本文介绍了盲源分离(BSS)及其在图像处理领域中的应用盲图像分离(BIS)的背景和重要性,指出在复杂图像分离任务中,基于统计约束的算法存在局限性。
      Gong Jiaxin, Xu Jindong, Sun Haoqin
      Vol. 31, Issue 2, Pages: 465-478(2026) DOI: 10.11834/jig.250230
      Dual-channel blind image separation based on a wavelet suppression interactive diffusion model
      摘要:ObjectiveBlind image separation (BIS) is the application of blind source separation in the domain of image processing. It refers to the inverse problem of simultaneously estimating and restoring multiple independent source images from a single observation image under conditions of unknown mixing mode and without prior knowledge of the source images. Typical applications include bad weather (rain, snow) removal and reflection/shadow layer separation. Traditional methods, represented by independent component analysis and its improved algorithms, rely on strong assumptions such as “statistical independence of source signals” and “non-Gaussianity,” which often fail for real images owing to strong feature correlation and nonlinear mixing. Although sparse coding and low-rank decomposition, which emerged later, introduced prior constraints, they remained difficult for characterizing complex textures and structures. In recent years, deep learning frameworks have gradually dominated: convolutional neural networks (CNNs) achieved efficient feature extraction with local receptive fields. Generative adversarial networks (GANs) circumvented explicit priors through adversarial training, and variants such as CycleGAN, Attention-GAN, and Transformer-GAN have been proposed successively, achieving significant progress on synthetic datasets. However, in complex real scenes, existing methods based on CNNs and GANs still exhibit insufficient performance in processing such mixed images. The reasons are as follows: 1) uncertainty in source feature distribution (features within the same source category vary significantly in shape, transparency, and scale, resulting in modeling difficulties); 2) complex image mixing (nonlinearity, crosstalk between channels, making the “reverse mapping” nonunique); and 3) irregular noise interference (sensor noise, compression artifacts, motion blur coupled with the source signal, further blurring the separable boundary). These factors make it difficult for models to characterize the complex and variable feature distribution of source images in real scenes. Consequently, under strong noise, nonlinear mixing, and highly coupled texture details, problems, including source image separation estimation bias, texture distortion, and artifact residue, arise. These problems negatively impact the effectiveness of image restoration. To address these challenges, this study proposes a novel dual-channel diffusion separation model (DCDSM). It leverages the powerful generative ability of diffusion models to handle complex mixed images effectively.MethodDCDSM consists of a dual-branch structure based on a conditional diffusion model. This model exploits the diffusion process to learn the feature distribution of the source images, enabling the reconstruction of the feature structure for initial separation. During the reverse denoising process of the dual-branch diffusion, the design of the wavelet suppression module (WSM) is grounded in the characteristic of mutual coupling noise between the two source images. The structure of the interactive dual-branch separation network enhances the separation of detailed information within mixed images. WSM is composed of two independent wavelet frequency-domain feature extraction networks (WFENs). A WFEN employs a two-dimensional discrete wavelet transform to process high- and low-frequency sub-bands in the time-frequency domain to obtain suppression information. In the time domain, an encoder-decoder structure is utilized to capture global contextual features and reconstruct local details, thereby enhancing the texture and edge information in the low-frequency sub-bands. In the frequency domain, a window-based frequency channel attention mechanism is introduced to process the high-frequency sub-bands. Finally, a two-dimensional wavelet inverse transform is employed to integrate the outputs from both branches to obtain the suppression information. Simultaneously, the noise output from the intermediate process of the other branch is subtracted pixel-wise from the suppression information to decouple the two source images, thereby further improving the model’s performance in image separation tasks.ResultDCDSM is validated through the construction of synthetic datasets from diverse application scenarios, encompassing rain removal, snow removal, and the simulation of complex mixtures. The experimental results are as follows: 1) in the tasks of rain and snow image restoration, DCDSM achieves quantitative metrics (PSNR/SSIM) of 35.002 3 dB/0.954 9 and 29.810 8 dB/0.924 3, respectively, demonstrating an average improvement of 1.257 0 dB/0.927 2 dB (PSNR) and 0.026 2/0.028 9 (SSIM) over the current state-of-the-art methods. 2) For the dual-blind separation of complex mixed images, the restored dual-source images exhibit significantly better texture fidelity and detail integrity compared with those from other methods, with PSNR and SSIM metrics reaching 25.004 9 dB and 0.799 7, respectively, surpassing those of the comparative methods by 4.124 9 dB and 0.092 6, respectively. Ablation experiments validate the effectiveness of the proposed modules and the interpretability and rationality of the selected hyperparameters.ConclusionThe experiments of DCDSM on rain, snow, and complex mixed datasets verify the effectiveness of the method in the task of dual-channel BIS. Experimental results show that the proposed method achieves the best subjective and objective indicators compared with other methods and solves the problems of residual rain/snow lines and blurred texture edge details in real complex separation scenes.  
      关键词:blind image separation(BIS);diffusion model(DM);wavelet transform;Fourier transform;image restoration   
      54
      |
      89
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 125793415 false
      更新时间:2026-02-11

      Image Analysis and Recognition

    • 专家提出频域引导的RGB-D轻量级语义分割网络,融合Transformer与CNN结构,通过频域引导机制提升跨模态特征融合效果与模型轻量化水平,在RGB-D语义分割任务中显著提高语义对齐能力与信息融合效率,为轻量化多模态感知网络设计提供新思路。
      Jia Di, Zhao Chen, Zhang Huaxiu, Song Huilun
      Vol. 31, Issue 2, Pages: 479-498(2026) DOI: 10.11834/jig.250212
      Lightweight RGB-D semantic segmentation network incorporating frequency-domain guidance
      摘要:ObjectiveSemantic segmentation has emerged as a fundamental task in computer vision, aiming to assign a semantic label to each pixel for fine-grained scene understanding. With the increasing demand for intelligent systems such as autonomous driving, indoor robotics, and augmented reality, the ability to accurately parse complex visual scenes has become crucial. While RGB images provide abundant texture and color information, they often fail under challenging conditions such as poor illumination, occlusion, or cluttered environments. Depth maps capture geometric and structural cues that complement RGB images, making RGB-D semantic segmentation an attractive solution. However, integrating these two modalities is difficult because of their intrinsic representation differences. Mainstream approaches often adopt dual-encoder architectures, which separately extract modality-specific features before fusion. Although effective in certain cases, these models suffer from high computational overhead, redundant parameters, and suboptimal feature alignment. The challenge, therefore, lies in designing a framework that achieves a strong balance between segmentation accuracy and model efficiency. To address these limitations, this study introduces a frequency-guided lightweight RGB-D semantic segmentation network, which incorporates frequency-domain modeling into multiple key components. By doing so, the proposed method not only enhances cross-modal semantic alignment but also ensures adaptability for deployment in resource-constrained scenarios.MethodThe proposed framework is constructed around three novel modules, each designed to tackle a specific limitation of conventional RGB-D segmentation. First, we introduce a frequency-guided prompt adapter (FPA). Spectral energy distributions are extracted and transformed into dynamic prompt vectors by applying a Fourier transform to the input features. These prompts are fused with query features to encourage consistent semantic alignment across network layers. Unlike static fusion methods, FPA adaptively strengthens cross-layer information propagation and enhances semantic coherence between RGB and depth modalities. Second, we propose a spectrum-guided dynamic convolution (SDC) module, which explicitly leverages frequency-domain decomposition. Input features are divided into low-, mid-, and high-frequency bands, corresponding to global structures, boundary details, and fine-grained textures, respectively. These frequency-specific features are then integrated with spatial-domain features through a gating mechanism. This design simultaneously addresses local detail preservation and global context modeling, enabling robust multiscale representation learning. Third, we design a multiscale frequency-aware agent attention (MFAA) module. Traditional attention mechanisms, though powerful, incur high computational cost when applied to dense feature maps. MFAA alleviates this issue by dynamically generating compact agent vectors from spectral features of the previous stage. These vectors serve as efficient proxies to capture global dependencies while preserving contextual consistency across scales. In this way, MFAA reduces redundancy while strengthening long-range semantic modeling. The overall architecture adopts an encoder-decoder structure, in which Transformer and CNN branches are jointly employed in the encoder, and the frequency-guided modules bridge the gap between local precision and global reasoning. The decoder reconstructs pixel-level predictions while preserving cross-modal semantic alignment.ResultExtensive experiments were conducted on multiple benchmark datasets to validate the effectiveness of the proposed framework. On the NYU Depth V2 and SUN-RGBD datasets, the network achieved 57.6% and 52.8% mIoU, respectively, using less than half the parameters of most state-of-the-art models. These results highlighted its superior efficiency-accuracy tradeoff. Beyond RGB-D tasks, we evaluated the network on the KITTI-360 dataset for RGB-L (RGB + LiDAR) semantic segmentation, where it demonstrated competitive performance with a mean IoU of 66.3%, surpassing recent baselines such as DFormer while maintaining a lightweight design. Generalization ability was tested on five RGB-D salient object detection datasets: NJU2K, NLPR, SIP, STERE, and DES. Across all four standard metrics (F-measure, E-measure, S-measure, and MAE), the proposed network consistently outperformed existing methods, achieving leading scores even under challenging conditions such as occlusion, reflective surfaces, or fine-grained object boundaries. Ablation studies on NYU Depth V2 confirmed the independent contributions and synergistic effects of each module: FPA improved cross-layer semantic propagation, SDC enhanced frequency-specific feature modeling, and MFAA strengthened multiscale global interactions. Importantly, the model achieved 46 frame/s inference speed on a single NVIDIA RTX 3090 GPU, making it suitable for real-time or resource-limited applications.ConclusionThis paper presents a novel frequency-guided lightweight RGB-D perception framework that systematically integrates frequency-domain modeling into prompt learning, dynamic convolution, and agent attention. The proposed design addresses key challenges of RGB-D semantic segmentation, including modality misalignment, computational inefficiency, and insufficient global reasoning. By harmonizing spectral and spatial cues, the network achieves high segmentation accuracy with significantly reduced parameters and computational costs. Experimental results across diverse datasets and tasks demonstrate not only its state-of-the-art performance but also its strong generalization to unseen modalities and scenarios. The contributions of this work are threefold: 1) an FPA that promotes consistent semantic alignment across layers, 2) an SDC module that enhances multiscale modeling through explicit frequency decomposition, and 3) an MFAA mechanism that efficiently captures global context at low cost. Collectively, these innovations form a compact yet powerful framework for RGB-D perception. Beyond segmentation, the proposed methodology provides insights for designing lightweight multimodal neural networks that achieve an excellent trade-off between accuracy and efficiency, offering a reference for future research on deployable perception models.  
      关键词:RGB-D image;RGB-D semantic segmentation;frequency-domain modeling;prompt tuning dynamic convolution;agent attention   
      201
      |
      107
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 124234210 false
      更新时间:2026-02-11
    • 专家提出基于改进RT-DETR的机车圆弹簧缺陷检测方法,通过结构重参数化、可变形注意力机制及跨尺度特征信息融合等改进,有效提升检测精度与效率,为机车安全运行提供有力保障。
      Peng Zhenrui, Pei Zhibiao
      Vol. 31, Issue 2, Pages: 499-511(2026) DOI: 10.11834/jig.250240
      Defect detection method for locomotive coil springs integrating features and deformable attention
      摘要:ObjectiveAs critical components of a vibration reduction system, the primary and secondary coil springs of locomotives are vital to safe train operations. With the increasing demand for railway transportation, these springs are often subjected to alternating loads, environmental corrosion, and fatigue stress, making them highly susceptible to surface cracks. If not detected and addressed in time, such defects may lead to spring fractures, locomotive malfunctions, or even serious safety accidents. At present, the inspection of locomotive coil springs primarily relies on magnetic particle testing combined with manual visual inspection. While this approach can detect most apparent defects, fine or concealed cracks are often missed owing to dust interference, strong magnetic fields, and fluorescent reflection in the testing environment. Moreover, manual methods suffer from low efficiency and lack of consistency, failing to meet the demands of intelligent and highly reliable modern locomotive maintenance. Therefore, this study proposes a surface defect detection method for locomotive coil springs based on an improved real-time detection transformer (RT-DETR) model, aiming to leverage deep learning and computer vision technologies to enhance the automation and robustness of defect recognition, thereby ensuring spring operation safety and supporting intelligent maintenance.MethodThe RT-DETR framework was extensively enhanced in three key aspects to improve the detection capability for small surface defects under complex industrial conditions while ensuring real-time performance and model efficiency. First, in the backbone network, the original basic block was replaced with a re-parameterized partial convolution module (Rep-Pconv), constructed by applying structural reparameterization to partial convolution. This structure improves feature representation during training via a multibranch architecture and simplifies the inference phase through branch fusion, thus reducing computational cost while maintaining detection accuracy. Second, in the attention-based intrascale feature interaction module, the conventional multihead self-attention was replaced with a deformable attention(DA) mechanism. This modification allows the model to dynamically perceive spatial offsets and focus effectively on salient local regions, enhancing its ability to detect fine-grained defects in cluttered backgrounds. Third, a P2 detection layer was added to improve sensitivity to small targets. A lightweight cross-scale feature fusion module was also designed by incorporating scale-sequence feature fusion and a triple feature encoder, enabling efficient integration of shallow texture and deep semantic features across multiple scales.ResultComprehensive experiments were conducted on a self-constructed dataset of locomotive spring surface defects to validate the effectiveness and superiority of the proposed method. Results show that compared with the original RT-DETR, the improved model reduced the parameter count by approximately 54% and improves the mean average precision to 97.2%, representing a 2.8% increase. Additionally, precision and recall were improved by 0.8% and 1.2%, respectively, demonstrating enhanced performance in false positive suppression and missed defect reduction. Furthermore, comparative evaluations against mainstream lightweight object detection algorithms including YOLOv5s, YOLOv8s, YOLO11s, and YOLO12s revealed that the proposed method outperformed all baselines across key performance metrics.ConclusionThe improved RT-DETR-based defect detection algorithm proposed in this paper introduces Rep-Pconv, deformable attention mechanisms, and a lightweight cross-scale feature fusion module, effectively addressing the challenges posed by complex industrial environments. Experimental results confirm that the model not only significantly reduces computational complexity but also achieves superior detection performance on the custom locomotive spring defect dataset. The method exhibits strong advantages in detection accuracy, efficiency, and robustness, offering a reliable technical solution for intelligent defect detection. It lays a solid foundation for the future development of automated and intelligent defect identification systems in practical industrial applications.  
      关键词:Locomotive coil spring;defect detection;Real-Time Detection Transformer (RT-DETR);partial convolution(Pconv);structural re-parameterization(Rep);Deformable attention mechanism;feature fusion   
      131
      |
      89
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 119437039 false
      更新时间:2026-02-11
    • SiamMo: Siamese motion-centric 3D object tracking AI导读

      相关研究在三维单目标跟踪领域取得新进展,专家提出一种基于孪生网络以运动为中心的跟踪方法,有效解决了无纹理且不完整的激光雷达点云处理难题,显著提高跟踪精度,为该领域研究开辟了新方向。
      Yang Yuxiang, Deng Yingqi, Gu Hongjie, Dong Zhekang, Zhang Jing
      Vol. 31, Issue 2, Pages: 512-524(2026) DOI: 10.11834/jig.250112
      SiamMo: Siamese motion-centric 3D object tracking
      摘要:Objective3D single-object tracking (3D SOT) is of paramount importance in a wide array of applications, including autonomous driving, robotics, and intelligent security. The fundamental goal of 3D SOT is to localize a specific target across a sequence of point clouds, with the only given information being its initial status. Existing matching-based 3D SOT methods generally utilize certain forms of Siamese networks for feature extraction. After transforming the cropped target template and search area embeddings to the same feature space with a shared encoder, these methods enhance target-specific features with various appearance matching techniques, such as cosine similarity and cross attention. Although the Siamese matching-based paradigm has become a popular design in existing models, appearance matching has long suffered from issues with textureless and incomplete LiDAR point clouds. Beyond this paradigm, a new motion-centric tracker, M2-Track, offers a new perspective for 3D SOT. It takes point clouds from two successive frames without cropping as input, explicitly modeling the relative target motion in a single-stream architecture, largely overcoming the challenges. However, it fuses adjacent point clouds and processes them in a single-stream architecture, lacking explicit target information from adjacent frames for accurate localization. To compensate for this deficiency, M2-Track requires additional segmentation and box refinement, which makes the training objective complex and results in cumulative errors. To this end, this study proposes a novel Siamese motion-centric tracking approach, dubbed SiamMo.MethodSiamMo adopts a simple single-stage tracking pipeline of Siamese feature extraction and motion modeling. To learn excellent features for point clouds of varying density, we first divide nonuniform points into regular voxels, which exhibit reduced sensitivity to point count variations, thus mitigating the varying sparsity issue to some extent. Afterward, we present a top-down convolutional network based on Siamese architecture to encode voxelized point clouds of successive frames into the same feature space. The network first uses sparse convolution to extract features in 3D space and then adopts dense convolution to extract features in 2D space in a bird’s eye view. In contrast with the single-steam architecture of M2-Track, Siamese architecture decouples feature extraction from temporal fusion, which enables it to extract more abundant and representative latent features while reducing information interference among successive frames. Subsequently, we design a spatiotemporal feature aggregation (STFA) module that integrates the encoded features at multiple scales for motion modeling. Intuitively, effective motion modeling necessitates rich representations at multiple scales. Integrating these multifaceted representations is expected to significantly enhance the network’s capability to accurately localize targets with various motion patterns. Moreover, we introduce a box-aware feature encoding (BFE) module that injects explicit box priors into motion features for prediction. It first encodes the bounding box size parameters of the object in the initial frame to the box-aware encoding with a multilayer perception (MLP). Then, BFE adds the box-aware encoding and the output feature from the STFA module. Finally, the added feature is fed into an MLP to regress the relative target motion. Despite being conceptually simple, our BFE can boost tracking performance, with negligible computation. In a nutshell, using neither segmentation nor box refinement, SiamMo achieves precise localization by directly inferring the relative target motion in a single-stage manner.ResultWe compare our model with several state-of-the-art tracking methods, including Siamese matching-based and motion-centric trackers on three public datasets, namely, Kitti, NuScenes, and WOD. The quantitative evaluation metrics comprise Precision and Success. Experimental results show that our model outperforms all other methods on Kitti, NuScenes, and WOD datasets. On the Kitti dataset, compared with the second-ranked method, our method increases the average success indicator by 4.7% and the average precision indicator by 4.9%. On the NuScenes dataset, the average success indicator is increased by 14.2%, and the average precision indicator is increased by 11.5%. On the WOD dataset, the average success indicator is increased by 2.9%, and the average precision indicator is increased by 5.4%. Our method also demonstrates strong robustness to sparsity and distractors on the difficult test subsets of the Kitti and NuScenes datasets. In addition, we report the efficiency of SiamMo, which is lightweight with only 0.82 GFLOPs and 14.6 M parameters. We record the average running time of all test frames for the Car category on the Kitti dataset to evaluate the computational efficiency of our method, which achieves 108 frame/second, including 4.2 ms for pre/processing point clouds and 5.0 ms for network forward propagation on a single NVIDIA 4090 GPU. Ablation experiments conducted on the Kitti and NuScenes datasets further verify the effectiveness of the Siamese architecture, STFA module, and BFE module. In Kitti’s Car category tracking task, replacing the single-stream architecture with the Siamese architecture results in a 3.8% increase in the average success and precision. When STFA leverages features across all scales, Kitti’s Car success and precision are improved by 2.2% and 2.6%, respectively, as opposed to relying solely on single-scale features. When the BFE module is added, Kitti’s Car success and precision are improved by 2.9% and 3.6%, respectively.ConclusionIn this paper, we propose a novel and simple Siamese motion-centric tracking approach, which avoids the vulnerable appearance matching process and does not require additional presegmentation and box refinement by adopting Siamese architecture and models target motion in a simple single-stage pipeline. Comprehensive experiments demonstrate that our model surpasses state-of-the-art methods on three challenging benchmarks while demonstrating excellent robustness and maintaining a high inference speed.  
      关键词:3D single object tracking;LiDAR point clouds;siamese network;motion estimation;deep learning   
      78
      |
      166
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 125793576 false
      更新时间:2026-02-11
    • 对话情绪识别领域迎来新突破,研究者提出DECANet模型,通过解耦情绪依赖关系与跨模态感知机制,有效攻克情绪动态演变及多模态融合难题,显著提升情绪识别准确率与泛化性能,为复杂交互场景下的情绪识别提供有力支持。
      Deng Tiansheng, Cai Guoyong, Dong Kai, Wang Shunjie
      Vol. 31, Issue 2, Pages: 525-540(2026) DOI: 10.11834/jig.250309
      Decoupled emotion dependencies and cross-modal awareness for emotion recognition in conversations
      摘要:ObjectiveEmotion recognition in conversations (ERC) aims to identify the emotional state of each utterance within dialogues. Unlike traditional emotion recognition, which typically classify emotions in isolated utterances, ERC must account for the dynamic evolution of emotions shaped by contextual cues throughout the dialogue. Two key emotional dependencies are crucial: intra-speaker dependencies, representing emotional continuity or variation within the same speaker, and inter-speaker dependencies, reflecting emotional influence between speakers. Properly modeling both dependencies is essential for tracking emotional flow and transitions over time. However, existing methods often conflate these structures, failing to distinguish their unique characteristics, thereby limiting their performance. Concurrently, ERC relies heavily on multimodal fusion of textual, acoustic, and visual cues to capture the full emotional context. Yet, current fusion strategies frequently suffer from semantic misalignment, noisy or conflicting modality signals, and inadequate discrimination of relevant information. These shortcomings result in suboptimal multimodal representations and impair recognition accuracy. Together, these challenges point to the need for a unified framework that can simultaneously disentangle emotional dependencies and enable robust cross-modal semantic integration.MethodA decoupled emotion dependencies and cross-modal awareness network, named DECANet, is proposed for ERC to address the aforementioned challenges. The core idea of DECANet lies in the structural disentanglement and dynamic integration of emotional dependencies. Specifically, two distinct subgraphs are constructed to separately model intra-speaker and inter-speaker emotional relationships. The intra-speaker subgraph tracks emotional continuity or fluctuation within the same speaker’s dialogue turn, while the inter-speaker subgraph captures emotion shifts triggered by interaction among different speakers. Each subgraph is processed using graph attention networks augmented with learnable relation-type embeddings. These embeddings encode various temporal and emotional relations, enabling more precise and context-sensitive propagation of affective cues across the graph structure. A heuristic dynamic interaction strategy is introduced to bridge these two emotional structures while preserving their independence. Drawing inspiration from Siamese architecture, the model concatenates original node features with the outputs of both subgraphs to preserve foundational semantic information and ensure representation alignment. Element-wise difference operations are used to quantify semantic divergence between representations, whereas element-wise multiplication captures their semantic consistency. These composite features are then passed through a gating mechanism, which adaptively learns fusion weights on the basis of the current conversational context. This selective integration strategy enhances emotional discrimination by emphasizing the more informative dependency pathway under different conditions. A context-aware self-attention mechanism is developed to further improve cross-modal semantic alignment. This component performs iterative refinement of audio and visual modality representations through sequential integration of contextual cues from other modalities. For example, audio features are initially refined via context-aware interaction with textual representations, followed by a second-stage fusion with visual features, conditioned on the updated audio-text representation. This sequential alignment strengthens inter-modal cohesion and reduces modality gaps. Additionally, speaker embeddings and positional embeddings are incorporated to capture structural cues inherent in multiturn, multispeaker dialogues. These embeddings help encode speaker-specific emotional tendencies and temporal structure within the conversation, facilitating context-aware emotion recognition. A semantic consistency-driven feature selection mechanism is employed to address the issue of semantic noise and inconsistencies often introduced during multimodal fusion. This mechanism selectively preserves only those multimodal representations that maintain high semantic similarity with their corresponding unimodal features. By filtering out semantically deviating or redundant signals, this process helps maintain the integrity of emotion-related information and enhances the robustness of the final emotion representations. Finally, the entire model is implemented under the supervision of multimodal and unimodal objectives, ensuring that the learned features retain discriminative capacity while maintaining robustness across modality configurations.ResultDECANet was evaluated on two widely-used ERC benchmarks, IEMOCAP and MELD, consistently demonstrating competitive performance across multiple evaluation metrics. On IEMOCAP, it achieved improvements of 1.74% in accuracy and 1.77% in weighted F1 score. On MELD, it attained gains of 0.63% in accuracy and 0.52% in weighted F1. These results demonstrate the effectiveness and generalizability of DECANet across diverse conversational scenarios. Comprehensive ablation studies further validate the functional contribution of each proposed component. The removal of either the intra-speaker or inter-speaker subgraph leads to notable performance degradation, underscoring the complementary roles of the two emotional dependency structures. Comparative experiments with single-graph modeling, parallel fusion, and hierarchical fusion strategies reveal that the proposed heuristic dynamic interaction strategy consistently achieves superior results by enabling adaptive and context-sensitive subgraph integration. Furthermore, the cross-modal context-aware self-attention mechanism is essential for effective multimodal fusion. Replacing it with conventional cross-attention or removing speaker and positional embeddings significantly reduces model performance, highlighting the importance of fine-grained inter-modal alignment and contextual sensitivity. In addition, the semantic consistency-based feature selection module effectively filters out semantically inconsistent signals, thereby enhancing the robustness and discriminative quality of the resulting multimodal representations.ConclusionDECANet effectively disentangles and models intra- and inter-speaker emotional dependencies in multispeaker conversations. It further achieves fine-grained semantic alignment across modalities through an iterative and selective fusion mechanism, enhancing the robustness and accuracy of multimodal emotion recognition. Extensive experiments verify that DECANet offers a generalizable and interpretable solution for understanding emotions in complex dialogue scenarios.  
      关键词:emotion recognition in conversations(ERC);graph attention network(GAT);emotion dependency graph disentanglement and fusion;multimodal fusion;cross-modal interaction mechanism   
      66
      |
      126
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 125793345 false
      更新时间:2026-02-11
    • Non-spatial registration decision fusion for multimodal object detection AI导读

      多模态目标检测领域迎来新突破,相关专家提出非空间配准多模态目标检测决策融合方法,有效解决实际应用中不同模态相机非空间配准图像检测难题,显著提升检测精度与鲁棒性。
      Zhang Rong, Yao Liang, Zhang Yixin, Wang Yijun, Zhang Chuanyi, Liu Fan
      Vol. 31, Issue 2, Pages: 541-555(2026) DOI: 10.11834/jig.250326
      Non-spatial registration decision fusion for multimodal object detection
      摘要:ObjectiveMultimodal object detection significantly improves detection accuracy and robustness in complex environments by integrating complementary information from multiple sensors such as visible light and infrared imaging, especially in harsh conditions such as low light, occlusion, smoke, or haze. However, a key limitation of current multimodal detection research is the high dependence on spatial registration data. Most existing methods assume that multimodal image pairs are aligned at the spatial level, which is usually true in manually processed datasets but is often difficult to guarantee in real-world scenarios. In practical deployment, such as autonomous driving, drone monitoring, and disaster rescue, multimodal sensors are typically installed on independent platforms with different perspectives, resolutions, and directions. Therefore, accurate registration under real-time conditions is difficult to achieve, frequently requiring expensive hardware, manual calibration, or computationally expensive preprocessing. These limitations seriously affect the flexibility, scalability, and real-time performance of multimodal detection systems. To address this issue, this study defines a new research task: multimodal object detection under nonspatial registration conditions. This task explicitly eliminates the dependence on image spatial alignment, taking into consideration the conditions faced by multimodal perception systems in practical applications. The core goal is to realize accurate and robust target fusion of nonspatially registered multimodal image inputs without the need for explicit image spatial registration.MethodTo support this new task, this study proposes a decision-level fusion strategy for nonregistered multimodal object detection. This method can be directly operated on the basis of the output results of a single modal detector, without the need for image- or feature-level fusion, and does not rely on image registration. It mainly includes two contributions: the data and algorithm layers. At the data level, we have constructed a benchmark dataset that can simulate nonregistered inputs. Specifically, multi-spectral multi-scene and multi- resolution fusion dataset(M3FD), FLIR thermal dataset(FLIR), and low-light visible-infrared paired dataset(LLVIP), three widely used and originally registered multimodal datasets of infrared and visible light images, were selected. The differences in multimodal viewing angles and the inconsistent target numbers in reality were simulated by cropping image pairs. This process achieves data augmentation without introducing additional annotation information, providing a realistic and efficient testing platform for evaluating nonregistration detection methods without increasing annotation costs. At the algorithmic level, we propose a nonspatial registration decision fusion method based on graph structure. First, a weighted directed graph is constructed on the basis of the detection results of different modal detectors to obtain a structured representation of diverse modal targets. Each node in the diagram represents a detection target, and the edge weights between nodes are established, corresponding to the relative spatial positions and Euclidean distances between target nodes. Next, the target matching between different modalities is transformed into a graph structure matching problem for different modalities, utilizing the relative positional relationships between targets in the graph structure to achieve adaptive matching of cross-modal targets. Finally, for successfully matched target pairs, a Bayesian strategy based on probability ensemble and a weighted average method are used for confidence fusion and detection box fusion. At the same time, a modal transfer strategy was designed to attain efficient complementarity of multimodal information between two modalities for matching failed targets to compensate for missed detections and improve recall rates.ResultWe conducted extensive experiments on the constructed dataset and evaluated the performance of the method in nonregistered and registered scenarios. On the nonspatially registered M3FD, FLIR, and LLVIP datasets, our method achieved the lowest missed detection rates of 12.45%, 12.83%, and 2.67%, respectively. Compared with single-modal detectors, the proposed method reduced the maximum missed detection rate by 10.03%, fully demonstrating that even under nonspatial registration conditions, it can effectively utilize modal complementary information. On the M3FD, FLIR, and LLVIP datasets of spatial registration, this method also performed well, with AP50 reaching 87.0%, 84.4%, and 44.1%, respectively. In comparison with existing multimodal fusion methods such as MS-DETR and DAMSDet, it improved the detection accuracy by up to 6.8%. Experimental results show that the proposed method has good generalization ability on different datasets, can adapt to complex nonregistration situations, and exhibits strong adaptability and universality.ConclusionThis article proposes a multimodal object detection decision fusion strategy suitable for nonspatial registration environments, which is in line with common real-world perception conditions. This method is based on graph matching and decision-level fusion and does not rely on image-level registration to achieve effective correlation and integration of multimodal information. Compared with existing methods, the method proposed in this paper significantly improves detection accuracy and recall in nonregistered scenarios and is applicable to registered data, providing a more universal solution for multimodal perception. The designed modal transfer mechanism further enhances the robustness of detection, allowing the complementary advantages between various modalities to be fully utilized. The code and dataset will be available at https://github.com/1e12Leon/ProbDet.  
      关键词:object detection;multimodal object detection;directed graph;decision fusion;non spatial registration   
      101
      |
      139
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 125793382 false
      更新时间:2026-02-11

      Image Understanding and Computer Vision

    • 相关研究在视频内容描述领域取得新进展,专家提出融合轨迹时空感知与自适应语义聚焦的方法,显著提升了视频描述的准确性和语义丰富性,为视频内容理解与表达提供了新思路。
      Xiao Jingfu, Jiang Wenhui, Fang Yuming, Fang Chengyang, Zhao Xiaowei
      Vol. 31, Issue 2, Pages: 556-572(2026) DOI: 10.11834/jig.250266
      Video captioning with trajectory-based spatiotemporal perceiving and adaptive semantic focusing
      摘要:ObjectiveThe video captioning task aims to automatically generate natural language sentences that accurately convey the semantic content of videos, serving as a crucial intersection between visual understanding and natural language processing. It has shown broad potential in practical applications such as automated surveillance, assistive technologies for the visually impaired, and video summarization. While most existing approaches adopt an encoder-decoder architecture and have achieved notable progress in visual representation and language generation, they still face two key challenges. First, the video encoder often struggles to effectively model object-level motion and event information, which are essential for expressing action-related semantics. Real-world videos frequently involve complex dynamic processes, such as rapid motion and object deformation, making it difficult for models to capture spatiotemporal continuity. Second, the decoder encounters difficulties in cross-modal semantic association during training, as the model often fails to consistently focus on visual regions that are highly relevant to the current word being generated. The mapping between visual and textual modalities is inherently complex and unstable, particularly when multimodal alignment is weak. These limitations severely undermine the quality and accuracy of generated captions in dynamic scenes. To address these issues, this study proposes a novel video captioning method that integrates trajectory-aware spatiotemporal modeling with adaptive semantic focusing, aiming to enhance the model’s ability to capture object-level dynamics and improve multimodal semantic alignment.MethodTo overcome the identified challenges, this study introduces a video description approach that leverages video point trajectories as a core constraint to enhance semantic continuity and multimodal alignment. The methodology comprises two innovative components designed to address the limitations of existing approaches. First, a visual feature trajectory aggregation method is developed to improve the modeling of target semantic continuity. This method explicitly captures the spatiotemporal semantics of target objects by aggregating visual features along point trajectories, which represent the temporal positions and movements of objects within a video. Through an average pooling operation, features along these trajectories are integrated to generate smooth and stable trajectory representations that preserve spatial appearance and temporal continuity. This approach surpasses traditional methods relying on frame-level feature extraction or independent sequence modeling, which often fail to maintain semantic consistency in scenarios involving occlusions or multiobject interactions. By focusing on continuous representations of individual targets, the trajectory aggregation method enables the model to track objects across frames, capture dynamic associations, and produce descriptions with enhanced temporal coherence and robustness, particularly in complex scenes with frequent motion or visual obstructions. Second, an unsupervised adaptive key trajectory focusing learning strategy is proposed to mitigate semantic misalignment between visual and textual modalities. This method exploits the dynamic spatiotemporal information embedded in dense video point trajectories to uncover latent semantic associations between visual and textual elements. By analyzing the distribution of attention weights during decoding, it adaptively identifies key trajectories——those most relevant to the generated description——and distinguishes them from nonkey trajectories that contribute less to semantic content. A novel unsupervised key trajectory focusing loss guides the model to prioritize semantically significant regions while suppressing interference from irrelevant background information. Unlike conventional attention mechanisms that either struggle to align words with appropriate visual regions or rely on labor-intensive manual annotations, this approach is annotation-free and avoids dependence on fixed geometric regions, such as rectangles. Instead, it allows semantic alignment regions to flexibly adapt to the dynamic boundaries and complex shapes of target objects, ensuring precise and contextually relevant attention. The adaptive key trajectory selection process further adjusts dynamically to changes in attention weights as each word is generated, thereby enhancing the flexibility and accuracy of semantic alignment. Together, these methods provide a comprehensive solution that strengthens the model’s ability to generate coherent and accurate descriptions by addressing challenges in temporal semantic continuity and multimodal alignment.ResultThe proposed approach was evaluated on two widely used benchmark datasets, Microsoft research video to text(MSR-VTT) and Microsoft research video description corpus(MSVD), to demonstrate its effectiveness in generating accurate and semantically rich video descriptions. On the MSR-VTT dataset, the model achieved a consensus-based image description evaluation(CIDEr) score of 61.2, surpassing several leading methods such as IcoCap (60.2) and OmniViD (56.6), and attained a metric for evaluation of translation with explicit ordering(METEOR) score of 32.0. These results indicate that our method——particularly the introduced unsupervised focusing loss——effectively enhances the semantic association between fine-grained spatiotemporal video features and words. Moreover, the model outperformed approaches that rely on large-scale vision-language pretraining, including MELTR, VL-Prompt, and OmniViD, with CIDEr improvements ranging from +4.6 to +11.2, highlighting the practical advantages of the proposed trajectory-guided attention mechanism. On the MSVD dataset, which features more fine-grained and diverse video content, our model consistently outperformed all baselines across every evaluation metric. In particular, it achieved a BLEU@1 score of 88.6, a BLEU@4 score of 66.0, a METEOR score of 42.5, a ROUGE-L score of 80.1, and an outstanding CIDEr score of 130.1. These scores represent a 7.6-point gain over OmniViD and a 1.6-point gain over VL-Prompt on CIDEr, underscoring the robustness and effectiveness of our model in handling dynamic video scenarios. Beyond quantitative metrics, qualitative analyses reveal that the generated descriptions exhibit strong temporal coherence, reduced redundancy, and improved alignment with video semantics. Attention visualizations confirm that the model successfully attends to semantically relevant moving objects while filtering out background clutter. This characteristic mitigates issues of attention dispersion——such as incorrectly identifying background elements as primary subjects——demonstrating the interpretability and precision of the proposed attention mechanism.ConclusionThis paper presents a robust and interpretable video description framework that addresses two long-standing challenges: insufficient modeling of semantic continuity and inadequate cross-modal semantic alignment. By introducing a trajectory-based visual representation approach together with an unsupervised adaptive key trajectory focusing strategy, the proposed method strengthens the temporal and semantic integrity of video representations while guiding the model toward precise alignment with textual outputs. Requiring no additional annotations and being fully end-to-end trainable, the approach demonstrates strong generalization across multiple datasets and proves particularly effective in complex scenarios involving occlusions, background clutter, and multiobject dynamics. The substantial gains in quantitative performance——achieving CIDEr scores of 61.2 and 130.1 on MSR-VTT and MSVD, respectively——along with qualitative improvements in description quality, underscore the potential of integrating motion-aware semantics and attention guidance into video-language models. Overall, the proposed method provides a scalable and efficient solution for video content description, with broad implications for applications such as automated surveillance, assistive technologies, video summarization, and content retrieval.  
      关键词:video captioning;multimodal semantic alignment;spatiotemporal point trajectory;feature aggregation;adaptive semantic focusing   
      73
      |
      174
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 125793450 false
      更新时间:2026-02-11
    • 相关研究在红外与可见光图像融合领域取得新进展,研究人员提出创新IVIF框架,结合大气散射模型与频域特征组件,有效提升融合图像质量,为复杂场景图像融合提供新思路。
      Song Chengcheng, Hu Jiwei, Jin Qiwen
      Vol. 31, Issue 2, Pages: 573-588(2026) DOI: 10.11834/jig.250159
      High-quality fusion of infrared and visible images by integrating frequency information with atmospheric scattering models
      摘要:ObjectiveIn the domain of image processing, infrared and visible image fusion (IVIF) holds a pivotal position. Its main aim is to synthesize a single image that can optimally integrate the complementary aspects of infrared and visible light images. Visible light images are rich in texture details and color information, while infrared images offer valuable thermal radiation characteristics. By fusing them, we can create a comprehensive and useful visual representation that is applicable in numerous scenarios such as surveillance, autonomous driving, and remote sensing. However, real-world atmospheric conditions pose significant challenges. During the transmission of light from the source to the sensor, it undergoes various alterations due to atmospheric scattering and energy attenuation. Such alterations lead to issues including light energy loss, scattering effects that blur images, and inaccurate color contrast. In complex outdoor or dynamic environments where the atmosphere has a strong influence, these problems become pronounced, resulting in a significant degradation of image quality and making it difficult to obtain accurate and useful fusion results. Hence, developing a method that can effectively overcome these atmospheric-induced limitations is crucial to enhance the overall quality and practicality of fused images.MethodTo address these challenges, we propose an innovative IVIF framework. This framework combines the physical model of atmospheric scattering with frequency-domain feature components. First, within the atmospheric scattering model, we focus on accurately estimating and predicting two critical parameters: transmission map and atmospheric light. The transmission map reflects how light propagates through the atmosphere and gets attenuated, while the atmospheric light represents the ambient light conditions in the scene. By precisely determining these parameters, we can deeply understand the impact of the atmosphere on images and use this knowledge to enhance visible light images. This enhancement helps in reducing the negative effects of energy loss and scattering. Additionally, to counteract the artifacts and texture loss that might occur as a result of applying the scattering model, we integrate a Fourier transform and a spatial-channel attention mechanism. The Fourier transform allows us to analyze the images in the frequency domain, where we can selectively amplify the amplitude and phase features. Meanwhile, the spatial-channel attention mechanism helps in focusing on the most relevant regions and features in the spatial and channel dimensions, ensuring that the texture fidelity of the fused images is improved and the fine details are well preserved.ResultWe conducted extensive experiments on four publicly available datasets, namely, RoadScene, TNO, M3FD, and VT5000, along with an extreme foggy scene dataset (AWMM-100K). The outcomes of these experiments were quite compelling and provided solid evidence of the superiority of our proposed method. In the qualitative comparison, our method outperformed the existing state-of-the-art fusion techniques, especially when dealing with complex scenes. The fused images generated by our approach boasted clear object contours, which is of great significance for subsequent object recognition and detection tasks. Moreover, these fused images had a high visual quality, presenting a visually appealing and information-rich appearance. They were able to vividly showcase the details that mattered, making them conducive to in-depth analysis and practical applications. Quantitatively, on the RoadScene and TNO datasets, we carried out a meticulous comparison with other deep learning methods in terms of several key metrics that are crucial for evaluating image quality and fusion performance. Specifically, in the aspects of information entropy, standard deviation, spatial frequency, mean gradient, peak signal-to-noise ratio, and correlation coefficient, our method demonstrated remarkable advantages. Detailed calculations and comparisons revealed that compared with other deep learning methods, our method showed increased mean values in these six important indicators by 7.44%, 44.22%, 97.89%, 91.01%, 59.15% and 83.68%. Such significant improvements in these metrics clearly indicated that our method had a strong ability to enhance image quality from multiple dimensions and was highly capable of supporting high-level visual tasks such as object detection. Notably, in the extreme foggy scenes of the AWMM-100K dataset, which are particularly challenging because of the heavy influence of fog, our method shone brightly. It effectively mitigated the adverse effects of fog, significantly reducing the blurriness that is commonly associated with foggy conditions. By doing so, it enhanced the clarity of the fused images and managed to produce higher-quality fused images that were more suitable for further analysis and application in various fields. Overall, the experimental results comprehensively demonstrated the effectiveness and superiority of our proposed IVIF framework in different scenarios and laid a solid foundation for its potential real-world applications.ConclusionOur study has introduced a novel approach to IVIF by integrating traditional atmospheric scattering models with deep learning techniques. Through the combination of Fourier transforms and spatial-channel attention mechanisms, we have achieved enhanced image fusion accuracy and improved preservation of critical image details. This approach offers a promising solution for real-world applications in the field of computer vision. The ability to mitigate the effects of atmospheric scattering, along with advanced frequency-domain processing, enables our framework to deliver superior fusion results. Consequently, it is highly suitable for various high-level tasks such as object recognition, segmentation, autonomous driving, remote sensing, and security surveillance in dynamic environments. The potential of our proposed approach for practical use across diverse fields is truly significant. It sets a new benchmark for future research in this area, inspiring researchers to conduct further investigations and improvements in IVIF. This breakthrough will likely lead to the development of more effective and efficient methods for dealing with complex atmospheric conditions and attaining enhanced visual representations, thus advancing the entire domain.  
      关键词:image fusion;infrared and visible images;atmospheric scattering enhancement;fusion algorithm;deep learning   
      95
      |
      140
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 125793533 false
      更新时间:2026-02-11
    • Lightweight binocular stereo matching network for edge computing devices AI导读

      专家提出轻量级实时立体匹配网络框架LBSM,摒弃高计算开销方案,采用2D卷积与轻量化通道注意力机制等,降低计算开销,提升准确性,为边缘计算设备部署双目立体视觉系统提供新方案。
      Wu Zhong, Zhu Hong, Lin Guangfeng, He Lili
      Vol. 31, Issue 2, Pages: 589-608(2026) DOI: 10.11834/jig.250081
      Lightweight binocular stereo matching network for edge computing devices
      摘要:ObjectiveComputer binocular stereo vision draws inspiration from the fundamental principle of human binocular perception of object distance. It has significant application potential in numerous fields, including intelligent manufacturing, autonomous driving, robotic visual navigation, aerospace, geographic remote sensing, smart healthcare, and virtual/augmented reality, and is being increasingly widely adopted. In recent years, stereo matching methods based on deep learning have leveraged the powerful learning capabilities of deep neural networks to learn implicit rules of stereo matching from a large number of training samples. These kinds of methods integrate the entire stereo matching process, including feature extraction, cost volume construction, cost aggregation, disparity regression, and disparity refinement, into an end-to-end deep neural network. By leveraging the powerful learning capabilities of deep neural networks, these methods effectively address challenges posed by various complex factors. Currently, the accuracy of deep learning-based stereo matching networks has significantly surpassed that of traditional methods, leading to significant advancements and a substantial improvement in the accuracy of stereo matching. However, as a pixel-level dense matching task, stereo matching inherently involves high computational complexity. Deep learning-based stereo matching models typically require significant computational resources to achieve good disparity accuracy. Existing state-of-the-art stereo matching networks often demand tens to hundreds of giga multiply-accumulate operations to process a pair of stereo images with a resolution of 540 × 960. In many real-world application scenarios, owing to constraints such as hardware costs and power consumption, stereo matching models often need to be deployed on low-power edge devices. This issue imposes stringent requirements on the models: They must not only achieve good disparity accuracy but also exhibit extremely low computational overhead. Present advanced stereo matching networks generally have high computational costs, making their deployment expensive and unsuitable for such practical scenarios. To address this issue, this study proposes a lightweight binocular stereo matching framework named LBSM, which is designed for low-power edge computing devices.MethodThe construction and aggregation of cost volume are two critical steps that significantly influence the overall accuracy and efficiency of the process. Thus, in this study, the popular architecture of “4D cost volume + 3D convolution” with high computational overhead is abandoned. First, the 4D cost volume is directly reshaped into a 3D cost volume by fusing the channel and disparity dimensions, and only 2D convolutions are used for cost aggregation. In this way, the entire cost aggregation process only requires the use of 2D convolution, which significantly reduces computational overhead while minimizing information loss. Second, a lightweight channel attention mechanism is introduced for cost aggregation, avoiding the stacking of many convolutional layers and making the cost aggregation process efficient. Third, a two-stage network following the “coarse-to-fine” architecture is adopted, in which a spatially adaptive disparity upsampling strategy replaces bilinear interpolation, enabling adaptive propagation of low-resolution disparity estimates with minimal computational cost. This upsampling strategy not only significantly enhances the accuracy of lightweight models with minimal computational overhead but also enables us to reduce the number of “coarse-to-fine” iterations. Finally, on the basis of the above methods, the proposed LBSM is constructed without any 3D convolutions. Through straightforward configurations, LBSM can yield a series of models that are capable of achieving real-time stereo matching on low-power edge computing devices.ResultAblation and comparative experiments on the large-scale Scene Flow dataset demonstrate that LBSM achieves high disparity accuracy with lower computational overhead. Compared with existing lightweight stereo matching models and even some nonlightweight models, the proposed network framework exhibits clear advantages in accuracy and real-time performance. On the real-world road scene datasets KITTI 2012 and KITTI 2015, the proposed method achieves error rates as low as 2.41% and 2.52%, respectively, significantly outperforming mainstream lightweight stereo matching models such as ADCPNet and P3SNet. Additionally, deployment and testing on embedded AI hardware (i.e., NVIDIA Jetson TX2) platforms show that LBSM achieves much better accuracy with faster processing speed (4.59—20.29 frames per second).ConclusionBenefiting from the aforementioned strategies, LBSM relies solely on 2D convolutional layers and exhibits very low computational overhead. It can perform binocular stereo matching tasks in real time on low-power edge computing devices, achieving a better trade-off between computational cost and accuracy than existing models and thus endowing it with significant application potential in real-world scenarios with limited computational resources.  
      关键词:binocular stereo vision;fully convolutional network;spatial adaptivity;lightweight model;embedded intelligent device   
      160
      |
      121
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 124234294 false
      更新时间:2026-02-11
    • #论文摘要#介绍了其在高精度结构光三维成像领域的研究进展,相关专家提出了一种基于条件扩散模型的多频相位解包裹方法(DiffPhase),为解决相位图噪声污染与大范围跳跃问题提供了有效解决方案。
      Li Yan, Huang Ji, Zou Qin, Li Qingquan
      Vol. 31, Issue 2, Pages: 609-627(2026) DOI: 10.11834/jig.250133
      Multifrequency phase unwrapping and 3D imaging based on a conditional diffusion model
      摘要:ObjectivePhase unwrapping is a signal processing technique for recovering actual continuous phase values from phase information constrained by a periodic range and affected by noise. It represents a fundamental step in high-precision structured light 3D imaging, playing a pivotal role in ensuring the accuracy and robustness of the reconstruction process. This technique finds broad application in fields such as optical interferometry, synthetic aperture radar, and magnetic resonance imaging. Owing to the limitations of the measuring principle, the phases captured in real-world scenarios are “wrapped” into multiple cycles. When the variation in the measured phase exceeds 2π, the excess phase is “wrapped back” to 0, known as the wrapping phenomenon. Phase unwrapping techniques aim to restore the wrapped phase data, which span multiple cycles, to their original true phase. This restoration allows for the accurate retrieval of spatial information needed for precise analysis and computation. Because of equipment errors and environmental interference, the measured phase is often contaminated by noise in practical scenarios, sometimes even exhibiting noncontinuous phase jumps, which poses a significant challenge to accurate phase unwrapping and 3D imaging. If the unwrapping method cannot effectively model the phase globally, local jumps can lead to error propagation and result in significant unwrapping errors. Diffusion models based on denoising mechanisms have demonstrated impressive performance in high-quality image generation and shown potential in phase unwrapping. By iteratively denoising noisy images, diffusion models can effectively model large-scale phase relationships and mitigate the impact of local errors on the overall result. However, as generative models primarily designed for generating natural images, existing diffusion models focus on balancing clarity and diversity, making it challenging to ensure the precision of geometric models. Moreover, phase unwrapping is a rigorous mathematical problem governed by periodic and physical constraints, which diffusion models are not inherently equipped to handle. Therefore, the direct application of diffusion models to phase unwrapping and 3D imaging remains problematic, which requires accurate recovery of continuous phase values under strict physical constraints. Existing phase unwrapping methods also rely solely on a single-frequency wrapped phase, which struggles to simultaneously recover large-scale information and fine details of the true phase, thus affecting the high-precision reconstruction of 3D scenes. Obtaining wrapped phase at additional frequencies incurs minimal cost in many scenarios. Low-frequency wrapped phases capture large phase jumps but lose local details, while high-frequency wrapped phases preserve details but struggle with large jumps. To address the above issues, this study proposes DiffPhase, a multifrequency phase unwrapping method based on a conditional diffusion model. The proposed method can take any number of different frequency wrapped phases as input and uses multifrequency phase features as conditions. Leveraging the powerful generative capabilities and global modeling ability of the diffusion model, DiffPhase can precisely restore the phase values. In combination with 3D imaging, DiffPhase enables accurate reconstruction of the absolute phase and scene depth information.MethodWe model phase unwrapping as a conditionally guided image generation task. By incorporating multiscale feature conditioning and diffusion-based generative mechanisms, DiffPhase enhances robustness against noise and occlusions. DiffPhase consists of a feature extraction module and a generation module. In the feature extraction module, we design a multiscale feature extraction module aligned with the diffusion network architecture to extract hierarchical semantic information from the input wrapped phase. The features extracted are progressively integrated into each stage of the diffusion process by leveraging a cross-scale cross-attention mechanism based on multihead attention, thereby enhancing the local accuracy and global consistency of the generated results. The generation process is guided by the conditions provided by the feature extraction part, progressively denoising the initial Gaussian noise and generating the predicted true phase. Then, the denoising process is performed step by step on the reduced features to learn the reverse noise distribution. The interaction between feature extraction and generation parts is realized through the cross-attention layer in the diffusion model. A two-stage training strategy is adopted to improve training stability and generalization capability. The feature extraction module is first pretrained to learn structural priors, followed by end-to-end joint optimization to enhance phase prediction performance. To optimize computational efficiency, DiffPhase reduces the high-dimensional input of the diffusion model using UNet structure to obtain a compact low-dimensional representation. By introducing the stepwise denoising mechanism of diffusion models, the method can effectively handle phase unwrapping in strong noise environments and mitigate error propagation in complex scenarios. Additionally, an adaptive multifrequency input mechanism is introduced, enabling the network to flexibly process an arbitrary number of wrapped phase inputs at different frequencies. By integrating low-frequency global contours with high-frequency local details, this approach effectively suppresses error propagation and enhances the accuracy and robustness of the unwrapping results.ResultExperiments are conducted to compare the proposed method with eight state-of-the-art deep learning and traditional approaches on two simulated datasets and two real datasets. For the two simulated methods, three wrapped phases with different frequencies are generated for each absolute phase to simulate multifrequency data collected in real scenarios. All training data are randomly corrupted with noise ranging from 0 dB to 30 dB to train the model’s denoising ability. The test sets are corrupted with noise at different levels to evaluate the model’s performance under various noise conditions. The NYU Depth V2 and Middlebury Stereo datasets are also considered as real data for comparative testing. On the RME-multi and MoGR-multi simulation datasets, the normalized root-mean-square errors of phase unwrapping are 0.23% and 0.24%, respectively. On the NYU-phase and MS-phase real datasets, the normalized root-mean-square errors of phase unwrapping are 4.69% and 7.50%, respectively. The comparison experiments demonstrate that DiffPhase, using only a single-frequency phase, outperforms other methods in phase unwrapping performance. With additional frequency wrapped phases, DiffPhase achieves a stable improvement in accuracy. In real-world 3D scene reconstruction, DiffPhase demonstrates high depth reconstruction accuracy under varying levels of noise interference, exhibiting superior precision and robustness, particularly in regions with fine edges and complex structures. Furthermore, occlusion recovery experiments are conducted on the MoGR-multi dataset, which verifies the superior image restoration ability and accurate phase unwrapping capability of DiffPhase in complex occlusion scenarios.ConclusionIn this paper, a multifrequency phase unwrapping method based on a conditional diffusion model is proposed. Experimental results show that DiffPhase outperforms several state-of-the-art methods and achieves accurate and robust unwrapping results in high-noise and complex phase scenarios, effectively improving the accuracy of 3D reconstruction.  
      关键词:phase unwrapping;diffusion model;3D modeling;deep learning;computer vision   
      151
      |
      157
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 125793678 false
      更新时间:2026-02-11
    • 3D hand pose estimation network integrating hand and object features AI导读

      专家提出融合手物特征的三维手部姿态估计网络,通过多层次跨模态交互优化,有效提升遮挡场景下的手部姿态估计性能,为复杂交互场景下的人机交互研究开辟了新方向。
      Jia Di, Wang Jianchun, Han Xuefeng, Zhang Fan, Wang Xiao
      Vol. 31, Issue 2, Pages: 628-641(2026) DOI: 10.11834/jig.250270
      3D hand pose estimation network integrating hand and object features
      摘要:ObjectiveThe estimation of hand poses is a crucial technology in the field of human-computer interaction, playing an important role in many complex interactive scenarios such as virtual reality, augmented reality, and robotics. However, owing to the high flexibility of hand movements, the complexity of self-occlusion, and the interaction between hands and objects, hand pose estimation remains a challenging task, especially in real-world scenarios. Traditional methods often struggle with issues such as insufficient multiscale feature fusion, channel loss, and occlusion interference. Most of the existing approaches tend to rely on either single-feature extraction strategies or static attention mechanisms, which frequently fail to achieve a balance between accuracy and robustness, particularly in occluded environments. As a result, hand pose estimation in these challenging settings often suffers from low accuracy. To address these challenges, this study proposes a novel 3D hand pose estimation network that integrates hand-object features, named hand-object collaborative enhancement network (HOCEN). The proposed method aims to enhance the robustness and accuracy of hand pose estimation in occlusion-prone environments by optimizing multilevel cross-modal interactions between hands and objects.MethodFirst, the network introduces a dual-stream feature pyramid network (DS-FPN), aiming to capture local details and global semantic dependencies through bidirectional cross-scale information aggregation. This approach alleviates the common channel loss problem in traditional feature pyramid network (FPN), especially when dealing with multiscale features. Traditional FPNs usually fuse features through an upward path, which typically leads to the loss of fine details such as fingertip and joint textures, specifically in complex occlusion scenarios. By contrast, DS-FPN utilizes upward and downward information flows, enabling more excellent feature refinement. The downward flow enhances local details, while the upward flow injects global posture and semantic information. After these steps are completed, a second downward flow process is adopted to integrate the fine local details with global information, thereby forming an optimized feature fusion path. This path can enhance the low-level details and high-level global context of hand features, and this bidirectional interaction mechanism exists in the DS-FPN, significantly improving the accuracy and efficiency of hand feature extraction, especially in complex occlusion scenarios. Second, this study introduces a dynamic adjustment module (DAM) based on an external attention mechanism. This module dynamically adjusts the attention weights of the extracted hand features, ensuring that important features are emphasized while suppressing noise interference. The dynamic feature adjustment ensures that the hand features can effectively adapt to complex environments, thus achieving accurate feature extraction. The attention mechanism guided by external memory adapts to different conditions, optimizes feature calibration, and improves the overall robustness. Third, a hand feature enhancement module is constructed, aiming to enhance the feature alignment between hands and objects. This dual-stream collaborative attention mechanism combines the geometric constraints and semantic complementarity of hands and objects, thereby improving the cross-modal feature alignment between them. The HFE module fuses hand and object pose features through a multihead attention mechanism. This fusion enables the spatial information between the hands and objects to achieve precise alignment, which is crucial for tasks involving interactions between hands and objects. Finally, a hierarchical feature decoder is adopted, which is used to reconstruct 3D hand pose parameters. This decoder extracts the optimized features from the previous layers and accurately predicts a 3D hand pose on the basis of the estimated joint positions and pose parameters, thus reconstructing the 3D model of the hand.ResultExtensive experimental validation is performed on the publicly available dexterous-YCB(Dex-YCB) and hand-object 3D(HO3D) datasets to evaluate the proposed method’s effectiveness. Experimental results demonstrate that HOCEN outperforms existing state-of-the-art models in hand pose estimation accuracy, particularly in complex interaction and occlusion scenarios. On the Dex-YCB dataset, the proposed model achieves a mean per-joint position error (MPJPE) of 12.4 mm and a Procrustes-aligned MPJPE of 5.4 mm, which is superior to advanced models such as semantic graph convolutional network(SemGCN) and harmonious feature learning(HFL). Similarly, on the HO3D dataset, the model achieves very low joint and mesh errors of 9.2 mm and 9.1 mm, respectively, proving the model’s accuracy in challenging conditions. Ablation studies further validate the individual contributions and synergistic effects of each module, with the DS-FPN module improving multiscale feature fusion, DAM enhancing feature robustness, and the HFE module strengthening hand-object interaction alignment.ConclusionThis study proposes a 3D hand pose estimation deep network based on the fusion of dynamic hand-object interaction features. It effectively improves the hand pose estimation in occluded environments and solves key problems such as multiscale feature fusion, noise interference, and interference between hand and object features. Experimental results show that this method has strong robustness and generalization ability in complex interaction scenarios. Dynamic feature calibration and hand-object collaboration strategies provide a novel solution for solving the problem of hand pose estimation in occluded situations.  
      关键词:Gesture estimation;Double-flow pyramid feature fusion;Dynamic attention mechanism;Dynamic feature adjustment;Hand and object synergistic enhancement   
      174
      |
      150
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 125793719 false
      更新时间:2026-02-11

      Remote Sensing Image Processing

    • 相关研究在遥感影像建筑物提取领域取得新进展,专家们提出了一种无提示—判别联合模型(SAM-Classifier),实现了通用视觉模型向遥感场景的迁移,完成了建筑物的自动化高效提取,为解决传统方法精度瓶颈问题提供了新方案。
      Chen Xiuxiu, Jin Yongsheng, Ye Jiansheng, Fang Lei
      Vol. 31, Issue 2, Pages: 642-656(2026) DOI: 10.11834/jig.250258
      From general segmentation to specialized building extraction——research on the optimization strategies of SAM in high-resolution remote sensing images
      摘要:ObjectiveWith the rapid advancement of remote sensing technology, high-resolution imagery has become an essential tool for urban planning, environmental monitoring, and disaster management, thereby highlighting the need for accurate and efficient building extraction methods. Approaches based on deep learning have made substantial advancements in this field. Unlike traditional handcrafted feature engineering, convolutional neural networks (CNNs) autonomously learn building-specific features from remote sensing imagery. The U-Net encoder-decoder architecture provides a solution path for the fusion of shallow and deep features. Based on the U-Net architecture, network structures such as multilevel parallelism, multipath fusion, and multitask learning have been proposed to achieve effective multiscale fusion. Transformer architectures, leveraging self-attention mechanisms to capture long-range dependencies between global and local features, have also been adopted as alternatives to CNNs for multiscale feature extraction. However, substantial intraclass feature variations within buildings and complex surrounding environments in remote sensing imagery necessitate large-scale annotated building mask datasets for training deep learning models. Acquiring such labeled data is labor intensive, time consuming, and complicated by the expertise required for remote sensing interpretation. For mitigating annotation costs, a common strategy involves transferring feature extraction networks pretrained on large-scale natural image datasets (e.g., ImageNet) to remote sensing tasks, leveraging their robust feature representations. The segment anything model (SAM), a foundational segmentation model trained on 11 million images and 1.1 billion mask annotations, exhibits exceptional generalization capabilities across diverse objects and scenes. Its promptable design enables zero-shot segmentation for arbitrary tasks, demonstrating immense potential in remote sensing applications. Nonetheless, SAM faces the following main challenges in remote sensing tasks: 1) poor semantic generalization, which stems from its insufficient generalization ability to handle the diversity of remote sensing imagery given that the model is trained on natural images; 2) strong prompt dependence, which requires manual input of points, boxes, or masks, limiting its automation capability in large-scale real-world applications. To enable SAM to be directly used for specific semantic segmentation tasks, scholars have made various attempts, including fine-tuning the SAM decoder and adding prompt branch networks. However, these approaches have high model training difficulties and require high-quality training datasets. Some scholars have also tried to simply combine SAM with object detection and language models, using the points, boxes, masks, or prompt feature information generated by these models as SAM’s prompt input to achieve automatic semantic segmentation. These methods simply concatenate the two models, and the SAM segmentation results are dependent on the accuracy of the prompts generated by the former, with poor controllability of the extraction results and rough object edge segmentation. They are suitable for tasks that do not require fine object edge segmentation, such as street scene segmentation, but are not suitable for building segmentation. The current core challenge lies in developing a lightweight SAM semantic segmentation framework that effectively addresses the issues of semantic deficiency and prompt dependency to realize fine building segmentation.MethodThis study systematically investigated the performance of SAM in building extraction tasks under three prompting strategies: point, box, and mask prompts. Additionally, a prompt-free SAM-classification joint model (SAM-Classifier) was proposed to overcome the limitations of SAM in semantic understanding and prompt dependency. Experiments were conducted using two widely recognized benchmark datasets, namely, the WHU Building and Massachusetts Buildings datasets, to evaluate the building extraction performance of SAM under various prompting strategies. A comparative analysis was performed to benchmark SAM’s performance against the building extraction capabilities of the Sense Earth 3.0 platform, a state-of-the-art solution developed by SenseTime, a prominent AI technology company.ResultResults demonstrate that box prompt demonstrates superior performance in terms of extraction accuracy and generalization. On the WHU Building dataset, it achieves an F1-score of 0.945, a precision rate of 0.955, and a recall rate of 0.936; meanwhile, the values for point prompt are 0.767, 0.940, and 0.648 and for mask prompt are 0.705, 0.943, and 0.563. Box prompt can significantly enhance small-object recognition in complex scenes by delineating the approximate bound of building targets, thereby reducing interference from cluttered backgrounds. The batch input method for point prompt is susceptible to insufficient segmentation issues. By contrast, the single-object loop input method for point prompt can effectively eliminate location ambiguity, leading to a substantial enhancement in the accuracy of building recognition. Prompt-guided SAM for building segmentation attains high accuracy; however, its reliance on prior information and need for manual intervention limit its practical applicability. SAM-Classifier, which combines the segmentation capability of SAM with a lightweight classification network, eliminates the need for manual prompting and achieves considerable performance. Notably, it outperforms existing literature and Sense Earth, reaching an F1-score of 0.717 (vs. 0.746 for Sense Earth), precision of 0.728 (vs. 0.695 for Sense Earth), and recall of 0.706 (vs. 0.804 for Sense Earth) on the Massachusetts Building dataset. The findings underscore the potential of SAM as a foundational model for remote sensing tasks but emphasize the necessity of task-specific adaptations to overcome its inherent limitations. For practical applications, box prompts are recommended for time-sensitive, interactive scenarios such as disaster response, in which rapid boundary delineation is critical, while SAM-Classifier offers a viable solution for large-scale automated mapping. This research not only benchmarks the capabilities of SAM in building extraction but also provides a methodological framework for adapting general-purpose image segmentation models to specialized extraction workflows. Future directions include exploring few-shot learning to reduce annotation dependency, enhancing SAM’s multimodal integration, and developing hybrid models that synergize SAM’s segmentation prowess with domain-specific prior knowledge. Through addressing these challenges, a fully autonomous, high-precision building extraction model could be developed, empowering smart urban development, sustainable environmental governance, and resilient disaster preparedness in the era of pervasive remote sensing. This comprehensive investigation advances the understanding of SAM’s strengths——such as its zero-shot adaptability and boundary precision——and limitations——including semantic ambiguity and prompt reliance——while offering actionable insights for researchers and practitioners. The proposed two-stage framework exemplifies how the foundational large segmentation model can be tailored to meet the rigorous demands of remote sensing, setting a precedent for future innovations in AI-driven geospatial analysis.ConclusionIn this study, we evaluated the performance of three prompting strategies for applying SAM to building extraction and proposed the prompt-free SAM-classification joint model SAM-Classifier that integrates SAM with a lightweight binary classification model for extracting buildings from high-resolution images. The experimental results show that box prompt achieves the highest extraction accuracy and is well suited for interactive scenarios. Moreover, the extraction performance of the proposed SAM-Classifier is comparable to that of several state-of-the-art building extraction approaches.  
      关键词:image segmentation;High-Resolution Imagery;building extraction;segment anything model(SAM);prompt segmentation;optimization strategy   
      155
      |
      115
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 125793191 false
      更新时间:2026-02-11
    0