Lang Jinwei, Li Yaxin, Liu Shuai, Kang Xiaodong, Cheng Junqiang
DOI:10.11834/jig.260217
摘要:With the proliferation of multi-source sensors and the rapid progress of multimodal data fusion and intelligent interpretation techniques, aerial remote sensing is evolving from traditional single-modal perception toward multimodal perception and understanding. This progress provides significant potential for intelligent monitoring and decision-making in precision agriculture, urban environmental monitoring, ecological protection, and natural disaster assessment. With the rapid advancement of Earth observation capabilities, multi-source remote sensing data—including optical imagery, synthetic aperture radar (SAR), and hyperspectral imagery—are increasingly characterized by high spatial resolution, multi-temporal coverage, and multi-dimensional richness, offering a solid foundation for fine-grained land-cover identification and dynamic change analysis. Nevertheless, traditional remote sensing interpretation methods, which rely primarily on single-source data and single-task learning models, struggle to cope with the complexities of real-world scenarios, where object scales vary dramatically, semantic layers are rich and entangled, and spatiotemporal heterogeneity is strong. Such methods cannot adequately balance holistic scene semantics with local fine-grained details, limiting their capacity for high-level semantic understanding and comprehensive decision-making. Compared with satellite remote sensing, aerial platforms—typically employing unmanned aerial vehicles (UAVs) and low-altitude aircraft—offer higher spatial resolution and greater observational flexibility, enabling tasks such as fine urban modeling, small-object recognition, and emergency monitoring. However, these very advantages introduce severe scale variations, highly complex backgrounds, and inconsistent imaging conditions, which make it extremely difficult for conventional models to capture both global scene semantics and subtle local textures simultaneously. Recent advances in deep learning and artificial intelligence have transformed remote sensing interpretation from manual inspection toward automation and intelligence, achieving notable results in scene classification, object detection, and semantic segmentation. Yet single-task and single-modality paradigms remain inadequate for fully exploiting complementary information in heterogeneous multi-source data, falling short of supporting the high-level semantic comprehension and integrated decision-making required for complex applications. The emergence of multimodal large models provides a transformative pathway for intelligent aerial remote sensing interpretation. By integrating visual encoders with large language models through instruction tuning and cross-modal alignment, these models can jointly perform image understanding and linguistic reasoning, achieving breakthroughs in visual-language fusion, cross-modal reasoning, and task-instruction-guided analysis. Within the remote sensing community, researchers have begun to construct large-scale image-text datasets to train vision-language models and employ linguistic guidance to enhance the comprehension of complex land-cover semantics, marking a critical shift from purely visual modeling toward semantically augmented modeling and laying the groundwork for the subsequent development of multimodal large models. A systematic review reveals that the evolution of remote sensing large models has followed a clear paradigm progression: from early approaches that relied solely on single visual modalities, through vision-language models that introduced textual semantics but were still confined to specific tasks, to the current stage in which unified multimodal large models enable cross-task collaboration and complex reasoning. Despite these promising advances, applying multimodal large models to aerial remote sensing still encounters a series of formidable bottlenecks. The substantial disparities in spatiotemporal references and radiometric properties among heterogeneous data sources such as optical, SAR, and hyperspectral imagery render cross-modal alignment extremely difficult, often giving rise to semantic misalignment and information loss. The fine-grained spatial structures and multi-scale objects inherent in aerial imagery impose stringent demands on spatial cognition and multi-scale reasoning, yet general-purpose large models frequently lack sufficient domain-specific spatial priors to accurately capture geometric and structural relationships. High-resolution scenes impose massive computational and storage burdens, making real-time inference on edge devices highly challenging and failing to meet the stringent timeliness and generalization requirements of time-sensitive tasks such as disaster response and UAV-based inspection. Moreover, the black-box nature of model decision-making leads to insufficient interpretability and trustworthiness, hindering deployment in safety-critical scenarios, while data privacy concerns have become increasingly prominent within distributed collaborative learning frameworks. In response to the outlined challenges, this paper systematically reviews recent advances in multimodal data fusion and large model technologies for aerial remote sensing, tracing the shift from single-source to multimodal integration. Driven by large models, remote sensing interpretation has gradually evolved from low-level perception to cross-modal reasoning and semantic understanding. Nevertheless, substantial challenges still remain in multimodal alignment, spatial cognition, task reliability, and practical deployment in real-world scenarios. We analyze cross-sensor fusion, highlighting that disparities in geometry, radiometry, and acquisition timing among optical, SAR, and LiDAR data severely impede cross-modal semantic alignment; despite advances in vision-language pretraining, high resolution and complex backgrounds often degrade alignment accuracy, necessitating fine-grained multimodal representations through geometric correction, radiometric normalization, and spatiotemporal consistency modeling. For multi-scale spatial reasoning, we emphasize that complex tasks require holistic understanding of structural relations and spatial distributions; existing models partially strengthen spatial reasoning via region-based interaction and spatial question answering, but a unified framework for global spatial relation modeling is needed to elevate task awareness and multilevel reasoning. Generative AI for few-shot interpretation shows potential to alleviate annotation scarcity and support time-critical missions such as disaster response. Regarding trustworthiness and interpretability, hallucination in large vision-language models is exacerbated in remote sensing by peculiar data distributions and insufficient instruction data, thus requiring domain knowledge, task constraints, and dedicated evaluation systems to enforce reasoning consistency and reliable high-stakes decisions. For efficient edge deployment and privacy, high-resolution large-format data impose heavy computational and storage burdens; lightweight networks continue to emerge, yet trade-offs among real-time performance, stability, and complex scene adaptability remain, making model compression, inference optimization, and hardware-software co-design critical, while federated learning offers a privacy-preserving mechanism for multi-source data integration. In summary, addressing these challenges is pivotal not only for advancing theoretical understanding of cross-modal semantic reasoning and multi-scale spatial cognition but also for enabling the practical deployment of intelligent aerial remote sensing systems in high-stakes, real-world applications.
Hu Yumei, Wang Xiaohua, Deng Bao, Zhao Yangyang, Zhao Yiyang
DOI:10.11834/jig.260236
摘要:The rational and effective utilization of multi-source information will expand the spatial and temporal coverage of measurements, fully excavate the target feature information contained in various sensors, greatly reduce the ambiguity of information, and improve detection performance. Faced with multi-modal target data from different types, characteristics, and perception methods, for example, in airborne multi-sensor target tracking systems, various types of sensors with different working attributes such as visible light images, infrared images, optical sensors, microwave radar, and lidar are often involved, presenting measurement information of the target from different perspectives. The point cloud information obtained by lidar can not only achieve high-precision ranging but also provide more accurate spatial information due to different reflectivities of different objects. However, it is susceptible to noise generated by complex environments and lacks semantic information. Compared to point cloud information, target image information provides rich surface textures and contextual semantic information that can more accurately restore the appearance and structure of objects, but lacks depth information. With the rapid development of sensor networks, information perception methods, and big data processing technologies, faced with information perception from different types, characteristics, and methods, as well as its characteristics of multi-domain, multi-modal, ambiguity, and incomplete data association, multimodal information fusion technology has received increasing attention due to its advantages of human-like automatic perception and powerful comprehensive reasoning capabilities. The paper first introduces the significance and necessity of multi-modal information fusion. Furthermore, at the fusion architecture level, it outlines the evolution from early fusion, late fusion, to hybrid fusion, revealing the trade-off relationship between information retention and computational efficiency in each architecture. At the same time, it elaborates on three cutting-edge multi-modal fusion methods: attention-based interaction modeling, semantic alignment based on contrastive learning, and generative fusion with large language models as the carrier. Then, it introduces typical multi-modal datasets such as the COCO dataset, LAION-400M dataset, Visual Genome dataset, MS COCO Captions dataset, Conceptual Captions dataset, Flickr30k dataset, and aviation-related datasets, as well as their corresponding application fields. Moreover, it further explores the typical military applications of multimodal fusion, including battlefield situation awareness, multi-source intelligence analysis, and unmanned system collaboration. In addition, it points out that with the maturity of contrastive learning and multi-modal pre-training, high-quality single-modal representations are no longer a bottleneck. The research focus has shifted from how to map heterogeneous modalities to a unified space in the early stage to how to design interaction mechanisms that can capture complex, dynamic, and even contradictory relationships between modalities based on representation alignment. Based on fusion architecture, fusion models, and computational costs, it proposes three development directions for multimodal information fusion. The choice of multimodal fusion architecture directly impacts the degree of information retention, computational efficiency, and model interpretability. Based on the stage at which information fusion occurs, existing methods can be categorized into early fusion, late fusion, hybrid fusion. Early fusion, also known as data-level fusion, involves the integration of multi-source information prior to or at the shallowest level of modality-specific feature extraction. This process preserves the richest original information and multimodal interaction details. Typical implementations include multimodal concatenation and multi-view encoding. Taking the visual-language model as an example, early fusion concatenates image patch embeddings and text word embeddings into a unified sequence, which is then fed into a joint encoder for processing. The advantage of this architecture lies in the ability to establish multimodal associations at a shallow level, facilitating the capture of fine-grained modality interactions. However, early fusion faces a severe dimensionality catastrophe problem. When the number of modalities increases or the feature dimensions of each modality are too high, the joint representation space grows exponentially. At the same time, due to the different noise characteristics of different modalities, the noise of some modalities may be amplified during early fusion, resulting in poor representation quality. Late fusion adopts a diametrically opposite strategy: each modality independently performs feature extraction and task prediction, with integration, weighted average, and meta-learner only at the final decision-making layer. In this "divide and conquer" design, the modal branches can be trained in parallel, achieving high computational efficiency; meanwhile, overfitting or noise in one modality is less likely to affect other modalities; in addition, in scenarios where some modalities are missing, the remaining branches can still work normally, demonstrating strong system robustness. However, late fusion may lead to higher consumption of computational resources as it requires training independent models for each modality. At the same time, independent models for each modality struggle to capture low-level interactions between them, making it difficult to model simple fusion at the decision-making level. Furthermore, data from different modalities may face alignment issues in time or space, such as the synchronization of video frames and audio signals. Hybrid fusion introduces cross-modal interaction in the middle layers of the network while maintaining modality-specific processing paths, aiming to combine the strengths of both. Its typical application is to use modality-independent encoders at the bottom layer, introduce a cross-attention module in the middle layer to achieve feature-level interaction, and separate again at the top layer to preserve modality-specific information. A deeper evolutionary direction is dynamic fusion, which allows the network to autonomously decide where and with what intensity to fuse information from various modalities. For example, a gated mechanism-based visual image and LiDAR multimodal fusion network dynamically adjusts the weights of information from each modality based on input data. Specifically, it places more trust in visual images under strong lighting conditions, while giving higher weight to LiDAR point cloud information under low-light conditions. The choice of fusion architecture does not have an "optimal solution", but rather depends on the function of task characteristics and resource constraints. Generally speaking, when there are fine-grained, location-related interactions between modalities (such as image-text alignment), early fusion is better; when each modality can complete predictions independently and some modalities are prone to missing, late fusion is more robust; dynamic fusion strikes a good balance between performance and robustness. Looking back at the development of multimodal fusion, the bottleneck of fusion is shifting from "representation" to "interaction". With the maturity of contrastive learning and multimodal pre-training, high-quality single-modality representation is no longer a bottleneck. The research focus has shifted from how to map heterogeneous modalities into a unified space in the early stage to how to design interaction mechanisms that can capture complex, dynamic, and even contradictory relationships between modalities on the basis of representation alignment. The fusion architecture is evolving from "static design" to "dynamic adaptation". In the real world, the correlation and reliability of modalities change dynamically with the environment, and the limitations of fixed fusion strategies are becoming increasingly evident. In military confrontations, electronic jamming may render specific sensors inoperative. Therefore, it is necessary to dynamically adjust the activation state of modalities and fusion weights based on input content and task context to enhance the robustness of the fusion system. The causal fusion model holds promise for breaking through the current limitations of relational learning. Most existing methods focus on learning statistical correlations between modalities, rather than causal relationships. This leads to two issues: firstly, the model is prone to learning spurious correlations, and secondly, it struggles to generalize to environments beyond the training distribution. Therefore, introducing causal inference tools can enhance the fusion model's adversarial robustness and environmental transferability. In addition, a new multimodal fusion mechanism is designed. Redundant information from each modality may cause computational waste, and the corresponding noise information may affect fusion performance. How to utilize information theory techniques to identify and retain complementary information while suppressing redundancy and noise is a new development direction for multimodal fusion technology.
关键词:multimodal information fusion;multimodal large language model;attention mechanism;contrastive learning;generative fusion
摘要:ObjectiveNeoadjuvant Therapy (NAT) has been established as a standard preoperative treatment paradigm for locally advanced breast cancer, with the objectives of reducing tumor burden, enhancing surgical resectability, and maximizing the probability of breast-conserving surgery. Therapeutic efficacy is conventionally assessed by pathological Complete Response (pCR) versus non-pCR status, determined through postoperative histopathological examination. However, the inherent latency of pathological assessment substantially limits its utility for timely clinical decision-making and adaptive treatment modification. Furthermore, breast tumors characteristically exhibit complex morphological architectures and heterogeneous enhancement kinetics on Magnetic Resonance Imaging (MRI), particularly Dynamic Contrast-Enhanced MRI (DCE-MRI), which poses considerable challenges for precise tumor delineation and quantitative response evaluation. Consequently, developing a robust and effective computational framework capable of simultaneously achieving accurate tumor segmentation and extracting clinically discriminative features for early NAT response prediction represents a critical unmet need with substantial clinical significance.MethodTo address the aforementioned challenges, this study proposes a novel radiomics-guided semantic segmentation framework based on the classical U-Net architecture. The proposed model is designed to enhance feature representation, improve boundary delineation, and facilitate the extraction of discriminative features for downstream NAT response prediction. First, a Discrete Wavelet Transform (DWT)-based pooling module is introduced to replace conventional downsampling operations in the encoder. Unlike standard pooling methods that often result in the loss of high-frequency information, the DWT-based pooling decomposes feature maps into multiple frequency sub-bands, including low-frequency approximation components and high-frequency detail components. This multi-frequency decomposition enables the model to explicitly preserve edge information, fine structural details, and texture variations, which are essential for accurate tumor boundary delineation. As a result, the proposed module improves the model’s ability to maintain spatial consistency and enhances its sensitivity to subtle structural changes. Second, a Radiomics-Augmented Transformer (RAT) module is incorporated to strengthen the interaction between deep semantic features and radiomics-inspired representations. Conventional convolutional neural networks primarily capture local spatial patterns, whereas radiomics features describe global statistical and texture characteristics that are closely related to tumor heterogeneity and pathological properties. By integrating these complementary sources of information, the proposed model achieves a more comprehensive representation of tumor characteristics. Specifically, the RAT module employs a channel-wise selection mechanism to adaptively fuse low-level structural features with high-level semantic and texture representations. This mechanism allows the network to dynamically emphasize informative channels while suppressing redundant or irrelevant features, thereby improving feature discriminability and robustness. Third, a Spatial Cross-Attention (SCA) mechanism is introduced in the decoding stage to refine tumor contour representation and enhance intra-tumoral heterogeneity modeling. By capturing long-range spatial dependencies and enabling interactions between features at different spatial locations, the SCA module effectively integrates global contextual information with local details. This design improves segmentation consistency, especially in regions with ambiguous boundaries or complex structures, and enhances the overall quality of the reconstructed feature maps. The overall architecture follows an encoder–decoder paradigm, in which hierarchical feature representations are progressively extracted and refined through the integration of the aforementioned modules. The encoder focuses on multi-scale feature extraction with enhanced detail preservation, while the decoder reconstructs high-resolution segmentation maps with improved boundary accuracy. Beyond segmentation, the deep features learned by the model are further utilized for NAT response prediction, enabling a unified framework that bridges image segmentation and clinical outcome analysis. Notably, the proposed method supports prediction based on a single randomly selected time point from DCE-MRI, thereby reducing the dependency on fully aligned temporal sequences and improving its applicability in real-world clinical scenarios.ResultComprehensive experiments were conducted on three publicly available breast MRI benchmark datasets to rigorously evaluate the effectiveness and cross-dataset generalization capability of the proposed method. Experimental results demonstrate that the proposed model consistently achieves state-of-the-art segmentation performance on both the BreastDM and ISPY1 datasets, with Dice similarity coefficients of 88.68% and 92.87%, respectively, indicating excellent spatial overlap with ground truth annotations. Regarding boundary precision, the proposed method yields substantial improvements in the 95th percentile Hausdorff Distance (HD95), with relative reductions of 19.7% and 7.2% over competing state-of-the-art approaches. These results underscore the superiority of the proposed framework in capturing fine-grained boundary details while preserving global structural integrity. Systematic ablation studies further validate the individual and synergistic contributions of each component within the proposed framework. Notably, the Radiomics-Augmented Transformer (RAT) module plays a pivotalrole in enhancing feature representation and boundary precision. When evaluated independently, the inclusion of the RAT module results in an additional 3.4% improvement in HD95, demonstrating its strong contribution to refining segmentation results and improving the modeling of tumor heterogeneity. Moreover, the combination of DWT-based pooling and SCA further enhances the overall performance, indicating the complementary nature of these modules. Furthermore, the deep features extracted by the proposed segmentation model demonstrate strong discriminative power for downstream NAT response prediction. Experimental results confirm that incorporating these learned features yields a substantial improvement in prediction accuracy over baseline approaches. Crucially, the demonstrated ability to achieve robust prediction based on a single randomly selected DCE-MRI time point highlights the exceptional flexibility and clinical practicality of the proposed method, substantially reducing the dependency on complete temporal imaging sequences and streamlining the data acquisition workflow in routine clinical practice.ConclusionIn conclusion, this study presents a novel radiomics-guided semantic segmentation framework that synergistically integrates multi-frequency feature decomposition, transformer-based radiomics-enhanced feature learning, and spatial cross-attention mechanisms within a unified end-to-end architecture. The proposed model achieves consistently superior performance in breast tumor segmentation, particularly in boundary delineation accuracy and fine-detail preservation, while simultaneously yielding highly discriminative features for NAT response prediction. By enabling reliable prediction from limited single-time-point imaging data, the method substantially reduces the dependency on complex multi-temporal imaging protocols and significantly enhances clinical applicability. Overall, the proposed framework provides a promising and practical computational tool for supporting personalized treatment planning and facilitating evidence-based clinical decision-making in breast cancer management.
摘要:ObjectiveHuman action recognition plays a critical role in intelligent surveillance, human–computer interaction, and assisted healthcare. Multimodal action recognition based on skeleton sequences and RGB videos has attracted increasing attention in recent years due to its ability to integrate complementary structural motion information and visual appearance cues. Skeleton data provide relatively stable structural priors of the human body and are less sensitive to background clutter and appearance variations, while RGB videos contain rich texture, context, and human–object interaction information that cannot be fully captured by skeletons alone. However, existing methods still face several challenges. First, most skeleton-based approaches built on graph convolutional networks rely on dense natural body topology, which is effective for modeling local motion patterns among adjacent joints but insufficient for capturing long-range dependencies among semantically related yet spatially distant joints. In many actions, discriminative cues depend not only on local movements but also on global coordination across distant body parts. Second, RGB-based methods are highly susceptible to background clutter, viewpoint variations, scale changes, and irrelevant regions, making it difficult to focus on action-relevant areas, especially in complex scenes. Third, many multimodal fusion methods adopt simple feature concatenation or modality-symmetric fusion strategies, which fail to fully exploit the complementary information across modalities and do not adequately reflect the structural stability of the skeleton modality.MethodTo address these issues, this paper proposes a Dense-Sparse Skeleton Joint Representation-Guided RGB Image ROI Localization for Multimodal Action Recognition network a dense–sparse skeleton joint representation framework is constructed. The dense skeleton branch preserves the complete physical topology of the human body and is responsible for modeling local motion patterns among adjacent joints. Because it retains full joint connectivity, this branch is effective for describing local posture transitions and fine-grained motion dynamics. In contrast, the sparse skeleton branch retains only key joints, including the head, hands, elbows, knees, feet, and a trunk center. By compressing topological paths, it effectively enhances long-range dependency modeling across distant body parts. Compared with the dense branch, the sparse branch focuses more on global coordination relationships and the overall action structure. Through jointly learning dense and sparse skeleton representations, the model achieves unified modeling of local motion dynamics and global coordination relationships. In this way, the proposed framework improves the spatiotemporal representation capability of the skeleton modality by combining local structural continuity with long-range semantic interaction. Second, a Cross-Modal Attention-Based Coarse-to-Fine Skeleton-Guided RGB ROI Localization Strategy is proposed to enhance the discriminative capability of the RGB modality. Specifically, a two-stage coarse-to-fine guided ROI localization mechanism is designed. In the first stage, sparse skeleton features are used to perform coarse-grained ROI localization, where the overall action structure serves as a prior to guide the RGB branch to focus on the human body and major motion regions while suppressing large-scale background noise. This stage mainly emphasizes the action subject and broad movement-related regions, enabling the visual branch to quickly concentrate on the main action area. In the second stage, dense skeleton features are employed to conduct fine-grained ROI localization on the enhanced visual features, further emphasizing discriminative local regions such as hands, elbows, knees, and human–object interaction areas. Since dense skeleton features preserve richer local topology, they provide more precise structural guidance for highlighting subtle but important regions for action discrimination. The two stages therefore serve different but complementary purposes: the first provides global localization, while the second refines local details. Through this progressive process, the RGB modality is guided from coarse subject-level focus to fine-grained action-related region enhancement, which helps reduce the influence of irrelevant background information. In addition, the proposed ROI localization is performed in feature space rather than by explicitly cropping raw RGB frames, making the visual enhancement process more stable and flexible. Finally, a skeleton-dominant cross-modal gated fusion module is designed to achieve effective multimodal integration. Unlike conventional modality-symmetric fusion strategies, the proposed method treats skeleton features as the dominant representation and generates gating weights to adaptively reweight RGB features along the channel dimension. The rationale is that skeleton data provide more stable and task-relevant structural cues, whereas RGB information, although rich in appearance semantics, is also more vulnerable to environmental noise. Under this mechanism, RGB information is incorporated as a complementary modality conditioned on skeleton semantics, enabling the model to selectively utilize visual appearance cues that are beneficial for distinguishing fine-grained actions while suppressing irrelevant noise. In this way, the complementary relationship between skeleton and RGB modalities can be exploited more effectively. In addition, the entire framework is trained in a joint optimization manner with both primary and auxiliary supervision. The fusion branch is supervised by the main classification loss, while the skeleton and RGB branches are equipped with auxiliary losses to enhance feature discriminability and improve training stability. This training strategy helps optimize both fused representation learning and branch-specific representation quality.ResultExtensive experiments are conducted on three public benchmarks, including NTU-RGB+D 60, NTU-RGB+D 120, and UAV-Human. Experimental results show that the proposed method achieves 94.7% and 98.3% accuracy under the X-Sub and X-View protocols on NTU-RGB+D 60, and 92.82% and 93.91% under the X-Sub and X-Set protocols on NTU-RGB+D 120. On the UAV-Human dataset, it achieves 53.60% and 76.90% under the CSv1 and CSv2 protocols, respectively. Compared with existing skeleton-based and multimodal methods, the proposed approach demonstrates superior performance and stronger robustness under various evaluation settings. The results on the NTU datasets indicate that the proposed framework can effectively exploit complementary structural and visual information in standard benchmark scenarios. More importantly, its performance on UAV-Human further demonstrates its robustness in low-altitude UAV-view scenes, where the task is more challenging because of viewpoint changes, target scale variation, occlusion, and complex backgrounds. These results verify that the proposed framework is effective not only in relatively controlled indoor environments but also in more difficult open-view conditions. Ablation studies further validate the effectiveness of each component. The dense–sparse skeleton joint representation significantly outperforms single dense skeleton modeling, showing that combining dense topology with sparse key-joint topology is beneficial for capturing both local motion patterns and long-range dependency relationships. The coarse-to-fine guided ROI localization consistently achieves better performance than single-stage guidance, confirming that progressive visual guidance from global action-subject localization to local detail enhancement is more effective for action-related region modeling. Moreover, the proposed skeleton-dominant cross-modal gated fusion module outperforms commonly used fusion strategies, including feature concatenation, matrix multiplication, and cross-attention. This comparison indicates that using the skeleton modality to adaptively regulate RGB contribution is more suitable for multimodal action recognition than treating both modalities equally. Category-level analysis shows that the RGB complementary branch provides more significant improvements for fine-grained and easily confused actions, further demonstrating the effectiveness of exploiting modality complementarity. In particular, actions involving subtle hand motion, object interaction, or visually similar body configurations benefit more from the proposed framework, because RGB information supplements details that skeleton data alone cannot adequately represent.ConclusionThe proposed method enhances skeleton modeling by jointly capturing local motion patterns and global coordination relationships through dense–sparse skeleton joint representation, improves the focus of the RGB modality on action-relevant regions via coarse-to-fine guided ROI localization, and achieves effective multimodal integration through a skeleton-dominant cross-modal gated fusion module. By integrating these three components into a unified framework, the model is able to make fuller use of the complementary advantages of skeleton sequences and RGB videos. Experimental results demonstrate that the proposed framework not only achieves superior performance on standard indoor benchmarks but also maintains strong robustness and generalization ability in complex UAV-view scenarios, providing an effective solution for multimodal human action recognition. Overall, the study shows that strengthening skeleton representation, progressively guiding visual attention, and designing a skeleton-dominant fusion mechanism are all important for improving multimodal action recognition performance in complex scenes.
摘要:ObjectiveAccurate nuclear segmentation in digital pathology images is a fundamental yet challenging task in quantitative pathological analysis, playing a critical role in cancer diagnosis, grading, prognosis assessment, and treatment evaluation. Manual annotation of nuclei is labor-intensive, time-consuming, and suffers from poor reproducibility, necessitating automated segmentation methods. Data-driven deep learning approaches, particularly those based on convolutional neural networks and Transformer architectures, have achieved considerable success in this domain. These methods typically require large-scale annotated training datasets and can effectively handle semantic reasoning of nuclei. However, due to their inherent reliance on pixel-wise classification, they often struggle to accurately segment individual nuclei in challenging scenarios such as overlapping, clustered, or tightly packed structures, where spatial geometric information is crucial. For example, in histopathology images of breast or prostate tissues, nuclei frequently form dense clusters with ambiguous boundaries, making individual instance separation particularly difficult. In contrast, deep diffusion generative models offer a fundamentally different learning paradigm. Through an iterative forward noising and reverse denoising process, these models can implicitly capture the underlying spatial and geometric constraints of nuclear structures without requiring explicit shape priors. This capability makes them particularly promising for medical image segmentation tasks where structural integrity and boundary delineation are paramount. Nevertheless, directly applying diffusion models to nucleus segmentation faces two major challenges: the difficulty of incorporating discriminative priors to guide the generation process, and the high computational overhead associated with numerous denoising steps. To address these issues, this paper proposes a novel dual-condition diffusion model that synergistically integrates discriminative segmentation and generative diffusion for high-precision pathological image nucleus segmentation.MethodThe proposed framework consists of three collaboratively designed components. First, based on the denoising diffusion probabilistic model, the framework constructs a dual-condition diffusion network that reformulates the nucleus segmentation task as a conditional mask generation problem. Unlike standard diffusion models that predict noise at each timestep, this network directly predicts refined segmentation masks, enabling more efficient and stable training. Specifically, the network takes the initial coarse mask produced by the discriminative segmenter as input and iteratively refines it through a series of denoising steps. Second, to address the limitation that diffusion models lack explicit spatial awareness, the model leverages attention mechanisms to design spatial and semantic complementary priors. These priors capture both local texture information and global contextual relationships of nuclear structures. The spatial prior focuses on edge and boundary information, while the semantic prior encodes category-level consistency across the entire nucleus region. They are embedded into each time step of the diffusion learning process, serving as conditional guidance that directs the diffusion process to focus on individual nucleus regions rather than being distracted by background clutter or ambiguous boundaries. This design ensures that the generative refinement remains grounded in meaningful anatomical constraints throughout all denoising steps. Third, recognizing the computational inefficiency of traditional diffusion models, the model introduces a random latent embedding strategy. Specifically, Gaussian-sampled latent embeddings are injected into each layer of the diffusion refiner, enabling the model to learn denoising mappings with larger step sizes. This strategy significantly reduces both the number of training timesteps and the number of sampling timesteps required during inference, achieving a favorable trade-off between segmentation accuracy and computational efficiency without compromising generation quality. The three components are trained jointly in an end-to-end manner, allowing the discriminative segmenter to provide high-quality spatial initializations and the diffusion refiner to leverage complementary priors for structural optimization.ResultExtensive experiments are conducted on publicly available multi-organ and multi-disease histopathology datasets, including breast, colon, kidney, lung, prostate, and stomach tissues, all with pixel-level nucleus annotations. The proposed model is compared against a comprehensive set of state-of-the-art methods, including approaches based on convolutional neural networks, Transformer-based architectures, and recent diffusion-based segmentation models. Quantitative evaluation employs two widely accepted metrics: the Dice similarity coefficient and the mean intersection over union. Experimental results consistently demonstrate that the proposed model outperforms all existing data-driven and diffusion-based deep segmentation methods across multiple organs, outperforming the suboptimal model by 1.03 percentage points and the latest model by 2.17 percentage points in terms of mean Intersection over Union. Consistent improvements are observed on all tested datasets, confirming the robustness and generalizability of the approach across different tissue types. Ablation studies are conducted to systematically validate the contribution of each core module. Removing the discriminative segmenter, the diffusion refiner, or the complementary prior embedding each leads to noticeable performance degradation, confirming that all three components play essential and non-redundant roles. Through coupling discriminative segmentation with generative learning, integrating discriminative initialization, generative refinement, and spatial-semantic guidance, the model achieves a substantial improvement of 3.68 percentage points in mean Intersection over Union on multi-organ image data compared to the baseline. These gains are both statistically significant and practically meaningful for clinical applications. Qualitative results further illustrate that the model produces segmentation masks with sharper boundaries, fewer isolated false positives, and better separation of touching and overlapping nuclei compared to competing methods.ConclusionThis paper addresses the challenging problem of segmenting complex overlapping and clustered nucleus structures in digital pathology images by proposing a dual-condition diffusion model that integrates discriminative segmentation and generative diffusion in a complementary manner. The framework introduces three key innovations: a dual-condition diffusion network formulating segmentation as conditional mask generation, spatial and semantic complementary priors embedded via attention mechanisms to guide the diffusion process, and a random latent embedding strategy that reduces computational overhead. Extensive experimental results demonstrate that the proposed method consistently outperforms existing state-of-the-art approaches. Ablation studies confirm the necessity and synergy of the three core components. The model achieves a favorable balance between segmentation accuracy and computational efficiency, making it suitable for potential clinical deployment. While the current work focuses on nucleus segmentation, the proposed paradigm may also be applicable to other medical image segmentation tasks requiring precise structural delineation under challenging conditions. Despite the promising results, the proposed method has one limitation. The iterative denoising process incurs higher computational cost compared to single-stage discriminative methods. Future work will focus on model compression techniques, such as knowledge distillation and pruning, as well as more efficient sampling strategies to reduce inference time while maintaining segmentation accuracy.
摘要:ObjectiveSynthetic Aperture Radar (SAR) to optical image translation (SAR-to-Optical, S2O) is essential for full-time and all-weather earth observation. Optical images provide rich texture information and intuitive visual representations, which are important to a wide range of downstream tasks such as land-cover classification and urban monitoring. However, optical sensors are highly vulnerable to clouds, fog, and illumination changes, limiting continuous data acquisition. In contrast, SAR sensors operate in the microwave spectrum and can acquire images under all-weather and all-time conditions. Consequently, S2O translation can significantly enhance the interpretability of SAR data and facilitate multimodal remote sensing applications. Despite recent progress in cross-modal image translation, achieving high-quality S2O translation remains challenging. First, SAR images inevitably contain multiplicative speckle noise that is strongly coupled with structural information, and together with the substantial semantic gap caused by fundamentally distinct imaging mechanisms between SAR and optical sensors, this leads to insufficient cross-modal feature fusion and representation capability. Second, although diffusion models have recently demonstrated impressive performance in image generation tasks, their high computational overhead and inefficient conditional fusion mechanisms limit their applicability for real-time processing on resource-limited remote sensing platforms. To address these issues, we propose a Spatial-Frequency Strongly-Sparse Guided Diffusion Model (SFSG-Diff) for efficient and high-quality SAR-to-optical image translation.MethodThe proposed SFSG-Diff framework follows the standard diffusion generation paradigm, consisting of a forward diffusion process and a reverse denoising process. In the forward process, Gaussian noise is progressively added to a clean optical image until the data distribution approaches pure noise. During the reverse process, a conditional denoising network iteratively removes the noise to reconstruct the target optical image. To improve both generation quality and computational efficiency, SFSG-Diff introduces spatial-frequency guided feature extraction and a strongly sparse conditional fusion mechanism. To effectively suppress SAR speckle noise and enhance structural representation, a Multi-scale Spatial-Frequency Denoising Encoder (MDE) is designed to extract robust conditional features from the input SAR image. The encoder adopts a dual-branch architecture composed of a spatial feature extraction branch and a frequency feature extraction branch. The spatial branch focuses on capturing local textures and contextual information through multi-scale convolutional operations, while the frequency branch transforms SAR images into the frequency domain to capture global structural information that is less sensitive to speckle noise. Since speckle noise is mainly concentrated in high-frequency components, whereas meaningful structural information is typically distributed in low and mid-frequency regions, the spatial-frequency joint representation enables effective separation of noise and structural features. The outputs of both branches are fused across multiple scales to produce robust conditional representations that guide the diffusion model during image generation. To further improve conditional guidance efficiency, we introduce a lightweight Strong-Sparse Semantic Fusion (SIF) pattern. Instead of densely integrating conditional features into all feature channels, the SIF module performs channel-wise selection and adaptive attention-based modulation to identify and fuse the most informative feature components. This sparse guidance mechanism not only reduces computational overhead but also improves cross-modal semantic alignment by emphasizing the most relevant structural cues. Furthermore, to enhance training stability and generation quality, we adopt a two-stage training strategy, leveraging a joint loss, including a simplified mean squared error loss, perceptual loss, focus frequency loss, and adversarial loss.ResultTo evaluate the performance of the proposed framework, extensive experiments are conducted on three publicly available datasets, including SEN1-2, QXS-SAROPT, and WHU-OPT-SAR. The experimental results demonstrate that SFSG-Diff consistently outperforms several state-of-the-art GAN-based and diffusion-based image translation models. Quantitative evaluation results demonstrate that SFSG-Diff consistently outperforms existing state-of-the-art methods across multiple evaluation metrics. On the SEN1-2 dataset, the proposed SFSG-Diff achieves a PSNR improvement of 0.77 dB over the best competing method, while the Structural Similarity Index (SSIM) improves by 17.8%. In addition, the perceptual quality metrics show substantial improvements, with LPIPS and Fréchet Inception Distance (FID) reduced by 8.0% and 17.6%, respectively. These results indicate that the proposed model not only improves reconstruction fidelity but also produces images with higher perceptual realism. Similar performance gains are observed on the QXS-SAROPT and WHU-OPT-SAR datasets, which include more complex urban structures and higher spatial resolution imagery. The proposed SFSG-Diff demonstrates strong robustness across diverse scenes and maintains consistent advantages in both structural similarity and perceptual quality. Visual comparisons further confirm the superiority of SFSG-Diff. Compared with existing methods, the SFSG-Diff generates optical images with clearer structural boundaries, more realistic textures, and fewer noise artifacts. In particular, the model exhibits improved performance in challenging regions such as dense urban areas and vegetation-rich scenes. In addition to generation quality, the computational efficiency of the proposed framework is also evaluated. Owing to the lightweight SIF pattern and the two-stage training strategy, SFSG-Diff achieves significantly faster inference compared with conventional diffusion models. The average inference time per image is approximately 0.21 seconds, which represents a 69.1% reduction in computation time compared with representative diffusion-based baselines.ConclusionWe presents SFSG-Diff, a novel framework for one-step S2O translation, which addresses key challenges in cross-modal image generation by leveraging spatial-frequency joint feature modeling and lightweight Strong-Sparse conditional guidance. The MDE effectively suppresses speckle noise and enhances structural feature representation, while the SIF pattern improves cross-modal alignment and reduces computational complexity. Extensive experiments on multiple benchmark datasets demonstrate that the proposed method significantly outperforms existing approaches in both quantitative metrics and visual quality. Moreover, the framework achieves substantial improvements in inference efficiency, making it more suitable for practical remote sensing applications with limited computational resources. Overall, the proposed SFSG-Diff framework provides an effective solution for robust and efficient SAR-to-optical image translation and offers new insights into the integration of spatial-frequency modeling and sparse conditional guidance within diffusion-based generative models.
JI Jiaxin, GUO Xingge, YANG Fazhan, WANG Jiang, ZHAO Peipei, XIAO Tao
DOI:10.11834/jig.260189
摘要:ObjectiveVisual saliency prediction aims to simulate the human visual attention mechanism by estimating the spatial probability distribution of gaze points in complex scenes. This task is fundamental to various downstream applications, including autonomous driving, robot navigation, and image compression. While Convolutional Neural Networks (CNNs) have long been the backbone of saliency modeling due to their proficiency in local feature extraction and multi-scale representation, they are inherently limited by their local receptive fields. This restriction often leads to "fixation fragmentation" and background false alarms in cluttered environments. Conversely, Transformer-based models have introduced global dependency modeling via self-attention; however, their quadratic computational complexity becomes a significant bottleneck when processing high-resolution saliency maps. Recently, State Space Models (SSMs), specifically Mamba, have emerged as a promising alternative, offering linear-time complexity while maintaining long-range context. Nevertheless, standard Mamba architectures are designed for 1D sequences, which inherently disrupts the 2D spatial topological continuity of images. Furthermore, during long-range state propagation, background noise is often indiscriminately amplified alongside salient features. Additionally, the continuous bilinear upsampling typically employed in decoding stages causes a "structural over-smoothing" effect, which diminishes the peak energy of salient regions and blurs boundaries. To address these multi-faceted challenges, this study proposes a novel visual saliency prediction network named Spatio-Spectral Uncertainty-Gated Mamba Network (S²UG-Mamba). The goal is to achieve an optimal synergy between global context modeling, noise suppression, and high-frequency structural restoration within a linear computational framework.MethodThe proposed S²UG-Mamba follows a "backbone extraction, encoder enhancement, multi-scale decoding, frequency modulation, and prediction" pipeline. We employ an ImageNet-pretrained ConvNeXt-Tiny as the backbone to extract multi-level features. In the encoder enhancement phase, we introduce the Uncertainty-Aware State Space Enhancement Module (UA-SSM). To preserve 2D spatial continuity, UA-SSM utilizes a bidirectional orthogonal serpentine scanning strategy, which unfolds features in both horizontal and vertical directions to capture omnidirectional contextual interactions. To mitigate the propagation of background noise, we design a dual-view uncertainty proxy within UA-SSM. This proxy consists of a spatial variance component (capturing local texture fluctuations in 3x3 windows) and a channel variance component (measuring semantic activation disagreements). These proxies are fused to generate a pixel-level confidence gate G, which dynamically modulates the Mamba input sequence, thereby suppressing responses in high-uncertainty background regions.In the decoding stage, to counteract the smoothing effects of upsampling, we propose the Semantic-Guided Dynamic Frequency Modulation (SDFM) module. Unlike conventional static filters, SDFM employs a joint logical routing mechanism that combines deep semantic priors from the encoder with local content features from the decoder. This routing mechanism dynamically synthesizes a set of learnable complex frequency filters. The feature maps are transformed into the frequency domain via a real Fast Fourier Transform (rFFT) after StarReLU activation, multiplied by the adaptive filters, and then mapped back to the spatial domain using an inverse rFFT. This residual frequency compensation explicitly restores high-frequency structural details and tightens the energy concentration of fixation peaks. The entire network is optimized using a joint loss function of Kullback-Leibler (KL) divergence and Linear Correlation Coefficient (CC).ResultExtensive experiments were conducted on five benchmark datasets: SALICON, MIT300, MIT1003, CAT2000, and TORONTO. On the SALICON test set (LSUN'17), S²UG-Mamba achieved state-of-the-art performance, reducing the KL divergence to 0.176 while reaching a CC of 0.918 and an NSS of 2.007. On the challenging MIT300 blind test set, our model yielded a CC of 0.829 and a KL of 0.370, outperforming the advanced GSGNet by 2.2% and 9.8%, respectively. To evaluate cross-domain generalization, we performed zero-shot testing on the MIT1003 dataset using weights trained solely on SALICON; S²UG-Mamba attained an NSS of 2.386 and a CC of 0.6849, demonstrating robust learning of universal saliency priors. Ablation studies confirmed the efficacy of each component: the introduction of UA-SSM improved the NSS from 1.9627 to 2.0152, while SDFM significantly reduced the KL divergence. The complexity and efficiency analysis demonstrates that our model achieves favorable performance in terms of computational cost and GPU memory consumption. Meanwhile, it maintains comparable parameter scale and inference speed with state-of-the-art methods, fully validating the efficient modeling capability of the Mamba architecture. Qualitative visualizations further illustrated that S²UG-Mamba effectively suppresses non-salient background clutter in densely textured scenes and produces sharper, more compact salient boundaries compared to existing methods.ConclusionThis study presents S²UG-Mamba, a linear-complexity network that synergistically integrates uncertainty-aware state space modeling and semantic-guided frequency modulation for visual saliency prediction. By employing orthogonal scanning and a dual-view uncertainty gating mechanism, the encoder efficiently captures long-range dependencies while adaptively suppressing noise. Simultaneously, the SDFM leverages high-level semantic guidance to perform adaptive frequency filtering, effectively restoring structural details lost during spatial upsampling. The experimental results across multiple benchmarks validate that S²UG-Mamba provides a superior balance between prediction accuracy, cross-domain robustness, and computational efficiency. Future research will explore hardware-friendly parallel 2D scanning algorithms and low-annotation learning paradigms to further enhance the model’s practical deployment potential.
关键词:saliency prediction;State Space Model;Mamba;uncertainty-aware;frequency domain enhancement
Ding Chen, Zhang Jingbo, Hao Xiaofeng, Zheng Sirui, Yan Song
DOI:10.11834/jig.260214
摘要:ObjectiveHyperspectral images (HSIs) capture reflectance values across hundreds of contiguous spectral bands at each spatial location, providing rich spectral–spatial information that enables precise discrimination of materials with subtle spectral differences. This high spectral resolution makes HSIs particularly valuable for environmental monitoring, land cover mapping, precision agriculture, and change detection applications. Change detection (CD) in multitemporal HSIs aims to identify meaningful surface alterations by comparing images of the same scene acquired at different times. Accurate HSI-CD plays a critical role in ecological monitoring, urban expansion analysis, and disaster assessment. However, the task faces significant challenges, including high dimensionality, spectral redundancy, spatial heterogeneity, and the difficulty of effectively integrating spatial, spectral, and temporal features. Traditional methods, such as change vector analysis (CVA), principal component analysis (PCA), and multivariate alteration detection (MAD), rely on hand-crafted features and empirical thresholding. Although computationally efficient, these approaches are sensitive to noise, illumination variations, and atmospheric effects. Deep learning methods, particularly convolutional neural networks (CNNs), have demonstrated superior performance by automatically learning hierarchical features. Most CNN-based frameworks adopt dual-branch Siamese architectures, yet they still struggle to capture long-range dependencies and explicit temporal dynamics.MethodTo address these limitations, a novel Time-Difference-Guided Network (TDG-Net) is proposed. The method employs a Siamese architecture with Vision Mamba as the feature extraction backbone to efficiently capture long-range spatial dependencies with linear computational complexity. Two core components are integrated: the Time Storage Module (TSM) and the temporal difference guidance strategy. Bi-temporal HSIs are first fed into the dual-branch Vision Mamba backbone to extract multi-level spatial–spectral features. The TSM performs sequential temporal modeling on these features using a simplified long short-term memory (sLSTM). To reduce computational cost, spectral compression via 1×1 convolution and spatial down-sampling via max pooling are applied before feeding features into the sLSTM. The sLSTM generates explicit temporal difference representations. The temporal difference guidance strategy then converts these low-resolution difference features into spatial attention weights through bilinear up-sampling and ReLU activation. These weights are fed back to the dual-branch network via residual connections, adaptively emphasizing change-relevant regions and suppressing unchanged areas at each hierarchical level. Finally, the enhanced multi-level features are fused and passed to the classification head to produce the binary change detection map. To mitigate severe class imbalance, Dice Loss is employed instead of conventional cross-entropy loss, directly optimizing the overlap between predicted and ground-truth change regions. Compared with recent Mamba-based methods, the core distinction of TDG-Net lies in its explicit modeling and hierarchical guidance of temporal differences across multiple feature levels via a lightweight TSM and residual feedback, rather than relying primarily on high-level implicit fusion.ResultComprehensive experiments were conducted on three widely used benchmark hyperspectral datasets: River, Farmland, and Hermiston. All experiments were implemented on an NVIDIA RTX 3090 GPU using TensorFlow-GPU 2.5.0. Performance was evaluated using Overall Accuracy (OA) and Kappa coefficient, with results averaged over 10 independent runs. Ablation studies confirmed the significant contributions of the TSM and temporal difference guidance strategy. Comparative experiments against state-of-the-art methods, demonstrated that TDG-Net consistently outperforms all competitors. On the River dataset, TDG-Net achieved an OA of 96.54% and a Kappa of 78.51%. On the Hermiston dataset, it reached an OA of 98.21% and a Kappa of 91.79%. On the Farmland dataset, it attained an OA of 95.87% and a Kappa of 90.11%. Additional analysis on model complexity shows that TDG-Net maintains competitive parameter count, FLOPs, and inference time while achieving superior accuracy.ConclusionThe proposed TDG-Net effectively addresses the key limitations of existing hyperspectral image change detection methods by explicitly modeling temporal differences at multiple hierarchical levels and guiding feature learning with adaptive attention. By integrating a lightweight Time Storage Module based on simplified LSTM and a temporal difference guidance strategy within a Vision Mamba backbone, the method captures rich spatio-spectral-temporal dynamics while effectively suppressing pseudo-changes. The introduction of Dice Loss further alleviates the severe class imbalance problem. Extensive experiments on three benchmark datasets demonstrate that TDG-Net achieves state-of-the-art performance in both quantitative metrics and visual quality, exhibiting strong robustness and generalization in complex scenarios. Compared with existing Mamba-based methods such as SAVDGN and CDMamba, which primarily rely on implicit high-level fusion, TDG-Net explicitly models and hierarchically guides temporal differences across multiple feature levels through a lightweight TSM and residual feedback mechanism, achieving more fine-grained change representation.
摘要:ObjectiveAerial RGB-IR object detection has received increasing attention in remote sensing because visible and infrared images provide complementary information under complex imaging conditions. Visible images preserve rich texture, color, and structural details, whereas infrared images are less sensitive to low illumination and can highlight thermal targets at night or in low-light scenes. However, the contribution of each modality changes with imaging factors, including illumination, exposure status, texture clarity, target-background contrast, background clutter, and artificial light interference. Existing methods have improved detection performance through cross-modal alignment, feature interaction, attention reweighting, and multi-scale fusion. Nevertheless, most of them still focus on feature-level fusion and lack explicit modeling of modality reliability. In other words, they do not sufficiently estimate how reliable each modality is under the current imaging condition, or how much each modality should contribute to detection. Some language-guided and condition-aware methods introduce semantic cues into multimodal detection. However, they usually rely on coarse scene descriptions, category-level prompts, or additional large-model branches during inference. These strategies are insufficient for characterizing modality quality attributes that are directly related to detection reliability. They may also increase the deployment burden on computation-constrained aerial platforms. To address these issues, this paper proposes a modality reliability modeling method for aerial RGB-IR object detection. The method transfers modality quality perception from a vision-language model to the detector during training and enables adaptive multimodal fusion without additional large-model inference cost.MethodsThe proposed method consists of structured modality quality description, semantic prior distillation, and reliability-aware adaptive fusion. First, a modality quality attribute description dataset is constructed for UAV-oriented aerial scenes. It provides structured supervision for modality reliability learning. Instead of using only category labels or coarse scene tags, the annotation scheme explicitly describes key imaging factors that affect detection performance in both modalities. For the RGB modality, the attributes include illumination condition, exposure status, texture clarity, artificial light interference, and background clutter. For the infrared modality, the attributes include target-background contrast, boundary clarity, and background cleanliness. Second, a vision-language model is used to encode the modality quality descriptions and generate semantic priors related to RGB and infrared reliability. These priors are used only during training. By combining semantic distillation with attribute supervision, the detector is guided to learn detection-oriented reliability representations. Thus, modality quality perception is internalized into the visual detection network. Third, a global-local adaptive fusion mechanism is designed based on the learned reliability representations. Global scene reliability captures the overall effectiveness of RGB and infrared cues under the current imaging condition. Local spatial reliability further adjusts the modality contribution at different spatial positions. Therefore, the detector can dynamically fuse RGB and infrared features according to both scene-level and region-level reliability. Since semantic priors are used only during training, the proposed framework does not require additional text prompts, language branches, or large-model participation during inference.ResultsExperiments are conducted on DroneVehicle and VEDAI, two public aerial RGB-IR object detection datasets. On DroneVehicle, the proposed method achieves 79.7% mAP@0.5 and 53.7% mAP@0.5:0.95. On VEDAI, it achieves 67.1% mAP@0.5 and 30.1% mAP@0.5:0.95. The method also shows stronger robustness in challenging scenarios, especially under nighttime, low-light, and complex interference conditions. Ablation studies verify the effectiveness of modality quality attribute modeling, semantic prior distillation, and joint global-local modality reliability modeling. In addition, the proposed method maintains good inference efficiency because no extra large-model branch is introduced during testing.ConclusionThis paper presents a modality reliability modeling method for aerial RGB-IR object detection. The method uses a vision-language model only during training to encode modality quality descriptions and provide semantic supervision. Through semantic distillation and attribute supervision, the detector learns reliability-aware representations. By jointly modeling global scene reliability and local spatial reliability, the detector can adaptively adjust the contributions of visible and infrared modalities under varying imaging conditions. Experimental results on DroneVehicle and VEDAI demonstrate that the proposed method improves detection accuracy and robustness, especially in nighttime, low-light, and cluttered scenes.
Zhang Benshuo, Ke Feng, An Zhiyong, Yu Xiaoning, Han Zhongwei, Zhao Feng
DOI:10.11834/jig.260158
摘要:Wildfires occur frequently worldwide and pose severe threats to ecosystem stability, regional climate systems, and human life and property. In recent years, against the backdrop of global warming, adverse meteorological conditions such as extreme heat, prolonged drought, and anomalous wind fields have become more frequent, further increasing the risk of wildfires. In particular, under the combined influence of complex terrain, hot and dry winds, and abrupt meteorological changes, fires can rapidly evolve from weak initial thermal anomalies into large-scale disasters, causing forest resource loss, increased carbon emissions, ecological degradation, and severe socioeconomic impacts. Therefore, achieving rapid and accurate detection of early wildfire events over large spatial extents has become an important research objective in remote-sensing-based disaster monitoring and a key issue for improving early warning and emergency response capabilities. Compared with ground-based monitoring and airborne observations, satellite remote sensing provides broad spatial coverage, high observation efficiency, and continuous monitoring capability over large regions. Among available platforms, geostationary meteorological satellites are particularly valuable for near-real-time wildfire monitoring because they can repeatedly observe the same region at short temporal intervals. Compared with polar-orbiting satellites, Himawari-8/9 provides continuous observations with a temporal resolution of 10 min and 16 multispectral channels, making it an important data source for operational wildfire monitoring. Benefiting from this high-frequency temporal sampling capability, Himawari-8/9 can capture thermal anomaly responses at the moment of fire ignition. Furthermore, it continuously characterizes the dynamic evolution of fire pixels from emergence to intensification and expansion. This capability is especially important for early wildfire detection. Many early fire pixels are extremely small, exhibit weak thermal anomaly signals, and are not visually salient in a single image. Their identification often depends on variation trends and cumulative characteristics across continuous time series. Existing fire detection methods based on Himawari-8/9 data mainly include traditional threshold-based methods, spatial contextual approaches, and, more recently, deep-learning-based methods. Traditional multi-temporal threshold methods and contextual algorithms usually rely on manually designed decision rules derived from brightness temperature, band differences, and neighborhood background statistics. These methods are physically interpretable, easy to implement, and computationally efficient, and therefore perform well in some typical scenarios. However, they remain prone to false alarms and missed detections. This is especially true under complex surface backgrounds, cloud contamination, smoke interference, and extremely weak thermal anomalies in the early stage of wildfire development. For example, bare land, high-albedo surfaces, urban heat sources, cloud edges, and terrain shadows may produce radiometric responses similar to those of fire pixels, thereby weakening the discriminative capability of rule-based methods. In addition, fixed thresholds or semi-empirical rules often lack sufficient generalization across different regions, seasons, and land-cover types. With the rapid development of deep learning, convolutional neural networks, recurrent neural networks, and Transformer-based models have gradually been introduced into wildfire detection tasks. Compared with traditional methods, these approaches can automatically learn more discriminative spatial and temporal features in a data-driven manner, thereby reducing dependence on handcrafted rules and improving adaptability in complex scenes. Nevertheless, they still exhibit clear limitations for early wildfire detection based on long time-series Himawari-8/9 imagery. Convolutional neural networks mainly emphasize local spatial representations and remain limited in modeling long-term dynamic dependencies during wildfire evolution. Recurrent neural networks and their variants possess sequence modeling capability, but when handling long sequences at high temporal resolution, they are prone to gradient vanishing and long-term memory decay. This issue becomes more pronounced in scenarios involving 24 h of continuous observation and 144 temporal steps. Transformers can model global dependencies and long-range feature interactions. However, their computational complexity generally increases quadratically with sequence length. This imposes considerable efficiency pressure on near-real-time geostationary satellite monitoring. Therefore, early wildfire detection from long time-series Himawari-8/9 imagery still requires an efficient framework that can simultaneously capture weak spatial thermal anomalies and model long-range temporal evolution. To address these challenges, this study proposes Fire-Mamba, an early wildfire detection framework that integrates multi-scale spatial representation with long-sequence dynamic modeling. Fire-Mamba is designed for early wildfire targets in Himawari-8/9 imagery. These targets are typically extremely small, weak in thermal radiation variation, and highly susceptible to complex background interference. The framework aims to improve the accuracy and stability of near-real-time wildfire detection while maintaining manageable computational cost. Its design is motivated by two observations. First, early wildfire pixels usually manifest as subtle local thermal anomalies and lack salient shape or texture cues. Second, genuine fire pixels often cannot be reliably distinguished from transient background disturbances using a single image. Instead, their identification depends on variation trends across continuous observations. Specifically, the B07, B14, and B03 bands are selected to construct multispectral temporal inputs, and day–night cloud masking is performed using multi-band threshold rules to reduce interference from clouds and complex backgrounds. According to the 10 min observation interval of Himawari-8/9, a continuous sequence of length 144 is constructed to represent a 24 h temporal window, thereby providing sufficient support for modeling wildfire dynamics. Compared with short-window strategies, such a long time-series input can reflect not only the instantaneous thermal anomaly state of candidate fire pixels, but also their continuous intensification, spatial expansion, and temporal persistence. In the spatial feature extraction stage, a Multi-Scale Contextual Thermal Anomaly-aware Module (MCTAM) module is designed to enhance the representation of weak fire-induced thermal anomalies under complex backgrounds. The MCTAM module employs a 3 × 3 depthwise separable convolution to extract local thermal-gradient features induced by weak fire signals. This enables the capture of fine-grained local brightness-temperature variations and spatial discontinuities around candidate fire pixels. Meanwhile, a 7 × 7 convolution branch is introduced to capture broader background contextual information. This branch suppresses pseudo-anomalous responses caused by non-fire heat sources, heterogeneous land-cover patterns, or complex textures. Global average pooling and a multilayer perceptron are further used to generate channel-wise weights for adaptive feature recalibration, enabling the model to enhance fire-related features while suppressing background interference and isolated noise. In the temporal modeling stage, the selective state-space model Mamba is introduced to perform long-range dynamic modeling on pixel-wise temporal features, thereby improving the recognition of weak early thermal anomalies and their evolutionary trends. Compared with conventional recurrent structures, Mamba maintains strong temporal modeling capability while processing long sequences more efficiently. This makes it highly suitable for high-temporal-resolution continuous observations from geostationary satellites. By modeling long-range dependencies in pixel-wise temporal features, Fire-Mamba better captures the persistence, intensification, and diffusion characteristics of fire signals over time. Consequently, it improves the discrimination between genuine wildfire signals and short-term background fluctuations. Finally, to address class imbalance caused by the sparsity of fire pixels, Focal Loss is adopted at the prediction stage to encourage the model to focus more on hard samples and improve fire-class detection performance. Experimental results demonstrate that Fire-Mamba achieves superior performance across multiple quantitative metrics. Its average fire detection rate (FA) reaches 90.33%, significantly outperforming the second-best CNN-LSTM model (82.28%) and the official JAXA WLF L2 product (40.66%). The model also achieves an overall accuracy (OA) of 99.60%, while reducing the average omission rate to 9.67%, indicating high sensitivity to small fire pixels. Compared with most deep-learning baselines, Fire-Mamba reduces isolated false alarms while maintaining higher fire detection sensitivity. However, its FAR remains higher than that of the conservative JAXA WLF L2 product, indicating that false-alarm suppression still requires further improvement. In summary, Fire-Mamba achieves a lightweight integration of multi-scale spatial thermal-gradient perception and long-range temporal evolution modeling, providing strong technical support for large-scale near-real-time wildfire monitoring and early warning.
Han Junwei, Qian Xuelin, Xu Chang, Wang Haoyan, Zhang Dingwen
DOI:10.11834/jig.260181
摘要:Against the profound backdrop of the advancing global "Dual Carbon" strategy and the rapid proliferation of the visual Artificial Intelligence (AI) industry, promoting the green transition of visual AI technologies has emerged as a crucial pathway toward sustainable socioeconomic development. National directives, such as China’s "14th Five-Year Plan for Digital Economy Development" explicitly emphasize the necessity of advancing green and low-carbon computing infrastructures. In recent years, the revolutionary leaps in visual AI performance have been fundamentally driven by the "Scaling Law," which dictates that model capability scales in direct proportion to continuous expansions in model parameters and the exponential growth of training data.While this brute-force trajectory has unlocked unprecedented capabilities in complex perception tasks, it has simultaneously triggered severe resource bottlenecks. The prohibitive financial and temporal costs associated with massive physical data collection and fine-grained manual annotation, coupled with the exorbitant energy consumption required for training and deploying massive architectures, pose stark challenges to the low-carbon transformation of the AI industry. In response to this impending crisis, "Green Visual AI" has garnered widespread attention as a pivotal research paradigm. The core objective of this study is to systematically review energy-efficient methodologies that harmonize technical performance with long-term ecological sustainability. This paper aims to provide a comprehensive theoretical foundation and practical framework by exploring optimization strategies that minimize data, computational, and human resource requirements across the entire lifecycle of visual AI systems. This review adopts a comprehensive survey methodology to systematically deconstruct the entire lifecycle of visual intelligence models. We construct a structured taxonomy organized around four core, interrelated stages: Data Collection, Data Annotation, Model Inference, and Model Iteration. To provide a holistic view of the field, we categorize and analyze the overarching strategies within each stage: In the Data Collection phase, the survey reviews literature aimed at circumventing costly physical data acquisition. We analyze data synthesis paradigms and data transfer mechanisms that reuse existing knowledge for new environments. In the Data Annotation phase, we evaluate methodologies designed to eliminate the reliance on exhaustive human-in-the-loop labeling. This encompasses weakly supervised learning and self-supervised learning frameworks. For Model Inference, the methodology categorizes hardware and algorithmic interventions into model lightweighting and inference acceleration. The review synthesizes approaches like knowledge distillation, efficient attention mechanisms, linear sequence architectures, and dynamic inference strategies. Finally, in the Model Iteration phase, we examine frameworks that mitigate the massive carbon footprint of repetitive retraining. The literature is organized into continual learning strategies that prevent catastrophic forgetting and parameter-efficient fine-tuning methods that adapt models with minimal parameter updates. The synthesis of current technological frameworks reveals a profound paradigm shift across all evaluated dimensions of the visual AI lifecycle, transitioning from resource-heavy processes to highly optimized, sustainable operations. Energy-Efficient Data Collection: The literature indicates a decisive shift from physical data collection toward virtual synthesis and cross-domain transfer. Generative Adversarial Networks, Diffusion Models, and Large Language Models are now highly capable of synthesizing high-fidelity, varied data. Furthermore, domain adaptation and open-vocabulary learning techniques enable models to effectively reuse source-domain knowledge, facilitating robust deployment in unknown target environments with minimal to zero new data acquisition. Annotation-Efficient Paradigms: To alleviate the immense human capital required for pixel-level annotations, the field is rapidly adopting self-driven learning signals. Weakly supervised methods successfully derive supervisory cues from coarse, image-level labels or pseudo-labels. More prominently, self-supervised strategies, particularly contrastive learning, Masked Image Modeling, and cross-modal alignment, extract intrinsic structural representations directly from massive unlabeled datasets. These methods effectively bypass the manual annotation bottleneck while yielding powerful, generalized foundation models. Low-Carbon Model Inference: Addressing the high-frequency energy costs of model deployment, current network lightweighting techniques exhibit remarkable efficacy. Knowledge distillation successfully transfers representational power from massive teacher models to compact student networks, shifting the computational burden away from the deployment phase. To break the quadratic complexity bottleneck of standard Transformers, researchers have developed linear attention mechanisms and linear sequence architectures. On the acceleration front, techniques like single-step sampling for diffusion models, Key-Value cache optimization, and dynamic routing architectures ensure models execute complex tasks with minimal computational overhead. Sustainable Model Iteration: In dynamic environments, retraining massive models from scratch for every concept drift is computationally prohibitive. The survey highlights continual learning strategies—including parameter regularization, data replay, and architectural increments—which successfully prevent catastrophic forgetting, allowing models to accumulate new knowledge continuously. Concurrently, PEFT methods, notably Adapters, Low-Rank Adaptation, and Prompt Tuning, permit adaptation to novel downstream tasks by updating only a minuscule fraction of parameters while freezing the pre-trained backbone, drastically lowering the barrier for sustained model evolution. The transition toward Green Visual AI represents a fundamental evolution of artificial intelligence from a resource-intensive discipline to a sustainable, eco-friendly ecosystem. This review demonstrates that achieving a synergistic balance between technological performance and energy efficiency is feasible through targeted, full-lifecycle optimizations. However, significant challenges remain. Future research trajectories may emphasize the integration of explicit physical constraints into generative models to ensure real-world consistency, the development of unified self-supervised frameworks for heterogeneous multi-modal data, and the deepening of hardware-software co-design. Ultimately, embedding energy efficiency into the foundational design principles of computer vision is imperative to ensure that the next generation of visual AI sustainably empowers industrial applications while actively advancing global carbon neutrality objectives.
关键词:Green visual AI;energy-efficient data collection;energy-efficient data annotation;energy-efficient model inference;energy-efficient model training
Huang Rongmei, Yu Hong, Xie Caiyun, Dai Xiaofang, Chen Ying, Dai Jingjie, Hong Ruxia
DOI:10.11834/jig.260176
摘要:ObjectiveCamouflaged object detection (COD) is a significant and challenging task in computer vision, which focuses on identifying and segmenting objects that are highly similar to complex backgrounds in terms of color, texture, and shape. This technique presents important research value and wide application potential in agricultural monitoring, medical imaging, ecological protection, military reconnaissance, and other fields. However, existing camouflaged object detection methods still face several critical limitations. On the one hand, convolutional neural networks (CNN) are restricted by insufficient effective receptive fields, making it difficult to capture global context information and long-range dependencies required for distinguishing camouflaged objects. On the other hand, vision Transformers rely on self-attention mechanisms with quadratic computational complexity, resulting in huge computational overhead and memory consumption, which makes it hard to balance detection accuracy and efficiency. In addition, most mainstream methods only use single-modal RGB images as input and ignore the rich geometric and spatial prior information contained in depth maps. The existing cross-modal fusion strategies are relatively simple and cannot fully exploit the complementary information between RGB and depth modalities, leading to poor detection performance in highly camouflaged scenes. Aiming at these problems, this paper conducts an in-depth study on RGB-D camouflaged object detection methods based on multi-modal state space models, so as to achieve accurate, efficient and robust camouflaged object detection by fusing appearance information and geometric priors.MethodThis paper proposes a novel RGB-D camouflaged object detection framework based on a multi-modal state space model, named MambaCOD. First, considering the lack of real depth maps in public COD datasets, the advanced visual foundation model Depth Anything V2 is employed to generate high-quality pseudo-depth maps from raw RGB images, providing reliable geometric structure priors and constructing stable RGB-D multi-modal input pairs. Second, a parameter-shared dual-branch encoder is designed to extract hierarchical multi-scale pyramid features from RGB images and depth maps respectively, ensuring the consistency of feature extraction and reducing redundant parameters. Third, a multi-modality Mamba fusion module (M3FM) based on the state space model is proposed to achieve bidirectional reciprocal feature fusion between RGB and depth modalities. This module integrates depth-wise separable convolution, 2D selective scanning (SS2D), and bidirectional 2D selective scanning (Bi-SS2D), which can model long-range global dependencies with linear complexity, break through the bottlenecks of traditional CNNs and Transformers, and fully mine complementary information between modalities. Fourth, a dual-directional context mixture convolution module (DCM-Conv) based on multi-kernel asymmetric convolution is constructed for the decoder stage. By channel splitting, cascaded vertical and horizontal asymmetric depth-wise separable convolutions, and channel mixing operations, this module extracts multi-receptive-field contextual features while effectively controlling the number of parameters and computational costs. On this basis, a progressive multi-scale decoder is built to fuse adjacent-scale features layer by layer and gradually output the final camouflaged object prediction mask. Finally, a hybrid loss function combining binary cross-entropy (BCE) loss, IoU loss, and structural similarity (SSIM) loss is adopted to jointly optimize the model from pixel accuracy, global structure, and boundary integrity.ResultComprehensive experiments are conducted on three challenging and widely used COD benchmark datasets: CAMO, COD10K, and NC4K. The proposed MambaCOD is compared with 11 state-of-the-art methods, including 6 RGB-based models and 5 RGB-D-based models. Quantitative results show that MambaCOD achieves the optimal performance on most key evaluation metrics, including structural measure (Sm), enhanced alignment measure (Em), weighted F-measure (wFm), and mean absolute error (MAE). Specifically, compared with the second-best method, the proposed method reduces the mean absolute error by 21.3%, 17.4%, and 12.5% on the three datasets respectively, and achieves the best values in all major metrics. Efficiency analysis indicates that the model has only 58.5M parameters and 47.6G FLOPs, which are significantly lower than most comparison methods; compared with FSPNet, the parameter quantity is reduced by 78.6%, and compared with the Samba model, FLOPs are decreased by 4.0%, achieving an excellent balance between accuracy and efficiency. Visual effect analysis demonstrates that MambaCOD generates segmentation masks highly consistent with ground truth, accurately restores the contours and fine details of camouflaged objects, effectively distinguishes targets from highly similar backgrounds, reduces background noise and false detections, and maintains complete segmentation for irregularly shaped and highly concealed targets. Ablation experiments verify that each core component, including M3FM and DCM-Conv, contributes effectively to performance improvement. Further experiments confirm that Depth Anything V2 provides higher-quality geometric priors than traditional depth generators such as DPT, and the proposed modules still maintain effectiveness under different backbone networks.ConclusionThis paper proposes a lightweight and high-performance RGB-D camouflaged object detection framework MambaCOD based on a multi-modal state space model. By introducing high-quality pseudo-depth maps generated by Depth Anything V2, the model enriches the geometric structure information of input features and strengthens the multi-modal fusion between RGB appearance and depth geometry. The designed M3FM module realizes efficient bidirectional cross-modal fusion based on Mamba, which effectively captures long-range dependencies with linear complexity and breaks through the limitations of traditional CNNs and Transformers. The DCM-Conv module constructs a high-efficiency multi-scale decoder through asymmetric multi-kernel convolution, further improving detection accuracy while controlling computational costs. Experimental results on multiple benchmark datasets show that the proposed method outperforms existing mainstream approaches in both detection performance and computational efficiency, achieving state-of-the-art COD results. The proposed method effectively solves the problems of insufficient feature representation, low cross-modal fusion efficiency, and unbalanced precision and efficiency in traditional methods, providing a new solution for high-precision and high-efficiency camouflaged object detection in complex scenes. In the future, we will focus on enhancing the model’s robustness to noisy depth inputs, extending the framework to video camouflaged object detection, and exploring lightweight deployment on resource-constrained edge devices to expand the practical application scope of the model.
关键词:camouflage target detection;RGB-D;State Space Model;multimodal fusion;depth features
Feng Renshuai, Liu Xilin, Xue Yuhao, Li Xinjing, Tan Yulin, Zhao Cairong
DOI:10.11834/jig.260154
摘要:ObjectiveIndustrial safety monitoring often involves rare but high-risk events that exhibit a pronounced long-tail distribution. In real deployment scenarios, these critical hazard categories occur infrequently, are expensive to annotate at scale, and are often defined by subtle visual cues that must be interpreted together with contextual evidence from the surrounding scene. As a result, conventional vision-based safety monitoring methods, which typically rely on large annotated datasets and closed-set classification assumptions, often struggle to generalize to unseen hazard categories in practical industrial environments. Although recent vision-language models have demonstrated strong cross-modal understanding and reasoning capabilities, their direct application to industrial safety recognition still faces several fundamental challenges. First, existing models are often insufficiently sensitive to small but decisive local details, such as missing protective equipment, improper operation near machinery, or subtle environmental anomalies. Second, their few-shot adaptation ability is unstable when the support examples are limited and the target category has never appeared during training. Third, the generated reasoning text may not be well aligned with the final decision, which reduces the reliability and interpretability of the output. Finally, these models may generate fluent but weakly grounded explanations that are not fully supported by the visual evidence. To address these issues, this study investigates a few-shot industrial risk recognition framework that aims to improve both generalization to unseen hazard categories and the interpretability, factual consistency, and structural reliability of the decision process.MethodsA contextual chain-of-thought driven framework is developed for few-shot industrial safety risk recognition. The proposed method is built on three tightly coupled components. First, a dual-database data organization strategy is adopted to support different stages of learning. A core chain-of-thought training set is constructed to teach the model a structured industrial reasoning pattern through observation, analysis, and conclusion-oriented supervision, while an external few-shot example database is designed to provide task-specific contextual examples for unseen hazard categories. This separation allows the model to learn general reasoning ability from structured supervision and then apply it to new tasks through contextual adaptation. Second, the model architecture is enhanced by a hierarchical vision encoder, a semantic consistency classification head, and an active perception with iterative refinement mechanism. The hierarchical vision encoder fuses shallow, middle, and deep visual features so that both global scene semantics and fine-grained local details can be utilized during reasoning. The semantic consistency classification head predicts the final risk label from the semantic representation of the generated reasoning text rather than from an isolated parallel branch, thereby encouraging the final decision to remain structurally aligned with the explanation. The active perception mechanism allows the model to trigger local refinement when the initial global observation is insufficient, ambiguous, or affected by clutter, occlusion, or small target scale, so that difficult samples can receive additional focused inspection on critical regions. Third, a two-stage training strategy is adopted. In the first stage, structured chain-of-thought supervision is used to inject an industrial safety reasoning pattern into the model by jointly optimizing reasoning generation, final risk classification, and image-text semantic grounding. In the second stage, meta-samples composed of contextual examples and a query image are used to explicitly train in-context generalization, so that the model learns to infer a new risk concept from only a few support examples rather than relying on memorized category names. A contrastive image-grounding loss is further introduced to strengthen the factual alignment between visual evidence and generated text, reduce hallucination, and improve the faithfulness of the reasoning process.ResultsExperiments are conducted on a dedicated evaluation protocol for unseen industrial hazard recognition. The core supervised training set contains three common industrial risk categories with structured reasoning annotations, while the UH-14 benchmark is used to evaluate generalization to 14 unseen hazard categories under few-shot settings. Quantitative experiments are reported under 1-shot, 3-shot, and 5-shot settings. Under the 3-shot setting on UH-14, the proposed method achieves an F1-score of 68.56%, outperforming ChatCH-SFT by 12.81 percentage points over its 55.75% result. The F2-score improves from 57.83% to 70.81%, and recall reaches 72.40%, indicating a substantially stronger ability to reduce missed detections in safety-critical scenarios where recall is particularly important. Under the 1-shot setting, the proposed model still achieves an F1-score of 56.92%, demonstrating that the framework remains effective even when only a single support example is available for each unseen class. Under the 5-shot setting, the model reaches an F1-score of 74.49% and an F2-score of 76.27%, showing that the method can continue to benefit from additional contextual examples. Ablation experiments further verify the contribution of the major components. Removing the contextual learning stage causes a significant drop in performance, with the F1-score decreasing by 19.72 percentage points and the F2-score decreasing by 19.51 percentage points, which confirms that explicit contextual generalization training is central to the framework. Removing the chain-of-thought learning stage also leads to clear degradation, with the F1-score and F2-score dropping by 4.76 and 5.42 percentage points, respectively, indicating that structured reasoning supervision provides an important foundation for downstream few-shot transfer. Additional analyses show that the proposed design is beneficial not only for recognition accuracy but also for improving explanation consistency and strengthening the robustness of decisions on visually complex or ambiguous samples.ConclusionThe proposed framework improves few-shot recognition of unseen industrial risk categories while providing a more interpretable, structurally consistent, and visually grounded decision process. By coupling contextual learning, hierarchical visual perception, semantic consistency constraints, and conditional iterative refinement, the method is particularly suitable for industrial safety scenarios characterized by long-tail hazards, scarce annotations, strong dependence on contextual cues, and strict recall requirements. The framework does not merely improve classification performance; it also strengthens the factual reliability and logical coherence of generated explanations, which is important for human verification, expert review, and practical safety deployment in real industrial environments. In this sense, the study contributes not only a recognition model, but also a pathway toward more trustworthy multimodal reasoning for safety-critical applications. At the same time, the current study is still mainly focused on static-image scenarios. Future work may extend the method toward video-based temporal reasoning, deployment-oriented efficiency optimization, larger-scale open-category evaluation protocols, and more systematic construction of industrial reasoning data, so as to further improve practical applicability in real-world safety monitoring systems.
关键词:Danger warning;visual language model;multimodal feature fusion;Context learning;Few-Shot Learning
摘要:Semantic segmentation is a fundamental task in computer vision, aiming to assign a semantic label to each pixel in an image. Most conventional semantic segmentation methods are designed for a closed-set setting, where the categories involved in testing are predefined and largely consistent with those observed during training. Although this setting facilitates the learning of stable category representations and clear decision boundaries, it limits the applicability of segmentation models in real-world scenarios, where unseen objects and novel category names are frequently encountered. In this context, open-vocabulary semantic segmentation (OVSS) has emerged as an important research direction for extending semantic segmentation from closed-set prediction to open-world visual understanding. By using natural language descriptions as category definitions, OVSS enables models to identify and segment pixel-level regions corresponding to arbitrary textual concepts, thereby alleviating the dependence on fixed label spaces. The rapid development of vision-language pre-trained models, especially CLIP and related large-scale cross-modal models, has provided an important foundation for OVSS. Trained on massive image-text pairs, these models learn aligned visual and textual representations and exhibit strong transferability in open-category recognition. However, most existing vision-language models are mainly optimized for image-level semantic alignment and global visual recognition, making it difficult to directly satisfy the requirements of pixel-level dense prediction, such as accurate localization, boundary delineation, and fine-grained category discrimination. Therefore, how to effectively transfer image-level open-vocabulary recognition ability to pixel-level segmentation remains a key challenge in OVSS. This paper summarizes recent progress in OVSS from the perspectives of task background, representative methods, benchmark datasets, evaluation metrics, remote sensing extension, and future directions. First, the task background and basic characteristics of OVSS are introduced, and its relationship with conventional semantic segmentation and zero-shot semantic segmentation is clarified. Compared with conventional semantic segmentation, OVSS emphasizes generalization beyond predefined categories; compared with zero-shot semantic segmentation, it further benefits from large-scale image-text pre-training and allows more flexible category definitions through natural language descriptions. Second, representative OVSS methods are reviewed according to several major research routes, including zero-shot semantic segmentation, early explorations based on image-text supervision, two-stage methods, and single-stage methods. Early zero-shot segmentation methods mainly rely on semantic embeddings to transfer knowledge from seen categories to unseen categories, but their expressive ability is often limited by static semantic representations. Early studies based on image-text supervision explore whether transferable semantic region representations can be learned without dense pixel annotations, providing important inspiration for subsequent OVSS methods. Two-stage methods usually decompose the task into class-agnostic region generation and open-vocabulary region recognition. This paradigm is clear and modular, but its performance is highly dependent on the quality of candidate regions and may suffer from additional computational costs. In contrast, single-stage methods aim to integrate region modeling, vision-language alignment, and pixel-level prediction within a unified framework, which helps reduce cross-stage error accumulation and improve inference efficiency. Nevertheless, they still face challenges in fine-grained localization, dense semantic alignment, and robust generalization to unseen categories. The main ideas, technical characteristics, advantages, and limitations of these methods are further analyzed. In addition, this paper discusses the extension of OVSS to remote sensing imagery. Compared with natural images, remote sensing images usually exhibit top-down viewpoints, large scale variations, significant orientation changes, complex backgrounds, and cross-region distribution shifts, which make it difficult to directly transfer OVSS methods designed for natural-image scenarios to remote sensing applications. Recent studies have therefore begun to explore open-vocabulary remote sensing segmentation frameworks, dedicated benchmark datasets, training-free strategies, and multimodal fusion methods, indicating that OVSS is gradually expanding from general natural-image understanding to more complex professional visual scenarios. Commonly used datasets and evaluation metrics in OVSS are also summarized. Existing studies typically use COCO-Stuff as the main training benchmark and evaluate model generalization on datasets such as Pascal VOC, Pascal Context, and ADE20K under different vocabulary settings. At present, OVSS evaluation still mainly relies on conventional semantic segmentation metrics, especially mean Intersection over Union (mIoU). However, such metrics emphasize strict category matching and are insufficient for measuring semantic similarity, category hierarchy, and open-category generalization. Therefore, constructing evaluation protocols that better reflect the semantic openness of OVSS remains an important issue. Finally, this paper summarizes the major challenges in current OVSS research and discusses future directions, including precise pixel-level transfer of open-vocabulary recognition, adaptive region generation, more appropriate evaluation protocols, low-supervision and training-free learning, and remote sensing applications.
摘要:ObjectivePoint cloud classification has emerged as a fundamental pillar in 3D computer vision, primarily driven by the rapid proliferation of LiDAR sensors and depth-sensing technologies in autonomous systems, industrial inspection, and robotic perception. Unlike structured 2D images, 3D point clouds are characterized by inherent irregularity, sparsity, and a lack of canonical ordering, which poses significant challenges for high-precision recognition. While recent convolution-based and MLP-based networks have achieved considerable progress, they often struggle to capture fine-grained local geometric structures and establish comprehensive global topological dependencies simultaneously. Many existing models suffer from limited feature representation capabilities when dealing with complex object geometries or noisy environments, often failing to maintain scale adaptability and robust discriminative power. This study aims to address these critical deficiencies by proposing a novel multi-granularity classification network that synergizes spatial-domain geometric features with frequency-domain decoupled information. The primary objective is to enhance the model's ability to model local structures and refine feature fusionmechanisms, thereby achieving superior classification performance and structural robustness in 3D shape recognition tasks.MethodIn this study, a sophisticated multi-branch fusion architecture is developed to perform multi-granularity feature extraction and integration across different domains. The technical scheme is organized into several complementary modules designed to perceive point clouds from both spatial and spectral perspectives. The spatial domain processing is bifurcated into a point feature branch and a global geometric branch. The point feature branch utilizes Multi-Layer Perceptrons to extract robust, fine-grained geometric representations directly from the raw point coordinates, ensuring the preservation of fundamental shape information. Simultaneously, the global geometric branch employs an improved edge convolution algorithm. Unlike traditional static graph convolutions, this branch utilizes an incremental neighbor sequence, specifically a k-sequence, to realize multi-scale aggregation of the global topology. This mechanism allows the network to perceive structural evolution ranging from micro-level local neighborhoods to macro-level global contours, significantly improving the receptive field of the model. To further emphasize critical geometric characteristics, a channel attention mechanism is incorporated to adaptively enhance significant geometric responses while suppressing redundant noise. In parallel, a local spectral feature extractor is designed for frequency-domain analysis. It decouples the features into low-frequency, high-frequency, and spectral difference information, effectively capturing various levels of structural variations and geometric skeletons that are often ignored by spatial-only operations. To facilitate long-range feature interaction within local regions without excessive computational overhead, a selective state space model, specifically the Mamba architecture, is integrated into the spectral branch. Finally, the network integrates a spatial transformation regularization strategy via T-Net to ensure rotation invariance, a dual-path pooling mechanism combining maximum and average pooling to retain representative features, and multiple classifiers to achieve the final robust classification.ResultThe performance of the proposed multi-granularity network is rigorously evaluated on two mainstream benchmark datasets, namely ModelNet40 and ScanObjectNN. ModelNet40 consists of 12,311 synthetic CAD models across 40 categories, providing an ideal environment for testing geometric learning. In contrast, ScanObjectNN provides a more challenging testbed with 15,000 real-world scanned objects containing significant background noise and occlusions. The quantitative evaluation focuses on two key metrics which are Overall Accuracy and mean Accuracy. Experimental results show that the proposed method achieves an Overall Accuracy of 93.0% and a mean Accuracy of 90.7% on the ModelNet40 dataset. On the more challenging ScanObjectNN dataset, the model yields an Overall Accuracy of 82.4% and a mean Accuracy of 79.8%. On the ShapeNet Part dataset, the average instance intersection-over-union (IoU) and class intersection-over-union (IoU) ratios reached 85.94% and 83.32%.Comparative experiments demonstrate the clear superiority of this approach over several state-of-the-art models. Compared with existing mainstream methods, the classification accuracy is improved by an average of approximately 1% on ModelNet40 and 3% on ScanObjectNN. On the ShapeNet Part dataset, our approach outperforms state-of-the-art methods by about 0.8% in instance mIoU and 0.7% in class mIoU.. Specifically, the integration of spectral decoupling and the selective state space model significantly enhances the model's discriminative power and scale adaptability. The experimental data confirms that the synergy of spatial-domain multi-scale features and frequency-domain decoupled information provides superior robustness against geometric deformations and real-world sensor noise, effectively reducing the performance gap between synthetic and real-world data.ConclusionThe experimental results demonstrate that the model consistently outperforms several state-of-the-art approaches, and the proposed multi-branch fusion algorithm significantly improves the overall feature representation capability. This research offers a 3D object recognition, demonstrating that multi-granularity analysis can effectively compensate for the information loss inherent in single-domain operations. The application of this research is highly significant for fields requiring high-precision 3D perception. Future work will focus on further optimizing the model structure to maintain its lightweight characteristics while exploring the potential of this multi-granularity strategy in large-scale scene segmentation and other complex 3D vision tasks. By continuously refining the selective state space mechanism and spectral decoupling filters, we aim to develop a more generalized and efficient framework for 3D understanding that can adapt to increasingly complex sensing environments.
关键词:point cloud classification;spatial domain;frequency domain;edge convolution model;selective state space model
Weng Zihui, Zhang Quan, Xie Xiaohua, Lai Jianhuang
DOI:10.11834/jig.260095
摘要:Diffusion models have rapidly become the dominant paradigm of generative visual models. However, the memorization-forgetting mechanisms inherent in these models make them unintentionally memorize sensitive information from training datasets, which further aggravates the privacy and copyright issues in the image domain. Although the memorization and forgetting mechanisms have been deeply studied in the field of language models, a systematic review on diffusion models, which are the core technology of visual generation tasks, is still lacking. Our survey aims to bridge this gap, and the existing works are reviewed from five aspects critically. First, we briefly introduce the theoretical modeling and architecture of diffusion models, including the formulations of denoising diffusion probabilistic models, score-based generative models, and flow matching, as well as the architectures of U-Net, DiT and MMDiT. Such theoretical and architectural foundations not only support the strong generative capability of diffusion models, but also provide the necessary preliminaries for understanding why and how memorization arises during the iterative denoising process. On this basis, the definitions of memorization in diffusion models are introduced from non-temporal and temporal perspectives. For non-temporal diffusion models, the memorization can be divided into global memorization and local memorization, where the former focuses on the global similarity between generated images and training images, and the latter is concerned with the partial replication of sensitive components such as faces, signatures and watermarks. For temporal diffusion models, the memorization can be further decomposed into content memorization and motion memorization, since the leakage in video diffusion models is reflected not only in the static appearance of objects but also in the dynamic motion patterns across frames. Once the manifestations of memorization are formally defined, a natural follow-up question is what factors give rise to such memorization behaviors and how they can be theoretically interpreted. Accordingly, the understanding of memorization in diffusion models is then discussed from the perspectives of model factor, data factor and theoretical view. The model factor is focused on the influence of over-parameterization and architectural components such as the cross-attention modules, while the data factor reveals that the long-tailed distribution and duplicated samples can significantly increase the memorization risk. The theoretical view is supported by the manifold memorization hypothesis and the geometry-adaptive harmonic representation, which explain why diffusion models tend to memorize specific samples from a geometric standpoint. Building upon these qualitative understandings, more quantitative tools are required for the practical assessment of memorization risks. To this end, the quantification methods of memorization are introduced from the perspectives of model auditor and malicious attacker. The auditor-based methods can be divided into proxy-based methods and replication-based methods. The proxy-based methods are designed to approximate the memorization score by influence functions, gradient variance, and geometric proxies such as the Hessian curvature. The replication-based methods utilize pixel-level, perceptual-level and semantic-level similarity metrics, such as SSIM, LPIPS, SSCD and CLIP, to detect the duplication between generated images and training images. The attacker-based methods can be divided into membership inference attacks and extraction attacks, which are designed to determine whether a sample is in the training dataset or to recover the original training samples directly. The threat models can be further categorized into white-box, gray-box and black-box settings according to the attacker’s knowledge of the target model. Given that these quantification methods have empirically verified the severity of memorization-induced privacy leakage, how to effectively mitigate such risks becomes the next critical concern. Therefore, the memorization mitigation methods under a unified framework of machine unlearning are introduced, including 1) differential privacy, 2) prompt optimization, and 3) machine unlearning. The differential privacy based methods are typically applied in the pre-training stage and can provide formal privacy guarantees through input perturbation, output perturbation, or gradient perturbation such as DP-SGD, but at the cost of generative utility. The prompt optimization based methods intervene at the inference stage by perturbing the text embedding or attention scores to prevent the model from generating memorized content, which is lightweight but less rigorous. The machine unlearning based methods can be focused on fine-tuning the model parameters to redirect the memorized concepts to null concepts, locating and editing the memorization-related neurons, or applying closed-form parameter editing to remove the influence of specific samples or concepts in a post-hoc manner. These three families of methods are complementary to one another in terms of intervention stage, computational overhead and privacy guarantee, and their integration provides a more comprehensive perspective on memorization mitigation in diffusion models. To sum up, although a series of progress has been made along the above five aspects, several critical challenges still remain to be addressed in future research. 1) The privacy data processing pipeline and benchmarks of diffusion models are required to be standardized further, since the absence of unified datasets and evaluation protocols hampers fair comparison and reproducibility. 2) The memorization definitions and machine unlearning algorithms that are more compatible with the characteristics of diffusion models are required to be developed, since the existing methods are mostly inherited from classifiers or language models and cannot fully exploit the iterative and cross-modal nature of diffusion processes. 3) The memorization-forgetting mechanisms in novel learning scenarios such as test-time training, federated learning and continual learning are required to be explored, and the deployment in vertical domains such as medical imaging and financial generation is required to be accelerated for the cooperation of academia and industry. Furthermore, the data privacy policy-relevant ethical issues need to be considered for the responsible deployment of diffusion models in the future.
摘要:ObjectiveRemote sensing change captioning (RSCC) aims to automatically generate natural language descriptions for changes occurring between bi-temporal remote sensing images acquired over the same geographic region. Different from traditional change detection, which mainly focuses on determining whether changes occur and where they are located, RSCC further requires the model to understand the semantic category, spatial position, and transformation relationship of changed objects. This makes RSCC a challenging task that integrates visual change perception, bi-temporal feature interaction, and language generation. Existing methods have made considerable progress by introducing attention mechanisms, Transformer-based structures, and generative modeling strategies. However, several problems remain. First, the deep features extracted by visual backbones usually contain a large amount of unchanged background information, while the truly changed regions may occupy only a small portion of the image. As a result, key changed regions and discriminative semantic channels may not be sufficiently highlighted before difference modeling. Second, the relationship between bi-temporal features is not limited to simple subtraction. The original pre-change and post-change features, change magnitude, multiplicative interaction, and semantic similarity may all contribute to change understanding. Directly relying on a single difference representation or simple feature concatenation may be insufficient to describe complex changes. Third, some recent methods achieve better performance by increasing model complexity, but the balance between captioning performance and model size remains important for practical remote sensing applications. To address these problems, this paper proposes a remote sensing change captioning framework guided by bi-temporal feature enhancement and multi-relation difference coupling.MethodThe proposed framework follows an encoder-decoder structure. RemoteCLIP-RN50, namely RemoteCLIP with a residual network-50 backbone, is adopted as the shared visual feature extractor to encode pre-change and post-change remote sensing images. The two temporal images are processed by the same backbone to ensure that their visual features are represented in a unified semantic space. After feature extraction, a bi-temporal spatial-channel enhancement (BSCE) strategy is introduced before difference modeling. The BSCE strategy performs channel enhancement and spatial enhancement on the features of both temporal images. Channel enhancement uses global contextual information to recalibrate semantic channels, so that channels related to changed objects can obtain stronger responses. Spatial enhancement further emphasizes local regions with change potential and suppresses irrelevant background responses. In this way, the enhanced bi-temporal features provide clearer and more discriminative inputs for subsequent difference modeling. On this basis, a multi-relation difference coupling (MRDC) unit is constructed. Instead of using only simple subtraction, MRDC jointly models the original bi-temporal features, absolute difference, multiplicative interaction, and cosine similarity. Absolute difference represents local change magnitude between the two temporal features. Multiplicative interaction captures co-response and feature interaction between the two temporal images. Cosine similarity describes semantic consistency between corresponding spatial positions. These complementary relationships are concatenated and compressed through feature fusion layers to obtain a compact visual change representation. The fused feature map is then flattened into a visual memory sequence. Finally, a Transformer-based text decoder is employed to generate change captions autoregressively. During training, the model is optimized using cross-entropy loss under the teacher-forcing strategy. During inference, greedy search is adopted to generate the final change description.ResultExperiments on the LEVIR-CC dataset show that the proposed method achieves 83.62 in BLEU-1, 60.22 in BLEU-4, 64.94 in ROUGE-L, and 128.58 in CIDEr, where BLEU-1 and CIDEr outperform the comparison methods. Meanwhile, the proposed method contains 41.50M parameters, which is lower than several classic and recent representative methods, demonstrating a favorable balance between performance and model complexity. Additional experiments on the DUBAI-CC dataset show that the proposed method achieves 63.75 in BLEU-1, 34.14 in BLEU-4, 56.62 in ROUGE-L, and 90.09 in CIDEr, obtaining the best performance in BLEU-4, ROUGE-L, and CIDEr, which indicates its applicability across different datasets. Ablation studies demonstrate that both the bi-temporal feature enhancement strategy and the multi-relation difference coupling unit improve change captioning performance. Further relation synergy experiments show that, after bi-temporal feature enhancement, complete multi-relation coupling achieves the best performance in BLEU-1, ROUGE-L, and CIDEr, indicating that feature enhancement helps improve the complementary representation of different difference relations.ConclusionThe proposed method designs a targeted bi-temporal change representation process for remote sensing change captioning. By introducing BSCE before difference modeling and constructing MRDC to jointly represent multiple change relations, the method improves the discriminability of visual change features and provides more effective visual cues for language generation. Experimental results on LEVIR-CC and DUBAI-CC demonstrate that the proposed method achieves a favorable balance between captioning performance and model complexity, and it shows good applicability across different datasets. The ablation, relation decomposition, synergy, and visualization analyses further confirm that the performance gain comes from the cooperation between bi-temporal feature enhancement and multi-relation difference coupling. Although the proposed method still has room for improvement in fine-grained word-level matching and complex scene description, it provides a practical and relatively simple framework for remote sensing change captioning. Future work will explore stronger language decoding strategies, cross-dataset semantic alignment, multi-dataset joint training, and lightweight deployment to further improve the robustness and generalization ability of RSCC models.
Yu Yating, Cao Congqi, Wang Zhaoying, Zhang Yanning
DOI:10.11834/jig.260215
摘要:With the rapid expansion of the low-altitude economy and the increasing maturity of intelligent unmanned systems, unmanned aerial vehicles (UAVs) are gradually transforming from traditional remotely controlled flying platforms into autonomous aerial agents capable of perception, reasoning, and decision-making. Among the various sensing modalities available to UAVs, visual perception plays a central role in acquiring environmental information and enabling high-level situational awareness. Consequently, the capability of visual understanding directly determines the intelligence level and operational autonomy of UAV systems. In recent years, the emergence of visual foundation models, vision-language models, and multimodal large language models has substantially reshaped the technical paradigm of UAV visual understanding. This paradigm shift provides new opportunities for enabling UAV systems to operate effectively in complex and open environments where perception, reasoning, and decision making must be tightly coupled. To clarify this emerging research landscape, we introduce a capability-oriented analytical framework that organizes UAV visual understanding into three hierarchical levels: basic perception, semantic reasoning, and decision planning. This framework serves as the conceptual backbone of the survey and allows recent studies to be examined from a systematic capability-evolution standpoint. Specifically, from the task perspective, we construct a comprehensive taxonomy of UAV visual understanding tasks that includes four major categories: basic object perception, event semantic analysis, spatial environment understanding, and flight decision-making. Within this taxonomy, representative tasks such as object detection, target tracking, human action recognition, visual question answering, spatial reasoning, navigation, and autonomous flight control are analyzed in a unified manner. At the same time, we summarize several fundamental challenges that arise in aerial visual perception. These challenges include significant scale variations caused by high-altitude viewpoints, long-range observation that leads to small object representations, complex backgrounds with strong visual clutter, and dynamic environmental changes that require robust temporal reasoning. From the technical perspective, we review the methodological evolution that underlies the development of UAV visual understanding models. Early approaches were largely based on conventional deep learning architectures that focused on supervised visual perception tasks. Subsequent advances in visual foundation models significantly improved representation learning by leveraging large-scale pretraining and open-vocabulary multimodal alignment. More recently, large language models and multimodal large language models have introduced powerful reasoning capabilities and cross-modal interaction mechanisms that allow UAV systems to interpret visual observations in conjunction with natural language instructions and contextual knowledge. Building upon these developments, embodied vision-language-action models have begun to connect perception with action generation, thereby enabling UAVs to perform complex task planning and interactive decision making in real-world environments. From the capability perspective, we further examine how large visual understanding models enhance UAV intelligence across three dimensions: 1) visual perception enhancement, where large models improve robustness in open-vocabulary recognition, small-object detection, and fine-grained visual understanding; 2) vision-language reasoning, where multimodal models facilitate complex reasoning processes such as spatial relation reasoning, event interpretation, and cross-modal knowledge integration; 3) visual decision planning, where embodied multimodal models enable UAVs to translate perception and reasoning outcomes into actionable flight strategies, mission planning procedures, and adaptive control policies. In addition, we summarize representative UAV visual datasets and benchmarks that support the development and assessment of large multimodal models for visual understanding. Particular attention is given to the evolution of evaluation protocols. Traditional benchmarks often measure performance in narrowly defined tasks such as detection or classification. However, recent research increasingly emphasizes capability-oriented evaluation frameworks that assess broader competencies including reasoning ability, cross-task generalization, and decision support. This transition reflects a broader shift in the field toward evaluating integrated visual intelligence rather than isolated perception performance. Finally, we discuss several promising research directions that may shape the future development of UAV visual understanding systems. These directions include the construction of general-purpose visual foundation models tailored for aerial scenarios, the advancement of embodied UAV intelligence through vision-language-action integration, the development of real-time reasoning techniques and lightweight deployment strategies suitable for resource-constrained aerial platforms, and the establishment of safety-aware, trustworthy, and privacy-preserving UAV perception systems.
关键词:Multimodal large language model;visual foundation model;unmanned aerial vehicle (UAV);visual understanding;intelligent decision-making;review
Dong Yan, Jia Jijie, Gao Guangshuai, Gao Junyu, Li Xiangyun, Li Chunlei
DOI:10.11834/jig.260167
摘要:ObjectiveRotated Object Detection(ROD) is a critical and essential task in the field of remote sensing image processing, playing a vital role in numerous core application scenarios, including land surveying, national defense, agricultural monitoring, disaster emergency response, and urban planning. This task aims to accurately identify and locate direction-sensitive objects in remote sensing images, such as ships, aircraft, vehicles, buildings, and bridges. Unlike horizontal object detection in natural scenes, these remote sensing targets typically exhibit extreme scale variations, dense distribution, arbitrary orientations, and complex backgrounds. These characteristics render ROD a significantly more challenging research direction. Traditional Horizontal Bounding Box (HBB) detection methods fail to accurately fit rotating object contours, often introducing excessive background redundancy that leads to small object detection failures and positioning inaccuracies. In contrast, Oriented Bounding Box (OBB) detection techniques effectively address this issue by predicting five parameters: the object's center coordinates, width, height, and rotation angle. Consequently, OBB has become the mainstream research direction in remote sensing. However, existing ROD methods still face two core challenges constraining performance improvement: the angular boundary discontinuity problem and the parameter coupling issue in regression tasks. The first challenge, angular boundary discontinuity, fundamentally stems from the inherent contradiction between the periodic nature of angular parameters (0°and 180° being equivalent in practical applications) and the requirement for continuous differentiability in regression tasks. When predicted angles approach these boundaries, minor angular errors can trigger abrupt jumps in the loss function, leading to unstable training, slow convergence, and prediction biases. Although existing IoU-based joint optimization methods and various angle encoding schemes (e.g., dual-angle approach, complex encoding) attempt to convert discrete angle prediction into a continuous regression task, they have not yet fundamentally realized stable and continuous angle estimation. The second challenge lies in the coupling of rotation box parameters. Mainstream rotation detectors (e.g., Rotated RetinaNet, Rotated Faster R-CNN) typically employ a shared regression head to jointly predict center coordinates, scale, and rotation angle parameters. This design overlooks the fundamental differences in physical meaning and learning characteristics between these geometric parameters: center coordinates and scale represent translation and scaling transformations in Euclidean space, while the rotation angle represents a rotational transformation in periodic space. Feature confusion and gradient conflicts between these parameters compromise the learning process. In particular, unstable angle predictions propagate errors to other parameters through coordinate computations, significantly degrading the overall model performance. To address these issues, this paper proposes a high-precision ROD algorithm based on confidence-weighted guided angle correction, by extending the Rotated RetinaNet framework.MethodFirst, an Angle Modulation Module(AMM) is introduced to resolve angular boundary discontinuity. Inspired by complex exponential encoding-decoding mechanisms, this module maps angular parameters to continuous quantizable complex-valued signals, then restores angle values via differentiable inverse transformations. When the angular frequency ω is set to 2, the encoding function ensures consistent encoding outputs for target rotation angles θ and θ+π. This resolves prediction instability caused by segmented function fitting in traditional methods, guaranteeing continuity and differentiability in angle regression. Second, to address parameter coupling in rotation boxes, the Three-Branch Decoupled Regression(TBDR) module is designed. This module decouples the traditional joint regression head into three parallel, independent prediction branches, responsible for center coordinate, scale, and rotation angle prediction, respectively. These branches share underlying feature extraction to ensure parameter efficiency. Each branch independently learns the transformation rules for its corresponding geometric attribute: the center branch focuses on capturing positional offsets between predicted and ground truth bounding boxes; the scale branch handles width-height scaling relationships; and the angle branch concentrates on learning nonlinear periodic rotation patterns. This dual-layer architecture—combining shared feature extraction with dedicated prediction branches—effectively isolates feature interference and gradient conflicts between parameters, enabling specialized modeling for each geometric parameter. Finally, to maximize the correction capability of AMM while avoiding "over-correction" for high-confidence angle predictions, this paper innovatively introduces the Dynamic Angle Confidence Weighting(DACW) mechanism. This mechanism decouples a lightweight confidence prediction sub-branch from the angle branch, generating confidence scores within the [0,1] range via a Sigmoid activation function to quantify the reliability of each angle prediction. The confidence score serves as a dynamic weight to regulate the correction intensity of AMM outputs, while the hyperparameter λ fine-tunes the correction magnitude. Specifically, it reduces correction intensity for high-confidence predictions (high scores) to preserve original valid features, and enhances correction intensity for low-confidence predictions (low scores) to eliminate boundary discontinuities and prediction biases. The final angle result is obtained through a weighted fusion strategy combining the original angle prediction and the AMM-corrected angle.ResultExperimental results on two publicly available remote sensing datasets (DOTA v1.0 and High-Resolution Ship Collection 2016, HRSC2016) validate the effectiveness and generalization capability of the proposed method. Ablation experiments on DOTA v1.0 confirm that the optimal hyperparameter λ is 0.2. The combination of TBDR and DACW produces a synergistic effect, achieving a mean average precision (mAP) of 76.52%, representing an 1.80% improvement over the baseline model with only the Angle Modulation Module(AMM). Comparative experiments demonstrate that the proposed method achieves an mAP of 76.52% on the DOTA v1.0 test set (an 8.09% improvement over the baseline algorithm) and 90.30% on HRSC2016. This paper proposes a confidence-weighted guided angle correction algorithm for rotating target detection in remote sensing. By integrating three core modules—AMM, TBDR, and DACW—it systematically addresses two critical challenges in existing ROD: angular boundary discontinuity and rotation box parameter coupling. The algorithm achieves continuous angle regression through a complex exponential encoding-decoding mechanism, enables independent parameter optimization via a three-branch decoupling architecture, and implements adaptive angle correction using a dynamic confidence-weighted mechanism. The synergistic interaction of these three components effectively resolves angular discontinuity and rotation box coupling issues, significantly enhancing the detection accuracy and training stability of rotating objects. This approach provides a reliable technical basis for practical remote sensing applications in land surveying, national defense, and other fields.
Guo Yurong, He Yufei, Zhang Ke, Zhang Tiefeng, Yang Hong
DOI:10.11834/jig.250581
摘要:ObjectiveThe precise identification of substation equipment is crucial for ensuring the stability and operational security of power systems. Traditional detection methods relying solely on single-modality imaging, whether infrared thermal imaging or visible images, struggle to comprehensively characterize the multidimensional attributes of complex electrical apparatus. While the infrared modality effectively captures thermal information of equipment, it lacks the spatial resolution and textural details required for structural assessment. Conversely, visible images provide excellent detail fidelity but are fundamentally blind to thermal phenomena directly related to equipment health status, and are significantly affected by weather and environmental conditions. This representational dichotomy highlights the core value of infrared-visible image fusion, a technique that synthesizes complementary diagnostic information into a unified visual representation. Despite its theoretical promise, existing fusion methods exhibit significant shortcomings in substation inspection scenarios: fused results often suffer from insufficient target saliency, ambiguous structural definition, and poor foreground-background differentiation. These deficiencies severely compromise the discriminative features required by downstream detection algorithms, ultimately reducing the inspection accuracy and operational robustness of the entire power infrastructure network. To address these long-standing limitations through architectural innovation, this study proposes a dual-branch perceptual enhancement framework specifically designed to optimize the efficacy of substation equipment inspection.MethodThe architecture adopts a dual-branch paradigm. The shared-branch encoder utilizes a Transformer architecture to extract high-level structural commonalities invariant across both infrared and visible spectra, capturing the geometric continuity, topological relationships, and spatial configurations that define substation equipment categories. The complementary-branch encoder employs domain-optimized convolutional blocks to isolate modality-specific features, specifically the detailed texture information from visible images and the thermal information from infrared images. Compared to traditional methods, our algorithm achieves a tight integration between the feature enhancement mechanism and the substation equipment detection scenario. By targeting the enhancement of equipment structure and key details, it establishes a novel structure-detail decoupling fusion paradigm, significantly improving the topological integrity and feature saliency of equipment in the fused image. Specifically, we introduce a self-attention and structure enhancement Module (SEM) operating on the shared features. This module dynamically constructs attention maps through cross-modal feature correlation, utilizing learned spatial weighting to selectively enhance equipment contours while suppressing irrelevant background structures, thereby improving the saliency of equipment structure and its distinguishability from the background. Simultaneously, the multibranch feature enhancement module (MFEM) processes complementary features through parallel convolutional streams with cascaded refinement blocks to enhance the expression of detailed textures and thermal information of the target equipment. The refined feature tensors then undergo modality-specific fusion via feature fusion modules. Finally, the decoder reconstructs the fused features into the fused image, ensuring the preservation of both global structural coherence and local diagnostic details throughout the inverse transformation process.ResultExperimental validation was conducted using rigorously curated substation-specific infrared-visible image pairs capturing diverse equipment types under various operating conditions. In downstream equipment detection tasks evaluated using industry-standard frameworks, the algorithm demonstrated exceptional performance across key categories including suspension insulators, post insulators, current transformers, voltage transformers, and bushings. Quantitative assessment revealed significant improvements in the authoritative metric for object detection reliability, mean average precision (mAP@0.5): achieving a 40.1% enhancement compared to infrared-only detection, a 1.2% improvement over visible-only baselines, and a 3.9% gain over existing State-Of-The-Art (SOTA) methods, significantly enhancing the robustness of substation equipment detection. Regarding fused image performance, a comprehensive evaluation across six established dimensions consistently demonstrated superiority. The fused outputs excelled in information entropy (EN), spatial frequency (SF), mutual information (MI), QAB/F, peak signal-to-noise ratio (PSNR), and structural similarity index (SSIM), comprehensively outperforming existing fusion methods without exception. Ablation studies systematically isolating architectural contributions confirmed that both the SEM and MFEM modules are effective in both image fusion and object detection tasks.ConclusionThis research establishes a transformative paradigm for intelligent substation inspection through perceptually enhanced image fusion. By fundamentally reimagining the synthesis of infrared and visible representations, the framework overcomes the long-standing limitations of conventional methods that compromise equipment detectability. Beyond the direct performance improvements, this study bridges a crucial gap between computer vision theory and power engineering practice, demonstrating that domain-aware fusion architectures are essential for mission-critical infrastructure applications. The methodology provides a foundational advancement toward fully autonomous power grid maintenance, inherently adaptable to the evolving inspection requirements within the energy sector.