最新刊期

    30 9 2025

      Artificial Intelligence Empowers Earth-Moon Space Perception

    • 在地月空间感知领域,人工智能技术展现出构建智能感知体系的能力,为地月空间开发智能化发展提供新方向。
      Yu Dengyun, Yin Jihao, Liu Siqi, Wang Peng, Wang Huijuan, Zhang Yu, Jiang Hongxiang, Chen Pei
      Vol. 30, Issue 9, Pages: 2899-2910(2025) DOI: 10.11834/jig.250204
      Research progress and prospects of artificial intelligence-enabled space situational awareness technology in cislunar space
      摘要:With the continuous advancement of deep space exploration and the rapid evolution of aerospace technologies, cislunar space, serving as a strategic nexus connecting Earth and deep space, is increasingly emerging as a critical arena for international space competition. The Moon’s unique orbital position and resource potential confer it exceptional strategic value: as Earth’s sole natural satellite, it not only provides a natural proving ground for deep space technologies but also serves as a resource base for future space development owing to its distinct environmental conditions (e.g., microgravity and extreme temperature variations) and rich reserves of materials (e.g., helium-3 and rare earth metals). According to recent reports, more than 20 lunar missions have been carried out globally since 2020, which is approximately 4 times the number from 2010 to 2019. This notable increase underscores the growing strategic focus on cislunar space and marks humanity’s formal entry into a new era of systematic cislunar development. However, current cislunar activities face two major technological issues: 1) safety management crisis induced by exponential growth in space object density. The 2023 Cislunar space situational assessment report by the European Space Agency revealed that over 4 500 trackable objects larger than 10 cm presently exist in cislunar orbits, increasing by about 15% each year. These objects include functional spacecraft, derelict satellites, and collision-generated debris, whose unpredictable trajectories significantly elevate in-orbit collision risks. 2) Perception and prediction bottlenecks in complex dynamical environments. Owing to the coupled gravitational perturbations of Earth, Moon, and solar radiation pressure, cislunar orbital dynamics exhibit highly nonlinear characteristics. Under these conditions, traditional tracking technologies, limited by meter-level ranging accuracy and simple models, achieve less than 60% continuous tracking accuracy, with trajectory prediction errors exceeding 200 km within 7 days. These limitations directly threaten the safety of future large-scale activities, such as lunar base construction and Earth-Moon transport systems, posing significant challenges to high-precision intelligent perception and autonomous prediction technologies required to safeguard space assets. To address these challenges, artificial intelligence (AI) offers disruptive solutions. Next-generation AI technologies, particularly deep learning and reinforcement learning, demonstrate remarkable advantages in key domains through their robust nonlinear fitting capabilities and data-driven nature: (1) object detection and recognition. The YOLOT model, improved based on the YOLOv2 architecture by introducing an attention mechanism and multiscale feature fusion techniques, achieves an F1 score of 0.95 in object detection tasks under complex starry backgrounds. (2) Initial orbit determination. Researchers from a Turkish university developed an artificial neural network model that bypasses traditional orbital mechanic equations by learning spatiotemporal correlations directly from raw optical observation data, significantly reducing short-arc orbit determination times. (3) Long-term trajectory prediction. The MLP-TLE correction framework improves 30-day trajectory prediction accuracy for LEO targets by 40% through mining evolutionary patterns in historical TLE data, offering novel insights for deep-space object evolution modeling. This article provides a systematic review of the application of AI in cislunar space situational awareness: (1) mission planning and status analysis: It summarizes lunar exploration missions from major space powers, including China, the United States, Russia, and Europe, since the 21st century and analyzes their development plans and progress. (2) AI-driven situational awareness: By comparing traditional algorithms with AI approaches for key tasks such as space object detection and orbit prediction, it highlights AI’s advantages in computational efficiency and generalization performance. (3) Future development prospects. This article proposes an evolutionary path from “dedicated algorithms to vertical domain large models”, focusing on breakthroughs in detecting faint targets under complex lighting and intelligent orbit determination and prediction for cislunar objects. The ultimate goal is to build a vertical large model for cislunar situational awareness, significantly advancing the autonomy and intelligence of space operations. As an “experimental field” and a “resource repository” for deep space exploration, cislunar space development is expected to experience exponential growth in intelligent capabilities. Despite current challenges such as onboard computational constraints and insufficient space-environment adaptability validation, AI technologies have demonstrated generality, adaptability, and scalability. These capabilities are transforming cislunar situational awareness from postevent analysis to real-time perception and from localized observation to comprehensive domain coverage. This technological evolution not only reshapes the strategic landscape among spacefaring nations but also provides a foundational capability for the sustainable exploration of deep space.  
      关键词:Lunar-Earth space;situation awareness;artificial intelligence(AI);target detection;orbit determination   
      318
      |
      164
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 110140767 false
      更新时间:2025-09-16
    • Review of remote sensing image matching methods for lunar exploration AI导读

      在深空探测领域,国内外专家综述了月球影像匹配的研究进展,为行星表面测绘提供重要支持。
      Yang Yicheng, Peng Man, Wan Wenhui, Di Kaichang, Liu Zhaoqin, Li Lu
      Vol. 30, Issue 9, Pages: 2911-2942(2025) DOI: 10.11834/jig.250161
      Review of remote sensing image matching methods for lunar exploration
      摘要:This paper provides a comprehensive review of remote sensing image matching methods used in lunar exploration, encompassing traditional and deep learning approaches. Driven by the advancements in deep space exploration technology, a wealth of diverse remote sensing image data has been acquired by lunar probes, offering fundamental support for planetary surface mapping, geological analysis, and resource exploration. However, the unique environmental characteristics of the Martian and lunar surfaces, such as extreme terrain variations, significant lighting changes, and a lack of prominent texture regions, pose substantial challenges to remote sensing image matching, including radiometric differences, geometric distortions, and scale variations. The review first introduces the relevant lunar and deep space exploration missions and the types of image data obtained. It then elaborates on the research progress of lunar image matching methods, categorized into three main areas: feature, dense, and deep learning matching methods. Traditional image matching techniques are divided into feature and dense matching. Feature matching involves extracting salient feature points like corners and edges from images and building feature descriptors for matching, establishing sparse correspondences. This technique is widely applied in tasks requiring high real-time performance, such as rapid localization, navigation, and image registration for orbiters and rovers. Dense matching, on the other contrary, generates continuous disparity fields through pixel-wise matching strategies, enabling the generation of high-precision depth maps. This technique is primarily used in tasks like 3D reconstruction of planetary surfaces, path planning for rover obstacle avoidance, and detailed reconstruction of local areas. However, traditional methods face several limitations in the context of planetary imaging, including decreased stability of feature descriptors due to extreme lunar surface lighting conditions, scarcity of feature points in weak texture areas, and increased mismatches caused by repetitive crater structures. The significant resolution differences and viewpoint changes between orbiter and rover images further complicate cross-scale matching. Deep learning methods have shown significant advantages in comparison with traditional approaches owing to their data-driven learning capabilities and end-to-end training mechanisms. They can effectively extract deep semantic features from images and model complex nonlinear spatial mapping relationships, thereby significantly improving image matching accuracy, robustness, and adaptability to various complex scenarios. The advantages are primarily twofold: first, the data-driven feature learning mechanism overcomes the reliance on manually designed features, enabling adaptive feature representation from large-scale remote sensing datasets and effectively mitigating mismatch issues caused by insufficient generalization of traditional feature descriptors. Second, deep neural networks exhibit stronger geometric reasoning capabilities in weak-texture areas, occluded regions, and areas with disparity discontinuities. Through multilevel feature fusion, global contextual information perception, and attention mechanism application, these models can leverage effective information from surrounding high-texture areas to assist matching inference in low-texture regions or handle ambiguities at occlusion boundaries by learning scene structures. Hence, deep learning methods can achieve smooth and continuous disparity estimation, particularly excelling in maintaining disparity continuity at occlusion boundaries and performing context-based probabilistic inference optimization for texture-less regions. The end-to-end optimization further enhances overall matching performance by enabling mutual promotion between the feature extraction and matching processes. This study systematically reviews key advancements in applying deep learning techniques to lunar remote sensing image feature and dense matching. Deep learning-based feature matching methods are categorized into three types: feature extraction and description methods, end-to-end feature matching methods, and mismatch removal methods. Representative methods like SuperPoint and KeyNet automatically learn to extract robust feature points and generate compact descriptors. End-to-end methods like SuperGlue and LoFTR directly learn the correspondence between features or optimize the matching process. Mismatch removal methods employ additional constraints or optimization strategies to enhance matching accuracy. For deep learning-based dense matching, methods are typically divided into distributed and end-to-end approaches. Distributed methods integrate deep learning models into parts of the traditional pipeline, like in cost computation. End-to-end methods directly predict dense depth maps from image pairs using deep networks. Methods like GC-Net and PSM-Net are foundational in this area. Recent methods inspired by optical flow estimation, such as RAFT-Stereo and CREStereo, iteratively refine disparity results. This paper presents experimental comparisons of selected traditional and deep learning methods for feature matching and dense matching, considering the unique challenges in lunar remote sensing images, such as low contrast and complex lighting variations. In feature matching experiments using LROC NAC and Chang’E-3 rover images, traditional methods like scale-invariant feature transform(SIFT) and speeded up robust features(SURF) demonstrated stable performance across various metrics, including the number of correct matches, root mean squared error(RMSE), and repeatability. Deep learning methods like LoFTR and SuperGlue also showed balanced performance in terms of the number of filtered points and RMSE. In dense matching experiments using the LuSNAR dataset, the SGBM method achieved the lowest RMSE, indicating the highest accuracy, with the DepthAnything method performing very closely.In summary, this paper highlights that while traditional methods have mature theories and lower computational requirements, their limited adaptability restricts their application in complex deep space environments. Deep learning methods, with their strong data-driven feature learning and complex pattern modeling capabilities, offer significant performance improvements, particularly in challenging conditions. However, challenges remain regarding data scarcity, model complexity, and deployment on resource-limited space platforms.Future research should focus on developing efficient self-supervised and unsupervised learning algorithms to reduce reliance on labeled data, improving the realism of simulated environments, building standardized datasets, designing lightweight network architectures for on-board processing, and integrating multisource data and physical models to enhance accuracy and robustness in the challenging lunar environment. The goal is to achieve real-time performance and develop integrated systems for matching, localization, and mapping to support future autonomous deep space exploration missions.  
      关键词:deep space exploration;remote sensing image;feature matching;dense matching;deep learning   
      201
      |
      142
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 110057711 false
      更新时间:2025-09-16
    • 在航天器监测领域,专家提出了基于无线电被动感知技术的近月空间航天器天地基联合监测方案,为近月空间航天器监测系统设计与实现提供技术参考。
      Sun Jun, Chen Lue, Lu Weitao, Zhang Yujia, Kong Jing, Han Songtao
      Vol. 30, Issue 9, Pages: 2943-2950(2025) DOI: 10.11834/jig.250143
      Space-ground coordinated monitoring scheme for near-lunar spacecraft based on radio passive sensing
      摘要:ObjectiveAiming at the limitations of ground-based optical and radar systems for spacecraft monitoring at the distance between Earth and the Moon, this study proposes a joint space-ground-based monitoring program for near-lunar spacecraft based on radio passive sensing technology, which does not need to actively transmit radio signals but only utilizes downlinked radio signals sent by the spacecraft in orbit to realize target monitoring.MethodFirst, the theoretical method of radio passive sensing technology is introduced, which uses every two sensors to form a measurement baseline to measure the time delay and frequency difference observation of the target by time difference of arrival(TDOA) and frequency difference of arrival(FDOA) methods. Then, three monitoring programs of space-, ground-, and space-ground-based synergy applicable to near-lunar spacecraft are designed. The space-based monitoring scheme mainly realizes passive monitoring of near-lunar spacecraft by deploying multiple space-based sensors at the first cislunar Lagrange point and forming a measurement baseline between two sensors. The structure of the space-based monitoring system is preliminarily designed; it contains five subsystems, namely, parabolic observation antenna subsystem, time and frequency reference subsystem, RF link and downconversion subsystem, signal sampling and recording subsystem, and high-speed communication to Earth subsystem. The workflow of the space-based monitoring system is also preliminarily analyzed and designed as follows: 1) the parabolic antenna of space sensors, in accordance with the guidance file, is aligned with the lunar spacecraft target and receives its downlink RF signals. 2) The downlink RF signal is amplified, filtered, and converted into an intermediate frequency signal by the on-board downconversion system. 3) The intermediate frequency signal is collected and recorded by the on-board signal sampling and recording subsystem and converted into a digital signal. 4) The digital signal of the tracked target is sent to the ground station through the high-speed communication to Earth subsystem. 5) After the digital signals from two or more space sensors are gathered at the ground station, the signal processing center completes the generation of TDOA and FDOA observations of lunar spacecraft targets and the orbit determination of lunar spacecraft. In the workflow design of this space-based monitoring system, the application scenario in which a single space-based sensor completes the radio detection processing for a near-lunar space target is not considered, while the signal processing analysis of the space-based monitoring system is completed through the ground-based signal processing center. The sensors of the ground-based monitoring program are deployed on Earth’s surface, and the radio monitoring of near-lunar spacecraft is carried out in a mode similar to that of the traditional joint observation of VLBI stations. In the space-ground-based cooperative monitoring program, the sensors are deployed at the first cislunar space Lagrange point and on Earth’s surface, forming a joint space-ground-based measurement baseline to realize the joint tracking of the near-lunar spacecraft. Then, the ground signal processing center obtains the observation quantities of the time delay and frequency difference of the target spacecraft, thus realizing the orbit determination of the target. Subsequently, the key technologies involved in the design of the near-lunar spacecraft monitoring program are analyzed, specifically including the key technologies of sensor layout, target guidance, signal reception, TDOA/FDOA processing, error correction, and target orbiting and forecasting.ResultRadio passive sensing monitoring test of near-lunar spacecraft in orbit is carried out, and a ground-based deep-space antenna is used to monitor spacecraft. The signal of an unknown lunar spacecraft that randomly enters into the range of the main flap of the deep-space antenna is successfully monitored, and spectrum analysis and frequency monitoring are carried out. The frequency result from signal measurement shows a periodic pattern, which is exactly half a cycle (1 h), during the observation time. This study infers that the Doppler frequency change period generated by the spacecraft’s ground motion is 2 h, which corresponds precisely to the Doppler frequency change period of a typical circumlunar probe relative to a ground-based station. Through the above experiments, the Doppler variation motion pattern of the target is grasped, and the feasibility of the monitoring scheme proposed in this paper is preliminarily verified.ConclusionEffective radio monitoring of near-lunar spacecraft can be realized through multiple space–ground-based antennas covering the near-lunar range in accordance with the antenna beam range. The commonly used measurement and control frequency coverage of S/X can be set to realize the determination of their orbits and facilitate the safe operation of near-lunar spacecraft. The space-based monitoring program for near-lunar spacecraft based on radio passive sensing proposed in this paper can provide useful technical references for the design and realization of subsequent near-lunar spacecraft monitoring systems.  
      关键词:near-lunar space;spacecraft;downlink signal;radio passive sensing;monitoring experiment   
      99
      |
      86
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 109922479 false
      更新时间:2025-09-16
    • 在地月空间感知任务设计与分析领域,专家基于“容器云+微服务”设计了智能仿真系统架构,有效支持任务正向设计及方案优化,为地月空间任务研究提供新方向。
      Hu Jiaxin, Zheng Qingbiao, Zhu Yanwei, Wang Peng, Li Yongchang
      Vol. 30, Issue 9, Pages: 2951-2965(2025) DOI: 10.11834/jig.250183
      Architecture and implementation of an intelligent simulation system for cislunar space awareness mission design and analysis
      摘要:ObjectiveConducting cislunar space awareness missions is critical for advancing deep-space exploration and ensuring the safe operation of spacecraft in complex space environments. However, the cislunar environment, characterized by gravitational interactions, dynamic properties of space objects, and multi-factor coupling effects, presents significant challenges that traditional experimental validation and trial-and-error methods struggle to address efficiently and precisely. In recent years, China’s domestically developed aerospace simulation systems, such as the Aerospace Tool Kit developed by the National University of Defense Technology and the spacecraft system simulation software SpaceSim from Harbin Institute of Technology, have established autonomous technological frameworks that cover diverse mission types. Despite these advancements, specialized simulation solutions for deep-space exploration and cislunar space awareness remain underdeveloped, with limited integration of intelligent capabilities. The chaotic and nonlinear nature of the cislunar environment, combined with the dynamic variability and uncertainty inherent in perception tasks, imposes substantial demands on simulation system architectures, necessitating innovations in computational efficiency, adaptability, and scalability. To address these challenges, this study proposes a design philosophy that achieves modernization in four aspects: service orientation, standardization, domestic autonomy, and toolchain development.MethodThe service-oriented approach employs a “container cloud + microservices” architecture, integrating dynamic load-balancing algorithms and an intelligent task-scheduling engine to build an elastically scalable service platform capable of resolving high-concurrency multi-user parallelism. Standardization efforts focus on developing cross-platform interoperable middleware under a unified framework to address heterogeneous system integration and model generalization. For domestic autonomy, a fully independent technology stack is established using Phytium Advanced RISC Machine processors and the Kylin V10 operating system. The toolchain development adopts a “platform + plugin” design to enable dynamic model component loading and flexible functional module integration, supporting maintainable frontend/backend interfaces and adaptable resource allocation. These innovations culminate in a three-tier intelligent simulation architecture (“basic support + service support + typical applications”) that addresses the chaotic dynamics of cislunar space, the dynamic uncertainty of perception tasks, and the computational complexity of large-scale spatiotemporal simulations. Key technologies include a cislunar orbital dynamics model library, Kubernetes-based intelligent task scheduling, and an end-to-end microservice integration framework, which collectively resolve challenges in computational scheduling, service autonomy, and standardized model encapsulation. This feature enables efficient high-precision three-body orbit calculations, multi-user parallel response, and intelligent control of simulation workflows. The developed cislunar space awareness mission design and analysis simulation system incorporates six core modules. The first is the scenario design module, which leverages advanced technologies such as ThreeJS, WebGL, and Shader to address multi-view situational representation, multi-level spatial partitioning, multi-physical field visualization, multi-scenario synchronization, and multi-viewpoint navigation, achieving 3D visualization of large-scale spatiotemporal environments. The second is the situational awareness system design and analysis module, designed as a Cislunar Space System Tool Kits platform using ontological concepts to enable object-oriented system design and analysis capabilities, supporting iterative optimization through illumination analysis, orbital stability analysis, ground station visibility analysis, and target visibility analysis. The third is the mission planning module, which coordinates data transmission resources, tracking and control resources, and payload resources to generate integrated plans for data transmission, tracking, and payload operations. The fourth is the system-of-systems simulation module, enabling large-scale parallel simulations to validate design outcomes and mission workflows. The fifth is the experimental design module, which applies mature experimental design theories to generate effective combinations of test parameters and values, minimizing the test sample space. The sixth is the experimental evaluation module, which provides performance assessment and analysis for situational awareness tasks, allowing users to intuitively define and establish performance evaluation metrics. The system’s orbital dynamics model library encompasses four categories of 27 classical three-body orbits, such as halo, Lyapunov, and quasi-periodic orbits, supporting end-to-end mission phases from orbital design and control to target/environment simulation and capability evaluation.ResultThe design and analysis of a space-based cislunar space awareness system serve as a case study. A situational awareness constellation is designed, incorporating libration-point satellite orbital configurations and long-term orbital maintenance planning. High-precision orbit prediction is achieved for both near-lunar resident targets and cislunar transfer trajectories, with 15-year distant retrograde orbit and near-rectilinear halo orbit resident orbit integration under high-precision ephemeris models, demonstrating minute-level computational performance. Comprehensive planning of data transmission, tracking and control, and payload resources for the awareness constellation is executed. Simulation-driven 3D visualization and multi-dimensional performance evaluation of mission scenarios are conducted, ultimately yielding quantified mission effectiveness metrics for the space-based cislunar awareness system.ConclusionCase studies demonstrate the system’s ability to support forward design and iterative optimization of cislunar missions, exhibiting high efficiency, intelligence, stability, and scalability. On the basis of evaluation results, satellite orbital parameters (e.g., inclination, eccentricity) and constellation configurations (e.g., inter-satellite spacing, orbital phasing) are dynamically adjusted to iteratively optimize mission plans, enhancing system performance and operational capabilities. This process ensures the scientific rigor and operational feasibility of designs while providing reliable technical foundations for subsequent engineering implementation, thereby solidifying its role as an important tool for advancing cislunar infrastructure development and deep-space exploration initiatives.  
      关键词:system architecture;container technology;microservice technology;lunar space;three-body orbit   
      145
      |
      124
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 110057599 false
      更新时间:2025-09-16
    • Research on the observation mechanism of cislunar space cataloging system AI导读

      在地月空间非合作目标编目定轨领域,专家提出了基于天基光学观测技术的编目定轨关键算法,为提高编目定轨精度和时效性提供解决方案。
      Chen Yanling, Huang Yong, Wang Peng
      Vol. 30, Issue 9, Pages: 2966-2974(2025) DOI: 10.11834/jig.250146
      Research on the observation mechanism of cislunar space cataloging system
      摘要:ObjectiveIn cislunar space, as the core area for deep space exploration and lunar resource development, space activities become increasingly frequent. Meanwhile, the number of space objects in near-Earth, cislunar, and deep space has grown rapidly. By the end of 2024, more than 100 in-orbit lunar spacecraft exist in cislunar space. The need for timely cataloging, orbit determination, and monitoring of these objects has intensified. However, cislunar noncooperative target cataloging suffers constrained orbit determination accuracy and timeliness due to lack of cooperative signals for precision measurement, insufficient deep-space tracking resources, and multibody perturbation-induced trajectory complexity necessitating sophisticated dynamical models. This study focuses on the cataloging and orbit determination of cislunar space objects, investigating key algorithms for space target cataloging and orbit determination under the constraint of three-body orbit based on space-based optical observation.MethodThis study considers the orbital stability characteristics of distant retrograde orbit (DRO) and near-rectilinear halo orbit (NRHO), which are well suited for space-based monitoring systems. Accordingly, we utilize a four-satellite constellation consisting of two DRO and two NRHO spacecraft as space-based tracking stations. This research adopts a dynamic statistical orbit determination algorithm for cataloging and orbit determination for typical noncooperative DRO and NRHO targets under different platform orbit errors, numbers of platforms and platform distributions, and observation data conditions. Then, the orbit determination accuracy of DRO and NRHO targets under those conditions is comprehensively analyzed.ResultSimulation results show that under the simulation conditions of a 1 km orbit error of the space-based platform, 2″ measurement noise level in optical imaging observation, and a 10% dynamic model error in solar radiational pressure, the 2DRO and 2NRHO cislunar space navigation constellations have a single station orbit determination accuracy of 1–7 km for DRO targets, with a dual-station observation of 3 h per day in 3 days and an orbit determination accuracy of about 1 km. The predicted 1-day orbit accuracy is better than 1.3 km. For NRHO targets, the orbit determination accuracy for single-station observation in 3 days is 1–3 km, while those for dual-station observation of 3 h per day, observation of 6 h per day, and continuous collaborative observation are better than 1.2, 1.1, and 1.0 km, respectively. Under these three observation conditions, the predicted 1-day orbit errors (root mean square) are better than 3.0, 2.5, and 1.9 km, respectively. The results indicate that under the condition of 1 km error in the space-based station, the accuracy of dual-station observation is significantly improved compared with that of single-station observation. The dual-station system observes for 3 h a day, and the orbit determination accuracy for DRO and NRHO targets is both better than 1.2 km. The accuracy of 1-day orbit prediction is higher than 1.3 km for DRO target, and the NRHO orbit prediction has the highest accuracy of over 1.9 km under continuous observation for 3 days.ConclusionUnder current simulation conditions, the cataloging and orbit determination capability of a two-station system with 3 h daily observations outperforms the orbit determination accuracy of a single station with continuous daily observations. This conclusion holds significant engineering value for 1) evaluating and optimizing the observation efficiency and operational methodology of cislunar space cataloging and orbit determination, 2) supporting the layout planning of observation systems, and 3) enhancing the situational awareness capability in cislunar space.  
      关键词:cislunar space;Non-cooperative Target;Space-based optical observation;Cataloging and Orbit Determination;distant retrograde orbit(DRO);near rectilinear halo orbit(NRHO)   
      158
      |
      92
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 110057676 false
      更新时间:2025-09-16
    • 在空间态势感知领域,专家提出了基于KAN的高轨目标短弧初轨确定模型,有效解决了天基光学观测问题,为空间目标监测提供新手段。
      Liu Hongyu, Zhang Yu, Zhang Chaoqun, Chen Guo, Yin Jihao
      Vol. 30, Issue 9, Pages: 2975-2987(2025) DOI: 10.11834/jig.250147
      Initial orbit determination of high-earth-orbit objects combined with Kolmogorov-Arnold network
      摘要:ObjectiveInitial orbit determination (IOD) estimates a spacecraft’s orbit by using a limited set of observations, providing rapid trajectory insights and essential initial estimates for subsequent orbit refinement. As a fundamental component of space situational awareness (SSA), an efficient and accurate IOD method is crucial for downstream tasks such as track association, satellite cataloging, and anomaly detection. Optical observations are the primary means of acquiring space object data, but they lack direct range measurements, posing a major challenge for IOD. Traditional IOD methods typically rely on iterative range estimation to improve accuracy, leading to high computational costs and a strong dependence on precise dynamical models and sufficient observations. These challenges are exacerbated in short-arc scenarios, where limited physical constraints often result in trivial solutions or convergence failures. For high-earth-orbit objects, orbital dynamics are influenced by multiple perturbation factors, including the earth’s non-spherical gravitational field, solar radiation pressure, and lunar gravitational effects, making orbit determination significantly more complex than in low-earth orbit. While deep learning has demonstrated strong nonlinear fitting capabilities, conventional neural networks function as black-box models and struggle to incorporate underlying physical laws governing orbital motion. In contrast, the Kolmogorov-Arnold network (KAN), inspired by the Kolmogorov-Arnold representation theorem, offers an interpretable alternative. Unlike traditional multilayer perceptron (MLP), KAN replaces fixed-weight parameters with learnable univariate functions, parameterized as splines, enabling adaptive activation functions and a more flexible topology. This design preserves strong fitting capabilities while improving interpretability. To overcome the limitations of traditional IOD techniques and conventional deep learning models in short-arc IOD tasks, we propose a novel IOD method of high-earth-orbit objects via short-arc observations using KAN. By leveraging short-arc space-based optical observations, our method integrates deep learning with orbital physics, enhancing accuracy and computational efficiency synergistically.MethodIn this study, we developed a model based on KAN for short-arc IOD tasks of high-earth-orbit objects. The model architecture consists of five interconnected KAN layers, employing a design strategy of dimensionality increase followed by dimensionality reduction. The model processes angular observation data of target objects acquired from space-based observers. Through its hierarchical structure, the network effectively extracts complex nonlinear relationships from the input data while focusing on the most critical features. Ultimately, the model outputs accurate predictions for the position vector and the velocity vector of the target object at the initial moment of the observation arc. Furthermore, we constructed a large-scale dataset of short-arc space-based optical observations for high-earth-orbit objects. This dataset is derived from real on-orbit satellite data provided by CelesTrak, selecting 581 high-orbit active satellites from its database as target objects. We further designed seven satellites to serve as space-based observers. We generated a total of 258 741 observation samples by simulation, which were split into training (70%), validation (15%), and test (15%) sets. Each observation arc spans 300 seconds with a sampling interval of 60 seconds, recording the observation angles and their rates of change of the target satellites relative to the observer. The selected dataset features a diverse range of orbits for the satellites, providing a reliable data foundation for model training and performance evaluation. Using this dataset, we trained our model. To address inconsistencies in parameter units and magnitudes, we first applied data normalization before feeding the data into the model as a preprocessing step. The model employs a composite loss function that incorporates both positional and velocity errors, regulated by a weighting coefficient λ to prevent over-optimization toward either metric. For training, we adopted the Adam optimizer with an initial learning rate of 0.001, supplemented by a dynamic learning rate scheduler; if the validation loss failed to decrease for 100 consecutive epochs, then the learning rate was halved. This approach ensures rapid and stable convergence while improving training efficiency considerably. With a batch size of 512 and training performed on an NVIDIA GeForce RTX 3090 GPU, the entire training process took approximately five hours to complete 2 000 training epochs.ResultWe evaluated our approach by comparing our model with four traditional IOD methods: the Laplace method, the Gauss method, the Double-R method, and the circle-orbit tracking method. The experiment was conducted on the dataset that we constructed. Quantitative evaluation metrics included the average position error (Ep), average velocity error (Ev), and speed evaluation metric arcs per second (APS). Experimental results demonstrate that our method outperforms all other methods across multiple performance dimensions. In terms of accuracy, our model’s Ep is 27.458 km, which is only 0.24%, 0.58%, 0.49%, and 0.50% of those achieved by the four traditional IOD methods, respectively. Similarly, our model’s Ev is 3.904 m/s, which is only 0.46%, 1.14%, 0.95%, and 0.98% of the Ev values of the corresponding methods. Moreover, with regard to processing speed, our method’s APS is 154, 212, 822, and 132 times higher than that of the competing methods, demonstrating crucial advantages in both accuracy and efficiency. In addition, our approach exhibits outstanding stability. It produces no unsolvable cases and generates erroneous solutions at a rate less than 0.08% of that observed with traditional IOD methods. Furthermore, when compared with the conventional deep learning method MLP, our method achieves Ep and Ev values that are only 25.01% and 19.58% of those of the MLP, respectively, highlighting KAN’s superior nonlinear fitting capability over conventional deep learning methods.ConclusionIn this study, we proposed a KAN-based model for short-arc IOD tasks of high-earth-orbit objects. Experiment results show that our model demonstrates superior accuracy and efficiency compared with several traditional IOD methods while also outperforming conventional deep learning approaches. It exhibits exceptional stability and effectively solves the problem of short-arc initial orbit determination.  
      关键词:initial orbit determination(IOD);Kolmogorov-Arnold network(KAN);Short-arc;Space-based optical observations;High-earth-orbit objects;deep learning   
      132
      |
      112
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 109937244 false
      更新时间:2025-09-16
    • 在空间稳定轨道设计领域,专家提出了基于图像信息提取的频率数据处理方法,首次构造并得到地月三角平动点附近空间稳定轨道的半分析解,为轨道设计提供高效解决方案。
      Liu Mulin, Hou Xiyun, Wang Peng
      Vol. 30, Issue 9, Pages: 2988-2999(2025) DOI: 10.11834/jig.250144
      Feature identification based on photometry analysis for orbits around the Earth-Moon triangular libration points
      摘要:ObjectiveThe Earth-Moon triangular libration points (TLPs) have gained interest from many fields of astronomy, including natural body observation and trajectory design for probes and satellites. Early research was conducted using simple dynamical models such as the circular restricted three-body problem and the bicircular problem model. However, real-world scenarios necessitate high-fidelity ephemeris models that incorporate perturbations from the Sun and other celestial bodies, under which the TLPs cease to be equilibrium points. Despite this complexity, quasi-periodic orbits persist in their vicinity, exhibiting a complicated relationship between forced motions (driven by external perturbations) and free motions (intrinsic oscillations). Even though traditional numerical methods are robust, they struggle to efficiently capture these dynamics due to their iterative nature, thus increasing computational costs exponentially with model fidelity.MethodThis work addresses these challenges by proposing a hybrid semi-analytical approach that combines analytical insights with numerical refinements, significantly reducing computational overhead while preserving accuracy. This method only takes 1% of the time to calculate the same orbit as the traditional numerical method does. With the use of an analytical or semi-analytical solution of target orbits treated as initial input, much less computational resources are needed to derive the final designed orbits. In this work, a method to construct semi-analytical orbits is presented. This method is based on the fact that many orbits in the cislunar space are combinations of forced motions and free motions, making their trajectories quasi-periodic. As a result, the semi-analytical solution for these orbits can be written in the trigonometric series form. Changes in the amplitude parameters of the orbits will cause changes in the coefficients of the series and the frequency photometry of the free motion components, enabling us to find the combinations of base frequencies of a certain frequency object in the least squares sense. The study employs the JPL DE430/431 ephemeris to model the Earth-Moon-Sun system, integrating gravitational interactions in the geocentric celestial reference system. The equation of motion is transformed to the Earth-Moon synodic frame, where perturbations from the Sun and nonlinear effects are explicitly considered. Stable orbits are decomposed into forced and free components, with the latter governed by three fundamental frequencies ωl,ωs,ωv. Orbital trajectories are analyzed using fast Fourier transform (FFT) to extract dominant frequency components. A hybrid approach that combines FFT with continuous Fourier transform (CFT) refinement is implemented to mitigate spectral leakage and aliasing effects. FFT initialization is used to identify prominent frequency peaks from time-domain orbital data. CFT optimization is used to determine precise frequency by iteratively subtracting dominant signals and recalculating residuals, enhancing resolution beyond FFT limitations. After all frequency data are extracted from the orbit data samples in the time domain, the signal preprocess is implemented to prepare data for integer combination determination and coefficients fit by polynomials. Distance between frequency data extracted from neighboring orbits is considered a restriction for characterization of frequency data from the frequency objects with the same base frequency integer combinations. Polynomials are used to fit the coefficients of frequency objects by using prepared data from previous process. A test case is provided to validate the proposed method. Previous work showed the spatial structure of stable regions by γ values, which is γ=0.40.6γ=0.650.7 and γ=0.730.8. They are formed by certain resonances between free frequencies of the spatial orbits and combinations of certain base frequencies from the motion of the Moon.ResultAn analysis of stable orbits in the three spatial stable regions around the TLPs validated the effectiveness of the frequency analysis method and the signal processing procedures for frequency data characterization. In this work, semi-analytical solutions for γ values in three stable regions are constructed for the first time. The proposed method reduces computational overhead considerably by replacing iterative numerical simulations with semi-analytical approximations. By isolating critical frequency components and their amplitude dependencies, the framework enables rapid design of stable orbits with tailored characteristics. Furthermore, the spectral analysis provides insights into instability mechanisms, such as resonance overlaps, guiding future mission planning.ConclusionThis work demonstrates the efficacy of image-based frequency extraction in constructing semi-analytical solutions for spatial orbits near the Earth-Moon TLPs. The integration of FFT/CFT refinement, dynamical modeling, and empirical fitting offers a robust pipeline for efficient orbit design. This method can be applied to more quasi-periodic orbits that consist of forced motions and free motions. Future applications could extend this approach to multi-body systems or deep-space missions, where computational efficiency is paramount.  
      关键词:triangular libration points(TLPs);Quasi-Periodic Orbit;Frequency Analysis;Quasi-Analytical Solution;orbit correction;numerical fitting   
      82
      |
      108
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 109943633 false
      更新时间:2025-09-16
    • 在地月目标编目定轨领域,专家提出了结合CNN与CAR的短弧航迹智能轨道族分类方法,实现了高鲁棒性的航迹分类,为地月空间复杂轨道的智能识别提供有效参考。
      Li Jiayi, Cai Han, Sun Xiucong, Gui Haichao, Zhang Chen
      Vol. 30, Issue 9, Pages: 3000-3011(2025) DOI: 10.11834/jig.250139
      Intelligent classification algorithm for cislunar trajectories integrating deep neural networks and constraint admissible region
      摘要:ObjectiveThe growing number of lunar and deep-space exploration missions has transformed the Earth-Moon environment into a dynamic and complex orbital domain, where the ability to rapidly and accurately classify trajectory families is critical for mission planning, space situational awareness, and orbital debris management. Traditional approaches, such as statistical clustering of orbital elements or permissible domain methods dependent on manually established boundaries, face critical limitations in this context. These methods fail to adequately address the nonlinear dynamics and analytical intractability of three-body gravitational interactions, resulting in low classification accuracy and computationally intensive orbit determination processes. To address these challenges, this study proposes a novel hybrid framework called convolutional neural network-constrained admissible region (CNN-CAR), which combines data-driven feature learning with physics-based constraints. By synergizing the pattern recognition capabilities of CNN with the dynamical principles of the circular restricted three-body problem (CR3BP), the framework aims to improve the computational efficiency of orbit determination and the precision of orbital family attribution, particularly for trajectories of unknown origin. This advancement provides a robust basis for real-time cataloging, autonomous trajectory screening, and reliable identification of objects in cislunar space.MethodThe CNN-CAR framework is developed through a structured three-stage process. Initially, a comprehensive database of known Earth-Moon trajectory families is constructed, encompassing halo orbits at L1 and L2, as well as distant retrograde orbits (DROs). This database is populated with high-fidelity orbital parameters obtained from the JPL Horizons system. A CNN is subsequently trained on this dataset to automatically extract high-dimensional dynamical features from orbital state vectors, including positional coordinates and velocity components. The trained CNN functions as a feature extraction engine, capable of discerning complex nonlinear relationships among these parameters that are often obscured in traditional methods. In the second stage, initial CARs are defined using Jacobi constant constraints based on the CR3BP. These constraints establish permissible parameter ranges that reflect the dynamical stability criteria of each trajectory family. The CNN’s classification outputs are then integrated with these physics-informed boundaries to dynamically refine the CARs. This integration systematically excludes regions that are outside the learned feature clusters and inconsistent with the Jacobi thresholds, thereby narrowing the search space for orbit determination tasks while preserving physical validity. In the final stage, unknown trajectories are classified through a fusion mechanism that leverages both the deep semantic features extracted by the CNN and the predefined CARs for distinct orbital families. The framework evaluates the compatibility of a trajectory’s inferred Jacobi constant with the CARs and its similarity to the learned feature patterns. This approach enables the autonomous assignment of trajectories to the most probable orbital family even under conditions of uncertainty.ResultThe performance of the proposed framework was rigorously validated through numerical simulations in a representative scenario where the observational platform was positioned on a DRO. Three primary orbital families——HaloL1, HaloL2, and DRO——were selected for comprehensive testing of classification accuracy and CAR optimization. Compared with conventional methods that rely solely on Jacobi constant constraints, CNN-CAR achieved a significant reduction in the admissible domain area, exceeding 50% as demonstrated in the simulations. This reduction not only minimizes the computational effort required for orbit determination but also maintains consistency with the constraints derived from the CR3BP. In the classification of unknown trajectories, the hybrid CNN-CAR model achieved a remarkable success rate of 50%, nearly doubling the accuracy of the standalone CAR method (25%). This substantial improvement is attributed to the CNN’s ability to discern subtle nonlinear features that are unattainable through purely physics-based methods. The framework also exhibited remarkable robustness against observational noise, maintaining a stable recognition rate of 50% even when positional uncertainties reached 1 000 arcseconds.ConclusionThe proposed CNN-CAR hybrid framework represents a step in trajectory classification and orbit determination within the Earth-Moon system. By unifying machine learning’s adaptability with the rigor of classical astrodynamics, the method overcomes the drawbacks of traditional techniques, balancing computational efficiency and physical interpretability. The dynamic optimization of admissible regions, guided by learned features and Jacobi constant constraints, effectively reduces search spaces while maintaining consistency with three-body dynamics. This ability is essential for real-time operations in complex orbital environments. The framework’s enhanced accuracy for unknown trajectories advances autonomous space situational awareness, enabling rapid identification of uncharacterized objects such as space debris or non-cooperative spacecraft. In addition, the framework shows robust performance under high-noise conditions and varied observational scenarios, suggesting its potential for practical applications, such as lunar orbital servicing, cislunar debris mitigation, and multi-objective trajectory planning. This work highlights the benefits of combining data-driven methods with physics-based constraints. It provides a reliable approach to addressing orbital management challenges in the Earth-Moon system and offers a foundation for intelligent trajectory analysis in complex gravitational systems.  
      关键词:Space situational awareness;Trajectory classification;convolutional neural network (CNN);constraint admissible region (CAR);Cislunar space objects;fusion algorithm   
      177
      |
      99
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 109942912 false
      更新时间:2025-09-16
    • Radar imaging and intelligent cognition for cislunar space objects AI导读

      在地月空间探测领域,专家系统探索了雷达探测机制与信息获取方法,为实现地月空间物体高稳健探测提供核心路径。
      Liu Dan, Zhang Lei, Shang Huatao, Ren Yuanzhen
      Vol. 30, Issue 9, Pages: 3012-3025(2025) DOI: 10.11834/jig.250148
      Radar imaging and intelligent cognition for cislunar space objects
      摘要:ObjectiveWith the unprecedented and continuous advancement of international lunar exploration initiatives (e.g., the Artemis program and Chang’e series) and ambitious deep space missions targeting Mars and beyond, the cislunar space——the region encompassing the Earth-Moon system——is experiencing a dramatic surge in spacecraft traffic. This traffic includes lunar orbiters, transfer vehicles, landers, potential space stations (e.g., the Lunar Gateway), and even discarded mission stages or debris. Consequently, achieving highly robust, reliable, and persistent detection and characterization of objects within this vast and complex domain has become an essential prerequisite for ensuring the operational safety, collision avoidance, and long-term sustainability of current and future space activities. The inherent challenges in cislunar space, characterized by extreme distances, diverse and complex orbital dynamics far beyond near-Earth regimes, and weak signal returns, demand fundamentally new approaches. This study aims to systematically investigate the underlying physical mechanisms governing radar detection in this challenging environment and develop sophisticated information acquisition methodologies specifically tailored for cislunar objects. Our primary focus is on overcoming the critical technical hurdles that plague current systems, notably the severely low signal integration efficiency arising from long dwell times and intricate, non-Keplerian motion patterns and the consequent poor imaging performance (manifesting as severe defocusing and geometric distortions) caused by the highly complex, coupled orbital, and potentially attitude dynamics that cannot be adequately modeled with traditional approximations.MethodCommencing with a rigorous analysis of representative cislunar orbital trajectories, this research systematically uncovered the complex variation patterns governing radial distance and velocity dynamics of diverse orbital targets relative to radar observation platforms, thereby establishing a foundational kinematic framework essential for subsequent signal modeling. Building upon this foundation, this study directly confronted the critical challenge that target attitude rotation——emerging as spin, precession, or tumbling——cannot be effectively simplified or decoupled from orbital motion over extended coherent integration intervals. This necessity drove the development of an integrated approach, fundamentally combining high-fidelity orbital motion equations with sophisticated electromagnetic scattering representations. This synthesis unequivocally establishes that motion-constrained models, which dynamically account for the intricate coupling between orbital trajectory and target attitude, constitute the indispensable theoretical bedrock for advanced radar imaging processing under these uniquely complex dynamical conditions. Finally, to empirically validate the theoretical framework and assess practical implications, this investigation executed comprehensive empirical analyses grounded in prototypical lunar surface target detection scenarios. These analyses included comparative assessments of radar imaging results across distinct lunar regions characterized by varying topographies. Through meticulous simulation and comparative evaluation, these analyses elucidated the precise mechanisms by which inherent orbital characteristics——such as pronounced eccentricity or rapid range-rate fluctuations——directly influence critical radial measurement variations, including range walk and Doppler smear. Simultaneously, the comparative imaging outcomes, particularly the contrast in focus quality between processing chains employing and omitting precise motion constraints, provided compelling quantitative evidence for the paramount significance of integrated motion-scattering models in substantially enhancing overall radar imaging performance for discrete objects and extended lunar surface features within the challenging cislunar domain.ResultThis investigation yielded evidence of fundamentally distinct radial motion characteristics——encompassing range rate profiles, acceleration dynamics, and Doppler history complexity——across diverse classes of cislunar orbits. These inherent kinematic variations proved decisive in dictating requisite radar signal processing methodologies, particularly governing coherent integration window selection, motion compensation algorithm complexity, and pulse repetition frequency configuration for mitigating ambiguities. Crucially, the analysis demonstrated that the intricate, inseparable coupling between target attitude rotation——whether spin, precession, or tumbling——and its underlying orbital trajectory fundamentally invalidates conventional radar imaging models predicated on simplified translational motion or decoupled rotational assumptions, inevitably leading to severe image degradation characterized by pronounced defocusing, geometric distortions, and unresolved scattering centers when traditional techniques are applied. In compelling contrast, advanced compensation algorithms rigorously incorporating precise, dynamically coupled motion constraints derived from integrated orbit-attitude models proved exceptionally effective in suppressing these debilitating imaging artifacts, restoring near-theoretical resolution limits, and achieving high-fidelity target reconstruction even under the most challenging dynamical regimes, as quantitatively evidenced by significant improvements in key image quality metrics including impulse response sharpness, sidelobe suppression, and geometric integrity. Furthermore, lunar surface imaging experiments, designed to simulate realistic operational scenarios, provided unequivocal validation: Comparative analysis of synthetic aperture radar outputs across varied lunar terrains——such as rugged highlands versus smoother mare regions——consistently revealed that image formation processes neglecting the precise nonlinear motion constraints intrinsic to the observer’s orbit suffered from substantial smearing and spatial distortion. On the contrary, the systematic application of motion-constrained compensation techniques yielded consistently high-contrast, geometrically accurate renditions of surface features, thereby definitively reinforcing the critical operational necessity of these sophisticated models not only for discrete object characterization but also for reliable high-resolution mapping of extended lunar surfaces from complex cislunar vantage points.ConclusionThis study conclusively demonstrates that overcoming the formidable challenges of cislunar space object detection hinges on moving beyond traditional radar processing paradigms. Developing sophisticated target imaging techniques and intelligent feature extraction methods fundamentally constrained by and integrated with high-fidelity orbital dynamics and attitude variation models represents the core, indispensable pathway to achieving highly robust, high-resolution detection and characterization of objects throughout the cislunar domain. The integrated modeling approach and motion-constrained processing validated here provide a powerful framework. This research direction holds importance not only for fundamental space science and lunar exploration but critically for ensuring the sustainable development of humanity’s activities beyond Earth orbit. The methodologies established form a critical foundation for future autonomous navigation, debris mitigation, and on-orbit servicing capabilities in the increasingly vital cislunar theater.  
      关键词:cislunar space;radar imaging;orbital constraint;electromagnetic scattering model;intelligent cognition   
      94
      |
      117
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 110057045 false
      更新时间:2025-09-16
    • 在地月空间多目标自动跟踪领域,专家提出了基于DeepSORT的高效跟踪方法,显著提升了跟踪准确性和实时性,为解决复杂场景下的多目标跟踪问题提供解决方案。
      Wang Lei, Zhang Xiaoming, Zhi Hui, Zhang Yindi, He Mengqiu, Jiang Xiaojun
      Vol. 30, Issue 9, Pages: 3026-3038(2025) DOI: 10.11834/jig.250138
      Study of high-efficiency tracking algorithm for multiple targets in cislunar space
      摘要:ObjectiveWith the resurgence of international lunar exploration in recent years, the Earth-Moon space, as a key strategic area connecting the Earth and deep space, has once again become a global focus. The density of space activities in this region has increased considerably, and real-time monitoring of Earth-Moon space targets has become urgently needed. The widespread application of wide-field telescopes has greatly expanded the coverage of optical detection, enabling the monitoring of broader areas of the sky and thereby improving monitoring efficiency remarkably. However, this technological advancement has also brought new challenges. Specifically, within a single field of view, multiple targets with different speeds and directions now exist, which complicates the tracking process. Moreover, issues such as mutual occlusion between targets, complex target trajectories, and the difficulty of detecting faint targets further increase the complexity of multi-target tracking. Against this background, breaking through the key technical problems of tracking the Earth-Moon space targets is necessary. In particular, improving the accuracy and robustness of multi-target tracking in the face of complex target trajectories, mutual occlusion between targets, and the difficulty of detecting faint targets is important. This study aims to develop an efficient and accurate multi-target tracking method for Earth-Moon space to meet monitoring needs in complex dynamic scenarios. Overcoming the key technical challenges of multi-target tracking in the Earth-Moon space will not only provide security guarantees for space missions but also offer strong technical support for China’s strategic layout in the Earth-Moon space, establishing a solid foundation for future deep-space exploration missions.MethodTo address the many challenges of multi-target tracking in Earth-Moon space, such as significant differences in target speeds, intermittent disappearance of faint targets, and crossing target trajectories, we propose a target tracking method based on an improved DeepSORT algorithm. This method can quickly and stably track multiple moving space targets in images. The first step of the algorithm is the use of a space target detection method based on image feature extraction. This step can quickly determine the positions of space targets in each frame and set bounding boxes according to the targets’ shapes. Subsequently, the algorithm uses a Kalman filter to model and predict the motion trajectories of the targets. On this basis, the algorithm integrates the extracted feature data of the space targets as their appearance features and calculates a cost matrix on the basis of the motion characteristics of their centroids. Finally, with the use of the Hungarian algorithm, associations are established between targets in the current frame and those in the previous frame to minimize the association cost. This process helps identify target identities and track their trajectories. Through this series of carefully designed steps, our method not only improves the tracking accuracy but also ensures the real-time performance and stability of the entire system, providing a reliable technical means for the effective monitoring of Earth-Moon space targets.ResultThe experiments were divided into two stages——simulation and field measurement——to verify the accuracy and effectiveness of the proposed multi-target tracking method. In the simulation stage, we generated simulated observation images of space targets in a 16-bit format. To make the simulated images more realistic, we created them by adding simulated targets to the actual telescope images of tracked space targets. The real background images were derived from the actual astronomical FITS images obtained by the 1-meter-aperture telescope in Xinjiang. The simulated images contained a total of 1 697 moving targets. Test results showed that the multiple object tracking accuracy (MOTA) of this tracking method reached 94.96%, and the IDF1 score reached 96.2%, which fully demonstrated the high efficiency and accuracy of this method in complex dynamic scenarios. In the field measurement stage, we selected 183 frames of images captured by the Nanshan One-meter Wide-field Telescope of Xinjiang Astronomical Observatory in target tracking mode as the test images. The field measurement results indicated that the MOTA of the targets could reach 90.9%, and the IDF1 score could reach 89.66%. The actual measurement data showed that the tracking accuracy was 91.5%, the IDF1 score was 91.7%, and the average processing time for 4 k × 4 k images was only 0.53 s per frame. Notably, this method demonstrated excellent robustness in complex scenarios such as the temporary disappearance of targets and crossing trajectories, outperforming traditional algorithms remarkably.ConclusionThe improved DeepSORT algorithm has made significant progress in the field of multi-target tracking in Earth-Moon space, successfully overcoming the key technical challenges in this area. By employing meticulously designed adaptive detection boxes and a deeply integrated feature extraction strategy, the algorithm not only significantly enhances tracking accuracy but also ensures the system’s real-time processing capabilities. Experimental results fully demonstrated the effectiveness of this method: When confronted with complex scenarios in wide-field observations, such as the intricate motion states of multiple targets, crossing trajectories, and the occasional non-detection or intermittent disappearance of faint targets, this method still manages to ensure the accurate tracking of multiple targets. The proposed multi-target tracking method for diverse motion space targets is capable of not only tracking space targets in an axis system but also of simultaneously tracking multiple moving targets within an image. Moreover, it has a relatively short runtime, which greatly improves monitoring efficiency. This achievement provides a reliable technical solution for Earth-Moon space target monitoring and offers crucial technical support for space situational awareness in future deep-space exploration missions. It has broad application prospects and significant strategic importance.  
      关键词:cislunar;image process;multi-objects tracking;space object;deep simple online and realtime tracking(DeepSORT)   
      99
      |
      123
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 109942853 false
      更新时间:2025-09-16
    • Intelligent temporal behavior detection for lunar space station AI导读

      在航天领域,专家提出了一种高效时序动作检测框架,为航天员动作监测提供轻量化、高精度解决方案,助力太空任务安全与效率。
      Miao Shilin, Tang Yepeng, Zhang Chunjie, Liu Chunkai, Zhang Jitao, Cao Jianfeng
      Vol. 30, Issue 9, Pages: 3039-3049(2025) DOI: 10.11834/jig.250135
      Intelligent temporal behavior detection for lunar space station
      摘要:ObjectiveThe rapid advancement of space exploration has intensified the demands for efficient and precise monitoring of astronaut activities during training and missions. With the rise of commercial spaceflight and lunar exploration, ensuring astronaut safety and operational efficiency is paramount. Video understanding technologies, particularly temporal action detection (TAD), play a critical role in evaluating astronaut performance, detecting anomalies, and optimizing workflows in resource-constrained environments such as spacecraft bases. TAD aims to localize and classify action instances within untrimmed videos, such as identifying moments when an astronaut deploys a tool or responds to an alert. However, conventional TAD methods face challenges in computational efficiency owing to their reliance on dense optical flow, a motion representation that calculates pixel-level displacements between consecutive frames. This process is prohibitively expensive, rendering it impractical for real-time space applications where hardware resources are limited and latency must be minimized. This study aims to develop a lightweight yet effective TAD framework. The primary objective is to bridge the gap between accuracy and deployability, enabling TAD systems to operate efficiently in scenarios such as onboard spacecraft systems, lunar habitat monitoring, or astronaut training simulations.MethodTo address the computational inefficiency of conventional TAD methods while preserving their accuracy for space applications, we propose a knowledge distillation framework that leverages sparse optical flow inputs within a teacher-student paradigm. This approach is designed to mitigate the performance degradation typically caused by reduced motion cues in sparse representations. The teacher model, which serves as a high-accuracy reference, processes dense optical flow features computed at frame intervals of 1 using established TAD architectures. By contrast, the student model operates on sparse optical flow features extracted at frame intervals of 4, significantly reducing computational overhead. To bridge the representational gap between sparse and dense motion features, our framework integrates a dual-level distillation strategy: feature-level alignment and response-level guidance. For feature alignment, a lightweight U-Net-style network is employed to refine sparse optical flow features, ensuring they approximate the spatial and temporal patterns of their dense counterparts. This alignment enables the student model to capture critical motion dynamics, such as astronaut movements or equipment interactions, even with sparsely sampled inputs. Simultaneously, the student model is trained to mimic the teacher’s predictions through pseudolabels, which provide supervisory signals for action classification and temporal boundary regression. The framework’s architecture-agnostic design allows seamless integration with diverse TAD methodologies, including anchor-based approaches that rely on predefined temporal segments, boundary-based methods that predict start-end pairs, and query-based detectors inspired by transformer architectures. During inference, only the lightweight student model is deployed, requiring solely RGB frames and sparsely sampled optical flow, thereby eliminating the need for resource-intensive dense flow computations. This design reduces optical flow extraction time by 75%, making it feasible for deployment in resource-constrained environments such as onboard spacecraft systems or lunar habitats. By combining feature-level distillation to preserve motion semantics and response-level supervision to ensure prediction fidelity, our framework achieves a balance between efficiency and accuracy, addressing the unique challenges of real-time astronaut monitoring in space missions.ResultOur framework achieves significant advancements in computational efficiency and detection accuracy by transferring motion insights from dense to sparse optical flow, making it viable for deployment in limited-resource environments. Experimental evaluations on the THUMOS14 dataset demonstrate that the proposed method reduces optical flow extraction time from 132.1 h to 32.5 h while maintaining competitive detection performance. By utilizing sparse optical flow sampled at frame intervals of 4, the model achieves an average mAP of 61.6% across intersection over union thresholds [0.3∶0.7], outperforming baseline methods (60.6%) that rely solely on RGB and sparse flow without distillation. We conduct a detailed error analysis and achieve ​​comprehensive improvements across all three aspects​​: false positive analysis, sensitivity analysis, and false negative analysis.ConclusionThis paper presents a computationally efficient TAD framework tailored for space exploration applications, where real-time processing and hardware constraints are paramount. By integrating knowledge distillation with sparse optical flow, we achieve a 4 times reduction in optical flow extraction time and 50% faster end-to-end processing while maintaining competitive accuracy. The framework’s success in reducing false positives and improving sensitivity to short actions underscores its potential for enhancing safety and operational efficiency in space missions. Future research directions will focus on advancing video understanding technologies tailored for space environments. Key areas include analyzing motion patterns under microgravity conditions, in which kinematic dynamics differ fundamentally from Earth-based movements, and addressing challenges in irregular illumination scenarios prevalent in spacecraft or lunar habitats. By addressing the critical trade-off between accuracy and deployability, this research advances video understanding capabilities for the next era of space exploration.  
      关键词:Action detection;temporal action localization(TAD);knowledge distillation;video understanding;space mission safety monitoring   
      78
      |
      80
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 109942955 false
      更新时间:2025-09-16
    • 在月球科学探测领域,专家提出了多尺度超分辨率模型MSRT,有效恢复月球NAC影像精细地貌,为地质分析和着陆区评估提供高质量数据支持。
      Miao Dingruibo, Yan Jianguo, Tu Zhigang, Jean-Pierre Barriot
      Vol. 30, Issue 9, Pages: 3050-3065(2025) DOI: 10.11834/jig.250126
      Adaptive multi-scale super-resolution reconstruction for lunar images captured by a narrow-angle camera
      摘要:ObjectiveThe lunar south polar region is a prime focus for international scientific inquiry and future exploration missions due to its unique geological setting and the potential for significant volatile resources, notably water ice within permanently shadowed regions (PSRs). High-resolution imagery is crucial for characterizing these terrains, enabling safe landing site selection, effective rover operations, resource prospecting, and the advancement of our understanding of lunar geology. The narrow-angle camera (NAC) aboard the lunar reconnaissance orbiter (LRO) provides vital high-resolution (up to 0.5 m/pixel) data. However, the south polar region’s extreme illumination conditions——low sun angles and extensive shadowing——result in a poor signal-to-noise ratio (SNR) for NAC images. To compensate for this situation, the NAC often employs a pixel summation mode, especially in these polar areas, which averages or sums pixel values (e.g., in 2 × 2 to 4 × 4 blocks) to enhance SNR. However, although this process improves image brightness, it also severely degrades the spatial resolution, reducing a 0.5 m/pixel image to 1~2 m/pixel or coarser. This loss of detail critically impairs the identification of fine-scale geomorphological features (small craters, rock distributions, subtle slopes) that are essential for scientific interpretation (e.g., crater dating, regolith studies, volatile detection) and mission engineering (e.g., hazard avoidance for landers, rover trafficability assessment, ISRU site evaluation). This research directly addresses this critical data deficiency by aiming to significantly enhance the spatial resolution of lunar south polar NAC images degraded by pixel summation mode. The core objective is to develop an advanced super-resolution technique to effectively restore fine geomorphological details, thereby acquiring high-quality data to support lunar scientific research and upcoming exploration endeavors in this challenging but scientifically rich domain.MethodThis study introduces a novel deep learning framework called the multi-scale super-resolution transformer (MSRT), alongside a purpose-built dataset to tackle the multi-scale degradation problem in lunar NAC imagery. First, a large-scale, multi-scale lunar NAC image super-resolution dataset was constructed. This process involved selecting high-quality nominal-resolution (approximately 0.5 m/pixel) LRO NAC images from diverse lunar terrains. The NAC’s summation mode was emulated by downsampling these pristine images by factors of 2×, 3×, and 4× using average pooling, closely mimicking the pixel aggregation process. A controlled level of Gaussian noise, based on real NAC image characteristics, was added to simulate sensor noise. This process yielded an extensive dataset of paired low-resolution and high-resolution ground truth images, which are crucial for training and evaluating super-resolution models specifically designed for this type of degradation. The dataset’s unique focus on multiple, simulated summation mode degradation levels provides a robust platform for developing adaptive super-resolution solutions. Second, the MSRT model was proposed, strategically integrating the local feature extraction capabilities of convolutional neural networks (CNNs) with the global context modeling strengths of transformer architectures (specifically Swin transformer). MSRT’s architecture begins with an initial CNN layer for shallow feature extraction. These features are then processed by a shared backbone of cascaded residual Swin transformer blocks, which learns powerful, generalizable deep feature representations common to all upscaling factors. The core innovation lies in its multi-scale reconstruction stage, where multiple parallel, independent, and learnable scale-specific spatial-aware upsampling (S3U) modules are employed. Each S3U branch, dedicated to a specific upscaling factor (2×, 3×, or 4×), takes the shared deep features (combined with shallow features via skip connections) and uses scale-specific convolutional layers along with sub-pixel convolution to reconstruct the high-resolution image for its designated scale. This multi-branch reuse of deep features with unique terminal-scale adaptive upsampling structure enables a single MSRT model to efficiently handle various degradation levels, enhancing adaptability and reconstruction accuracy while managing model complexity, and is trained end-to-end using an L1 pixel loss function.ResultThe MSRT model’s efficacy was rigorously assessed through comprehensive experiments on the custom-built multi-scale lunar NAC dataset. Quantitative comparisons were made against several leading super-resolution methods (blind super-resolution generative adversarial network(BSRGAN), enhanced super-resolution generative adversarial network(ESRGAN), super-resolution residual network(SRResNet), deep plug-and-play super-resolution(DPSR), information multi-distillation network(IMDN)) using peak SNR (PSNR) and structural similarity index measure (SSIM) as key image quality metrics. Across all tested upscaling factors (2×, 3×, and 4×), MSRT demonstrated statistically significant and visually apparent outperformance over all baseline methods. In the particularly challenging 4× super-resolution task (recovering 0.5 m/pixel details from 2 m/pixel input), MSRT achieved a PSNR of 28.73 dB and an SSIM of 0.923 on the test set, substantially exceeding the performance of competing approaches (e.g., an approximately 3.0 dB PSNR gain over BSRGAN). Consistent superiority was also observed for the 2× and 3× tasks, highlighting MSRT’s robustness across varying degradation intensities. Detailed ablation studies were conducted to validate the contributions of MSRT’s key architectural components, such as the Swin transformer backbone depth, network width, and the S3U modules. These studies confirmed the effectiveness of the design choices and elucidated the impact of different parameters on performance across scales. Notably, the multi-scale training strategy for MSRT (simultaneously training all upsampling branches) proved effective, yielding results comparable to or better than those obtained by training separate single-scale MSRT instances, thus validating the efficiency of the shared-backbone multi-branch design. Qualitative assessments based on visual inspection of the reconstructed images further underscored MSRT’s advantages. The model effectively suppressed artifacts and noise commonly seen in super-resolved images while successfully restoring fine-scale lunar surface details that are crucial for geological interpretation. These details included the clear delineation of small impact crater morphologies, rock size and distribution patterns, subtle regolith textures, and other minute topographical features that were indiscernible in the degraded input images. Successful application to real-world LRO NAC images not seen during training also demonstrated the MSRT’s strong generalization capabilities.ConclusionThis research successfully developed and validated the MSRT model, a novel transformer-based deep learning approach for adaptive multi-scale super-resolution of lunar NAC imagery degraded by the summation mode, which is particularly prevalent in the challenging illumination conditions of the lunar south pole. The MSRT effectively addresses this multi-scale degradation, thereby considerably improving the reconstruction quality of fine lunar geomorphological features. The MSRT model’s innovative architecture, which combines a shared Swin transformer backbone for powerful cross-scale feature extraction with parallel, scale-specific S3U modules for tailored upsampling and reconstruction, forms its core technical contribution. This design enables a single model to efficiently and accurately handle multiple degradation levels, offering a practical and high-performing solution. The high-quality, high-resolution lunar surface data reconstructed by MSRT provides crucial support for detailed geological analyses (e.g., crater studies, regolith characterization), enhances the precision of landing site safety assessments, aids in safer and more efficient rover traverse planning, and facilitates more informed evaluations for potential in situ resource utilization. While MSRT shows significant promise, future work could explore more complex degradation models and the integration of physical or geometric priors. The architectural principles of the MSRT, particularly its strategy of using a shared backbone and specialized branches, offer valuable insights and a strong foundation for tackling similar multi-scale super-resolution challenges in other remote sensing applications, such as imagery from other planetary bodies or Earth observation systems. The MSRT thus represents a notable advancement in computational remote sensing and image restoration.  
      关键词:super-resolution reconstruction;lunar narrow-angle camera;Transformer;multi-scale;deep learning   
      100
      |
      101
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 109924891 false
      更新时间:2025-09-16

      Visual-Text Multimodal Large Language Models

    • TextLLM: a document multimodal large model based on dynamic resolution AI导读

      最新研究突破,提出了基于动态分辨率的文档多模态大模型TextLLM,无需OCR工具即可处理高分辨率文档图像,显著提升了文档理解性能。
      Yang Biao, Liu Yuliang, Liu Qiang, Zhu Yingying
      Vol. 30, Issue 9, Pages: 3068-3082(2025) DOI: 10.11834/jig.240608
      TextLLM: a document multimodal large model based on dynamic resolution
      摘要:ObjectiveThe advancement of document intelligence aims to realize the intelligent processing and interpretation of various document information, such as the processing of structured documents containing tables, forms, and invoices, among others, and text in natural scenes. Traditional deep learning methods excel at certain tasks, but because they are often optimized for a single task, they struggle to adapt to increasingly complex needs and diverse scenarios. When processing document text, optical character recognition (OCR) is usually required to extract text information. However, the processing speed and accuracy decrease. Moreover, the various steps involved in handling the reading of the pipeline text can lead to the accumulation of errors. In addition, relying on ready-made OCR models/APIs introduces additional engineering complexity, limits connections between text and its surrounding context, and can increase computational costs. OCR-free solutions have attracted increasing attention recently to mitigate the disadvantages of external systems before they are understood. With the rise of multimodal large models, the field of document intelligence is undergoing a revolution. These models integrate textual and visual information to enable a more comprehensive and accurate understanding of document content, potentially eliminating reliance on OCR tools. However, these multimodal large models still face challenges, especially when it comes to processing high-resolution document images and dealing with the ever-growing number of visual markers. Previous similar working resolutions are limited by the input size of the encoder, and seeing specific small text is generally difficult. Reducing the image resolution will also greatly increase the number of tokens of the image, which is challenging for memory and space consumption. A large multimodal model based on dynamic resolution is proposed to overcome these challenges. This model is designed to handle high-resolution document images, eliminating the need for OCR tools while being flexible enough to accommodate ever-increasing visual tokens.MethodIn this study, we improve the latest multimodal large model and train a large document model on the basis of dynamic resolution, called TextLLM. This process mainly involves dynamic adjustment of the image, block processing, visual coding, feature compression, and screening to extract and retain useful information to the maximum extent. First, the area closest to the predefined scaling size is determined on the basis of the original image size for subsequent processing, including scaling processing, image slicing, and global information acquisition. Next, the image blocks are fed into the visual encoder for encoding processing, including window attention mechanisms to recognize and fuse information, and extract local regional visual features and global visual features. Then, the image resampling module is used to process the segmented features, and the compressed visual features are obtained on the basis of the attention mechanism. The compression rate is set dynamically, and discrete compression rates and a learnable parameter are defined to learn the compression rate dynamically. Gumbel softmax is used for end-to-end training to learn the compression rate. According to the compression ratio, the number of features after compression is obtained. The most important features are selected and sorted by calculating the similarity and importance, and these important features are used for further aggregation. Finally, the attention mechanism of the large language model is used to capture the visual features related to the prompt words, the most relevant features are selected according to the attention distribution map of the prompt words, and the surrounding relevant features are retained. The integrated application of these steps helps in effectively processing and utilizing image information and improving the efficiency and accuracy of information extraction.ResultExperiments across multiple datasets compared TextLLM with the latest six methods, demonstrating considerably performance improvements in various document understanding benchmarks. It outperforms existing models on datasets such as DocVQA, WTQ, ChartQA, and TextVQA, achieving scores of 82.4, 37.6, 70.8, and 65.3, respectively. Furthermore, on the comprehensive OCRBench evaluation dataset, the model achieves a score of 601, proving its adaptability and overall strength in a variety of text-related tasks. Comparative experiments are performed across multiple datasets to verify the effectiveness of the algorithm, and results confirm that the proposed dynamic processing algorithm improved the model’s performance. Several mainstream scenario images are selected for visualization in this paper to further demonstrate the model's capabilities in text-related tasks across various scenarios. The model is able to accurately recognize text in scene images and documents and answer questions on the basis of its understanding, showing its strong text processing capabilities and adaptability. Introducing dynamic feature compression ensures that the model can autonomously learn the compression rate and sample different compression rates during the learning process, which to some extent serves as data augmentation. Furthermore, incorporating visual feature selection enables the model to focus more on text-related features. Reducing features further allowed it to achieve improvements across all datasets.ConclusionIn this study, we introduce an innovative document large model called TextLLM, which, on the basis of dynamic resolution, combines dynamic compression feature and dynamic selection algorithm. Extensive experimental validation shows that our model not only outperforms several state-of-the-art document large models but also achieves significant improvements in efficiency and accuracy.  
      关键词:multi-modality;large document model;document intelligence;dynamic resolution;dynamic compression ratio   
      119
      |
      236
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 83151965 false
      更新时间:2025-09-16
    • 最新研究突破了电子文档视觉问答数据生成技术,显著提升了多模态大型语言模型的文档阅读性能。
      Li Yuzhe, Fu Ling, Zhu Linghao, Luo Qidi, Tu Lai
      Vol. 30, Issue 9, Pages: 3083-3096(2025) DOI: 10.11834/jig.240610
      Multimodal large model-based method for generating visual Q&amp;A data for electronic document images
      摘要:ObjectiveRecent advancements in multimodal large language models (MLLMs) have revolutionized the field of visual question answering (VQA), especially in the domain of text-centric document understanding. These models have demonstrated remarkable improvements in tasks that involve the integration of visual and textual information, with several state-of-the-art models currently leading the field. One critical task in this area is the generation of VQA datasets for electronic documents, which entails combining the textual content embedded within document images with their visual components to generate meaningful and contextually relevant questions and corresponding answers. The integration of high-quality, fine-tuned datasets—specifically designed for multimodal instruction-following tasks—has greatly enhanced the document comprehension capabilities of MLLM. However, existing VQA datasets, which are typically generated using manual annotation techniques or templating methods, face significant challenges in scale and quality. These limitations impede the scalability and overall effectiveness of training datasets for multimodal models. Therefore, this paper proposes an innovative method to automatically generate image-based VQA datasets for electronic documents by utilizing an MLLM. The goal of this work is to address the existing gaps in dataset quality and quantity, thereby facilitating better training and fine-tuning of MLLM in the context of document-based visual question answering tasks.MethodThe proposed methodology involves the use of a large-scale data generation framework, powered by an MLLM. This framework is divided into four distinct stages: self-question generation, quantity and format verification, data filtering, and consistency validation. In the initial stage, the MLLM is tasked with generating multiple question-answer (Q&A) pairs by processing the input electronic document images alongside their corresponding textual descriptions. This stage capitalizes on the model’s ability to simultaneously analyze both the visual and textual elements of the document, enabling it to generate a diverse array of questions that cover various aspects of the content, such as factual inquiries, inferential reasoning, and contextual understanding. The second stage focuses on ensuring that the generated Q&A pairs meet the required quantity and adhere to the correct formatting standards. This stage is critical for eliminating any inconsistencies, errors, or discrepancies in the formatting of the data, which could otherwise compromise the quality of the final dataset. In the third stage, data filtering is employed to refine the dataset by eliminating irrelevant or incorrect Q&A pairs. This process involves evaluating the generated Q&A pairs, along with their corresponding images and instructions, to identify and discard any irrelevant or improperly answered pairs. This step ensures that the dataset contains only high-quality questions that require multimodal reasoning capabilities for accurate responses. The final stage involves consistency validation, wherein the MLLM is used to generate multiple variations of the same Q&A pair. The objective of this stage is to verify that the answers remain consistent across different rephrasings of the same question. If inconsistencies in the answers are identified, then those pairs are discarded. This step not only ensures the reliability and accuracy of the dataset but also helps improve the robustness of the dataset by introducing diverse question formulations. By systematically applying these four stages, the proposed method enables the generation of a large-scale, high-quality VQA dataset for electronic documents, which can then be leveraged in fine-tune MLLMs to enhance their performance in document understanding tasks.ResultIn this study, a high-quality dataset was constructed, consisting of 324 546 images and 2 036 263 corresponding Q&A pairs. The overall correctness rate of 91.34% was achieved through random sampling of a sufficiently large number of images and their associated Q&A pairs, followed by manual verification of the selected samples. The effect of this dataset on improving the performance of MLLMs in document-based question answering tasks, such as DocVQA, was rigorously evaluated. Fine-tuning experiments on the LLaVA-OV and Deepseek-VL models demonstrated improvements of 1.4% and 2.6%, respectively, in average normalized Levenshtein similarity on DocVQA. In addition, ablation studies were conducted to assess the effectiveness of the data filtering process. These studies revealed that correctness filtering, relevance filtering, and external knowledge filtering each contributed to the enhancement of the performance of MLLMs when applied to the generated dataset. Interestingly, relevance filtering and external knowledge filtering did not conflict with one another. By contrast, applying both filtering methods simultaneously resulted in better model performance than when either one was applied alone. Furthermore, the entire data filtering process resulted in a 1.3% performance improvement for the Deepseek-VL model on the DocVQA dataset. Complementarity experiments with the DocVQA dataset showed a 1.3% performance gain when the model was fine-tuned on both the DocVQA dataset and a subset of the visual Q&A dataset. This finding demonstrated the ability of the generated dataset to complement manually labeled data and showcased the effectiveness of the synthetic data generation process. The superiority of the proposed dataset generation method was validated further through comparisons between the model performance achieved using the ALLaVA and TG-Doc datasets and the model performance obtained using the generated data from the proposed method. Specifically, 1 million instruction samples were randomly selected from each of these datasets for full fine-tuning of the LLaVA-OV model. Experimental results indicated that the generated data from the proposed method led to the most significant improvement in model performance. Finally, while the proposed dataset resulted in some improvement in model performance, the overall gain was somewhat limited. A more in-depth analysis revealed that redundant characters in the generated answers——such as unnecessary phrasing——contributed to a degradation in performance. This issue was addressed by conducting a post-processing experiment using Qwen2.5-14B. By removing redundant content from the model’s outputs, the post-processing technique enhanced performance considerably, indicating that further refinement in the dataset generation process could yield even better results.ConclusionThe proposed method for MLLM-driven generation of image-based VQA data for electronic documents effectively addresses the challenges of limited dataset size and poor data quality. The comprehensive evaluation of the dataset’s effect on model performance, along with the successful implementation of data filtering and post-processing strategies, demonstrates the potential of this approach to improve the robustness and accuracy of multimodal models in document-based visual question answering tasks. Future work could further refine this process to eliminate redundant content and optimize the generated dataset for even better performance.  
      关键词:multimodal large model;electronic document image;Visual instruction tuning dataset;Visual perception and understanding;visual text   
      205
      |
      317
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 84436315 false
      更新时间:2025-09-16
    • 在无监督图表示学习领域,研究人员提出了LAST-MGCL模型,通过局部—全局图增强技术和多重神经网络协同建模,有效提升了节点表示质量,为无监督图表征学习提供了有效解决方案。
      Jiang Xuchu, Zhang Xiaowen
      Vol. 30, Issue 9, Pages: 3097-3110(2025) DOI: 10.11834/jig.240612
      Multidimensional graph contrastive learning based on graph enhancement and multi-neural networks
      摘要:ObjectiveGraph representation learning has been widely applied across various domains, including social networks, bioinformatics, and recommendation systems, due to its ability to effectively capture and encode structural and relational information within graph-structured data. Among existing approaches, unsupervised graph contrastive learning (GCL) has gained significant attention as it enables high-quality node representations without relying on extensive labeled data, making it particularly suitable for real-world applications where labeled annotations are costly and scarce. However, despite their advantages, current GCL methods suffer from several inherent limitations. Most existing graph contrastive learning techniques rely on single-perspective augmentation strategies, such as randomly removing edges or nodes, which can only capture a limited range of structural variations within a graph. However, graphs often exhibit intricate, multi-level dependencies that a single augmentation approach cannot adequately represent. For instance, a graph may contain subgraphs with varying node connectivity patterns or may encode higher-order relationships that are overlooked by such simple augmentations. As a result, relying solely on one perspective reduces the richness and diversity of learned node representations, limiting the model’s ability to generalize across different graph structures. Moreover, conventional contrastive learning frameworks often employ coarse-grained contrastive mechanisms, comparing entire subgraphs or large sets of nodes at a high level of abstraction. While this process can work in certain contexts, it fails to capture finer-grained distinctions in local node structures and semantic attributes, leading to suboptimal node embeddings. These limitations hinder the model’s ability to learn discriminative representations, thereby affecting its effectiveness in tasks such as node classification, clustering, and link prediction.MethodTo address the above issues, this paper proposes a method of local and SVD-based global augmentation with triple network for multi-dimensional graph comparative learning (LAST-MGCL). First, LAST-MGCL incorporates a local-global graph augmentation strategy, which combines two complementary techniques: a conditional variational autoencoder (CVAE) for local enhancement and a singular value decomposition (SVD)-based module for global enhancement. The CVAE is designed to enrich feature representations within local neighborhoods, enabling the model to better capture the intricate relationships between nodes that have limited neighborhood connectivity. This ability is particularly beneficial for sparse graphs, where local structure is often underrepresented. By generating richer local feature representations, the CVAE ensures that the model can more effectively process nodes with few neighbors, ultimately improving the overall quality of graph embeddings. The SVD-based module focuses on global structural patterns, leveraging SVD to capture the topological essence of the graph at a broader scale. This global enhancement ensures that key topological features are preserved, facilitating the model’s ability to generalize across different graph types. By combining these local and global enhancement techniques, LAST-MGCL creates a multi-granularity augmentation strategy that provides diverse views of the graph, enriching the learning process and improving the expressiveness of various graph neural networks. Second, LAST-MGCL adopts a triple encoding network architecture, which leverages the power of multi-head attention to process both the original and augmented graph data. In this architecture, the graph data are passed through three sub-networks, each guided by a multi-head attention mechanism that enables the model to focus on different aspects of the graph’s structure. The multi-head attention mechanism is designed to capture diverse, multi-scale dependencies across the graph, making it particularly effective at integrating information from various views of the graph. Through cross-network information exchange, the sub-networks collaborate, strengthening the model’s ability to integrate representations from different graph perspectives. This cross-network collaboration enhances multi-view fusion, which is critical for improving the robustness and stability of the learned graph embeddings. Ensuring that information is effectively shared between sub-networks enables the model to integrate complementary information from both original and augmented graph views, thus improving the overall representation quality. Third, LAST-MGCL introduces an innovative multi-dimensional contrastive learning optimization framework to further refine the learning process. This novel framework integrates multiple contrastive learning objectives to optimize graph representation learning across various dimensions. The contrastive loss is designed to combine cross-network contrastive learning, which aligns representations between the original and augmented graphs, with cross-view contrastive learning that enhances generalization across different augmented perspectives. Neighbor contrastive learning is incorporated to maintain local semantic coherence by focusing on the relationships between neighboring nodes within the graph. These objectives work together to reinforce the structural consistency and semantic alignment of graph representations at multiple granularities, ensuring that both local and global dependencies are effectively captured. By applying this multi-dimensional contrastive framework, LAST-MGCL addresses critical challenges such as the underutilization of contrastive information in traditional methods, the reliance on negative samples, and the difficulty of aligning representations across different graph views.ResultIn node classification tasks, the LAST-MGCL model demonstrates strong performance across several benchmark datasets, achieving classification accuracies of 83.1% on Cora, 72.6% on Citeseer, and 81.8% on PubMed. These results indicate that LAST-MGCL consistently outperforms state-of-the-art contrastive learning methods, offering superior classification accuracy and robustness. In addition, the node embeddings generated by LAST-MGCL exhibit more compact intra-cluster cohesion and clearer inter-cluster boundaries, highlighting the model’s ability to effectively capture and distinguish graph structures. Ablation experiments were conducted to assess the contribution of each model component. The results revealed that removing key components significantly degraded performance. For example, removing multi-dimensional contrastive learning resulted in the most significant performance drop, with a reduction of 9.7%. These findings underscore the importance of the combined local-global graph augmentation approach, which captures both local and global graph information, and the role of multi-dimensional contrastive learning in enhancing node interactions and clustering. Furthermore, a hyperparameter sensitivity analysis was performed to optimize model performance. Finally, t-SNE visualizations comparing LAST-MGCL to the best baseline models show that LAST-MGCL excels in node clustering and maintains clear class boundaries across all datasets, further validating its superior performance in representation learning.ConclusionIn summary, LAST-MGCL enhances the existing graph contrastive learning framework by integrating local-global graph augmentation, multi-view representation learning, and multidimensional contrastive optimization. Specifically designed for unsupervised graph learning, it provides an effective solution for learning high-quality node representations from unlabeled graph data.  
      关键词:graph representation learning;multiple twin network;Multi-dimensional contrastive learning;local-global graph enhancement;graph neural network(GNN);large language model(LLM)   
      94
      |
      101
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 89556389 false
      更新时间:2025-09-16

      Review

    • Review of state space models in medical image processing AI导读

      最新研究显示,基于SSM的Mamba模型在医学图像处理领域取得突破,性能有望超越Transformer,为AI医学影像分析提供新思路。
      Zou Maoyang, Wu Yulan, Gao Lin, Wang Zhongwei, Chen Ran
      Vol. 30, Issue 9, Pages: 3111-3126(2025) DOI: 10.11834/jig.240566
      Review of state space models in medical image processing
      摘要:The state space model (SSM) has excelled in computational efficiency over long sequences. With the introduction of the SSM-based Mamba model in 2024, the SSM has become the new high-profile AI architecture. It combines the strengths of CNN and Transformer technologies to efficiently capture local information and remote dependencies while maintaining linear complexity in computation. It exhibits great potential for processing high-resolution data or long-term sequential data. To fully understand the research and application of state space models in the field of medical image processing, we conduct a comprehensive survey to sort out the development lineage, key models, and application scenarios of SSM models in the field of medical image processing. First, we summarize the history of state space models, which describes the evolution of state space models from SSM to HiPPO, S4, Mamba, and Mamba-2. Focusing on SSM and Mamba, we describe the basics of SSM and enumerate important improved models of SSM. We then elaborate on the selectivity mechanism and hardware-aware state extension of Mamba. The 11 improved models of Mamba with their characteristics are summarized in Table 1. Second, in the field of medical image processing, the SSM model represented by Mamba has been used for tasks such as image segmentation, classification, alignment, fusion, and reconstruction. In addition to breakthroughs in disease prediction, medical image synthesis, radiotherapy dose prediction, the model has achieved excellent results in graphic segmentation. We explore the improvement and application of the SSM model to each of the medical graphics processing tasks. Medical image segmentation is suitable for the application of SSM model because the segmentation task corresponds to the characteristics of long sequences. Currently, nearly 40 papers on medical image segmentation based on SSM have been published. Moreover, only one model called Vivim performs video segmentation, while the rest of the models perform image segmentation. The vast majority of the models use the U-Net architecture, which has achieved remarkable success in various medical image segmentation tasks. The exploration of SSM-based medical image segmentation is elaborated. The technical features of each model, the ROI studied, and the image modalities involved are analyzed and presented. Medical image classification is an important task in the field of medical image processing and analysis. However, SSM has been rarely applied because the image classification task is not compatible with long sequences or autoregressive properties. Thus, only five main models have been developed. In medical image registration and fusion, multimodal medical image registration has always been a major challenge in image registration. Research on medical image registration and fusion based on SSM is limited, with only two models existing models. However, they overcome the challenge of multimodality. Two Mamba-based models have achieved good results in medical image fusion. The task of medical image reconstruction needs to deal with long sequences, which makes it highly suitable for the application of SSM models. At present, researchers have proposed five SSM-based models for medical image reconstruction. One SSM-based model each has been proposed by researchers in the fields of medical image synthesis, disease prediction, radiotherapy dose prediction, and surgical stage identification. We have analyzed the technical characteristics of all the models for each of the abovementioned tasks individually and described why they achieved good experimental results.Finally, we conclude with a discussion of the challenges facing state space modeling, which consist of five main areas. The first is the suboptimal performance of the SSM model for visual tasks with non-causal attributes. For data with non-causal attributes, the inherent limitations in the receptive field can be mutually compensated for by using bidirectional scanning and cross-scanning to extend the scanning direction for capturing the spatial information inherent in 2D or high-dimensional medical visual data. Despite these adaptations, the task of handling non-causal attributes remains challenging. The second is that the scanning scheme needs to be further optimized. Most medical image processing tasks use multi-dimensional data, and new scanning schemes need to be designed to enhance the feature learning of SSM. These improvements maximize the potential of high-dimensional non-causal visual data and boost the accuracy of SSM-based models. The third is the need to further improve the generalization and robustness of the model. Hidden states can accumulate or even amplify domain-specific information, which can affect the generalization performance of the model. The fourth is to improve the efficient fine-tuning of the underlying model. When SSM is developed as an infrastructure, pretraining models on various datasets such as ImageNet requires research into parameter-efficient fine-tuning techniques. Optimizing parameter-efficient fine-tuning in SSM-based models to minimize adjustments and reduce requirements for a large amount of computational resources is a highly demanded yet unexplored research area. The fifth is that the adaptability to specific medical image tasks needs to be improved. The reason is that medical image datasets are usually small in size and limited in diversity, which make the model prone to overfitting. The Mamba model may need to be further optimized in terms of efficient learning using limited data to reduce the reliance on large amounts of annotated data.The research discussed in this paper and its open source implementation are available on GitHub at https://github.com/wyl32123/ssm-medical-paper/tree/main.  
      关键词:state space model(SSM);Mamba;medical image segmentation;medical image classification;medical image registration and fusion;medical image reconstruction   
      211
      |
      228
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 85428186 false
      更新时间:2025-09-16

      Image Processing and Coding

    • 在光场图像显著性检测领域,研究者提出了一种融合多阶段差分特征的检测网络,有效提升了显著物体检测的准确性。
      Jiang Wenhui, Zhu Chang, Fang Yuming, Cai Chao, Yan Jiebin
      Vol. 30, Issue 9, Pages: 3127-3140(2025) DOI: 10.11834/jig.240617
      Stage-wise diff-enhanced features for light field image salient object detection
      摘要:ObjectiveSalient object detection is a crucial computer vision technique aimed at automatically identifying the most attention-grabbing regions in images or videos, mimicking the attention mechanism of the human visual system. This technology is widely used in applications such as object detection, image and video compression, image retrieval, human-computer interaction, and advertising design, with the goal of improving processing efficiency, optimizing resource usage, and enhancing user experience. Light field images, known for capturing scene details at different depths, are widely employed to segment visually distinctive objects from their surroundings. Most current research focuses on the combination of focal stack and all-in-focus images. Focal stack images capture specific areas along the depth dimension while keeping other parts defocused; all-in-focus images synthesize the in-focus areas from these focal stack images to depict the entire scene. In focal stack image sequences, objects are only clearly imaged in the focal stack at certain depth levels, whereas other depth levels remain blurred. The large number of defocused and blurred areas affects the performance of salient object detection in light field images considerable. Reducing the effect of blurred information on effective object feature learning during the fusion of focal stack signals is critical. Traditional methods for processing focal stacks employ convolutional long short-term memory (ConvLSTM) to integrate features, but ConvLSTM overlooks the unique structural characteristics of light field images, which deteriorates detection accuracy due to blurred image areas. Existing methods typically use individual backbone networks to extract features from focal stack and all-in-focus images and only consider feature interactions during the decoding stage. These methods lack effective feature fusion during the encoding process, resulting in limited feature utilization. To address these issues, this paper proposes a stage-wise diff-enhanced network for light field image salient object detection, aimed at improving salient object detection in light field images.MethodTo enhance the performance of light field image salient object detection, we employ a transformer-based backbone to extract multi-scale, high-quality image features. In focal stack images, notable differences often appear near the in-focus regions of consecutive depth images, which provide planar localization information for various targets, aiding in pixel-level localization of salient objects. Inspired by this, we design a focal stack spatial awareness module based on multi-stage self-differential features. This module leverages the differences between focal stack feature maps to distinguish in-focus regions from defocused regions, mitigating the effect of blurriness in defocused areas on feature fusion and improving the effectiveness of focal stack feature encoding. However, due to influences from neighboring depth focal stacks, significant difference regions in focal stack maps often exceed the actual focus range of the current depth. To further accurately distinguish focused and defocused areas at different depths and enhance the precision of in-focus regions in each depth plane of focal stack images, thereby improving pixel-level localization accuracy, we propose an explicit multimodal stage-wise fusion method. This method achieves efficient fusion between focal stack and all-in-focus features through stage-wise interactions and fusion of differences between modalities. The fusion module complements the focal stack spatial awareness module, improving overall performance. Moreover, to increase feature utilization, we introduce both the focal stack spatial awareness module and the multimodal stage fusion module in the encoder stage, enabling early interaction of features and mitigating the issue of low feature utilization. Finally, we design a salient map decoder module that predicts salient objects from focal stack and all-in-focus features by using separate decoders and then merges the results from both decoders to generate the final salient map. This approach ensures efficient feature fusion from both the focal stack and all-in-focus branches, producing the final salient map.ResultExperiments were conducted on the widely used Dalian University of Technology Light Field Focal Stack (DUTLF-FS), Hefei University of Technology Lytro (HFUT-Lytro), and Lytro Illum datasets, and the results were compared with those of 11 state-of-the-art methods. In terms of quantitative analysis, we compared different models on the basis of mean absolute error (MAE), S-measure, max F-measure, and max E-measure. On the DUTLF-FS dataset, compared with the second-best-performing FESNet, without introducing additional depth map clues, our method achieved the same level in terms of MAE (1.8%) and S-measure (95.1%) while outperforming FESNet in max F-measure (96.1%) and max E-measure (97.3%). On the HFUT-Lytro dataset, compared with the second-best-performing FESNet, our method reduced the MAE (5.4%) by 12.9% and showed better performance in max F-measure (82.1%), max E-measure (88.6%), and S-measure (83.9%). On the Lytro Illum dataset, compared with the second-best-performing LFTransNet, our method reduced the MAE (2.8%) by 22.2% and showed better performance in max F-measure (90.4%), max E-measure (95.1%), and S-measure (91.5%). In terms of visual analysis, our model exhibited strong robustness in complex scenarios, such as transparent objects, similar foreground and background, low light conditions, very large objects, multiple salient objects, and multiple distracting objects, providing accurate and reliable salient object detection results in various challenging conditions. In terms of complexity analysis, our model has a moderate number of parameters and floating point operations while demonstrating efficient inference speed. Furthermore, we conducted ablation experiments on the HFUT-Lytro dataset, and the results proved the effectiveness of the designed modules and proposed methods, improving the performance of light field image salient object detection. Compared with the baseline model, our method reduced the MAE by 22.2%.ConclusionThe proposed salient object detection effectively enhances the salient region features in complex scenes and suppresses background regions, accurately improving salient object detection.  
      关键词:saliency object detection;light field image;stage-wise fusion;diff-enhanced features;multi-modal feature fusion   
      154
      |
      193
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 82627268 false
      更新时间:2025-09-16

      Image Understanding and Computer Vision

    • Arbitrary image style transfer based on swapping attention mechanism AI导读

      在图像风格迁移领域,研究者提出了一种无需额外微调的高效风格迁移方法,能够处理丰富多样的视觉内容和艺术风格,输出高保真、风格一致的迁移结果。
      Zhang Yuxin, Dong Weiming, Xu Changsheng
      Vol. 30, Issue 9, Pages: 3141-3152(2025) DOI: 10.11834/jig.240652
      Arbitrary image style transfer based on swapping attention mechanism
      摘要:ObjectiveGenerative methods based on diffusion models have garnered widespread attention due to their diverse generation outcomes and high-quality effects. Diffusion models have a distinct advantage over traditional style transfer frameworks in terms of the diversity, quality, and realism of generated images, and are effective in capturing the complex distributions of data. With their robust data modeling capabilities, diffusion models are particularly suitable for multimodal generation tasks. However, applying diffusion models to text-to-image generation for style transfer tasks still faces several challenges. First, the artistic styles involved in style transfer often possess highly visual and artistic characteristics that are difficult to precisely express through language. Although text-guided stylization techniques can generate artistic images by using natural language and textual prompts, the textual prompts for the target style are often limited to a rough description of materials, artistic genres, artists, or famous artworks. Detailed auxiliary textual input is usually required to guide the generation process in reproducing vivid content and style. Even so, the generated results may still fail to fully capture the creativity and conception of a specific painting. Researchers in the field of image generation have explored the use of reference images as visual cues to enhance the quality and diversity of generated images. These methods mainly include fine-tuning diffusion models or text embedding techniques. These techniques are often accompanied by high fine-tuning costs, including the need for a large amount of data and computational resources.MethodThe core of image style transfer lies in two aspects: transferring the artistic style of the image while maintaining the content information of the original image. We designed a novel three-branch parallel generation path, namely, the new image generation path, the content reference path, and the style reference path, with the goal of integrating style and content features into the image generation process. This feature fusion is achieved through an interactive attention module. Specifically, the process begins with three generation paths: One starts from text prompts and initial noise, naturally forming an image that matches the text prompts, while the other two paths are directed toward an artistic image and a content image, respectively. To generate an image that integrates the reference style features, we utilize the hierarchical generation characteristics of the diffusion model, which generates structured information in the deep network of the model and texture, color, and other information in the shallow network, corresponding to the content and style information set in the task. On the basis of this characteristic, in the deep self-attention module of the diffusion model’s main network, the key and value features in the original path are exchanged with the corresponding features in the content path. Subsequently, in the shallow self-attention module of the diffusion model’s main network, the key and value features in the original path are exchanged with the corresponding features in the style path. In this way, the attention mechanism uses the similarity between the original query features and the reference key features to reconstruct the reference value features in the form of weights. Tone consistency is a key factor in evaluating the effect of style transfer. Although methods based on interactive attention perform well in transferring the brushstrokes and textures of artistic images, they often have difficulty aligning the overall tone of the content image with the style image. The image generation of the diffusion model depends not only on the generation network and text conditions but is also significantly influenced by the initial noise. Previous studies pointed out that the color tone of the generated image is closely related to the sampling of the initial noise. On the basis of this finding, we address the tone consistency issue between the generated image and the style image by adjusting the initial noise. We propose a new method to improve tone consistency by aligning the initial noise of the generated image with the distribution of the style image.ResultWe conducted a comparative analysis with five state-of-the-art traditional style transfer methods, namely, ArtFlow, AdaAttN, StyTr2, CAST, and AesPA-Net. For the test dataset, we utilized a collection of 100 artistic images from WikiArt and 100 real-world images from the Places365 dataset, which were randomly sampled from a dataset exceeding 100 000 images to ensure a fair comparison. With regard to implementation details, we employed stable diffusion extra large (SDXL) with its default hyperparameters throughout all our experiments. Our approach enhanced the style accuracy by 4%, achieving a state-of-the-art style transfer effect. All baselines were evaluated using publicly available implementations and default configurations. Furthermore, we compared our method with two advanced diffusion model-based style transfer methods, namely, InST and Z*. Our method adeptly transferred the tonality and brushstrokes of the style image while achieving the best content preservation effect.ConclusionQualitative and quantitative comparisons with other arbitrary style transfer methods prove that our method can generate higher quality and more accurate style transfer results. Future work will focus on further optimizing the algorithm to enhance the model’s adaptability to various styles and content, as well as exploring more innovative application scenarios. We anticipate that this method will play a more significant role in fields such as artistic creation and multimedia design, providing users with a richer and more personalized visual experience.  
      关键词:image generation;generative model;diffusion model;style transfer;attention mechanism   
      179
      |
      300
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 83144699 false
      更新时间:2025-09-16

      Remote Sensing Image Processing

    • 最新研究突破遥感影像领域自适应难题,提出语言文本引导的全局预训练—局部微调框架,显著提升跨时空领域迁移性能。
      Tao Chao, Guo Xin, Hu Keyan, Shen Yuxiang, Wang Hao
      Vol. 30, Issue 9, Pages: 3153-3170(2025) DOI: 10.11834/jig.240640
      Language-guided cross-spatiotemporal domain adaptation for remote sensing image semantic segmentation
      摘要:ObjectiveWith the development of large-scale visual models, pre-training on multi-source unlabeled remote sensing images to learn global visual features and fine-tuning for target tasks has become a new paradigm for domain adaptation of remote sensing image semantic segmentation. Across spatiotemporal domains, joint distribution shifts, comprising feature and label distribution shifts, occur due to variations in lighting, weather, phenology, natural landscapes, and human-made environments. This spatiotemporal heterogeneity complicates the accurate assessment of domain relevance, challenging the applicability of most “local-local” domain adaptation methods. In contrast, “global-local” learning strategies, which extract general visual features from a broad spectrum of unlabeled data, enhance the relevance of knowledge across domains. However, current global pre-training approaches primarily focus on low-level feature learning, which limits the ability to capture complex, high-level semantic relationships. Furthermore, during the fine-tuning phase with limited annotated samples, these samples often reflect only specific scenarios within the target domain, making it insufficient to fully activate the relevant knowledge within the global model. Consequently, a major semantic gap persists between the globally trained models and the actual task requirements. This challenge manifests in two aspects: 1) a mismatch between the global pre-training objectives and the requirements of the target semantic segmentation task, as pre-trained features focused on low-level information may not align well with the need for deep semantic associations, thus limiting the effectiveness of model transfer; and 2) insufficient learning of target-specific semantic features during local fine-tuning due to the limited representativeness of the few annotated samples, which may fail to encompass the full range of variability within the target domain. To address these issues, this paper proposes a language-guided “global pre-training-local fine-tuning” framework for domain adaptation to overcome the challenges associated with cross-spatiotemporal domain shifts of remote sensing images.MethodThe proposed framework addresses the spatiotemporal heterogeneity of remote sensing data by leveraging a large-scale visual-language assistant called large language and vision assistant (LLaVA) to generate textual descriptions of remote sensing images that include information on season, geographical area, and distribution of ground objects. Rich in semantic and contextual information, these language texts, when combined with visual features, enable the model to better understand the deep semantic associations across different remote sensing images. For the generative pre-training strategy of the global model, the complex contextual information in long texts aids in the reconstruction and generation of detailed image content. For the discriminative pre-training strategy, clear and concise short texts are beneficial for contrastive learning optimization. Therefore, this paper proposes a method for generating long and short textual descriptions of remote sensing images, tailored to the different pre-training strategies of the global model. Within the global pre-training-local fine-tuning domain adaptation framework, the language text not only guides the global model in capturing and understanding the spatiotemporal distribution patterns within remote sensing images but also facilitates the rapid activation of associated domain knowledge in the local model: 1) during the global pre-training phase, textual descriptions that include information about the season, geographic region, and distribution of ground objects guide the model in learning associations between visual features and semantic information, thereby capturing and understanding the spatiotemporal patterns within the imagery. 2) During the local fine-tuning phase, similar textual descriptions assist in rapidly activating relevant domain knowledge embedded within the global model.ResultThree sets of global-local cross-spatiotemporal domain adaptation experiments for semantic segmentation were conducted, comparing discriminative, masked generative, and diffusion generative pre-training strategies to validate the effectiveness of the proposed framework. Using the example of global-local (Changsha,CS), the employment of language text guidance, compared with no text guidance, has resulted in performance improvements of 8.7%, 4.4%, and 2.9% across three different pre-training strategies, with similar performance enhancements observed for global-local (Xingtan,XT) and global-local (Wuhan,WH). Compared with traditional local-local learning methods and global-local learning methods without text guidance, the proposed framework significantly enhances the model’s transfer performance.ConclusionThis paper pioneers the exploration and validation of the positive role of language text in mitigating spatiotemporal domain shifts in remote sensing imagery, introducing a language-guided global pre-training-local fine-tuning framework for domain adaptation. This framework uses textual descriptions of remote sensing images to facilitate the global model’s learning of spatiotemporal distribution patterns of ground objects during pre-training and enhances the activation of relevant domain knowledge during local fine-tuning. Three multi-source, cross-spatiotemporal semantic segmentation experiments demonstrate that the proposed framework significantly improves model transfer performance compared with traditional local-local domain adaptation methods and global-local methods without text guidance. Future research will focus on two main directions: 1) the impact of language text on model transfer performance over larger spatiotemporal scales will be investigated. While this study conducted exploratory experiments using remote sensing data sampled across four seasons in the Hunan and Hubei regions, future work will aim to extend the spatial coverage and temporal span to assess the feasibility of applying the proposed framework at national and even global levels. 2) More refined textual description methods for remote sensing images will be developed, potentially incorporating meteorological data (e.g., temperature and precipitation) and topographic information to enrich the content of the descriptions.  
      关键词:remote sensing image;semantic segmentation;domain adaptation;visual-language model;spatiotemporal heterogeneity   
      139
      |
      243
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 82627231 false
      更新时间:2025-09-16
    0