摘要:ObjectiveResearch on computer-assisted instruction in physics education remains at a relatively exploratory and developmental stage, particularly in the area of dynamics problem solving, where persistent and multifaceted challenges continue to restrict large-scale pedagogical effectiveness. These challenges include the incomplete and fragmented representation of physical states, insufficient formalization of object interactions, opaque and non-explainable reasoning processes, limited adaptability to complex multi-body systems, and weak alignment between computational outputs and instructional logic. Although intelligent tutoring systems, digital simulations, and adaptive learning technologies have achieved considerable technological advancement, many existing solutions rely either on purely numerical computation frameworks or on rigid, template-driven procedural scripts. Such approaches often fail to establish a coherent structural bridge between low-level simulation data and high-level conceptual reasoning, thereby limiting semantic consistency, interpretability, and cross-problem generalization. Consequently, learners frequently receive algorithmically generated answers without adequate exposure to intermediate inferential steps, causal dependencies among variables, constraint propagation mechanisms, or the theoretical principles governing the solution trajectory. This lack of transparency not only reduces instructional clarity but also weakens students’ ability to construct transferable mental models of physical systems. In response to these theoretical and methodological gaps, the present study seeks to develop a unified, extensible, and semantically grounded framework that enhances representational completeness, reasoning explicability, structural consistency, cognitive alignment, and domain generalizability in computer-assisted physics education through the systematic integration of structured scene modeling, knowledge-graph-based inference, and interactive visual analytics.MethodTo achieve this objective, this study proposes a hybrid reasoning architecture that integrates Physics Scene Graphs with a domain-specific knowledge graph to support intelligent physical process analysis, automated model selection, interpretable equation derivation, and structured problem solving. The framework begins by constructing a formally defined Physics Scene Graph derived from simulation data generated by a physics engine. Within this graph-based representation, physical entities are modeled as nodes enriched with multidimensional attributes, including mass, position, displacement, velocity, acceleration, net force, constraint conditions, and state-transition parameters. Mechanical interactions among entities—such as contact relationships, frictional forces, normal forces, tension links, collision events, composite bindings, and constraint hierarchies—are encoded as semantically annotated edges, forming a relational topology that captures both spatial configuration and dynamic dependency. This explicit structural formalization enables comprehensive modeling of instantaneous states, temporal evolution, constraint propagation, and multi-object coupling in a machine-interpretable and computationally robust manner.Building upon this structured representation layer, a knowledge-graph-driven rule-based reasoning mechanism is introduced to dynamically identify, activate, and instantiate appropriate physical models according to contextual features of the problem scenario. The knowledge graph encodes fundamental physical laws, mathematical expressions, symbolic transformation rules, boundary conditions, dimensional constraints, decomposition strategies, and pedagogical heuristics within a hierarchically organized semantic network. Through semantic matching, relational mapping, subgraph alignment, and rule triggering, entities and relations in the Physics Scene Graph are systematically mapped to corresponding theoretical constructs within the knowledge graph. This bidirectional mapping enables automated derivation of governing equations, symbolic manipulation of expressions, validation of physical consistency, incremental state updating, and stepwise numerical computation. Importantly, the reasoning pipeline is designed to be modular, traceable, and explainable, allowing explicit reconstruction of inference chains and transparent visualization of intermediate logical transitions. To operationalize the proposed architecture, an interactive physics teaching system was implemented, integrating automatic scene parsing, structured reasoning orchestration, real-time state synchronization, multimodal rendering modules, and user-interactive control mechanisms to facilitate both computational automation and pedagogical interpretability.ResultComprehensive experimental case analyses, benchmark evaluations, and user-centered empirical studies demonstrate that the proposed system is capable of automatically and reliably solving a broad spectrum of dynamics problems involving multi-object interactions, frictional constraints, composite systems, conditional equilibrium states, and dynamic state transitions. The generated analytical solutions exhibit high consistency with standard textbook methodologies and conventional classroom derivations, including systematic free-body diagram construction, equation decomposition, and sequential symbolic manipulation. In addition to producing accurate numerical results, the system dynamically generates synchronized two-dimensional visualizations that depict object configurations, force vectors, motion trajectories, constraint relationships, and state-evolution sequences. These visual representations are temporally aligned with symbolic derivations, formula transformations, and computational procedures, forming a coherent multimodal explanatory environment. User evaluation results indicate that the integration of structured graphical representations with explicit reasoning chains substantially enhances conceptual clarity, cognitive engagement, and problem-solving confidence. Students reported improved understanding of force decomposition strategies, system boundary selection, inter-object dependency modeling, and the logical progression from physical principles to quantitative solutions. Moreover, the modular architecture of both the Physics Scene Graph and the knowledge graph demonstrates strong extensibility, scalability, and adaptability, allowing systematic expansion to additional domains such as kinematics, energy conservation analysis, rotational dynamics, oscillatory systems, and electromagnetism through incremental enrichment of semantic attributes and domain-specific inference rules.ConclusionThis study presents a comprehensive, semantically coherent, and pedagogically interpretable framework that integrates structured scene representation, knowledge-graph-driven automated reasoning, and interactive multimodal visualization to advance the theoretical foundation and practical implementation of computer-assisted physics instruction. By systematically addressing limitations related to representational incompleteness, reasoning opacity, procedural rigidity, and limited cross-domain transferability, the proposed approach enhances both computational intelligence and educational effectiveness. The synergistic coupling of simulation-based state modeling with explicit conceptual inference establishes a transparent, explainable, and cognitively aligned problem-solving paradigm that supports deep conceptual comprehension, structured analytical thinking, and meaningful knowledge transfer. Beyond the specific context of dynamics, the architectural principles and methodological innovations underlying this framework provide a transferable reference model for other branches of physics as well as highly abstract disciplines in computer-assisted education, including advanced mathematics, engineering mechanics, and computational modeling. Overall, this work contributes a theoretically grounded, technically robust, and pedagogically valuable blueprint for the design of next-generation intelligent educational systems that harmonize structured representation, automated reasoning, semantic transparency, and interactive visualization to foster more rigorous, immersive, and engaging learning experiences.
关键词:computer-assisted instruction;scene graph;Knowledge Graph;Physics Problem Solving;Physical Process Analysis
Cai Weinan, Wang Zongji, Zhang Yuanben, Yin Yuhao, Liu Junyi
DOI:10.11834/jig.260160
摘要:ObjectiveIn the contemporary landscape of geospatial engineering and computer vision, modern industrial workflows dedicated to Unmanned Aerial Vehicle (UAV) aerial surveying and high-fidelity 3D scene reconstruction have become foundational. Rapidly acquiring sparse 3D point clouds of target areas—typically achieved through Structure-from-Motion (SfM) photogrammetry or airborne Light Detection and Ranging (LiDAR)—has been established as a standard procedure. However, practical data collection is frequently hindered by severe physical and regulatory bottlenecks. Restricted by limited UAV battery endurance, strict airspace control regulations, or rigid predefined flight routes, the actual optical imagery collected during field operations can rarely achieve the dense, omnidirectional overlapping coverage required by ideal reconstruction algorithms. Consequently, researchers are often forced to work with highly sparse visual observations. When attempting to perform large-baseline view extrapolation or traditional graphics rendering utilizing these discrete and occluded 3D priors, existing methods face insurmountable challenges. Traditional rendering pipelines often result in severe structural distortions, projection holes, and rendering artifacts due to the lack of dense multi-view continuity. On the other hand, the recent emergence of purely data-driven 2D generative diffusion models (such as Stable Diffusion) offers unprecedented capabilities in high-frequency texture hallucination. Yet, these models fundamentally lack explicit 3D physical constraints. During large-baseline view extrapolation, they easily violate epipolar geometric constraints, leading to severe perspective distortion, physical scale collapse, and the topological misalignment of critical ground objects. To address these critical limitations, this research proposes a novel generative view extrapolation method fused with sparse 3D priors. The primary objective is to utilize discrete physical geometric skeletons to constrain and guide continuous pixel generation. By bridging the gap between graphical rendering and generative modeling, this research aims to enable high-fidelity, geometrically consistent, and controllable novel view extrapolation under extreme, sparse UAV observation conditions.MethodTo achieve the aforementioned objectives, the proposed framework explicitly injects multi-source physical spatial constraints into a large-scale Latent Diffusion Model (SDXL), fundamentally transforming it from a pure 2D image generator into a 3D-aware extrapolation engine. The methodology is meticulously structured into three core stages to ensure that geometric consistency and semantic richness are simultaneously preserved. First, the framework executes the Spatial Alignment of 3D Priors. Recognizing that raw point clouds are discrete and challenging for 2D networks to interpret, we project the sparse point cloud onto the target camera view using the corresponding extrinsic matrices. This projection yields an absolute depth map. By fusing this depth information with pixel-level coordinate maps, we construct a comprehensive Geometric Spatial Embedding (GSE) map. Furthermore, to provide the generative network with an optimal starting initialization, the source reference image undergoes a pre-alignment process via rigid pose transformation, mapping the source pixels to the approximate target perspective based on available geometric data. Second, we propose a Dual-Branch Forward Generation Pipeline with Decoupled Semantics and Geometry. To prevent the network from confusing structural boundaries with texture colors, the architecture handles these modalities independently. An IP-Adapter (Image Prompt Adapter) is employed as the semantic branch to extract global appearance characteristics—such as lighting conditions, material textures, and environmental atmosphere—directly from the source image. Simultaneously, a ControlNet architecture serves as the geometric branch, meticulously injecting the constructed GSE features into the denoising steps. This decoupled dual-branch design forces the network to render continuous and realistic high-frequency textures (guided by the semantic branch) while strictly anchoring them to precise spatial topologies (constrained by the geometric branch). Third, the framework introduces 3D Perspective Supervision via Latent-Space Reprojection. Traditional pixel-space losses often fail to capture deep structural semantic errors during the diffusion process. Therefore, our method computes the geometric constraints directly within the latent space. At each denoising timestep, the model estimates the noise-free latent features. Using the known camera intrinsic and extrinsic matrices, these latent features are reprojected across views to compute a cross-view geometric reprojection loss. This rigorous supervision mechanism continuously penalizes any spatial drift during the reverse diffusion process, forcing the newly generated textures to strictly follow the physical perspective rules of epipolar geometry.ResultThe proposed framework was rigorously evaluated using datasets derived from real-world UAV aerial imagery and their corresponding reconstructed point clouds. Our experimental setup was explicitly designed to simulate extreme large-baseline view extrapolation tasks, where the target camera pose significantly deviates from the available sparse reference views. We conducted comprehensive qualitative and quantitative comparisons against state-of-the-art baselines, including purely data-driven models like VistaDream and adapted baselines such as SDXL equipped with standard ControlNet depth conditioning. Qualitative visual analyses demonstrate that the proposed method significantly outperforms all existing baseline models. While foundational SDXL models suffered from severe structural drift—often manifesting as fractured roads or the unnatural intertwining of distinct architectural elements—our method accurately preserved the targeted ground objects. The generated objects exhibited geometrically correct recession strictly along the depth axis, maintaining precise topological boundaries without perspective distortion. Furthermore, compared to traditional multi-view stereo reconstruction pipelines, the images synthesized by our generative framework eliminated rendering empty holes and appeared significantly more natural and visually realistic. Quantitatively, the proposed method achieved superior performance across multiple core perception and semantic metrics. Traditional pixel-level metrics like PSNR often penalize generative models due to minor, physically plausible texture variations. Therefore, we focused on metrics that reflect human visual perception and deep semantic alignment. Our method successfully reduced the Learned Perceptual Image Patch Similarity (LPIPS) score to an exceptional 0.466, objectively proving its capability to synthesize high-fidelity textures that closely mimic real UAV footage. Moreover, the framework exhibited distinct advantages in large-scale visual-language model evaluations, comprehensively outperforming baseline models in both LLaVA-IQA aesthetic scoring and CLIP-Score semantic consistency evaluation.ConclusionThis research represents a significant paradigm shift in the field of novel view synthesis, moving away from relying on dense pixel interpolation towards geometry-guided generative extrapolation. By seamlessly integrating the deterministic "rendering" principles of traditional computer graphics with the "generative" capabilities of advanced diffusion models, we have explored a highly effective solution for large-baseline view synthesis under sparse observation conditions. The proposed method successfully breaks the long-standing dependence on multi-view continuity. By treating discrete 3D point clouds not merely as rendering primitives, but as rigorous physical anchors to constrain the generative diffusion process, the framework effectively mitigates spatial drift and perspective distortion. From a practical engineering standpoint, this research directly addresses the severe objective constraints faced in actual UAV aerial surveying, such as airspace restrictions and limited flight endurance. It provides a robust, controllable, and highly practical algorithm pipeline capable of accurately deducing large-scale spatial structures, thereby unlocking new potentials for digital twin modeling and complex scene reconstruction in industrial applications.
Li Xiao-Hui, Zhou Yu, Peng Liang-Rui, Chen Shan-Xiong, Lian Zhou-Hui, Gao Liang-Cai, Yin Xu-Cheng, Liu Cheng-Lin
DOI:10.11834/jig.260148
摘要:Document Image Analysis and Recognition (DIAR) stands as a pivotal technological bridge, transforming static physical documents into dynamic, structured digital information. This field is currently undergoing a profound paradigm shift, evolving from a collection of isolated, task-specific pipelines towards an era of holistic, intelligent understanding powered by large-scale models. This comprehensive survey, synthesized from over twenty presentations at the “Document Image Micro-Salon” seminar series organized by the Technical Committee on Document Image Analysis and Recognition of the China Society of Image and Graphics, offers a systematic review of the groundbreaking contributions made by young Chinese scholars in recent years. We trace the trajectory of this evolution across three interconnected layers: the continuous refinement of foundational tasks, the architectural unification of these tasks into end-to-end systems, and the emergence of new intelligent parsing paradigms driven by large vision-language models (LVLMs).At the foundational layer, significant innovations have redefined core challenges. In text recognition, research has moved decisively beyond closed-set, fully supervised settings. Pioneering work has formalized and addressed the Open-Set Text Recognition (OSTR) problem, enabling models to recognize novel characters—crucial for applications like ancient manuscript digitization or minority language processing—without costly retraining, through frameworks that learn to map character labels to visual prototypes. Concurrently, the field has embraced self-supervised and semi-supervised learning to overcome the bottleneck of expensive annotations. Novel frameworks that synergistically combine contrastive learning with masked image modeling have demonstrated remarkable gains in general representation power, while unified architectures for joint supervised and self-supervised learning have effectively bridged the domain gap between synthetic and real-world data. Furthermore, there has been a concerted push towards more efficient and universal models. Architectures like SVTR have shown that a single, pure vision model can outperform traditional hybrid designs, and new decoding strategies have reconciled the speed-accuracy trade-off in sequence prediction. The frontier has even expanded to instruction-guided recognition, where models are trained to understand text by predicting rich character attributes, showcasing a deeper level of semantic interaction.The deep structural complexities of mathematical expressions and tables have also seen dedicated advances. For handwritten mathematical expression recognition (HMER), graph-to-graph learning paradigms have enabled explicit modeling of hierarchical symbol relationships, while structured string decoders have offered a balanced approach between sequential and tree-based methods. More recently, leveraging the power of pre-training, specialized LVLMs for HMER have been developed, featuring hierarchical adapters that separately handle primitive character recognition and structural relationship inference, achieving state-of-the-art results that surpass both general-purpose and previous specialized models. In table structure recognition (TSR), solutions have been tailored to the wild variations found in real-world documents. Innovations include cycle-pairing modules for robust cell detection in distorted tables and visual-aligned sequential modeling that ensures precise physical bounding box prediction by enriching logical representations with fine-grained local visual features. The ultimate goal is a unified framework capable of end-to-end, multi-turn conversational parsing of complex tabular data.Building upon these refined components, the second layer of progress focuses on system-level integration. The traditional, error-prone pipeline of “detect-then-recognize” is being replaced by deeply fused, end-to-end architectures. Models like Text Perceptron and MANGO have pioneered methods to jointly optimize detection and recognition, using techniques such as fiducial point regression and mask attention to directly read character sequences from feature maps, thereby eliminating intermediate cropping steps and enabling seamless handling of arbitrary text shapes. This trend culminates in the vision of General OCR (OCR-2.0), embodied by unified transformer models that can process all forms of human-readable optical signals—text, formulas, tables, and charts—in a single, elegant framework. Yet, the field pragmatically acknowledges that meticulously engineered, high-performance pipelines remain highly valuable, as evidenced by open-source parsers like MinerU that integrate multiple SOTA modules with sophisticated post-processing rules to achieve production-grade quality.Finally, the advent of the large model era has catalyzed a third wave of innovation, giving rise to new intelligent document parsing paradigms. Recognizing the limitations of generic LVLMs in professional OCR tasks (e.g., hallucination, inefficiency), the community has developed a new generation of specialized, lightweight OCR-LVMs. These models, such as PaddleOCR-VL and HunyuanOCR, strategically balance the benefits of end-to-end learning with the reliability of modular design, often through a two-stage “analyze-then-parse” workflow that first predicts a layout and then performs parallel, element-wise content recognition. This approach preserves structural integrity while maintaining high efficiency. To support this rapid development, a new ecosystem of comprehensive evaluation benchmarks has emerged. These benchmarks, including OCRBench, OmniDocBench, and TextHalu-Bench, move far beyond simple accuracy metrics. They provide systematic, fine-grained assessments of a model’s capabilities across diverse languages, layouts, and degradation conditions, while also quantifying critical factors like its susceptibility to semantic hallucination and the cascading impact of its errors on downstream applications like RAG systems.In summary, this survey delineates a clear and vibrant path of DIAR’s evolution—from the meticulous engineering of individual components, through their synergistic integration into robust systems, to the cognitive-level semantic understanding promised by the new generation of intelligent document parsers. It provides a crucial reference for the ongoing effort to build a universal document intelligence foundation that is not only highly accurate but also robust, interpretable, efficient, and truly capable of serving the complex demands of the real world. The algorithms, datasets, and evaluation metrics mentioned in this paper have been compiled athttps://github.com/xhli-git/Micro-Salon-Survey.
Li Ce, Wang Kai, Xiao Limei, Wang Ru, Ping Mengmeng, Lu Ming
DOI:10.11834/jig.260054
摘要:ObjectiveFacial expression recognition (FER) has emerged as a pivotal research topic in the field of computer vision due to its broad applicability in diverse domains such as human–computer interaction, intelligent education systems, healthcare monitoring, mental health assessment, driver fatigue detection, and online behavior analysis. Accurate FER enables machines to interpret human emotions, thereby enhancing the naturalness and adaptability of interactive systems. Despite recent advances driven by convolutional neural network (CNN) and, more recently, Transformer-based architectures, the performance of FER models remains constrained in challenging scenarios characterized by high inter-class similarity and large intra-class variability. One of the primary limitations of existing methods lies in their insufficient ability to effectively model fine-grained discriminative features from facial key regions—such as the corners of the eyes and mouth—which are crucial for recognizing subtle emotional differences. This deficiency often results in feature representations that are dominated by global semantic patterns while failing to capture subtle but semantically critical local cues, ultimately reducing classification accuracy in complex and ambiguous expression categories.MethodsTo address these challenges, we propose a novel Cross-Fusion Multi-Level Receptive Field Network specifically designed for facial expression recognition. The central idea of our approach is to explicitly integrate local structural priors derived from facial landmarks with global semantic expression features, enabling a more balanced and discriminative feature representation that combines detailed regional cues with holistic facial context. Concretely, we first design a dual-stream feature extraction framework in which one stream encodes the global expression semantics of the entire face, while the other focuses on fine-grained geometric structures extracted from facial landmark points. These two feature streams are then fused using a Transformer-based cross-fusion module, allowing rich bidirectional interactions in the spatial domain. This fusion mechanism effectively guides the model to attend to expression-critical regions while simultaneously leveraging the contextual information provided by the full-face representation.A key technical innovation in our model is the Sliding Dilated Window Attention mechanism, which is designed to overcome the limitations of conventional global self-attention in standard Transformers. While global attention offers strong long-range dependency modeling and parallel computation efficiency, it lacks the local inductive bias inherent to CNNs, making it less effective for capturing small scale discriminative features. To remedy this, our SDWA mechanism restricts attention computation to localized sliding windows, significantly reducing computational complexity while concentrating attention resources on critical facial areas. Furthermore, inspired by the dilated convolution paradigm, we introduce varying dilation rates within these sliding windows to expand the receptive field without sacrificing resolution. Specifically, we assign different dilation rates to different attention heads, thereby constructing a Multi-Level receptive field encoder within the Transformer. This design allows the network to simultaneously perceive small-scale details, such as subtle movements in the eyes or mouth corners, and larger-scale structural variations, enabling a richer and more semantically aligned representation of facial expressions.ResultWe evaluate our method on three widely used benchmark datasets: RAF-DB, AffectNet, and FERPlus. The proposed model achieves top-tier performance, with classification accuracies of 92.14% on RAF-DB (seven-class classification), 67.35% and 63.44% on AffectNet (seven- and eight-class classification settings), and 91.67% on FERPlus (eight-class classification). Our experiments are conducted using a standardized training pipeline with data augmentation strategies such as random cropping, horizontal flipping, and color jittering, and optimization via AdamW with a learning rate warm-up schedule. We also perform a comprehensive ablation study, demonstrating that removing any of the three major components—the cross-fusion framework, the SDWA mechanism, or the multi-dilation receptive field encoder—results in a significant performance drop. Additionally, qualitative analysis using attention heatmaps confirms that the proposed method consistently focuses on semantically meaningful facial regions, such as the eyes and mouth, across different expression categories and datasets.Beyond raw accuracy numbers, we compare our method with representative state-of-the-art FER models, including CNN-based architectures and pure Transformer baselines. Our network consistently outperforms these methods, particularly in categories with subtle local differences, validating the effectiveness of explicitly modeling key facial regions. Notably, the proposed architecture achieves this without incurring excessive computational cost, maintaining an inference speed suitable for real-time applications.ConclusionIn this work, the proposed Cross-Fusion Multi-Level Receptive Field Network offers a new paradigm for FER by effectively combining the strengths of global and local feature modeling within a unified Transformer-based framework. Its ability to attend to subtle local details while preserving global semantic coherence leads to notable improvements in recognition accuracy on challenging benchmark datasets. Future work will explore adaptive attention windowing strategies and integration with temporal modeling modules to extend the approach to video-based FER and other dynamic facial analysis tasks.
Zhou Tianyang, Mao Yuxiang, Ye Yongjing, Xia Shihong
DOI:10.11834/jig.260036
摘要:ObjectiveConstructing vivid personalized 3D head avatar models at the lowest possible cost from 2D images is an important research problem in the fields of computer graphics and virtual reality. Although recently the 2D video generation models have achieved significant breakthroughs, explicit 3D avatars still play an irreplaceable role in fields such as virtual reality and human–computer interaction. Using 3D Gaussian Splatting (3DGS) as the rendering and 3D representation method can effectively improve the rendering quality and efficiency of the avatar models. However, the existing avatar modeling methods based on 3DGS either require several hours or even days of training time or have difficulty reconstructing fine wrinkle details, making it hard to achieve both fast and high-quality reconstruction simultaneously. In the term of the implementation details of these methods, during the training of the avatar models, the existing methods all rely on the mesh that tracked from the original data in the preprocessing stage to provide geometric information for the avatar model, but at the same time, the texture map corresponding to the mesh model is discarded as an additional output of the preprocessing stage. This leads to the avatar model having to learn the appearance information from scratch from the original image. In addition, the existing methods achieves the expression animation of avatar models through the linear combination of Gaussian attribute blendshapes. However, taking the rotation attribute of Gaussians as an example, the addition of quaternions is mathematically meaningless. This will lead to unreasonable rotation interpolation during the expression animation process, affecting the stability of the animation and the rationality of the final avatar model's geometry. To solve these problems, we propose TPAvatar (Head Avatar with Texture Prior), a novel method to create photorealistic head avatars from the multi-view video sequences or monocular video sequence of the subject based on 3DGS, achieving real-time and high-fidelity animation.MethodBy learning a latent feature space for Gaussian attributes, TPAvatar achieves a significant reduction in neural network parameters, resulting in a compact neural network model. Specifically, TPAvatar is the first to leverage a pre-trained DINOv2 model to extract view-independent identity appearance prior from the texture map of the specific subject, construct a UV-aligned identity feature map, and provide improved initialization for the Gaussian model. For expression driving, TPAvatar establishes a set of implicit expression feature blendshapes for each Gaussians within the local space of each triangle of the mesh. By combining mesh binding with linear combinations of these expression feature blendshapes, it enables efficient and expressive animation of the avatar. The identity feature map and the expression feature map are first summed pixel-by-pixel, and then decoded by a Gaussian decoder to obtain Gaussian attribute maps defined in the UV-local coordinate. These Gaussian attributes are subsequently transformed into the global coordinate, and the final images are rendered using 3DGS.ResultExperimental results on the multi-views dataset NeRSemble and the monocular dataset INSTA demonstrate that TPAvatar can effectively handle multi-view reconstruction and monocular reconstruction tasks. Comparing with existing methods such as GaussianAvatars, GEM, and RGBAvatar, TPAvatar achieves shorter training time, faster inference speed, and higher-quality reconstruction and animation, effectively balancing high fidelity and real-time performance. Specifically, we evaluate these methods on subjects of different ages and genders under two tasks: novel view synthesis on the validation set and novel expression synthesis on the test set. In the multi-view reconstruction scenario, compared with the baseline method GaussianAvatars/GEM, TPAvatar has shortened the reconstruction time from 8/12 hours to 1.5 hours while achieving higher reconstruction quality: on the test set, PSNR increased by 1.5608/10.3556, and LPIPS decreased by 0.0037/0.0131; compared with the baseline method RGBAvatar, TPAvatar significantly improved the generalization of new view synthesis while maintaining the advantage of fast reconstruction, with PSNR increasing by 0.5139 and LPIPS decreasing by 0.0155. In the monocular reconstruction scenario, compared with the existing SOTA baseline method RGBAvatar, TPAvatar's PSNR increased by 0.1176 and LPIPS decreased by 0.0016. TPAvatar achieves an animation speed of 164 FPS, enabling real-time animation performance. Ablation studies further verify the effectiveness of both the identity feature module and the expression driving module.ConclusionBy integrating texture features and constructing expression feature bases, TPAvatar enhances the animation quality and perspective generalization of the 3DGS head avatar model, making it a practical and efficient 3D avatar reconstruction method that capable of real-time rendering and animation. It suitable for personalized 3D head avatar reconstruction tasks under multi-view or monocular video input. The decoupled design of identity and expression endows the proposed method with the potential to be extended to single-image avatar reconstruction. Code is available in:https://doi.org/10.57760/sciencedb.j00240.00128
摘要:ObjectiveAs a fundamental data modality for autonomous driving,robotic navigation,and large-scale environmental perception,3D point clouds pose stringent demands on place recognition technologies. Accurate place recognition relies on extracting robust environmental descriptors;however,despite recent advancements,mainstream deep learning methods are fundamentally constrained by three prominent limitations. First,conventional feature representations are strictly confined to the real-number domain,which inherently restricts models to capturing only geometric coordinates and local structural information. This paradigm discards crucial spatial attributes such as the point distribution phase,the relative positions of non-adjacent points,and complex global topological relations. Consequently,this leads to low recognition accuracy for places exhibiting similar local structures but possessing distinct global configurations. Second,existing feature interaction mechanisms struggle to align with the "unordered,sparse,and irregular" intrinsic properties of raw point clouds. Traditional grid-based convolutions,which rely on fixed local receptive fields,often suffer from "local information over-focus," severely neglecting long-range semantic associations necessary for holistic s place understanding. Finally,resolving the pervasive trade-off between recognition performance and computational efficiency remains elusive. High-accuracy baseline models typically demand excessive network depth and parameter scales—often reaching tens or hundreds of megabytes—yielding inference latencies that far exceed the 10ms real-time threshold, thereby failing to meet the strict practical application requirements of autonomous systems. Aiming to systematically address these limitations of traditional real-number domain operations,this study explores a novel point cloud feature representation paradigm deeply integrated with interdisciplinary theories,striving to significantly enhance both the performance and interpretability of large-scale point cloud place recognition.MethodsInspired by the foundational wave-particle duality theory in physics,this study abstracts each individual point in the 3D point cloud as an elementary particle with fundamental physical states,innovatively constructing a comprehensive point cloud feature representation framework that embodies both "wave nature" and "particle nature". Correspondingly,an end-to-end place recognition method consisting of three synergistic core modules is proposed,with two specific implementation instances designed for each module to demonstrate the paradigm's remarkable flexibility. 1) Point Cloud Wave Property Expression Module: Leveraging the wave attribute of single particles,this initial module maps raw 3D point clouds from the constrained real-number space into a high-dimensional Hilbert space,representing each point as a complex-valued wave function. The amplitude component encodes fundamental geometric positions and local structural details,whereas the phase component intricately captures relative spatial relationships and global distribution trends. This dual encoding achieves comprehensive modeling of both geometric and semantic point information,vastly enhancing local context perception. 2) Point Cloud Particle Property Information Interaction Module: Emulating the "particle nature" behavior observed under multi-particle coherence,this phase-driven interaction mechanism constructs a dynamic spatial graph based on the wave function similarities between points,treating individual points as graph nodes and interaction intensities as adaptive edge weights. By mathematically simulating particle energy transfer and interactive information exchange,it facilitates the adaptive fusion of high-dimensional feature information across both immediate adjacent points and semantically correlated non-adjacent distant points. This efficiently characterizes the dynamic dependencies and high-order semantic interactions that conventional real-number convolutions miss. 3) Point Cloud Feature Compact Encoding Module: To address the inherent high dimensionality and computational burden of complex-valued features,this final module leverages parameter-sharing complex-valued convolutions for deep feature extraction. It then applies efficient aggregation strategies—specifically proposing a Generalized Mean (GeM) pooling approach or a structurally refined Soft-NetVLAD instance—to compress these high-dimensional interactions into a highly compact, representative global feature vector tailored for the final retrieval and recognition task. Results Comprehensive and rigorous evaluations were conducted across four foundational benchmark datasets (Oxford, U.S., R.A., B.D.) and two complex generalization datasets (KITTI Sequence 00, RobotCar Season). The empirical findings confirm that all instances of the proposed interdisciplinary paradigm effectively and consistently enhance the performance of established benchmark models—including PointNet++,MinkLoc3D, and PCAN—in terms of both top-1% average recall and retrieval precision. Notably,the specific combination of direct wave function modeling and GeM pooling achieves an unprecedented balance between performance and deployment efficiency. Requiring merely 6.12MB of learnable parameters,this streamlined instance nearly doubles the model convergence speed during training and reduces the inference time to a mere 8ms, strictly satisfying the critical 10ms real-time processing requirement dictated by autonomous driving and robotic navigation systems. Furthermore, the wave-particle paradigm exhibits exceptional robustness; in rigorous perturbation tests,it delivered a 7.1% average performance gain under severe rotational variances and secured a 5.8% average performance enhancement in demanding cross-domain environmental tests. Extensive ablation studies unequivocally corroborate the structural necessity of modeling wave-particle coherence,demonstrating that the removal of any core module directly leads to immediate and notable performance degradation.ConclusionThe proposed wave-particle duality paradigm exhibits excellent architectural versatility and plug-and-play characteristics,enabling significant performance enhancements of existing baseline models without necessitating extensive structural modifications. It empirically verifies the profound effectiveness of interdisciplinary theoretical fusion in successfully breaking through traditional point cloud feature representation bottlenecks,resolving issues including real-number domain spatial limitations,mismatched feature interaction mechanisms,and rigid performance-efficiency trade-offs. This work provides a robust, interpretable, and highly efficient new research direction for point cloud place recognition, paving the way for future optimizations in complex-valued computation and expanding scenario-specific applications.
Si Ruotong, Tang Yichao, Zhang Xinpeng, Li Sheng, Qian Zhenxing
DOI:10.11834/jig.260043
摘要:ObjectiveThe imaging style of a digital camera, determined by its proprietary image signal processing (ISP) pipeline, constitutes a core intellectual property and a critical brand asset for manufacturers. It encompasses distinct visual characteristics including color tendency, tone and atmosphere, spatial sharpness and detail, and noise reduction, which together form brand-identifiable aesthetics, as exemplified by Canon's Picture Style system and Nikon's Vivid mode. However, the widespread availability of open-source deep ISP models and large-scale paired RAW-RGB datasets has made surrogate model attacks a severe threat. In such attacks, an adversary can train a data-driven deep ISP network on paired RAW-RGB datasets, where RAW images are collected by the adversary’s device and RGB images are captured by the target camera, to mimic its proprietary imaging style with high fidelity, and can even launch black-box theft without revealing the model’s structure or parameters. Existing digital watermarking methods are predominantly designed to resist conventional signal processing attacks or physical channel attacks, and they prove inadequate against surrogate model attacks. These attacks involve a highly nonlinear, data-driven transformation that can inadvertently destroy embedded watermarks during the style learning process, which is fundamentally distinct from traditional attack paradigms. To solve this problem, this paper proposes StyleSign, a robust watermarking framework specifically designed to protect camera imaging styles against surrogate model attacks. The framework embeds an invisible watermark into every output RGB image of the protected ISP pipeline, thereby ensuring that the watermark information survives the nonlinear transformations of surrogate attacks and remains reliably extractable from the attacker's generated outputs, providing verifiable evidence of style theft.MethodStyleSign adopts an end-to-end trainable architecture that jointly optimizes three core modules: a multi-scale watermark encoder, an internal surrogate module, and a decoder. Specifically, the multi-scale watermark encoder is embedded into the protected ISP pipeline to imperceptibly embed a binary watermark into the final RGB image. To enhance robustness against subsequent nonlinear transformations, the encoder employs a squeeze-and-excitation-based discrete wavelet transform (SEDWT) module as its core unit. This module decomposes the fused image and watermark features into multiple frequency sub-bands and applies channel attention module to emphasize style-relevant components, allowing the watermark to be embedded into style-relevant features rather than relying on the semantic content of the image.The key innovation of this framework is the internal surrogate module, which is designed to simulate the behavior of an attacker's surrogate ISP model during training. This module takes the same RAW image as the ISP pipeline as input and learns to reconstruct the watermarked RGB image, effectively mimicking the style transfer process of surrogate attack while preserving the embedded watermark. Architecturally, it adopts a dual-branch design. The demosaicing branch leverages a global guided color mapping (GGCM) network to accurately capture and model the global color and tone characteristics of the target imaging style. Meanwhile, the RAW branch utilizes U-Net structure in which standard pooling operations are replaced with discrete wavelet transforms to retain high-frequency watermark details during downsampling and reconstruction, and an efficient channel attention module is integrated into the skip connections to further enhance watermark-related features. The final output is the pixel-wise average of the two branches’ results. Finally, the decoder is jointly optimized on two inputs: the directly watermarked image from the encoder and the mimic image generated by the internal surrogate module.ResultExperimental results on the Zurich RAW to RGB dataset demonstrate the effectiveness of StyleSign. In terms of fidelity, the embedded watermark introduces negligible visual distortion, with watermarked images attaining a peak signal-to-noise ratio (PSNR) of 37.26 dB, a structural similarity index measure (SSIM) of 0.9893, and a learned perceptual image patch similarity (LPIPS) of 0.0425, all close to the original image quality and outperforming the compared watermarking schemes. In terms of robustness, StyleSign demonstrates strong performance under both conventional image processing attacks and surrogate model attacks. For conventional attacks including color jitter, Gaussian noise, Gaussian blur, JPEG compression, and resizing, the watermark extraction bit error rate (BER) ranges from 0.63% to 0.83%, showing robustness against these distortions. More importantly, under four representative surrogate model attacks with different architectures, namely RAW-to-sRGB, AWNet, MW-ISPNet, and Airia CG, StyleSign achieves BER values as low as 1.07%, 1.19%, 0.99%, and 0.49%, respectively, outperforming the compared watermarking schemes. Ablation studies further verify that the internal surrogate module is indispensable for ensuring robustness against surrogate model attacks, while the discriminator effectively ensures the visual quality of the watermarked images.ConclusionThe proposed StyleSign framework effectively solves the problem that it is difficult for existing watermarking methods to resist surrogate model attacks for camera imaging style theft. Through the joint optimization of the multi-scale watermark encoder, the dual-branch internal surrogate module, and the decoder, StyleSign maintains excellent watermark robustness and reliable extractability across various surrogate model attack scenarios, while having a minor impact on image quality. This work provides an effective and generalizable technical solution for protecting the intellectual property of camera imaging styles, and also offers a new research idea for the core intellectual property protection of camera manufacturers.
关键词:watermarking;Camera;Image Signal Processing;copyright protection;Camera Imaging Style
Wang Baixiang, Huo Hongtao, Zheng Bowen, Li Zhiqian
DOI:10.11834/jig.260115
摘要:ObjectivePan-sharpening aims to fuse a high-resolution panchromatic (PAN) image with a low-resolution multispectral (LRMS) image to produce a high-resolution multispectral (HRMS) product. The key difficulty lies in injecting spatial textures from PAN without violating the spectral relationships among multispectral bands. Maintaining this balance becomes more challenging in heterogeneous scenes where land covers vary within a small neighborhood and structural edges cross semantic boundaries. Many deep learning approaches adopt convolutional neural networks (CNNs) and learn a global mapping from the concatenated inputs, but the locality of convolution and the lack of explicit region priors often lead to over-sharpening, ringing, and color shifts, especially around boundaries and in shadow and vegetation areas. In addition, sensor-dependent factors such as blur, noise, and band responses make it hard to transfer a fixed sharpening pattern across scenes. Hence, incorporating both region-aware priors and long-range context is important for stable spatio-spectral fusion.MethodsWe present a hierarchical clustering-guided pan-sharpening network with global context enhancement, referred to as HCPNet. The network follows an encoder–fusion–decoder design and predicts an HRMS output by adding a learned high-frequency residual to the LRMS image upsampled to the PAN resolution. Skip connections pass shallow textures to the decoder, and residual learning stabilizes optimization and reduces the risk of over-sharpening. The guidance mask used by the proposed method is computed directly from LRMS spectra and does not require extra annotations. Two complementary ingredients are introduced. First, region-aware fusion is driven by a hierarchical clustering prior. Instead of treating all pixels equally, we construct an explicit homogeneous-region guidance mask from the LRMS spectra. To obtain reliable partitions in complex scenes, we adopt a hybrid strategy that combines agglomerative hierarchical clustering (agglomerative nesting, AGNES) and K-means. The hierarchical stage builds a dendrogram using spectral angle distance and Ward linkage, and an adaptive cluster number is selected according to clustering validity (Davies–Bouldin index) and the change rate across cut heights. The resulting cluster centroids are then used to initialize K-means, producing pixel-wise cluster labels that are reshaped into the guidance mask. This prior is injected into the backbone to route features and to realize differential processing: pixels within a cluster share convolutional responses to preserve spectral consistency in homogeneous areas, while inter-cluster interactions are handled through residual fusion to avoid block artifacts and to prevent the propagation of sharpening errors across boundaries. In the shallow encoder, a Hybrid-H block leverages relatively stable low-level features to generate a reliable partition and an index map; in deeper layers, a Hybrid-D block updates routing with lower overhead to adapt to feature distribution shifts as depth increases. A patch-centroid representation for cluster prototypes and a lightweight dynamic adjustment mechanism are used to improve stability across different scene complexities. In our implementation, the deep routing uses 32 clusters with a small filtering threshold of 0.005. Second, global context enhancement is achieved via an Efficient Feature Transformer (EFT) block. To overcome the limited receptive field of convolution, EFT augments local features with long-range dependencies using spatial-reduction multi-head self-attention. Convolutional features are projected to query, key, and value embeddings; attention weights are computed with scaled dot-product similarity and normalized by softmax; and the attended feature is fused back through residual connections. Spatial reduction is applied when computing keys and values, which lowers complexity while preserving coarse global structures. The resulting context feature helps propagate cues across distant regions, improves large-scale structural coherence, and reduces boundary artifacts caused by purely local sharpening. In the experiments, a lightweight configuration is used, including a single attention head, a small feed-forward expansion, and dropout regularization, to keep the additional cost moderate. The network is optimized with a multi-constraint objective. The total loss is a weighted sum of three terms: a spectral-angle loss encouraging the predicted and reference spectra to have similar directions, a clustering-consistency loss aligning feature distributions with cluster prototypes (including a symmetric Kullback–Leibler divergence term and a compactness term), and a reconstruction loss penalizing pixel-wise absolute errors. These terms jointly supervise spectral shape, region coherence, and fidelity of the reconstructed details. During training, images are normalized to the sensor radiometric range, and reduced-resolution samples are generated following the Wald protocol.ResultsExtensive experiments are conducted on the Quick Bird and GaoFen-2 datasets. Under the reduced-resolution Wald protocol, where an HRMS reference is available, we report spectral angle mapper (SAM), ERGAS (Erreur Relative Globale Adimensionnelle de Synthèse), and mean universal image quality index averaged over bands. HCPNet consistently improves these metrics across scenes and sensors. On Quick Bird, it increases SAM and ERGAS by 3.4% and 4.0% and increases Qavg by 1.6% compared with the second-best baseline; on GaoFen-2, it increases SAM and ERGAS by 8.0% and 1.2% . Across both datasets, improvements are observed in urban and rural scenes, and the method avoids noticeable spectral shifts in vegetation and shadow regions. Qualitatively, the proposed method produces sharp yet clean edges and maintains consistent colors in homogeneous regions; it also suppresses ringing and halo effects near high-contrast boundaries, where many learning-based methods tend to over-inject high-frequency components. Ablation studies verify the complementary roles of the two key components: removing the clustering guidance increases boundary artifacts and degrades spectral stability, while removing EFT mainly harms cross-region consistency and large-scale structural coherence. Because clustering is performed on low-dimensional spectra and EFT uses spatial reduction, the overall computational overhead remains manageable in practice.ConclusionBy combining an explicit hierarchical clustering prior for region-aware fusion with an efficient attention-based global context enhancement module, HCPNet improves pan-sharpening quality in heterogeneous remote-sensing scenes, delivering clearer boundaries with lower spectral distortion.
Sun Renjie, Sun Yubao, Shao Shuai, Shuai Hui, Liu Qingshan
DOI:10.11834/jig.250606
摘要:ObjectiveText-driven 3D human motion generation has emerged as a frontier research direction in multimodal content creation, holding great promise for applications in virtual reality, film production, and the metaverse. Despite significant progress, existing methods still face fundamental challenges in three aspects: precise semantic alignment between natural language descriptions and generated motions, fine-grained control over individual body parts, and global coordination that respects biomechanical constraints. Consequently, current solutions often suffer from semantic leakage unnatural postures, and limited expressiveness. Moreover, most approaches either focus solely on motion synthesis without producing complete 3D assets, or generate static avatars without dynamic pose control. To address these limitations, we propose a novel cascaded diffusion framework that follows a “local-to-global, structure-to-appearance” generation pipeline, enabling end-to-end synthesis from raw text to high-fidelity, real-time renderable 3D human models with precise motion control.MethodOur framework consists of four key stages, each designed to address a specific aspect of the text-to-3D human generation problem. First, a semantic decoupling module leverages a large language model (GPT-4) to automatically parse the input text into independent action descriptions for six anatomical body parts: head, left arm, right arm, torso, left leg, and right leg. This decomposition converts a global motion description into a set of part-specific textual instructions, explicitly separating semantics across different body regions. For body parts not mentioned in the original text, the parser assigns a “do nothing” instruction, preventing unintended movements. This step is crucial because it transforms a loosely coupled global description into a structured, machine-readable format that guides subsequent generation. Second, we construct a local motion generation module composed of six parallel diffusion-based encoders, each conditioned on its corresponding part description. These encoders operate with gradient isolation, meaning that the training and inference processes for different body parts do not share gradients. This design fundamentally prevents semantic leakage—a common issue in prior work where an action described for one body part inadvertently affects others. Each encoder adopts a transformer-based denoising network. Starting from pure Gaussian noise, the network iteratively refines a latent code guided by the corresponding part text embedding produced by a pre-trained TMR encoder. The resulting latent representation captures the fine-grained motion characteristics of that specific body part, such as the trajectory, speed, and joint angles. Importantly, because the six encoders are independent, they can be trained in parallel on part-specific motion data extracted from full-body motion capture datasets. Third, a global motion fusion module integrates the six independent part latents into a coherent full-body pose. Simple concatenation of part latents would ignore the biomechanical dependencies between body regions. To address this, we employ a lightweight feed-forward network with GELU (gaussian error linear units) activation, augmented by the global semantic feature of the complete text. This network learns to enforce biomechanical constraints such as torso leaning backward during a forward kick, natural arm-leg coordination during walking, and maintaining overall balance. The fused latent is then decoded into SMPL parameters, producing a parametric human mesh that respects human skeletal kinematics. Fourth, for appearance enhancement and efficient rendering, we convert the SMPL mesh into a set of 3D Gaussians—a modern explicit representation that supports real-time differentiable rasterization. Each Gaussian is defined by its position, covariance matrix, opacity, and spherical harmonics coefficients for color. To enrich geometric and textural details beyond the smooth SMPL mesh, we adopt a state-of-the-art 2D diffusion model (Flux) as a powerful visual prior. Through SDS (score distillation sampling), gradients from the 2D diffusion model are backpropagated to iteratively optimize the attributes of the 3D Gaussians while keeping their positions fixed to preserve the generated motion. This optimization runs for 4000 iterations, refining details such as skin texture, clothing wrinkles, and lighting effects. The final output is a fully textured 3D human model that can be rendered in real time without any post-processing.ResultWe conduct extensive experiments on two standard benchmarks, HumanML3D and KIT-ML, and compare our method against representative baselines including MotionDiffuse, MDM, MLD, DreamFusion, GaussianDreamer, and others. For quantitative evaluation, we employ multiple metrics. FID (Fréchet Inception Distance) measures the realism and diversity of generated motion sequences. CLIP-S (CLIP similarity) evaluates semantic alignment between rendered multi-view images and input text. Additionally, we introduce a Part-FID (part-level Fréchet inception distance) , which computes FID separately for each of the six body parts using dedicated feature extractors, providing a fine-grained assessment of local motion quality. Experimental results demonstrate that our method achieves an FID of 0.429, comparable to MotionDiffuse (0.687) and MDM (0.747). In terms of CLIP-S, our method attains 29.41 (ViT-L/14) and 44.39 (ViT-bigG-14), surpassing GaussianDreamer (27.23 and 41.88) and other text-to-3D baselines. The proposed Part-FID yields an average score of 1.26, which is 18.7% better than MotionDiffuse, with the most significant improvement observed on the torso, validating the effectiveness of our global fusion module in enforcing biomechanical coordination. Ablation studies further confirm the contribution of each component: removing gradient isolation increases semantic leakage. Efficiency analysis shows that our method takes approximately 20 minutes for end-to-end generation, and the final 3D Gaussian representation enables real-time rendering at 24 frames per second, which is two orders of magnitude faster than NeRF-based renderers.Conclusionwe present a comprehensive framework for text-driven 3D human motion generation that uniquely combines local motion generation with global fusion, supported by efficient 3D Gaussian splatting and a powerful 2D diffusion prior. The method achieves superior performance in motion realism, part-level control accuracy, semantic alignment, and rendering efficiency. It provides an end-to-end solution from natural language to high-quality, real-time renderable 3D human assets, opening new possibilities for interactive virtual human applications. Future work will focus on extending the framework to generate long-sequence motions with temporal consistency and incorporating multimodal control signals.
Zuo Jialong, Deng Haoyou, Zuo Haotong, Zhou Hanyu, Zhu Jiaxin, Zhang Yicheng, Zhang Yiwei, Yan Yongxin, Huang Kaixing, Chen Weisen, Deng Yongtai, Jin Rui, Sang Nong, Gao Changxin
DOI:10.11834/jig.260029
摘要:The rapid evolution of large-scale text-to-image generation models has fundamentally transformed the landscape of visual content creation. Driven by advances in diffusion models, large multimodal pretraining, and scalable inference pipelines, modern generative systems have demonstrated unprecedented capabilities in synthesizing visually compelling images across a wide range of styles, scenes, and semantic conditions. Commercial models such as Nano Banana Pro have attracted significant attention due to their strong zero-shot generation ability, robust semantic understanding, and impressive perceptual quality. However, despite their success in creative image synthesis, a critical and largely underexplored question remains: Can such foundation generative models serve as general-purpose solvers for traditional low-level vision tasks? Low-level vision tasks—including dehazing, deblurring, super-resolution and so on—have historically been dominated by task-specific, regression-based models. These models are typically trained under strong supervision with paired data and optimized using pixel-aligned objectives such as PSNR(peak signal-to-noise ratio)and SSIM(structural similarity index measure). While highly effective within their target domains, such specialist models lack flexibility, often require costly retraining for new tasks, and struggle to generalize beyond their training distributions. In contrast, foundation generative models promise a unified alternative: a single pretrained model capable of addressing diverse vision tasks through natural language prompts, without task-specific fine-tuning. In this work, we present the first large-scale, systematic zero-shot evaluation of Nano Banana Pro across a broad spectrum of low-level vision tasks. Specifically, we investigate whether Nano Banana Pro can function as a low-level vision all-rounder—a generalist model capable of producing high-quality results across heterogeneous restoration, enhancement, and fusion tasks. To this end, we conduct an extensive evaluation covering 14 distinct low-level vision tasks across 40 datasets, encompassing both synthetic and real-world degradations. The evaluated tasks include deblurring (motion, defocus), super-resolution, image denoising, deraining, shadow removal, reflection removal, flare removal, low-light image enhancement, underwater image enhancement, HDR(high dynamic range)reconstruction, multi-focus image fusion, and infrared–visible image fusion, among others. All experiments are conducted under a standard zero-shot protocol. Nano Banana Pro is queried exclusively through simple, task-oriented natural language prompts, without any model fine-tuning, parameter adaptation, or task-specific post-processing. This setting is deliberately chosen to reflect realistic deployment scenarios and to assess the intrinsic capability of the model as a foundation visual system. For each task, we compare Nano Banana Pro against state-of-the-art specialist methods specifically designed for the corresponding task. Our comprehensive evaluation reveals a consistent and striking performance dichotomy. On one hand, Nano Banana Pro frequently produces results with superior perceptual quality, characterized by enhanced clarity, vivid textures, improved contrast, and visually pleasing color distributions. In many challenging scenarios—such as severe noise, extreme low-light conditions, heavy underwater color distortion, or strong atmospheric degradation—the model is able to hallucinate plausible high-frequency details and recover semantically coherent structures that rival or even surpass those generated by domain-specific methods. Across multiple tasks, Nano Banana Pro achieves competitive or leading performance on no-reference perceptual metrics and consistently receives favorable qualitative assessments. On the other hand, when evaluated using traditional full-reference, pixel-aligned quantitative metrics, Nano Banana Pro systematically underperforms compared to specialist models. Metrics such as PSNR, SSIM, SCD(sum of correlations of differences), and VIF(visual information fidelity) consistently reveal notable gaps, particularly in tasks requiring strict structural alignment or physical signal fidelity. This discrepancy is especially pronounced in tasks like denoising, HDR reconstruction, and image fusion, where pixel-level consistency with the reference image is heavily rewarded. We attribute this behavior to the inherent stochastic and generative nature of diffusion-based models, which prioritize semantic plausibility and perceptual realism over deterministic pixel correspondence. As a result, even visually improved outputs may be penalized for global color shifts, localized texture synthesis, or subtle geometric deviations. Importantly, our analysis shows that these quantitative penalties do not necessarily indicate failure. In many datasets, the provided “ground-truth” images themselves contain residual noise, blur, or imperfect color balance. In such cases, Nano Banana Pro often generates cleaner, more visually appealing results that deviate from the reference but align better with human perception. This observation highlights a fundamental tension between regression-based evaluation paradigms and generative reconstruction behaviors, and suggests that current benchmarks may be insufficient for assessing foundation generative models. Beyond aggregate metrics, we conduct detailed task-wise and dataset-wise analyses to characterize the operational scope and limitations of Nano Banana Pro. The model excels in scenarios involving severe degradation, ambiguous structure, or incomplete information, where its strong semantic priors can compensate for missing signal. Conversely, it struggles in applications demanding strict physical accuracy, such as forensic analysis, scientific imaging, or safety-critical perception, where hallucinated details or slight structural inconsistencies may be unacceptable. Collectively, our findings position Nano Banana Pro as a powerful zero-shot contender for low-level vision, capable of delivering high perceptual quality across a remarkably diverse set of tasks without retraining. At the same time, achieving the pixel-level fidelity of domain specialists remains a significant challenge. Rather than framing this as a binary competition between generative and regression paradigms, our results suggest a more promising direction: strategic integration. Future robust vision systems may combine the semantic imagination of foundation generative models with the physical constraints and precision of task-specific networks, leveraging the strengths of both. In summary, this study provides the first comprehensive empirical answer to the question: Is Nano Banana Pro a low-level vision all-rounder? Our answer is nuanced. Nano Banana Pro substantially raises the upper bound of perceptual quality in zero-shot low-level vision, but has yet to establish a stable lower bound suitable for high-fidelity, safety-critical applications. By systematically documenting these strengths and limitations across 14 tasks and 40 datasets, this report offers a detailed reference point for future research on foundation models in low-level vision, and calls for the development of new evaluation frameworks that better reflect perceptual realism, semantic consistency, and downstream utility.
摘要:ObjectiveScene text detection (STD) is a fundamental task in scene text reading and understanding, and plays an important role in enabling intelligent systems to perceive high-level semantic information from natural scenes. It provides essential technical support for various applications, such as autonomous driving, image retrieval, unmanned systems, and intelligent scene analysis. In recent years, with the rapid development of deep learning and visual representation modeling, STD has achieved substantial progress and attracted increasing research attention. Existing deep learning-based methods can generally be divided into regression-based, connected-component-based, and segmentation-based approaches. Among them, segmentation-based methods have become a mainstream solution due to their flexibility in pixel-level prediction and strong capability in detecting arbitrarily shaped text instances. However, most existing segmentation-based methods still implicitly assume that multi-scale features can be optimized under a unified supervision signal and fused within a shared semantic space. Such a strategy overlooks the intrinsic semantic heterogeneity across feature hierarchies. Specifically, low-level features contain rich spatial details but are vulnerable to pixel-level noise, whereas high-level features encode stronger semantic information but may lose fine-grained structural cues. Directly supervising and fusing these heterogeneous representations may lead to interference between low-level pixel noise and high-level semantic constraints, thereby weakening feature fusion effectiveness and reducing inference stability. From the perspective of representation learning, multi-scale features are not merely homogeneous representations at different spatial resolutions, but heterogeneous representations associated with different semantic granularities. Therefore, effective STD requires explicit modeling, alignment, and coordination of semantic information across different feature levels.MethodTo address the above issues, we propose an efficient and effective STD framework, which consists mainly of a branch-wise distribution-aware modeling (BDM) module and a cross-semantic global knowledge integration (CGKI) module. Considering that conventional multi-scale text detection methods often ignore the semantic discrepancies among different feature levels at the supervision stage, the BDM module is designed from the perspective of label modeling. Specifically, it transforms pixel-level binary segmentation annotations into hierarchical distribution-aware supervision signals, enabling feature branches at different scales to independently learn text distribution semantics that are consistent with their corresponding receptive fields. In this way, the semantic interference among heterogeneous multi-scale features can be alleviated, and semantically aligned feature representations can be provided for subsequent feature fusion. Notably, the BDM module is only employed during the training stage and removed during inference, thus improving detection accuracy without introducing additional computational overhead. On the basis of intra-scale distribution-aware semantic alignment, we further design the CGKI module to explicitly model the collaborative relationships among different semantic levels. This module first enhances the representation of each scale within its own semantic space, and then performs controlled cross-scale interaction through adaptive scale reweighting and adjacent-scale information injection. By selectively recalibrating the importance of different scales and introducing complementary contextual information from neighboring levels, the CGKI module achieves global coordination and stable fusion of multi-scale semantics while maintaining a controllable computational cost. The ResNet equipped with deformable convolutions and feature pyramid network (FPN) is adopted as the backbone. For the training stage, the model is either directly trained on public datasets for ablation studies or pre-trained on Synth150k for 10 epochs and then fine-tuned on real-world datasets for comparison experiments. SGD with an initial learning rate of 0.001 and a poly learning rate schedule is used for optimization, together with data augmentation strategies including random rotation, cropping, and flipping.ResultThe proposed method is extensively evaluated against more than ten advanced methods on five widely used public text detection benchmarks, including MSRA-TD500, CTW1500, Total-Text, ICDAR2015, and MPSC. Precision (P), recall (R), and F-measure (F) are adopted as the evaluation metrics, where higher values indicate better detection performance. All inference tests are conducted on a single NVIDIA GTX 1080Ti GPU with an Intel i7-6800K CPU to ensure a consistent evaluation environment. Experimental results show that the proposed method consistently outperforms existing efficient STD methods on the above datasets while maintaining competitive inference speed. Specifically, on Total-Text, the proposed method improves the F-measure by 4.2% and 2.7% compared with DBNet++ and FEPE, respectively. On MSRA-TD500, it achieves F-measure improvements of 5.0% and 4.1% over DBNet++ and FEPE, respectively. On CTW1500, it gains 2.6% and 1.0% in F-measure against DBNet++ and FEPE, respectively. On ICDAR2015, it achieves F-measure gains of 2.8% and 2.7% relative to DBNet++ and FEPE, respectively. On the industrial scene text dataset MPSC, the proposed method surpasses existing advanced methods ISTD-DLA, ODM, and RT-DETR by 1.0%, 3.8%, and 1.3% in F-measure, respectively. Ablation studies on MSRA-TD500 further demonstrate the effectiveness of the proposed modules, confirming that BDM and CGKI can enhance multi-scale feature representation and fusion. In addition, visualization results on these datasets show that the proposed method can generate complete and accurate text boundaries in different scenes. Cross-dataset experiments further verify the generalization ability of the proposed method, where it achieves superior performance over existing representative methods such as ZTD, MTD, and CM-Net under both line-level and word-level annotation settings.ConclusionThis work presents an efficient and effective scene text detection method. By integrating BDM and CGKI, the proposed method enhances the semantic consistency and collaborative fusion of multi-scale text features, thereby improving the detection of complex text. Experimental results on multiple public benchmarks demonstrate that the proposed method achieves competitive detection accuracy and inference speed, outperforming existing efficient scene text detection methods. In future work, we will explore the integration of the proposed detection model with efficient text recognition models to establish an end-to-end efficient framework for text spotting.
摘要:Remote sensing technology serves as the core mechanism for the observation of the Earth and the understanding of surface environments. It plays an irreplaceable role in critical fields such as natural disaster monitoring, urban planning, resource exploration, and ecological protection. Over the past decade, driven by the rapid advancement of deep learning, the intelligent interpretation of remote sensing images has achieved breakthrough progress in fundamental vision tasks. However, the traditional deep learning paradigm is intrinsically built upon a closed-set assumption, meaning that models can only recognize a predefined and human-annotated set of fixed categories during the inference stage. When confronted with highly complex surface environments in real-world Earth observation scenarios, dynamic object morphologies, and rare ground objects with long-tail distributions, this traditional paradigm not only incurs prohibitive costs for the construction of massive pixel-level annotated datasets but also easily falls into the trap of domain-specific overfitting. Consequently, the generalization and response capabilities of this paradigm are severely challenged by unseen categories or sudden events, making it inadequate to meet the highly dynamic interpretation demands of the open world. In recent years, the rapid development of vision-language models has catalyzed a paradigm shift in artificial intelligence from task-specific models to general-purpose perception models. By mapping visual representations and natural language into a unified feature space through contrastive learning on massive image-text pairs, these models have broken the constraints of discrete labels. This enables a direct response to arbitrary natural language prompts, a capability known as open vocabulary perception. While this technology has demonstrated remarkable zero-shot generalization and cross-modal reasoning capabilities in the natural image domain, the direct application of these general vision-language models to the remote sensing domain encounters a severe domain gap. The uniqueness of remote sensing data poses multiple challenges to the adaptability of existing models. First, the distinct overhead imaging perspective causes drastic variations in object scale and complex background textures. Second, Earth observation tasks rely on multi-source heterogeneous data from SAR, multispectral or hyperspectral, and thermal infrared sensors. The underlying physical mechanisms of these sensors exceed the inherent inductive biases of models pre-trained solely on natural RGB images. Third, remote sensing objects often possess strong geospatial attributes and complex topological associations. To address these critical challenges, this paper provides a comprehensive and systematic review of recent advancements in open vocabulary perception for remote sensing images. We first delve into the foundational aspect of this field: vision-language pre-training for remote sensing. We extensively review the evolution of construction strategies for large-scale datasets. We highlight the transition from limited, human-annotated image-text pairs to massive datasets generated via heuristic rules, the integration of geographic metadata, and advanced multi-modal large language models. This includes innovative approaches that leverage OpenStreetMap, geographical coordinates, etc., to produce fine-grained, physics-aware descriptions across multiple modalities. Concurrently, we systematically summarize the progression of pre-training methodologies. While early approaches primarily focused on simple domain adaptation through continuous pre-training, recent state-of-the-art frameworks emphasize physics-aware encoding, fine-grained multi-level consistency learning, and geography-enhanced architectures. These frameworks better capture the intricate spatial relationships and modality diversities inherent in Earth observation data. Subsequently, this review conducts an in-depth analysis of the adaptation and optimization of open vocabulary perception techniques across a wide spectrum of crucial downstream tasks. For zero-shot scene classification and cross-modal retrieval, we discuss advanced strategies designed to mitigate the high intra-class similarity and complex inter-class variances typical in remote sensing. We emphasize the shift towards fine-grained local-global alignment, hard negative mining, dynamic soft-labeling, and prompt engineering. In the realm of open vocabulary image segmentation, we categorize the existing literature into training-based methods and training-free or annotation-free paradigms. Training-based methods leverage base categories to adapt models while preventing catastrophic forgetting through pseudo-label distillation and knowledge retention mechanisms. Training-free paradigms synergize foundational models, such as CLIP and the Segment Anything Model, to extract structural masks and align semantics without the updating of network weights. For open vocabulary object detection and remote sensing visual grounding, we explore the approaches of researchers to tackle extreme scale variations, arbitrary orientations, and dense object distributions. These approaches include innovative frameworks for pseudo-label generation, multi-scale feature alignment, cross-modality context modeling, and interactive grounding mechanisms. Furthermore, we examine open vocabulary change detection, where recent studies employ either combinations of pre-trained vision-language models or generative models to generate large-scale data. These approaches aim to identify arbitrary, text-specified surface transitions and simulate complex spatiotemporal changes without reliance on massive and costly bi-temporal pixel-level annotations. We also briefly touch upon emerging open vocabulary applications in three-dimensional urban point clouds and cross-domain archaeological remote sensing, illustrating the expanding horizon of this technology. Despite remarkable progress, the field of open vocabulary perception for remote sensing remains in a crucial developmental stage and faces several critical bottlenecks. This paper critically identifies the limitations of current research, including the severe scarcity of high-quality and geographically balanced training data. This scarcity leads to geographic biases and performance degradation in data-poor regions. Additionally, there is a prominent absence of genuinely fine-grained and long-tailed open vocabulary evaluation benchmarks that can accurately reflect the performance of a model in extreme or unknown real-world scenarios. The inadequate physical understanding of heterogeneous modalities and the inherent black-box unreliability of current large models in high-stakes decision-making scenarios further constrain practical deployments. To chart the course for future research, we outline several promising and essential trajectories. First, we anticipate a paradigm shift towards generative perception driven by multi-modal large language models. This shift unifies various spatial localization tasks into the direct generation of coordinate sequences or geometric property tokens to fully exploit the logical reasoning capabilities of foundational models. Second, we strongly advocate for the construction of rigorous, real-world, and fine-grained evaluation systems that incorporate complex spatiotemporal logic, diverse geographic conditions, and comprehensive evaluation metrics. Third, the development of omni-modal foundation models that explicitly integrate physical priors and deep learning is deemed crucial for the achievement of all-weather and all-spectrum Earth observation, moving beyond pure data-driven approaches. Furthermore, we highlight the necessity to extend perception from static spatial analysis to dynamic spatiotemporal causal reasoning to decode the evolutionary processes of the Earth. Finally, addressing the severe conflict between the massive parameter scale of foundation models and the limited computing power of aerospace edge devices requires focused research into efficient, trustworthy, and safe edge-cloud collaborative computing architectures. By systematically synthesizing these advancements and challenges, this comprehensive review aims to serve as a foundational roadmap for researchers and practitioners. It accelerates the transition of the intelligent interpretation of remote sensing from isolated, closed-set recognition toward artificial general intelligence capable of highly reliable, dynamic, and open-world perception.
Liu Zhen, Yang Qinzhe, Liu Liqin, Liu Chenyang, Zou Zhengxia, Shi Zhenwei
DOI:10.11834/jig.260078
摘要:ObjectiveGiant pandas serve as a flagship species for global biodiversity conservation and play a key role in assessing ecosystem integrity and conservation effectiveness. Accurate and reliable detection of giant pandas in camera-trap images is thus essential for long-term wildlife monitoring, population assessment, and adaptive management of protected areas. In recent years, deep learning-based object detection algorithms have demonstrated remarkable success. However, directly deploying general-purpose detection models in wild scenarios remains challenging due to two fundamental issues. First, giant pandas are rare species, and acquiring large volumes of high-quality, finely annotated training data from the wild is extremely costly, time-consuming, and often impractical. Second, there exists a substantial domain gap between commonly used pre-training datasets and unconstrained camera-trap images captured in natural habitats. To alleviate data scarcity and improve robustness in wild environments, we propose a unified generation–detection method termed PandaGenDet.MethodRather than treating data augmentation and detection as independent components, the core idea of PandaGenDet is to improve detection robustness through multi-level collaboration between generative and discriminative models. Specifically, PandaGenDet consists of three complementary components operating at different representational levels. First, at the data level, we introduce a class-conditioned image generation module equipped with a Category-guidance Mechanism. This mechanism explicitly incorporates semantic category information into the generative process, guiding the synthesis of panda images with improved semantic consistency and target realism, making them more suitable as high-quality supplementary training samples. Second, at the image level, we design an Image Enhancer module to reduce the domain discrepancy between wild camera-trap images and the visual priors learned from large-scale pre-training datasets. The Image Enhancer is implemented as a modular and easily integrable component that performs a learnable image-level mapping prior to detection. By adaptively reshaping low-level and mid-level image statistics, this module maps target-domain images to representations that are more compatible with the detector’s pre-trained weights, without requiring any modification to the detector architecture. During training, the Image Enhancer and the detector are jointly optimized in an end-to-end manner, with all detector parameters fully fine-tuned from their pre-trained initialization. Third, at the feature level, we propose a Generative Feature Injector, which leverages the trained generative model as a multi-scale feature extractor. Hierarchical feature representations learned during the image generation process are extracted and injected into the detection backbone via a PSPNet (pyramid scene parsing network) and FPN (Feature Pyramid Network) fusion network. This design enables the detector to leverage rich semantic and structural priors embedded within the generative model, enabling the transfer of multi-scale semantic priors from the generative model into the detection network. Together, these mechanisms form a unified and extensible method for robust wildlife detection.ResultWe conduct extensive experiments using Grounding DINO, a modern open-set object detection model, as the detection backbone. Evaluations are performed on the giant panda subset of the LoTE-Animal (long time-span dataset for endangered animal) dataset, which contains challenging camera-trap images representative of real-world conservation scenarios. Experimental results demonstrate that the proposed Category-guidance Mechanism significantly improves generative quality. Specifically, KID (kernel inception distance) decreases from 0.059 to 0.038, while FID (fréchet inception distance) is reduced from 147.00 to 123.13, indicating that the synthesized images achieve higher fidelity and improved semantic consistency with real wild panda images. These improvements directly translate into more effective training data for detection. When the Image Enhancer is integrated into the Grounding DINO detector, notable gains in detection performance are observed. On the LoTE-Animal panda subset, mAP (mean average precision) increases from 88.8 to 89.7, while mAR (mean average recall) improves from 94.9 to 95.5, confirming the effectiveness of image-level domain adaptation. Further incorporating the Generative Feature Injector leads to additional performance improvements, with the detector achieving a mAP of 89.8, outperforming both the baseline and image-enhancer-only configurations. Finally, training the detector using a mixture of real images and high-quality synthetic images generated by the full PandaGenDet pipeline yields the best overall performance, achieving a final mAP of 90.1. Qualitative analyses further reveal that synthesized images exhibit more accurate panda poses, better integration with realistic environmental textures, and fewer semantic artifacts. Detection visualizations demonstrate high localization accuracy in challenging scenarios, including dense vegetation, low illumination, and partial occlusion. Furthermore, the final model demonstrates strong robustness in open-set detection, maintaining stable performance even when encountering object categories not present in the training dataset.ConclusionThis study presents PandaGenDet, a unified collaborative framework from data synthesis to object detection for giant panda monitoring in complex wild environments. By integrating data-level synthesis, image-level enhancement, and feature-level injection in a unified manner, the proposed method effectively addresses two major bottlenecks in real-world wildlife detection: the scarcity of annotated data and the presence of severe domain gaps between pre-training and deployment scenarios. Extensive experiments on camera-trap datasets demonstrate that PandaGenDet substantially improves both synthetic image fidelity and detection accuracy, while also enhancing open-set robustness. Through a three-dimensional collaborative strategy—data-level synthesis, image-level enhancement, and feature-level injection—PandaGenDet significantly improves the detection performance of general-purpose models in complex wild environments.
摘要:With the rapid emergence of multimodal large models (MLLMs), massive heterogeneous data are proliferating across various industrial and scientific domains. In this context, multi-view clustering (MVC) serves as a cornerstone technology for unsupervised knowledge discovery and latent correlation mining. Currently, MVC is undergoing a profound and historical paradigm shift. Traditional surveys predominantly focus on the horizontal categorization of algorithmic network structures. However, this approach often fails to reveal the intrinsic evolutionary logic across different technological eras. Departing from these conventions, this paper proposes a pioneering, prior-driven theoretical perspective to systematically reconstruct the developmental trajectory of MVC over the past two decades. This is achieved through a trans-paradigm analytical framework of Geometry-Semantics-Cognition. In the initial stage of shallow structural mining, the research paradigm focused on explicit mathematical constraints within original or kernel-induced feature spaces. Euclidean space methods, such as multi-view k-means and non-negative matrix factorization (NMF), identified global prototypes by minimizing squared error or Frobenius norm reconstruction loss. In contrast, affine space methods leveraged self-representation properties to model data as a union of low-dimensional subspaces, while manifold space techniques utilized spectral graph theory to transform clustering into optimal graph-cut problems by preserving local topological correlations. As the field transitioned into deep spatial modeling based on semantic collaborative priors, researchers utilized the powerful non-linear mapping capabilities of deep neural networks to project heterogeneous data into high-order semantic spaces. This evolution encompasses several distinct research paradigms. (1) Embedding space research focuses on deep subspace clustering, employing autoencoders to learn discriminative features while maintaining cross-view consistency. (2) Latent space methods utilize probabilistic generative models, such as Variational Autoencoders (VAE) and Generative Adversarial Networks (GAN), to align latent distributions and infer missing view information through adversarial games. (3) Augmented space paradigms introduce contrastive learning to maximize mutual information between views via InfoNCE-like losses, thereby enhancing representation robustness. (4) Topological space studies leverage Graph Neural Networks (GNN) to synergistically mine intra-view geometric structures and inter-view complementary semantics.Moving into the current era of multimodal large models, this paper forward-lookingly explores deep alignment based on cognitive priors, where MLLMs are viewed not merely as data sources but as knowledge bases containing human-level common sense and logical reasoning abilities. We systematically elucidate the infrastructural role of MVC in empowering massive data governance. Specifically, MVC facilitates semantic deduplication to enhance data quality and employs token-level clustering to optimize mixture-of-experts (MoE) routing for expert specialization. Furthermore, MVC enables hierarchical semantic chunking, which is critical for precise document retrieval within retrieval-augmented generation (RAG) frameworks.Beyond these applications, we analyze the potential for MLLM logical reasoning to back-propagate into clustering tasks. This synergy elevates MVC from pure statistical feature alignment to a new dimension of knowledge-driven cognitive logic consistency.Beyond theoretical frameworks, the review highlights the transformative impact of MVC in diverse real-world scenarios, ranging from multi-omics fusion in smart healthcare for cancer subtyping and spatial-spectral fusion in remote sensing for urban functional zone identification to cross-perspective pedestrian recognition in public security and unsupervised anomaly detection in industrial IoT networks. Despite these advancements, achieving a transition from laboratory benchmarks to robust industrial infrastructure requires addressing several core challenges. First, the scaling bottleneck where O(N2) or O(N3) computational complexity of traditional methods must be reduced to linear levels to handle million-scale web data. Second, the extreme robustness required in open environments to handle not only severely incomplete views but also the pervasive long-tailed category imbalance where minority abnormal samples are often overwhelmed by dominant patterns. Third, the urgent need for evaluation system reconstruction to move away from legacy small-scale datasets with shallow features toward native heterogeneous multimodal benchmarks that reflect real-world weak alignment and complex noise distributions. Ultimately, this survey aims to provide a novel research roadmap for theoretical innovation and engineering practice, advocating for a transition toward intelligent decision systems characterized by high-level semantic decoupling and cognitive logic alignment in the era of multimodal large models.
关键词:Multi-view clustering (MVC);Prior-driven learning;Multimodal large models (MLLMs);Geometric structure;Semantic collaboration;Cognitive alignment
Liu Erhu, Yuan Sijie, Li Haowen, Xu Shengjun, Hu Yu, Yang Tiantian
DOI:10.11834/jig.260045
摘要:ObjectiveBuilding extraction from remote sensing imagery is a fundamental and challenging task in remote sensing image interpretation and has been widely applied in urban planning, land-use analysis, disaster assessment, and geographic information system updating. With the rapid development of high-resolution and very-high-resolution (VHR) remote sensing sensors, buildings in aerial images exhibit large variations in scale, shape, texture, and spatial distribution. In addition, complex backgrounds, shadow interference, and spectral similarity between buildings and surrounding objects further increase the difficulty of accurate building extraction. Although convolutional neural network (CNN) based semantic segmentation methods have achieved remarkable progress in this field, existing approaches still face two major limitations. First, many networks lack sufficient capability to model multi-scale building features, resulting in missed detections or incomplete segmentation of buildings with large scale variations. Second, due to repeated down-sampling operations and insufficient boundary supervision, the extracted building boundaries are often blurred or discontinuous, which degrades the geometric accuracy of segmentation results. To address these issues, this paper proposes a novel remote sensing building extraction network that integrates multi-level feature extraction and edge enhancement, named the multi-level feature extraction and Edge-enhanced network (MFEE-Net), with the aim of improving both overall segmentation accuracy and boundary quality.MethodThe proposed MFEE-Net adopts an encoder–decoder architecture and is specifically designed to jointly enhance multi-scale feature representation and boundary detail preservation. In the encoding stage, a lightweight multi-scale feature extraction encoder is constructed using a newly designed residual multi branch convolution block (ResMBC) as the fundamental building unit. The ResMBC introduces parallel convolutional branches with different receptive fields, enabling the network to capture building structures and texture patterns at multiple spatial scales while retaining the local modeling advantages of standard convolution. The residual connection further facilitates stable training and effective feature propagation, allowing the encoder to generate rich and discriminative feature representations with relatively low computational cost. To effectively utilize features from different encoding depths, an interlayer feature fusion module (IFFM) is introduced between the encoder and decoder. Unlike simple skip connections, the IFFM jointly models spatial information and channel-wise correlations, enabling adaptive fusion of heterogeneous features from different layers. By enhancing feature complementarity and reducing semantic inconsistencies between low-level spatial details and high-level semantic representations, the IFFM alleviates information loss and improves the robustness of feature transmission during the decoding process. In the decoding stage, an edge-aware enhancement module (EAEM) is incorporated to explicitly refine building boundaries. The EAEM emphasizes edge-related features by enhancing boundary-sensitive responses and suppressing background interference, thereby improving the continuity and clarity of extracted building contours. Furthermore, a joint loss function with an edge-constrained auxiliary term is employed during training. This loss formulation encourages the network to simultaneously optimize building foreground regions and boundary details, leading to a coordinated improvement in segmentation completeness and edge precision.ResultExtensive experiments were conducted on two widely used public benchmark datasets, namely the WHU Aerial Building Dataset and the Massachusetts Building Dataset, to evaluate the performance of the proposed MFEE-Net. Quantitative comparisons were performed against multiple state-of-the-art building extraction methods using standard evaluation metrics, including Intersection over Union (IoU), F1-score, precision, and recall. On the WHU Aerial Building Dataset, MFEE-Net achieved an IoU of 91.13%, an F1-score of 95.36%, a precision of 95.81%, and a recall of 94.92%, demonstrating superior performance in both overall accuracy and boundary consistency. On the Massachusetts Building Dataset, which contains lower-resolution imagery and poses greater challenges due to blurred edges and complex backgrounds, MFEE-Net attained an IoU of 75.46%, an F1-score of 86.01%, a precision of 87.84%, and a recall of 84.26%. These results indicate that the proposed network maintains robust performance under different spatial resolutions and scene complexities. Qualitative visual comparisons further reveal that MFEE-Net is capable of producing more complete building regions with clearer and more continuous boundaries, particularly in scenes with dense buildings, complex structures, and significant scale variations. Ablation studies validate the effectiveness of each proposed component, confirming that multi-level feature extraction, interlayer feature fusion, and edge-aware enhancement collaboratively contribute to performance improvements.ConclusionThis study proposes a novel remote sensing building extraction network, MFEE-Net, which integrates multi-level feature extraction and edge enhancement within an encoder–decoder framework. By leveraging a lightweight multi-scale encoder, an interlayer feature fusion strategy, and an edge-aware enhancement mechanism with boundary supervision, the proposed network effectively addresses the challenges of scale variation and boundary ambiguity in remote sensing building extraction. Experimental results on public benchmark datasets demonstrate that MFEE-Net achieves competitive and stable performance, significantly improving both segmentation accuracy and boundary quality. The proposed approach provides an effective solution for high-precision building extraction in complex remote sensing scenarios and offers potential for practical applications in urban analysis and geospatial information processing.
Zhang Zhihao, Fu Zhitao, Ji Yashuai, Zhang Xinshan, Tang Bohui
DOI:10.11834/jig.250573
摘要:ObjectiveSynthetic Aperture Radar (SAR), as an active microwave remote sensing technology, is capable of all-weather and day-and-night data acquisition of the Earth's surface. However, due to its coherent imaging principle, the SAR receiver inevitably introduces significant speckle noise when processing the backscattered signals. Furthermore, SAR imagery primarily reflects the dielectric properties and geometrical structures of targets, rather than the spectral characteristics familiar to the human visual system. This leads to fundamental differences between SAR and optical images in terms of imaging mechanisms, physical properties, and image characteristics. Such differences severely limit their joint application and analysis in subsequent tasks. SAR-to-optical image translation, as a data processing technique capable of converting heterogeneous remote sensing images into data with homogeneous image characteristics, can effectively address the modality gap between SAR and optical imagery. It provides crucial technical support for subsequent applications and analyses such as SAR and optical image matching and SAR and optical image change detection. Therefore, research on SAR-to-optical image translation is of significant importance. However, existing SAR-to-optical image translation methods typically employ a single generator structure, which struggles to simultaneously maintain global semantic consistency and local textural-structural details. This often leads to generated images with semantic distortions and blurred details, consequently limiting their reliability and practical utility in real-world applications. To address this issue, this paper proposes a unidirectional knowledge transfer generative adversarial network (UKT-GAN) for dual-fidelity SAR-to-optical image translation. The model aims to achieve dual fidelity in both global and local dimensions of the generated imagery through unidirectional knowledge transfer between its dual-branch network architecture.MethodThe proposed method employs a dual-branch generation framework that allocates two fundamentally distinct tasks of local texture-structure detail reconstruction and global semantic information preservation to a Detail Reconstruction Subnetwork and a Semantic Preservation Subnetwork, respectively. This architectural design structurally overcomes the inherent limitation of single-branch generative adversarial networks in simultaneously maintaining global semantic consistency and local textural-structural fidelity. The detail reconstruction branch consists of a generator and a discriminator. The generator employs a Unet architecture embedded with CBAM modules, which utilizes skip connections to achieve multi-scale fusion of feature maps from the encoder and decoder, thereby preserving richer spatial details. The discriminator adopts a shallow two-layer convolutional structure to focus on assessing the authenticity of local textural and structural details, while pixel-level loss constraints are applied to ensure the fidelity of local detail reconstruction. Similarly, the semantic preservation branch also comprises a generator and a discriminator. Its generator utilizes a ResNet framework integrated with PE-Transformer modules, leveraging the multi-head attention mechanisms and residual blocks within the PE-Transformer to perform multi-dimensional and deep extraction and integration of semantic features from the imagery. This ensures the complete and accurate retention of semantic information throughout the feature extraction and transmission process. The discriminator employs a deeper four-layer convolutional network to evaluate the overall structural rationality and global semantic consistency of the generated images, supplemented by a combined pixel-level and feature-level loss constraint to maintain global semantic information consistency. Finally, a unidirectional consistency loss transfers the detail capture capability from the detail reconstruction branch to the semantic preservation branch. This process optimizes and refines the local textural and structural details of the optical images generated by the Semantic Preservation Subnetwork, ensuring that these generated images can simultaneously maintain both global semantic consistency and local textural-structural fidelity, thereby enhancing the overall quality of the imagery produced by the Semantic Preservation Subnetwork.ResultTo validate the effectiveness of UKT-GAN in SAR-to-optical image translation, this paper conducts comparative analysis with six mainstream image translation methods, including Pix2pix, DCLGAN, Parallel-GAN, Conditional Diffusion, StegoGAN and HVT-cGAN , on two public datasets: SEN1-2 and WHU-OPT-SAR. The evaluation employs four metrics: Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM), Learned Perceptual Image Patch Similarity (LPIPS), and Root Mean Square Error (RMSE), assessing the quality of generated images from four aspects: global quality, structural similarity, deep feature fidelity, and pixel-level accuracy. On the farmland and mountain subsets of SEN1-2 and the WHU-OPT-SAR dataset, the optical images generated by UKT-GAN achieved optimal results across all four metrics: PSNR, SSIM, LPIPS, and RMSE. Particularly notable improvements were observed in the LPIPS metric, with an accuracy improvement of at least 2.3% compared to the second-best results (where lower LPIPS values indicate better performance). On the building and forest subsets of SEN1-2, the optical images generated by UKT-GAN achieved optimal results on SSIM and LPIPS. These findings demonstrate that the proposed UKT-GAN can generate higher-quality optical images compared to six mainstream image translation methods.ConclusionThis paper proposes a Unidirectional Knowledge Transfer Generative Adversarial Network (UKT-GAN) for dual-fidelity SAR-to-optical image translation. Through unidirectional consistency loss, the model effectively transfers the local textural-structural reconstruction capability from the Detail Reconstruction Subnetwork to the Semantic Preservation Subnetwork. This enables the generation of optical images that simultaneously maintain both global semantic information and local textural-structural details during testing, while requiring only the single Semantic Preservation Subnetwork to be deployed. Furthermore, experimental results demonstrate that compared to other methods, the proposed UKT-GAN can generate optical images with clearer structures and superior overall quality when processing different land cover types. Moreover, when confronted with data distribution shifts caused by varying sensor characteristics and imaging configurations, UKT-GAN maintains stable translation performance, exhibiting strong generalization capability.The open source code for this paper is available at:https://www.scidb.cn/s/YNjqIf.
Xuan Enyun, Li You, Li Ziwei, Yao Mengmeng, Guo Renzhong
DOI:10.11834/jig.250531
摘要:ObjectiveCo-Speech holistic motion generation aims to simultaneously achieve expressive gestures and precisely synchronized facial expressions. These two tasks have fundamentally different natures. Gesture generation is non-deterministic, representing a one-to-many mapping where the same speech can correspond to various natural motions requiring high diversity. Meanwhile, facial expression generation, especially lip movements, is deterministic, representing a one-to-one mapping that requires precise correspondence with phonemes and demands high accuracy. Existing methods face three critical limitations. First, employing fixed architectural designs, such as unidirectional conditional flows, imposes rigid task relationships and hinders models from capturing the true dynamic connections between gestures and expressions. Second, using manually designed static loss weights cannot adapt to the dynamic changes in task importance during training. Third, over-relying on minimizing differences from ground truth data leads to gesture overfitting and suppresses diversity. These deficiencies force existing methods into unavoidable trade-offs between facial synchronization and gesture diversity. This research aims to develop a unified adaptive framework that autonomously models and dynamically balances the relationship between these two tasks through learnable uncertainty mechanisms, simultaneously satisfying the dual objectives of gesture diversity and expression accuracy without manual intervention.MethodWe propose a novel diffusion-based framework leveraging uncertainty-based multi-task learning for adaptive task balancing in holistic motion generation. This represents the first application of uncertainty-based loss weighting to speech-driven holistic motion synthesis. Our core innovation treats gesture and facial expression generation as distinct tasks within a unified framework, allowing their relationship to emerge naturally during training. The framework employs a denoising diffusion probabilistic model operating on concatenated gesture and facial expression representations. The architecture incorporates shared features, including WavLM audio representations, word embeddings, speaker identity, and timestep encoding, alongside task-specific features like Gaussian noise vectors and seed motion sequences, to capture both commonalities and distinct requirements of each task. Cross-local attention mechanisms capture long-range dependencies across timesteps and modalities, while self-attention layers refine task-specific patterns. The key innovation introduces learnable parameters representing task-dependent homoscedastic uncertainty for gestures and expressions respectively. The total training objective integrates the losses of both tasks, dynamically weighted by these uncertainty parameters. This formulation automatically balances task contributions, as larger uncertainty values reduce penalties to encourage diversity, while smaller values increase penalties to enforce precision. The uncertainty parameters are jointly optimized with model parameters, enabling the dynamic discovery of optimal task weighting without manual intervention.ResultComprehensive evaluations on the 76-hour BEAT dataset, featuring 30 speakers and a 98/16/16 data split, demonstrate significant improvements. Our method achieved the highest gesture diversity (52.5) compared to MambaTalk (51.6), DiffSHEG (47.4), DSG (48.5), and CaMN (43.2), with the best semantic relevance score (SRGR: 0.324). For facial expressions, we obtained the lowest Fréchet Distance, outperforming MambaTalk, DiffSHEG and SAiD. Ablation studies confirm the critical role of uncertainty-based weighting, as removing it decreased gesture diversity from 52.5 to 47.2 and increased facial FD from 9.18 to 10.5. The learned uncertainty parameters converged to weights of 0.506 for gestures and 0.494 for expressions, demonstrating autonomous task balancing. Applying our mechanism to DiffSHEG and MambaTalk improved its gesture diversity from, validating generalizability. Qualitative analysis shows our gestures exhibit substantially greater diversity than baselines which closely imitate ground truth. User studies with 17 participants evaluating nine video groups confirmed overwhelming preference for our method across gesture diversity, facial synchronization, and overall quality dimensions.ConclusionThis research presents a novel adaptive diffusion framework successfully addressing the fundamental challenge of simultaneously achieving precise facial synchronization and diverse gesture generation. By introducing uncertainty-based learnable parameters within a multi-task learning paradigm, our method enables automatic optimization of task relationships, eliminating manual tuning while achieving superior performance in both deterministic expression synthesis and non-deterministic gesture generation. Experimental results demonstrate significant improvements in facial accuracy (FD: 9.18), gesture diversity (52.5), and semantic relevance (SRGR: 0.324), with user studies confirming enhanced realism. This work provides an effective solution for creating lifelike virtual agents and opens new research directions for holistic motion generation through adaptive multi-task learning. The codebase of the paper:https://doi.org/10.57760/sciencedb.j00240.00175.
Wang Zhixiang, Zhang Yayuan, Shang Wei, Yang Liu, Zhu Pengfei, Ren Dongwei
DOI:10.11834/jig.250659
摘要:ObjectiveArbitrary-scale video super-resolution (AVSR) aims to reconstruct high-resolution (HR) videos from low-resolution (LR) inputs under continuous scaling factors, including non-integer and asymmetric magnifications. Compared with fixed-scale video super-resolution (VSR), AVSR must generalize across a continuum of scales while maintaining temporal coherence amid complex motions, non-rigid deformations, and occlusions. In practice, three key issues often drive performance degradation: (i) scale generalization, where details plausible at one magnification may appear over-smoothed or over-sharpened at another; (ii) alignment error accumulation, where minor misalignments from optical-flow warping compound during recurrent propagation, causing flickering, ghosting, and motion artifacts; and (iii) robustness to unseen degradations, as real videos often diverge from training degradation models, complicating high-frequency restoration and temporal stability. This work develops an AVSR approach that enhances spatial detail recovery, temporal consistency, and scale generalization while maintaining deployment-friendly efficiency.MethodWe propose SL-AVSR, an arbitrary-scale video super-resolution framework that integrates (1) an explicit multi-scale frequency prior derived from image Laplacian pyramids; (2) second-order composite-flow-guided propagation for temporal feature transfer; (3) second-order deformable alignment refinement for sub-pixel correction near motion boundaries and non-rigid regions; and (4) a scale-aware hyper-upsampling unit for efficient continuous scaling. SL-AVSR builds on a forward-looking recurrent architecture with a lightweight look-ahead mechanism. The current HR frame is reconstructed by fusing history-propagated features with a short window of future cues, avoiding the overhead of a full bidirectional pass. First, to ensure scale-consistent guidance for detail restoration, we construct a Laplacian pyramid on the LR input to extract band-limited components representing multi-scale frequency information. These components are fused via learnable weights, enabling the network to prioritize appropriate frequency bands for different magnifications and content types. Unlike resource-intensive perceptual feature networks, this explicit prior is lightweight, interpretable, and imposes direct constraints on frequency discrepancies across scales. Second, to enhance alignment robustness in recurrent temporal aggregation, SL-AVSR employs second-order composite flow for feature propagation. Instead of using one-step displacements from single neighboring frames, we compose neighboring flows into two-step composite displacements, providing more stable cues under large motions and partial occlusions. This composite-flow-guided warping transfers features temporally, mitigating drift and curbing misalignment error accumulation. Third, to resolve residual misalignments persisting after flow-based warping—particularly around motion boundaries, non-rigid deformations, and occlusions—we introduce a second-order deformable alignment refinement module. This module predicts residual sampling offsets and modulation masks conditioned on warped features and the current context, enabling adaptive local corrections around flow-estimated displacements. The refinement is applied in both history propagation and look-ahead aggregation pathways, improving temporal feature correspondence and reducing motion artifacts. Fourth, to enable efficient continuous and asymmetric scaling, SL-AVSR incorporates a scale-aware hyper-upsampling unit. A compact hyper-network generates scale-specific convolution kernels that can be precomputed or cached for common output resolutions. This approach balances (i) direct interpolation (fast but limited in fidelity for large scales and fine textures) and (ii) implicit neural representation (INR)-based pixel-wise rendering (flexible but computationally expensive). By conditioning convolutional kernels on the target scale, SL-AVSR preserves convolution-based efficiency alongside arbitrary-scale flexibility.ResultTraining occurs on standard VSR/AVSR benchmarks with continuous scale sampling, with evaluation under integer, non-integer, and asymmetric magnifications. Generalization is tested by applying models trained on one dataset directly to others without adaptation. Robustness is assessed under randomized synthetic degradations and real-world videos with unknown degradations. We report distortion metrics (PSNR, SSIM) and a perceptual metric (LPIPS) for fidelity and quality, alongside qualitative comparisons and time–space profile visualizations to evaluate temporal stability (e.g., flickering and alignment artifacts). Across scaling factors (including non-integer and asymmetric) and diverse video content, SL-AVSR achieves the best or consistently competitive quantitative performance against representative AVSR and arbitrary-scale image super-resolution (AISR) baselines. The explicit Laplacian-pyramid frequency prior delivers stable gains in detail recovery and scale generalization, evidenced by higher PSNR/SSIM and lower LPIPS across most scales. Qualitatively, SL-AVSR reconstructs structured regions (e.g., thin lines, repetitive patterns, man-made textures) more reliably and preserves stochastic textures with fewer over-smoothing artifacts, especially at large magnifications where frequency information is vulnerable. For temporal consistency, the second-order composite-flow-guided propagation and deformable alignment refinement reduce motion distortions like trailing edges, ghosting, and shimmering. Time–space profiles reveal smoother, more continuous traces in SL-AVSR compared to competitors' blurred or jagged ones, indicating superior temporal aggregation. The look-ahead mechanism further boosts stability and perceptual quality by incorporating future context without a costly full-sequence backward pass. In cross-dataset tests, SL-AVSR sustains robust performance on unseen distributions, with gradual degradation as scaling increases. Under randomized and real-world degradations, it avoids severe artifact amplification, underscoring the resilience from explicit frequency guidance and second-order alignment. Efficiency analyses show SL-AVSR's favorable quality–efficiency trade-off, outperforming INR-based methods due to its kernel-generating hyper-upsampling and lightweight prior.ConclusionWe present SL-AVSR, an arbitrary-scale video super-resolution framework that unifies an explicit Laplacian-pyramid multi-scale frequency prior with second-order composite-flow-guided propagation and second-order deformable alignment refinement in a forward-looking recurrent architecture. The proposed design enhances spatial detail restoration and scale generalization while improving temporal consistency by mitigating alignment error accumulation under challenging motion patterns. The hyper-upsampling unit supports continuous scaling with practical efficiency, avoiding the high computational cost of pixel-wise implicit rendering. Extensive evaluations across datasets, scaling factors, and degradation conditions demonstrate SL-AVSR's strong balance of fidelity, perceptual quality, temporal coherence, and computational efficiency, positioning it as a practical solution for real-world arbitrary-scale video super-resolution. The code is publicly available through Science Data Bank:https://www.doi.org/10.57760/sciencedb.j00240.00181.
关键词:arbitrary-scale video super-resolution;recurrent neural network;second-order deformable alignment;frequency prior;hyper-upsampling unit
Yebin Liu, Yao Mu, Qi Ye, Lin Gao, Xiaoguang Han, Anpei Chen, Yueqi Duan, Sida Peng, Tianjia Shao, Hongwen Zhang, Li Zhang, Yiyi Liao, Lan Xu, Xihui Liu, Yao Yao, Ruizhen Hu, Li Yi, Yuan Guo, Zhouhui Lian, Ziwei Liu, Baoquan Chen
DOI:10.11834/jig.260114
摘要:As an interdisciplinary field spanning computer vision, graphics, artificial intelligence, and optical imaging, 3D vision serves as the core cornerstone for constructing Embodied General Intelligence (EGI) and the Metaverse. As the "Scaling Law" paradigm, upon which AI development relies, faces significantly diminishing marginal returns and encounters bottlenecks, the focus of both academia and industry is pivoting ever more clearly toward foundational subjects closely related to 3D vision, such as "World Models," "Spatial Intelligence," and "Embodied Intelligence," granting 3D vision unprecedented strategic attention and developmental opportunities. In 2025, the primary frontier trends in the field of 3D vision can be summarized as follows: 1) Feed-forward 3D reconstruction that supports spatiotemporal multi-image inputs: with breakthroughs in feed-forward 3D reconstruction technologies such as VGGT, obtaining scene structure and motion information through spatiotemporal multi-image feed-forward methods has become increasingly simple, bringing two profound impacts: firstly, it provides a solid foundation for 3D scene understanding for spatial intelligence, allowing many traditional 2D vision problems to be solved more fundamentally in 3D space; secondly, combined with efficient rendering technologies such as 3D Gaussian Splatting (3DGS), the threshold for high-quality 3D content production has been significantly lowered, paving the way for large-scale applications such as digital twins and the Metaverse. 2) The gradual fusion of 3D generation and 3D reconstruction: 3D AIGC technologies such as SAM3D support compositional and instance-level object generation under single-image input, with generation quality gradually reaching industrial-grade scanning standards, while simultaneously integrating with feed-forward reconstruction methods to gradually achieve the generation of authentic 3D structures and textures consistent with the input images; this will support feed-forward multi-instance reconstruction of dynamic complex scenes, significantly improving real-time, multimodal perception and understanding capabilities in complex scenarios. 3) The integration from video generation and world models to embodied intelligence: video generation technology is rapidly incorporating explicit or implicit 3D representations and evolving toward multi-view consistency, long sequences, and physical plausibility, directly driving the development of integrated "Perception-Generation-Interaction" world model technologies. These types of world models, combined with feed-forward 3D reconstruction technology, will form a complete "Multimodal Perception–3D Modeling–4D Generation–Real-time Interaction" 4D world model. At the same time, world model methods have begun to serve embodied intelligence, and a unified framework of "understanding-generation-execution" has begun to emerge. World models are widely regarded by the academic community as the key path to achieving generalizable embodied intelligence and ultimately leading to AGI. 4) Human behavior and video data becoming the core fuel driving breakthroughs: human operational spaces and interaction videos constitute a "data goldmine" for training embodied intelligence. The vast amount of human behavior videos on the internet, as well as first-person perspective data collected through simple devices, contain physical common sense, causal reasoning, and interaction preferences that serve as the natural fuel to break through the current data bottlenecks of embodied intelligence. By performing explicit 3D perceptual reconstruction or latent-space action alignment and learning on these data, a "data pyramid" base can be constructed to drive the scaling of embodied intelligence. 5) The evolution of the embodied training paradigm from imitation learning to interaction-driven reinforcement learning: the technical evolution of embodied intelligence VLA models is leaping from a supervised fine-tuning paradigm relying on expert demonstrations to a composite training architecture integrating online reinforcement learning. This shift effectively breaks the dependence on scarce high-quality data, enabling policies driven by sparse rewards to obtain generalization and exploration capabilities surpassing those of imitation learning, solving the challenges of exploration and stable updates in continuous action spaces. Simultaneously, the development of high-performance training systems and action-conditioned world models provides the infrastructure support for large-scale interaction data generation and efficient policy evolution, marking a new "post-training" stage for embodied intelligence centered on "interaction-driven" approaches. The selected top ten research advancements of the year in the field of 3D vision include: 1) Feed-forward 3D reconstruction constructing the foundation models for 3D vision (spatial intelligence); 2) The convergence of reconstruction and generation technical routes (video generation/3D generation), moving from mutual assistance to preliminary integration; 3) 3DGS/4DGS continuously improving representation efficiency, sparking a surge in scene modeling and volumetric video applications; 4) 3D generation: a leap from single-object visual realism to structuralized components/scenes and physical interactivity; 5) From video generation to world models: oriented toward spatiotemporal consistency, physical plausibility, and interactivity; 6) Unified multimodal large models for understanding and generation serving spatial intelligent perception; 7) Frontier shifts in digital humans: from appearance modeling to multimodal interaction; 8) Human data becoming the essential fuel to break through the Scaling Law of embodied intelligence; 9) Embodied intelligence foundation models evolving toward unified models of integrated "understanding-imagination-execution"; 10) The "post-training" moment of embodied intelligence: the paradigm shift of VLA models from imitation learning to online RL. Collectively, these breakthroughs have established the prototype of an integrated intelligent architecture characterized by “Multimodal perception - 3D modeling - 4D Generation - Real-time interaction”, providing critical technical support for the substantive advancement of spatial and embodied intelligence. To promote academic discourse, this paper extensively analyzes frontier trends in 3D vision and curates the top ten annual research advances, offering valuable reference perspectives for both academia and industry.
关键词:3D vision;Embodied AI;World model;reconstruction and generation;spatial intelligence
摘要:ObjectiveHyperspectral Image (HSI) classification is critical in remote sensing and widely used in land cover monitoring, agricultural survey, and urban planning. Mamba-based models have been increasingly applied in HSI classification due to their advantages in linear computational complexity and long-range dependency modeling. However, existing Mamba-based methods suffer from insufficient spatial information utilization and unreasonable spatial-spectral feature fusion, which leads to spatial information erosion and feature submergence, thereby limiting classification accuracy and efficiency. In this context, this study proposes a spatial information enhancement-based method, named SE-Mamba (Spatial Enhancement-Mamba), to improve classification accuracy and efficiency through effective integration of spatial and spectral information.MethodSE-Mamba incorporates two key designs focusing on the effective introduction and reasonable fusion of spatial information. First, a full-process spatial information enhancement mechanism is constructed, consisting of a front-end Spatial Enhancement Feature Extractor (SEFE) and a back-end High-Order Feature Refinement (HFR) module. Before the features are serialized and processed, SEFE explicitly encodes local structural priors and geometric dependencies into the feature map to alleviate the spatial information loss caused by Mamba serialization; HFR restores fine-grained geometric structures through high-order interaction and dual gate control enhancement mechanisms. Second, a rational spatial-spectral fusion architecture, namely the Spatial Spectral Collaborative Module (SSCM), is designed, which includes a Spatial-Spectral Fusion Module (SSFM). The SSCM decouples spatial and spectral features into two separate branches to strengthen the independent representation of heterogeneous features, while the SSFM adopts a "calibration-first, then-fusion" strategy to achieve in-depth integration through cross-guidance and adaptive weight allocation, thereby avoiding spatial information erosion. For experimental verification, four representative hyperspectral datasets (HanChuan, HongHu, Houston, PaviaU) covering agricultural and urban scenes are used to evaluate the model's performance and robustness. Key evaluation indicators include Overall Accuracy (OA), Average Accuracy (AA), and Kappa coefficient. The ablation experiment compared the performance of using SEFE, HFR, and SSCM separately on the baseline model, and analyzed the role of the proposed modules themselves, the synergy between each module, and the coupling effect of the internal sub modules of SSCM by removing the SEFE, HFR, and SSCM modules from the complete framework. Computational complexity, parameter size, and inference speed are also evaluated to assess the model's efficiency.ResultsExperimental results on the four representative datasets (HanChuan, HongHu, Houston, and PaviaU) demonstrate that SE-Mamba achieves the best Overall Accuracy (OA) and Average Accuracy (AA), with its Kappa coefficient also reaching a level comparable to the state-of-the-art methods. Specifically, SE-Mamba attains an average OA of 96.07% across the four datasets, surpassing the benchmark model MambaHSI by 2.32%. In terms of efficiency, the computational complexity and parameter size of SE-Mamba are comparable to those of mainstream methods, while its inference speed is superior to that of some comparison models, achieving a good balance between classification accuracy and computational efficiency. Ablation experiments verify the effectiveness of each core module in spatial feature representation. Compared with existing Mamba-based methods (e.g., MambaHSI), SE-Mamba effectively addresses spatial information erosion and feature submergence through spatial enhancement and optimized fusion, while preserving Mamba's linear computational advantage. Compared with traditional CNN/Transformer-based methods, SE-Mamba combines state space modeling with spatial enhancement, achieving more stable performance in complex scenes.ConclusionsExperiments verify that the combination of explicit spatial enhancement and state space modeling is effective, and the two core strategies of SE-Mamba synergistically alleviate spatial information erosion and feature submergence. By strengthening spatial feature extraction and optimizing spatial-spectral fusion, SE-Mamba maintains stable and efficient classification performance on complex agricultural, urban, and multi-category HSI datasets, achieving improved classification accuracy and efficiency. SE-Mamba provides a novel approach for HSI classification and serves as a reference for state space-based remote sensing image processing, offering technical support for land cover monitoring and agricultural survey. Future work could consider designing adaptive scanning mechanisms and introducing transfer learning to enhance the model's adaptability to complex scenarios and cross regional generalization ability, and promote its practical application on portable devices through lightweight design. The dataset and code related to this article have been shared [DOI: 10.57760/scientificdb. j00240.00182].