Sun Renjie, Sun Yubao, Shao Shuai, Shuai Hui, Liu Qingshan
DOI:10.11834/jig.250606
摘要:ObjectiveText-driven 3D human motion generation has emerged as a frontier research direction in multimodal content creation, holding great promise for applications in virtual reality, film production, and the metaverse. Despite significant progress, existing methods still face fundamental challenges in three aspects: precise semantic alignment between natural language descriptions and generated motions, fine-grained control over individual body parts, and global coordination that respects biomechanical constraints. Consequently, current solutions often suffer from semantic leakage unnatural postures, and limited expressiveness. Moreover, most approaches either focus solely on motion synthesis without producing complete 3D assets, or generate static avatars without dynamic pose control. To address these limitations, we propose a novel cascaded diffusion framework that follows a “local-to-global, structure-to-appearance” generation pipeline, enabling end-to-end synthesis from raw text to high-fidelity, real-time renderable 3D human models with precise motion control.MethodOur framework consists of four key stages, each designed to address a specific aspect of the text-to-3D human generation problem. First, a semantic decoupling module leverages a large language model (GPT-4) to automatically parse the input text into independent action descriptions for six anatomical body parts: head, left arm, right arm, torso, left leg, and right leg. This decomposition converts a global motion description into a set of part-specific textual instructions, explicitly separating semantics across different body regions. For body parts not mentioned in the original text, the parser assigns a “do nothing” instruction, preventing unintended movements. This step is crucial because it transforms a loosely coupled global description into a structured, machine-readable format that guides subsequent generation. Second, we construct a local motion generation module composed of six parallel diffusion-based encoders, each conditioned on its corresponding part description. These encoders operate with gradient isolation, meaning that the training and inference processes for different body parts do not share gradients. This design fundamentally prevents semantic leakage—a common issue in prior work where an action described for one body part inadvertently affects others. Each encoder adopts a transformer-based denoising network. Starting from pure Gaussian noise, the network iteratively refines a latent code guided by the corresponding part text embedding produced by a pre-trained TMR encoder. The resulting latent representation captures the fine-grained motion characteristics of that specific body part, such as the trajectory, speed, and joint angles. Importantly, because the six encoders are independent, they can be trained in parallel on part-specific motion data extracted from full-body motion capture datasets. Third, a global motion fusion module integrates the six independent part latents into a coherent full-body pose. Simple concatenation of part latents would ignore the biomechanical dependencies between body regions. To address this, we employ a lightweight feed-forward network with GELU (gaussian error linear units) activation, augmented by the global semantic feature of the complete text. This network learns to enforce biomechanical constraints such as torso leaning backward during a forward kick, natural arm-leg coordination during walking, and maintaining overall balance. The fused latent is then decoded into SMPL parameters, producing a parametric human mesh that respects human skeletal kinematics. Fourth, for appearance enhancement and efficient rendering, we convert the SMPL mesh into a set of 3D Gaussians—a modern explicit representation that supports real-time differentiable rasterization. Each Gaussian is defined by its position, covariance matrix, opacity, and spherical harmonics coefficients for color. To enrich geometric and textural details beyond the smooth SMPL mesh, we adopt a state-of-the-art 2D diffusion model (Flux) as a powerful visual prior. Through SDS (score distillation sampling), gradients from the 2D diffusion model are backpropagated to iteratively optimize the attributes of the 3D Gaussians while keeping their positions fixed to preserve the generated motion. This optimization runs for 4000 iterations, refining details such as skin texture, clothing wrinkles, and lighting effects. The final output is a fully textured 3D human model that can be rendered in real time without any post-processing.ResultWe conduct extensive experiments on two standard benchmarks, HumanML3D and KIT-ML, and compare our method against representative baselines including MotionDiffuse, MDM, MLD, DreamFusion, GaussianDreamer, and others. For quantitative evaluation, we employ multiple metrics. FID (Fréchet Inception Distance) measures the realism and diversity of generated motion sequences. CLIP-S (CLIP similarity) evaluates semantic alignment between rendered multi-view images and input text. Additionally, we introduce a Part-FID (part-level Fréchet inception distance) , which computes FID separately for each of the six body parts using dedicated feature extractors, providing a fine-grained assessment of local motion quality. Experimental results demonstrate that our method achieves an FID of 0.429, comparable to MotionDiffuse (0.687) and MDM (0.747). In terms of CLIP-S, our method attains 29.41 (ViT-L/14) and 44.39 (ViT-bigG-14), surpassing GaussianDreamer (27.23 and 41.88) and other text-to-3D baselines. The proposed Part-FID yields an average score of 1.26, which is 18.7% better than MotionDiffuse, with the most significant improvement observed on the torso, validating the effectiveness of our global fusion module in enforcing biomechanical coordination. Ablation studies further confirm the contribution of each component: removing gradient isolation increases semantic leakage. Efficiency analysis shows that our method takes approximately 20 minutes for end-to-end generation, and the final 3D Gaussian representation enables real-time rendering at 24 frames per second, which is two orders of magnitude faster than NeRF-based renderers.Conclusionwe present a comprehensive framework for text-driven 3D human motion generation that uniquely combines local motion generation with global fusion, supported by efficient 3D Gaussian splatting and a powerful 2D diffusion prior. The method achieves superior performance in motion realism, part-level control accuracy, semantic alignment, and rendering efficiency. It provides an end-to-end solution from natural language to high-quality, real-time renderable 3D human assets, opening new possibilities for interactive virtual human applications. Future work will focus on extending the framework to generate long-sequence motions with temporal consistency and incorporating multimodal control signals.
Zuo Jialong, Deng Haoyou, Zuo Haotong, Zhou Hanyu, Zhu Jiaxin, Zhang Yicheng, Zhang Yiwei, Yan Yongxin, Huang Kaixing, Chen Weisen, Deng Yongtai, Jin Rui, Sang Nong, Gao Changxin
DOI:10.11834/jig.260029
摘要:The rapid evolution of large-scale text-to-image generation models has fundamentally transformed the landscape of visual content creation. Driven by advances in diffusion models, large multimodal pretraining, and scalable inference pipelines, modern generative systems have demonstrated unprecedented capabilities in synthesizing visually compelling images across a wide range of styles, scenes, and semantic conditions. Commercial models such as Nano Banana Pro have attracted significant attention due to their strong zero-shot generation ability, robust semantic understanding, and impressive perceptual quality. However, despite their success in creative image synthesis, a critical and largely underexplored question remains: Can such foundation generative models serve as general-purpose solvers for traditional low-level vision tasks? Low-level vision tasks—including dehazing, deblurring, super-resolution and so on—have historically been dominated by task-specific, regression-based models. These models are typically trained under strong supervision with paired data and optimized using pixel-aligned objectives such as PSNR(peak signal-to-noise ratio)and SSIM(structural similarity index measure). While highly effective within their target domains, such specialist models lack flexibility, often require costly retraining for new tasks, and struggle to generalize beyond their training distributions. In contrast, foundation generative models promise a unified alternative: a single pretrained model capable of addressing diverse vision tasks through natural language prompts, without task-specific fine-tuning. In this work, we present the first large-scale, systematic zero-shot evaluation of Nano Banana Pro across a broad spectrum of low-level vision tasks. Specifically, we investigate whether Nano Banana Pro can function as a low-level vision all-rounder—a generalist model capable of producing high-quality results across heterogeneous restoration, enhancement, and fusion tasks. To this end, we conduct an extensive evaluation covering 14 distinct low-level vision tasks across 40 datasets, encompassing both synthetic and real-world degradations. The evaluated tasks include deblurring (motion, defocus), super-resolution, image denoising, deraining, shadow removal, reflection removal, flare removal, low-light image enhancement, underwater image enhancement, HDR(high dynamic range)reconstruction, multi-focus image fusion, and infrared–visible image fusion, among others. All experiments are conducted under a standard zero-shot protocol. Nano Banana Pro is queried exclusively through simple, task-oriented natural language prompts, without any model fine-tuning, parameter adaptation, or task-specific post-processing. This setting is deliberately chosen to reflect realistic deployment scenarios and to assess the intrinsic capability of the model as a foundation visual system. For each task, we compare Nano Banana Pro against state-of-the-art specialist methods specifically designed for the corresponding task. Our comprehensive evaluation reveals a consistent and striking performance dichotomy. On one hand, Nano Banana Pro frequently produces results with superior perceptual quality, characterized by enhanced clarity, vivid textures, improved contrast, and visually pleasing color distributions. In many challenging scenarios—such as severe noise, extreme low-light conditions, heavy underwater color distortion, or strong atmospheric degradation—the model is able to hallucinate plausible high-frequency details and recover semantically coherent structures that rival or even surpass those generated by domain-specific methods. Across multiple tasks, Nano Banana Pro achieves competitive or leading performance on no-reference perceptual metrics and consistently receives favorable qualitative assessments. On the other hand, when evaluated using traditional full-reference, pixel-aligned quantitative metrics, Nano Banana Pro systematically underperforms compared to specialist models. Metrics such as PSNR, SSIM, SCD(sum of correlations of differences), and VIF(visual information fidelity) consistently reveal notable gaps, particularly in tasks requiring strict structural alignment or physical signal fidelity. This discrepancy is especially pronounced in tasks like denoising, HDR reconstruction, and image fusion, where pixel-level consistency with the reference image is heavily rewarded. We attribute this behavior to the inherent stochastic and generative nature of diffusion-based models, which prioritize semantic plausibility and perceptual realism over deterministic pixel correspondence. As a result, even visually improved outputs may be penalized for global color shifts, localized texture synthesis, or subtle geometric deviations. Importantly, our analysis shows that these quantitative penalties do not necessarily indicate failure. In many datasets, the provided “ground-truth” images themselves contain residual noise, blur, or imperfect color balance. In such cases, Nano Banana Pro often generates cleaner, more visually appealing results that deviate from the reference but align better with human perception. This observation highlights a fundamental tension between regression-based evaluation paradigms and generative reconstruction behaviors, and suggests that current benchmarks may be insufficient for assessing foundation generative models. Beyond aggregate metrics, we conduct detailed task-wise and dataset-wise analyses to characterize the operational scope and limitations of Nano Banana Pro. The model excels in scenarios involving severe degradation, ambiguous structure, or incomplete information, where its strong semantic priors can compensate for missing signal. Conversely, it struggles in applications demanding strict physical accuracy, such as forensic analysis, scientific imaging, or safety-critical perception, where hallucinated details or slight structural inconsistencies may be unacceptable. Collectively, our findings position Nano Banana Pro as a powerful zero-shot contender for low-level vision, capable of delivering high perceptual quality across a remarkably diverse set of tasks without retraining. At the same time, achieving the pixel-level fidelity of domain specialists remains a significant challenge. Rather than framing this as a binary competition between generative and regression paradigms, our results suggest a more promising direction: strategic integration. Future robust vision systems may combine the semantic imagination of foundation generative models with the physical constraints and precision of task-specific networks, leveraging the strengths of both. In summary, this study provides the first comprehensive empirical answer to the question: Is Nano Banana Pro a low-level vision all-rounder? Our answer is nuanced. Nano Banana Pro substantially raises the upper bound of perceptual quality in zero-shot low-level vision, but has yet to establish a stable lower bound suitable for high-fidelity, safety-critical applications. By systematically documenting these strengths and limitations across 14 tasks and 40 datasets, this report offers a detailed reference point for future research on foundation models in low-level vision, and calls for the development of new evaluation frameworks that better reflect perceptual realism, semantic consistency, and downstream utility.
摘要:ObjectiveScene text detection (STD) is a fundamental task in scene text reading and understanding, and plays an important role in enabling intelligent systems to perceive high-level semantic information from natural scenes. It provides essential technical support for various applications, such as autonomous driving, image retrieval, unmanned systems, and intelligent scene analysis. In recent years, with the rapid development of deep learning and visual representation modeling, STD has achieved substantial progress and attracted increasing research attention. Existing deep learning-based methods can generally be divided into regression-based, connected-component-based, and segmentation-based approaches. Among them, segmentation-based methods have become a mainstream solution due to their flexibility in pixel-level prediction and strong capability in detecting arbitrarily shaped text instances. However, most existing segmentation-based methods still implicitly assume that multi-scale features can be optimized under a unified supervision signal and fused within a shared semantic space. Such a strategy overlooks the intrinsic semantic heterogeneity across feature hierarchies. Specifically, low-level features contain rich spatial details but are vulnerable to pixel-level noise, whereas high-level features encode stronger semantic information but may lose fine-grained structural cues. Directly supervising and fusing these heterogeneous representations may lead to interference between low-level pixel noise and high-level semantic constraints, thereby weakening feature fusion effectiveness and reducing inference stability. From the perspective of representation learning, multi-scale features are not merely homogeneous representations at different spatial resolutions, but heterogeneous representations associated with different semantic granularities. Therefore, effective STD requires explicit modeling, alignment, and coordination of semantic information across different feature levels.MethodTo address the above issues, we propose an efficient and effective STD framework, which consists mainly of a branch-wise distribution-aware modeling (BDM) module and a cross-semantic global knowledge integration (CGKI) module. Considering that conventional multi-scale text detection methods often ignore the semantic discrepancies among different feature levels at the supervision stage, the BDM module is designed from the perspective of label modeling. Specifically, it transforms pixel-level binary segmentation annotations into hierarchical distribution-aware supervision signals, enabling feature branches at different scales to independently learn text distribution semantics that are consistent with their corresponding receptive fields. In this way, the semantic interference among heterogeneous multi-scale features can be alleviated, and semantically aligned feature representations can be provided for subsequent feature fusion. Notably, the BDM module is only employed during the training stage and removed during inference, thus improving detection accuracy without introducing additional computational overhead. On the basis of intra-scale distribution-aware semantic alignment, we further design the CGKI module to explicitly model the collaborative relationships among different semantic levels. This module first enhances the representation of each scale within its own semantic space, and then performs controlled cross-scale interaction through adaptive scale reweighting and adjacent-scale information injection. By selectively recalibrating the importance of different scales and introducing complementary contextual information from neighboring levels, the CGKI module achieves global coordination and stable fusion of multi-scale semantics while maintaining a controllable computational cost. The ResNet equipped with deformable convolutions and feature pyramid network (FPN) is adopted as the backbone. For the training stage, the model is either directly trained on public datasets for ablation studies or pre-trained on Synth150k for 10 epochs and then fine-tuned on real-world datasets for comparison experiments. SGD with an initial learning rate of 0.001 and a poly learning rate schedule is used for optimization, together with data augmentation strategies including random rotation, cropping, and flipping.ResultThe proposed method is extensively evaluated against more than ten advanced methods on five widely used public text detection benchmarks, including MSRA-TD500, CTW1500, Total-Text, ICDAR2015, and MPSC. Precision (P), recall (R), and F-measure (F) are adopted as the evaluation metrics, where higher values indicate better detection performance. All inference tests are conducted on a single NVIDIA GTX 1080Ti GPU with an Intel i7-6800K CPU to ensure a consistent evaluation environment. Experimental results show that the proposed method consistently outperforms existing efficient STD methods on the above datasets while maintaining competitive inference speed. Specifically, on Total-Text, the proposed method improves the F-measure by 4.2% and 2.7% compared with DBNet++ and FEPE, respectively. On MSRA-TD500, it achieves F-measure improvements of 5.0% and 4.1% over DBNet++ and FEPE, respectively. On CTW1500, it gains 2.6% and 1.0% in F-measure against DBNet++ and FEPE, respectively. On ICDAR2015, it achieves F-measure gains of 2.8% and 2.7% relative to DBNet++ and FEPE, respectively. On the industrial scene text dataset MPSC, the proposed method surpasses existing advanced methods ISTD-DLA, ODM, and RT-DETR by 1.0%, 3.8%, and 1.3% in F-measure, respectively. Ablation studies on MSRA-TD500 further demonstrate the effectiveness of the proposed modules, confirming that BDM and CGKI can enhance multi-scale feature representation and fusion. In addition, visualization results on these datasets show that the proposed method can generate complete and accurate text boundaries in different scenes. Cross-dataset experiments further verify the generalization ability of the proposed method, where it achieves superior performance over existing representative methods such as ZTD, MTD, and CM-Net under both line-level and word-level annotation settings.ConclusionThis work presents an efficient and effective scene text detection method. By integrating BDM and CGKI, the proposed method enhances the semantic consistency and collaborative fusion of multi-scale text features, thereby improving the detection of complex text. Experimental results on multiple public benchmarks demonstrate that the proposed method achieves competitive detection accuracy and inference speed, outperforming existing efficient scene text detection methods. In future work, we will explore the integration of the proposed detection model with efficient text recognition models to establish an end-to-end efficient framework for text spotting.
摘要:Remote sensing technology serves as the core mechanism for the observation of the Earth and the understanding of surface environments. It plays an irreplaceable role in critical fields such as natural disaster monitoring, urban planning, resource exploration, and ecological protection. Over the past decade, driven by the rapid advancement of deep learning, the intelligent interpretation of remote sensing images has achieved breakthrough progress in fundamental vision tasks. However, the traditional deep learning paradigm is intrinsically built upon a closed-set assumption, meaning that models can only recognize a predefined and human-annotated set of fixed categories during the inference stage. When confronted with highly complex surface environments in real-world Earth observation scenarios, dynamic object morphologies, and rare ground objects with long-tail distributions, this traditional paradigm not only incurs prohibitive costs for the construction of massive pixel-level annotated datasets but also easily falls into the trap of domain-specific overfitting. Consequently, the generalization and response capabilities of this paradigm are severely challenged by unseen categories or sudden events, making it inadequate to meet the highly dynamic interpretation demands of the open world. In recent years, the rapid development of vision-language models has catalyzed a paradigm shift in artificial intelligence from task-specific models to general-purpose perception models. By mapping visual representations and natural language into a unified feature space through contrastive learning on massive image-text pairs, these models have broken the constraints of discrete labels. This enables a direct response to arbitrary natural language prompts, a capability known as open vocabulary perception. While this technology has demonstrated remarkable zero-shot generalization and cross-modal reasoning capabilities in the natural image domain, the direct application of these general vision-language models to the remote sensing domain encounters a severe domain gap. The uniqueness of remote sensing data poses multiple challenges to the adaptability of existing models. First, the distinct overhead imaging perspective causes drastic variations in object scale and complex background textures. Second, Earth observation tasks rely on multi-source heterogeneous data from SAR, multispectral or hyperspectral, and thermal infrared sensors. The underlying physical mechanisms of these sensors exceed the inherent inductive biases of models pre-trained solely on natural RGB images. Third, remote sensing objects often possess strong geospatial attributes and complex topological associations. To address these critical challenges, this paper provides a comprehensive and systematic review of recent advancements in open vocabulary perception for remote sensing images. We first delve into the foundational aspect of this field: vision-language pre-training for remote sensing. We extensively review the evolution of construction strategies for large-scale datasets. We highlight the transition from limited, human-annotated image-text pairs to massive datasets generated via heuristic rules, the integration of geographic metadata, and advanced multi-modal large language models. This includes innovative approaches that leverage OpenStreetMap, geographical coordinates, etc., to produce fine-grained, physics-aware descriptions across multiple modalities. Concurrently, we systematically summarize the progression of pre-training methodologies. While early approaches primarily focused on simple domain adaptation through continuous pre-training, recent state-of-the-art frameworks emphasize physics-aware encoding, fine-grained multi-level consistency learning, and geography-enhanced architectures. These frameworks better capture the intricate spatial relationships and modality diversities inherent in Earth observation data. Subsequently, this review conducts an in-depth analysis of the adaptation and optimization of open vocabulary perception techniques across a wide spectrum of crucial downstream tasks. For zero-shot scene classification and cross-modal retrieval, we discuss advanced strategies designed to mitigate the high intra-class similarity and complex inter-class variances typical in remote sensing. We emphasize the shift towards fine-grained local-global alignment, hard negative mining, dynamic soft-labeling, and prompt engineering. In the realm of open vocabulary image segmentation, we categorize the existing literature into training-based methods and training-free or annotation-free paradigms. Training-based methods leverage base categories to adapt models while preventing catastrophic forgetting through pseudo-label distillation and knowledge retention mechanisms. Training-free paradigms synergize foundational models, such as CLIP and the Segment Anything Model, to extract structural masks and align semantics without the updating of network weights. For open vocabulary object detection and remote sensing visual grounding, we explore the approaches of researchers to tackle extreme scale variations, arbitrary orientations, and dense object distributions. These approaches include innovative frameworks for pseudo-label generation, multi-scale feature alignment, cross-modality context modeling, and interactive grounding mechanisms. Furthermore, we examine open vocabulary change detection, where recent studies employ either combinations of pre-trained vision-language models or generative models to generate large-scale data. These approaches aim to identify arbitrary, text-specified surface transitions and simulate complex spatiotemporal changes without reliance on massive and costly bi-temporal pixel-level annotations. We also briefly touch upon emerging open vocabulary applications in three-dimensional urban point clouds and cross-domain archaeological remote sensing, illustrating the expanding horizon of this technology. Despite remarkable progress, the field of open vocabulary perception for remote sensing remains in a crucial developmental stage and faces several critical bottlenecks. This paper critically identifies the limitations of current research, including the severe scarcity of high-quality and geographically balanced training data. This scarcity leads to geographic biases and performance degradation in data-poor regions. Additionally, there is a prominent absence of genuinely fine-grained and long-tailed open vocabulary evaluation benchmarks that can accurately reflect the performance of a model in extreme or unknown real-world scenarios. The inadequate physical understanding of heterogeneous modalities and the inherent black-box unreliability of current large models in high-stakes decision-making scenarios further constrain practical deployments. To chart the course for future research, we outline several promising and essential trajectories. First, we anticipate a paradigm shift towards generative perception driven by multi-modal large language models. This shift unifies various spatial localization tasks into the direct generation of coordinate sequences or geometric property tokens to fully exploit the logical reasoning capabilities of foundational models. Second, we strongly advocate for the construction of rigorous, real-world, and fine-grained evaluation systems that incorporate complex spatiotemporal logic, diverse geographic conditions, and comprehensive evaluation metrics. Third, the development of omni-modal foundation models that explicitly integrate physical priors and deep learning is deemed crucial for the achievement of all-weather and all-spectrum Earth observation, moving beyond pure data-driven approaches. Furthermore, we highlight the necessity to extend perception from static spatial analysis to dynamic spatiotemporal causal reasoning to decode the evolutionary processes of the Earth. Finally, addressing the severe conflict between the massive parameter scale of foundation models and the limited computing power of aerospace edge devices requires focused research into efficient, trustworthy, and safe edge-cloud collaborative computing architectures. By systematically synthesizing these advancements and challenges, this comprehensive review aims to serve as a foundational roadmap for researchers and practitioners. It accelerates the transition of the intelligent interpretation of remote sensing from isolated, closed-set recognition toward artificial general intelligence capable of highly reliable, dynamic, and open-world perception.
Liu Zhen, Yang Qinzhe, Liu Liqin, Liu Chenyang, Zou Zhengxia, Shi Zhenwei
DOI:10.11834/jig.260078
摘要:ObjectiveGiant pandas serve as a flagship species for global biodiversity conservation and play a key role in assessing ecosystem integrity and conservation effectiveness. Accurate and reliable detection of giant pandas in camera-trap images is thus essential for long-term wildlife monitoring, population assessment, and adaptive management of protected areas. In recent years, deep learning-based object detection algorithms have demonstrated remarkable success. However, directly deploying general-purpose detection models in wild scenarios remains challenging due to two fundamental issues. First, giant pandas are rare species, and acquiring large volumes of high-quality, finely annotated training data from the wild is extremely costly, time-consuming, and often impractical. Second, there exists a substantial domain gap between commonly used pre-training datasets and unconstrained camera-trap images captured in natural habitats. To alleviate data scarcity and improve robustness in wild environments, we propose a unified generation–detection method termed PandaGenDet.MethodRather than treating data augmentation and detection as independent components, the core idea of PandaGenDet is to improve detection robustness through multi-level collaboration between generative and discriminative models. Specifically, PandaGenDet consists of three complementary components operating at different representational levels. First, at the data level, we introduce a class-conditioned image generation module equipped with a Category-guidance Mechanism. This mechanism explicitly incorporates semantic category information into the generative process, guiding the synthesis of panda images with improved semantic consistency and target realism, making them more suitable as high-quality supplementary training samples. Second, at the image level, we design an Image Enhancer module to reduce the domain discrepancy between wild camera-trap images and the visual priors learned from large-scale pre-training datasets. The Image Enhancer is implemented as a modular and easily integrable component that performs a learnable image-level mapping prior to detection. By adaptively reshaping low-level and mid-level image statistics, this module maps target-domain images to representations that are more compatible with the detector’s pre-trained weights, without requiring any modification to the detector architecture. During training, the Image Enhancer and the detector are jointly optimized in an end-to-end manner, with all detector parameters fully fine-tuned from their pre-trained initialization. Third, at the feature level, we propose a Generative Feature Injector, which leverages the trained generative model as a multi-scale feature extractor. Hierarchical feature representations learned during the image generation process are extracted and injected into the detection backbone via a PSPNet (pyramid scene parsing network) and FPN (Feature Pyramid Network) fusion network. This design enables the detector to leverage rich semantic and structural priors embedded within the generative model, enabling the transfer of multi-scale semantic priors from the generative model into the detection network. Together, these mechanisms form a unified and extensible method for robust wildlife detection.ResultWe conduct extensive experiments using Grounding DINO, a modern open-set object detection model, as the detection backbone. Evaluations are performed on the giant panda subset of the LoTE-Animal (long time-span dataset for endangered animal) dataset, which contains challenging camera-trap images representative of real-world conservation scenarios. Experimental results demonstrate that the proposed Category-guidance Mechanism significantly improves generative quality. Specifically, KID (kernel inception distance) decreases from 0.059 to 0.038, while FID (fréchet inception distance) is reduced from 147.00 to 123.13, indicating that the synthesized images achieve higher fidelity and improved semantic consistency with real wild panda images. These improvements directly translate into more effective training data for detection. When the Image Enhancer is integrated into the Grounding DINO detector, notable gains in detection performance are observed. On the LoTE-Animal panda subset, mAP (mean average precision) increases from 88.8 to 89.7, while mAR (mean average recall) improves from 94.9 to 95.5, confirming the effectiveness of image-level domain adaptation. Further incorporating the Generative Feature Injector leads to additional performance improvements, with the detector achieving a mAP of 89.8, outperforming both the baseline and image-enhancer-only configurations. Finally, training the detector using a mixture of real images and high-quality synthetic images generated by the full PandaGenDet pipeline yields the best overall performance, achieving a final mAP of 90.1. Qualitative analyses further reveal that synthesized images exhibit more accurate panda poses, better integration with realistic environmental textures, and fewer semantic artifacts. Detection visualizations demonstrate high localization accuracy in challenging scenarios, including dense vegetation, low illumination, and partial occlusion. Furthermore, the final model demonstrates strong robustness in open-set detection, maintaining stable performance even when encountering object categories not present in the training dataset.ConclusionThis study presents PandaGenDet, a unified collaborative framework from data synthesis to object detection for giant panda monitoring in complex wild environments. By integrating data-level synthesis, image-level enhancement, and feature-level injection in a unified manner, the proposed method effectively addresses two major bottlenecks in real-world wildlife detection: the scarcity of annotated data and the presence of severe domain gaps between pre-training and deployment scenarios. Extensive experiments on camera-trap datasets demonstrate that PandaGenDet substantially improves both synthetic image fidelity and detection accuracy, while also enhancing open-set robustness. Through a three-dimensional collaborative strategy—data-level synthesis, image-level enhancement, and feature-level injection—PandaGenDet significantly improves the detection performance of general-purpose models in complex wild environments.
摘要:With the rapid emergence of multimodal large models (MLLMs), massive heterogeneous data are proliferating across various industrial and scientific domains. In this context, multi-view clustering (MVC) serves as a cornerstone technology for unsupervised knowledge discovery and latent correlation mining. Currently, MVC is undergoing a profound and historical paradigm shift. Traditional surveys predominantly focus on the horizontal categorization of algorithmic network structures. However, this approach often fails to reveal the intrinsic evolutionary logic across different technological eras. Departing from these conventions, this paper proposes a pioneering, prior-driven theoretical perspective to systematically reconstruct the developmental trajectory of MVC over the past two decades. This is achieved through a trans-paradigm analytical framework of Geometry-Semantics-Cognition. In the initial stage of shallow structural mining, the research paradigm focused on explicit mathematical constraints within original or kernel-induced feature spaces. Euclidean space methods, such as multi-view k-means and non-negative matrix factorization (NMF), identified global prototypes by minimizing squared error or Frobenius norm reconstruction loss. In contrast, affine space methods leveraged self-representation properties to model data as a union of low-dimensional subspaces, while manifold space techniques utilized spectral graph theory to transform clustering into optimal graph-cut problems by preserving local topological correlations. As the field transitioned into deep spatial modeling based on semantic collaborative priors, researchers utilized the powerful non-linear mapping capabilities of deep neural networks to project heterogeneous data into high-order semantic spaces. This evolution encompasses several distinct research paradigms. (1) Embedding space research focuses on deep subspace clustering, employing autoencoders to learn discriminative features while maintaining cross-view consistency. (2) Latent space methods utilize probabilistic generative models, such as Variational Autoencoders (VAE) and Generative Adversarial Networks (GAN), to align latent distributions and infer missing view information through adversarial games. (3) Augmented space paradigms introduce contrastive learning to maximize mutual information between views via InfoNCE-like losses, thereby enhancing representation robustness. (4) Topological space studies leverage Graph Neural Networks (GNN) to synergistically mine intra-view geometric structures and inter-view complementary semantics.Moving into the current era of multimodal large models, this paper forward-lookingly explores deep alignment based on cognitive priors, where MLLMs are viewed not merely as data sources but as knowledge bases containing human-level common sense and logical reasoning abilities. We systematically elucidate the infrastructural role of MVC in empowering massive data governance. Specifically, MVC facilitates semantic deduplication to enhance data quality and employs token-level clustering to optimize mixture-of-experts (MoE) routing for expert specialization. Furthermore, MVC enables hierarchical semantic chunking, which is critical for precise document retrieval within retrieval-augmented generation (RAG) frameworks.Beyond these applications, we analyze the potential for MLLM logical reasoning to back-propagate into clustering tasks. This synergy elevates MVC from pure statistical feature alignment to a new dimension of knowledge-driven cognitive logic consistency.Beyond theoretical frameworks, the review highlights the transformative impact of MVC in diverse real-world scenarios, ranging from multi-omics fusion in smart healthcare for cancer subtyping and spatial-spectral fusion in remote sensing for urban functional zone identification to cross-perspective pedestrian recognition in public security and unsupervised anomaly detection in industrial IoT networks. Despite these advancements, achieving a transition from laboratory benchmarks to robust industrial infrastructure requires addressing several core challenges. First, the scaling bottleneck where O(N2) or O(N3) computational complexity of traditional methods must be reduced to linear levels to handle million-scale web data. Second, the extreme robustness required in open environments to handle not only severely incomplete views but also the pervasive long-tailed category imbalance where minority abnormal samples are often overwhelmed by dominant patterns. Third, the urgent need for evaluation system reconstruction to move away from legacy small-scale datasets with shallow features toward native heterogeneous multimodal benchmarks that reflect real-world weak alignment and complex noise distributions. Ultimately, this survey aims to provide a novel research roadmap for theoretical innovation and engineering practice, advocating for a transition toward intelligent decision systems characterized by high-level semantic decoupling and cognitive logic alignment in the era of multimodal large models.
关键词:Multi-view clustering (MVC);Prior-driven learning;Multimodal large models (MLLMs);Geometric structure;Semantic collaboration;Cognitive alignment
Liu Erhu, Yuan Sijie, Li Haowen, Xu Shengjun, Hu Yu, Yang Tiantian
DOI:10.11834/jig.260045
摘要:ObjectiveBuilding extraction from remote sensing imagery is a fundamental and challenging task in remote sensing image interpretation and has been widely applied in urban planning, land-use analysis, disaster assessment, and geographic information system updating. With the rapid development of high-resolution and very-high-resolution (VHR) remote sensing sensors, buildings in aerial images exhibit large variations in scale, shape, texture, and spatial distribution. In addition, complex backgrounds, shadow interference, and spectral similarity between buildings and surrounding objects further increase the difficulty of accurate building extraction. Although convolutional neural network (CNN) based semantic segmentation methods have achieved remarkable progress in this field, existing approaches still face two major limitations. First, many networks lack sufficient capability to model multi-scale building features, resulting in missed detections or incomplete segmentation of buildings with large scale variations. Second, due to repeated down-sampling operations and insufficient boundary supervision, the extracted building boundaries are often blurred or discontinuous, which degrades the geometric accuracy of segmentation results. To address these issues, this paper proposes a novel remote sensing building extraction network that integrates multi-level feature extraction and edge enhancement, named the multi-level feature extraction and Edge-enhanced network (MFEE-Net), with the aim of improving both overall segmentation accuracy and boundary quality.MethodThe proposed MFEE-Net adopts an encoder–decoder architecture and is specifically designed to jointly enhance multi-scale feature representation and boundary detail preservation. In the encoding stage, a lightweight multi-scale feature extraction encoder is constructed using a newly designed residual multi branch convolution block (ResMBC) as the fundamental building unit. The ResMBC introduces parallel convolutional branches with different receptive fields, enabling the network to capture building structures and texture patterns at multiple spatial scales while retaining the local modeling advantages of standard convolution. The residual connection further facilitates stable training and effective feature propagation, allowing the encoder to generate rich and discriminative feature representations with relatively low computational cost. To effectively utilize features from different encoding depths, an interlayer feature fusion module (IFFM) is introduced between the encoder and decoder. Unlike simple skip connections, the IFFM jointly models spatial information and channel-wise correlations, enabling adaptive fusion of heterogeneous features from different layers. By enhancing feature complementarity and reducing semantic inconsistencies between low-level spatial details and high-level semantic representations, the IFFM alleviates information loss and improves the robustness of feature transmission during the decoding process. In the decoding stage, an edge-aware enhancement module (EAEM) is incorporated to explicitly refine building boundaries. The EAEM emphasizes edge-related features by enhancing boundary-sensitive responses and suppressing background interference, thereby improving the continuity and clarity of extracted building contours. Furthermore, a joint loss function with an edge-constrained auxiliary term is employed during training. This loss formulation encourages the network to simultaneously optimize building foreground regions and boundary details, leading to a coordinated improvement in segmentation completeness and edge precision.ResultExtensive experiments were conducted on two widely used public benchmark datasets, namely the WHU Aerial Building Dataset and the Massachusetts Building Dataset, to evaluate the performance of the proposed MFEE-Net. Quantitative comparisons were performed against multiple state-of-the-art building extraction methods using standard evaluation metrics, including Intersection over Union (IoU), F1-score, precision, and recall. On the WHU Aerial Building Dataset, MFEE-Net achieved an IoU of 91.13%, an F1-score of 95.36%, a precision of 95.81%, and a recall of 94.92%, demonstrating superior performance in both overall accuracy and boundary consistency. On the Massachusetts Building Dataset, which contains lower-resolution imagery and poses greater challenges due to blurred edges and complex backgrounds, MFEE-Net attained an IoU of 75.46%, an F1-score of 86.01%, a precision of 87.84%, and a recall of 84.26%. These results indicate that the proposed network maintains robust performance under different spatial resolutions and scene complexities. Qualitative visual comparisons further reveal that MFEE-Net is capable of producing more complete building regions with clearer and more continuous boundaries, particularly in scenes with dense buildings, complex structures, and significant scale variations. Ablation studies validate the effectiveness of each proposed component, confirming that multi-level feature extraction, interlayer feature fusion, and edge-aware enhancement collaboratively contribute to performance improvements.ConclusionThis study proposes a novel remote sensing building extraction network, MFEE-Net, which integrates multi-level feature extraction and edge enhancement within an encoder–decoder framework. By leveraging a lightweight multi-scale encoder, an interlayer feature fusion strategy, and an edge-aware enhancement mechanism with boundary supervision, the proposed network effectively addresses the challenges of scale variation and boundary ambiguity in remote sensing building extraction. Experimental results on public benchmark datasets demonstrate that MFEE-Net achieves competitive and stable performance, significantly improving both segmentation accuracy and boundary quality. The proposed approach provides an effective solution for high-precision building extraction in complex remote sensing scenarios and offers potential for practical applications in urban analysis and geospatial information processing.
Zhang Zhihao, Fu Zhitao, Ji Yashuai, Zhang Xinshan, Tang Bohui
DOI:10.11834/jig.250573
摘要:ObjectiveSynthetic Aperture Radar (SAR), as an active microwave remote sensing technology, is capable of all-weather and day-and-night data acquisition of the Earth's surface. However, due to its coherent imaging principle, the SAR receiver inevitably introduces significant speckle noise when processing the backscattered signals. Furthermore, SAR imagery primarily reflects the dielectric properties and geometrical structures of targets, rather than the spectral characteristics familiar to the human visual system. This leads to fundamental differences between SAR and optical images in terms of imaging mechanisms, physical properties, and image characteristics. Such differences severely limit their joint application and analysis in subsequent tasks. SAR-to-optical image translation, as a data processing technique capable of converting heterogeneous remote sensing images into data with homogeneous image characteristics, can effectively address the modality gap between SAR and optical imagery. It provides crucial technical support for subsequent applications and analyses such as SAR and optical image matching and SAR and optical image change detection. Therefore, research on SAR-to-optical image translation is of significant importance. However, existing SAR-to-optical image translation methods typically employ a single generator structure, which struggles to simultaneously maintain global semantic consistency and local textural-structural details. This often leads to generated images with semantic distortions and blurred details, consequently limiting their reliability and practical utility in real-world applications. To address this issue, this paper proposes a unidirectional knowledge transfer generative adversarial network (UKT-GAN) for dual-fidelity SAR-to-optical image translation. The model aims to achieve dual fidelity in both global and local dimensions of the generated imagery through unidirectional knowledge transfer between its dual-branch network architecture.MethodThe proposed method employs a dual-branch generation framework that allocates two fundamentally distinct tasks of local texture-structure detail reconstruction and global semantic information preservation to a Detail Reconstruction Subnetwork and a Semantic Preservation Subnetwork, respectively. This architectural design structurally overcomes the inherent limitation of single-branch generative adversarial networks in simultaneously maintaining global semantic consistency and local textural-structural fidelity. The detail reconstruction branch consists of a generator and a discriminator. The generator employs a Unet architecture embedded with CBAM modules, which utilizes skip connections to achieve multi-scale fusion of feature maps from the encoder and decoder, thereby preserving richer spatial details. The discriminator adopts a shallow two-layer convolutional structure to focus on assessing the authenticity of local textural and structural details, while pixel-level loss constraints are applied to ensure the fidelity of local detail reconstruction. Similarly, the semantic preservation branch also comprises a generator and a discriminator. Its generator utilizes a ResNet framework integrated with PE-Transformer modules, leveraging the multi-head attention mechanisms and residual blocks within the PE-Transformer to perform multi-dimensional and deep extraction and integration of semantic features from the imagery. This ensures the complete and accurate retention of semantic information throughout the feature extraction and transmission process. The discriminator employs a deeper four-layer convolutional network to evaluate the overall structural rationality and global semantic consistency of the generated images, supplemented by a combined pixel-level and feature-level loss constraint to maintain global semantic information consistency. Finally, a unidirectional consistency loss transfers the detail capture capability from the detail reconstruction branch to the semantic preservation branch. This process optimizes and refines the local textural and structural details of the optical images generated by the Semantic Preservation Subnetwork, ensuring that these generated images can simultaneously maintain both global semantic consistency and local textural-structural fidelity, thereby enhancing the overall quality of the imagery produced by the Semantic Preservation Subnetwork.ResultTo validate the effectiveness of UKT-GAN in SAR-to-optical image translation, this paper conducts comparative analysis with six mainstream image translation methods, including Pix2pix, DCLGAN, Parallel-GAN, Conditional Diffusion, StegoGAN and HVT-cGAN , on two public datasets: SEN1-2 and WHU-OPT-SAR. The evaluation employs four metrics: Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM), Learned Perceptual Image Patch Similarity (LPIPS), and Root Mean Square Error (RMSE), assessing the quality of generated images from four aspects: global quality, structural similarity, deep feature fidelity, and pixel-level accuracy. On the farmland and mountain subsets of SEN1-2 and the WHU-OPT-SAR dataset, the optical images generated by UKT-GAN achieved optimal results across all four metrics: PSNR, SSIM, LPIPS, and RMSE. Particularly notable improvements were observed in the LPIPS metric, with an accuracy improvement of at least 2.3% compared to the second-best results (where lower LPIPS values indicate better performance). On the building and forest subsets of SEN1-2, the optical images generated by UKT-GAN achieved optimal results on SSIM and LPIPS. These findings demonstrate that the proposed UKT-GAN can generate higher-quality optical images compared to six mainstream image translation methods.ConclusionThis paper proposes a Unidirectional Knowledge Transfer Generative Adversarial Network (UKT-GAN) for dual-fidelity SAR-to-optical image translation. Through unidirectional consistency loss, the model effectively transfers the local textural-structural reconstruction capability from the Detail Reconstruction Subnetwork to the Semantic Preservation Subnetwork. This enables the generation of optical images that simultaneously maintain both global semantic information and local textural-structural details during testing, while requiring only the single Semantic Preservation Subnetwork to be deployed. Furthermore, experimental results demonstrate that compared to other methods, the proposed UKT-GAN can generate optical images with clearer structures and superior overall quality when processing different land cover types. Moreover, when confronted with data distribution shifts caused by varying sensor characteristics and imaging configurations, UKT-GAN maintains stable translation performance, exhibiting strong generalization capability.The open source code for this paper is available at:https://www.scidb.cn/s/YNjqIf.
Xuan Enyun, Li You, Li Ziwei, Yao Mengmeng, Guo Renzhong
DOI:10.11834/jig.250531
摘要:ObjectiveCo-Speech holistic motion generation aims to simultaneously achieve expressive gestures and precisely synchronized facial expressions. These two tasks have fundamentally different natures. Gesture generation is non-deterministic, representing a one-to-many mapping where the same speech can correspond to various natural motions requiring high diversity. Meanwhile, facial expression generation, especially lip movements, is deterministic, representing a one-to-one mapping that requires precise correspondence with phonemes and demands high accuracy. Existing methods face three critical limitations. First, employing fixed architectural designs, such as unidirectional conditional flows, imposes rigid task relationships and hinders models from capturing the true dynamic connections between gestures and expressions. Second, using manually designed static loss weights cannot adapt to the dynamic changes in task importance during training. Third, over-relying on minimizing differences from ground truth data leads to gesture overfitting and suppresses diversity. These deficiencies force existing methods into unavoidable trade-offs between facial synchronization and gesture diversity. This research aims to develop a unified adaptive framework that autonomously models and dynamically balances the relationship between these two tasks through learnable uncertainty mechanisms, simultaneously satisfying the dual objectives of gesture diversity and expression accuracy without manual intervention.MethodWe propose a novel diffusion-based framework leveraging uncertainty-based multi-task learning for adaptive task balancing in holistic motion generation. This represents the first application of uncertainty-based loss weighting to speech-driven holistic motion synthesis. Our core innovation treats gesture and facial expression generation as distinct tasks within a unified framework, allowing their relationship to emerge naturally during training. The framework employs a denoising diffusion probabilistic model operating on concatenated gesture and facial expression representations. The architecture incorporates shared features, including WavLM audio representations, word embeddings, speaker identity, and timestep encoding, alongside task-specific features like Gaussian noise vectors and seed motion sequences, to capture both commonalities and distinct requirements of each task. Cross-local attention mechanisms capture long-range dependencies across timesteps and modalities, while self-attention layers refine task-specific patterns. The key innovation introduces learnable parameters representing task-dependent homoscedastic uncertainty for gestures and expressions respectively. The total training objective integrates the losses of both tasks, dynamically weighted by these uncertainty parameters. This formulation automatically balances task contributions, as larger uncertainty values reduce penalties to encourage diversity, while smaller values increase penalties to enforce precision. The uncertainty parameters are jointly optimized with model parameters, enabling the dynamic discovery of optimal task weighting without manual intervention.ResultComprehensive evaluations on the 76-hour BEAT dataset, featuring 30 speakers and a 98/16/16 data split, demonstrate significant improvements. Our method achieved the highest gesture diversity (52.5) compared to MambaTalk (51.6), DiffSHEG (47.4), DSG (48.5), and CaMN (43.2), with the best semantic relevance score (SRGR: 0.324). For facial expressions, we obtained the lowest Fréchet Distance, outperforming MambaTalk, DiffSHEG and SAiD. Ablation studies confirm the critical role of uncertainty-based weighting, as removing it decreased gesture diversity from 52.5 to 47.2 and increased facial FD from 9.18 to 10.5. The learned uncertainty parameters converged to weights of 0.506 for gestures and 0.494 for expressions, demonstrating autonomous task balancing. Applying our mechanism to DiffSHEG and MambaTalk improved its gesture diversity from, validating generalizability. Qualitative analysis shows our gestures exhibit substantially greater diversity than baselines which closely imitate ground truth. User studies with 17 participants evaluating nine video groups confirmed overwhelming preference for our method across gesture diversity, facial synchronization, and overall quality dimensions.ConclusionThis research presents a novel adaptive diffusion framework successfully addressing the fundamental challenge of simultaneously achieving precise facial synchronization and diverse gesture generation. By introducing uncertainty-based learnable parameters within a multi-task learning paradigm, our method enables automatic optimization of task relationships, eliminating manual tuning while achieving superior performance in both deterministic expression synthesis and non-deterministic gesture generation. Experimental results demonstrate significant improvements in facial accuracy (FD: 9.18), gesture diversity (52.5), and semantic relevance (SRGR: 0.324), with user studies confirming enhanced realism. This work provides an effective solution for creating lifelike virtual agents and opens new research directions for holistic motion generation through adaptive multi-task learning. The codebase of the paper:https://doi.org/10.57760/sciencedb.j00240.00175.
Wang Zhixiang, Zhang Yayuan, Shang Wei, Yang Liu, Zhu Pengfei, Ren Dongwei
DOI:10.11834/jig.250659
摘要:ObjectiveArbitrary-scale video super-resolution (AVSR) aims to reconstruct high-resolution (HR) videos from low-resolution (LR) inputs under continuous scaling factors, including non-integer and asymmetric magnifications. Compared with fixed-scale video super-resolution (VSR), AVSR must generalize across a continuum of scales while maintaining temporal coherence amid complex motions, non-rigid deformations, and occlusions. In practice, three key issues often drive performance degradation: (i) scale generalization, where details plausible at one magnification may appear over-smoothed or over-sharpened at another; (ii) alignment error accumulation, where minor misalignments from optical-flow warping compound during recurrent propagation, causing flickering, ghosting, and motion artifacts; and (iii) robustness to unseen degradations, as real videos often diverge from training degradation models, complicating high-frequency restoration and temporal stability. This work develops an AVSR approach that enhances spatial detail recovery, temporal consistency, and scale generalization while maintaining deployment-friendly efficiency.MethodWe propose SL-AVSR, an arbitrary-scale video super-resolution framework that integrates (1) an explicit multi-scale frequency prior derived from image Laplacian pyramids; (2) second-order composite-flow-guided propagation for temporal feature transfer; (3) second-order deformable alignment refinement for sub-pixel correction near motion boundaries and non-rigid regions; and (4) a scale-aware hyper-upsampling unit for efficient continuous scaling. SL-AVSR builds on a forward-looking recurrent architecture with a lightweight look-ahead mechanism. The current HR frame is reconstructed by fusing history-propagated features with a short window of future cues, avoiding the overhead of a full bidirectional pass. First, to ensure scale-consistent guidance for detail restoration, we construct a Laplacian pyramid on the LR input to extract band-limited components representing multi-scale frequency information. These components are fused via learnable weights, enabling the network to prioritize appropriate frequency bands for different magnifications and content types. Unlike resource-intensive perceptual feature networks, this explicit prior is lightweight, interpretable, and imposes direct constraints on frequency discrepancies across scales. Second, to enhance alignment robustness in recurrent temporal aggregation, SL-AVSR employs second-order composite flow for feature propagation. Instead of using one-step displacements from single neighboring frames, we compose neighboring flows into two-step composite displacements, providing more stable cues under large motions and partial occlusions. This composite-flow-guided warping transfers features temporally, mitigating drift and curbing misalignment error accumulation. Third, to resolve residual misalignments persisting after flow-based warping—particularly around motion boundaries, non-rigid deformations, and occlusions—we introduce a second-order deformable alignment refinement module. This module predicts residual sampling offsets and modulation masks conditioned on warped features and the current context, enabling adaptive local corrections around flow-estimated displacements. The refinement is applied in both history propagation and look-ahead aggregation pathways, improving temporal feature correspondence and reducing motion artifacts. Fourth, to enable efficient continuous and asymmetric scaling, SL-AVSR incorporates a scale-aware hyper-upsampling unit. A compact hyper-network generates scale-specific convolution kernels that can be precomputed or cached for common output resolutions. This approach balances (i) direct interpolation (fast but limited in fidelity for large scales and fine textures) and (ii) implicit neural representation (INR)-based pixel-wise rendering (flexible but computationally expensive). By conditioning convolutional kernels on the target scale, SL-AVSR preserves convolution-based efficiency alongside arbitrary-scale flexibility.ResultTraining occurs on standard VSR/AVSR benchmarks with continuous scale sampling, with evaluation under integer, non-integer, and asymmetric magnifications. Generalization is tested by applying models trained on one dataset directly to others without adaptation. Robustness is assessed under randomized synthetic degradations and real-world videos with unknown degradations. We report distortion metrics (PSNR, SSIM) and a perceptual metric (LPIPS) for fidelity and quality, alongside qualitative comparisons and time–space profile visualizations to evaluate temporal stability (e.g., flickering and alignment artifacts). Across scaling factors (including non-integer and asymmetric) and diverse video content, SL-AVSR achieves the best or consistently competitive quantitative performance against representative AVSR and arbitrary-scale image super-resolution (AISR) baselines. The explicit Laplacian-pyramid frequency prior delivers stable gains in detail recovery and scale generalization, evidenced by higher PSNR/SSIM and lower LPIPS across most scales. Qualitatively, SL-AVSR reconstructs structured regions (e.g., thin lines, repetitive patterns, man-made textures) more reliably and preserves stochastic textures with fewer over-smoothing artifacts, especially at large magnifications where frequency information is vulnerable. For temporal consistency, the second-order composite-flow-guided propagation and deformable alignment refinement reduce motion distortions like trailing edges, ghosting, and shimmering. Time–space profiles reveal smoother, more continuous traces in SL-AVSR compared to competitors' blurred or jagged ones, indicating superior temporal aggregation. The look-ahead mechanism further boosts stability and perceptual quality by incorporating future context without a costly full-sequence backward pass. In cross-dataset tests, SL-AVSR sustains robust performance on unseen distributions, with gradual degradation as scaling increases. Under randomized and real-world degradations, it avoids severe artifact amplification, underscoring the resilience from explicit frequency guidance and second-order alignment. Efficiency analyses show SL-AVSR's favorable quality–efficiency trade-off, outperforming INR-based methods due to its kernel-generating hyper-upsampling and lightweight prior.ConclusionWe present SL-AVSR, an arbitrary-scale video super-resolution framework that unifies an explicit Laplacian-pyramid multi-scale frequency prior with second-order composite-flow-guided propagation and second-order deformable alignment refinement in a forward-looking recurrent architecture. The proposed design enhances spatial detail restoration and scale generalization while improving temporal consistency by mitigating alignment error accumulation under challenging motion patterns. The hyper-upsampling unit supports continuous scaling with practical efficiency, avoiding the high computational cost of pixel-wise implicit rendering. Extensive evaluations across datasets, scaling factors, and degradation conditions demonstrate SL-AVSR's strong balance of fidelity, perceptual quality, temporal coherence, and computational efficiency, positioning it as a practical solution for real-world arbitrary-scale video super-resolution. The code is publicly available through Science Data Bank:https://www.doi.org/10.57760/sciencedb.j00240.00181.
关键词:arbitrary-scale video super-resolution;recurrent neural network;second-order deformable alignment;frequency prior;hyper-upsampling unit
Yebin Liu, Yao Mu, Qi Ye, Lin Gao, Xiaoguang Han, Anpei Chen, Yueqi Duan, Sida Peng, Tianjia Shao, Hongwen Zhang, Li Zhang, Yiyi Liao, Lan Xu, Xihui Liu, Yao Yao, Ruizhen Hu, Li Yi, Yuan Guo, Zhouhui Lian, Ziwei Liu, Baoquan Chen
DOI:10.11834/jig.260114
摘要:As an interdisciplinary field spanning computer vision, graphics, artificial intelligence, and optical imaging, 3D vision serves as the core cornerstone for constructing Embodied General Intelligence (EGI) and the Metaverse. As the "Scaling Law" paradigm, upon which AI development relies, faces significantly diminishing marginal returns and encounters bottlenecks, the focus of both academia and industry is pivoting ever more clearly toward foundational subjects closely related to 3D vision, such as "World Models," "Spatial Intelligence," and "Embodied Intelligence," granting 3D vision unprecedented strategic attention and developmental opportunities. In 2025, the primary frontier trends in the field of 3D vision can be summarized as follows: 1) Feed-forward 3D reconstruction that supports spatiotemporal multi-image inputs: with breakthroughs in feed-forward 3D reconstruction technologies such as VGGT, obtaining scene structure and motion information through spatiotemporal multi-image feed-forward methods has become increasingly simple, bringing two profound impacts: firstly, it provides a solid foundation for 3D scene understanding for spatial intelligence, allowing many traditional 2D vision problems to be solved more fundamentally in 3D space; secondly, combined with efficient rendering technologies such as 3D Gaussian Splatting (3DGS), the threshold for high-quality 3D content production has been significantly lowered, paving the way for large-scale applications such as digital twins and the Metaverse. 2) The gradual fusion of 3D generation and 3D reconstruction: 3D AIGC technologies such as SAM3D support compositional and instance-level object generation under single-image input, with generation quality gradually reaching industrial-grade scanning standards, while simultaneously integrating with feed-forward reconstruction methods to gradually achieve the generation of authentic 3D structures and textures consistent with the input images; this will support feed-forward multi-instance reconstruction of dynamic complex scenes, significantly improving real-time, multimodal perception and understanding capabilities in complex scenarios. 3) The integration from video generation and world models to embodied intelligence: video generation technology is rapidly incorporating explicit or implicit 3D representations and evolving toward multi-view consistency, long sequences, and physical plausibility, directly driving the development of integrated "Perception-Generation-Interaction" world model technologies. These types of world models, combined with feed-forward 3D reconstruction technology, will form a complete "Multimodal Perception–3D Modeling–4D Generation–Real-time Interaction" 4D world model. At the same time, world model methods have begun to serve embodied intelligence, and a unified framework of "understanding-generation-execution" has begun to emerge. World models are widely regarded by the academic community as the key path to achieving generalizable embodied intelligence and ultimately leading to AGI. 4) Human behavior and video data becoming the core fuel driving breakthroughs: human operational spaces and interaction videos constitute a "data goldmine" for training embodied intelligence. The vast amount of human behavior videos on the internet, as well as first-person perspective data collected through simple devices, contain physical common sense, causal reasoning, and interaction preferences that serve as the natural fuel to break through the current data bottlenecks of embodied intelligence. By performing explicit 3D perceptual reconstruction or latent-space action alignment and learning on these data, a "data pyramid" base can be constructed to drive the scaling of embodied intelligence. 5) The evolution of the embodied training paradigm from imitation learning to interaction-driven reinforcement learning: the technical evolution of embodied intelligence VLA models is leaping from a supervised fine-tuning paradigm relying on expert demonstrations to a composite training architecture integrating online reinforcement learning. This shift effectively breaks the dependence on scarce high-quality data, enabling policies driven by sparse rewards to obtain generalization and exploration capabilities surpassing those of imitation learning, solving the challenges of exploration and stable updates in continuous action spaces. Simultaneously, the development of high-performance training systems and action-conditioned world models provides the infrastructure support for large-scale interaction data generation and efficient policy evolution, marking a new "post-training" stage for embodied intelligence centered on "interaction-driven" approaches. The selected top ten research advancements of the year in the field of 3D vision include: 1) Feed-forward 3D reconstruction constructing the foundation models for 3D vision (spatial intelligence); 2) The convergence of reconstruction and generation technical routes (video generation/3D generation), moving from mutual assistance to preliminary integration; 3) 3DGS/4DGS continuously improving representation efficiency, sparking a surge in scene modeling and volumetric video applications; 4) 3D generation: a leap from single-object visual realism to structuralized components/scenes and physical interactivity; 5) From video generation to world models: oriented toward spatiotemporal consistency, physical plausibility, and interactivity; 6) Unified multimodal large models for understanding and generation serving spatial intelligent perception; 7) Frontier shifts in digital humans: from appearance modeling to multimodal interaction; 8) Human data becoming the essential fuel to break through the Scaling Law of embodied intelligence; 9) Embodied intelligence foundation models evolving toward unified models of integrated "understanding-imagination-execution"; 10) The "post-training" moment of embodied intelligence: the paradigm shift of VLA models from imitation learning to online RL. Collectively, these breakthroughs have established the prototype of an integrated intelligent architecture characterized by “Multimodal perception - 3D modeling - 4D Generation - Real-time interaction”, providing critical technical support for the substantive advancement of spatial and embodied intelligence. To promote academic discourse, this paper extensively analyzes frontier trends in 3D vision and curates the top ten annual research advances, offering valuable reference perspectives for both academia and industry.
关键词:3D vision;Embodied AI;World model;reconstruction and generation;spatial intelligence
摘要:ObjectiveHyperspectral Image (HSI) classification is critical in remote sensing and widely used in land cover monitoring, agricultural survey, and urban planning. Mamba-based models have been increasingly applied in HSI classification due to their advantages in linear computational complexity and long-range dependency modeling. However, existing Mamba-based methods suffer from insufficient spatial information utilization and unreasonable spatial-spectral feature fusion, which leads to spatial information erosion and feature submergence, thereby limiting classification accuracy and efficiency. In this context, this study proposes a spatial information enhancement-based method, named SE-Mamba (Spatial Enhancement-Mamba), to improve classification accuracy and efficiency through effective integration of spatial and spectral information.MethodSE-Mamba incorporates two key designs focusing on the effective introduction and reasonable fusion of spatial information. First, a full-process spatial information enhancement mechanism is constructed, consisting of a front-end Spatial Enhancement Feature Extractor (SEFE) and a back-end High-Order Feature Refinement (HFR) module. Before the features are serialized and processed, SEFE explicitly encodes local structural priors and geometric dependencies into the feature map to alleviate the spatial information loss caused by Mamba serialization; HFR restores fine-grained geometric structures through high-order interaction and dual gate control enhancement mechanisms. Second, a rational spatial-spectral fusion architecture, namely the Spatial Spectral Collaborative Module (SSCM), is designed, which includes a Spatial-Spectral Fusion Module (SSFM). The SSCM decouples spatial and spectral features into two separate branches to strengthen the independent representation of heterogeneous features, while the SSFM adopts a "calibration-first, then-fusion" strategy to achieve in-depth integration through cross-guidance and adaptive weight allocation, thereby avoiding spatial information erosion. For experimental verification, four representative hyperspectral datasets (HanChuan, HongHu, Houston, PaviaU) covering agricultural and urban scenes are used to evaluate the model's performance and robustness. Key evaluation indicators include Overall Accuracy (OA), Average Accuracy (AA), and Kappa coefficient. The ablation experiment compared the performance of using SEFE, HFR, and SSCM separately on the baseline model, and analyzed the role of the proposed modules themselves, the synergy between each module, and the coupling effect of the internal sub modules of SSCM by removing the SEFE, HFR, and SSCM modules from the complete framework. Computational complexity, parameter size, and inference speed are also evaluated to assess the model's efficiency.ResultsExperimental results on the four representative datasets (HanChuan, HongHu, Houston, and PaviaU) demonstrate that SE-Mamba achieves the best Overall Accuracy (OA) and Average Accuracy (AA), with its Kappa coefficient also reaching a level comparable to the state-of-the-art methods. Specifically, SE-Mamba attains an average OA of 96.07% across the four datasets, surpassing the benchmark model MambaHSI by 2.32%. In terms of efficiency, the computational complexity and parameter size of SE-Mamba are comparable to those of mainstream methods, while its inference speed is superior to that of some comparison models, achieving a good balance between classification accuracy and computational efficiency. Ablation experiments verify the effectiveness of each core module in spatial feature representation. Compared with existing Mamba-based methods (e.g., MambaHSI), SE-Mamba effectively addresses spatial information erosion and feature submergence through spatial enhancement and optimized fusion, while preserving Mamba's linear computational advantage. Compared with traditional CNN/Transformer-based methods, SE-Mamba combines state space modeling with spatial enhancement, achieving more stable performance in complex scenes.ConclusionsExperiments verify that the combination of explicit spatial enhancement and state space modeling is effective, and the two core strategies of SE-Mamba synergistically alleviate spatial information erosion and feature submergence. By strengthening spatial feature extraction and optimizing spatial-spectral fusion, SE-Mamba maintains stable and efficient classification performance on complex agricultural, urban, and multi-category HSI datasets, achieving improved classification accuracy and efficiency. SE-Mamba provides a novel approach for HSI classification and serves as a reference for state space-based remote sensing image processing, offering technical support for land cover monitoring and agricultural survey. Future work could consider designing adaptive scanning mechanisms and introducing transfer learning to enhance the model's adaptability to complex scenarios and cross regional generalization ability, and promote its practical application on portable devices through lightweight design. The dataset and code related to this article have been shared [DOI: 10.57760/scientificdb. j00240.00182].
Chen Yaoyi, Wang Na, Peng Yanchun, Chen Jiahao, Qiao Pengxu, Wang Wei, Qin Chuan
DOI:10.11834/jig.250661
摘要:ObjectiveScreen-shooting resistant watermarking technology is an effective copyright authentication technique which has received widespread attention in recent years. It utilizes a pre-designed embedding algorithm to embed secret information into the cover image. When the copyright of the image is found to be infringed, the corresponding extraction algorithm can be used to extract the secret information, thereby achieving copyright protection. However, screen content images as a predominant medium for information transmission, the widespread use of screen-shooting enables low-cost duplication of screen content, posing severe challenges to copyright protection. Although screen-shooting resistant robust watermarking technology has emerged as an effective solution for copyright authentication and infringement tracing, existing research faces a critical problem: the lack of dedicated screen content image datasets. Current deep learning-based watermarking models are primarily trained on natural image datasets such as ImageNet, COCO, and MIR-Flickr. These natural image datasets focus on real-world scenes with rich textures and complex color distributions, which differ fundamentally from screen content images characterized by text-dominated content, large uniform color backgrounds, and vector-based elements like lines and diagrams. This domain gap leads to significant visual quality degradation (e.g., visible artifacts) when models trained on natural images are applied to screen content. Additionally, existing screen-related datasets are designed for quality assessment tasks and contain a large number of noise-processed images, making them unsuitable for training robust watermarking models that require accurate simulation of real-world screen content scenarios. To address these issues, this study aims to construct a large-scale, high-quality dedicated screen content image dataset to support the development of screen-shooting resistant robust watermarking technologies, thereby bridging the performance gap between natural image and screen content watermarking applications and enhancing the practicality of copyright protection for screen-based information.MethodA screen content image dataset for screen-shooting resistant watermarking (SCID) was constructed specifically for screen-shooting resistant robust watermarking tasks. The dataset was categorized into six functional themes based on common usage scenarios: webpage applications, chat applications, programming environments applications, engineering drawings applications, online meetings applications, and office applications. For webpage applications images, diverse sources were collected through search engines, including official websites, social media platforms, open-source project repositories, and large language model conversation interfaces, covering text-heavy pages, image-rich content, and interactive interfaces. Chat applications images included interfaces from popular communication software such as QQ and WeChat, as well as public account push and conversation interfaces. Programming applications images captured code displays and runtime results from various development platforms. Engineering drawing applications images consisted of both 2D blueprints and 3D model renderings, covering mechanical, architectural, and electrical design scenarios. Online meeting applications images were collected from remote collaboration tools like Tencent Meeting and FeiShu, including live lectures, video conferences, and remote desktop control scenes. Office images were captured from common office software (Microsoft Office and WPS), including Word documents, Excel spreadsheets, PDF files, and PPT presentations. In total, the SCID contains 17 101 high-resolution images, integrating text, images, diagrams, and video frames to simulate both daily and professional screen usage scenarios.ResultTo validate the effectiveness of SCID, five deep learning watermarking methods (StegaStamp, MBRS, PIMoG, HiFiMSFA and MTVDGAN) were selected for comparative experiments, and the performance of the models was evaluated from three key dimensions: visual quality, robustness against digital attacks, and robustness against real screen-shooting attacks. For visual quality (assessed by PSNR and SSIM), models trained on natural image datasets showed a significant drop of 2~4 dB in PSNR when tested on SCID, indicating obvious visual artifacts in screen content watermarking. In contrast, models trained on SCID maintained stable PSNR and SSIM when tested on natural image datasets, demonstrating that SCID-trained models retain excellent visual quality across domains without additional fine-tuning. In digital attack experiments (including random cropping, JPEG compression, Gaussian blur, Gaussian noise, median filtering, and salt-and-pepper noise), the accuracy difference (AD) of SCID-trained models was consistently better than that of natural image-trained models. This indicates that SCID-trained models have smaller performance fluctuations when transferred between screen content and natural images, reflecting stronger versatility. For real screen-shooting attacks, although the SCID trained model did not show significant performance improvement compared to the model trained on natural image datasets, the fluctuation range of AD was within 0.1%, that the performance fluctuation range is still within an acceptable range. These results collectively confirm that SCID not only improves the visual quality of screen content watermarking but also maintains strong robustness against both digital and real-world screen-shooting attacks, while ensuring excellent generalization to natural images.ConclusionThis study addresses the lack of dedicated datasets for screen-shooting resistant watermarking by constructing the large-scale SCID, which covers 17 101 images across six practical themes. Comparative experiments using three watermarking methods demonstrate that SCID effectively resolves the visual quality degradation issue of natural image-trained models when applied to screen content, while enabling models to retain stable performance on natural images. The dataset's diverse content and realistic scenario simulation enhance the generalization and practicality of watermarking models, providing critical data support for the development of screen content copyright protection technologies. Additionally, SCID can serve as a benchmark dataset for screen-shooting resistant watermarking research, promoting standardized evaluation and technological innovation in the field. This work contributes to advancing copyright protection for digital screen content and provides a foundation for addressing infringement and information leakage challenges in cross-media transmission scenarios. To facilitate replication and verification by academic peers, the dataset constructed in this paper will be made public upon acceptance of the paper, and the complete download link will be provided in the article at that time.
Yang Jingxiang, Zeng Jian-An, Diao Wenxiu, Xiao Liang
DOI:10.11834/jig.260110
摘要:Hyperspectral image (HSI) contains rich spatial-spectral information, it has high discriminative ability and wide applications such as remote sensing, geologic examination, and medical diagnosis. The traditional spectral imaging technologies include whisk broom, push broom, and staring imaging, they suffer from large volume of equipment, long period of collection, and limited spatial-temporal resolution, which hinders the usage in dynamic scenes and motional platforms. Recently, compressive spectral imaging (SCI) has obtained the research interests. The coded aperture snapshot spectral imaging (CASSI) could take the compressed measurement of a 3D HSI within single exposure, the high efficiency makes it a hot point in computational imaging. One key technology in CASSI is HSI reconstruction, which aims at restoring the latent HSI of high-quality from the compressed measurement. In the last decades, several HSI reconstruction algorithms has been proposed. In this overview, we comprehensively reviews recent advancements in spectral imaging and reconstruction methods. First, we analyze the physical process of compressive spectral imaging, and formulate the spatial-spectral degradation model. Then, we model the CASSI reconstruction as an ill-posed reverse problem, which requires priors as regularization to reduce the solution space. Taking the prior as a view angle, we divide the current HSI reconstruction technologies into four categories: 1) model driven methods based on hand-crafted priors, 2) data driven methods based on deep learning networks, 3) model-data joint driven methods based on deep priors, and 4) the recently proposed generative diffusion prior. Via such structured analysis, the overview aims to offer valuable insights into the core idea, design paradigm, and evolution of different methods, highlight the persistent challenges, and provide an outlook on their future development trends. The model-driven methods rely on hand-crafted priors, various priors such as total variation and sparsity have been proposed as regularization in HSI reconstruction problem. They are mathematically interpretable, and could generalize to different imaging systems as long as the degradation model is accurate. But the hand-crafted may be simplistic and fail to fully capture the complex spatial-spectral characteristic of HSI. The iterative optimization process of HSI reconstruction model is computationally expensive for real-time applications. Tuning the hyper-parameter in HSI reconstruction model is also difficult. The data driven methods uses deep learning networks to learn a mapping between measurements and HSIs. Different networks (e.g., convolutional networks and Transformers) have been designed to exploit the spatial-spectral features for HSI reconstruction. Normally, after learning the complex data-driven features, high-fidelity HSI can be inferred efficiently. But such networks are black-boxes with limited interpretability, what’s more, the deep learning networks may fail catastrophically when the spatial-spectral degradation is unknown or even unseen during inference. The model-data joint driven methods combine the strengths of model driven and data driven methods. It originates from the traditional HSI reconstruction model but replacing the hand-crafted prior with an implicit deep prior. Classic optimization algorithms are used to minimize the HSI reconstruction model, the iterative solutions are unrolled into a deep network. Each iterative solution becomes an unfolding stage in the network. The hand-crafted prior is replaced with a learnable denoiser as deep proximal operator. The unrolled network is trained in an end-to-end manner. Since the network is designed under the guidance of imaging physics, compared with the data driven methods, it has higher interpretability and robustness in varying degradation cases. By learning the deep prior, it achieves higher quality than hand-crafted priors. However, these networks can be regarded as discriminative models learned by regression losses, they tend to produce deterministic results that are actually the “averaged” distributions of potential ground truth, thus leads to blurry outputs and hinders reconstructing fine-grained and detailed image structures. The diffusion model could generate diverse and highly-realistic contents, leveraging the generative diffusion prior may remedy the limitation and has shown the potential in HSI reconstruction. Further, in the overview, we select 12 mainstream HSI reconstruction methods and compare their performance on widely-used datasets. The experimental code and data are at:https://github.com/DDXNJUST/Computational-Imaging/. Based on the experimental results, finally, we discuss the shortcomings of existing works and future work trends. Several pain points such as representing the complex spatial-spectral futures, the limited generative ability and content distortion, and the disjointing relationship among compressive imaging, HSI reconstruction, and downstream tasks. The purpose of this overview is providing a comprehensive introduction of spectral imaging and reconstruction, and also presenting valuable insights for future advancement.
关键词:compressive spectral imaging;computational reconstruction;imaging model;deep learning;model and data driven
Feng Jianjiang, Jia Wei, Li Qi, Cui Zhe, Zhao Cairong⁵, Lei Zhen, Wang Caiyong⁶, Kang Wenxiong⁷, Yu Shiqi⁸, Fei Lunke⁹, Li Xiaobai, ⁰, Ye Mang, Wei Jianze, Cao Shiwen⁸, Sun Shibo, Xie Tianming⁷, Zheng Weishi, Yang Hongyu, Huang Junduan, ⁵
DOI:10.11834/jig.260069
摘要:Biometric recognition has become a fundamental enabling technology for modern digital society, serving as a core infrastructure for identity authentication, access control, and human-centered intelligence. By exploiting intrinsic physiological and behavioral characteristics, biometrics offers inherent advantages over traditional knowledge- or token-based authentication mechanisms, including higher security, improved usability, and stronger resistance to impersonation. Over the past decade, and especially since 2021, the rapid advancement of deep learning, sensing technologies, and large-scale data resources has profoundly reshaped the landscape of biometric research and applications. Biometric systems have evolved from task-specific, handcrafted pipelines into data-driven, end-to-end intelligent systems capable of operating in complex, unconstrained, and large-scale real-world environments.This report provides a comprehensive review of the development of biometric recognition technologies from 2021 to 2025, covering both methodological advances and emerging application paradigms. We focus on major biometric modalities, including face, iris, fingerprint, palmprint, finger and palm vein, body, gait, and person re-identification, while also highlighting the increasingly critical role of security, privacy protection, and trustworthiness in biometric systems.Face recognition remains the most widely deployed biometric modality due to its non-contact nature, low acquisition cost, and high social acceptance. Recent progress has been driven by innovations in network architectures, large-scale training data, and discriminative loss functions. Convolutional neural networks and vision transformers have significantly improved representation capacity, while margin-based and quality-aware losses have enhanced intra-class compactness and inter-class separability. At the same time, face detection and alignment have advanced toward robust performance under extreme conditions such as low resolution, severe illumination variation, occlusion, and large pose changes. Beyond recognition, face generation and synthesis have emerged as both an enabling technology and a security challenge. Generative adversarial networks and diffusion models now support high-fidelity, controllable, and even 3D-aware face generation, facilitating data augmentation, virtual humans, and animation, while simultaneously raising new threats in the form of deepfakes and identity spoofing.Iris recognition continues to be regarded as one of the most secure biometric modalities due to the high uniqueness and stability of iris texture. In recent years, research has shifted from controlled laboratory settings toward less constrained and mobile scenarios. Advances in iris acquisition include visible-light imaging, mobile-device-based capture, and near-eye sensing in VR/AR environments. Deep learning-based iris segmentation and localization methods have greatly improved robustness against noise, occlusion, and cross-domain variations. In parallel, iris feature representation has evolved from handcrafted binary codes toward deep embeddings and hybrid models that preserve compatibility with traditional matching schemes. Synthetic iris data generation has gained attention as a means to address data scarcity and privacy constraints, although its impact on large-scale recognition performance and potential security implications remain open research questions.Fingerprint recognition, as one of the most mature biometric technologies, has undergone a significant transformation with the adoption of deep learning. Modern fingerprint systems address challenges such as low-quality latent prints, partial fingerprints, distortion, and cross-sensor variability through deep enhancement, minutiae extraction, dense descriptors, and multi-stage matching frameworks. The emergence of new fingerprint modalities, including contactless fingerprints, 3D fingerprints, and optical coherence tomography (OCT)-based internal fingerprints, has expanded the representational space of fingerprint biometrics and enabled improved robustness against distortion and spoofing. At the same time, large-scale real and synthetic fingerprint datasets have facilitated systematic evaluation and training of deep models.Palmprint and palm-vein recognition have experienced rapid growth in both research and large-scale deployment. Advances in deep learning have addressed long-standing challenges in region-of-interest extraction, alignment, and feature robustness, enabling commercial applications such as contactless palm payment systems. Multimodal fusion of palmprint and vein patterns has further enhanced recognition accuracy and security. In addition, progress in palm image synthesis and 3D palm modeling has opened new directions for data augmentation and unconstrained recognition.Body-based biometrics, including person re-identification and gait recognition, play an increasingly important role in surveillance and public safety scenarios where close-range or cooperative acquisition is not feasible. Recent work has focused on cross-view, cross-domain, and long-term recognition under clothing changes, occlusion, and low-resolution conditions. Deep spatiotemporal modeling, attention mechanisms, and sequence-level representations have significantly improved robustness, while also raising concerns about privacy, fairness, and ethical deployment.Alongside performance improvements, security and privacy have become central themes in biometric research. Biometric systems are increasingly exposed to sophisticated attacks, including presentation attacks, adversarial examples, template inversion, and deepfake-based impersonation. Consequently, substantial efforts have been devoted to spoof detection, adversarial defense, secure template protection, and cancellable biometrics. Privacy-preserving techniques such as template transformation, encryption, and federated or decentralized learning are gaining importance as regulatory requirements and public awareness intensify. Ensuring that biometric systems are not only accurate but also trustworthy, explainable, and compliant with data protection regulations is now a core research objective.Beyond identity authentication, biometrics is expanding toward broader human-centered applications, including human–computer interaction, healthcare monitoring, behavioral analysis, and immersive virtual environments. This shift reflects a transition from “who you are” to “how you are,” where biometric signals contribute to comprehensive perception and intelligent interaction rather than mere identification.In summary, the period from 2021 to 2025 has witnessed rapid and multifaceted progress in biometric recognition. Advances in deep learning, sensing, and data generation have substantially improved accuracy, robustness, and scalability across modalities, while new challenges in security, privacy, and societal impact have emerged. By systematically reviewing recent developments across core biometric technologies and security frameworks, this report aims to provide a holistic perspective on the current state of the field and to outline promising directions for future research and deployment. We hope this work will serve as a valuable reference for researchers, engineers, and policymakers seeking to understand and shape the next generation of biometric systems.
关键词:biometrics;face recognition;iris recognition;Fingerprint and palmprint recognition;Finger and palm vein recognition;person re-identification;gait recognition;Spoof detection
摘要:Active speaker detection (ASD) aims to identify speakers and their active speech intervals within video sequences by leveraging both audio and visual modalities. ASD serves as a foundational technology for applications such as media content analysis, human-computer interaction, intelligent meeting systems, and audio-visual speech recognition. Despite the significant progress driven by the rapid development of deep learning since 2015, real-world deployment still encounters challenges from complex environmental factors such as visual occlusions, acoustic interference, overlapping speech, and dynamic camera movements. To address these developments and challenges, this survey provides a comprehensive review of ASD technologies over the past 25 years, categorizing existing methodologies into vision-based and audio-visual methods. The first category, vision-based methods, infers speech activity entirely from visual cues, such as lip contours, facial movement, and body gestures. These methods are valuable where audio is entirely missing or heavily corrupted by acoustic interference. While immune to acoustic degradation, vision-based methods inherently struggle to distinguish actual speech from non-speech lip movements and are highly sensitive to low image resolution, non-frontal head poses, and occlusions. The second category, audio-visual methods, constitutes the mainstream of current research by harnessing the complementary nature of auditory and visual signals. This survey further subdivides this category into three major paradigms. (a) Matching-based methods identify speakers by learning cross-modal correspondences, typically without extensive manual annotations. This paradigm is split into two distinct routes: synchronization-based and identity-based association. Synchronization-based methods measure the short-term temporal alignment between lip motions and acoustic signals, utilizing contrastive learning to project audio and visual features into a shared embedding space. While these methods benefit from self-supervised learning paradigms, they require tight audio-visual synchronization and can fail under desynchronization or in dubbed videos. Alternatively, identity-based association methods focus on long-term consistency. They typically cluster acoustic speaker embeddings and facial feature sequences separately and then associate voices with faces based on co-occurrence statistics or cross-modal face-voice matching networks. This route is highly robust to dubbing, off-screen voices, and poor visual quality (e.g., in egocentric videos) but relies heavily on the accuracy of intermediate clustering steps. (b) Fusion-based classification methods formulate ASD as a fully supervised "speaking vs. non-speaking" binary classification task for each candidate face at every time step. This pipeline generally involves four crucial stages: feature extraction, feature fusion, temporal modeling, and final speaker activity detection. In the feature extraction stage, modern architectures employ large-scale pre-trained acoustic encoders and deep visual backbones. To effectively integrate these multi-modal streams, dynamic fusion strategies such as cross-attention mechanisms, gating networks, and uncertainty-aware adaptive fusion have largely replaced simple static concatenation. Furthermore, temporal context modeling has evolved from local, short-term processing to Recurrent Neural Networks (RNNs) and then global spatiotemporal reasoning using Transformers and Graph Neural Networks (GNNs). By explicitly modeling the complex interactive dynamics among multiple candidate speakers and the global scene context, fusion-based classification achieves state-of-the-art accuracy on most benchmarks. However, it demands large amounts of densely annotated data and suffers from domain shift issues. (c) Hybrid methods seek to combine the complementary strengths of both matching and classification paradigms to tackle complex scenarios. By integrating short-term speech behavior (via synchronization or classification) with long-term identity verification (speaker profiles), hybrid systems effectively suppress interference from non-target speakers, overlapping voices, and off-screen narrators, thereby significantly enhancing overall robustness in real-world environments. Beyond this algorithmic taxonomy, this survey also extensively summarizes benchmark datasets and evaluation metrics commonly used in the ASD community. The paradigm shift in dataset curation is traced from early, heavily constrained laboratory recordings with limited participants to large-scale, in-the-wild datasets. Modern benchmarks feature thousands of hours of video spanning movies, video logs, egocentric wearable camera views, and even surveillance footage. Commonly used evaluation metrics such as mean Average Precision (mAP) are also discussed. Finally, this survey concludes by highlighting the technical trends and outlining several persistent open problems. Despite achieving near-perfect scores on certain benchmarks, current state-of-the-art models exhibit limited cross-dataset generalization, particularly struggling with diverse languages, out-of-domain scenarios, and extreme conditions. Moreover, existing systems lack a deep semantic understanding of conversational dynamics, such as turn-taking logic, interruptions, and non-verbal social cues. To address these bottlenecks, future research should focus on constructing more inclusive datasets, exploring data-efficient learning, integrating Large Language and Vision-Language Models (LLMs/VLMs) for semantic reasoning, and developing lightweight architectures for edge deployment.
摘要:ObjectiveIn recent years, brain networks have become an indispensable cornerstone in brain disorder research. Currently, functional magnetic resonance imaging (fMRI) is commonly used to construct functional networks, while diffusion tensor imaging (DTI) is utilized to build structural networks. Yet two limitations remain prominent: many approaches still operate on a single modality or adopt shallow multi-modal fusion, leaving the potential dependencies between SC and FC under-explored; and although Structure-Function Coupling (SFC) has been shown to differ across brain regions between healthy controls and patients and to hold biomarker potential, its relationship to classification has not been systematically integrated into representation learning. This study aims to operationalize SFC as an explicit prior that guides multi-modal representation learning, so that high-order cross-modal associations can be captured and translated into robust diagnostic performance.MethodThis paper presents SFC-HGNN, a Structure-Function Coupling-guided foundational framework that combines a dual-stream encoder, cross-modal reconstruction pretraining, and hypergraph computation to model high-order relationships inside and between SC and FC. The core design is to use an SFC matrix as a bridge that explicitly guides two hypergraph neural network (HGNN) branches—one per modality—so that each branch learns modality-appropriate group relations while remaining informed by the other modality. Concretely, the SFC matrix is first constructed to capture the consistency between FC and SC patterns across brain regions using a rank-based correlation measure, and this matrix is fused with each modality through shallow feature mapping to form initial node-level features for the two streams. On the functional stream, hyperedges are formed via sparse-representation method, providing data-driven, high-order groupings appropriate for FC. On the structural stream, hyperedges are built with a k-nearest-neighbor strategy to reflect global structural associations. Each branch then employs HGNN layers to realize high-order message passing on its respective hypergraph, thereby encoding latent dependencies between SC and FC under SFC guidance. To fully exploit cross-modal dependencies, a cross-modal reconstruction pretraining task is introduced. During pretraining, the functional stream is trained to reconstruct the structural connectivity matrix and the structural stream to reconstruct the functional connectivity matrix via decoders. The decoders are optimized with a reconstruction objective together with a constraint that emphasizes appropriate symmetry and sparsity of the reconstructed matrices. This pretraining forces the encoder to internalize modality-bridging information under SFC guidance. In the subsequent downstream tuning phase, the encoder parameters are frozen; latent node representations from the two branches are flattened and concatenated into a global feature vector, which is then fed into a lightweight multilayer perceptron (MLP) classifier optimized with cross-entropy. Freezing the encoder keeps downstream training simple and stable while preserving the cross-modal dependencies captured during pretraining.ResultExperiments are conducted on two public multi-modal brain imaging datasets. On ADNI, we use 332 subjects (64 AD, 129 MCI, and 139 NC). On ABIDE, we include 86 subjects with paired fMRI and DTI (ASD/NC). For both datasets, we adopt the AAL atlas; fMRI data are preprocessed with DPARSF and DTI data with PANDA. The model is implemented in PyTorch and trained on an NVIDIA RTX 4090 GPU. We follow a two-stage training protocol (cross-modal pretraining followed by frozen-encoder tuning) and evaluate with 5-fold cross-validation, reporting ACC, AUC, F1, and specificity. We compare SFC-HGNN against representative single-modality baselines (BrainNetCNN, GAT, HGNN+, BrainGNN) and multi-modality methods (MME-GCN, Cross-GNN). Across diagnostic tasks, SFC-HGNN achieves state-of-the-art performance, consistently improving ACC, AUC, and F1. While certain baselines occasionally yield higher specificity, these cases are often accompanied by markedly lower F1, suggesting reduced stability; in contrast, SFC-HGNN maintains a better overall balance among metrics and demonstrates stronger robustness. Ablation studies further isolate the contributions of the proposed components: introducing SFC as a cross-modal bridge (without pretraining) improves accuracy by 1.6% and 2.3% on AD vs. NC and ASD vs. NC, respectively, and adding cross-modal reconstruction pretraining brings additional gains of 2.1% and 1.9%. These results indicate that SFC guidance together with cross-modal reconstruction effectively encourages the encoder to capture latent SC-FC dependencies that transfer to downstream classification. Finally, to assess interpretability, we conduct significant-hyperedge analysis using group-level t-tests and visualize discriminative functional and structural hyperedges for AD vs. NC and ASD vs. NC; the identified regions and connections align with known patterns of network alteration, supporting the neurobiological plausibility of the learned high-order representations.ConclusionBy treating SFC as an explicit bridge and pairing dual HGNN encoders with a cross-modal reconstruction pretraining paradigm, this study introduces a principled multi-modal framework for brain network analysis and brain disorder diagnosis. The approach encourages each modality to be informed by the other and to encode high-order associations that are useful for downstream classification, while a frozen-encoder tuning strategy keeps optimization stable and lightweight. On ADNI and ABIDE, SFC-HGNN consistently surpasses single and multi-modality baselines, with gains reflected not only in accuracy and AUC but also in a better balance between F1 and specificity, highlighting robustness. The significant-hyperedge findings further provide neurobiologically plausible insights that complement the quantitative improvements. Overall, SFC-HGNN advances multi-modal brain disorder diagnosis by unifying SFC-guided hypergraph encoding, cross-reconstruction pretraining, and simple downstream tuning into a coherent pipeline that achieves superior and reliable performance across datasets.
Liang Shutong, Xie Dongjin, Li Dong, Zhang Hui, Jia Xiaofeng, Wang Fei-Yue, Li Yidong, Li Lingxi
DOI:10.11834/jig.260100
摘要:In recent years, the rapid advancement of foundation models, including large language models (LLMs), vision–language models (VLMs) and world models, has introduced a paradigm shift that enables humanoid robots to transition from laboratory demonstrations to open-world applications such as household services, industrial manufacturing, and medical assistance. As the primary end-effector responsible for high-dimensional and fine-grained physical interaction, multi-fingered dexterous hands represent one of the most challenging and emblematic platforms in embodied intelligence, due to their high degrees of freedom, strongly nonlinear contact dynamics, and tightly coupled multimodal feedback mechanisms. The emergence of vision–language–action (VLA) models and large-scale foundation architectures, the breakthrough application of diffusion models and flow matching in continuous control policy generation, hybrid reinforcement–imitation learning frameworks, and advances in high-resolution tactile sensing, variable-stiffness mechanisms, and rigid–soft hybrid materials are collectively driving a fundamental transition in dexterous hands—from a paradigm of “rigid high-precision” mechanical determinism toward an integrated, perception–learning–execution–centered closed-loop intelligent system. This paper presents a comprehensive review of robotic dexterous hands across four dimensions: mechanical structures, intelligence capability grading, data resources, and benchmarking methodologies. First, from a historical perspective, we systematically trace the evolution of mechanical architectures and hardware paradigms, summarizing representative technical routes including fully actuated multi-finger designs, underactuated compliant mechanisms, tendon-driven systems, soft robotic hands, and rigid–soft hybrid structures. Our analysis indicates that the evolution of dexterous hand mechanisms is not merely an accumulation of degrees of freedom, but rather a gradual shift toward engineering-oriented paradigms characterized by underactuated coupling, material compliance, and hybrid structural design. By embedding adaptive coordination mechanisms into the mechanical body through passive responses, these approaches effectively reduce actuation and control dimensionality while physically enhancing robustness against object diversity and contact uncertainty. Building upon this foundation, we propose a systematic five-level taxonomy of dexterous intelligence (H1–H5) centered on the evolution of perceptual capability. H1 (Perception-Free) is characterized by open-loop program execution and teleoperation, where the system lacks environmental modeling and policy generation capabilities. H2 (Single-Modal Perception) introduces either vision or tactile feedback to enable perception-driven grasping and basic stability regulation. H3 (Multimodal Perception) integrates vision, tactile, and force sensing through deep multimodal collaboration, supporting complex fine manipulation tasks such as precision assembly, deformable object manipulation, and tool use. At this stage, systematic methodologies emerge across three technical directions: hierarchical task planning, multimodal servo control, and data-driven policy learning. H4 (Open Perception) centers on vision–language–action models and addresses perceptual generalization, long-horizon task planning, and deep multimodal fusion to enable language-guided open-world task understanding and zero-shot manipulation. H5 (Dynamic Perception) envisions autonomous, evolving general manipulation capabilities supported by deep multimodal dynamic perception and real-time coordination mechanisms, representing a historical leap from robots as “tools” to embodied “symbiotic agents.” This taxonomy provides a unified reference framework for evaluating the technological transition of dexterous hands from repetitive execution to open-world task planning and ultimately toward autonomous evolution. Furthermore, we systematically review the key data resources and evaluation benchmarks that support dexterous intelligence from two complementary dimensions: real-world interaction and high-fidelity simulation. At the data level, real-world datasets offer ecological validity but suffer from high collection costs, limited scalability, and safety risks. Synthetic datasets and simulation platforms enable large-scale and diverse data generation at controllable costs but remain constrained by simplified contact models and the simulation-to-reality gap. We outline the evolution of synthetic datasets from static grasp poses to dynamic manipulation sequences and analyze representative resources in terms of their contributions to grasp generation, cross-hand generalization, articulated object manipulation, and long-horizon modeling. We further summarize the technological progression of simulation platforms from basic physical validation to high-fidelity interaction and cross-domain transfer. In terms of evaluation, we categorize performance metrics into outcome-oriented and process-oriented dimensions, including task success rate, grasp cycle time, target pose error, normalized task error, contact region error, stability and drop rate, as well as efficiency and robustness. Benchmark tasks are organized into five families: stable grasping and transport; re-grasping and contact transition; in-hand manipulation and reorientation; constrained operation and assembly; and tool use and functional manipulation. Together, these constructs form a systematic two-dimensional evaluation spectrum spanning contact complexity and temporal depth, emphasizing reproducible and diagnostically meaningful standards for assessing generalization capability and deployment readiness. Finally, we summarize the core challenges and future directions toward the general-purpose deployment of dexterous hands. From a data perspective, the scarcity of real interaction data and the persistent simulation-to-reality gap remain fundamental bottlenecks for effective policy transfer. From a modeling perspective, efficient and robust multimodal joint representations, 3D foundation model construction, and interpretable decision-making mechanisms have yet to converge into a unified theoretical framework, while inherent tensions persist between model scale and real-time inference requirements. From a hardware perspective, long-standing engineering trade-offs exist between high degrees of freedom and low cost, reliability, and lightweight design, as well as between precision force–tactile control and structural simplicity. Looking ahead, deep integration of perception, decision-making, and execution; incorporation of physical commonsense and causal reasoning through world models and embodied foundation models; generative AI–driven data-efficient learning and simulation credibility enhancement; biomimetic variable-stiffness mechanisms and endogenous tactile sensing through soft–hard co-design; and long-term real-world deployment with closed-loop optimization in high-value scenarios such as intelligent manufacturing, domestic service, and specialized operations will be critical pathways. These efforts will drive dexterous hands from laboratory prototypes toward reliable real-world applications, ultimately achieving the goal of general embodied intelligence capable of perceiving, reasoning, and manipulating “like a human hand.” This work provides a unified capability framework and systematic reference for understanding and tracking the frontier of robotic dexterous hands, offering theoretical guidance and practical insights for future research in hardware paradigm evolution, intelligence capability transition, and data and benchmarking system construction.
Mu Yao, Zhao Hao, Hu Ruizhen, Zhang Li, Li Hongyang, Yang Jiaolong, Wang Jingbo, Han Lei, Su Yongfeng, Xu Kai, Yang Yi, Li Jiang, Dai Ruoli, Chen Baoquan, Liu Yebin, Yi Li
DOI:10.11834/jig.260059
摘要:As a critical and rapidly evolving domain within artificial intelligence, Embodied AI represents the convergence of computer vision, natural language processing, and robotics, aiming to create intelligent agents capable of perceiving, reasoning, and acting within the physical world. However, despite the transformative success of Large Language Models (LLMs) in the digital realm, Embodied AI faces unprecedented and multifaceted challenges that hinder the direct replication of the “large-scale pre-training plus scaling law” paradigm. These challenges include extreme data heterogeneity across different robot morphologies, strong physical constraints that demand safety and precision, and the prohibitively expensive interaction costs associated with collecting real-world robotic data. Consequently, simply scaling up model parameters without addressing these domain-specific hurdles has proven insufficient for achieving general-purpose robotic intelligence. This paper comprehensively reviews the frontier technical evolution of Embodied AI, offering a systematic analysis across four critical dimensions: data, models, systems, and evaluation, to chart a path toward more robust and generalized embodied agents. In terms of data, we propose a“Data Pyramid” structure designed to maximize data efficiency and transferability. This hierarchical framework advocates for the foundational use of massive, low-cost simulation data and internet-scale video datasets at the bottom layer to build broad physical commonsense and visual representations; the utilization of human interaction data (such as ego-centric videos and teleoperation logs) in the middle layer to facilitate behavioral mapping and intent understanding; and the strategic application of a small, high-quality amount of real-world robot data at the top layer for fine-tuning and final skill deployment, thereby bridging the reality gap. Regarding models, the paper critically discusses the current state of mainstream Vision-Language-Action (VLA) models, highlighting that while they excel at semantic understanding, they encounter significant scaling bottlenecks in continuous control and fine-grained manipulation. To overcome this, we identify “World Models” as a pivotal new direction for embodied pre-training. By learning to simulate environmental dynamics, predict future states, and understand causal relationships without explicit supervision, world models promise to endow agents with deeper physical intuition and superior generalization capabilities in unseen environments. In terms of systems, we observe a paradigm shift where the architecture is evolving from monolithic, single end-to-end models toward an Operating System-like “Hierarchical Architecture.” This evolution achieves the necessary decoupling of high-level semantic planning—powered by the reasoning capabilities of LLMs—and low-level motion control, which ensures precise execution and hardware compliance. This modular approach not only improves system robustness but also facilitates easier debugging and component upgrades. Finally, the paper examines the critical issues within current evaluation systems, specifically focusing on the challenges of authenticity in simulation benchmarks and the lack of reproducibility in real-world experiments. We argue that the field suffers from fragmented metrics that fail to capture the complexity of open-world interaction. In conclusion, we provide a forward-looking perspective on the inevitable integration of locomotion and manipulation—moving beyond stationary arms to mobile manipulators—and anticipate the arrival of the “ImageNet moment” for Embodied AI, where standardized datasets and benchmarks will catalyze a Cambrian explosion of robotic capabilities, ultimately bridging the gap between digital intelligence and physical reality.
关键词:Embodied AI;Data Pyramid;World Models;VLA Models;Hierarchical Control Architecture;Embodied Evaluation
摘要:For the 21st century, this study is dedicated to exploring the fundamental theories, models and architectures of next-generation artificial neural networks (ANNs), with the goal of constructing high-performance ANNs featuring adaptive topology, interpretability, strong generalization capability and high energy efficiency. Since the initial proposition of ANNs in the 1940s, the field has witnessed over 80 years of development. Extending to the mid-21st century, ANNs can be categorized into five generations based on five core dimensions, and such generational evolution constitutes the core trajectory of neural network research and advancement. This paper defines five generations of ANNs (abbreviated as xG-ANNs) from five perspectives: neuronal unit, information coding, network structure, learning mechanism and Turing test. 1G-ANNs: Threshold logic networks, represented by the M-P model and the Perceptron; 2G-ANNs: Continuous activation networks (e.g., sigmoid, tanh), typified by the classical back propagation (BP) network; 3G-ANNs: Spiking neural networks, represented by spiking neural networks (SNNs) (Maass, W., 1997); 4G-ANNs: Deep neural networks (DNNs), represented by AlexNet, ResNet, Transformer architectures and the attention mechanism, which have passed the conversational Turing test for disembodied intelligence; 5G-ANNs: Cognitive neural networks (CoNNs), with five defining characteristics: (1) cognitive units integrating memory, reasoning and human-like attention; (2) hybrid coding of semantic symbols and distributed representations; (3) modular cognitive architecture with dynamic reconfigurable topology; (4) meta-learning, causal reasoning and lifelong learning; (5) the embodied Turing test remaining to be breakthrough. A core academic consensus has been formed regarding 5G-ANNs: such networks integrate neural computing, symbolic reasoning and cognitive architecture with low-power consumption, support dynamic topology, memory cognition, neuro-symbolic fusion and embodied/world models, and exhibit intrinsic merits including adaptive structure, few-shot generalization, interpretability, low energy consumption and embodied cognition.At present, the field is in the era of 4G-ANNs, which are characterized by data-driven fitting, deep learning, the attention mechanism and Transformer frameworks. Represented by large language model-based ChatGPT, 4G-ANNs have passed the conversational Turing test, yet such validation is a "black-box" assessment restricted to the emergence of disembodied intelligence. The root causes lie in the inherent asymmetry of large language models built on the scaling law of parameter expansion, their lack of comprehension of physical laws governing the real world, as well as critical drawbacks including jagged multi-modal and multi-form intelligent outputs and inferior energy efficiency. In contrast, neural networks deployed in embodied intelligent robots lack autonomous intelligence and can merely execute predefined actions per programmed instructions, leading to a huge gap between disembodied intelligence and embodied intelligence in 4G-ANN systems. Addressing these critical limitations of 4G-ANNs calls for the support of novel theories, models and architectural designs. Currently, extensive debates and divergences persist over the developmental orientation and technical routes of next-generation ANNs.This paper analyzes and summarizes the developmental progress of mainstream theories, models and architectures of the first four generations of ANNs, focuses on the characteristics of several typical 4G-ANN models and their enhanced variants, and reviews representative architectures for 5G-ANNs, including world models, Joint Embedding Predictive Architecture (JEPA), cognitive spiral models and intelligence-trace cellular network frameworks. Finally, grounded in the theory of cognitive physics and practical technologies of driving brain cognition, this work proposes a lightweight architecture termed Embodied Cognitive Physics Neural Network (E-CoPNN), which incorporates the core characteristics of fifth-generation ANNs. Specifically, E-CoPNN features dynamic topology (three-layer nested structure and dual systems), memory cognition (three categories of memory), neuro-symbolic fusion (integration of Euclidean and non-Euclidean data, combination of lexemes and concepts), and embodied cognition/world models (fusion of disembodied and embodied intelligence, integration of cognitive space and physical space). The proposed architecture embodies the typical properties of 5G-ANNs: adaptive structure, few-shot generalization, interpretability, low power consumption and embodied cognition. Targeting brain-like general intelligence, 5G-ANNs take dynamic topology, autonomous memory, cognitive reasoning and structural evolution as core attributes, shifting the ANN research paradigm from data fitting to structural reconstruction, thus representing a pivotal direction for achieving machine autonomous intelligence and continuous learning.Conclusionand Significance At present, the development of next-generation ANNs for the 21st century will drive a series of transformative revolutions: philosophically, a paradigm shift from mind-body dualism to embodied cognitive science based on embodied perception monism; theoretically, an disciplinary expansion from 20th-century biophysics to 21st-century cognitive physics; model-wise, a fundamental transition of ANN research from data fitting to structural reconstruction; in application, bridging the long-standing gap between disembodied intelligence and embodied intelligence in ANN development; generationally, an evolutionary leap from 4G-ANNs to 5G-ANNs featured by brain-like cognitive capabilities and adaptive topological structures. This work lays a solid foundation for the widespread application of embodied cognitive robots with learning, self-development, self-correction and human-robot interaction capabilities. Furthermore, it supports the convergent development of Nano-Bio-Info-Cogno (NBIC) — the four cutting-edge technologies led by cognitive integration, enhances human intellectual competence, and ushers in a new round of cognitive revolution.