摘要:ObjectiveHyperspectral Image (HSI) classification is critical in remote sensing and widely used in land cover monitoring, agricultural survey, and urban planning. Mamba-based models have been increasingly applied in HSI classification due to their advantages in linear computational complexity and long-range dependency modeling. However, existing Mamba-based methods suffer from insufficient spatial information utilization and unreasonable spatial-spectral feature fusion, which leads to spatial information erosion and feature submergence, thereby limiting classification accuracy and efficiency. In this context, this study proposes a spatial information enhancement-based method, named SE-Mamba (Spatial Enhancement-Mamba), to improve classification accuracy and efficiency through effective integration of spatial and spectral information.MethodSE-Mamba incorporates two key designs focusing on the effective introduction and reasonable fusion of spatial information. First, a full-process spatial information enhancement mechanism is constructed, consisting of a front-end Spatial Enhancement Feature Extractor (SEFE) and a back-end High-Order Feature Refinement (HFR) module. Before the features are serialized and processed, SEFE explicitly encodes local structural priors and geometric dependencies into the feature map to alleviate the spatial information loss caused by Mamba serialization; HFR restores fine-grained geometric structures through high-order interaction and dual gate control enhancement mechanisms. Second, a rational spatial-spectral fusion architecture, namely the Spatial Spectral Collaborative Module (SSCM), is designed, which includes a Spatial-Spectral Fusion Module (SSFM). The SSCM decouples spatial and spectral features into two separate branches to strengthen the independent representation of heterogeneous features, while the SSFM adopts a "calibration-first, then-fusion" strategy to achieve in-depth integration through cross-guidance and adaptive weight allocation, thereby avoiding spatial information erosion. For experimental verification, four representative hyperspectral datasets (HanChuan, HongHu, Houston, PaviaU) covering agricultural and urban scenes are used to evaluate the model's performance and robustness. Key evaluation indicators include Overall Accuracy (OA), Average Accuracy (AA), and Kappa coefficient. The ablation experiment compared the performance of using SEFE, HFR, and SSCM separately on the baseline model, and analyzed the role of the proposed modules themselves, the synergy between each module, and the coupling effect of the internal sub modules of SSCM by removing the SEFE, HFR, and SSCM modules from the complete framework. Computational complexity, parameter size, and inference speed are also evaluated to assess the model's efficiency.ResultsExperimental results on the four representative datasets (HanChuan, HongHu, Houston, and PaviaU) demonstrate that SE-Mamba achieves the best Overall Accuracy (OA) and Average Accuracy (AA), with its Kappa coefficient also reaching a level comparable to the state-of-the-art methods. Specifically, SE-Mamba attains an average OA of 96.07% across the four datasets, surpassing the benchmark model MambaHSI by 2.32%. In terms of efficiency, the computational complexity and parameter size of SE-Mamba are comparable to those of mainstream methods, while its inference speed is superior to that of some comparison models, achieving a good balance between classification accuracy and computational efficiency. Ablation experiments verify the effectiveness of each core module in spatial feature representation. Compared with existing Mamba-based methods (e.g., MambaHSI), SE-Mamba effectively addresses spatial information erosion and feature submergence through spatial enhancement and optimized fusion, while preserving Mamba's linear computational advantage. Compared with traditional CNN/Transformer-based methods, SE-Mamba combines state space modeling with spatial enhancement, achieving more stable performance in complex scenes.ConclusionsExperiments verify that the combination of explicit spatial enhancement and state space modeling is effective, and the two core strategies of SE-Mamba synergistically alleviate spatial information erosion and feature submergence. By strengthening spatial feature extraction and optimizing spatial-spectral fusion, SE-Mamba maintains stable and efficient classification performance on complex agricultural, urban, and multi-category HSI datasets, achieving improved classification accuracy and efficiency. SE-Mamba provides a novel approach for HSI classification and serves as a reference for state space-based remote sensing image processing, offering technical support for land cover monitoring and agricultural survey. Future work could consider designing adaptive scanning mechanisms and introducing transfer learning to enhance the model's adaptability to complex scenarios and cross regional generalization ability, and promote its practical application on portable devices through lightweight design. The dataset and code related to this article have been shared [DOI: 10.57760/scientificdb. j00240.00182].
Chen Yaoyi, Wang Na, Peng Yanchun, Chen Jiahao, Qiao Pengxu, Wang Wei, Qin Chuan
DOI:10.11834/jig.250661
摘要:ObjectiveScreen-shooting resistant watermarking technology is an effective copyright authentication technique which has received widespread attention in recent years. It utilizes a pre-designed embedding algorithm to embed secret information into the cover image. When the copyright of the image is found to be infringed, the corresponding extraction algorithm can be used to extract the secret information, thereby achieving copyright protection. However, screen content images as a predominant medium for information transmission, the widespread use of screen-shooting enables low-cost duplication of screen content, posing severe challenges to copyright protection. Although screen-shooting resistant robust watermarking technology has emerged as an effective solution for copyright authentication and infringement tracing, existing research faces a critical problem: the lack of dedicated screen content image datasets. Current deep learning-based watermarking models are primarily trained on natural image datasets such as ImageNet, COCO, and MIR-Flickr. These natural image datasets focus on real-world scenes with rich textures and complex color distributions, which differ fundamentally from screen content images characterized by text-dominated content, large uniform color backgrounds, and vector-based elements like lines and diagrams. This domain gap leads to significant visual quality degradation (e.g., visible artifacts) when models trained on natural images are applied to screen content. Additionally, existing screen-related datasets are designed for quality assessment tasks and contain a large number of noise-processed images, making them unsuitable for training robust watermarking models that require accurate simulation of real-world screen content scenarios. To address these issues, this study aims to construct a large-scale, high-quality dedicated screen content image dataset to support the development of screen-shooting resistant robust watermarking technologies, thereby bridging the performance gap between natural image and screen content watermarking applications and enhancing the practicality of copyright protection for screen-based information.MethodA screen content image dataset for screen-shooting resistant watermarking (SCID) was constructed specifically for screen-shooting resistant robust watermarking tasks. The dataset was categorized into six functional themes based on common usage scenarios: webpage applications, chat applications, programming environments applications, engineering drawings applications, online meetings applications, and office applications. For webpage applications images, diverse sources were collected through search engines, including official websites, social media platforms, open-source project repositories, and large language model conversation interfaces, covering text-heavy pages, image-rich content, and interactive interfaces. Chat applications images included interfaces from popular communication software such as QQ and WeChat, as well as public account push and conversation interfaces. Programming applications images captured code displays and runtime results from various development platforms. Engineering drawing applications images consisted of both 2D blueprints and 3D model renderings, covering mechanical, architectural, and electrical design scenarios. Online meeting applications images were collected from remote collaboration tools like Tencent Meeting and FeiShu, including live lectures, video conferences, and remote desktop control scenes. Office images were captured from common office software (Microsoft Office and WPS), including Word documents, Excel spreadsheets, PDF files, and PPT presentations. In total, the SCID contains 17 101 high-resolution images, integrating text, images, diagrams, and video frames to simulate both daily and professional screen usage scenarios.ResultTo validate the effectiveness of SCID, five deep learning watermarking methods (StegaStamp, MBRS, PIMoG, HiFiMSFA and MTVDGAN) were selected for comparative experiments, and the performance of the models was evaluated from three key dimensions: visual quality, robustness against digital attacks, and robustness against real screen-shooting attacks. For visual quality (assessed by PSNR and SSIM), models trained on natural image datasets showed a significant drop of 2~4 dB in PSNR when tested on SCID, indicating obvious visual artifacts in screen content watermarking. In contrast, models trained on SCID maintained stable PSNR and SSIM when tested on natural image datasets, demonstrating that SCID-trained models retain excellent visual quality across domains without additional fine-tuning. In digital attack experiments (including random cropping, JPEG compression, Gaussian blur, Gaussian noise, median filtering, and salt-and-pepper noise), the accuracy difference (AD) of SCID-trained models was consistently better than that of natural image-trained models. This indicates that SCID-trained models have smaller performance fluctuations when transferred between screen content and natural images, reflecting stronger versatility. For real screen-shooting attacks, although the SCID trained model did not show significant performance improvement compared to the model trained on natural image datasets, the fluctuation range of AD was within 0.1%, that the performance fluctuation range is still within an acceptable range. These results collectively confirm that SCID not only improves the visual quality of screen content watermarking but also maintains strong robustness against both digital and real-world screen-shooting attacks, while ensuring excellent generalization to natural images.ConclusionThis study addresses the lack of dedicated datasets for screen-shooting resistant watermarking by constructing the large-scale SCID, which covers 17 101 images across six practical themes. Comparative experiments using three watermarking methods demonstrate that SCID effectively resolves the visual quality degradation issue of natural image-trained models when applied to screen content, while enabling models to retain stable performance on natural images. The dataset's diverse content and realistic scenario simulation enhance the generalization and practicality of watermarking models, providing critical data support for the development of screen content copyright protection technologies. Additionally, SCID can serve as a benchmark dataset for screen-shooting resistant watermarking research, promoting standardized evaluation and technological innovation in the field. This work contributes to advancing copyright protection for digital screen content and provides a foundation for addressing infringement and information leakage challenges in cross-media transmission scenarios. To facilitate replication and verification by academic peers, the dataset constructed in this paper will be made public upon acceptance of the paper, and the complete download link will be provided in the article at that time.
Yang Jingxiang, Zeng Jian-An, Diao Wenxiu, Xiao Liang
DOI:10.11834/jig.260110
摘要:Hyperspectral image (HSI) contains rich spatial-spectral information, it has high discriminative ability and wide applications such as remote sensing, geologic examination, and medical diagnosis. The traditional spectral imaging technologies include whisk broom, push broom, and staring imaging, they suffer from large volume of equipment, long period of collection, and limited spatial-temporal resolution, which hinders the usage in dynamic scenes and motional platforms. Recently, compressive spectral imaging (SCI) has obtained the research interests. The coded aperture snapshot spectral imaging (CASSI) could take the compressed measurement of a 3D HSI within single exposure, the high efficiency makes it a hot point in computational imaging. One key technology in CASSI is HSI reconstruction, which aims at restoring the latent HSI of high-quality from the compressed measurement. In the last decades, several HSI reconstruction algorithms has been proposed. In this overview, we comprehensively reviews recent advancements in spectral imaging and reconstruction methods. First, we analyze the physical process of compressive spectral imaging, and formulate the spatial-spectral degradation model. Then, we model the CASSI reconstruction as an ill-posed reverse problem, which requires priors as regularization to reduce the solution space. Taking the prior as a view angle, we divide the current HSI reconstruction technologies into four categories: 1) model driven methods based on hand-crafted priors, 2) data driven methods based on deep learning networks, 3) model-data joint driven methods based on deep priors, and 4) the recently proposed generative diffusion prior. Via such structured analysis, the overview aims to offer valuable insights into the core idea, design paradigm, and evolution of different methods, highlight the persistent challenges, and provide an outlook on their future development trends. The model-driven methods rely on hand-crafted priors, various priors such as total variation and sparsity have been proposed as regularization in HSI reconstruction problem. They are mathematically interpretable, and could generalize to different imaging systems as long as the degradation model is accurate. But the hand-crafted may be simplistic and fail to fully capture the complex spatial-spectral characteristic of HSI. The iterative optimization process of HSI reconstruction model is computationally expensive for real-time applications. Tuning the hyper-parameter in HSI reconstruction model is also difficult. The data driven methods uses deep learning networks to learn a mapping between measurements and HSIs. Different networks (e.g., convolutional networks and Transformers) have been designed to exploit the spatial-spectral features for HSI reconstruction. Normally, after learning the complex data-driven features, high-fidelity HSI can be inferred efficiently. But such networks are black-boxes with limited interpretability, what’s more, the deep learning networks may fail catastrophically when the spatial-spectral degradation is unknown or even unseen during inference. The model-data joint driven methods combine the strengths of model driven and data driven methods. It originates from the traditional HSI reconstruction model but replacing the hand-crafted prior with an implicit deep prior. Classic optimization algorithms are used to minimize the HSI reconstruction model, the iterative solutions are unrolled into a deep network. Each iterative solution becomes an unfolding stage in the network. The hand-crafted prior is replaced with a learnable denoiser as deep proximal operator. The unrolled network is trained in an end-to-end manner. Since the network is designed under the guidance of imaging physics, compared with the data driven methods, it has higher interpretability and robustness in varying degradation cases. By learning the deep prior, it achieves higher quality than hand-crafted priors. However, these networks can be regarded as discriminative models learned by regression losses, they tend to produce deterministic results that are actually the “averaged” distributions of potential ground truth, thus leads to blurry outputs and hinders reconstructing fine-grained and detailed image structures. The diffusion model could generate diverse and highly-realistic contents, leveraging the generative diffusion prior may remedy the limitation and has shown the potential in HSI reconstruction. Further, in the overview, we select 12 mainstream HSI reconstruction methods and compare their performance on widely-used datasets. The experimental code and data are at:https://github.com/DDXNJUST/Computational-Imaging/. Based on the experimental results, finally, we discuss the shortcomings of existing works and future work trends. Several pain points such as representing the complex spatial-spectral futures, the limited generative ability and content distortion, and the disjointing relationship among compressive imaging, HSI reconstruction, and downstream tasks. The purpose of this overview is providing a comprehensive introduction of spectral imaging and reconstruction, and also presenting valuable insights for future advancement.
关键词:compressive spectral imaging;computational reconstruction;imaging model;deep learning;model and data driven
Feng Jianjiang, Jia Wei, Li Qi, Cui Zhe, Zhao Cairong⁵, Lei Zhen, Wang Caiyong⁶, Kang Wenxiong⁷, Yu Shiqi⁸, Fei Lunke⁹, Li Xiaobai, ⁰, Ye Mang, Wei Jianze, Cao Shiwen⁸, Sun Shibo, Xie Tianming⁷, Zheng Weishi, Yang Hongyu, Huang Junduan, ⁵
DOI:10.11834/jig.260069
摘要:Biometric recognition has become a fundamental enabling technology for modern digital society, serving as a core infrastructure for identity authentication, access control, and human-centered intelligence. By exploiting intrinsic physiological and behavioral characteristics, biometrics offers inherent advantages over traditional knowledge- or token-based authentication mechanisms, including higher security, improved usability, and stronger resistance to impersonation. Over the past decade, and especially since 2021, the rapid advancement of deep learning, sensing technologies, and large-scale data resources has profoundly reshaped the landscape of biometric research and applications. Biometric systems have evolved from task-specific, handcrafted pipelines into data-driven, end-to-end intelligent systems capable of operating in complex, unconstrained, and large-scale real-world environments.This report provides a comprehensive review of the development of biometric recognition technologies from 2021 to 2025, covering both methodological advances and emerging application paradigms. We focus on major biometric modalities, including face, iris, fingerprint, palmprint, finger and palm vein, body, gait, and person re-identification, while also highlighting the increasingly critical role of security, privacy protection, and trustworthiness in biometric systems.Face recognition remains the most widely deployed biometric modality due to its non-contact nature, low acquisition cost, and high social acceptance. Recent progress has been driven by innovations in network architectures, large-scale training data, and discriminative loss functions. Convolutional neural networks and vision transformers have significantly improved representation capacity, while margin-based and quality-aware losses have enhanced intra-class compactness and inter-class separability. At the same time, face detection and alignment have advanced toward robust performance under extreme conditions such as low resolution, severe illumination variation, occlusion, and large pose changes. Beyond recognition, face generation and synthesis have emerged as both an enabling technology and a security challenge. Generative adversarial networks and diffusion models now support high-fidelity, controllable, and even 3D-aware face generation, facilitating data augmentation, virtual humans, and animation, while simultaneously raising new threats in the form of deepfakes and identity spoofing.Iris recognition continues to be regarded as one of the most secure biometric modalities due to the high uniqueness and stability of iris texture. In recent years, research has shifted from controlled laboratory settings toward less constrained and mobile scenarios. Advances in iris acquisition include visible-light imaging, mobile-device-based capture, and near-eye sensing in VR/AR environments. Deep learning-based iris segmentation and localization methods have greatly improved robustness against noise, occlusion, and cross-domain variations. In parallel, iris feature representation has evolved from handcrafted binary codes toward deep embeddings and hybrid models that preserve compatibility with traditional matching schemes. Synthetic iris data generation has gained attention as a means to address data scarcity and privacy constraints, although its impact on large-scale recognition performance and potential security implications remain open research questions.Fingerprint recognition, as one of the most mature biometric technologies, has undergone a significant transformation with the adoption of deep learning. Modern fingerprint systems address challenges such as low-quality latent prints, partial fingerprints, distortion, and cross-sensor variability through deep enhancement, minutiae extraction, dense descriptors, and multi-stage matching frameworks. The emergence of new fingerprint modalities, including contactless fingerprints, 3D fingerprints, and optical coherence tomography (OCT)-based internal fingerprints, has expanded the representational space of fingerprint biometrics and enabled improved robustness against distortion and spoofing. At the same time, large-scale real and synthetic fingerprint datasets have facilitated systematic evaluation and training of deep models.Palmprint and palm-vein recognition have experienced rapid growth in both research and large-scale deployment. Advances in deep learning have addressed long-standing challenges in region-of-interest extraction, alignment, and feature robustness, enabling commercial applications such as contactless palm payment systems. Multimodal fusion of palmprint and vein patterns has further enhanced recognition accuracy and security. In addition, progress in palm image synthesis and 3D palm modeling has opened new directions for data augmentation and unconstrained recognition.Body-based biometrics, including person re-identification and gait recognition, play an increasingly important role in surveillance and public safety scenarios where close-range or cooperative acquisition is not feasible. Recent work has focused on cross-view, cross-domain, and long-term recognition under clothing changes, occlusion, and low-resolution conditions. Deep spatiotemporal modeling, attention mechanisms, and sequence-level representations have significantly improved robustness, while also raising concerns about privacy, fairness, and ethical deployment.Alongside performance improvements, security and privacy have become central themes in biometric research. Biometric systems are increasingly exposed to sophisticated attacks, including presentation attacks, adversarial examples, template inversion, and deepfake-based impersonation. Consequently, substantial efforts have been devoted to spoof detection, adversarial defense, secure template protection, and cancellable biometrics. Privacy-preserving techniques such as template transformation, encryption, and federated or decentralized learning are gaining importance as regulatory requirements and public awareness intensify. Ensuring that biometric systems are not only accurate but also trustworthy, explainable, and compliant with data protection regulations is now a core research objective.Beyond identity authentication, biometrics is expanding toward broader human-centered applications, including human–computer interaction, healthcare monitoring, behavioral analysis, and immersive virtual environments. This shift reflects a transition from “who you are” to “how you are,” where biometric signals contribute to comprehensive perception and intelligent interaction rather than mere identification.In summary, the period from 2021 to 2025 has witnessed rapid and multifaceted progress in biometric recognition. Advances in deep learning, sensing, and data generation have substantially improved accuracy, robustness, and scalability across modalities, while new challenges in security, privacy, and societal impact have emerged. By systematically reviewing recent developments across core biometric technologies and security frameworks, this report aims to provide a holistic perspective on the current state of the field and to outline promising directions for future research and deployment. We hope this work will serve as a valuable reference for researchers, engineers, and policymakers seeking to understand and shape the next generation of biometric systems.
关键词:biometrics;face recognition;iris recognition;Fingerprint and palmprint recognition;Finger and palm vein recognition;person re-identification;gait recognition;Spoof detection
摘要:Active speaker detection (ASD) aims to identify speakers and their active speech intervals within video sequences by leveraging both audio and visual modalities. ASD serves as a foundational technology for applications such as media content analysis, human-computer interaction, intelligent meeting systems, and audio-visual speech recognition. Despite the significant progress driven by the rapid development of deep learning since 2015, real-world deployment still encounters challenges from complex environmental factors such as visual occlusions, acoustic interference, overlapping speech, and dynamic camera movements. To address these developments and challenges, this survey provides a comprehensive review of ASD technologies over the past 25 years, categorizing existing methodologies into vision-based and audio-visual methods. The first category, vision-based methods, infers speech activity entirely from visual cues, such as lip contours, facial movement, and body gestures. These methods are valuable where audio is entirely missing or heavily corrupted by acoustic interference. While immune to acoustic degradation, vision-based methods inherently struggle to distinguish actual speech from non-speech lip movements and are highly sensitive to low image resolution, non-frontal head poses, and occlusions. The second category, audio-visual methods, constitutes the mainstream of current research by harnessing the complementary nature of auditory and visual signals. This survey further subdivides this category into three major paradigms. (a) Matching-based methods identify speakers by learning cross-modal correspondences, typically without extensive manual annotations. This paradigm is split into two distinct routes: synchronization-based and identity-based association. Synchronization-based methods measure the short-term temporal alignment between lip motions and acoustic signals, utilizing contrastive learning to project audio and visual features into a shared embedding space. While these methods benefit from self-supervised learning paradigms, they require tight audio-visual synchronization and can fail under desynchronization or in dubbed videos. Alternatively, identity-based association methods focus on long-term consistency. They typically cluster acoustic speaker embeddings and facial feature sequences separately and then associate voices with faces based on co-occurrence statistics or cross-modal face-voice matching networks. This route is highly robust to dubbing, off-screen voices, and poor visual quality (e.g., in egocentric videos) but relies heavily on the accuracy of intermediate clustering steps. (b) Fusion-based classification methods formulate ASD as a fully supervised "speaking vs. non-speaking" binary classification task for each candidate face at every time step. This pipeline generally involves four crucial stages: feature extraction, feature fusion, temporal modeling, and final speaker activity detection. In the feature extraction stage, modern architectures employ large-scale pre-trained acoustic encoders and deep visual backbones. To effectively integrate these multi-modal streams, dynamic fusion strategies such as cross-attention mechanisms, gating networks, and uncertainty-aware adaptive fusion have largely replaced simple static concatenation. Furthermore, temporal context modeling has evolved from local, short-term processing to Recurrent Neural Networks (RNNs) and then global spatiotemporal reasoning using Transformers and Graph Neural Networks (GNNs). By explicitly modeling the complex interactive dynamics among multiple candidate speakers and the global scene context, fusion-based classification achieves state-of-the-art accuracy on most benchmarks. However, it demands large amounts of densely annotated data and suffers from domain shift issues. (c) Hybrid methods seek to combine the complementary strengths of both matching and classification paradigms to tackle complex scenarios. By integrating short-term speech behavior (via synchronization or classification) with long-term identity verification (speaker profiles), hybrid systems effectively suppress interference from non-target speakers, overlapping voices, and off-screen narrators, thereby significantly enhancing overall robustness in real-world environments. Beyond this algorithmic taxonomy, this survey also extensively summarizes benchmark datasets and evaluation metrics commonly used in the ASD community. The paradigm shift in dataset curation is traced from early, heavily constrained laboratory recordings with limited participants to large-scale, in-the-wild datasets. Modern benchmarks feature thousands of hours of video spanning movies, video logs, egocentric wearable camera views, and even surveillance footage. Commonly used evaluation metrics such as mean Average Precision (mAP) are also discussed. Finally, this survey concludes by highlighting the technical trends and outlining several persistent open problems. Despite achieving near-perfect scores on certain benchmarks, current state-of-the-art models exhibit limited cross-dataset generalization, particularly struggling with diverse languages, out-of-domain scenarios, and extreme conditions. Moreover, existing systems lack a deep semantic understanding of conversational dynamics, such as turn-taking logic, interruptions, and non-verbal social cues. To address these bottlenecks, future research should focus on constructing more inclusive datasets, exploring data-efficient learning, integrating Large Language and Vision-Language Models (LLMs/VLMs) for semantic reasoning, and developing lightweight architectures for edge deployment.
摘要:ObjectiveIn recent years, brain networks have become an indispensable cornerstone in brain disorder research. Currently, functional magnetic resonance imaging (fMRI) is commonly used to construct functional networks, while diffusion tensor imaging (DTI) is utilized to build structural networks. Yet two limitations remain prominent: many approaches still operate on a single modality or adopt shallow multi-modal fusion, leaving the potential dependencies between SC and FC under-explored; and although Structure-Function Coupling (SFC) has been shown to differ across brain regions between healthy controls and patients and to hold biomarker potential, its relationship to classification has not been systematically integrated into representation learning. This study aims to operationalize SFC as an explicit prior that guides multi-modal representation learning, so that high-order cross-modal associations can be captured and translated into robust diagnostic performance.MethodThis paper presents SFC-HGNN, a Structure-Function Coupling-guided foundational framework that combines a dual-stream encoder, cross-modal reconstruction pretraining, and hypergraph computation to model high-order relationships inside and between SC and FC. The core design is to use an SFC matrix as a bridge that explicitly guides two hypergraph neural network (HGNN) branches—one per modality—so that each branch learns modality-appropriate group relations while remaining informed by the other modality. Concretely, the SFC matrix is first constructed to capture the consistency between FC and SC patterns across brain regions using a rank-based correlation measure, and this matrix is fused with each modality through shallow feature mapping to form initial node-level features for the two streams. On the functional stream, hyperedges are formed via sparse-representation method, providing data-driven, high-order groupings appropriate for FC. On the structural stream, hyperedges are built with a k-nearest-neighbor strategy to reflect global structural associations. Each branch then employs HGNN layers to realize high-order message passing on its respective hypergraph, thereby encoding latent dependencies between SC and FC under SFC guidance. To fully exploit cross-modal dependencies, a cross-modal reconstruction pretraining task is introduced. During pretraining, the functional stream is trained to reconstruct the structural connectivity matrix and the structural stream to reconstruct the functional connectivity matrix via decoders. The decoders are optimized with a reconstruction objective together with a constraint that emphasizes appropriate symmetry and sparsity of the reconstructed matrices. This pretraining forces the encoder to internalize modality-bridging information under SFC guidance. In the subsequent downstream tuning phase, the encoder parameters are frozen; latent node representations from the two branches are flattened and concatenated into a global feature vector, which is then fed into a lightweight multilayer perceptron (MLP) classifier optimized with cross-entropy. Freezing the encoder keeps downstream training simple and stable while preserving the cross-modal dependencies captured during pretraining.ResultExperiments are conducted on two public multi-modal brain imaging datasets. On ADNI, we use 332 subjects (64 AD, 129 MCI, and 139 NC). On ABIDE, we include 86 subjects with paired fMRI and DTI (ASD/NC). For both datasets, we adopt the AAL atlas; fMRI data are preprocessed with DPARSF and DTI data with PANDA. The model is implemented in PyTorch and trained on an NVIDIA RTX 4090 GPU. We follow a two-stage training protocol (cross-modal pretraining followed by frozen-encoder tuning) and evaluate with 5-fold cross-validation, reporting ACC, AUC, F1, and specificity. We compare SFC-HGNN against representative single-modality baselines (BrainNetCNN, GAT, HGNN+, BrainGNN) and multi-modality methods (MME-GCN, Cross-GNN). Across diagnostic tasks, SFC-HGNN achieves state-of-the-art performance, consistently improving ACC, AUC, and F1. While certain baselines occasionally yield higher specificity, these cases are often accompanied by markedly lower F1, suggesting reduced stability; in contrast, SFC-HGNN maintains a better overall balance among metrics and demonstrates stronger robustness. Ablation studies further isolate the contributions of the proposed components: introducing SFC as a cross-modal bridge (without pretraining) improves accuracy by 1.6% and 2.3% on AD vs. NC and ASD vs. NC, respectively, and adding cross-modal reconstruction pretraining brings additional gains of 2.1% and 1.9%. These results indicate that SFC guidance together with cross-modal reconstruction effectively encourages the encoder to capture latent SC-FC dependencies that transfer to downstream classification. Finally, to assess interpretability, we conduct significant-hyperedge analysis using group-level t-tests and visualize discriminative functional and structural hyperedges for AD vs. NC and ASD vs. NC; the identified regions and connections align with known patterns of network alteration, supporting the neurobiological plausibility of the learned high-order representations.ConclusionBy treating SFC as an explicit bridge and pairing dual HGNN encoders with a cross-modal reconstruction pretraining paradigm, this study introduces a principled multi-modal framework for brain network analysis and brain disorder diagnosis. The approach encourages each modality to be informed by the other and to encode high-order associations that are useful for downstream classification, while a frozen-encoder tuning strategy keeps optimization stable and lightweight. On ADNI and ABIDE, SFC-HGNN consistently surpasses single and multi-modality baselines, with gains reflected not only in accuracy and AUC but also in a better balance between F1 and specificity, highlighting robustness. The significant-hyperedge findings further provide neurobiologically plausible insights that complement the quantitative improvements. Overall, SFC-HGNN advances multi-modal brain disorder diagnosis by unifying SFC-guided hypergraph encoding, cross-reconstruction pretraining, and simple downstream tuning into a coherent pipeline that achieves superior and reliable performance across datasets.
Liang Shutong, Xie Dongjin, Li Dong, Zhang Hui, Jia Xiaofeng, Wang Fei-Yue, Li Yidong, Li Lingxi
DOI:10.11834/jig.260100
摘要:In recent years, the rapid advancement of foundation models, including large language models (LLMs), vision–language models (VLMs) and world models, has introduced a paradigm shift that enables humanoid robots to transition from laboratory demonstrations to open-world applications such as household services, industrial manufacturing, and medical assistance. As the primary end-effector responsible for high-dimensional and fine-grained physical interaction, multi-fingered dexterous hands represent one of the most challenging and emblematic platforms in embodied intelligence, due to their high degrees of freedom, strongly nonlinear contact dynamics, and tightly coupled multimodal feedback mechanisms. The emergence of vision–language–action (VLA) models and large-scale foundation architectures, the breakthrough application of diffusion models and flow matching in continuous control policy generation, hybrid reinforcement–imitation learning frameworks, and advances in high-resolution tactile sensing, variable-stiffness mechanisms, and rigid–soft hybrid materials are collectively driving a fundamental transition in dexterous hands—from a paradigm of “rigid high-precision” mechanical determinism toward an integrated, perception–learning–execution–centered closed-loop intelligent system. This paper presents a comprehensive review of robotic dexterous hands across four dimensions: mechanical structures, intelligence capability grading, data resources, and benchmarking methodologies. First, from a historical perspective, we systematically trace the evolution of mechanical architectures and hardware paradigms, summarizing representative technical routes including fully actuated multi-finger designs, underactuated compliant mechanisms, tendon-driven systems, soft robotic hands, and rigid–soft hybrid structures. Our analysis indicates that the evolution of dexterous hand mechanisms is not merely an accumulation of degrees of freedom, but rather a gradual shift toward engineering-oriented paradigms characterized by underactuated coupling, material compliance, and hybrid structural design. By embedding adaptive coordination mechanisms into the mechanical body through passive responses, these approaches effectively reduce actuation and control dimensionality while physically enhancing robustness against object diversity and contact uncertainty. Building upon this foundation, we propose a systematic five-level taxonomy of dexterous intelligence (H1–H5) centered on the evolution of perceptual capability. H1 (Perception-Free) is characterized by open-loop program execution and teleoperation, where the system lacks environmental modeling and policy generation capabilities. H2 (Single-Modal Perception) introduces either vision or tactile feedback to enable perception-driven grasping and basic stability regulation. H3 (Multimodal Perception) integrates vision, tactile, and force sensing through deep multimodal collaboration, supporting complex fine manipulation tasks such as precision assembly, deformable object manipulation, and tool use. At this stage, systematic methodologies emerge across three technical directions: hierarchical task planning, multimodal servo control, and data-driven policy learning. H4 (Open Perception) centers on vision–language–action models and addresses perceptual generalization, long-horizon task planning, and deep multimodal fusion to enable language-guided open-world task understanding and zero-shot manipulation. H5 (Dynamic Perception) envisions autonomous, evolving general manipulation capabilities supported by deep multimodal dynamic perception and real-time coordination mechanisms, representing a historical leap from robots as “tools” to embodied “symbiotic agents.” This taxonomy provides a unified reference framework for evaluating the technological transition of dexterous hands from repetitive execution to open-world task planning and ultimately toward autonomous evolution. Furthermore, we systematically review the key data resources and evaluation benchmarks that support dexterous intelligence from two complementary dimensions: real-world interaction and high-fidelity simulation. At the data level, real-world datasets offer ecological validity but suffer from high collection costs, limited scalability, and safety risks. Synthetic datasets and simulation platforms enable large-scale and diverse data generation at controllable costs but remain constrained by simplified contact models and the simulation-to-reality gap. We outline the evolution of synthetic datasets from static grasp poses to dynamic manipulation sequences and analyze representative resources in terms of their contributions to grasp generation, cross-hand generalization, articulated object manipulation, and long-horizon modeling. We further summarize the technological progression of simulation platforms from basic physical validation to high-fidelity interaction and cross-domain transfer. In terms of evaluation, we categorize performance metrics into outcome-oriented and process-oriented dimensions, including task success rate, grasp cycle time, target pose error, normalized task error, contact region error, stability and drop rate, as well as efficiency and robustness. Benchmark tasks are organized into five families: stable grasping and transport; re-grasping and contact transition; in-hand manipulation and reorientation; constrained operation and assembly; and tool use and functional manipulation. Together, these constructs form a systematic two-dimensional evaluation spectrum spanning contact complexity and temporal depth, emphasizing reproducible and diagnostically meaningful standards for assessing generalization capability and deployment readiness. Finally, we summarize the core challenges and future directions toward the general-purpose deployment of dexterous hands. From a data perspective, the scarcity of real interaction data and the persistent simulation-to-reality gap remain fundamental bottlenecks for effective policy transfer. From a modeling perspective, efficient and robust multimodal joint representations, 3D foundation model construction, and interpretable decision-making mechanisms have yet to converge into a unified theoretical framework, while inherent tensions persist between model scale and real-time inference requirements. From a hardware perspective, long-standing engineering trade-offs exist between high degrees of freedom and low cost, reliability, and lightweight design, as well as between precision force–tactile control and structural simplicity. Looking ahead, deep integration of perception, decision-making, and execution; incorporation of physical commonsense and causal reasoning through world models and embodied foundation models; generative AI–driven data-efficient learning and simulation credibility enhancement; biomimetic variable-stiffness mechanisms and endogenous tactile sensing through soft–hard co-design; and long-term real-world deployment with closed-loop optimization in high-value scenarios such as intelligent manufacturing, domestic service, and specialized operations will be critical pathways. These efforts will drive dexterous hands from laboratory prototypes toward reliable real-world applications, ultimately achieving the goal of general embodied intelligence capable of perceiving, reasoning, and manipulating “like a human hand.” This work provides a unified capability framework and systematic reference for understanding and tracking the frontier of robotic dexterous hands, offering theoretical guidance and practical insights for future research in hardware paradigm evolution, intelligence capability transition, and data and benchmarking system construction.
Mu Yao, Zhao Hao, Hu Ruizhen, Zhang Li, Li Hongyang, Yang Jiaolong, Wang Jingbo, Han Lei, Su Yongfeng, Xu Kai, Yang Yi, Li Jiang, Dai Ruoli, Chen Baoquan, Liu Yebin, Yi Li
DOI:10.11834/jig.260059
摘要:As a critical and rapidly evolving domain within artificial intelligence, Embodied AI represents the convergence of computer vision, natural language processing, and robotics, aiming to create intelligent agents capable of perceiving, reasoning, and acting within the physical world. However, despite the transformative success of Large Language Models (LLMs) in the digital realm, Embodied AI faces unprecedented and multifaceted challenges that hinder the direct replication of the “large-scale pre-training plus scaling law” paradigm. These challenges include extreme data heterogeneity across different robot morphologies, strong physical constraints that demand safety and precision, and the prohibitively expensive interaction costs associated with collecting real-world robotic data. Consequently, simply scaling up model parameters without addressing these domain-specific hurdles has proven insufficient for achieving general-purpose robotic intelligence. This paper comprehensively reviews the frontier technical evolution of Embodied AI, offering a systematic analysis across four critical dimensions: data, models, systems, and evaluation, to chart a path toward more robust and generalized embodied agents. In terms of data, we propose a“Data Pyramid” structure designed to maximize data efficiency and transferability. This hierarchical framework advocates for the foundational use of massive, low-cost simulation data and internet-scale video datasets at the bottom layer to build broad physical commonsense and visual representations; the utilization of human interaction data (such as ego-centric videos and teleoperation logs) in the middle layer to facilitate behavioral mapping and intent understanding; and the strategic application of a small, high-quality amount of real-world robot data at the top layer for fine-tuning and final skill deployment, thereby bridging the reality gap. Regarding models, the paper critically discusses the current state of mainstream Vision-Language-Action (VLA) models, highlighting that while they excel at semantic understanding, they encounter significant scaling bottlenecks in continuous control and fine-grained manipulation. To overcome this, we identify “World Models” as a pivotal new direction for embodied pre-training. By learning to simulate environmental dynamics, predict future states, and understand causal relationships without explicit supervision, world models promise to endow agents with deeper physical intuition and superior generalization capabilities in unseen environments. In terms of systems, we observe a paradigm shift where the architecture is evolving from monolithic, single end-to-end models toward an Operating System-like “Hierarchical Architecture.” This evolution achieves the necessary decoupling of high-level semantic planning—powered by the reasoning capabilities of LLMs—and low-level motion control, which ensures precise execution and hardware compliance. This modular approach not only improves system robustness but also facilitates easier debugging and component upgrades. Finally, the paper examines the critical issues within current evaluation systems, specifically focusing on the challenges of authenticity in simulation benchmarks and the lack of reproducibility in real-world experiments. We argue that the field suffers from fragmented metrics that fail to capture the complexity of open-world interaction. In conclusion, we provide a forward-looking perspective on the inevitable integration of locomotion and manipulation—moving beyond stationary arms to mobile manipulators—and anticipate the arrival of the “ImageNet moment” for Embodied AI, where standardized datasets and benchmarks will catalyze a Cambrian explosion of robotic capabilities, ultimately bridging the gap between digital intelligence and physical reality.
关键词:Embodied AI;Data Pyramid;World Models;VLA Models;Hierarchical Control Architecture;Embodied Evaluation
摘要:For the 21st century, this study is dedicated to exploring the fundamental theories, models and architectures of next-generation artificial neural networks (ANNs), with the goal of constructing high-performance ANNs featuring adaptive topology, interpretability, strong generalization capability and high energy efficiency. Since the initial proposition of ANNs in the 1940s, the field has witnessed over 80 years of development. Extending to the mid-21st century, ANNs can be categorized into five generations based on five core dimensions, and such generational evolution constitutes the core trajectory of neural network research and advancement. This paper defines five generations of ANNs (abbreviated as xG-ANNs) from five perspectives: neuronal unit, information coding, network structure, learning mechanism and Turing test. 1G-ANNs: Threshold logic networks, represented by the M-P model and the Perceptron; 2G-ANNs: Continuous activation networks (e.g., sigmoid, tanh), typified by the classical back propagation (BP) network; 3G-ANNs: Spiking neural networks, represented by spiking neural networks (SNNs) (Maass, W., 1997); 4G-ANNs: Deep neural networks (DNNs), represented by AlexNet, ResNet, Transformer architectures and the attention mechanism, which have passed the conversational Turing test for disembodied intelligence; 5G-ANNs: Cognitive neural networks (CoNNs), with five defining characteristics: (1) cognitive units integrating memory, reasoning and human-like attention; (2) hybrid coding of semantic symbols and distributed representations; (3) modular cognitive architecture with dynamic reconfigurable topology; (4) meta-learning, causal reasoning and lifelong learning; (5) the embodied Turing test remaining to be breakthrough. A core academic consensus has been formed regarding 5G-ANNs: such networks integrate neural computing, symbolic reasoning and cognitive architecture with low-power consumption, support dynamic topology, memory cognition, neuro-symbolic fusion and embodied/world models, and exhibit intrinsic merits including adaptive structure, few-shot generalization, interpretability, low energy consumption and embodied cognition.At present, the field is in the era of 4G-ANNs, which are characterized by data-driven fitting, deep learning, the attention mechanism and Transformer frameworks. Represented by large language model-based ChatGPT, 4G-ANNs have passed the conversational Turing test, yet such validation is a "black-box" assessment restricted to the emergence of disembodied intelligence. The root causes lie in the inherent asymmetry of large language models built on the scaling law of parameter expansion, their lack of comprehension of physical laws governing the real world, as well as critical drawbacks including jagged multi-modal and multi-form intelligent outputs and inferior energy efficiency. In contrast, neural networks deployed in embodied intelligent robots lack autonomous intelligence and can merely execute predefined actions per programmed instructions, leading to a huge gap between disembodied intelligence and embodied intelligence in 4G-ANN systems. Addressing these critical limitations of 4G-ANNs calls for the support of novel theories, models and architectural designs. Currently, extensive debates and divergences persist over the developmental orientation and technical routes of next-generation ANNs.This paper analyzes and summarizes the developmental progress of mainstream theories, models and architectures of the first four generations of ANNs, focuses on the characteristics of several typical 4G-ANN models and their enhanced variants, and reviews representative architectures for 5G-ANNs, including world models, Joint Embedding Predictive Architecture (JEPA), cognitive spiral models and intelligence-trace cellular network frameworks. Finally, grounded in the theory of cognitive physics and practical technologies of driving brain cognition, this work proposes a lightweight architecture termed Embodied Cognitive Physics Neural Network (E-CoPNN), which incorporates the core characteristics of fifth-generation ANNs. Specifically, E-CoPNN features dynamic topology (three-layer nested structure and dual systems), memory cognition (three categories of memory), neuro-symbolic fusion (integration of Euclidean and non-Euclidean data, combination of lexemes and concepts), and embodied cognition/world models (fusion of disembodied and embodied intelligence, integration of cognitive space and physical space). The proposed architecture embodies the typical properties of 5G-ANNs: adaptive structure, few-shot generalization, interpretability, low power consumption and embodied cognition. Targeting brain-like general intelligence, 5G-ANNs take dynamic topology, autonomous memory, cognitive reasoning and structural evolution as core attributes, shifting the ANN research paradigm from data fitting to structural reconstruction, thus representing a pivotal direction for achieving machine autonomous intelligence and continuous learning.Conclusionand Significance At present, the development of next-generation ANNs for the 21st century will drive a series of transformative revolutions: philosophically, a paradigm shift from mind-body dualism to embodied cognitive science based on embodied perception monism; theoretically, an disciplinary expansion from 20th-century biophysics to 21st-century cognitive physics; model-wise, a fundamental transition of ANN research from data fitting to structural reconstruction; in application, bridging the long-standing gap between disembodied intelligence and embodied intelligence in ANN development; generationally, an evolutionary leap from 4G-ANNs to 5G-ANNs featured by brain-like cognitive capabilities and adaptive topological structures. This work lays a solid foundation for the widespread application of embodied cognitive robots with learning, self-development, self-correction and human-robot interaction capabilities. Furthermore, it supports the convergent development of Nano-Bio-Info-Cogno (NBIC) — the four cutting-edge technologies led by cognitive integration, enhances human intellectual competence, and ushers in a new round of cognitive revolution.
Hu Jianfang, Huang Linjiang, Zhai Wei, Yan Ruisong, Li Chenglin, Zheng Weishi, He Ran, Zha Zhengjun, Xiong Hongkai
DOI:10.11834/jig.260085
摘要:The intelligent driving foundation model integrates vision, language, and action through multimodal learning, driving the evolution of autonomous systems from the traditional “perception–planning–control” architecture towards an end-to-end unified paradigm. By leveraging the capabilities of unified representation, generative reasoning, and few-shot generalization, it significantly enhances the system robustness and decision-making intelligence. From the perspective of research, the intelligent driving foundation model incorporates achievements from multiple disciplines: the latest advances in visual computing, natural language processing, reinforcement learning, cognitive science, computer graphics, and virtual simulation are comprehensively applied within this system. At the industry level, global leading automotive and technology companies have also regarded large models as the technological cornerstone of the next generation of intelligent driving systems. This report systematically reviews the latest progress in the intelligent driving foundation model at both the international and domestic levels, including: the decision planning, environmental perception, visual question answering, and data generation. Specifically, the decision planning section is dedicated to achieving a direct mapping from perception inputs to planning outputs through a unified large model architecture, while maintaining explainability and generalization capabilities. On one hand, to address the shortcomings of traditional end-to-end driving models in terms of interpretability, generalization ability, and safety verification, researchers have proposed a series of approaches that incorporate tegrate language models with visual models, achieving “model thinking readability” through natural languages. On the other hand, large language models also provide a unified language-level interface for the collaborative optimization of multi-vehicle planning, enabling different vehicle agents to engage in semantic communication, share intentions and polices, and form a group intelligence similar to the “implicit collaboration” among human drivers. The task of perception and motion prediction section includes not only includes detecting surrounding objects, but also understanding environmental semantics, inferring the behavioral intentions of other traffic participants, and performing multi-target trajectory prediction in dynamic scenarios. Traditional perception systems rely on the dense labeling and geometric reconstruction models, which often experience performance degradation in long-tail scenarios (such as extreme weather, emergencies). To address these issues, the academic community has recently introduced large language models and multimodal fusion mechanisms, incorporating semantic reasoning into visual perception to achieve “semantic-enhanced visual understanding”. This perception-semantic integration design significantly enhances the depth of understanding of complex environments by autonomous driving systems. Predictive capabilities can also be enhanced by introducing language reasoning. Some methods use language prompts as the semantic guidance, combining visual and motion features for future trajectory prediction, which are referred to as “language prompt guided prediction”. In order to enable the model to explain and communicate with human drivers in natural language, people further introduced visual question answering into intelligent driving foundation models. With this approach, driving models can not only answer questions, such as “why is the vehicle slowing down” and “can we change lanes”, but also adjust polices based on semantic questions, achieving an explainable and intervenable driving intelligence. Retrieval-augmented-generation and chain-of-thought techniques are applied as an effective means to enhance the question-answering capabilities into autonomous driving systems, this report discusses related methods in the visual question answering section. Data is a key driving factor for enhancing the capabilities of autonomous driving systems. High-quality, diverse, and semantically consistent driving data thud directly determines the generalization and safety performance of the model. Traditional data collection and annotation methods are extremely costly: for example, the annotation cost for urban-level autonomous driving can reach $3–5 per frame, with less than 5% coverage of long-tail scenarios. To address this issue, research focus has shifted to the automatic annotation, self-supervised learning, and generative data synthesis. The target of these methods is to reduce dependence on manual annotation, synthesize rare samples that are difficult to capture in the real world in the virtual space, and form a closed-loop data engine, enabling the co-evolution of models and data. The data generation section, distinguished by the data sources, explores how automatic annotation, generative data synthesis, world models, and integrated virtual and real simulation methods solve the problems of high cost and insufficient coverage of long-tail scenarios in autonomous driving data. On these bases, this report conducts a horizontal comparison and further analyzes China’s strengths and limitations in terms of the data resources, computational infrastructure, algorithmic innovation, and standardization. International research is leading in the theoretical depth and integration of multimodal fusion, especially showing a great innovative potential in unified architecture, generative world models, and collective intelligence. While China has significant advantages in the engineering applications, real-time optimization, and scenario adaptation, particularly with a unique practical experience in data closed-loop, automatic annotation, and computational optimization. Looking forward to the future development trends, intelligent driving technology faces challenges in real-time, safety, and personalization. This report recommends strengthening the fundamental research and public infrastructure, establishing a unified open-source data platform to promote the sharing and collaboration of multimodal data, building trustworthy AI evaluation systems, advancing personalized driving and human–AI alignment, and fostering an autonomous and controllable innovation ecosystem. Intelligent driving foundation models have become a crucial enabler for the high-quality development of China’s automotive industry and a new frontier in applied artificial intelligence. The algorithms and open-source codes mentioned have been summarized at:https://github.com/Ruisong-Yan/Intelligent-Driving-Foundation-Model, and can also be accessed via:https://www.scidb.cn/detail?dataSetId=3921ce7e24e44cf98428e3bc1494c410.
摘要:ObjectiveFacial Action Unit (AU) detection is a fundamental yet challenging task in affective computing and computer vision. AU, defined in the Facial Action Coding System (FACS), corresponds to the subtle activation of specific facial muscles, providing objective measurements that reveal human emotions, cognitive states, and mental health conditions. Automatic AU detection plays a critical role in a wide array of applications, including advanced emotion recognition systems, non-intrusive mental health monitoring, and natural human–computer interaction. However, despite significant progress, existing methods face two primary limitations. First, spatial relationship modeling often relies on fixed, predefined priors or learns only shallow feature correlations. These approaches are insufficient to capture the complex, non-linear co-activation and antagonistic patterns inherent in facial expressions, such as the synergistic relationship between AUs during a smile or the mutually exclusive nature of certain brow movements. Second, temporal dynamic modeling typically depends on computationally intensive techniques like optical flow or manually crafted constraints. These methods are not only prone to noise and tracking errors but are also difficult to scale for real-time applications due to their high computational cost. To overcome these issues, we propose a unified and lightweight spatiotemporal fusion framework for AU detection that jointly captures intra-frame spatial dependencies and inter-frame temporal dynamics in an efficient and scalable manner.MethodOur framework employs a ResNet-18 architecture as the backbone to extract robust per-frame facial feature representations. Building upon these features, the model integrates two core, lightweight modules: Spatial Relationship Modeling (SRM) and Temporal Relationship Modeling (TRM). The SRM module leverages Graph Neural Networks (GNN) to explicitly encode intra-frame AU structures. Instead of relying on static priors, it dynamically computes pairwise AU feature similarities via cosine similarity. A Top-K pruning mechanism is then applied to the resulting graph, retaining only the strongest connections to suppress spurious correlations and ensure a compact, interpretable, and efficient AU graph structure. This dynamic graph construction requires no additional anatomical labels. The TRM module focuses on temporal dynamics by first generating motion-sensitive features through simple frame-difference operations. This step effectively reduces redundant texture information, highlighting regions of change. A temporal GNN is then applied to these motion features to propagate information across a sequence of frames. This allows the model to distinguish transient noise from genuine AU activations that evolve smoothly over time. To integrate the complementary strengths of spatial context and temporal motion, we propose a Spatiotemporal Feature Fusion (SFF) strategy. This module concatenates the features from the SRM and TRM branches and employs an adaptive attention mechanism to dynamically balance their contributions on a per-instance basis. Finally, AU recognition is performed using a cosine-similarity-based classification module, which is robust to feature scale. A weighted cross-entropy loss function is employed during training to effectively alleviate the pervasive class imbalance problem by emphasizing the learning of infrequent AUs.ResultExtensive experiments were conducted to validate the proposed framework on two benchmark datasets, BP4D and DISFA. The results demonstrate that our approach achieves state-of-the-art or highly competitive performance while maintaining low computational and memory costs. On the BP4D dataset, the model reached an average F1-score of 66.00%, outperforming recent methods such as RE-Net and AC2D. On the DISFA dataset, it achieved an average F1-score of 65.34%, closely approaching the best-reported result. Comprehensive ablation studies confirmed the necessity and effectiveness of each module: removing the SRM module significantly impaired the detection of weak or infrequent AU, removing the TRM module degraded performance on dynamic AU recognition, and removing the SFF fusion strategy reduced the model's adaptability and overall feature integration efficiency. The synergy of SRM, TRM, and SFF was essential for achieving superior performance. Notably, the framework excelled in detecting dynamic AU such as AU6 (cheek raiser) and AU7 (lid tightener), demonstrating its particular strength in modeling temporal consistency and motion patterns. Statistical tests further verify the significance of our improvements over baselines. These findings highlight that the model is particularly robust in handling varying illumination conditions and subtle expression changes, showcasing its strong generalization capability for unconstrained environments.ConclusionIn conclusion, the proposed spatiotemporal fusion framework effectively integrates spatial and temporal modeling to enhance AU detection accuracy and robustness, while maintaining a lightweight and efficient design suitable for real-world applications. It provides an effective solution for AU detection in complex, dynamic environments and shows strong potential for deployment in affective computing and human–computer interaction. The model's ability to dynamically balance spatial and temporal features based on the input context is a key contribution. Future work will explore fine-grained local feature modeling around specific AU, multi-scale temporal fusion strategies, and the integration of multimodal data (e.g., audio and text) to further improve generalization and performance in unconstrained real-world scenarios. We also aim to validate the framework across diverse demographic groups and clinical populations.
摘要:ObjectiveThe discrete Laplacian operator is a cornerstone in the field of 3D geometry processing and computer graphics, serving as the fundamental differential operator for a wide array of algorithms including spectral shape analysis, geometry compression, mesh smoothing, parameterization, and physics-based simulation. In the context of manifold triangular meshes, the Laplacian operator (often discretized via the cotangent formula) is well-defined and theoretically rigorous. However, with the rapid development of 3D acquisition technologies such as LiDAR and RGB-D cameras, point clouds have become a ubiquitous representation for 3D geometry. Unlike meshes, point clouds lack explicit topological connectivity and often exhibit irregular sampling densities, making the definition of a robust and accurate Laplacian operator a mathematically ill-posed and challenging problem. Traditional approaches typically attempt to approximate the underlying manifold by constructing local triangulations (e.g., Delaunay triangulation) or estimating tangent planes. While effective on high-quality data, these geometric methods are inherently fragile and computationally expensive when processing sparse, noisy, or non-uniform data common in real-world scans. Recently, the Neural Laplacian Operator (NeLo) has emerged as a promising data-driven alternative, utilizing Graph Neural Networks (GNNs) to learn edge weights on K-nearest neighbor (KNN) graphs. By imitating the behavior of the ground-truth Laplacian on a set of probe functions, NeLo achieves high accuracy on clean data. However, NeLo and its variants suffer from two critical limitations that hinder their practical application in "in-the-wild" scenarios. First, they typically rely on relative Cartesian coordinates as the primary input feature for the neural network. This design implicitly entangles the learned geometric descriptors with the global pose of the object, rendering the operator highly sensitive to rigid transformations. Consequently, the performance degrades significantly when the input point cloud is arbitrarily rotated, which is a common occurrence in robotic perception and autonomous driving. Second, existing architectures predominantly employ single-scale feature aggregation mechanisms. This design struggles to resolve the inherent trade-off in differential operator estimation: suppressing noise in flat regions requires a large receptive field, while preserving high-frequency details in sharp feature areas requires precise local information. A single scale often leads to either over-smoothing of sharp edges or under-smoothing of noise. To address these limitations, this paper proposes a novel Rotation-Robust Neural Laplacian Operator (RR-NeLo), a geometry-aware and pose-invariant framework designed to predict accurate, intrinsic Laplacian matrices for unoriented and noisy point clouds.MethodThe proposed RR-NeLo establishes an end-to-end learning framework comprising three synergistic components: a Local Reference Frame (LRF) alignment module, a dual-channel multi-scale backbone network, and a rotation-consistent training scheme. First, to fundamentally eliminate the interference of global pose variation, we introduce a robust LRF alignment module. Instead of feeding raw coordinates directly into the network, we compute a weighted covariance matrix for the local neighborhood of each point. By performing Eigen-decomposition on this covariance matrix, we derive a stable, content-based local coordinate system (canonical frame) that aligns with the principal geometric directions of the local surface. All neighboring points are then projected into this rotation-invariant canonical space. This transformation ensures that the input features to the subsequent GNN are intrinsic to the shape and invariant to any global rotation or translation, thereby guaranteeing the SE(3)-invariance of the learned operator. Second, to address the scale dilemma, we design a dual-channel multi-scale backbone network. Unlike previous methods that operate on a fixed KNN graph, our network constructs two parallel graph branches with varying neighborhood radii: a local detail branch () focused on capturing fine geometric structures like corners and edges, and a global context branch () focused on integrating broader neighborhood information for robust noise suppression. The features extracted from these two branches are fused via a Squeeze-and-Excitation (SE-Block) channel attention mechanism. The SE-Block adaptively recalibrates the channel-wise feature responses, assigning higher weights to fine-grained features in high-curvature regions and higher weights to coarse-grained features in planar regions. This mechanism allows the network to dynamically balance the trade-off between detail preservation and smoothing based on local geometric complexity. Finally, the network is trained using a hybrid loss function. In addition to the standard supervised loss that minimizes the discrepancy between the predicted Laplacian and the ground-truth mesh Laplacian on a set of spectral and spatial probe functions, we incorporate a novel rotation consistency loss. This regularization term explicitly enforces the network to produce identical Laplacian matrix values (up to permutation) for a shape and its randomly rotated counterparts. This constraint further regularizes the solution space and improves the generalization ability of the model on unseen orientations. Dataset DOI: 10.57760/sciencedb.31651.ResultTo validate the effectiveness of the proposed method, extensive comparative experiments were conducted on the synthetic ShapeNet dataset (comprising 12k shapes across 17 categories) and real-world noisy scans from the Kinect v2 dataset. We benchmarked RR-NeLo against representative traditional methods, including the Graph Laplacian, Heat Method, and NManifold, as well as the baseline deep learning model NeLo. Quantitative evaluations demonstrate that RR-NeLo significantly outperforms existing state-of-the-art techniques. More importantly, on the randomly rotated ShapeNet test set, where the baseline NeLo suffers a severe performance drop due to pose variation, RR-NeLo maintains high accuracy, reducing the Mean Squared Error (MSE) by 27.6% and improving the F-measure (a metric for structural fidelity) by 3.1% compared to NeLo. This result confirms the effectiveness of the LRF module in achieving robust rotation invariance. Furthermore, qualitative and quantitative evaluations on downstream geometry processing tasks—including heat diffusion, Laplacian smoothing, and geodesic distance computation—reveal that RR-NeLo effectively removes noise while preserving sharp features. The experiments on real-world Kinect v2 scans further demonstrate that RR-NeLo generalizes well to uncontrolled data with severe noise and non-uniform sparsity, generating plausible Laplacian matrices without requiring explicit mesh reconstruction.ConclusionThis paper presents RR-NeLo, a robust neural framework for learning the discrete Laplacian operator on point clouds. By integrating Local Reference Frame alignment, the proposed method successfully disentangles geometric feature learning from global pose, solving the rotation sensitivity issue inherent in previous neural operators. The introduction of the dual-channel multi-scale attention mechanism further enhances the operator's ability to preserve high-frequency geometric details while suppressing noise. Experimental results confirm that RR-NeLo sets a new state-of-the-art in terms of accuracy, robustness, and generalization capability. This work provides a reliable mathematical tool for differential geometry processing on unoriented, non-uniform point clouds, bridging the gap between deep learning and classical geometry processing. Future work will explore extending this framework to learn other differential operators, such as the gradient and divergence, to support a wider range of physical simulations on point-based representations.
摘要:ObjectiveMultimodal emotion recognition aims to understand the emotional states of specific subjects by fusing data from multiple modalities such as text, vision, and audio. However, the inherent heterogeneity existing in the representation forms and distribution laws of different modality data leads to emotional semantics in the latent feature space often being entangled and coupled with modality-specific non-emotional noise. This phenomenon of feature entanglement not only hinders the model's effective learning of key emotional features but also limits the interpretability ability of the model's decision-making process. Furthermore, feature fusion strategies adopt simple concatenation operations or coarse-grained attention mechanisms, making it difficult to effectively capture fine-grained emotional semantic interaction cues between modalities in complex cross-modal contexts, ultimately resulting in the fused emotional representation lacking sufficient discriminability. To this end, an interpretable invertible disentanglement and adaptive fusion method for multimodal emotion recognition is proposed.MethodFirst, in order to reduce the loss of semantic information during the feature learning phase and achieve structured feature disentanglement, an invertible attention mask-based disentanglement (IAMD) module is designed. Based on invertible neural networks (INN), a bidirectional invertible mapping structure is constructed between the latent representations of each modality's features and emotional semantic factors, and an attention mask mechanism is combined to disentangle latent features in the channel dimension into two parts: one part capturing shared features with semantic consistency across modalities, and the other retaining specific features containing unique attributes of each modality. Secondly, to further enhance the disentanglement effect from an information-theoretic level, a mutual information constraint (MIC) mechanism is constructed. The semantic consistency of emotional features in the shared subspace is enhanced by calculating and maximizing the mutual information between shared features as well as between shared features and emotion labels. Meanwhile, by minimizing the mutual information between specific features and emotion labels conditioned on shared features, the model is constrained to strip modality-specific attributes not directly related to the emotion task into the specific feature subspace, thereby reducing the interference of modality-redundant noise on emotional semantics. Finally, addressing the issue of insufficient interaction during the feature fusion phase, a semantic-guided adaptive feature fusion (SGAFF) module is designed. The cross-modal consistent emotional semantic information captured in the shared subspace is utilized by this module as contextual cues to perform fine-grained semantic correction and guidance on modality-specific features through residual connections, and a dual-branch prediction structure is constructed, in which a gating mechanism is utilized to adaptively assign weights to the shared branch and the specific-guided branch, thereby enhancing the discriminability of the fused representation.ResultExtensive comparative experiments and ablation studies were conducted on the CMU-MOSI, CMU-MOSEI, and UR-FUNNY datasets. Specifically, on CMU-MOSI, the model improved mean absolute error (MAE) and 7-class accuracy (Acc-7) by 2.4% and 2.9%, respectively, compared to the disentangled language focused (DLF) model. On CMU-MOSEI, it yielded improvements of 2.6% and 1.7% in MAE and Pearson correlation coefficient (Corr), respectively, compared to the Transformer-based multimodal binding learning (TMBL) model. Furthermore, on UR-FUNNY, the model improved the F1-score (F1) by 5.6% compared to the modality-invariant and specific analysis (MISA) model. In addition, detailed ablation experiments verified the necessity of the IAMD, MIC, and SGAFF modules for improving model performance. Feature visualization analysis based on t-distributed stochastic neighbor embedding confirmed that the model realized the effective separation of emotional semantics and modality noise in the latent space. Furthermore, fusion weight visualization confirmed that the model adaptively assigned higher contributions to the specific-guided branch, confirming the role of fine-grained complementary cues in final emotion judgment.ConclusionIn summary, the proposed method achieves interpretable disentanglement of emotional semantic information from modality-specific noise based on INN and mutual information constraints, at the same time, through the semantic-guided adaptive fusion strategy, it realizes deep and fine-grained interactions between cross-modal emotional semantic features, thereby improving the accuracy and robustness of multimodal emotion recognition tasks in complex scenarios. Although the proposed method achieves significant progress in model interpretability, the introduction of invertible transformations and multiple mutual information constraints increases the model's computational complexity. This method is applicable to multimodal scenarios with complex modal heterogeneity, as well as tasks that have strict requirements for quantitative emotion recognition metrics. Future work will focus on lightweight disentanglement for emotion recognition tasks, to further improve the inference efficiency and generalization ability of the model in scenarios with limited computational resources. The source code has been archived athttps://doi.org/10.57760/sciencedb.j00240.00138.
ZHENG Yajing, ZHAO Rui, ZHU Lin, Liu Yujia, HUANG Tiejun
DOI:10.11834/jig.260128
摘要:With the rapid development of neuromorphic vision sensors, spike cameras have emerged as a promising paradigm for continuous-time visual perception. Unlike conventional frame-based cameras that sample scenes at fixed frame rates, spike cameras encode luminance variations as asynchronous binary spike streams triggered by intensity accumulation at each pixel. This sensing mechanism changes the way visual information is acquired and represented. Instead of producing discrete image frames, spike cameras generate continuous spike events that directly reflect temporal changes in scene intensity. As a result, spike cameras provide several unique advantages, including extremely high temporal resolution, wide dynamic range, low motion blur, and sparse event-driven representations. These properties make spike cameras particularly suitable for challenging visual environments such as high-speed motion analysis, extreme illumination conditions, and subtle temporal change detection, where traditional imaging systems often encounter limitations. Spike vision also introduces new challenges for visual computing. The statistical characteristics and data structures of spike streams differ significantly from those of conventional images or videos. Spike cameras produce sparse binary spike sequences in continuous time rather than dense intensity frames. Many established computer vision algorithms therefore cannot be directly applied to spike data without modification. Effective processing of spike streams requires new modeling strategies that explicitly consider the temporal dynamics and sparsity of the signals.This survey reviews recent progress in spike vision research from the perspective of hierarchical modeling of continuous-time spike representations. Existing methods are organized into several levels reflecting the progressive expansion of spike vision capabilities, ranging from signal modeling and reconstruction to semantic perception and system deployment.At the lowest level, physically consistent modeling of spike generation and sensor noise provides a foundation for understanding spike data statistics. Studies in this direction analyze pixel triggering mechanisms, noise characteristics, and spike accumulation processes, forming the basis for reliable signal processing and algorithm design. Low-level visual reconstruction methods aim to recover stable visual signals from spike streams. Representative tasks include intensity reconstruction, high dynamic range (HDR) imaging, motion deblurring, super-resolution, and low-light enhancement. These approaches convert spike sequences into interpretable intensity representations while preserving the temporal information contained in the spike data.The next level focuses on spatiotemporal modeling. The continuous-time nature of spike streams enables joint modeling of spatial structure and temporal motion. Research in this area addresses problems such as optical flow estimation, motion segmentation, and dynamic scene analysis. Compared with frame-based methods, spike-based models provide improved temporal fidelity in fast motion scenarios. At the semantic perception level, spike representations are increasingly applied to tasks such as object detection, recognition, and multi-object tracking. Continuous spike streams are integrated with deep neural networks, transformer architectures, or spiking neural networks (SNNs) to perform higher-level visual reasoning. These methods exploit the temporal sparsity of spike data while maintaining low-latency processing. Spike cameras have also been introduced into three-dimensional scene modeling. Recent studies combine spike streams with neural implicit representations to reconstruct static or dynamic 3D scenes. Continuous spike measurements provide detailed temporal information that can benefit dynamic scene reconstruction and neural rendering. System-level considerations play an important role in practical spike vision deployment. Evaluation of spike-based methods involves not only accuracy but also system metrics such as latency, throughput, and energy consumption. These metrics become particularly important in real-time perception systems and edge computing scenarios.Progress in spike vision has also been supported by the development of datasets, simulation tools, and open-source platforms. Because collecting spike camera data can be costly, spike simulators are widely used for algorithm development and validation. Simulation methods attempt to reproduce sensor physics and temporal spike generation processes. Public datasets and benchmarking protocols further support reproducible research. The SpikeCV platform provides a unified open-source framework for spike vision research. It integrates datasets, algorithm implementations, hardware interfaces, and evaluation tools, allowing researchers to rapidly prototype and evaluate spike-based algorithms. The platform has helped facilitate collaborative development and reproducible experiments within the community. Research activity in spike vision has grown rapidly in recent years. Publications, open-source resources, and benchmark datasets have increased steadily. Two international competitions organized in 2025 attracted broad participation from academic institutions and industry teams. These competitions encouraged standardized task definitions and stimulated methodological progress in the field.However, continuous-time representation learning for spike streams is still an active area of research. Large-scale self-supervised learning for spike data remains largely unexplored. Multimodal fusion with complementary sensors introduces additional challenges in temporal alignment and noise modeling. System-level optimization involving latency, throughput, and energy consumption also requires further investigation. Hardware–algorithm co-design with neuromorphic processors may provide new opportunities for efficient spike-based computation. Spike vision represents an emerging direction that reconsiders visual perception from a continuous-time perspective. Advances in sensor technology, representation learning, and system integration are gradually forming a new framework for visual computing. Continued progress in spike-based sensing, modeling, and deployment may enable high-speed, energy-efficient visual intelligence for future perception systems.Taken together, this survey provides a structured overview of spike vision research from the perspectives of sensing principles, representation modeling, perception algorithms, and system-level infrastructure. By organizing existing studies within a hierarchical framework of continuous-time spike representations, the survey highlights how spike vision methods evolve from low-level signal reconstruction to high-level semantic perception and three-dimensional scene understanding. The discussion of datasets, simulators, evaluation protocols, and open-source platforms further reflects the growing research ecosystem surrounding spike-based vision. Through this synthesis, we aim to clarify the relationships among different research directions, summarize the current development status of the field, and provide a reference for future work on continuous-time visual computing and neuromorphic perception systems.
Li Chunyi, Zhang Jianbo, Xiao Jiahao, Yan Bowen, Guo Shengyu, Ye Tongrui, Lin Weisi, Zhai Guangtao
DOI:10.11834/jig.250550
摘要:Embodied intelligence —agents that sense, reason, act and learn through a tight “body–environment–task” feedback loop—is widely regarded as the most promising route to general-purpose artificial intelligence. Paradoxically, while model parameters, training data and compute budgets have grown by four orders of magnitude in only five years, the evaluation ecosystem remains fragmented, ad-hoc and largely irreproducible. The same functional capability (e.g., “pick-and-place”) is formalised as object-detection mAP in one paper, as 3-D IoU in a second, as task-success rate in a third and as human-preference score in a fourth, with absolute gaps exceeding 10–15 % among “state-of-the-art” results. Because environments, random seeds, physics parameters, sensor noise models and success criteria are rarely open-sourced, the community cannot tell whether an apparent improvement comes from algorithmic innovation, data scale, evaluation cherry-picking or simple benchmark over-fitting. This uncertainty has become a critical bottleneck for both scientific progress and industrial adoption.
摘要:With the rapid development of image editing tools, mobile image processing applications, and generative artificial intelligence techniques, visually realistic manipulated images can now be produced with increasing ease, which poses serious challenges to the authenticity and credibility of digital visual content. In practical scenarios, tampered images are often subjected not only to splicing, copy-move, or content removal operations, but also to multiple post-processing procedures, such as compression, resampling, filtering, smoothing, and format conversion. These operations tend to weaken the forensic traces left by tampering and make the manipulated regions exhibit weak residual artifacts, ambiguous boundaries, and inconsistent structural cues. In addition, the discriminative clues useful for tampering localization are usually distributed across different feature domains. RGB images provide rich semantic appearance information, whereas noise-related representations contain subtle forensic evidence that is less dependent on visual semantics. However, existing methods often fail to fully exploit the complementarity between these heterogeneous modalities, and they remain vulnerable to semantic interference, weak trace attenuation, and complex degradation conditions. To address these problems, a direction-aware cross-modal multi-level tampering localization framework, termed DA-CMTL, was proposed for robust image tampering localization. The proposed method aimed to improve the representation and reasoning of manipulation traces by jointly modeling appearance information, forensic residual information, and directional structural priors in a unified framework. Specifically, a dual-branch architecture was constructed to extract complementary features from the RGB domain and the noise domain, respectively. The RGB branch was used to preserve contextual and structural appearance cues, while the noise branch was designed to highlight subtle statistical anomalies and residual inconsistencies caused by tampering. To enhance the discriminative ability of the extracted features, a hierarchical spatial-channel attention mechanism was introduced. This mechanism enabled the network to focus on suspicious local regions in the spatial dimension while adaptively emphasizing manipulation-sensitive responses in the channel dimension, thereby improving the perception of weak tampering traces and reducing the influence of irrelevant background content. Furthermore, considering that manipulated boundaries and residual artifacts often exhibit directional dependency, a multi-directional cross-modal fusion module was designed to incorporate horizontal, vertical, and diagonal directional priors into the feature interaction process. Through explicit direction-aware modeling, the proposed framework was able to better capture structural continuity, boundary variation, and anisotropic trace distribution in manipulated regions. In addition, an adaptive cross-modal gating strategy was employed to achieve collaborative fusion between RGB features and noise features. This strategy allowed the network to suppress redundant or conflicting information across modalities and selectively strengthen integrity-critical cues that were more informative for tampering localization. Extensive experiments were conducted on multiple public image tampering localization datasets containing diverse manipulation types and challenging post-processing conditions. The effectiveness of the proposed method was evaluated by using quantitative metrics such as F1-score and intersection over union, together with qualitative comparisons, ablation studies, and robustness analyses. Experimental results showed that the proposed DA-CMTL framework achieved superior or highly competitive performance compared with several mainstream tampering localization methods on different benchmarks. In particular, the proposed method demonstrated stronger robustness and more stable localization performance in scenarios involving complex manipulations, weak residual traces, and multiple post-processing disturbances. The predicted tampering masks were more accurate and structurally complete, especially for irregular boundaries, fine-grained manipulated regions, and semantically confusing areas. Comparative analysis further indicated that some existing methods were prone to semantic-related false detections or incomplete region localization when forensic clues became weak or fragmented, whereas the proposed framework was more effective in suppressing semantic interference and preserving detailed manipulation structures. The ablation results verified that each designed component contributed positively to the final performance, and the combination of hierarchical attention, direction-aware modeling, and adaptive cross-modal fusion yielded the most significant improvement. These findings demonstrated that explicitly integrating directional priors and cross-modal complementary reasoning was beneficial for enhancing image tampering localization under realistic conditions. Overall, the proposed method provided an effective solution for modeling weak and direction-sensitive manipulation traces, and it showed promising application value in multimedia forensics, digital evidence analysis, misinformation detection, and trustworthy image authentication.
摘要:ObjectiveAt present, deep learning–based steganalysis methods generally outperform traditional handcrafted-feature approaches in conventional settings; however, under white-box adversarial attacks constructed by exploiting model gradients, their ability to detect adversarially generated stego images often degrades substantially. Meanwhile, the adversarial perturbations introduced to deceive deep neural networks may induce non-natural local pixel variations or abnormal neighborhood dependencies in stego images, thereby disrupting high-order statistical regularities. Such artifacts make adversarial stego images more detectable by traditional steganalysis methods that rely on handcrafted features and statistical analysis. Therefore, a key problem is how to effectively exploit the complementary strengths of deep learning–based and traditional approaches within a unified steganalysis framework. In addition, existing adversarial-training strategies typically require a large number of adversarial stego images with ground-truth labels, whereas in real-world deployments such images are difficult to obtain and their labels are often unavailable. Motivated by these practical constraints, this paper proposes an integrated steganalysis framework designed for adversarial steganography scenarios.MethodFirst, we employ a traditional steganalysis method based on SRM (Spatial Rich Model) handcrafted features and a deep learning–based steganalysis method, Ye-Net, as heterogeneous base learners. For an input sample, both detectors output the posterior probability that the sample is a stego image, and their outputs are integrated through an ensemble strategy to reduce the performance volatility of any single detector under adversarial perturbations. Since the two detectors exhibit markedly different output scales and calibration characteristics, we introduce a sigmoid-based alignment with a normalization intensity factor to match their output distributions and map them into a comparable probability space. After this calibration, we obtain a probability-level feature representation that serves as the input to the subsequent classifier. At the classifier level, we construct a deep classifier based on DANN (Domain-Adversarial Neural Network). The classifier consists of a feature extractor, a label predictor, and a domain classifier. The feature extractor and the domain classifier are connected via a gradient reversal layer and engage in an adversarial game: the feature extractor is trained not only to minimize the label predictor loss, but also to suppress domain separability, thereby learning domain-invariant features that are transferable across the non-adversarial domain and the adversarial stego domain. The learned features are then fed into the label predictor to produce the final label. This design enables effective training in scenarios where ground-truth labels in the adversarial stego domain are unknown or unavailable. In addition, we observe that in adversarial steganography settings a subset of adversarial stego images may undergo excessively large shifts in the feature space. During domain-adversarial alignment, such samples can be misleadingly pulled toward the cover region, which triggers negative transfer and deteriorates training effectiveness. To address this issue, we introduce a deviated-sample identification module, implemented with a multi-layer perceptron, before domain-adversarial training. The module identifies and filters out target-domain stego images with overly strong adversarial deviations, thereby mitigating negative transfer, stabilizing domain alignment, and further improving cross-domain generalization under adversarial perturbations.ResultTo evaluate robustness under different adversarial intensities, this paper constructs multiple test sets by mixing adversarial stego images and non-adversarial stego images at varying ratios. The experimental results reveal a distinct performance divergence among baseline detectors as the adversarial ratio increases. Deep learning-based models (including Ye-Net, SRNet, and LWENet) demonstrate superior detection accuracy in non-adversarial scenarios; however, their overall detection error Pe (Probability of Error) deteriorates significantly as the proportion of adversarial stego images rises, confirming that deep discriminative features are highly vulnerable to targeted gradient-based perturbations. Conversely, traditional methods based on handcrafted features (such as SRM and SPAM) exhibit an inverse trend, where error rates decrease or stabilize under high-intensity adversarial steganography conditions, indicating that statistical residual features are more sensitive to the abnormal artifacts introduced by adversarial perturbations. Leveraging the complementary nature of these characteristics, the proposed fusion-based steganalysis model consistently maintains the lowest and most stable detection error across most mixing ratios, outperforming even the specialized adversarial defense model, KDNFT. Experimental results demonstrate that under adversarial attacks of varying intensities, the proposed model reduces the average error rate by 15.95% and 6.06% compared to traditional models (SPAM and SRM, respectively), by 10.93% to 19.50% compared to deep learning models (Ye-Net, SRNet, and LWENet), and by 5.90% compared to the robustness-enhanced method KDNFT, achieving state-of-the-art (SOTA) performance in adversarial steganography scenarios.ConclusionThis paper presents a fusion-based steganalysis framework that achieves stable and effective cross-domain alignment without requiring ground-truth labels in the adversarial stego domain. By integrating SRM and Ye-Net within an ensemble representation, aligning heterogeneous outputs via normalization intensity factors, and adopting a domain-adversarial transfer classifier enhanced with a deviated-sample filtering mechanism implemented with a multi-layer perceptron, the proposed approach substantially improves robustness and generalization under adversarial steganography conditions. The framework provides a practical and promising pathway toward more accurate and highly reliable steganalysis systems in adversarial environments.
摘要:ObjectiveScarce annotated data is a critical bottleneck constraining the performance of remote sensing hyperspectral image classification models. The small-sample condition not only leads to model overfitting but also poses the significant challenge of severely insufficient generalization capability. To address this issue, this study aims to explore a novel solution that couples data augmentation with semantic enhancement. We innovatively utilize pre-trained language models to generate diverse semantic prompts, which guide the development of a data augmentation technique that strictly maintains category consistency. By generating training samples that are semantically rich and distributionally diverse, this technique aims to fundamentally strengthen the model's understanding of the core semantics of categories, rather than merely memorizing the limited training samples. Experimental results demonstrate that the proposed method effectively enhances the model's classification accuracy and generalization capability on the target domain, offering a new pathway for small-sample hyperspectral image classification.MethodsThis study employs the HyperBlend data augmentation method and the PromptMix mechanism. At the data level, the HyperBlend method performs cropping and blending of hyperspectral images from the same category, followed by the application of random rectangular masks. This process simulates real-world scenarios involving occlusion, noise, and illumination variations, thereby generating blended images that maintain semantic consistency while exhibiting visual diversity. The method is straightforward to implement, allows adjustable mask ratios, and effectively expands the training dataset. At the semantic level, the PromptMix mechanism utilizes a pre-trained BERT model to generate three diverse textual prompts for each target domain category, from which semantic features are extracted. During training, different prompt features are randomly selected as class prototypes, and a diversity regularization loss is introduced to constrain the discrepancies among these features. This prevents semantic representation from becoming homogenized, thereby enriching class representations and mitigating overfitting. In terms of loss function design, the framework integrates classification loss, cross-modal alignment loss, target-domain supervised contrastive loss, and the newly proposed diversity regularization loss, achieving end-to-end optimization.ResultsExperiments were conducted on four standard hyperspectral datasets: Indian Pines (IP), Houston (HT), Salinas (SA), and LK (WHU-Hi-LongKou), achieving significant improvements. On the IP dataset, compared with the SCFDA method, the overall accuracy (OA) increased by 3.8%, the average accuracy (AA) by 2.67%, and the Kappa coefficient (KC) by 4.3%. On the SA dataset, compared with the MEDPL method, the overall accuracy (OA) improved by 1.16%, the average accuracy (AA) by 0.84%, and the Kappa coefficient (KC) by 1.28%. On the LK and HT datasets, the classification performance was comparable to the best existing results of SCFDA and MEDPL methods, demonstrating strong competitiveness.ConclusionHere is the professional English translation and expansion of your method description:This paper introduces HyperBlend, a novel data augmentation method that generates high-quality, diverse training samples through structured image blending and mask operations with minimal computational overhead. By strategically fusing spectral-spatial features from limited labeled data, HyperBlend effectively simulates realistic inter-class variations and boundary conditions, addressing the data scarcity challenge in hyperspectral image analysis.Concurrently, we propose the PromptMix mechanism, which introduces semantic-level diversity through learnable prompt embeddings to enhance the model's semantic discrimination capability. This mechanism operates in the feature space, generating semantically meaningful perturbations that encourage the model to learn more robust feature representations. The synergy between HyperBlend and PromptMix creates a comprehensive augmentation framework that operates at both the pixel level (through structured image mixing) and semantic level (through prompt-based feature manipulation).The combined approach significantly improves model robustness and generalization performance in few-shot learning scenarios. Crucially, both methods feature simple implementation architectures requiring neither complex network designs nor substantial parameter additions. This practical and efficient solution provides valuable reference for the hyperspectral image analysis community, particularly for tasks with limited labeled data. Extensive experiments demonstrate consistent performance improvements across multiple benchmark datasets, validating the effectiveness of our approach for enhancing model adaptability in data-constrained environments.Key contributions include: 1) The HyperBlend method for structured spectral-spatial augmentation, 2) The PromptMix mechanism for semantic-level feature diversification, and 3) A lightweight, easily implementable framework that significantly boosts few-shot learning performance without computational burden.
关键词:hyperspectral image(HSI);Cross-Domain Few-Shot Learning;HyperBlend Data Augmentation Method;PromptMix Mechanism;Multimodal Feature Alignment
Liu Rong, Tang Qiling, Wang Yan, Chen Pengzhou, Shu Chang, Wang Shuai, Yue Jianchi
DOI:10.11834/jig.260056
摘要:ObjectiveMitotic count is a key quantitative indicator for evaluating the grade and prognosis of invasive breast cancer in histopathological assessment. The accuracy of its detection and statistics directly affects the clinical doctors' judgment on the proliferative activity and malignancy of the tumor, and is also related to the selection of treatment plans and the prediction of therapeutic efficacy. In recent years, The deep learning-based mitotic target detection technology plays an important role in clinical diagnosis. Existing object detection approaches can generally be categorized into anchor-based and anchor-free methods.Anchor-based detectors generate candidate regions by predefining a large number of anchor boxes with different scales and aspect ratios on feature maps, enabling relatively accurate localization. Nevertheless, the design of anchors introduces numerous redundant hyperparameters, resulting in high computational cost and memory consumption, which consequently limits detection efficiency.In contrast, anchor-free detection methods predict object locations and categories in a pixel-wise manner, leading to a simpler architecture and higher computational efficiency.but compared to the two-stage detectors that use RoI Pooling or RoI Align to perform feature resampling and spatial alignment on candidate regions, the single-stage detector lack effective region aggregation mechanisms.This limitation makes it difficult to extract robust and discriminative features from complex backgrounds, thereby affecting the performance of target detection. At the same time, the acquisition process of breast pathological images often involves different brands of microscope scanners, differentiated imaging parameter configurations, and tissue staining procedures, These factors lead to significant domain discrepancies in pixel distributions, staining intensity, cellular boundary clarity, and background texture patterns across datasets. Consequently, detection models trained on data from a single laboratory often suffer performance degradation when applied to data from other institutions, as the input feature distributions change and the learned decision boundaries may no longer remain valid. Such domain shifts may introduce decision biases in the model and lead to incorrect predictions in downstream tasks such as survival analysis and tumor grading, ultimately hindering the reliability and large-scale clinical deployment of these models.In addition, in breast pathological image analysis, apart from the mitotic cells that need to be accurately identified, there are also a large number of interference samples with morphological features highly similar to mitotic cells. We define these samples as difficult negative samples. These samples mainly include cell nuclei in the interphase, apoptotic cells that have shrunk or fragmented, cell fragments produced during tissue section preparation, and artifact structures formed due to slice folds or uneven staining. Due to the substantial overlap between these samples and mitotic cells in key morphological features such as nucleus-to-cytoplasm ratio, staining intensity, and local texture patterns, even experienced pathologists often need to repeatedly examine them under high-magnification microscopy for confirmation. Such hard negative samples easily induce misclassification during model inference,hereby significantly increasing the false positive rate and ultimately impairing the accuracy of mitotic cell counting.MethodTo alleviate the feature misalignment problem of anchor-free detectors, inspired by the deformable attention module and deformable convolution networks (Deformable ConvNets) in Deformer DETR, an adaptive context alignment strategy is proposed, which dynamically aggregates context information that matches the target structure, effectively improving the expression ability and discrimination ability of features. For the problem of model generalization performance decline caused by domain differences in pathological image analysis, a multi-granularity cross-domain adaptive module is designed. This module starts from three granularities: at the image level, it eliminates apparent interference caused by color style and texture distribution of the overall image, as well as differences in staining depth and device parameters; at the foreground level, it aligns the local structure of the cell region through center perception and suppresses background responses; At the category level, the information of hard negative samples is fully utilized, and high-confidence pseudo labels are adopted as the supervisory signal. Through two parallel branches, they are optimized separately: one branch is responsible for aligning the feature distributions of similar samples between the source domain and the target domain; the other branch focuses on enhancing the inter-class distinguishability between mitotic cells and hard negative samples. Through the synergy of multi-granularity feature alignment, the impact of data distribution heterogeneity on model inference is effectively weakened, significantly improving the detection accuracy, stability, and generalization ability of the model in unknown domain breast pathological images for mitotic cells.ResultsThe proposed method demonstrates superior detection performance and generalization ability on mainstream public datasets, fully verifying its effectiveness and superiority in the mitotic cell detection task. On the ICPR MITOSIS 2014 dataset, this method achieved the current optimal F-score metric, achieving a significant performance improvement of 5.5% compared to existing best methods; on the MIDOG2021 dataset, this method further achieved the highest recall rate, significantly reducing the risk of missed detection of key target cells in clinical pathological analysis.ConclusionThe proposed detection method addresses the inherent defects of single-stage detectors in feature processing, effectively improving the accuracy of mitotic cell detection. At the same time, its excellent generalization performance breaks through the performance barriers of traditional models between source domain data and target domain data, providing a more reliable and efficient solution for the automated and precise detection of mitotic cells in breast pathological images.
Zhang Hui, Tian Yonglin, Wang Yutong, Gou Chao, Li Xuan, Wang Fei-Yue
DOI:10.11834/jig.250231
摘要:Driven by the rapid advancement of deep learning and large-scale computing, visual perception systems have achieved remarkable progress in a wide range of domains, including autonomous driving, intelligent transportation, security surveillance, medical diagnostics, industrial inspection, and human–robot interaction. Modern vision algorithms, empowered by massive datasets and increasingly sophisticated neural architectures, are now capable of performing object detection, semantic segmentation, scene understanding, and even causal reasoning with unprecedented accuracy. Despite this rapid growth, the development of visual intelligence still faces several critical bottlenecks. Chief among these challenges are the highly imbalanced data distributions that characterize real-world environments, the scarcity of long-tail or rare-event samples that are essential for robustness, and the substantial human and financial cost associated with large-scale manual annotation. These factors significantly hinder the performance, safety, and generalizability of deep perception systems, especially in complex, dynamic, or safety-critical scenarios. Parallel Images technology, emerging as a novel image generation and modeling methodology grounded in parallel systems theory and the ACP (Artificial systems, Computational experiments, and Parallel execution) framework, offers a promising pathway to address these limitations. The core idea behind Parallel Images is to construct controllable, high-fidelity artificial scene systems that reflect the structure, behavior, physics, and semantics of their real-world counterparts. Within these artificial systems, computational experiments can be conducted at scale, allowing for the controlled generation of diverse visual data that capture variations in illumination, geometry, environmental conditions, sensor characteristics, and task-specific factors. Through the interaction and iterative feedback between virtual and real environments, Parallel Images establish a closed-loop mechanism of “modeling–training–feedback–optimization,” enabling perception models to continuously evolve, validate hypotheses, and improve performance under systematically generated variations. This closed-loop mechanism differentiates Parallel Images from traditional synthetic data generation in several important aspects. First, instead of passively producing static rendered images, Parallel Images emphasize dynamic parallelism, wherein virtual agents, environments, and tasks evolve in sync with real-world processes. Second, the approach integrates multimodal feedback, bridging visual, geometric, physical, and semantic modalities, to ensure consistency and translatability across domains. Third, the framework supports scalable modeling of rare, dangerous, or expensive scenarios that are difficult or impossible to capture in real life, such as near-crash events in autonomous driving, rare diseases in medical imaging, or hazardous industrial operations. These capabilities make Parallel Images a powerful tool for enhancing the robustness, safety, and domain generalization of modern perception systems. This paper provides a comprehensive and systematic review of the theoretical foundations, methodological innovations, and developmental trajectory of Parallel Images technology. We begin by revisiting its roots in parallel intelligence and the ACP paradigm, detailing how artificial systems serve as controlled experimental platforms that complement real-world data collection. We then examine recent technical advances encompassing three major research directions aligned with the “artificial scenes – computational experiments – parallel execution” framework: 1) multimodal data–driven virtual scene generation, which employs generative adversarial networks, diffusion models, neural radiance fields, and 3D Gaussian splatting to overcome data scarcity and annotation bottlenecks, enabling the creation of controllable, editable, and semantically consistent synthetic environments; 2) multi-view feature fusion and virtual–real model transfer, aimed at addressing feature discrepancies and semantic misalignment across heterogeneous visual modalities through cross-modal alignment, multi-granularity adaptive transfer, and domain-bridging strategies that enhance generalization and adaptability in hybrid virtual–real environments; 3) parallel reasoning through heterogeneous data and knowledge fusion, which integrates structured information extraction, external knowledge guidance, scene graphs, temporal logic, and large language models to advance perceptual understanding toward semantic-level reasoning and decision-making, thereby supporting continuous optimization and closed-loop evolution in complex scenes. Beyond summarizing technological developments, this paper also situates Parallel Images within the broader context of emerging trends in generative artificial intelligence and foundation models. With the rise of diffusion models, neural radiance fields (NeRF), and large-scale multimodal models, Parallel Images are poised to integrate more deeply with generative simulation pipelines. We discuss how these innovations can strengthen the fidelity, controllability, and adaptability of artificial visual data, potentially enabling new capabilities such as task-conditioned scene synthesis, human–AI co-simulation, interactive data generation, and closed-loop autonomous scenario exploration, providing key capabilities for building general visual systems with continuous learning and feedback optimization. This provides crucial support for building general visual systems with continuous learning and feedback-driven optimization. Finally, the paper identifies several open challenges and future research directions that are essential for advancing the development of Parallel Image systems. These challenges include achieving high-quality expansion of virtual data, bridging the semantic gap between virtual and real domains, and enabling real-time, tightly coupled virtual–real interaction. Addressing these issues will require advances in intelligent generation models, self-supervised quality evaluation, unified data standards, causality-aware cross-domain alignment, and low-latency virtual–real collaboration supported by next-generation communication and sensing technologies. We argue that solving these challenges will be critical for pushing forward the frontier of synthetic visual intelligence and unlocking the full potential of Parallel Images in real-world applications.