最新刊期

    30 4 2025

      Review

    • Research progress and trends on models and structures of cognitive machines AI导读

      在认知能力研究领域,专家探讨了机器认知模型和构建方法,为设计新一代认知机器提供新模型、结构和方法论,为探索人和机器认知机制提供新视角。
      Bao Hong, Zheng Ying, Liang Tianjiao
      Vol. 30, Issue 4, Pages: 895-921(2025) DOI: 10.11834/jig.240108
      Research progress and trends on models and structures of cognitive machines
      摘要:How can machines possess cognitive abilities akin to humans? Cognitive abilities are measured through intelligence, and human intelligence is an emergent property of cognitive processes. This paper investigate the structure of cognition from cognitive models, as structure determines the cognitive functions of machines. This research aims to provide innovative architectural and technological methodologies for the design of new-generation intelligent machines. How can machines possess cognitive abilities similar to humans? This requires starting from cognitive models to study the boundary constraints of human perception of the physical world and the mapping to cognitive space. The structure of the model determines its cognitive functions, and the combination of “model + structure” determines the emergent properties of the machine (system). This work utilizes a comprehensive approach that includes analysis, source tracing, and deduction to provide an in-depth review of the origins, evolutionary path, and emerging trends within the domain of cognitive machine models and their structural development. The first step is revisiting the cross-disciplinary interpretation of physicist Erwin Schrödinger’s “thought experiment model and parallel universe view” since the early 20th century, which initiated a cognitive revolution with the concept that “The organism feeds on negative entropy”. This discovery unveiled the mysteries of life and led to the emergence of new disciplines such as molecular biology and genomics. The computational theory, cybernetics, and information theory established during the same era continue to influence the development of information-physical systems to this day, with the “Turing machine model + von Neumann architecture” laying the foundation for the invention of general-purpose computing machines and the formation of new disciplines such as computer science and technology. Furthermore, Turing’s profound question “Can machines think?” and his “Turing test” for measuring machine intelligence have attempted to unravel the mystery of the emergence of intelligence, having a significant inspiration and impact on the establishment of the artificial intelligence discipline. This paper focuses on analyzing and reviewing the milestone advancements and existing issues of the “deep learning models + convolutional neural networks” and the “large language model + Transformer structures” established over the past two decades. These issues also arise from several structural limitations when artificial intelligence(AI) systems operate on computers with von Neumann architecture. To address these issues and overcome structural limitations, this work reviews the latest developments section the models and structures proposed by three of the most representative scientists domestically and internationally: Yann LeCun’s “world model + self-supervision”, Fei-Fei Li’s “spatial intelligence + behavioral vision”, and Deyi Li’s “on the most basic element + cognitive structural chain”. In particular, the cognitive physics based on the “four elements” founded by Deyi Li provides a unified theoretical framework for human cognition and machine cognition, constituting the four basic patterns of machine cognition—the cognitive helix structural model and the OOXA structural chain. It indicating that there are four fundamental cognitive modes for humans: induction, deduction, creation, and discovery. Formalized as four basic modes of machine cognition: Observe Orient Act (OOA), Observe Orient Decide Act (OODA), Observe Orient Create Act (OOCA), and Observe Orient Hypothesis Act (OOHA). This study discusses the concept of machines relying on negative entropy and its measurement methods and uses our research and application of the cognitive process of unmanned driving machines as an example to provide ideas, theoretical foundations, and architectural frameworks for further research and the establishment of credible, controllable, and interpretable machine cognition models and structures. Finally, this work provides a prospective overview of future research and the development trends in this field.Structure is the cornerstone of machine cognition, and “model + structure” determines the emergence of machines (systems). Through the research methodology and evaluation of “model + structure”, provides a research idea and paradigm for exploring the “mechanism of human and machine cognition” and solving the major problems of artificial intelligence such as “how machines cognize”.  
      关键词:cognition machine;cognitive physics;cognitive nucleus;model;structure;emergence;negentropy   
      49
      |
      377
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 74864441 false
      更新时间:2025-04-16
    • Single-view 3D object reconstruction based on deep learning: survey AI导读

      在计算机视觉领域,基于深度学习的单视图三维物体重建取得进展,为工业生产、医疗诊断等提供解决方案。
      Liu Cao, Cao Ting, Kang Wenxiong, Jiang Zhaohui, Yang Chunhua, Gui Weihua, Liang Xiaojun
      Vol. 30, Issue 4, Pages: 922-952(2025) DOI: 10.11834/jig.240389
      Single-view 3D object reconstruction based on deep learning: survey
      摘要:Single-view 3D object reconstruction seeks to leverage the 2D structure of a single-view image to reconstruct the 3D shape of an object, facilitating subsequent tasks such as 3D object detection, 3D object recognition, and 3D semantic segmentation. Single-view 3D object reconstruction has emerged as a pivotal topic in computer vision, with wide-ranging applications in industrial production, medical diagnostics, virtual reality, and other fields. Traditional single-view 3D object reconstruction methods rely on a combination of geometric templates and geometric assumptions to complete the 3D reconstruction of specific scene objects. However, the traditional methods based on geometric templates are tailored to specific objects, limiting their generality and scalability. Meanwhile, the traditional methods based on geometric assumptions require strong prior conditions for the object, which limits the reconstruction quality of different changing scenes. Current single-view 3D object reconstruction methods based on deep learning have made significant progress in terms of the applicability of reconstructed objects and the robustness of reconstructed models through data-driven approaches. To further understand their current development, this work systematically analyzes and summarizes single-view 3D object reconstruction methods based on deep learning from following aspects: commonly used datasets and evaluation indicators, method classification and improvement innovation, problem challenges, and development trends in single-view 3D object reconstruction.First, this work focuses on the commonly used datasets and evaluation indicators in single-view 3D object reconstruction. These datasets are the foundation of 3D reconstruction methods based on deep learning and can be divided into three categories: red-green-blue-depth (RGBD) datasets, which contain object depth information and are commonly used for algorithm testing and evaluation; synthetic datasets, which contain large-scale rendering object images and 3D shape data and are commonly used for algorithm training and evaluation; and real-scene datasets, which contain a limited number of real object images and 3D shape and are commonly used for algorithm evaluation. Evaluation indicators, mainly including distance evaluation indicators and classification evaluation indicators, can quantitatively demonstrate algorithm performance. Distance evaluation indicators are mainly used to evaluate the shape distance between the reconstructed 3D model and the ground truth 3D model. The smaller the value, the closer the overall shape of the reconstructed 3D model is to the ground truth 3D model. Classification evaluation indicators are mainly used to evaluate the accuracy of the 3D shape classification of each point in the 3D space. The larger the value, the more accurate the reconstructed 3D model. Second, this work analyzes the field of single-view 3D object reconstruction based on deep learning and systematically summarizes the research works related to supervised, unsupervised, and semi-supervised learning single-view 3D object reconstructions. Supervised learning single view 3D object reconstruction methods mainly focus on reconstruction resolution in the early stage. With the improvement of 3D representations, especially the application of implicit 3D representation, the high-resolution reconstruction of object details has become possible. Subsequent works have improved and innovated various aspects, such as input image, encoding and decoding, prior knowledge, and general structure, to further solve reconstruction problems, such as unknown perspectives, key details, and generalized shapes. Unsupervised learning single view 3D object reconstruction methods mainly focus on improving the rendering process in the early stage, laying the foundation for unsupervised learning. Subsequent works have improved and innovated from the perspectives of rendering image quantity, image feature attributes, and additional prior knowledge to further solve various problems, such as lighting and background interference. Semi-supervised learning single view 3D object reconstruction methods are mainly divided into 2D and 3D labeled data-based methods. The former proposes a small-sample data training paradigm and a general data training paradigm to overcome challenges such as difficulty in 3D annotation, deviation, and inconsistency between annotation data and test data. The latter enhances the robust generalization performance of reconstruction through semantic and perspective information. The above three learning methods have their own advantages and disadvantages in terms of technical frameworks. Supervised learning methods utilize 3D labeled data for learning and reconstruction, resulting in high reconstruction quality; however, they are limited by the high cost of data annotation. Unsupervised learning methods can directly use 2D images to learn and reconstruct, effectively reducing training costs; however, the reconstruction quality is not stable enough. Semi-supervised learning methods propose a paradigm for joint learning of labeled and unlabeled data to address the problems of high data annotation cost and unstable reconstruction quality, combining the advantages of the two methods mentioned above. These methods have been widely studied. This work summarizes the unresolved challenges from the perspectives of data, training paradigms, evaluation metrics, and reconstruction performance and proposes future development trends and key technologies of single-view 3D object reconstruction methods based on deep learning. For the difficulty in collecting data from wild objects, studies must focus on how to use internet object image data to build datasets and develop efficient interactive annotation tools to reduce data collection costs and annotation costs. For the insufficient learning of local object structures, studies must focus on the training paradigm guided by prior knowledge of local object structures to enhance the accuracy and reliability of single-view 3D object reconstruction. For the limited reconstruction performance of few-shot 3D annotated data, studies must design the optimal combination of different tasks and develop multitask learning methods to obtain effective semantic information of objects and supplement effective object reconstruction supervision information. For the neglected local structure assessment of existing evaluation indicators, studies must design reconstruction evaluation indicators that focus on the reconstruction results of local structures to further guide high-precision reconstruction optimization. For the problems of long training optimization cycles and limited object categories faced by existing methods, studies must focus on 3D foundation models that achieve universal category object shape reconstruction to promote the development and application of single-view 3D object reconstruction methods.  
      关键词:deep learning;3D object reconstruction;single-view;supervised learning;unsupervised learning;semi-supervised learning   
      109
      |
      208
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 74864524 false
      更新时间:2025-04-16
    • 本综述探讨了基于人脸视频的心率变异性估计技术,突出了其在健康监测和疾病诊断中的无创性和实时监控的优势。深度学习技术在HRV估计方面因其强大的模式识别能力,能够有效提取复杂视觉特征和处理非线性生理信号,在提高估计精度方面展现出显著优势。本综述旨在提供基于人脸视频的HRV估计技术的全面视角,为学术界和工业界的技术创新和应用拓展提供重要参考。
      Zhou Caiying, Zhan Xinlong, Wei Yuanwang, Zhang Xianchao, Li Yonggang, Wang Chaochao, Ye Xiaolang
      Vol. 30, Issue 4, Pages: 953-976(2025) DOI: 10.11834/jig.240314
      Review of heart rate variability parameter estimation methods in facial video
      摘要:Heart rate variability (HRV) analysis has emerged as a powerful tool in health monitoring and disease diagnosis, offering valuable insights into the autonomic nervous system’s regulation of the cardiovascular system. Estimating HRV from facial video is an innovative approach that combines convenience and non-invasiveness, which holds great promise for advancing personalized healthcare. This method utilizes facial video to capture subtle changes in skin color caused by blood flow variations, allowing for remote and continuous monitoring of heart rate dynamics. HRV reflects the variations in the time intervals, known as RR intervals, between successive heartbeats. It serves as a non-invasive marker of cardiac autonomic function and provides a dynamic assessment of the balance between the sympathetic and parasympathetic branches of the autonomic nervous system. The significance of HRV lies in its ability to reveal underlying physiological conditions that may not be immediately apparent through standard vital sign measurements. For instance, a reduced HRV can indicate stress, fatigue, or the early onset of cardiovascular disease, making it a valuable metric for both preventive and therapeutic health strategies. The key parameters in HRV analysis include both time-domain and frequency-domain metrics. Time-domain measures, such as the standard deviation of NN intervals and the root mean square of successive differences, provide insights into overall heart rate dynamics and short-term variability. Frequency-domain measures, such as low-frequency and high-frequency components and their ratio, help evaluate the balance between sympathetic and parasympathetic activity. These parameters are vital for assessing individual health, particularly in relation to cardiovascular conditions, stress levels, and autonomic nervous system disorders. In healthcare, HRV has a wide range of applications across various domains. In disease prevention, HRV analysis can detect early signs of cardiovascular issues by identifying deviations from normal HRV patterns, potentially indicating autonomic dysfunction or underlying heart conditions. For example, individuals with lower HRV may be at a higher risk of sudden cardiac death or myocardial infarction. Continuous monitoring of HRV can therefore serve as a predictive marker for these events, enabling earlier interventions that could save lives. During rehabilitation, HRV monitoring assists in tracking recovery progress and adjusting treatment plans. Changes in HRV can guide modifications in exercise regimens, physiotherapy, or medication dosages, offering a more personalized approach to patient care and optimizing recovery outcomes. HRV also plays a crucial role in mental health, emotional management, and stress monitoring. Analyzing HRV allows healthcare providers to better understand a patient’s stress levels, emotional state, and overall cardiovascular health, enabling more tailored and effective treatment strategies. Facial video acquisition and data preprocessing are critical steps in HRV estimation. Obtaining high-quality RGB image data requires video capture devices with appropriate resolution and frame rate. Stable and consistent video capture conditions are essential to ensure accurate HRV estimation. Technical requirements for video frame extraction include precise synchronization and alignment of frames to maintain consistency across analyses. Data cleaning and normalization processes involve removing artifacts, correcting for illumination variations, and standardizing the data for analysis. Effective preprocessing ensures that the facial video accurately reflects the physiological signals needed for HRV estimation. Various methods are used for HRV parameter estimation. Traditional signal processing techniques, such as blind source separation and skin model-based methods, have been employed for years. Blind source separation aims to isolate the desired physiological signal from noise and interference, while skin model-based methods leverage physiological models to estimate heart rate from subtle changes in facial color due to blood flow variations. Frequency-domain analysis decomposes the HRV signal into its frequency components to assess autonomic function, while time-frequency analysis provides a comprehensive view of HRV dynamics by combining time and frequency information. Emerging deep learning algorithms have shown great promise in HRV estimation from facial videos. Supervised convolutional neural networks can learn complex features from labeled data, enhancing the ability to extract relevant information from facial videos. Recurrent neural networks are effective for modeling temporal dependencies in sequential data, which is particularly useful for HRV estimation where time-series analysis is critical. Transformer models, known for their capacity to handle long-range dependencies and capture intricate patterns, offer further advantages in this domain. Although less commonly used for HRV estimation, unsupervised generative adversarial networks provide potential for generating synthetic data to augment training datasets, improving model robustness and reducing the reliance on large-scale labeled datasets. The performance of traditional and deep learning methods varies across different application scenarios. Traditional methods often perform well in controlled environments but may struggle with complex scenes or dynamic changes, such as varying lighting conditions or head movements. On the other hand, deep learning methods, while more adept at handling complex and noisy data, require large amounts of labeled training data and significant computational resources. This trade-off highlights the strengths and limitations of each approach and underscores the importance of selecting appropriate methods based on specific application needs. Facial video-based HRV estimation has several practical applications. In health assessment, continuous HRV monitoring can provide real-time insights into a patient’s health status, enabling timely interventions and personalized treatment adjustments. Emotional recognition involves analyzing facial expressions and HRV to understand emotional states, which can be particularly useful in mental health diagnostics and therapy. Mental stress evaluation uses HRV data to identify individuals at risk of stress-related conditions, which is crucial for preventing burnout and promoting workplace well-being. Fatigue detection is vital for ensuring safety in various professional settings, such as aviation, transportation, and healthcare, where fatigue-related errors could have serious consequences. Early warning of cardiovascular diseases can be achieved through HRV monitoring, providing early alerts for potential health issues and enabling preventative measures. Despite the progress made in facial video-based HRV estimation, there are still challenges to overcome. Subject head movements and different lighting conditions can affect estimation accuracy, making it essential to develop robust algorithms that can handle these variations. Model selection and training strategies need to be optimized to improve performance in diverse real-world scenarios. Enhancing the real-time performance and robustness of these algorithms is crucial for their practical application, particularly in wearable and mobile health monitoring devices. Reducing dependency on large-scale labeled datasets through semi-supervised or unsupervised learning approaches could make these technologies more accessible and scalable, expanding their use in both clinical and consumer health settings. In conclusion, facial video-based HRV estimation technology holds great promise for health monitoring and disease diagnosis. By addressing current challenges and exploring future research directions, this technology can be further refined and integrated into everyday health practices. The ability to estimate HRV non-invasively from facial video has the potential to revolutionize the field of telemedicine and personalized health, offering a convenient, cost-effective, and accessible tool for continuous health monitoring. As research progresses, this innovative approach may become a standard component of modern healthcare, providing valuable insights into individual health status and enhancing overall quality of life.  
      关键词:Heart rate variability (HRV);face video;physiological monitoring;signal processing;deep learning   
      193
      |
      227
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 77026304 false
      更新时间:2025-04-16

      Transformer Cross-Modal Perception

    • 最新研究报道,YOLO-SF-TV模型在经颅超声三脑室检测中表现出色,准确率显著提升,为帕金森病早期诊断提供新工具。
      Wan Ao, Gao Hongling, Zhou Xiao, Xue Zheng, Mou Xingang
      Vol. 30, Issue 4, Pages: 977-988(2025) DOI: 10.11834/jig.240293
      YOLO-SF-TV: transcranial ultrasound images of the third ventricle as adetection model
      摘要:ObjectiveCognitive impairment is the most dangerous nonmotor symptom of Parkinson’s disease (PD) and affects approximately 25%–30% of patients every year. This condition seriously affects their quality of life and increases the risk of death. However, the accuracy of clinical diagnosis of PD cognitive impairment is still limited. The proportion of patients with PD diagnosed before the age of 50 years is less than 4%. Some scholars have proposed that transcranial ultrasound imaging of the third ventricle can assist doctors in the diagnosis of PD cognitive impairment. As a rapid, noninvasive, and low-cost detection method, transcranial ultrasound imaging has been gradually applied to the diagnosis of cognitive dysfunction in patients with PD, helping doctors find the disease in time and treat it as soon as possible. Owing to the low signal-to-noise ratio of transcranial ultrasound images and the poor imaging quality, complexity, and similarity of target tissues, specialized physicians rely on manual detection. However, this process is time consuming, labor intensive, and may result in variability among detection results due to the influence of subjective factors related to the operator. Deep learning has been increasingly integrated with the medical field, especially the computer diagnosis(CAD) system based on deep learning used to diagnose PD with good results. In this work, a YOLO-SF-TV network based on Swin Transformer and multiscale feature fusion is proposed for transcranial ultrasound image third ventricle detection to assist physicians in the early diagnosis.MethodA total of 2 400 transcranial ultrasound images of the third ventricle and the corresponding labels are acquired to form a dataset, and the third ventricle region in each image was manually labeled by a professional. The YOLO-SF-TV network is designed to consist of backbone, neck, and head components, whose roles are used to extract image features, fuse image features, and detect and classify targets, respectively. This work uses an algorithm based on YOLOv8 and the window-attention based Swin Transformer to improve the model backbone network and strengthen its ability to model global information. SPP-FCM, a spatial pyramid pooling module, is connected to the Swin Transformer network to enhance the network sensibility and integrate multiscale information. The SPP-FCM structure combines the characteristics of the CSPC structure in YOLOv7 while targeting the introduction of a multihead attention mechanism (MHAM) in the multilevel pooling part, which reduces the sensitivity of the model to noise and outliers during the extraction of multidimensional features. In the multiscale feature fusion PAFPN part of the network, the PAFPN-DM module is proposed by combining depthwise separable convolution (DCOW) with the multihead attention mechanism added to the backbone feature output layer to improve the network’s ability to understand the important global and local information in different scale feature maps. At the same time, traditional convolution is replaced with a depth-separable convolution module, which improves the sensitivity of the network to different channels by convolving each channel individually, to ensure the accuracy of the model while reducing the training parameters and difficulty and enhancing the generalization ability of the model.ResultFivefold cross-validation evaluation was performed on the dataset to validate the performance of the different networks. The dataset was randomly divided into equal quintuples, of which four at a time were used as the training set and the remaining one as the test set. The training input image was resized to 640 × 640 pixels, and the training dataset was expanded using data enhancement methods such as random flip, random angle rotation, and Mosaic. The initial learning rate for model training was set to 0.001, and the learning rate decayed to 0.1 times of the original every 50 epochs, with a momentum of 0.9, a decay coefficient of 0.000 5, and a batch size of 8. GeForce RTX 3090 was used as the GPU in the experiments, and the mean average precision (mAP) metrics were applied as a measure for detecting network performance under the Ubuntu 20.04 operating system and PyTorch framework. Experimental results show that the YOLO-SF-TV algorithm achieves 98.69% detection accuracy on transcranial ultrasound third ventricle targets, an improvement of 2.12% relative to that of the YOLOv8 model. Therefore, the detection accuracy is optimized compared with that of typical models.ConclusionThe proposed YOLO-SF-TV model performs excellently in the detection of the third ventricle in transcranial ultrasound images. The SPP-FCM and PAFPN-DM modules enhance the model’s detection capability, generalization, and robustness. The produced dataset helps promote the research on third ventricle detection in transcranial ultrasound images.  
      关键词:transcranial ultrasound imaging;computer aided diagnosis(CAD);third ventricle;deep learning;YOLOv8;Swin Transformer   
      75
      |
      96
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 74864592 false
      更新时间:2025-04-16
    • Dual-branch U-Net with hybrid attention for hyperspectral pansharpening AI导读

      在高光谱全色锐化领域,研究者提出了基于混合注意力机制的双分支U-Net结构DUNet-HA,有效融合空间—光谱信息,显著提升高光谱全色锐化结果图像的质量。
      Yang Yong, Wang Xiaozheng, Liu Xuan, Huang Shuying, Liu Ziyang, Wang Shuzhao
      Vol. 30, Issue 4, Pages: 989-1002(2025) DOI: 10.11834/jig.240410
      Dual-branch U-Net with hybrid attention for hyperspectral pansharpening
      摘要:ObjectiveHyperspectral (HS) images have rich spectral information and are obtained by sampling hundreds of contiguous narrow spectral bands using spectral imaging systems; however, they typically have low spatial resolution due to the low energy of narrow spectral bands. Single-band panchromatic (PAN) images from PAN imaging systems provide rich spatial information but have low spectral resolution. In some remote sensing applications that require high-resolution hyperspectral (HRHS) images with high spectral and spatial resolution, neither PAN nor HS images can meet the requirements. HS pansharpening aims to fuse the spatial information from PAN images with the spectral information from HS images to obtain HRHS images. This technology has received considerable attention in the field of remote sensing and is of great significance in various remote sensing tasks, such as military surveillance, environmental monitoring, object identification, and classification. HS pansharpening methods are mainly divided into two categories: traditional methods and deep learning (DL)-based methods. The former can be classified into component substitution-based methods, multiresolution analysis-based methods, Bayesian-based methods, and model-based methods. Although these traditional methods are easy to implement and physically interpretable, they often suffer from spatial and spectral distortion issues due to inappropriate prior assumptions and imprecise manual feature extraction. Owing to their powerful feature learning capability, DL-based methods have been widely applied to HS pansharpening tasks. Although these methods have better performance than traditional methods, spectral and spatial distortions still exist in the fused images because of failure to handle spectral and spatial features differently and the complex mapping relationships between multichannel images. With the introduction of the transformer architecture that can learn the global correlation features of images, some researchers have attempted to improve the performance of HS pansharpening methods by establishing relationships between two modal features using this architecture. However, the application of the Transformer architecture is limited by its high computational cost and low parameter efficiency. To effectively fuse PAN and HS images, this work proposes a dual-branch U-Net network based on hybrid attention (DUNet-HA) for HS pansharpening to achieve multiscale feature fusion. Spatial attention branches, spectral attention branches, and dual-cross attention module branches are constructed at each scale. These branches are used to enhance the spatial and spectral features of PAN and HS images and to achieve complementary cross-modal features. The dual-cross attention module is designed to avoid the complex query matrix computation in the Transformer.MethodThe proposed DUNet-HA includes two U-Net branches, one for PAN images and the other for upsampled HS images, to extract and complement texture and spectral features. At each scale, a hybrid attention module (HAM) is constructed to encode features from both types of images. Each HAM comprises a spatial attention module (Spa-A), a spectral attention module (Spe-A), and a dual-cross attention module (DCAM). The Spa-A and Spe-A enhance the texture and spectral features of PAN and HS images, respectively, and the DCAM corrects and complements these features. The enhanced and corrected features are integrated to obtain the encoded features at each scale. The decoder primarily facilitates feature fusion and reconstruction. In addition, we use a DCAM to capture global contextual information and directly integrate encoded features, decoded features, and corrected complementary features at the same scale to handle high-level spatial and spectral features. The proposed DCAM is a novel cross-attention structure that uses query-independent matrix computation instead of the attention computation in Transformer architecture, reducing computational cost. DCAM maps the cross-feature space of PAN and HS images to guide feature interaction for correction and supplementation.ResultExtensive experiments are conducted on the following three widely used HS datasets to validate the effectiveness of the proposed DUNet-HA: Pavia center, Botswana, and Chikusei. We compare DUNet-HA with several state-of-the-art (SOTA) methods, including five traditional methods (CNMF, GFPCA, SFIM, GSA, and MTF_GLP_HPM) and six DL-based methods (SSFCNN, HyperPNN, DHP-DARN, DIP-Hyperkite, Hyper-DSNet, and HyperRefiner). To evaluate the performance of all methods, we use five objective indicators, namely, spectral cross correlation (SCC), spectral angle mapper (SAM), root mean square error (RMSE), relative dimensionless global error of synthesis (ERGAS), and peak signal-to-noise ratio (PSNR). Experimental results with a scale factor of 4 demonstrate that the proposed method outperforms other SOTA methods in objective results and visual effects. In particular, the PSNR, SAM, and ERGAS of the proposed method on the Pavia center dataset are improved by 1.10 dB, 0.40, and 0.28, respectively, compared with those of the second-best method, HyperRefiner. Its objective results on the other two datasets also surpass those of HyperRefiner. Visual results indicate that our proposed method is superior in recovering fine-grained spatial textures and spectral details. Ablation studies further demonstrate that the DCAM structure significantly improves the fusion process.ConclusionThis work proposes a dual-branch interactive U-Net network named DUNet-HA for HS pansharpening. This network extracts and reconstructs spatial and spectral information from PAN and HS images through a parallel dual U-Net structure to achieve accurate fusion results. At each scale of the network’s encoder, a HAM is constructed to enhance the spatial features of PAN images and the spectral features of HS images using spatial attention and spectral attention, respectively. A DCAM is utilized to complement these features, reducing the modality differences between PAN and HS image features and enabling their mutual supplementation for feature interaction guidance. Compared with the classic hybrid Transformer attention structure, the DCAM improves network performance while reducing the number of parameters and computational cost. Extensive experimental results on three widely used HS datasets demonstrate that the proposed DUNet-HA outperforms several SOTA methods in quantitative and qualitative evaluations.  
      关键词:hyperspectral (HS) pansharpening;Modality differences;hybrid attention module(HAM);double cross attention module(DCAM);Transformer;spatial-spectral dependency relationship   
      87
      |
      122
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 70933566 false
      更新时间:2025-04-16
    • 在遥感场景分类领域,专家提出了全局—局部特征耦合网络,有效提升了模型特征学习能力,缓解了自注意力机制的计算压力。
      Wang Junjie, Li Wei, Zhang Mengmeng, Gao Yunhao, Zhao Boyu
      Vol. 30, Issue 4, Pages: 1003-1016(2025) DOI: 10.11834/jig.240228
      Global-local feature coupling network for remote sensing scene classification
      摘要:ObjectiveConvolutional neural networks (CNNs) have received significant attention in remote sensing scene classification due to their powerful feature induction and learning capabilities; however, their local induction mechanism hinders the acquisition of global dependencies and limits the model performance. Visual Transformers (ViTs) have gained considerable popularity in various visual tasks including remote sensing image processing. The core of ViTs lies in their self-attention mechanism, which enables the establishment of global dependencies and alleviates the limitations of CNN-based algorithms. However, this mechanism introduces high computational costs. Calculating the interactions between key-value pairs requires computations across all spatial locations, leading to huge computational burden and heavy memory footprint. Furthermore, the self-attention mechanism focuses on modeling global information while ignoring local detailed feature. To solve the above problems, this work proposes a global-local feature coupling network for remote sensing scene classification.MethodThe overall network architecture of the proposed global-local feature coupling network consists of multiple convolutional layers and dual-channel coupling modules, which include a ViT branch based on dual-grained attention and a branch with depth-wise separable convolution. Feature fusion is achieved using the proposed adaptive coupling module, facilitating an effective combination of global and local features and thereby enhancing the model’s capability to understand remote sensing scene images. On the one hand, a dual-grained attention is proposed to dynamically perceive data content and achieve flexible computation allocation to alleviate the huge computational burden caused by self-attention mechanisms. This dual-grained attention enables each query to focus on a small subset of key-value pairs that are semantically most relevant. Less relevant key-value pairs are initially filtered out at a coarse-grained region level so that the most relevant key-value pairs can be identified to efficiently achieve global attention. This step is accomplished by constructing a regional correlation graph and pruning it to retain only the top-k regions with the highest correlation. Each region only needs to focus on its top-k most relevant regions. Once the attention regions are determined, the next step involves collecting fine-grained key/value tokens to achieve token-to-token attention, thereby realizing a dynamic and query-aware sparse attention. For each query, the irrelevant key-value pairs are initially filtered out based on a coarse-grained region level, and fine-grained token-to-token attention is then employed within the set of retained candidate regions. On the other hand, an adaptive coupling module is utilized to combine the branches of CNN and ViT for the integration of global and local features. This module consists of two coupling operations: spatial coupling and channel coupling, which take the outputs of the two branches of ViT and depth-wise separable convolution as input and adaptively reweight the features from global and local feature dimensions. At this point, the global and local information from the scene image can be aggregated within the same module, achieving a comprehensive fusion.ResultExperiments are conducted on the UC merced land-use dataset (UCM), aerial image dataset (AID), and Northwestern Polytechnical University remote sensing image scene classification dataset (NWPU-RESISC4). The proposed method is compared with state-of-the-art CNN-based and ViT-based methods to demonstrate its superiority. At different training ratios, the proposed method achieves the best classification results with an accuracy of 99.71% ± 0.20%, 94.75% ± 0.09%, 97.05% ± 0.12%, 92.11% ± 0.20%, and 94.10% ± 0.17%. Ablation experiments are also performed on the three datasets to intuitively demonstrate the positive effect of the proposed two modules on the experimental results. The dual-grained attention and adaptive coupling module alleviate the model’s calculation pressure and improve its classification performance.ConclusionA novel global-local feature coupling network is proposed for remote sensing scene classification. First, a new dynamic dual-grained attention that utilizes sparsity to save computation while involving only GPU-friendly dense matrix multiplication is proposed to address the computational cost issue associated with the conventional attention mechanism in ViTs. Furthermore, an adaptive coupling module is designed to facilitate the mixing and interaction of information from two branches to comprehensively integrate global and local detailed features, thus significantly enhancing the representational capabilities of the extracted features. Extensive experimental results on the three datasets have demonstrated the effectiveness of the global-local feature coupling network.  
      关键词:scene classification;remote sensing image;global and local features;coupling module;attention mechanism   
      40
      |
      67
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 74864373 false
      更新时间:2025-04-16
    • 报道:最新研究提出基于混合注意力和双向门控网络的高光谱图像变化检测方法,有效提升检测准确性,显著优于BiT、CBANet等主流方法。
      Li Xiangtan, Gao Feng, Sun Yue, Dong Junyu
      Vol. 30, Issue 4, Pages: 1017-1026(2025) DOI: 10.11834/jig.240360
      Hybrid attention and bidirectional-gated network for hyperspectral image change detection
      摘要:ObjectiveHyperspectral images (HSIs) provide rich spectral and spatial information essential for a wide range of applications, including disaster assessment, resource management, and land surveys. The rich data content of HSIs allows for a detailed analysis of various materials and surfaces, making them invaluable in monitoring environmental changes, assessing natural disasters, and managing agricultural resources. However, the application of HSIs in change detection tasks is often hindered by several challenges, primarily due to the presence of various types of noise. HSIs are particularly susceptible to Gaussian noise, which is typically introduced during image acquisition due to sensor limitations and environmental conditions. This type of noise can obscure the spectral signatures of materials, making it difficult to accurately detect changes over time. Striping noise, which occurs due to inconsistencies in the sensor response across different spectral bands, further complicates the interpretation of HSIs. These noises can significantly degrade the quality of the data, leading to less reliable change detection results. Therefore, advanced methods that can effectively mitigate the effect of noise and enhance the accuracy of change detection in HSIs are urgently needed. Traditional methods often fall short in addressing the intricate balance between preserving local details and capturing global context, both of which are crucial for accurate change detection. This task is particularly challenging in complex and variable environments where changes can be subtle and distributed across different spatial and spectral dimensions.MethodTo address these issues, this work proposes a novel HSI change detection method based on hybrid attention and bidirectional gated networks. The proposed method integrates local and global features by enhancing the self-attention mechanism and feedforward neural network in the transformer architecture. In particular, a hybrid attention module (HAM) is introduced, which employs parallel structures of convolutional neural networks (CNNs) and multi-layer perceptron (MLP) layers with gating (gMLPs) to extract local features and global contextual information, respectively. CNNs are effective at identifying fine details and local patterns in the spatial domain, which are crucial for detecting subtle changes in HSIs. By using convolutional layers, the HAM can efficiently extract and represent local information. gMLPs with gating mechanisms are adept at modeling long-range dependencies and global interactions within the data, allowing the HAM to balance the detailed local information extracted by the CNNs with a broad, global perspective to ensure that the overall feature representation is comprehensive and robust against noise. This dual approach effectively balances the extraction of fine local details and broad global context, thereby reducing the impact of noise and enhancing feature representation. A bidirectional-gated feed-forward network (BGFN) is also constructed to further improve feature extraction and integration. The BGFN is designed to enhance feature extraction and integration across channel and spatial dimensions, providing a holistic understanding of the hyperspectral data. It leverages a bidirectional gating mechanism to selectively emphasize relevant features while suppressing irrelevant ones. This selective emphasis is crucial for dealing with the noise inherent in HSIs because it allows the network to focus on the most informative and significant features, thereby improving the accuracy of change detection. By enhancing the interactions across channel and spatial dimensions, the BGFN ensures that the extracted features are well-integrated and representative of the underlying data. This comprehensive fusion of local and global information is the key to the deep integration of multitemporal HSI features for accurate and reliable change detection.ResultExtensive experiments were conducted on three hyperspectral datasets: Framland, Hermiston, and River datasets. The proposed method was compared with six mainstream methods, including BiT and CBANet, to evaluate its performance in terms of accuracy and Kappa coefficient. On the River dataset, HBFormer achieves a reduction in false positive (FP), false negative (FN), and overall error (OE), an improvement of 0.13% in accuracy, and an improvement of 0.77% in Kappa coefficient compared with CBANet. On the Farmland dataset, HBFormer achieves the lowest FP, a 0.34% improvement in accuracy, and a 2.02% improvement in Kappa coefficient over BiT. On the Hermiston dataset, HBFormer achieves the lowest FP and OE, with 1% and 2.08% improvements in accuracy and Kappa coefficient over CBANet, respectively. Ablation experiments were also conducted to assess the contribution of each component of the proposed method. The results demonstrated that the hybrid attention module and the bidirectional gated network play crucial roles in enhancing change detection accuracy by effectively integrating local and global information. The proposed method consistently outperforms the mainstream methods across all datasets, showcasing its superior capability in handling noise and providing accurate change detection. These results underline the importance of integrating local and global features and highlight the robustness of the hybrid attention and bidirectional gated network approach.ConclusionThe comprehensive experiments on three hyperspectral datasets validate the efficacy of the proposed method in change detection tasks. The hybrid attention and bidirectional gated network approach enhances accuracy and robustness and offers a scalable solution for various HIS analysis applications. The overarching goal of this research is to enhance the performance of change detection in HSIs, making it accurate and reliable even under complex and variable conditions. The proposed method’s ability to effectively fuse local and global features results in improved change detection accuracy and resilience to noise, making it a valuable tool for remote sensing tasks. Its significant performance gains over current mainstream methods such as BiT and CBANet underscore the method’s potential for practical deployment in real-world scenarios. This study opens up avenues to explore the application of this method across diverse datasets and environments. Enhancing the generalization capabilities of the proposed method could lead to even broader applicability and stronger support for real-world remote sensing tasks. The proposed HIS change detection method based on hybrid attention and bidirectional gated network presents a significant advancement in the field. It addresses the challenges posed by noise and the need for robust feature integration, providing a reliable and effective solution for complex and variable environments. Its promising results and potential for further improvements make this method a valuable contribution to HIS analysis and remote sensing.  
      关键词:change detection;hyperspectral image (HSI);remote sensing;bidirectional attention;Transformer   
      216
      |
      209
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 70933463 false
      更新时间:2025-04-16
    • 在公安系统视频数据分析领域,专家提出了HK-DETR算法,有效提高了持刀危险行为检测的精度和效率。
      Jin Tao, Hu Peiyu
      Vol. 30, Issue 4, Pages: 1027-1040(2025) DOI: 10.11834/jig.240295
      HK-DETR: improved knife-holding dangerous behavior detection algorithm based on RT-DETR
      摘要:ObjectiveIn contemporary society, public safety concerns have garnered increasing attention, particularly in crowded venues such as subway stations, railway terminals, and commercial centers, where timely and accurate detection and response to potential threatening behaviors are paramount for maintaining societal stability. The extensive deployment of network cameras by public security systems serves as a vital surveillance tool, capable of capturing and recording vast amounts of video data in real-time, providing a rich source of information for security analysis. Nevertheless, a pivotal challenge arises when delving into the depths of these video data: the automated detection of pedestrians engaging in dangerous knife-wielding behaviors. The complexity of this task stems primarily from the diversity in knife shapes and sizes, ranging from conventional elongated knives to folding knives and daggers, each exhibiting distinct visual representations in images, posing significant challenges for detection algorithms. Furthermore, occlusions, a common occurrence in real-world surveillance scenarios, including body occlusions between pedestrians, obstructions by trees or buildings, can lead to incomplete target feature information, thereby compromising detection performance. Additionally, multi-object occlusion, prevalent in densely populated areas, where multiple pedestrians or objects overlap in images, exacerbates the difficulty in accurately distinguishing and localizing knife-wielding individuals. To address these issues and enhance the precision and efficiency of detecting dangerous knife-wielding behaviors, this paper proposes an algorithm named human-knife detection Transformer (HK-DETR), which is an improvement upon the real-time detection Transformer (RT-DETR). Building upon the inherent strengths of RT-DETR, HK-DETR incorporates numerous optimizations and innovations tailored specifically to the characteristics of knife-detection tasks.MethodFirst, we meticulously designed the inverted residual cascade block (IRCB) as a fundamental building block (BasicBlock) within the backbone network. This innovative design not only achieves a lightweight network architecture, effectively alleviating computational resource scarcity, but also significantly reduces redundant computations. By optimizing the processing flow of feature maps, the IRCB module substantially enhances the backbone network's ability to capture and distinguish diverse features, thereby laying a solid foundation for subsequent complex knife detection tasks. We propose the cross stage partial-parallel multi-atrousconvolution (CSP-PMAC) module, a revolutionary feature fusion strategy. This module directs the network to focus more intently on capturing and integrating multi-scale feature information during the fusion stage, which is pivotal for identifying knives of varying shapes and angles. This design equips the model with exceptional adaptability, enabling it to accurately identify both small knives and large knives, thus significantly improving the model’s performance in complex scenarios. In further optimizing the model, we have selected the novel Haar wavelet-based downsampling (HWD) module as a downsampling method to replace the traditional downsampling mechanism within the network. By leveraging its unique hierarchical wavelet decomposition technique, the HWD module effectively diminishes data dimensionality while retaining richer details of object scale variations. This enriches and refines feature representations in subsequent multiscale feature fusion, enhancing the model’s robustness in handling scale variations. Finally, to comprehensively enhance detection accuracy, we have adopted the minimum point distance based intersection over union (MPDIoU) loss function. This improved loss function optimizes object localization accuracy by more precisely measuring the overlap between predicted bounding boxes and actual target boxes. It not only considers classification accuracy but also intensifies the pursuit of localization precision, enabling the model to maintain superior detection performance even in the presence of dense or overlapping targets.ResultAblation experiments were conducted on the pedestrian knife-carrying dataset, which revealed that each improvement strategy, when applied individually, contributed to a certain degree of performance enhancement for the original RT-DETR model, despite the persistence of challenges such as missed detections and confidence issues in some cases. However, when these improvement strategies were combined, a significant boost in detection performance was achieved. To validate the effectiveness of the proposed model, comparative experiments were performed on the pedestrian knife-carrying dataset. The results demonstrated that compared to the original RT-DETR algorithm, the refined model exhibited a 25% reduction in network parameters while achieving improvements of 2.3%, 5.5%, and 5.2% in accuracy, recall, and mean average precision(mAP), respectively. When benchmarked against YOLOv5m, YOLOv8m, and Gold-YOLO-s, the refined model, with a lower number of network parameters, demonstrated notable mAP enhancements of 6.3%, 5.2%, and 1.8%, respectively.ConclusionThe proposed HK-DETR algorithm in this paper exhibits remarkable performance advantages in the task of automatically detecting dangerous knife-carrying behaviors of pedestrians in video data captured by public security system network cameras. This algorithm effectively addresses the challenges posed by the diversity of knife shapes and sizes, occlusion, and multi-target overlapping in complex scenarios, while significantly enhancing detection accuracy, recall rate, and mAP through a series of innovative designs. Compared to the original RT-DETR algorithm and other mainstream detection models such as YOLOv5m, YOLOv8m, and Gold-YOLO-s, HK-DETR achieves notable performance improvements. This result underscores the algorithm’s ability to maintain high efficiency and accuracy in diverse and complex environments, offering robust technical support for the field of public security surveillance. Within the realm of public safety, HK-DETR holds immense potential for widespread adoption in surveillance systems of public places like railway stations, airports, subway stations, and shopping malls, enabling real-time detection and early warning of potential knife-related dangers, thereby providing timely and effective information support for law enforcement agencies. Moreover, as technology continues to evolve and mature, the HK-DETR algorithm is poised to expand its reach into other domains, such as intelligent transportation and industrial automation, offering potent solutions to an array of practical problems.  
      关键词:knife-holding behavior detection;real-time detection Transformer (RT-DETR);object detection;multi-scale feature fusion;Transformer;dangerous behavior detection   
      147
      |
      113
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 70933606 false
      更新时间:2025-04-16

      Dataset

    • Indoor large-scale panoramic visual localization dataset AI导读

      在计算机视觉领域,研究人员提出了基于全景相机的大尺度室内视觉定位基准数据集,为评估视觉定位算法提供全面解决方案。
      Yu Hailin, Liu Jiarun, Ye Zhichao, Chen Xinyu, Zhan Ruohao, ShenTu Yichun, Lu Zhongyun, Zhang Guofeng
      Vol. 30, Issue 4, Pages: 1041-1058(2025) DOI: 10.11834/jig.240284
      Indoor large-scale panoramic visual localization dataset
      摘要:ObjectiveVisual localization aims to solve the 6 degree-of-freedom (DoF) pose of a query image based on an offline reconstructed map and plays an important role in various fields, including autonomous driving, mobile robotics, and augmented reality. In the past decade, researchers have made significant progress in the accuracy and efficiency of visual localization. However, due to the complexity of real-world scenes, actual visual localization still faces many challenges, especially illumination changes, seasonal changes, repeating textures, symmetrical structures, and similar scenes. To address the issues of illumination and seasonal changes under long-term conditions, researchers have proposed a series of datasets and provided evaluation benchmarks to compare and improve visual localization algorithms. With the rapid development of deep learning, the robustness of image features has surpassed that of traditional hand-crafted ones. Researchers have proposed much keypoint detection, feature description, and feature matching methods based on deep neural networks. Some features and matching methods have already been used in visual localization and showed promising results in long-term tasks. However, visual localization datasets specific for large-scale and complex indoor scenes are still lacking. Existing indoor visual localization datasets have limited scene sizes or show relatively mild challenges. The area of most scenes in these datasets ranges from a few square meters to several thousand square meters. Many large and complex indoor scenes that are often encountered in daily life are not included, such as underground parking lots, dining halls, and office buildings. These scenes often show severe repetitive textures and symmetrical structures. To promote the research of visual localization in large-scale and complex indoor scenes, we propose a new large-scale indoor visual localization dataset covering multiple scenarios, which include repeating textures, symmetrical structures, and similar scenes.MethodWe selected four commonly encountered indoor scenes in everyday life: an underground parking lot, a dining hall, a teaching building, and an office building. We used an Insta360 OneR panoramic camera as the collection device to densely capture these scenes. The size of the collected scenes ranged from 1 500 square meters to 9 000 square meters. To achieve accurate reconstruction of these scenes, we proposed a panoramic-based structure-from-motion (SfM) system. This system leverages the wide field-of-view offered by panoramas to addresses the challenge of constructing large-scale SfM models for indoor scenes. Different from existing techniques that rely on complex and costly 3D scanning equipment or extensive manual annotation, our proposed method can accurately reconstruct challenging large-scale indoor scenes by using only panoramic cameras. To restore the true scale of the reconstruction, we adopted an interactive approach by aligning the dense model of reconstruction to computer-aided design drawings. The accuracy of the proposed large-scale indoor visual localization dataset was quantitatively and qualitatively analyzed through measurement and rendering comparison. To create a suitable database or reference model for evaluating current state-of-the-art visual localization methods, we converted the reconstructed panoramic images into perspective images. Each panorama was divided into six perspective images with a resolution of 600 × 600, taken at 60°intervals along the yaw angle direction. Each perspective image has a field-of-view of 60 degrees.ResultThe scale error of all scene reconstructions in the proposed indoor visual localization dataset is within 1%. Basing on the four proposed indoor scenes, we conducted evaluations on multiple state-of-the-art visual localization algorithms, including four advanced image retrieval methods and eight visual localization algorithms. Our rigorous quantitative and qualitative analysis reveals significant room for improvement in current state-of-the-art methods when confronted with large-scale and complex indoor scenarios. For instance, we observed the underperformance of the state-of-the-art feature matching methods SuperGlue and LoFTR compared with the basic nearest neighbor (NN) matching approach on certain datasets. Significant performance degradation on the datasets were recorded for PixLoc (based on end-to-end training) and ACE (based on scene coordinate regression). Furthermore, we designed a new visual localization evaluation metric that can effectively reflect the urgent problems arising during the practical application of current visual localization methods. This new benchmark strongly suggests the necessity for developing new criteria in these scenarios to ensure reliable and accurate localization judgments.ConclusionThe proposed large-scale and complex indoor visual localization dataset exhibits distinct characteristics compared with existing indoor datasets. On the one hand, it poses greater challenges than existing indoor datasets in terms of repetitive textures, symmetric structures, and similar scenes. On the other hand, it contains a wide range of scenarios and can provide a comprehensive evaluation of visual localization algorithms. In addition, the benchmarks provided in this paper can be used by researchers to compare and improve visual localization algorithms and to advance visual localization in practical indoor scenarios. The dataset is available at https://github.com/zju3dv/PanoIndoor.  
      关键词:visual localization;dataset;feature matching;pose estimation;repeated textures   
      65
      |
      115
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 75674563 false
      更新时间:2025-04-16
    • Diffusion model-generated video dataset and detection benchmarks AI导读

      在视频生成领域,专家创建了大规模多类型生成视频数据集DGVD,为生成视频检测提供新基准,推动领域发展。
      Zheng Tianpeng, Chen Yanxiang, Wen Xinzhe, Li Yancheng, Wang Zhiyuan
      Vol. 30, Issue 4, Pages: 1059-1071(2025) DOI: 10.11834/jig.240259
      Diffusion model-generated video dataset and detection benchmarks
      摘要:ObjectiveDiffusion models have shown remarkable success in video generation. An example is Open AI sora, where we can simply use a text or image to generate a video. However, this convenient video generation also raises concerns regarding the potential abuse of generated videos for deceptive purposes. Existing detection techniques primarily target face videos. However, datasets specialized for the detection of forged scene videos generated by diffusion models are lacking, and many existing datasets suffer from limitations such as single conditional modality and insufficient data volume. To address these challenges, we propose a multiconditional generated video dataset and corresponding detection benchmark. Traditional detection methods for generated videos often rely on a single conditional modality, such as text or image, which restricts their ability to effectively detect a wide range of generated videos. For instance, algorithms trained solely on videos generated by text to video (T2V) model may fail to identify videos generated by image to video (I2V) model. By introducing multiconditional generated videos, we aim to provide a comprehensive and robust dataset that encompasses T2V and I2V generated videos. The proposed dataset development process involves collecting diverse multiconditional generated videos and real videos downloaded from the Internet. Each generative method produces substantial videos that can be utilized to train the detection model. These diverse conditions and large number of generated videos ensure that the dataset encompasses the broad features of diffusion videos, thereby enhancing the effectiveness of generated video detection models trained on this dataset.MethodOur generated video dataset can provide training data for detection, allowing the detector to recognize whether the video is AI-generated or not. It uses existing advanced diffusion models (T2V and I2V) to generate numerous videos. Prompt text is one of the keys to generate high-quality videos. We use ChatGPT to generate these prompt texts. To obtain general prompt texts, we set 15 entity words, such as dog and cat, and then combine them with a template as the input for ChatGPT. In this way, we gain 231 prompt texts for generating T2V videos. Different from the T2V model, the I2V model uses the image as the input condition to make a moving image. Using the existing advanced T2V and I2V methods for video generation, combined with real videos obtained from the web to build the final dataset. The generation quality of the video is evaluated using the existing advanced methods of generated video evaluation. We combine the metrics of the EvalCrafter and AIGCBench to evaluate generated videos. For the detection module, we use advanced video detection methods (four image-level video detectors and six video-level video detectors) to evaluate the performance of existing detectors on our dataset with different experiments.ResultWe introduce a generalized scene generated video dataset (diffusion-generated video dataset, DGVD) and corresponding detection benchmarks constructed using multiple generated video detection methods. A generated video quality estimation method containing T2V and I2V is proposed by combining the current state-of-the-art generated video evaluation methods EvalCrafter and AIGCBench. Generated video detection experiments are conducted on four image-level detectors (CNN detection, CNNdet; diffusion reconstruction error, DIRE; wavelet domain forgery clues, WDFC; and deep image fingerprint, DIF) and six video-level detectors (inflated 3D, I3D; expand 3D, X3D; convnets 2D, C2D; Slow; SlowFast; and multiscale vision Transformer, MViT). We set two experiments, within-classes detection and cross-classes detection. For within-classes detection, we train and evaluate the performance of the detectors using the T2V test dataset. For cross-classes detection, we train the detectors on the T2V test dataset but evaluated their performance on the I2V test dataset. Experimental results demonstrate that the image-level detection methods are unable to effectively detect unknown data and exhibit poor generalization. Conversely, the video-level detection methods perform well on the videos generated by methods implemented in the same backbone network. However, they still cannot achieve good generalizability in other classes. These results indicate that existing video detectors are unable to identify the majority of videos generated by the diffusion model.ConclusionWe work, introduce a novel dataset, DGVD, that encompasses a diverse array of categories and generative scenarios to address the need for advancements in generated video detection. By providing a comprehensive dataset and corresponding benchmarks, we offer a challenging environment for training and evaluating detection models. This dataset and its corresponding benchmarks highlight the current gaps in generated video detection and serve as basis for further progress in the field. We hope to reach significant strides toward enhancing the robustness and effectiveness of generated video detection systems, ultimately driving innovation and advancement in the field. The dataset and code in this paper can be downloded at https://cstr.cn/31253.11.sciencedb.22031 and https://github.com/ZenT4n/DVGD.  
      关键词:video generation;diffusion model;generated video detection;prompt text generation;video quality evaluation   
      124
      |
      103
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 75674294 false
      更新时间:2025-04-16

      Image Processing and Coding

    • 在图像版权保护领域,专家提出了一种感知约束和引导下的特征点增强局部水印算法,有效提高了特征点稳定性,增强了水印鲁棒性和不可感知性。
      Guo Na, Huang Ying, Niu Baoning, Guan Hu, Lan Fangpeng, Zhang Shuwu
      Vol. 30, Issue 4, Pages: 1072-1083(2025) DOI: 10.11834/jig.240348
      Concurrent watermark embedding and feature point enhancement with perceptual constraint and guidance
      摘要:ObjectiveThe technology of image watermarking plays a vital and indispensable role in the realm of copyright protection by embedding unique identifiers within digital images. This technological advancement enables content owners to assert and verify ownership, as well as trace unauthorized use of their intellectual property. Local image watermarking technology embeds the watermark into specific regions of an image, which helps to prevent the watermark from being compromised by cropping attacks while minimizing visual distortion as much as possible. As a result, local image watermarking technology, compared to global watermarking technology, ensures the robustness and integrity of the watermark while maintaining the visual quality of the original image. The localization and synchronization of embedding region in local image watermarking are typically facilitated by feature point, which serve as reference marker for embedding the watermark and are critical for ensuring accurate extraction during watermark detection processes. The feature point provides a consistent and reliable framework for embedding the watermark, thereby enhancing the precision and effectiveness of the watermarking technique. However, watermark embedding and potential image attacks can easily cause the displacement of feature point, resulting in inaccurate localization of the embedded region. This displacement can consequently lead to failures in watermark extraction, compromising the effectiveness of the watermarking process. To address these challenges, it is imperative to enhance the stability of feature point in local watermarking technologies. Stability refers to the resilience of feature point against various image distortions or attacks that could potentially alter its intended position. This stability directly impacts the effectiveness of watermark embedding and subsequent extraction processes, thereby influencing the overall robustness of the copyright protection mechanism. A stable feature point ensures that the watermark remains accurately embedded and can be reliably extracted, even in the face of adversarial conditions.MethodThis paper proposes concurrent watermark embedding and feature point enhancement with perceptual constraint and guidance(CoEE), a method that performs both watermark embedding and feature point enhancement by adaptively modifying pixels once. It can achieve three significant effects: improved stability of feature point, enhanced robustness, and ensured imperceptibility to maintain the visual quality of the image. The adaptability of the algorithm is underscored by two key aspects. Firstly, an optimization function is designed to get the optimal pixel modification strategy that enhances the strength of feature point and embeds the watermark simultaneously. This strategy involves a careful analysis and adjustment of pixel values to ensure that the feature points maintain their stability and resilience even after the watermark is embedded. By doing so, it prevents the weakening of feature point stability that typically occurs during the watermark embedding process, thereby improving the system’s resistance to various forms of attacks. This enhanced resistance ensures that the watermark remains robust and detectable under adverse conditions, thereby providing a reliable means of copyright protection. Secondly, the total amount of pixel modifications during the watermark embedding process is constrained by the peak signal-to-noise ratio (PSNR), which serves as a quantitative measure of the changes allowed in the image to maintain its visual quality. To go a step further, the total amount of pixel modifications is allocated to individual pixels under perceptual guidance, taking the human visual system’s sensitivity to changes in different parts of the image into account. The allocation strategy aims to maximize the imperceptibility of the embedded watermark, ensuring that the modifications are distributed in such a manner that they remain largely unnoticed by the human eye. By adhering to the PSNR constraints, the strategy guarantees that the watermarked image maintains its high visual quality. This careful balance between robustness and imperceptibility ensures that the watermark is invisible to viewers, while still being robust enough to be detected and accurately extracted, thereby achieving a seamless integration of copyright protection within the digital content.ResultThe experiments demonstrate that CoEE algorithm significantly enhances the stability of feature points. When subjected to various attack scenarios, the accuracy of watermark extraction using the CoEE algorithm significantly surpasses the performance of the current state-of-the-art watermarking algorithms. This superior performance is consistently observed provided that the PSNR of the watermarked image remains above the threshold of 40 dB. Consequently, the CoEE algorithm demonstrates a notable improvement in watermark extraction accuracy and resilience under diverse attack conditions.ConclusionThe algorithm proposed in this paper significantly enhances the stability of feature points, resulting in superior performance in both watermark invisibility and watermark robustness.  
      关键词:local watermarking;feature point;perceptual guidance;imperceptibility;robustness   
      24
      |
      66
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 79109438 false
      更新时间:2025-04-16
    • 在计算机视觉领域,研究者提出了一种超分辨率重建新方法MLFN,有效提升了图像重建精度和模型性能。
      Song Xiaogang, Zhang Pengfei, Liu Wanbo, Lu Xiaofeng, Hei Xinhong
      Vol. 30, Issue 4, Pages: 1084-1099(2025) DOI: 10.11834/jig.240042
      Image superresolution reconstruction based on multiscale large-kernel attention feature fusion network
      摘要:ObjectiveImage superresolution reconstruction is a foundational and critical task in computer vision that aims to enhance the resolution and visual quality of low-resolution images. With the rapid advancement of deep learning technologies, a plethora of image superresolution methods have been developed. Most of these methods leverage the power of deep learning models to achieve superior performance. Early methods were predominantly based on convolutional neural networks (CNNs), which gained popularity due to their efficient local feature extraction through the sliding window mechanism and parameter sharing. One inherent limitation of CNNs is their restricted receptive field, which limits their ability to capture long-range dependencies and contextual information within an image. Hence, convolution-based methods may struggle to fully restore fine details in distant regions of an image. With the advent of Transformer technology in computer vision, self-attention mechanisms have demonstrated remarkable capability in capturing global dependencies across an entire image to restore superresolution images with enhanced clarity and detail. Nonetheless, the associated increase in computational cost introduces significant challenges, particularly in algorithmic complexity and resource consumption. Therefore, balancing high-precision image reconstruction with reduced computational resources is essential for the broad adoption and practical application of superresolution reconstruction techniques across real-world scenarios.MethodTo address these challenges, this work proposes a super-resolution reconstruction method, MLFN, based on a multi-scale large-kernel attention feature fusion network. This approach involves four main stages. Initially, the low-resolution image is specifically preprocessed and then inputted into the network. The input image undergoes processing through an unrestricted blueprint convolution and is then fed into a multipath feature extraction module for global and local feature extraction. Within this multipath feature extraction module, a multiscale large-kernel separable convolution block is introduced to enhance the network’s receptive field while minimizing parameter consumption. Finally, a lightweight normalized attention mechanism is incorporated at the end to further improve reconstruction accuracy. This network adopts a multipath structure to learn different horizontal feature representations, thereby enhancing its multiscale extraction capability. A multiscale large-kernel separable convolution block is designed to balance the powerful global information-capturing ability of self-attention mechanisms and the strong local perception ability of convolution for the extraction of global and local features. A lightweight normalized attention module is incorporated at the end to further enhance model performance while achieving a lightweight design for the network model.ResultMLFN utilized the DF2K dataset comprising 800 and 2 650 training images for training, and its performance was evaluated on test sets including five benchmark datasets: Set5, Set14, BSD100, Urban100, and Manga109. Set5 was used as the validation set. Bicubic interpolation was applied to downsample high-resolution images to the desired scales (×2, ×3, and ×4), simulating low-resolution images with the downsampled counterparts. Given the human visual system’s higher sensitivity to luminance details than color changes, evaluations were conducted on the Y channel (luminance) in the YCbCr color space. Peak signal-to-noise ratio(PSNR) and structural similarity index(SSIM) were chosen as evaluation metrics. Experiments were conducted on five publicly available test datasets by comparing MLFN with 11 representative methods to showcase the performance of our approach. Results indicate that our proposed MLFN consistently outperforms IMDN at various upscaling factors, with an average PSNR improvement of 0.2 dB. The reconstructed images exhibit a significant visual advantage. Parameter count comparison reveals that our proposed model maintains a certain advantage over other advanced lightweight methods.ConclusionThis study introduces a novel superresolution reconstruction method, MLFN, which is based on a multiscale large-kernel attention feature fusion network. The core of this approach lies in the integration of a multiscale large-kernel separable convolution block that enhances the overall quality of image reconstruction by effectively balancing the strengths of self-attention mechanisms, which excel at capturing long-range global dependencies, with the robust local perception ability of convolutional layers. This combination allows for an accurate and efficient extraction of global and local features from images, addressing a common limitation in traditional convolutional neural networks that struggle to capture long-range contextual information. Moreover, the method introduces multipath feature extraction blocks designed to capture different horizontal feature representations. This multipath architecture allows the network to gather image details at various scales, significantly improving the reconstruction accuracy across different resolutions. Another important aspect of the proposed method is the adoption of a lightweight normalized attention mechanism, which enhances the model’s capability by selectively focusing on important features while avoiding the introduction of additional fully connected or convolutional layers, thus reducing the overall parameter count and making the model light and efficient. Owing to its optimized architecture, this method achieves high performance in image reconstruction quality and model lightweighting and is particularly suitable for applications on mobile devices and embedded systems where computational resources are limited. Hence, MLFN provides a lightweight and efficient solution for image superresolution reconstruction tasks.  
      关键词:image super-resolution (SR) reconstruction;large kernel separation convolution;attention mechanism;feature fusion;multi-path learning   
      85
      |
      130
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 74864739 false
      更新时间:2025-04-16
    • 在结冰风洞图像处理领域,研究人员提出了MSFF-GAN去雾方法,有效改善云雾环境下图像质量,为飞机结冰研究提供精准数据。
      Zhou Wenjun, Yang Xinling, Zuo Chenglin, Wang Yifan, Peng Bo
      Vol. 30, Issue 4, Pages: 1100-1117(2025) DOI: 10.11834/jig.240343
      MSFF-GAN: a dehazing model for icing wind tunnel images in cloudy conditions
      摘要:ObjectiveDuring high-altitude flights, aircraft are often exposed to extremely low-temperature environments, particularly below the freezing point. Such conditions make it easy for water vapor and liquid water in contact with the aircraft surface to condense and form ice layers. This icing phenomenon poses a significant threat to flight safety as it can alter the aircraft’s surface structure, affecting its aerodynamic performance and potentially leading to serious flight accidents. Therefore, the study of aircraft icing is crucial, as it relates to the safeguarding of flight safety and the enhancement of aircraft performance. To delve deeper into the impact of aircraft icing on flight performance, icing wind tunnels play a pivotal role as ground test facilities. They are capable of simulating high-altitude cloud and hazy environments, enabling researchers to replicate the icing conditions that aircraft may encounter during flight and observe and analyze the effects of icing on aircraft performance. However, in cloudy and haze environments, the quality of captured images is often severely compromised. Atmospheric scattering effects can cause images to appear grayish, reduce contrast, and obscure the originally visible ice structures. This not only hinders researchers’ observations and documentation of the icing process but also reduces the accuracy of icing detection, thereby affecting subsequent research and analysis. Therefore, preprocessing the captured images before utilizing the icing wind tunnel for aircraft icing research is particularly important. Image dehazing techniques, as effective image processing methods, can significantly enhance image quality, making originally blurred ice structures clearly visible. By applying dehazing techniques, not only can researchers improve their observation of the icing process, but also enhance the accuracy of icing detection, providing more reliable data support for subsequent research and analysis.MethodTraditional image dehazing methods suffer from issues such as parameter sensitivity and long processing time, which directly impact the effectiveness and efficiency of dehazing. In recent years, image dehazing methods based on deep learning have garnered widespread research attention. Through end-to-end learning, this approach demonstrates strong adaptability and dehazing capabilities. However, deep learning-based dehazing methods require a significant amount of labeled data and high computational resources for support. To effectively address the issues of insufficient feature extraction and residual haze in current deep learning-based image dehazing methods when processing icing wind tunnel images, this paper proposes a generative adversarial network with multi-scale feature fusion for image dehazing (MSFF-GAN). MSFF-GAN aims to leverage the exceptional image generation capabilities and supervised learning characteristics of GANs to achieve more automated and efficient dehazing. First of all, the generative adversarial network introduces a competitive mechanism to make the generator and the discriminator compete with each other in the training process, thus continuously optimizing the generation results. This mechanism helps the generator learn more complex image distributions and generate images with higher quality. Secondly, the generative adversarial network adopts supervised learning for training, which means that it can directly use a large amount of labeled data for learning, so as to better understand the inherent structure and characteristics of images. The generative adversarial network consists of a generator and a discriminator. The generator primarily consists of two modules: feature fusion and enhancement strategies. The feature fusion module effectively integrates multi-scale features of the image by employing back-projection techniques. The enhancement strategy module, on the other hand, achieves gradual refinement of intermediate results through a concise network design, thereby enhancing the overall image quality. To extract richer image information from hazy images and further enhance the dehazing effect, we extract prior features of hazy images: dark channel and color attenuation, and integrate them into the network structure. Furthermore, by setting the discriminator in the GAN-based dehazing network to be multi-scale, it can synthesize information from different scales, providing more comprehensive contextual information and feedback. This approach improves the visual quality of the image across multiple receptive fields. Finally, to further enhance the dehazing effect, multiple loss functions are employed to jointly constrain the training of the dehazing model.ResultWe selects six different icing and haze images of aircraft wings in various cloud and haze scenarios within the wind tunnel, under conditions of different observation angles and with the values of median volume diameter(MVD) and liquid water content(LWC) set to 25 μm and 1.31 g/m3, 20 μm and 1.0 g/m3, and 20 μm and 0.5 g/m3, respectively, to form the test sets. Subsequently, experiments are conducted on these six different icing wind tunnel scene test sets, comparing the method proposed in this paper with four traditional dehazing methods and four deep learning-based dehazing methods. The experimental results demonstrate that the dehazing images obtained by our method are the clearest and achieve the best dehazing effect. Notably, our method produces dehazed images with higher clarity, effectively preserving the original colors and texture information of the icing wind tunnel images, especially without compromising the shape of the icing regions on the wings. We conducted experiments on synthetic datasets from public datasets and real foggy images, and the results demonstrated significant defogging effects, resulting in an improvement in image quality. Furthermore, we conducted ablation experiments to verify the effectiveness of our proposed dehazing method. These experiments confirmed that our method significantly improves dehazing performance and achieves better results in evaluation metrics. These positive outcomes highlight the robustness and generalization capabilities of our dehazing model in various cloud and haze environments within the wind tunnel.ConclusionThe dehazing model proposed in this paper has demonstrated satisfactory dehazing effects and excellent generalization performance in icing wind tunnel simulations with varying degrees of haze concentration. The model is capable of effectively eliminating the haze from the images, significantly restoring their clarity, and providing researchers with improved visual outcomes. Additionally, the model preserves the shape and color of the icing region on the aircraft wing to a large extent while also reconstructing crucial parts of the background area around the wing. This provides researchers with clearer image information, which is crucial for subsequent icing detection and related work.  
      关键词:icing wind tunnel;cloud environment;wing icing image dehazing;generating adversarial network(GAN);multi-scale feature fusion   
      138
      |
      101
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 74864630 false
      更新时间:2025-04-16
    • Fuzzy diffusion model for seen-through document image restoration AI导读

      在文档图像处理领域,专家提出了一种模糊扩散模型,有效解决了透射现象,提高了图像去透射任务的准确性和效率。
      Wang Yijie, Gong Jiaxin, Liang Zongbao, Chong Qianpeng, Cheng Xiang, Xu Jindong
      Vol. 30, Issue 4, Pages: 1118-1129(2025) DOI: 10.11834/jig.240350
      Fuzzy diffusion model for seen-through document image restoration
      摘要:ObjectiveDocument images have significant applications across various fields, such as optical character recognition (OCR), historical document restoration, and electronic reading. While scanning or shooting a document, factors such as ink density and paper transparency may cause the content from the reverse side to become visible through the paper, resulting in a digital image with a “seen-through” phenomenon that affects practical applications. Image acquisition is often affected by various sources of uncertainty, including differences in camera equipment performance, paper quality, lighting conditions, lens shake, and variations in the physical properties of the documents themselves. All these random factors contribute to the noise in document images and may complicate the seen-through phenomenon, thereby influencing subsequent tasks such as text recognition, word identification, and layout analysis. Although restoring the content of document images is important, the backgrounds of many color document images also provide valuable information. Recovering color images with complex backgrounds affected by the seen-through phenomenon presents its own challenges. Despite the improvement in image quality achieved by existing methods for removing seen-through effects from document images, algorithms specifically tailored to handle variations in seen-through effects, complex background colors, and influence of uncertainty factors have not yet been developed. This work aims to develop a comprehensive algorithm for addressing the diverse seen-through problems in regular document images, handwritten document images, and color document images. We propose the fuzzy diffusion model (FDM) that integrates fuzzy logic with conditional diffusion models, introducing a novel approach to document image enhancement and restoration. The objective of this algorithm is to restore document images affected by various types and degrees of seen-through phenomenon.MethodThe overall process of this algorithm can be divided into forward diffusion and corresponding reverse denoising. First, we gradually add continuous Gaussian noise to the input image using mean-reverting stochastic differential equations, resulting in a seen-through mean state with fixed Gaussian noise. We then train a neural network to progressively predict the noise at the current time step from the image with added noise and estimate the score function based on the predicted noise. In the reverse process, we gradually restore the low-quality image by simulating the corresponding reverse-time stochastic differential equation until a clean image without seen-through effects is generated. To address the uncertainty factors in document images, we specifically design a fuzzy block in the skip connection part of the noise network to compute the affiliation of each pixel point in the image. The fuzzy operation uses nine surrounding pixels including the pixel itself, and the final affiliation of the pixel is obtained after fuzzy inference. We draw inspiration from the U-Net structure in denoising diffusion probabilistic model, except that we remove all group normalization layers and self-attention layers to improve inference efficiency. In the middle part, we introduce atrous spatial pyramid pooling (ASPP) to maximize the expansion of the receptive field and extract richer features. Finding matching pairs of seen-through images in the real world is challenging, so we propose a new protocol for synthetic seen-through images. During training, we input seen-through images as conditional information, along with the noise-added images, into the noise network to allow the model to directionally learn the target distribution. After the model is trained, seen-through images are used as conditional input to progressively predict the noise in the noise image, generating clear document images.ResultWe trained our model separately on the synthetic grayscale dataset and the synthetic color dataset and tested it on three synthetic datasets and two real datasets. The test sets include synthetic grayscale document images, synthetic color document images, synthetic handwritten document images, the Media Team Oulu document dataset, and real CET-6 seen-through document images. Compared with five representative existing methods, our proposed method achieved the best visual effects, effectively eliminating the noise present in the original images to some extent. It also achieved the best results on the following four evaluation metrics: peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), learned perceptual image patch similarity (LPIPS), and Fréchet inception distance (FID). Our method achieved the best PSNR and FID on the grayscale dataset with values of 35.05 dB and 30.69, respectively. On the synthetic color dataset, our method obtained the highest SSIM, LPIPS, and FID values of 0.986, 0.005 3, and 20.03, respectively. To validate the stability of the proposed method, we also provided the variance values when evaluating the SSIM metric. Our method achieved the best result of 0.0053.ConclusionThe proposed FDM effectively addresses various challenges in the task of removing seen-through effects from document images, including the lack of paired seen-through document images, residual seen-through effects, difficulty in handling complex backgrounds, and addressing uncertainty factors in images uniformly. It can effectively and accurately remove the seen-through phenomenon in different types of document images and is expected to be integrated into various practical hardware devices such as cameras and scanners.  
      关键词:diffusion model;fuzzy logic;image restoration;seen-through removal;stochastic differential equation(SDE)   
      93
      |
      143
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 70934168 false
      更新时间:2025-04-16

      Image Analysis and Recognition

    • Semantic segmentation of light-field angle cue representations AI导读

      在光场语义分割领域,专家提出了一种端到端网络模型,有效提升了分割性能,为光场技术在场景理解中的应用提供了新方向。
      Cheng Xinyi, Jia Chen, Zhang Zixuan, Shi Fan
      Vol. 30, Issue 4, Pages: 1130-1140(2025) DOI: 10.11834/jig.240391
      Semantic segmentation of light-field angle cue representations
      摘要:ObjectiveLight-field images are high-dimensional data capturing multiview information of scenes and encompassing rich geometric and angular details. In light-field semantic segmentation, the goal is to assign semantic labels to each pixel in the light-field image, distinguishing different objects or parts of objects. Traditional 2D or 3D image segmentation methods often struggle with challenges, such as variations in illumination, shadows, and occlusions, when applied to light-field images, leading to reduced segmentation accuracy and poor robustness. Leveraging the angular and geometric information inherent in light-field images, light-field semantic segmentation aims to overcome these challenges and improve segmentation performance. However, existing algorithms are typically designed for red/green/blue (RGB) or RGB-depth image inputs and do not effectively utilize the structural information of light fields for semantic segmentation. Moreover, previous studies mainly focused on handling redundant light-field data or manually crafted features, and the highly coupled 4D nature of light field data poses a barrier to conventional convolutional neural network (CNN) modeling approaches. Prior works primarily focused on object localization and segmentation in planar spatial positions, lacking detailed angular semantic information for each object. Therefore, we propose a CNN-based light-field semantic segmentation network for processing light-field macro-pixel images. Our approach incorporates angular feature extractor (AFE) to learn angular variations between different views within the light-field image and employs dilated convolution operations to enhance semantic correlations across multiple channels.MethodAn end-to-end semantic segmentation network is proposed for static images, starting from the construction of multiscale light-field macro-pixel images. It is based on various backbone networks and dilated convolutions. The challenge of efficiently extracting spatial features from macro-pixel images is addressed by employing atrous spatial pyramid pooling (ASPP) in the encoder module to extract high-level fused semantic features. In the experiments, the dilation rates for the ASPP module are selected as r [12, 24, 36] to enrich spatial features under the same-sized feature maps and achieve good semantic segmentation results. Multiscale spatial features are efficiently extracted using different dilation rates in convolutions. In the decoder module, feature modeling is performed to enhance the nonlinear expression of low-level semantic features in macro-pixel images and channel correlation representation. Semantic features from the encoder are upsampled four times and concatenated with the features generated by the angle model to enhance the interactivity between the features in the network. These features are further refined through 3 × 3 convolution operations, combining angle and spatial features for enhanced feature expression. Finally, segmentation results are outputted through fourfold upsampling. An AFE is introduced in the decoder stage to enhance the expression of light field features and fully extract rich angular features from light-field macro-pixel images. AFE operates as a special convolution with kernel size K × K, stride K, and dilation rate 1, where K equals the angular resolution of the light field. Input features for the angle model are derived from the Conv2_x layer of ResNet-101, preserving complete low-dimensional macro-pixel image features. This design is crucial for capturing angular relationships between pixels in subaperture images and avoids the loss of angular information during consecutive downsampling. Incorporation of angular features enables the model to better distinguish boundaries between different categories and provides more accurate segmentation results. In complex scenarios such as uneven illumination, occlusion, or small objects, ASPP can extract a broad context and AFE can capture complementary angular information between macro-pixel images. Their synergistic effect significantly enhances the semantic segmentation performance.ResultQuantitative and qualitative comparison experiments were conducted on the LFLF dataset against various optical flow methods to assess the performance of the proposed model. For a fair comparison, the baseline parameters were used as benchmarks. The model achieved 88.80% segmentation accuracy on the test set, outperforming all the selected state-of-the-art (SOTA) methods. ompared with all the other baseline methods, the proposed approach achieved a performance improvement of over 2.15%, allowing for a precise capture of subtle changes in images and thus accurate segmentation boundaries. Compared with five other semantic segmentation methods, this approach demonstrated significant superiority in segmentation boundary accuracy. Relevant ablation experiments were conducted to investigate the advantages of the AFE and multiscale ASPP. Removing ASPP and AFE resulted in a significant decrease in mean intersection over union (mIoU) to 22.51%, a substantial drop of 66.29%. This finding demonstrated that the complete model integrating ASPP and AFE effectively utilizes multiscale information and angular features to achieve optimal semantic segmentation performance. In particular, removing multiscale ASPP led to a performance decrease of 6.58% due to the lack of supplementary multiscale semantic features achievable only at a single scale. Similarly, removing AFE caused a performance drop of 2.36% due to the absence of guided angular clue features necessary for capturing specific optical flow information. Therefore, the synergistic effect of AFE and multiscale ASPP significantly enhances the semantic segmentation performance. Four popular backbone networks, namely, ResNet101, DRN, MobileNet, and Xception, were utilized to explore the optimal backbone network for the proposed algorithm. When the backbone network was ResNet101, the highest mIoU obtained was 88.80%.ConclusionThis limitation arises from their inability to utilize angular information from light-field images, thereby hindering the accurate delineation of object boundaries. The proposed approach demonstrates superior performance in overall image segmentation tasks, effectively mitigating the issues of oversegmentation and missegmentation. Owing to its AFE and ASPP components, the proposed method can accurately capture subtle changes in images and thereby achieve precise segmentation boundaries. Compared with five other semantic segmentation methods, the proposed approach demonstrates significant advantages in the accuracy of segmentation boundaries. This work introduces a novel light field image semantic segmentation method that takes light-field macro-pixel images as input to achieve end-to-end semantic segmentation. A simple and efficient angular feature extraction model is designed and integrated into the network to extract the angular features of the light field and enhance nonlinearity in the macro-pixel image features. Furthermore, the proposed model is evaluated against SOTA algorithms. Owing to its efficient network architecture capable of capturing rich structural cues of light fields, the model achieves the highest mIoU score of 88.80% in semantic segmentation tasks. Experimental results demonstrate the feasibility and effectiveness of the proposed model in enhancing semantic segmentation, offering new research directions for the application of light field technology in understanding scenes.  
      关键词:semantic segmentation;light field imaging;macro-pixel image;angle cues;atrous convolution   
      29
      |
      66
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 79109751 false
      更新时间:2025-04-16
    • 在视频显著目标检测领域,专家提出了多特征聚合的边界引导网络,有效提升了检测出的显著目标边界质量。
      Zhang Rongguo, Zheng Xiaoge, Wang Lifang, Hu Jing, Liu Xiaojun
      Vol. 30, Issue 4, Pages: 1141-1154(2025) DOI: 10.11834/jig.240243
      Boundary-guided video salient object detection with multi-feature aggregation
      摘要:ObjectiveVideo salient object detection aims to identify and highlight important objects or regions in a video and has been widely applied in various computer vision tasks, such as target tracking, medical analysis, and video surveillance. Owing to the development of deep learning technologies, especially convolutional neural networks, significant progress has been made in the field of video salient object detection over the past few decades. Deep learning models can automatically generate feature representations of salient objects by learning a large amount of annotated data, thereby achieving the efficient detection and localization of salient objects. However, existing methods often fall short in exploring the correlation between boundary cues and spatiotemporal features and fail to adequately consider relevant contextual information during feature aggregation, leading to imprecise detection results. Therefore, we propose a boundary-guided video salient object detection network with multifeature aggregation that integrates salient object boundary information and object information within a unified model, fostering complementary collaboration between them.MethodFirst, two adjacent video images are used to generate optical flow maps, and the spatial and motion features of salient objects are extracted from the RGB images and optical flow maps of video frames, respectively. The boundary features of salient objects in video frames are obtained by integrating low-level local edge information and high-level global position information from the spatial features. At different resolutions, the boundary features of the salient objects are coupled with the features of the salient objects themselves. The interaction and cooperation between the boundary features and the salient object features enhance the complementarity between these two information types, emphasizing and refining the boundary features of the objects and thus accurately localizing the salient objects in the video images. A multilayer feature attention aggregation module is then used to enhance the representation capability of features to fully utilize the extracted multilevel features and achieve the selective dynamic aggregation of semantic and scale-inconsistent multilevel features. This process is conducted by varying the size of the spatial pool to achieve channel attention and using point-wise convolution to aggregate local and global contextual information in the channels, prompting the network to pay attention to large objects with global distribution and small objects with local distribution to facilitate the recognition and detection of salient objects under extreme scale variations and thereby generate the final salient object detection map. In the training stage, random rotation, multiscale (scale values set to {0.75, 1, 1.25}), and mixed losses are employed. The mixed loss helps the network learn transformations between input images and ground truth at the pixel, block, and image levels by combining weighted binary cross-entropy loss, structural similarity loss, dice loss from the boundary guidance module, and intersection over union loss to accurately segment salient object regions with clear boundaries.ResultThe proposed method is evaluated on the densely annotated video segmentation (DAVIS), Freiburg-Berkeley motion segmentation (FBMS), video saliency (ViSal), and MCL datasets using three evaluation metrics, namely, mean absolute error (MAE), S-measure, and F-measure. A comparison with five existing methods is also performed. Results indicate that the proposed method can generate salient maps with clear boundaries and outperforms other methods in terms of F-measure across the four datasets. On the DAVIS dataset, the proposed method achieves the same MAE, slightly lower S-measure (by 0.7%), and higher F-measure (by 0.2%) than the best-performing dynamic spatiotemporal network (DSNet) model. On the FBMS dataset, the proposed method achieves the best MAE of 3.7%, matching that of the SCANet method. It also improves the S-measure by 0.3% and the F-measure by 0.9% compared with the second-best method. On the ViSal dataset, the MAE of the proposed method is only 0.1% lower than that of the optimal method STVS, and its F-measure is 0.2% higher than that of STVS. On the MCL dataset, the proposed method achieves the best MAE of 2.2%, and its S-measure and F-measure are 1.6% and 0.6% higher than those of the second-best method saliency-shift aware VSOD (SSAV), respectively. The video salient object detection results of the proposed method and comparative methods are visualized for an intuitive observation, showing that previous methods mostly yield detection results with high-quality region accuracy but rough and blurry boundaries. By contrast, the proposed method can generate detection results with clear boundaries. Ablation experiments on relevant modules are also conducted to demonstrate the effectiveness of the different modules employed.ConclusionIn this study, a network capable of achieving interactive collaboration between salient object boundary information and spatiotemporal information is proposed. It performs well on four public datasets and effectively enhances the boundary quality of detected salient objects in video frames. However, the method also has certain limitations. For instance, it may fail to detect or misidentify salient objects when multiple such objects are present in the video frames. In our future work, we plan to explore other efficient spatiotemporal feature extraction schemes to capture all salient object features in video frames and improve the detection capabilities of the current algorithm.  
      关键词:video image;salient object detection;boundary guidance;multi-scale features;feature aggregation   
      39
      |
      92
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 79110121 false
      更新时间:2025-04-16

      Image Understanding and Computer Vision

    • 3D stylized portrait synthesis and structured face modeling AI导读

      在三维人脸风格化领域,研究者提出了基于样例的三维人脸风格化与结构化建模方法,有效构建高质量结构化三维风格人脸模型,生成高质量全角度风格人脸视图与纹理贴图。
      Hu Jiaping, Zhou Yang
      Vol. 30, Issue 4, Pages: 1155-1169(2025) DOI: 10.11834/jig.240380
      3D stylized portrait synthesis and structured face modeling
      摘要:ObjectiveFacial image stylization and 3D face modeling are important tasks in the fields of computer graphics and vision, with significant applications in virtual reality and social media, including popular technologies such as virtual live streaming, virtual imaging, and digital avatars. This work focuses on 3D facial stylization and generation to produce novel and stylized facial views from a given real face image and a style reference. The novel views can be rendered at corresponding angles by inputting camera poses in 3D space. Meanwhile, these views need to maintain good 3D multi-view consistency while expressing the exaggerated geometry and colors characteristic of the given artistic style reference. Facial stylization and facial modeling are prominent tasks in the fields of computer graphics and vision, with significant applications in virtual reality and social media. These applications include popular technologies such as virtual live streaming, virtual imaging, and digital humans. This paper addresses the task of 3D facial stylization generation, aiming to produce facial views from corresponding angles by inputting camera poses in 3D space. These views need to maintain good 3D multi-view consistency while expressing the exaggerated geometry and colors characteristic of artistic styles. Existing methods for 3D facial generation can be broadly categorized into two types: those based on 3D deformable models and those based on implicit neural representations. Methods based on 3D deformable models often struggle to express non-facial components such as hairstyles and glasses, which severely limits the quality of the generated results. On the other hand, methods based on implicit neural representations, while capable of achieving good generation results, tend to produce severely distorted facial views under large camera poses, such as side profiles. Additionally, the results of implicit methods typically include only facial geometry and multi-view facial views, making it difficult to integrate them with mature rendering pipelines. This limitation hinders their application in practical scenarios. Consequently, both existing 3D facial generation methods face challenges in producing high-quality 3D stylized facial models with good structured modeling, i.e., 3D facial meshes and topologically complete texture maps.MethodTo address the shortcomings of existing methods, this paper proposes a novel approach for 3D stylized facial generation and structured modeling. The goal of this paper is to train a 3D aware stylized facial generator within the style domain of a specified artistic facial sample. This generator should be capable of producing high-quality 3D facial views from any angle in the specified style, including large-pose side profiles and back views. Furthermore, based on multi-view facial data, the generator should produce structured 3D facial models, including facial geometric mesh models and corresponding texture maps. To achieve this, the paper proposes a two-stage method for 3D stylized facial generation and structured modeling. The method comprises two main steps: 3D aware facial generator domain transfer and multi-view constrained facial texture optimization. In the first stage, the paper utilizes 2D facial stylization prior methods to perform data augmentation on artistic style samples, generating a small-scale artistic style facial dataset. Subsequently, the camera poses and facial masks of the facial images in this dataset are extracted sequentially. The annotated stylized facial dataset is then used to fine-tune a 3D aware generator in the natural style domain. The fine-tuned 3D aware generator can generate high-quality multi-view facial views and 3D mesh models. The focus of the second stage is to optimize facial textures using multi-view images from a set of directions. The paper first performs smoothing and UV unwrapping on the facial mesh. To align the volumetric rendered facial views with the differentiable rendered facial views for pixel-level loss optimization of facial textures, the paper proposes a simple and effective facial view alignment strategy based on mask affine transformation. Finally, multi-view facial supervision is used to optimize facial textures, and the final facial texture map is obtained through texture fusion.ResultTo demonstrate the superiority of the proposed method, the paper compares the two-stage 3D facial generation and structured modeling method with existing advanced baseline methods. This comparison illustrates the quality of 3D aware stylized facial generation and the effectiveness of structured facial mesh generation. Additionally, to demonstrate the effectiveness of each stage component of the proposed method, the paper includes ablation studies for the key components of the method. The use studies illustrate the correctness of the proposed method. Qualitative and quantitative experiments show that the proposed method can effectively construct high-quality structured 3D stylized facial models and generate high-quality stylized facial views. Moreover, the explicitly modeled structured facial models can be more conveniently applied to downstream tasks related to 3D faces. The results indicate that the proposed method not only achieves superior performance in generating stylized facial views but also ensures the structural integrity and applicability of the generated 3D facial models in practical scenarios.ConclusionIn general, this work presents a comprehensive approach to 3D stylized facial generation and structured modeling, addressing the limitations of existing methods. By leveraging a two-stage process that includes domain transfer and multi-view constrained texture optimization, the proposed method achieves high-quality results in both facial view generation and structured modeling. The effectiveness of the method is demonstrated through extensive experiments, highlighting its potential for practical applications in virtual reality, social media, and other related fields.  
      关键词:visual content synthesis;3D portrait stylization;3D aware generation;domain adaptation;texture optimization   
      90
      |
      103
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 70933411 false
      更新时间:2025-04-16

      Medical Image Processing

    • 在非缺血性扩张型心肌病预后领域,研究者提出了一种基于混合匹配蒸馏与对比互信息估计的多模态心脏磁共振图像预后模型,有效提高了小样本场景下的预后准确性。
      Wei Ran, Qi Xiaoming, He Yuting, Jiang Sheng, Qian Wen, Xu Yi, Zhu Yinsu, Pascal Haigron, Shu Huazhong, Yang Guanyu
      Vol. 30, Issue 4, Pages: 1170-1182(2025) DOI: 10.11834/jig.240349
      Knowledge distillation and mutual information of multimodal MRI for disease prognosis
      摘要:ObjectiveNonischemic dilated cardio myopathy (NIDCM) is a heart condition that can lead to severe outcomes, such as heart failure or sudden cardiac death. Accurate prognosis plays a crucial role in the early diagnosis and effective treatment of this disease. Multimodal cardiac magnetic resonance (CMR) imaging captures heart data from different perspectives and is essential for prognosis. Each CMR modality provides unique and complementary information about the heart’s structure and function. However, leveraging these multimodal data sources to predict prognosis poses two main challenges. First, the regions of interest (RoIs) differ among CMR modalities even when imaging the same disease. Hence, combining these modalities into a cohesive predictive model becomes challenging. The distinct characteristics and distribution of data from different modalities also create difficulties in capturing comprehensive information about the prognosis of NIDCM. Second, the limited labeled training data exacerbate the problem. Owing to the difficulty in labeling such data, the available dataset is small. Hence, the risk of a deep learning model falling into local optima increases, hindering its ability to generalize and achieve good predictive performance. Therefore, a specialized approach that can address the complexity of multimodal data representations and the limitations of small sample sizes is warranted.MethodTo overcome these challenges, we propose a novel model based on hybrid matching distillation and contrastive mutual information estimation. Its design focuses on two aspects: improving the representation of multimodal CMR images and preventing the model from falling into local optima due to the limited training data. The first component of our method involves combining different CMR modalities into pairs. Each pair is treated as a unique data source, and the image features corresponding to these modality pairs are extracted. Given that the prognosis objective is consistent across modalities but their feature distributions vary, a hybrid matching distillation network that enforces logical distribution consistency between the modalities is employed. It learns to associate and match different image feature distributions across modalities by leveraging the inherent consistency in prognosis objectives. This matching constrains the extraction of features from each modality, ensuring that the deep learning network can jointly represent multimodal features. As a result, the network effectively captures the complementary information from various modalities, leading to a good predictive performance. The second component is a mutual information contrastive learning strategy applied to estimate potential classification boundaries across the multimodal feature distribution. This step introduces a regularization term into the prognosis model to prevent it from falling into a local optimum during training with a small sample size. The contrastive learning strategy aims to maximize the mutual information between modalities while learning meaningful feature representations. By estimating the classification boundaries, the model can discern the subtle differences in the feature space and consequently enhance its ability to generalize from limited data. This strategy regularizes the learning process and ensures that the model captures the most informative aspects of the multimodal data. The two components of hybrid matching distillation and contrastive mutual information estimation work together to build a robust prognosis model for NIDCM. By utilizing the logical consistency across modalities and the mutual information between them, the model achieves improved feature representation and avoids overfitting to the small sample size.ResultExperiments were conducted using a clinical dataset of patients with NIDCM to evaluate the performance of the proposed model. The results were compared with those of six state-of-the-art methods to assess the model’s effectiveness. The performance was evaluated using two key metrics: F1 score and accuracy (Acc). F1 score is particularly useful in assessing the balance between precision and recall, and Acc is important in assessing the overall correctness of predictions. The proposed model achieved an F1 score of 81.25% and an accuracy of 85.61% on the NIDCM dataset. These results demonstrated significant improvement over the baseline models, highlighting the effectiveness of the hybrid matching distillation and contrastive mutual information estimation techniques in handling multimodal CMR images for prognosis prediction. These two complementary approaches allow the model to comprehensively utilize the limited training data and capture the complex correlations between the different modalities. An additional experiment was conducted on a public dataset related to brain tumors to further validate the generalization capability of the model. This dataset also featured multimodal medical images and allowed us to verify whether the proposed method could be applied beyond the NIDCM domain. The model achieved an F1 score of 85.07% and an accuracy of 87.72% on this dataset, outperforming the four baseline methods once again. These results demonstrated that the proposed model can be applied to other medical imaging tasks beyond NIDCM, making it a versatile and effective tool for prognosis prediction.ConclusionA prognosis network model based on hybrid matching distillation and contrastive mutual information estimation was proposed to effectively addresses the two major challenges in using multimodal CMR images for NIDCM prognosis. Hybrid matching distillation ensures that the model learns to represent multimodal data by leveraging the logical consistency across different CMR modalities, thereby improving the model’s ability to capture the complementarity between modalities. Contrastive mutual information estimation provides regularization by estimating classification boundaries, preventing the model from overfitting to small sample sizes. Experiments demonstrated that this approach significantly improves the accuracy and F1 score of the prognosis model, outperforming several state-of-the-art methods. The model’s generalization capabilities were further confirmed through its successful application to a brain tumor dataset, proving its versatility across various medical imaging tasks. In addition to its high predictive performance, this method demonstrates the potential of deep learning in handling complex medical imaging problems where data scarcity is a concern. By combining hybrid matching distillation with contrastive mutual information estimation, this model can handle the intricate relationships between different modalities and produce robust, reliable predictions even with limited labeled data. Future work should explore the extension of this method to other forms of multimodal medical imaging beyond CMR, such as combining magnetic resonance imageing(MRI) and computed tomography(CT) scans for comprehensive diagnosis. Further improving the efficiency of training and reducing the computational complexity of the model would make it accessible for widespread clinical use. This research opens up new avenues for multimodal prognosis models and sets a foundation for future innovations in medical image analysis.  
      关键词:contrastive learning;hybrid distillation;multi-modal cardiac magnetic resonance(CMR);mutual estimation;prognosis   
      63
      |
      88
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 75674364 false
      更新时间:2025-04-16
    • 在膝关节前交叉韧带损伤诊断领域,专家提出了SSAMNet模型,有效提高了诊断准确性和特异性,具有临床应用价值。
      Liu Yingli, Cha Yinqiu, Huang Yishan, Gao Ming
      Vol. 30, Issue 4, Pages: 1183-1194(2025) DOI: 10.11834/jig.240302
      Classification algorithm for anterior cruciate ligament injury embedded with slice sequence association mode
      摘要:ObjectiveThe prompt diagnosis of anterior cruciate ligament (ACL) injuries in the knee has been shown to reduce the risk of osteoarthritis, knee injuries, and more. As a common imaging method for identifying ACL injuries, magnetic resonance imaging (MRI) has been characterized by low acquisition time cost and non-invasiveness. MRI can be regarded as a three-dimensional data. Compared with two-dimensional images, it has more details. Doctors need to combine multiple images within a dataset to make a comprehensive judgment and draw a conclusion, The diagnosis takes a long time, so intelligent assisted medical treatment is necessary. Currently, deep learning-based methods are mainly used for ACL injury classification, which can be divided into algorithms based on 3D convolutional neural networks (3D CNNs) and 2D convolutional neural networks (2D CNNs). ACL injury classification algorithms using 3D CNNs often suffer from high computational costs and insufficient data usage, while algorithms based on 2D CNNs ignore the third-dimensional correlation (correlation of slice dimensions) and the morphological diversity of ACL. To solve the problems, an ACL injury classification algorithm embedded with slice sequence association mode is proposed (SSAMNet).MethodSSAMNet uses the classic AlexNet model based on 2D CNNs as the backbone network to acquire discriminative features for ACL injury classification. By designing the slice sequence information fusion (SFS) module, sequence properties are learned in parallel from adjacent and full slices of MRI data. First, the initial slice features extracted through the backbone network are divided into slice groups according to the channel dimension, and a shift operation is performed between the groups to achieve the purpose of merging adjacent slice features. Then, key information is extracted with the help of the initial slice features, the slice sequence relationship is modeled by SSAMNet, global slice information is merged and processed, and finally, the association pattern in the slice feature mapping is established. Secondly, the multi-level scale feature adaptive attention (MSFAA) module is used to splice the multiple levels features processed by the backbone network into multi-scale feature groups. The horizontal and vertical direction features are processed through mean expansion to obtain the directional scale weight coefficients, the goal of redistributing weights at different correlation scales is accomplished to accommodate the variable nature of ACL region shape and location representations. Class imbalance is a common problem among the knee MRI datasets selected for this task, where the number of positive samples representing ACL tears exceeds the number of negative samples representing intact ACLs. This problem is not only reflected in the selected datasets but also exists in the real world, where the number of individuals with torn ACLs surpasses the number of those with intact ACLs. Therefore, we choose to minimize the weighted binary cross-entropy loss to reduce the impact of class imbalance on algorithm performance. During the training process, the learning rate is initialized to 1E-5, and it is updated to 0.95 times the original value every 10 epochs of training. All experiments are performed on PyTorch using an NVIDIA RTX 3090 GPU. To ensure the fairness of comparative experiments, we report the mean and standard of three experimental results.ResultWe use accuracy, sensitivity, specificity, and area under the curve (AUC) as evaluation metrics for this algorithm, calculated using data such as true positives, true negatives, false positives, and false negatives. The experiments are conducted on the MRNet dataset and the knee MRI dataset, and their data distributions are different. The MRNet dataset is currently the largest public dataset on the knee joint, including 1 370 sets of MRI data, and the knee MRI dataset contains 917 sets of data. Because the knee MRI dataset is not divided into training sets and test sets, we use stratified random sampling to separate knee MRI, and we use the fivefold cross-validation method to complete the external validation experiment. The final experimental results on the MRNet dataset show that the AUC value of SSAMNet reaches 98.4%, which has the best performance compared with other comparison networks based on 2D CNNs and 3D CNNs, with a specificity value of 97% and an accuracy value of 91.4%.Moreover, the ablation experimental results on this dataset also prove that SFS and MSFAA have a gain value for the ACL injury discrimination task, especially the significant performance improvement brought by adding the SFS module (AUC increased by more than 2.5%),which also illustrates the effectiveness of the embedded slice sequence correlation mode in improving model performance. The joint use of SFS and MSFAA also increases AUC by more than 4%.Then, a fivefold cross-validation experiment is conducted on the knee MRI dataset. The AUC value of the proposed model reaches 88.7%, and the accuracy value exceeds other models by 3%.The ROC curve tends to be stable on the fivefold datasets of different divisions, proving that the model is stable and generalizable. Finally, the visualization results also show that SSAMNet can effectively focus on the ACL area and the indicators that affect the identification of ACL injuries, proving that the model’s conclusion of identifying tear injuries is consistent with the actual basis, further proving that the model can effectively identify ACL injuries.ConclusionThe proposed SSAMNet has excellent performance in ACL injury discrimination, and its robustness is confirmed in external validation experiments with different data distributions, further revealing the potential clinical application value of the proposed model. However, the current dataset also has the issue of single-label data. Considering the comprehensive data required for clinical usage, we will utilize patients’ past medical records and other information as prior knowledge to construct an auxiliary diagnostic framework for knee joint diseases in the subsequent steps. The code of this paper will be open source soon:https://github.com/wabk/SSAMNet.  
      关键词:magnetic resonance imaging(MRI);3D image classification;slice feature aggregation;adaptive scale attention;2D convolutional neural network(2D CNN)   
      33
      |
      105
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 74864707 false
      更新时间:2025-04-16
    0