最新刊期

    30 8 2025

      Continual Learning and Image Processing Applications

    • Comprehensive survey of continual learning AI导读

      持续学习领域最新研究进展综述,探讨了新旧知识学习的“可塑性—稳定性”平衡,为应对复杂现实任务需求提供新思路。
      Lyu Fan, Wang Liang, Li Xi, Zheng Weishi, Zhang Zhang, Zhou Tao, Hu Fuyuan
      Vol. 30, Issue 8, Pages: 2599-2632(2025) DOI: 10.11834/jig.240661
      Comprehensive survey of continual learning
      摘要:Continual learning (CL), which is also known as lifelong learning, is a major concern in the field of machine learning. It refers to the capability of a system to learn new tasks while retaining knowledge gained from previous ones. CL has broad relevance in many real-world applications that require continual adaptation and learning such as autonomous driving, robotics, and medical diagnosis. CL succumbs to the catastrophic forgetting issue, which occurs when a model overwrites or significantly diminishes previously learned information while training on new tasks. This issue is prominent in traditional machine learning paradigms where models are trained on static datasets. The goal of CL is to develop models that exhibit plasticity, which allows them to adapt to new information, and stability, which ensures that previous knowledge is not lost. Various strategies have been proposed in the literature to address this challenge, which are comprehensively reviewed in this study. Broadly, CL techniques can be classified into those based on continual training and those based on prompts. Methods based on continual training can be further divided into three main categories: replay-based, regularization-based, and dynamic architecture methods. Replay-based methods are among the most popular and well-established techniques in CL. These methods aim to mitigate catastrophic forgetting by storing and replaying data from previously learned tasks. By revisiting samples from prior tasks during training on new tasks, the model is effectively reminded of past knowledge, which helps maintain accuracy on earlier tasks. Replay-based methods can be further categorized into two types: replay memory construction and replay memory utilization. Replay memory construction focuses on saving the replay memory, and methods can be categorized into three types depending on how knowledge is stored: storing raw data, building generative models, and storing data features to construct memory. Memory replay methods that store raw data directly save the original data from old tasks along with the corresponding labels. The raw data are available in various formats, such as images, videos, audio, or text. During training on a new task, the stored memory data from old tasks are trained together with the new task data, which helps prevent forgetting. Storing raw data for memory replay requires no additional operations for storage and utilization, and it maintains consistency with the training process of the original model. The method of building generative models has been influenced by the recent development of generative models, where representative samples of past knowledge are generated for replay. These methods can be classified based on the type of generative model used. In CL, the strength of generative models lies in their ability to generate high-quality synthetic data, which helps the model overcome catastrophic forgetting issues common in traditional methods. Ultimately, the long-term memory capacity of the model is significantly enhanced. The method of storing data features is chosen when data features provide a good representation of the original data. In situations where privacy protection and storage constraints are a concern, storing data features rather than raw data or additional models becomes a practical solution. Replay memory utilization focuses on effectively leveraging the stored samples to enhance the efficacy of replay. This approach includes techniques such as data augmentation, knowledge distillation, Bayesian methods, optimization, and gradient projection, as well as representation alignment and bias correction. Regularization-based methods aim to address catastrophic forgetting by introducing constraints on the parameters of the model, which prevents them from undergoing drastic changes when learning new tasks. This approach enforces stability by preserving the critical aspects of previously learned knowledge through penalizing large deviations in the parameters of the model. Regularization methods can be categorized into those based on Laplace approximation, task representation constraints, Bayesian regularization, and knowledge distillation. The Laplace approximation technique approximates the posterior distribution of the old task as a Gaussian distribution, which imposes constraints on the important parameters of the old tasks. This type of regularization method works by adding penalty terms to the loss function to restrict changes to the parameters related to previous tasks when training on new tasks. In doing so, it achieves balanced learning of new and old tasks, which reduces the risk of catastrophic forgetting. Task representation constraint-based methods utilize the historical model to generate representations of current samples while applying regularization constraints based on these historical representations. The core idea of Bayesian regularization is to use a Bayesian framework to update model parameters, which allows the retention of knowledge from previous tasks while learning new ones. By regularizing model parameters within the Bayesian framework, this approach effectively balances the learning of new and old tasks, which reduces catastrophic forgetting. By contrast, knowledge distillation-based regularization methods do not require additional storage of replay memories. Instead, they leverage stored historical models as teachers while using the current training samples as input to guide learning. Dynamic architecture methods address new tasks and knowledge by gradually adjusting the structure of the model. These methods dynamically modify the architecture or parameters of the neural network based on changes in input data and the demands of new tasks. Thus, they add or reallocate network resources to learn new knowledge without forgetting the old. Dynamic architecture methods ensure CL by automatically expanding the network, activating important parameters, and freezing irrelevant parts, which allow the model to adapt to new tasks while avoiding catastrophic forgetting. Dynamic architecture methods can be further divided into multi-expert and subnetwork structures, dynamic scarification and masking techniques, dynamic structural adjustment, and the learning of additional task-related modules. In recent years, pre-trained large models, such as Transformer-based architectures, have achieved remarkable success across various domains due to their ability to generalize effectively across tasks. These models, which are commonly pre-trained on vast amounts of data, have strong representational power and are increasingly being applied in CL scenarios due to their ability to learn new tasks with minimal forgetting. Pre-trained models can be used in CL through two primary strategies: fine-tuning- and prompt-based methods. Fine-tuning involves adapting a pre-trained model to a new task by updating some or all of its parameters. In the context of CL, fine-tuning can be performed by freezing certain layers of the pre-trained model and only updating specific parts, such as task-specific layers, to prevent the model from losing knowledge gained from earlier tasks. Another approach is to fine-tune the model using task-specific learning rates, which ensures that important parameters for previous tasks are modified less during the training of new tasks. Prompt-based methods represent a newer approach to leveraging pre-trained models for CL. Rather than adjusting the internal parameters of the model, prompt-based methods guide its behavior by designing and inputting task-specific prompts. These prompts serve as auxiliary information that helps the model focus on the relevant aspects of the new task without altering its underlying architecture or parameters. Despite the advancements in CL techniques, several challenges remain. An important concern is the scalability of CL methods, especially in real-world applications where the number of tasks and data diversity can be vast. Another issue is achieving an optimal balance between plasticity and stability, given that methods that overemphasize stability may limit the ability of the model to learn effectively from new tasks. Future research in CL is expected to focus on the integration of pre-trained large models with traditional CL techniques. It will explore novel architectural designs and optimization strategies that can better address the demands of complex, real-world tasks. Additional works are needed to develop CL methods that are computationally efficient, particularly for large-scale models, without sacrificing performance or flexibility. By combining the strengths of large models with classical CL techniques, the field is set to achieve considerable advancements in creating intelligent systems capable of lifelong learning in dynamic and ever-changing environments.  
      关键词:continual learning(CL);catastrophic forgetting (CF);memory replay;regularization;dynamic architecture;pre-trained model(PTM);review   
      636
      |
      589
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 84437594 false
      更新时间:2025-08-19
    • Zhou Yifan, Du Kaile, Lyu Fan, Hu Fuyuan, Liu Guangcan
      Vol. 30, Issue 8, Pages: 2633-2644(2025) DOI: 10.11834/jig.240643
      Class activation map replay and minimum entropy sampling-based multilabel class-incremental learning
      摘要:ObjectiveMultilabel class-incremental learning seeks to enable models to learn new label information continuously while ensuring robust performance on previously acquired tasks within the context of multilabel classification. This task diverges from the label exclusivity assumption inherent in single-label problems because multilabel classification is effectively modeled as multiple binary classification tasks. In this framework, each sample can be associated with multiple labels simultaneously, reflecting the complexity of real-world scenarios. Moreover, the task setting in class-incremental learning imposes major constraints on how labels are assigned. Specifically, samples can only receive positive labels corresponding to their respective tasks, while the labels for categories pertaining to other tasks remain unknown. Consequently, this limitation results in a substantial loss of negative samples across categories because many potential negative examples are not represented in the available data. The implications of this scenario are profound. Not only do the various classes suffer from a considerable deficit of negative samples, but the learning process is further complicated by the emergence of numerous unknown mappings between classes. This lack of clarity can lead to severe label confusion, wherein the model struggles to distinguish between similar categories. As a result, the effectiveness of multilabel class-incremental learning is severely hindered, necessitating the development of novel methods to address these challenges.MethodTo address the issue of label confusion, this study introduces a novel bidirectional class activation mapping replay method, designed to facilitate the effective transfer of supervisory information across tasks. The primary innovation of our model lies in the establishment of a new storage area integrated within the traditional replay strategy. This dedicated storage area is specifically allocated to retain the class activation maps of the positive samples, thereby preserving critical information about the features associated with each class. Class activation mapping is a simple yet effective weakly supervised method for object localization that is capable of generating pixel-level masks for foreground objects. In an ideal scenario, an accurate class activation map can effectively represent the precise location of the target object within the image. Furthermore, leveraging the inherent spatial exclusivity of different objects within a single image enables the realization of cross-task information transfer. In the next step, we implement a cross-entropy-based sampling method aimed at selecting samples with highly accurate class activation maps. This careful selection process ensures that only the most representative samples are stored as replay samples, thereby enhancing the quality of the supervisory information available for future tasks. During the execution of subsequent tasks, we utilize these stored samples in a replay mechanism. This replay provides not only standard supervision for the positive class activation map output via the current network but also reverse supervision for the other class. Such reverse supervision ensures that the prominent regions of the class activation maps for the other classes do not overlap with the positive class. Reverse supervision for potentially confusing categories does not interfere with the classification of other positive classes across different tasks. Clearly, different objects occupy distinct regions within an image. If we assume that the localization of the class activation maps is sufficiently accurate, then the prominent regions of the class activation maps related to one class will not overlap with those belonging to other classes, as reflected in our reverse supervision loss function, which would yield a very low loss value. Consequently, the method we designed can correct the localization and output of the network for the confusing categories without disrupting the ability to learn the positive labels and features.ResultThis study selects a range of widely recognized and representative class-incremental comparison methods based on recent advancements in the field, including classic regularization-based methods, replay-based methods, and the top-performing replay and regularization methods from the single-label class-incremental learning, along with methods specifically designed for multilabel class-incremental learning. In simple task settings, our model, while slightly lagging behind the state-of-the-art in terms of mean average precision, demonstrates remarkable advantages in terms of the F1 score. Importantly, as the number of tasks increases, our method exhibits substantial advantages. Furthermore, our experiments demonstrate that our model effectively reduces the false positive rate in network predictions. Particularly in more challenging task settings, our model shows a considerable advantage, with a nearly 30% reduction in false positive rate compared with other methods. This result indicates that our model considerably alleviates the confusion associated with multilabel class-incremental learning. Finally, we provide extensive ablation experiments to validate the effectiveness of each module within our model.ConclusionThis study posits that the label confusion caused by label absence in multilabel class-incremental learning is as crucial as the problem of forgetting. Class activation maps reveal the prominent regions that the network focuses on, ideally corresponding to the locations of labeled objects. Given the practical exclusivity of object positions, we propose a bidirectional class activation mapping replay method that stores the class activation maps of the positive samples during the class-incremental learning process. This method leverages these stored class activation maps as supervisory information in future tasks. Moreover, a class activation map exclusivity loss function is designed to implement the reverse supervision, thereby distinguishing the new categories across tasks from the prominent regions of stored positive classes, thereby alleviating confusion. Furthermore, we introduce a cross-entropy-based sampling method to select samples with reliable class activation maps. Experimental results demonstrate that the proposed bidirectional class activation mapping replay method remarkably enhances the model accuracy in the context of multilabel class-incremental learning.  
      关键词:class incremental learning (CIL);multi-label classification;multi-label class-incremental learning (MLCIL);Class Activation Mapping;Minimizing Entropy   
      119
      |
      156
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 90087136 false
      更新时间:2025-08-19
    • Liu Ye, Bao Na, Cao Kerang, Chen Ji, Wang Xing
      Vol. 30, Issue 8, Pages: 2645-2659(2025) DOI: 10.11834/jig.240764
      Dynamic adaptive super-resolution for degraded images in continuous testing scenarios
      摘要:ObjectiveThe objective of image super-resolution is to enhance the quality of low-resolution images to match that of high-resolution images. Conventional super-resolution methods assume that data distribution during training and testing remains consistent. These methods typically focus on static domain images with relatively straightforward degradation types. However, in practical applications, image degradation is frequently influenced by numerous factors, including noise, blur, and alterations in illumination, resulting in a substantial domain shift between the training and testing environments. This domain shift results in a substantial decline in performance for traditional methods when applied in open environments. In light of these challenges, focus on the use of domain adaptation during continuous testing has been growing. This approach involves dynamically adjusting the model’s adaptability during testing, enabling it to adapt to real-time changes in image degradation types. This, in turn, leads to improvements in recovery performance under varying degradation conditions. This dynamic adaptation approach has been shown to better handle the diversity and uncertainty present in real-world environments, enhancing the stability and robustness of image super-resolution tasks.MethodTo improve the adaptability and robustness of super-resolution models in continuously changing open environments, this study proposes a new framework for dynamically adaptive image super-resolution during testing, called continuous test-time dynamic adaptation for super-resolution (CTDA-SR). The framework addresses the domain shift problem in complex scenarios through a dynamic domain adaptation strategy. Specifically, by designing a self-supervised dual-student network, the model can deeply explore the internal features of images. It can also fully exploit these features during the testing phase, enabling it to learn common features across different scales effectively. During the testing phase, the original input images are subjected to a downsampling operation. Subsequently, the first student network performs an upsampling operation on these downsampled images, striving to restore their original low-resolution state. This process aims to reconstruct the images preliminarily and lay the foundation for subsequent fine processing. Subsequently, the second student network upsamples the output from the first student network, with the objective of generating higher-resolution images for more accurate image reconstruction. Concurrently, the teacher model upsamples the original low-resolution images and then downsamples its output to ensure resolution consistency. The consistency loss between the outputs of the student networks and the teacher model is calculated to ensure that the model remains robust to variations in the input, a vital component for adapting to continuously changing environments. A multilevel cycle consistency loss function is introduced to enhance the model’s generalization capability. The implementation of cycle consistency constraints at multiple scales enables the network to learn the internal relationships between different scales, utilize the complementary information among scales to optimize the reconstruction results, enhance the model’s adaptability to images of different resolutions, and boost the generalization ability of the student networks.ResultThe experimental findings demonstrate that the proposed methodology in this study surpasses existing methods in multiple dynamic domain super-resolution tasks, including continuous degradation scenarios. The framework enhances the peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) of reconstructed images, thereby demonstrating stability and robustness. The experiment involved a comparison with 10 methods across eight datasets. In the U-Test1 dataset, compared with the model that performed second best, the proposed method increased the PSNR by 0.22 dB and the SSIM by 0.03. In the U-Test3 dataset, when compared with the second-best performing model, the proposed method improved the PSNR by 0.23 dB and the SSIM by 0.01. In the U-Test4 dataset, the proposed method increased the PSNR by 0.39 dB, while the SSIM decreased by 0.01. In the B-Test1 dataset, compared with the second-best performing model, the PSNR was improved by 0.11 dB, and the SSIM was increased by 0.01. In addition, comparative experiments in reverse order were conducted on the Urban100 dataset to verify the effectiveness of the algorithm. The results of these experiments confirmed that the proposed algorithm enhances the super-resolution effect of degraded images in continuous degradation environments.ConclusionThe proposed framework offers a pioneering solution for image super-resolution tasks in continuously changing environments. The proposed framework is effective due to its efficient adaptive capabilities, self-supervised dual-student network design, and unique loss functions. These elements collectively improve the performance of super-resolution tasks in dynamic domain conditions. The framework dynamically adjusts the model’s adaptability during the testing phase, effectively extracting internal image features and learning common features across different scales, ensuring robustness in complex degradation conditions. The adaptive strategy of CTDA-SR not only addresses the domain shift problem but also maintains high recovery performance in various dynamically changing degradation environments. Experimental findings demonstrate that the framework remarkably enhances the PSNR and SSIM of images while exhibiting excellent stability and adaptability when handling degradation types such as noise, blur, and illumination changes. This framework offers a novel approach for image super-resolution research, particularly for its application in dynamic open environments, where it demonstrates considerable promise.  
      关键词:super-resolution(SR);degraded images;domain adaptation;teacher-student model;loss function;continuous change   
      95
      |
      99
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 98368001 false
      更新时间:2025-08-19
    • Continual testing time domain adaptive image classification method AI导读

      Lu Tingyang, Lyu Fan, Zhou Tao, Yao Rui, Hu Fuyuan
      Vol. 30, Issue 8, Pages: 2660-2674(2025) DOI: 10.11834/jig.240739
      Continual testing time domain adaptive image classification method
      摘要:ObjectiveIn recent years, Deep Neural Networks have demonstrated exceptional performance in numerous computer vision tasks, including image classification, dense prediction, and image segmentation. The success of these tasks largely depends on the consistency between the test data distribution and the training data. However, in real-world application scenarios, consistency is often disrupted due to unknown changes in factors such as weather and lighting. This disruption leads to increased domain diversity and results in poor generalization and performance degradation of deployed models. The concept of continuous test-time adaptation (CTTA) has been proposed to address this challenge, with the aim of enabling the adaption of pre-trained models to the evolving target data distribution. CTTA is designed to allow source pre-trained models to adapt to the continuously changing target domain without using any source data. Existing CTTA primarily focuses on self-training model adaptation, which employs pseudo-labels from model-predicted data-augmented samples within the mean teacher framework to achieve self-training. However, this study posits that the essence of CTTA is the transformation of inter-domain feature styles. Through experiments, this study has found that existing methods, which use random data augmentation strategies, ignore the importance of inter-domain differences. The use of simple and singular data augmentation leads to issues such as insufficient model stability and generalization, which hinders knowledge transfer across certain domains. To this end, this study proposes a CTTA method that addresses inter-domain inconsistency differences, particularly within image classification tasks in the field of computer vision. The research explores ways to improve the adaptability of models to new domains through continuous testing adaptation techniques.MethodFirst, this study posits that the essence of domain variation lies in the variation in feature styles between domains, which acknowledges differences in features across target domains. Consequently, the study aims to construct a flexible data augmentation strategy based on domain differences. Calculating inter-domain differences is a key technique for measuring the distributional differences between features of different domains, which is specifically important in the field of CTTA. The Gram matrix, as a tool for assessing feature style differences, has been widely applied in this domain. The use of the Gram matrix more accurately quantifies and understands the differences in feature distribution between domains, which help calculate the appropriate elasticity factor for subsequent flexible data augmentation operations. This approach considers inter-domain differences from a data preprocessing perspective, which enables the model to adapt flexibly to continuously changing domains. Second, based on the differences in inter-domain feature styles, the study proposes a global elastic symmetric cross-entropy consistency loss function. This function incorporates the elasticity factor, which is calculated based on inter-domain differences, into the pseudo-label and loss function levels. It considers inter-domain differences from the perspective of model training optimization, which involves constructing a global elastic symmetric cross-entropy loss function that balances model generalization and stability. The elastic symmetric cross-entropy loss function is a loss function that combines differences in inter-domain feature styles, which adaptively adjusts pseudo-label outputs by constructing a flexible data augmentation strategy. The goal of this method is to achieve a balance between the generalization and stability of the model. Specifically, it enhances the adaptability of the model to new domains by dynamically adjusting the pseudo-labels and the weights between forward and backward cross-entropy based on differences in inter-domain feature styles. Finally, a confidence-based pseudo-label correction strategy is proposed. Given that the study implements controllable elastic data augmentation, data are elastically enhanced according to the degree of inter-domain differences, which produces a diverse set of outputs. However, elastic data augmentation may result in a large number of strongly augmented samples, which can obscure sample characteristics. As a result, the model has difficulty correctly predicting the true labels of strongly augmented samples, which leads to low-quality pseudo-labels and error accumulation. Therefore, using high-confidence weak data augmentation predictions to correct strong data augmentation predictions reduces the issue of low-quality pseudo-labels caused by high-intensity data augmentation during the CTTA phase. In this way, the accumulation of errors is effectively suppressed.ResultThis study conducted comprehensive comparative experiments on the CIFAR10-C, CIFAR100-C, and ImageNet-C datasets, which were followed by comparison of results with those of various advanced algorithms. The experimental results indicate that the algorithm proposed in this study has achieved significant performance improvements over the baseline method CoTTA on these datasets. Specifically, on the CIFAR10-C, CIFAR100, and ImageNet-C datasets, the error rates were reduced by approximately 2.3%, 2.7%, and 3.6%, respectively. These results demonstrate that the algorithm can effectively enhance the robustness and accuracy of the model across datasets of varying difficulty and complexity. Further ablation experiments were conducted on the CIFAR10-C dataset to verify the effectiveness of each module within the algorithm. Ablation experiments are a method of controlling variables, where modules are added individually or in combination to test their impact on overall performance. This approach helps understand the specific contributions of each module to the performance of the algorithm, which provides a basis for algorithm optimization. Furthermore, the study designed experiments with random domain inputs to more closely align with real-world domain variation scenarios. The experimental results show that the elastic symmetric cross-entropy based on domain difference detection has a lower error rate under random domain input conditions than existing methods, which verifies the effectiveness of the approach. This enhancement in capability is crucial because models in practice often need to make predictions in constantly changing environments. In other words, the model should better adapt to unknown domain changes, which maintains high performance in practical applications.ConclusionThe presented algorithm effectively balances the generalization and stability of the model in scenarios of continuous test-time adaptation while significantly reducing the accumulation of errors during the test-time adaptation process. This balance is crucial for machine learning models to maintain performance in dynamically changing environments. Our research focuses not only on the theoretical foundations of the algorithm but also on its effectiveness in practical applications. In future research, we plan to further explore the performance and applicability of continuous test-time adaptation algorithms in more open and complex environments. Therefore, we need to design and implement experimental setups that are closer to real-world conditions to more accurately evaluate and optimize the algorithms. Our goal is to ensure that these algorithms are not only robust in theory but also demonstrate strong adaptability and robustness in practical applications. Furthermore, we will consider the scalability and computational efficiency of the algorithms, given that these factors are crucial for the practical deployment of the algorithms in large-scale or resource-constrained environments. We hope that through these studies, we can provide more powerful and reliable tools for the fields of machine learning and artificial intelligence to address the increasingly complex challenges of the real world.  
      关键词:continual test-time adaptation (CTTA);Gram matrix;Domain difference;global symmetric cross-entropy;Elastic data augmentation;Pseudo label self-correction;continual learning   
      181
      |
      182
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 84437894 false
      更新时间:2025-08-19
    • Adapter-enhanced two-stage continual defect detection AI导读

      在工业产品缺陷判别领域,AETS方法通过适配器增强模块和双阶段训练策略,显著提升了模型的适应性和泛化性。
      Feng Jun, Meng Xujing, Shang Yuquan, Niu Chaofan
      Vol. 30, Issue 8, Pages: 2675-2689(2025) DOI: 10.11834/jig.240663
      Adapter-enhanced two-stage continual defect detection
      摘要:ObjectiveThe difficulty in obtaining defect samples in industrial scenarios causes many existing anomaly detection algorithms to rely solely on normal samples during the training phase. These methods demonstrate strong performance in industrial defect detection by training a single model tailored to specific product types. However, traditional anomaly detection methods mainly focus on the current task in industrial product defect discrimination, which frequently results in catastrophic forgetting knowledge of previously learned tasks when the model is trained on new tasks. As product objects and defect types continuously change and diversify in real-world industrial environments, anomaly detection models need to possess greater flexibility and continuous adaptation capability to cope with new tasks and data. Therefore, this study proposes an adapter-enhanced two-stage continual defect detection (AETS) method tailored for continuous anomaly detection tasks.MethodFirst, based on the AdaptFormer framework, this study introduces an external attention mechanism to augment the capability of the model to capture global dependencies across sequential tasks, which enhances generalization performance on new tasks. This enhancement is crucial in industrial anomaly detection, which enables the AETS model to adapt to diverse and complex defect types across different products. Furthermore, combining parameter-efficient tuning technique with vision Transformer (ViT) pre-trained models, the AETS framework incorporates two primary stages of training to enhance the ability of the model to retain previously learned knowledge while effectively adapting to new tasks. The first stage focuses on adaptation, where a full parameter fine-tuning strategy is employed to mitigate the domain shift between natural and industrial images. This procedure is critical given that the differences in feature distributions between the two domains can severely hinder the performance of industrial anomaly detection models when directly applied to industrial datasets. The AETS model integrates an external attention mechanism into the existing AdaptFormer architecture to further enhance the adaptation process. This integration allows the model to capture long-range dependencies across sequential tasks, which is vital for maintaining performance across different domains and tasks. This incorporation of external attention enables the model to better generalize to new tasks by improving the representation learning of global contextual information, which not only improves the ability of the model to attend to relevant features but also enhances its capacity to model global contextual information. By refining the representation learning of the model, the external attention mechanism improves the flexibility and robustness. In the second stage, which is known as efficient fine-tuning, the AETS method utilizes adapter modules to selectively fine-tune a minimal number of parameters. This way reduces computational overhead while retaining essential knowledge from previously learned tasks. By freezing most of the parameters, the model can preserve knowledge from prior tasks and mitigate catastrophic forgetting, which is a common challenge in continual learning settings. The adapter-enhanced module is designed to enhance the capacity of the model to handle new tasks without overwriting previously learned information. This capability ensures that the model can continuously adapt to evolving data distributions in industrial environments. Forgetting measure (FM) only evaluates the forgetting compared with the maximum performance achieved in the past for each task at the completion of the final model training. This study introduces a novel evaluation metric called forgetting fluctuation rate (FFR), which is specifically designed for continuous learning scenarios to provide a comprehensive assessment of the catastrophic forgetting problem. FFR is used for quantifying the extent of forgetting during the learning process and can provide a more granular assessment of model stability across sequential tasks. By measuring the fluctuation in forgetting over time, FFR helps evaluate how well the model retains its knowledge while learning new tasks, which offers a more robust evaluation of continual learning performance in industrial defect detection scenarios. When the FFR value is low, the forgetting ability of the model is relatively smooth, with less performance fluctuation. Conversely, higher values mean greater fluctuations in forgetting.ResultWe conduct extensive experiments using three benchmark datasets from the MVTecAD dataset, which are specifically divided to evaluate the performance of the model in different industrial defect detection scenarios. The AETS method demonstrates significant improvements in average accuracy (ACC). Specifically, it achieves scores of 84.21%, 89.16%, and 78.49% on the MVTec-MCIL, MVTec-SCIL, and MVTec+MTD datasets, respectively. These results indicate that AETS outperforms other continual learning models, especially in terms of efficiency and resource utilization. Compared with five continuous learning methods, AETS exhibits the best performance in terms of ACC and FM metrics, with the smallest number of training parameters. Furthermore, when compared with six state-of-the-art parameter-efficient tuning methods, AETS achieves optimal FFR values. This result demonstrates that our method not only improves model accuracy but also provides a lightweight solution that is more practical for real-world industrial applications. We perform ablation studies to further validate the effectiveness of our proposed method. Specifically, the ablation experiments focus on selecting the optimal scaling factor and determining the best structure for the adapter enhancement module. These experiments are crucial in achieving a balanced trade-off between the plasticity of the model, which enables it to adapt to new tasks, and its stability, which helps preserve previously learned knowledge.ConclusionThis study presents a novel continual anomaly detection method called AETS, which leverages the powerful feature representation capability of pre-trained models combined with adapter-enhanced modules to improve continual learning performance in industrial defect detection tasks. AETS, which adopts a two-stage training process, effectively addresses domain shift issues between natural and industrial images. It also efficiently captures task-relevant features through parameter-efficient fine-tuning, which not only mitigates catastrophic forgetting but also enhances the training efficiency of the model. In addition, the introduction of FFR validates the robustness and adaptability of the model in dynamic industrial environments. The experimental results demonstrate that AETS can achieve competitive performance across various benchmark datasets. It not only maintains high detection accuracy but also significantly enhances the stability and plasticity of the model in the context of continuous learning. By mitigating catastrophic forgetting, AETS effectively enables the model to adapt to new tasks without compromising previously learned knowledge. These advantages make AETS particularly well suited for real-world industrial applications, where accuracy and the ability to continuously learn from evolving data are essential for practical deployment and long-term performance.  
      关键词:continual learning (CL);industrial product defect detection;anomaly detection;adapter enhancement;two-stage training;parameter-efficient tuning (PET);forgetting fluctuation rate (FFR)   
      163
      |
      289
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 84437819 false
      更新时间:2025-08-19

      Review

    • Visual saliency detection for 3D object models AI导读

      三维模型视觉显著性检测技术,模拟人类视觉系统,定位模型中重要视觉信息区域,为模型简化、分割及压缩等任务提供解决方案。
      Ding Xiaoying, Zhang Xinfeng, Lin Weisi, Chen Zhenzhong
      Vol. 30, Issue 8, Pages: 2690-2710(2025) DOI: 10.11834/jig.240614
      Visual saliency detection for 3D object models
      摘要:With the fast development of point cloud acquisition technology and the popularity of portable point cloud acquisition equipment, generating 3D model with high density and rich texture information is much easier than ever before. A three-dimensional (3D) model can represent the real world with rich information and has been widely applied in various applications in our daily life, such as smart cities, autonomous driving, virtual reality, and product design, making 3D model processing a hot research topic in the fields related to computer vision and 3D graphics. However, the growing number and size of 3D models bring great challenges to storage, transmission, and processing. With this huge amount of data, directly processing the whole 3D model in real time is impractical or uneconomical in many situations. Inspired by the selective mechanism of the human visual system (HVS), which indicates that the HVS selectively processes the huge amount of visual information that comes into our eyes and distributes more of the limited attention to regions with important visual information, researchers begin to investigate whether this selective mechanism can be applied to deal with 3D models. In essence, 3D visual saliency detection aims to detect the visually salient regions on 3D models and can help with various human-centered applications in 3D model processing, e.g., resizing, simplification, segmentation, and quality assessment. Different from 2D image and video data which have regular pixel distribution, a 3D model contains depth information and noise data and is usually with large data size. More often than not, points are distributed unevenly in 3D space, making it hard to extend existing image and video visual saliency detection methods to deal with 3D models straightforwardly. In addition, in contrast to 2D image and video visual saliency detection which have been studied by researchers for decades, research toward 3D visual saliency detection has just started during recent years. Considering the fact that humans are the final users for most small 3D object while machines are final users for large-scale/city-scale scenarios, this paper summarizes studies related to visual saliency detection for 3D object models to provide a comprehensive survey. First, we summarize existing 3D visual saliency detection methods. Considering the different kind of features used by researchers, these 3D visual saliency detection methods can be classified into two categories: handcrafted feature-based methods and deep learning-based methods. The former category can be further divided into two different types, namely, the single-scale feature-based methods and the multi-scale feature-based ones. In addition, according to different 3D representations, these methods can also be divided into 3D point-cloud based methods and 3D mesh based ones. We also investigate their feature extraction strategies and multi-feature integration frameworks. Compared with traditional handcrafted feature-based methods, deep learning-based methods achieve higher similarity with the ground-truth human labeling and fixation density maps. The reason is that deep learning-based features usually have stronger expressive power when compared with handcrafted features. Then, we summarize 3D visual saliency detection databases, which can be used to achieve and evaluate the performance of different 3D visual saliency detection methods. On the basis of different data collection approaches, these databases can be categorized into two different classes: mouse tracking-based database and eye tracking-based database. Limited by eye-tracking technology, earlier 3D visual saliency detection databases were mainly constructed using mouse-tracking technology. Compared with eye tracking, mouse tracking is cheaper and easier to implement. With development of eye-tracking technology in recent years, researchers tend to use eye-tracking technology to construct a 3D visual saliency detection database because recording the human visual behavior is more straightforward. During the eye-tracking experiment, several subjects are invited to observe 3D models or their projected images freely without any specific task. Their eye-movement data are collected during this observation and subsequently processed to construct the database. Moreover, we provide clear illustrations about the commonly used evaluation metrics for evaluating the performance of 3D visual saliency detection methods, such as the correlation coefficient score and the area under curve (AUC) score. In addition, we introduce how 3D visual saliency detection can be used as a guidance to improve the visual performance of other 3D model processing applications including 3D model resizing, simplification, denoising, and quality assessment. In 3D model resizing, uniformly adjusting the size of the entire 3D model may lead to unsatisfying visual performance because it could potentially cause excessive stretching of certain areas on the 3D model and lead to visual distortions, introducing visual saliency information can better preserve the structural features of the 3D model. In 3D model simplification, 3D visual saliency detection result identifies regions that are visually salient, which can guide the simplification procedure to preserve more details in visually salient regions and lead to more visual appealing results. In 3D model denoising, the visual saliency feature can help determine parameters of the denoising filter, allowing for enhanced preservation of detailed features in the denoised 3D model. In 3D model quality assessment, distortions in visually salient regions are more annoying than distortions appearing in nonsalient regions. Thus, accurately detecting visually salient regions on 3D model is crucial to 3D quality assessment and can help obtain more precise quality evaluation results. Finally, we provide an in-depth conclusion about 3D model visual saliency detection and discuss the problems that must be solved. Moreover, we introduce the development trends of 3D model visual saliency detection from two perspectives: content of the 3D model and design of the algorithm, hoping that it can help with further improvement and trigger further exploration in the research community.  
      关键词:3D model visual saliency detection(3D-VSD);3D model processing;visual attention mechanism;eye-tracking;feature extraction;deep-learning   
      102
      |
      177
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 82620153 false
      更新时间:2025-08-19
    • Deep learning for image reflection removal: a survey AI导读

      智能手机摄影普及,图像数据采集便捷,但玻璃反射影响图像质量。本文深入探讨基于深度学习的反射消除研究进展,为解决图像质量干扰问题提供解决方案。
      Hong Yuchen, Lyu Youwei, Wan Renjie, Li Si, Shi Boxin
      Vol. 30, Issue 8, Pages: 2711-2728(2025) DOI: 10.11834/jig.240537
      Deep learning for image reflection removal: a survey
      摘要:The widespread adoption of smartphone photography has simplified image acquisition, leading to generation of massive amounts of image data for training intelligent visual perception models. However, several image degradation issues hinder the use of captured image data, one of which is the glass reflection. When people take photos through transparent materials such as glass windows, the presence of reflections can severely degrade the quality of the captured images and interfere with downstream computer vision tasks. Reflection removal aims to separate different scene components located on either side of the glass from reflection-contaminated mixture images, which eliminates glass reflections to obtain clear transmission images. Reflection removal, as an attractive topic in computational photography and computer vision, has obtained considerable attention from researchers and rapidly developed with the extensive application of deep learning in computational photography problems. This study comprehensively reviews recent advancements in deep learning-based reflection removal. First, we start by analyzing the image formation model of mixture images, followed by examining the effects of glass material and camera characteristics on the properties of reflection and transmission images, including refraction, absorption, and reflection effects of glass and image blur caused by camera depth of field. Second, from the perspective of input images, we summarize publicly available reflection removal datasets and conduct statistical analysis of their application scenarios, specific purposes, data scale, and resolution attributes. Synthetic data created based on theoretical imaging models are important for large-scale training dataset creation, but they still exhibit distribution discrepancies compared with real captured images. Therefore, a key strategy to enhance model performance is to use a portion of real data during training. Benchmark datasets from real data need to be constructed to comprehensively evaluate the performance of reflection removal algorithms in real-world settings. Third, from the viewpoint of deep learning models, we systematically compare the network design, loss functions, and evaluation metrics of reflection removal networks. For network design, researchers have primarily employed three types of network structures to construct the network models, namely, direct, cascaded, and concurrent structures, depending on different strategies to predict transmission and reflection images. As for loss functions, early methods mainly utilize pixel loss and edge loss. Subsequently, more sophisticated loss functions have been introduced to constrain the perceptual quality of predicted images and the correlation between transmission and reflection images. These functions guide the optimization of network models toward more realistic and higher-quality restoration. For evaluating the quality of reflection removal results, similarities across various statistical characteristics between the predicted and reference images are used as metrics. Based on the employed auxiliary information in reflection removal methods, we propose a systematical taxonomy to categorize existing approaches into four types: image feature-based, text feature-based, geometry characteristics-based, and light characteristics-based. The rapid development of deep learning has enabled the use of deep neural networks in reflection removal methods, exploiting image features to extract low- or high-level image characteristics from large training datasets, thereby enhancing reflection removal performance. However, due to the inherent ill-posed nature of the problem, incorporating additional auxiliary information becomes crucial when dealing with complex reflection scenarios. Methods based on geometry characteristics use panoramic cameras or capture multiple images from different camera positions to obtain additional views of scenes, which provides auxiliary contextual information. Methods based on light characteristics leverage the discrepancy in the light paths between rays from the transmission and reflection scenes, such as variations in light conditions or polarization characteristics, to provide key cues for effective reflection removal. With the emergence of multimodal large language models, methods based on text features introduce natural language descriptions to cooperate with the image modality and provide semantic guidance for reflection removal. Consequently, state-of-the-art results are obtained without requiring additional hardware support. Finally, by discussing unresolved key challenges within the field, we offer a summary and outlook for reflection removal research. This survey provides a systematic review on recent advances in deep learning for the reflection removal problem, which helps researchers develop a profound understanding of reflection removal techniques and facilitate future research.  
      关键词:Computational photography;image restoration;Reflection removal;convolutional neural network(CNN);diffusion model;perceptual quality   
      660
      |
      212
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 79601498 false
      更新时间:2025-08-19

      Dataset

    • Multimodal lie detection dataset based on Chinese dialogue AI导读

      在谎言检测领域,东南大学专家构建了首个中文多模态测谎数据集SEUMLD,为研究中文语境下的多模态测谎提供重要数据来源。
      Xu Xiaolin, Zheng Wenming, Lian Hailun, Li Sunan, Liu Jiateng, Liu Anbang, Lu Cheng, Zong Yuan, Liang Zongbao
      Vol. 30, Issue 8, Pages: 2729-2742(2025) DOI: 10.11834/jig.240571
      Multimodal lie detection dataset based on Chinese dialogue
      摘要:ObjectiveLying is an intentional act of distorting facts or spreading false information. Although lying can lead to misunderstandings and a crisis of trust, it can also help avoid harm or protect personal privacy in certain situations. Lie detection aims to identify dishonest behavior from the physiological or behavioral manifestations of the speaker, and it holds significant application potential in criminal investigation, security screening, and high-risk commercial transactions. However, current studies mainly rely on lie detection datasets based on English language contexts due to the lack of publicly available Chinese lie detection research datasets. Considering differences in language and culture, lie detection algorithms developed for English scenarios may be unsuitable for Chinese language contexts. Moreover, existing lie detection datasets are limited by insufficient subject motivation to deceive and small sample sizes. To address these issues, this study introduces the first publicly available multimodal lie detection dataset based on Chinese dialogues, which features a more comprehensive stimulus paradigm and a larger scale —— Southeast University multimodal lie detection dataset (SEUMLD).MethodIn terms of data collection, the SEUMLD dataset adopts the guilty knowledge test paradigm, which systematically designs an experimental process that includes simulated crimes and simulated interrogations, starting with the typical crime of “theft.” By linking experimental performance with rewards, the study maximizes the motivation of participants to lie. SEUMLD includes 76 participants, all of whom have lived for an extended period in a Chinese language environment. The dataset records video, audio, and electrocardiogram (ECG) data during the simulated interrogation process, with a total of 3 224 conversation segments. The total duration of the SEUMLD dataset is 6.1 h, with the average duration of each participant being 288.8 s, and the average duration of each segment being 2.4 s. With regard to data annotation, the dataset provides annotations for determining whether a single subject is engaging in lying during a long conversation (coarse-grained annotation) and precise annotations for each segment within the long conversation (fine-grained annotation). Based on the constructed SEUMLD dataset, this study first designed cross-linguistic lie detection experiments to verify the impact of language and cultural differences on dishonest behavior. Then, transfer learning experiments were conducted to evaluate the generalization performance of SEUMLD. Interrogation-based lie detection not only needs to judge whether the interrogated subject is engaging in lying but also requires accurately assessing whether the subject is engaging in lying in response to specific questions, which is crucial for obtaining key clues during the interrogation process. Thus, this study conducts benchmark experiments using classical lie detection methods based on unimodal data and multimodal data fusion. An in-depth analysis of the results from the lie detection benchmark experiments is performed to address the challenge of insufficient lying cues in fine-grained detection. It posits that logical or temporal correlations may exist between multiple questions and answers within a long conversation. Experimental results demonstrate that leveraging the correlations between multiple responses can enhance the feature expression of individual short temporal sequences, which improves the performance of fine-grained lie detection.ResultThe cross-linguistic lie detection experiments demonstrated significant differences in lying behavior across different languages and cultural contexts, which makes models developed based on popular English lie detection datasets unsuitable for Chinese scenarios. Therefore, constructing a lie detection dataset in a Chinese context is crucial for studying the lying patterns of native Chinese speakers. The results of the transfer learning experiments validated the excellent generalization performance of SEUMLD. SEUMLD demonstrates excellent adaptability across different scenarios and collection paradigms, which advances lie detection algorithms toward practical applications. The benchmark experiment results show that 1) the best unweighted average recall (UAR) for coarse-grained lie detection based on unimodal signals on ECG, video, and audio signals are 0.718 9, 0.757 6, and 0.674 2, respectively. 2) The best UAR results for fine-grained lie detection based on unimodal signals on video and audio signals are 0.630 5 and 0.617 2, respectively, which are significantly lower than the performance of coarse-grained lie detection. 3) In the fine-grained lie detection with context information, the best UAR results for the fine-grained lie detection based on video signals and audio signals are improved to 0.709 6 and 0.707 5, respectively. 4) The experimental results of multimodal information fusion showed that the lie detection performance reaches the best after multimodal information fusion, with UAR recognition results of 0.808 3 in the coarse-grained lie detection and 0.737 9 in the fine-grained lie detection.ConclusionThe SEUMLD constructed in this study provides an important data source for studying multimodal lie detection in a Chinese context and holds significant value for future research on the lying patterns of native Chinese speakers. The dataset is available at:https://aip.seu.edu.cn/2024/1219/c54084a515309/page.htm or https://doi.org/10.57760/sciencedb.22548. Furthermore, based on SEUMLD, this study conducts a series of intriguing experiments, including cross-linguistic experiments, transfer learning experiments, and benchmark tests utilizing classical lie detection methods. These experiments not only provide valuable insights into the challenges and opportunities in cross-cultural and multimodal lie detection but also establish a comprehensive set of baseline results. These benchmarks serve as a practical reference for researchers interested in this domain, which enables them to quickly and efficiently initiate their studies.  
      关键词:lie detection;Chinese lie detection;multimodal;dataset;benchmark   
      249
      |
      500
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 79600222 false
      更新时间:2025-08-19

      Image Processing and Coding

    • Image colorization via color query and feature enhancement AI导读

      在图像上色领域,专家提出了一种新模型,通过编码器—解码器结构和颜色预测网络,有效提升了上色质量,为老照片修复和黑白电影增强提供了新方案。
      Yu Bing, Xiang Xue, Fan Zhenghui, Huang Dongjin, Ding Youdong
      Vol. 30, Issue 8, Pages: 2743-2757(2025) DOI: 10.11834/jig.240506
      Image colorization via color query and feature enhancement
      摘要:ObjectiveImage colorization refers to the process of predicting plausible colors for each pixel in a grayscale image, with the goal not necessarily being an exact restoration of the original color. Given that the same object can often be assigned different colors, colorization inherently exhibits multimodal characteristics. This complexity presents challenges for researchers and sustained interest in the field. As an important direction in computer vision, image colorization has gained considerable attention, particularly with the advancements in deep learning. Traditional colorization methods often rely on user input, such as scribble-based or reference image-based techniques, which, while effective, are hindered by limitations such as the inability to handle batch processing and extended processing times. To address these issues and reduce manual intervention, deep learning techniques have propelled the development of automatic colorization. Fully automatic methods, which eliminate the need for user interaction, can efficiently colorize images; however, challenges including the accuracy of color restoration and the preservation of image details remain. Deep learning approaches, particularly those based on convolutional neural networks and generative adversarial networks, have demonstrated improvements in colorization performance but continue to face issues such as insufficient semantic understanding and blurred details. Recently, Transformer models have shown promise in image colorization tasks by leveraging their ability to capture long-range dependencies, further enhancing results. However, existing methods still struggle with challenges such as color bleeding and loss of detail clarity, especially in highly detailed images. Achieving fully automated, natural, and plausible colorization remains an ongoing research challenge.MethodIn this study, we propose an end-to-end grayscale image colorization method utilizing encoder-decoder architecture for fully automatic colorization. Given an input grayscale image, the network predicts the chrominance channels in the CIELAB color space to generate a colorized image. The encoder employs ConvNeXt to leverage its multiscale semantic representation capabilities, effectively extracting high-level semantic features from the grayscale image. Multiscale feature maps are passed from the encoder to the decoder through convolutional connection layers, progressively restoring the image’s spatial resolution. These feature maps are then fed into the color prediction network, where a pixel enhancement block (PEB) refines the color predictions. The PEB is designed to focus on and enhance specific regions of the image, improving color matching accuracy by utilizing convolutional layers and pooling operations to generate spatial attention weights. These weights are element-wise multiplied with the original image, enabling spatial enhancement and better capturing important regions for improved color precision. The color query block in the color prediction network employs a Transformer-based approach, incorporating learnable color embedding memories that store sequences of color representations. Through cross-attention and self-attention mechanisms, color embeddings are progressively correlated with image features, reducing dependence on manual priors and improving sensitivity to semantic information. This approach mitigates issues such as color bleeding. Furthermore, to enhance the learning of color information and latent features from the grayscale image, the feature enhancement block generates attention maps using convolutional operations with varying kernel sizes. These maps are fused with the original image through convolutional layers to produce the final output tensor. This methodology ensures effective reconstruction of color and structural information, thereby enhancing the overall performance of the image colorization process.ResultIn the experiments, the proposed colorization model was trained on the large-scale ImageNet dataset and extensively evaluated across multiple benchmark datasets, including ImageNet (val5k), ImageNet (val50k), COCO-Stuff, ADE20K, and CelebA-HQ. The evaluation metrics included Frechet inception distance (FID), colorfulness score (CF), and peak signal-to-noise ratio (PSNR), which assess the realism, color quality, and reconstruction accuracy of the generated images. The model utilized a pretrained ConvNeXt-L as the encoder, paired with a multiscale decoder, and was optimized using the AdamW optimizer. All experiments were performed on four Tesla A100 GPUs. Comparative results showed that the proposed method remarkably outperformed existing approaches such as DeOldify, Wu et al., BigColor, ColorFormer, and DDColor, particularly in terms of color richness and realism. Quantitative comparisons across five test datasets further demonstrated that the proposed model consistently achieved the lowest FID scores, indicating superior image quality and strong generalization. Compared with the second-best model, the FID is reduced by 0.2, while the PSNR is improved by 0.13 dB. While previous methods achieved higher CF scores, a higher colorfulness score does not always correlate with better visual quality. Thus, the metric △CF was introduced to measure the difference in colorfulness between generated and real images. The proposed method achieved the lowest △CF scores across all datasets, reflecting its ability to generate more natural and realistic colorizations while preserving image diversity. Given the subjective nature of image colorization, a user study was also conducted, showing that over 30% of users preferred the colorization results produced by the proposed method. Additionally, ablation studies confirmed the effectiveness of the model architecture in enhancing colorization performance.ConclusionThe proposed colorization model is good at capturing and reproducing the details and color relationships in the image, achieving high-quality colorization results.  
      关键词:image colorization;grayscale image;spatial attention;color inquiry;feature enhancement   
      111
      |
      212
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 82627340 false
      更新时间:2025-08-19

      Image Analysis and Recognition

    • 在360°全景图像显著目标检测领域,DPNet模型有效解决了几何畸变和背景干扰问题,为显著目标检测提供新方案。
      Chen Xiaolei, Du Zelong, Zhang Xuegong, Wang Xing
      Vol. 30, Issue 8, Pages: 2758-2774(2025) DOI: 10.11834/jig.240592
      Distortion-adaptive and position-aware network for salient object detection in 360° omnidirectional image
      摘要:ObjectiveSalient object detection (SOD) in the field of computer vision originates from the study of human visual attention mechanisms. Its goal is to emulate the human ability to focus on specific objects or areas in complex scenes that naturally capture the interest of human vision. SOD, as a foundational research area in computer vision, is important for various downstream tasks, such as object tracking, semantic segmentation, person re-identification, camouflaged object detection, and image retrieval. In recent years, the advancements in Virtual reality(VR) and augmented reality(AR) technologies have expanded SOD beyond traditional 2D images to encompass 360° omnidirectional images (or panoramic images). The application of 360° SOD serves as a crucial preprocessing step for enhancing the efficiency of subsequent advanced visual tasks. These tasks include coding, editing, stitching, quality assessment, and transmission of 360° omnidirectional images. In contrast to traditional 2D images, 360° omnidirectional images exhibit the following core differences: 360° omnidirectional images are spherical. Given that no encoder has been specifically developed for spherical images, 360° omnidirectional images need to be projected into 2D images for further processing. Common projection methods mainly encompass equirectangular projection (ERP), cube-map projection, and octahedron projection. Regardless of the projection method used, geometric distortion is inevitable. This geometric distortion severely impacts the performance of SOD, which results in a significant decline in performance when traditional 2D SOD methods are directly applied to 360° omnidirectional images. Therefore, addressing the challenge of geometric distortion generated by 360° omnidirectional image projection is the core problem in the field of SOD in 360° omnidirectional images (360° SOD). In recent years, some 360° SOD methods have been established to solve the problem of geometric distortion caused by projection and have achieved good detection results to a certain extent. However, their approaches are either limited in effectiveness or rely on artificially designed features, which restricts the ability of the model to detect salient objects in 360° omnidirectional images. Meanwhile, most of the models have poor detection results when facing complex scenes or scenes with low contrast between foreground and background, which are easily interfered with by the background. This study introduces a distortion-adaptive and position-aware network (DPNet) for 360° SOD to solve the abovementioned problems. The aim is to further solve the problem of background interference in complex scenes by considering geometric distortion of ERP image, which helps better detect salient objects in 360° omnidirectional images.MethodDPNet combines vision transformer (ViT) and convolutional neural networks (CNNs) to build the basic framework of the network. It uses ViT and CNNs to design the encoder and constructs a combination decoder based on U-Net architecture to decode the features from the two encoders step by step. This way combines the global coding advantages of ViT and the local coding advantages of CNN. Compared with previous dual parallel structures, the two encoder backbones of our network are parallel, and the ViT backbone plays a guiding role for the CNN backbone. In other words, the CNN backbone can complement the detail information based on the semantic features extracted by the ViT backbone. On the one hand, this study proposes two distortion-adaptive detection modules, namely, distortion-adaptive module (DAM) and position-aware module (PAM), to solve the geometric distortion problem caused by ERP. DAM models geometric distortion in feature maps through channel-by-channel deformable convolution. PAM calculates spatial weights along the latitude and longitude, which directs the network to adaptively focus on salient regions in the image. Specifically, the global features extracted by the ViT backbone are processed by the DAM to model the geometric distortion. Then, two branches are extracted: one branch is sent to the decoder, and the other branch is sent to PAM to provide position prior information. PAM is placed in the shallow layer of CNN backbone and is responsible for fusing the position prior information with the information extracted in the shallow layer of CNN backbone to guide the subsequent feature extraction. In this way, DPNet can decide which regions of the 360° omnidirectional images should be focused on according to the characteristics of ERP and specific input images. On the other hand, a salient information enhancement module (SIEM) is proposed to further solve the problem of background interference in complex scenes. Currently, most SOD methods use structures such as U-Net to simply aggregate feature maps at different scales. This process inevitably treats a large amount of non-salient information contained in the low-level features as useful information, which leads to poor detection results. For addressing this issue, SIEM uses high-level features to guide low-level features, filters non-salient information, and prevents the influence of background interference on the effectiveness of 360° SOD.ResultWe compare our model with 13 state-of-the-art methods on 2 public datasets, namely, 360-SOD and 360-SSOD. Notably, its overall performance on 8 evaluation metrics is better than those of the latest 13 methods. In addition, the generalization experiment is set up, and the excellent generalization performance of the model is confirmed by cross-validation. Then, an ablation experiment is conducted to verify the performance of the proposed module. Finally, a set of complexity comparison experiments proves that the proposed model DPNet achieves a good balance in terms of detection accuracy and model complexity.ConclusionThe existing 360° SOD methods cannot effectively address the geometric distortion problem after projection and the background interference problem in complex scenes. Thus, we propose a distortion-adaptive and position-aware 360° SOD network (DPNet) based on ViT and CNNs. The proposed DAM and PAM play a pivotal role in guiding the network to focus on areas requiring attention based on the distinctive characteristics of ERP and specific input images. In addition, the proposed SIEM works to guide low-level features with high-level features, which effectively filters out non-salient information present in low-level features and enhances the salient information. These capabilities can help the model effectively deal with the background interference problem in complex scenes. Through an extensive set of experiments, we demonstrate that our method outperforms 13 state-of-the-art SOD methods, which establishes its superiority in 360°SOD applications.  
      关键词:360°omnidirectional image;salient object detection(SOD);distortion-adaptive;position-aware;anti-background interference   
      128
      |
      199
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 79601568 false
      更新时间:2025-08-19
    • Sun Zhongbin, Hu Shuai, Zhang Fan, Zhou Yong
      Vol. 30, Issue 8, Pages: 2775-2789(2025) DOI: 10.11834/jig.240490
      Improved YOLOv7-tiny with long short range dependency feature pyramid network
      摘要:ObjectiveIn recent years, YOLOv7-tiny has become a commonly used method in real-time object detection. The entire training process of this method is conducted in a single network due to its lightweight network architecture design and fewer parameters. The method offers fast detection speed without relying on sliding windows or region proposals, which makes it suitable for tasks with limited resources and high real-time requirements. However, YOLOv7-tiny has two problems in the feature fusion stage: one is information loss in adjacent layer feature fusion, and the other involves differences in non-adjacent layer feature information. Specifically, YOLOv7-tiny uses the traditional nearest neighbor upsampling method in adjacent layer feature fusion, which may lead to jagged edges in the generated feature map. This condition reduces the quality and expression ability of the feature map. The problem of non-adjacent layer feature differences occurs during the bidirectional fusion process of YOLOv7-tiny using feature pyramids. The unique information of upper and lower layers is gradually “diluted”. As a result, feature maps contain different scale information in the feature extraction and detection stages, which may seriously affect the ability of the model to detect large- or small-scale objects.MethodThis study proposes a long short range dependency feature pyramid network (LSRD-FPN) to solve the two problems. The network will be employed to improve the YOLOv7-tiny method. LSRD-FPN consists of two key components: the local short range dependency (SRD) mechanism and the global long range dependency (LRD) mechanism. SRD improves the upsampling method and introduces an attention mechanism. It uses the lightweight feature upsampling method CARAFE instead of the traditional nearest neighbor upsampling method, with an increase of only approximately 20 000 parameters. In addition, adding a non-parametric attention mechanism SimAM after local feature fusion aims to enhance feature representation and enhance perceptual range, which effectively reduces the problem of information loss during the feature fusion process. LRD is inspired by the ResNet and Libra R-CNN models by introducing cross layer connection modules. In this study, multi-scale feature maps of different resolutions in the backbone network are scaled and adjusted to the same scale. Then, these maps are fused and assigned to different levels in the detection stage. The extreme scale object feature information of the backbone network is directly input into the detection stage. This improvement enhances not only the feature expression ability of the model but also its performance in multi-scale object detection tasks.ResultThe training process of this study is conducted under the Ubuntu 20.04.4LTS operating system, with a GPU configured as an NVIDIA RTX 3090 and a graphics memory size of 24 GB. The input image is fixed to 640 × 640 pixels, the batch size is set to 16, and 100 epochs are trained. Other parameter settings are set using the default YOLOv7-tiny settings. The method proposed in this study is compared on two datasets with different scenarios and quantities, namely, the Traffic Detection Dataset TDD and the Coal mine underground drilling site object detection dataset Cmudsodd. This experiment uses YOLOv7-tiny as the benchmark and embeds LSRD-FPN into the YOLOv7-tiny. After 100 epochs of training, the experimental results show that the method achieves performance improvements of 1.3% mAP and 0.5% mAP compared with the benchmark model YOLOv7-tiny on the TDD and Cmudsodd datasets, respectively. Despite significant performance improvements, the number of parameters remains at a relatively low level. This study conducts ablation experiments on two sub models of LSRD-FPN, namely, LRD and SRD. The local SRD mechanism achieves improvements of 0.6% mAP and 0.2% mAP on the TDD and Cmudsodd datasets, respectively. The global LRD mechanism achieves improvements of 0.7% mAP and 0.3% mAP on the TDD and Cmudsodd datasets, respectively. Compared with other real-time object detection algorithms with the same number of parameters, the algorithm proposed in this study improves the TDD dataset by 2.6% mAP compared with YOLOv5-s and by 0.2% mAP compared with YOLOv8-n. In contrast to the two algorithms, the Cmudsodd dataset shows improvements of 2.1% mAP and 4.4% mAP. In addition, the frame per second (FPS) of the model proposed in this study is higher than 160, which meets the requirements of real-time detection tasks. Therefore, the proposed method not only improves performance but also exhibits rapid deployment, which can be more quickly applied to practical scenarios.ConclusionThe proposed LSRD-FPN method can effectively improve the detection performance of the object detection model while involving fewer parameters and floating-point operations to ensure that the model meets the requirements of real-time detection speed. In addition, LSRD-FPN can be applied to not only the YOLOv7-tiny model but also other object detection models. The plug and play nature of LSRD-FPN eases its deployment to other object detection models and results in performance improvements.  
      关键词:object detection;feature fusion;feature pyramid;YOLOv7-tiny;multiscale feature   
      37
      |
      3
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 119441946 false
      更新时间:2025-08-19
    • Image classification network with random dilated convolution AI导读

      RDCNet网络在图像分类领域取得突破,有效提取细粒度特征,提升分类准确率,为图像识别研究提供新方向。
      Jiang Wentao, You Zhuocheng, Yuan Heng
      Vol. 30, Issue 8, Pages: 2790-2807(2025) DOI: 10.11834/jig.240746
      Image classification network with random dilated convolution
      摘要:ObjectiveImage classification tasks are particularly common in modern computer vision tasks. However, with the continuous development of deep learning methods, effectively extracting fine-grained features, suppressing noise interference, and dealing with target features in complex backgrounds remain difficult to be solved. In particular, although residual networks exhibit strong feature learning capabilities, they often struggle to efficiently learn fine-grained features due to the diversity of training data, background noise, and differences between target objects at different scales. Conventional convolutional operations typically ignore detailed information at different scales when dealing with such complex problems. They are susceptible to overfitting when dealing with noisy or irrelevant regions, which leads to performance degradation. This study proposes image classification network with random dilated convolution (RDCNet) to address the abovementioned challenges. The network aims to solve the problems of difficulty in fine-grained feature extraction, background noise interference, and overfitting. Several innovative designs have enabled the extraction of key features from complex backgrounds and improved the classification ability of the network.MethodRDCNet is based on the classical ResNet-34 as the backbone network, which utilizes its powerful residual connectivity property to enhance the training depth and stability of the network. In this study, several innovative modules are proposed on this basis to enhance the feature extraction capability of the network. First, the multi-branch random dilated convolution module is proposed, which realizes the effective capture of fine-grained features from different scales and sensory fields by the convolution operation of multiple branches and the design of randomly inflated convolution kernel. Compared with the traditional convolution operation, the expansion convolution can expand the receptive field without increasing the computational effort. Thus, it enhances the capture of the multi-scale information in the image. The design of the randomized expansion rate ensures that the network can adapt to different target scales and variations by diversifying the convolution kernel structure, which further improves the ability of fine-grained feature extraction. The Fine-Grained Feature Enhancement module is introduced to enhance the sensitivity of the network to small objects and detailed features by fusing global information with local features, which in turn improves the representation of local features. The module extracts global features of an image using a global average pooling operation and models them jointly with local features to help the network better understand contextual information. With this feature enhancement mechanism, the network can classify small differences more accurately, which boosts the ability to recognize tiny targets. At the same time, the introduction of random masking mechanism dynamically masks part of the input features and convolutional kernel weights. This process not only enables learning of more robust representations through diverse feature combinations but also effectively reduces overfitting and improves the ability to adapt to noise and unknown inputs. This mechanism is similar to the Dropout technique, but the difference is that the random masking mechanism is based on the dynamic masking of features and convolutional kernels, which has higher flexibility to cope with various complex input data. Finally, the Context Excitation module is proposed to enhance the ability of the network to focus on key features by introducing contextual information and dynamically adjusting the weights of feature channels. In image classification tasks, features in certain regions are commonly critical to the final classification results, while background noise may negatively affect the results. This module helps the network focus on the critical features that contribute to the classification task by adaptively adjusting the importance of each feature channel while suppressing interference from irrelevant regions.ResultThis study conducts experiments on the CIFAR-10, CIFAR-100, SVHN, Imagenette, and Imagewoof datasets to verify the effectiveness of RDCNet in image classification tasks. The experimental results show that RDCNet exhibits a significant performance improvement on all the datasets. RDCNet compares with the model with the second highest performance in the CIFAR-10 dataset by 0.02%, which confirms its superiority in fine-grained feature extraction. The classification accuracy of RDCNet on the CIFAR-100 dataset is enhanced by 1.12%, which demonstrates its superiority in dealing with large-scale classification tasks. The classification accuracy of RDCNet on the SVHN dataset is increased by 0.18%, which highlights its superiority in dealing with the Street View Digit Recognition problem. The classification accuracy of RDCNet on the Imagenette dataset is improved by 4.73%, which verifies its ability to recognize target features in more complex backgrounds. The classification accuracy of RDCNet on the Imagewoof dataset is raised by 3.56%, which further proves the excellent performance of the network in different scenarios. In addition, to deeply analyze the roles of each innovative module in RDCNet, this study conducts ablation experiments to further demonstrate their unique contributions. These experiments involve gradually removing key modules from the network to evaluate their impact on the overall performance, which demonstrates how the synergy between the modules enables the network to maintain superior performance when dealing with complex tasks. Through these experiments, this study demonstrates the superiority of RDCNet on multiple datasets, especially in handling complex background, noise, and fine-grained features.ConclusionWe propose RDCNet to effectively solve the fine-grained feature extraction, noise interference suppression, and overfitting problems in image classification tasks. This network introduces an innovative multi-branch randomized cavity convolution module, a fine-grained feature enhancement module, a randomized masking mechanism, and a contextual excitation module. The experimental results show that the classification performance of RDCNet on multiple standard datasets is significantly improved, especially its target recognition ability in complex background. The key contributions of RDCNet are stronger fine-grained feature sensitivity through multi-scale feature extraction, fine-grained feature enhancement, and contextual information modeling. It excels in extracting rich feature information in multi-scales and contexts. It focuses on key features and discriminates targets in complex contexts, which lead to excellent performance in classification tasks. In addition, the introduction of the random masking mechanism effectively enhances the robustness of the network and reduces the risk of overfitting, which enable it to perform more stably in complex real-world scenarios. Future research can further explore the application of this method in other visual tasks and its integration with other advanced techniques to boost its performance.  
      关键词:image classification;residual network;dilated convolution;random dilated convolution;fine-grained feature;random masking mechanism   
      129
      |
      178
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 81601203 false
      更新时间:2025-08-19
    • 在雾天船舶重识别领域,专家提出了多元特征级联增强和跨层自适应融合网络模型,有效提高了船舶匹配准确率,为解决雾天船舶识别难题提供新方案。
      Sun Wei, Guan Fei, Zhang Xiaorui, Shen Xinyi
      Vol. 30, Issue 8, Pages: 2808-2821(2025) DOI: 10.11834/jig.240646
      Foggy ship re-identification network based on multiple feature cascade enhancement and cross-layer adaptive fusion
      摘要:ObjectiveShip re-identification (ReID) technologies play a crucial role in the field of maritime navigation and surveillance. Their core object is to accurately retrieve other images of the same ship captured by different cameras from a given ship image in the database. These technologies can be regarded as a key sub-problem in the field of image retrieval, which exhibits extensive application prospects and significant practical value in various fields such as maritime traffic monitoring, continuous ship tracking, and maritime criminal investigation. With the increasingly busy maritime traffic and the frequent occurrence of complex weather, ship ReID in foggy weather has become an urgent technical problem to be solved. Most existing ship ReID methods are only applicable under sunny days. In foggy environment, ship images typically suffer from blurred features and loss of details, which introduce significant challenges for accurate ship identification. A foggy ship ReID network (FSRNet) based on multiple feature cascade enhancement and cross-layer adaptive fusion is proposed to address this issue.MethodA multiple feature cascade enhancement (MFCE) module is proposed to address the challenge of identifying fuzzy and difficult-to-discern ship features in foggy images. By extracting the overall and local multiple features of the ship, we solve the problem of image blur and detail loss caused by fog. In the local fine processing, convolution and sigmoid activation function are used to generate weights to weight each pixel of the input feature map, which emphasizes the key details of the ship in the input feature map (e.g., hull edges, structural lines, and logo text). This process is conducive to reducing the fuzzy influence of fog on image details, such that the key features of the ship are enhanced and can still be clearly presented in foggy weather. At the global awareness level, global feature vectors are generated by global average pooling and full connection layer and then applied to the input feature map by weight extension. This global context information enables the overall shape and position of the ship to be accurately captured in foggy images, as well as enhances the clarity of its overall outline. Furthermore, a cross-layer adaptive fusion (CAF) module is proposed, which predicts the importance of shallow and deep features of ResNet50 through adaptive weights. Then, this module integrates these features across layers. Based on the ResNet50 design, CAF aims to transfer the deep semantic information layer by layer to the lower level feature map, which improves the richness of the multi-scale feature representation of ships. Multiple 1 × 1 convolution layers are used to adjust channels for achieving effective fusion of feature graphs at different levels, which ensures channel consistency before feature fusion. Then, bilinear interpolation is used to gradually up-sample the spatial dimension of the deep feature map to the same as that of the shallow layer. This process achieves spatial alignment and avoids the information loss caused by spatial misalignment. After the spatial alignment of feature maps, the method of gradual addition is adopted for feature fusion, which not only can retain the shallow rich detail information but also can integrate the deep high-level semantic information, which is conducive to realizing the complementarity and enhancement of multi-level features. Next, the feature maps of each level are transformed into feature vectors through global average pooling, and these feature vectors are combined to form a comprehensive feature representation that integrates multi-level and multi-scale information. Given that the features of different levels of the network are affected to different degrees under foggy weather, flexibly dealing with complex and changeable foggy scenarios by directly blending them with equal weights is difficult. Therefore, an adaptive weight predictor is designed in this study. This predictor is composed of multiple fully connected layers and is used to process the feature input composed of multi-level feature vectors. In addition, a new dataset specifically designed for foggy ship ReID, which is named Warships-Foggy, is constructed. By adjusting the parameters in the atmospheric scattering model, we synthesize ship images under various foggy conditions to simulate real foggy scenes. This way effectively addresses the challenge of training and evaluating ship ReID models in foggy environments.ResultComparison and ablation experiments are conducted on the Warships-Foggy dataset. These experiments include comparison of functions of FSRNet modules, comparison of the function of each module in MFCE, and comparison of different amounts of MFE and DenseBlock integration in MFCE. The experiments also assess the effectiveness of the adaptive weight predictor in CAF, the validity of Dropout layer parameters in CAF, and the performance of different network models on the dataset. The mean average precision is 92.39%, while the cumulative matching characteristics for the top 1, 5, and 10 ranks are 94.35%, 97.58% and 98.39%, respectively. The experimental results show that the proposed network model improves the accuracy of ship matching and shows excellent performance.ConclusionThe network model combines the two tasks of image feature enhancement and ship ReID for the first time, and it realizes ship ReID with high precision.  
      关键词:foggy ship re-identification;feature enhancement;adaptive weight;feature fusion;ResNet50   
      158
      |
      195
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 81601291 false
      更新时间:2025-08-19
    • Wu Zhize, Chen Xin, Xu Tong, Nian Fudong, Wang XiaoFeng, Li Teng
      Vol. 30, Issue 8, Pages: 2822-2834(2025) DOI: 10.11834/jig.240352
      Dynamic multi-granularity graph convolutional networks for skeleton-based action recognition
      摘要:ObjectiveIn recent years, methods based on graph convolutional networks (GCNs) have become increasingly popular in human skeleton-based action recognition, which resulted in significant strides in this challenging domain. These advances are primarily attributed to the ability of GCNs to model spatial and temporal dependencies inherent in human skeletal data. However, traditional graph convolutions exhibit notable limitations, particularly in capturing interaction information between distant nodes. This shortcoming leads to suboptimal performance in recognizing non-natural connections within the skeleton graph, which is a crucial aspect for accurately modeling complex human actions. Traditional GCNs are adept at processing locally connected nodes, but their efficacy diminishes as the distance between nodes increases. This concern is common in the context of human skeletons, where actions often involve coordinated movements of body parts that are not directly connected. For instance, actions involving simultaneous hand and foot movements necessitate an understanding of long-range dependencies. The inability of conventional GCNs to effectively capture these dependencies results in a limited understanding of the overall action, which reduces recognition accuracy. Moreover, existing approaches that attempt to model complex spatial relationships usually encounter significant issues related to feature redundancy and an exponential increase in parameter count. Although these methods are sophisticated, they tend to generate a large number of redundant features, which not only increase computational complexity but also hamper the overall efficiency of the model.MethodA novel multi-granularity graph structure called the dynamic multi-granularity graph convolutional network (DMG-GCN) is proposed for skeleton graph construction to address the aforementioned challenges. This approach involves designing three different granularity graph structures, with each of them being tailored to capture distinct aspects of the skeletal data. By combining various human body joint points in innovative ways, these multi-granularity graphs enable the model to capture interaction information between non-naturally connected nodes more effectively. This hierarchical representation allows for a more nuanced understanding of the spatial relationships within the skeleton graph. Based on the multi-granularity graph structure, a dynamic adjacency matrix is introduced in spatial modeling. Unlike static adjacency matrices, which remain fixed regardless of the specific action being performed, the dynamic adjacency matrix adapts depending on the current spatial configuration of the nodes. This adaptability ensures a more accurate representation of the semantic relationships between nodes, which leads to improved recognition performance. In addition to the dynamic adjacency matrix, a spatial reorganization convolution module is proposed to mitigate feature redundancy and growing parameter volume. This module operates by cross-reconstructing information-rich and -poor features through separation-reconstruction operations. The module effectively distinguishes and reorganizes these features. Thus, it reduces spatial dimension feature redundancy, which enhances the efficiency and performance of the model. During the feature fusion stage, a new six-stream fusion method is introduced, which leverages the complementary information derived from the three-granularity graph structures. This method integrates the diverse insights provided by each granularity level, which leads to a more comprehensive understanding of the skeletal data. The integration of these streams guarantees that the model captures the full spectrum of spatial and temporal dependencies, which significantly improves overall performance.ResultThe efficacy of the proposed approach is confirmed by its performance on benchmark datasets. Compared with the baseline method CTR-GCN, the proposed method achieves improvements of 0.6%, 0.7%, and 0.7% on the NTU-RGB+D, NTU-RGB+D 120, and Northwestern-UCLA datasets, respectively. These improvements are seemingly modest, but they represent significant advancements in the highly competitive field of human skeleton-based action recognition. The ablation studies further validate the effectiveness of the multi-granularity graph structure and spatial channel reconstruction convolution within the proposed architecture. These studies highlight the individual contributions of each component, which demonstrates how the multi-granularity approach enhances the ability of the model to capture complex interactions while the spatial reorganization convolution reduces redundancy and improves efficiency. In addition, comparative visualizations underscore the superiority of the dynamic adjacency matrix over conventional adjacency matrices. These visualizations reveal how the dynamic matrix more effectively captures semantically informative connections between nodes, which facilitates a deeper understanding of complex actions.ConclusionOur DMG-GCN represents a significant advancement in spatiotemporal modeling for human skeleton-based action recognition. By integrating a multi-granularity graph structure with spatial channel reconstruction convolution, this approach expands the receptive field of GCNs and substantially reduces feature redundancy. The dynamic adjacency matrix further enhances the capability of the model to capture intricate semantic relationships, which leads to more accurate and nuanced action recognition. The proposed DMG-GCN not only addresses the limitations of traditional GCNs but also sets a new benchmark for future research in the field. Its innovative approach to handling long-distance node interactions and reducing feature redundancy lays the foundation for developing more advanced and efficient models. As human skeleton-based action recognition continues to evolve, the principles and techniques introduced by DMG-GCN are likely to inspire further advancements. Such innovations will drive the field toward even greater accuracy and applicability in real-world scenarios.  
      关键词:graph convolution;skeleton-based action recognition;multi-granularity;feature redundancy;reconstruction convolution   
      44
      |
      3
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 119441948 false
      更新时间:2025-08-19
    • Local spatiotemporal convolutional network for gait recognition AI导读

      最新研究突破步态识别技术,通过局部时空卷积网络提升识别准确率。
      Ding Xinnan, Ye Nan, Duan Xin, Wang Kejun
      Vol. 30, Issue 8, Pages: 2835-2850(2025) DOI: 10.11834/jig.240710
      Local spatiotemporal convolutional network for gait recognition
      摘要:ObjectiveThe development of technology and the expansion of application scenarios have made biometric recognition technology a mainstream technology in the field of future security authentication. Gait is a biometric property that can be utilized to distinguish target objects based on their walking patterns. However, capturing the motion pattern hidden within a series of frames is challenging due to the complexity of video data. Existing gait recognition methods encounter difficulties in learning gait motion habits under a wide range of recognition conditions and achieving excellent real-time performance. These complexities are due to the influence of random and diverse external factors such as complex backgrounds, pedestrian clothing, walking directions, and illumination changes. We propose a gait recognition method based on the local spatiotemporal convolutional network to autonomously learn gait motion patterns for addressing the aforementioned issue.MethodUnlike introducing manual motion features, this network directly endows the two-dimensional convolutional network with the ability to extract temporal information. It adaptively learns complex underlying structures and patterns in video data driven by data to capture action features. The network can also adapt to different gait data through continuous learning, which makes the model highly universal. Specifically, inspired by the partitioning method, a global bidirectional spatial pooling method was proposed to decrease the dimensionality of gait tensors, and local strips were employed as the fundamental units to describe the details in the gait space. By using global bidirectional spatial pooling, gait features were divided into horizontal and vertical local features in the spatial domain, and the method of partitioning was utilized to focus on gait details while reducing dimensionality. In this way, the local spatiotemporal convolutional layer was designed to integrate the spatial and temporal domains, which allowed for adaptive learning of strip-based gait motion. The local spatiotemporal convolution attempts to involve the spatial domain, channel domain, and time domain in two-dimensional convolution operations. This novel layer allows the temporal and spatial dimensions to participate in the learning of the convolutional network, which enables the two-dimensional convolutional structure to capture spatiotemporal features of gait. Asymmetric convolution can also be extended to local spatiotemporal convolution to construct asymmetric local spatiotemporal convolution layers. Asymmetric convolution can explicitly enhance the representational power of standard square convolution kernels, and integrating asymmetric convolution can better extract spatial features. In addition, a local spatiotemporal pooling method can combine the discriminative local gait spatiotemporal representations from multiple frames to generate more discriminative gait features. By this means, the dimensionality reduction of the gait tensor is achieved, which allows the time-domain dimension to participate in the calculation of two-dimensional convolution, and the spatial representation details of gait are integrated, which reduces the loss of spatial features.ResultExtensive experiments on gait benchmark datasets have demonstrated the effectiveness of various parts of the designed network, and comparisons with other methods have also demonstrated the superiority of this approach. The experiments on two benchmark public datasets show that the proposed method outperforms other current gait recognition approaches. The proposed method achieves the best recognition performance under three training settings on the CASIA-B dataset. It achieves average recognition accuracies of 97.3%, 93.7%, and 83.8% under three walking scenarios, respectively, and 85.8% on OU-MVLP dataset. Moreover, the comparison with other recent methods confirms the superiority of this method.ConclusionThe experimental results show that the local spatiotemporal convolutional network has excellent spatiotemporal feature learning capacity and has a positive effect on improving gait recognition performance. These findings demonstrate that the local spatiotemporal convolutional network can effectively learn the spatiotemporal features and show the superiority of the proposed method. Therefore, the proposed local spatiotemporal convolutional network can adaptively capture spatiotemporal features of gait, which provides a new method for research in the field of gait recognition.  
      关键词:gait recognition;spatiotemporal features;convolutional neural network(CNN);local features;deep learning   
      111
      |
      172
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 81602241 false
      更新时间:2025-08-19

      Medical Image Processing

    • Single vertebra 2D/3D registration with fusion of local and global features AI导读

      在医学图像配准领域,专家提出了融合局部细节和全局位置特征的单椎体2D/3D刚性配准网络,有效提高了配准精度和鲁棒性。
      Yang Xiaolong, Zhang Zhancheng, Xu Shaokang, Zhang Baocheng, Luo Xiaoqing, Hu Fuyuan
      Vol. 30, Issue 8, Pages: 2851-2865(2025) DOI: 10.11834/jig.240502
      Single vertebra 2D/3D registration with fusion of local and global features
      摘要:ObjectivePrecise pose estimation of intraoperative X-ray images relative to preoperative computed tomography (CT) volumes is a fundamental aspect of 2D/3D registration, which is essential for improving the accuracy of medical analysis and integration in various fields, including surgical planning, radiation therapy, and neural navigation. Traditional methods can be divided into intensity- and feature-based approaches. In intensity-based methods, different ray-tracing techniques are typically used to project 3D volumes for generating simulated 2D X-rays, which are known as digitally reconstructed radiographs (DRR). An optimizer is then employed to find the optimal spatial transformation that maximizes the similarity between the DRR and the corresponding X-ray. Intensity-based methods can achieve high accuracy, but these methods encounter some challenges, such as long registration time due to iterative pose search and the need for generating numerous DRRs for similarity calculation. Moreover, iterative pose search relies on good initialization; otherwise, it may converge to local maximization, which results in registration failure. Meanwhile, feature-based methods typically utilize landmarks for matching to determine the spatial transformation, which can be specific points of anatomical structures or key points. Alternatively, feature detection operators such as Harris, or segmentation can also be employed to extract features. These methods extract features while filtering out a large amount of image information, which leads to higher computation efficiency but lower accuracy than intensity-based methods. Recently, deep learning-based models have emerged as powerful tools for medical image registration. Their feature extraction capability has efficiently addressed the time-consuming issue of traditional methods. The suitable network is designed to learn the feature representation that can describe the complex mapping relationship between images and corresponding labels. This network directly regresses transformation parameters and avoids the need for extensive searching and sampling in the pose space. The intraoperative planar images of complete spines cannot establish rigid correspondence with the preoperative CT, and existing registration algorithms commonly face issues such as low registration and insufficient robustness when dealing with the complete structures of the spine. To address the issue, we proposed a registration method that combines local detail features and global position features for 2D/3D rigid registration in a single vertebra manner.MethodConvolutional neural networks can enhance the ability of the model to learn the shape, boundaries, and local structures of vertebra, while the Transformer uses the self-attention mechanism to effectively capture global dependencies between images and extract key features of the vertebra. Therefore, by combining the characteristics of both structures, we proposed a multi-stage dual-branch network to effectively extract local and global features from single vertebra images for learning the relationship between features and spatial transformations. The aim was to improve the performance of the regressor. In each stage, the local branch utilizes down-sampling operation and stacked convolution blocks to capture details, edges, and other local information more effectively while reducing computational load. The global branch employs convolution-based patch embedding and multi-head self-attention mechanism to capture the positional relationship between various features and reduce the interference of background in images. The feature fusion module, which is based on the channel and spatial attention mechanism, maps the features of different branches to the same feature space and adaptively fuses local details and global contextual features at different stages, which enhances the expressive ability of the model. Moreover, we progressively optimize feature representations in a coarse-to-fine manner within the network to better capture relevant information. This approach improves the perceptual capability of the network at various scales and levels. Finally, we incorporate auxiliary registration heads to predict pose parameters from the multi-stage fused features, which enhances the supervision information available to network and helps the network gradually optimize pose predictions during training. Ultimately, the final registration accuracy is improved. The input of the network is a single grayscale low-dose X-ray image, and the output is the predicted pose parameters of 6 degrees of freedom (DoFs), which are used to obtain the registered image by the projecting operation. Our network is implemented using the PyTorch framework. The input images are resized to 128 × 128 pixels for training, the learning rate is set to 6e-5, and the weight decay is 0.05. The Adam learning procedure is accelerated using a NVIDIA GeForce RTX 3090 GPU device, which takes approximately 6 h for 320 iterations.ResultWe compared our model with 5 state-of-the-art models on 30 simulated datasets, including 3 iterative optimization-based methods (OPT-GO, OPT-GC, and OPT-NGI) and 2 deep learning-based methods (ResRegNet and EFbackbone-based). The quantitative evaluation metrics contained mean target registration error (mTRE), mean absolute error (MAE), and the registration time, and we provided several visualized registration results of each method. Comparative experiments demonstrated that our model outperforms other methods on the single vertebra datasets. The visualized results also indicate that the registered images have minimal pose deviation from the corresponding target images and demonstrate good alignment, which prove that our proposed method can improve the accuracy in the single vertebra 2D/3D image registration tasks. The average mTRE is 1.40 mm, and the average MAE of 6DoF pose parameters is 0.008. Compared with the second-best method, our model reduces mTRE by approximately 2.70 mm and MAE by 0.02. Furthermore, we conducted a series of ablation experiments and provided corresponding quantitative metrics. These experiments clearly demonstrate the effectiveness of each module in our model, including the local branch, global branch, dual-branch feature fusion module, and auxiliary registration heads.ConclusionIn this study, we proposed a dual-branch single vertebra 2D/3D registration model that contains feature fusion modules to integrate local and global features, which improves the registration accuracy. Moreover, the auxiliary registration heads realized through the features at different layers can enhance supervision information, which increases the stability of the registration model. The experimental results show that our model outperforms several state-of-the-art registration methods and further enhances the registration performance.  
      关键词:medical image;2D/3D registration;single vertebra;deep learning;feature fusion   
      87
      |
      12
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 81601326 false
      更新时间:2025-08-19

      Remote Sensing Image Processing

    • 在遥感图像分割领域,专家提出了基于全局感知与细节增强的非对称遥感建筑物分割网络GPDEA-UNet,有效提高了建筑物分割精度。
      Xu Shengjun, Liu Yurui, Liu Erhu, Liu Jun, Shi Ya, Li Xiaohan
      Vol. 30, Issue 8, Pages: 2866-2883(2025) DOI: 10.11834/jig.240629
      Global perception and detail enhancement network for building segmentation in remote sensing images
      摘要:ObjectiveRemote sensing images are a type of earth observation data with wide coverage, rich spectral information, and variable target structures. The advancement of computer technology has steadily increased the demand for accurate and efficient extraction of buildings across diverse domains and industries. Meanwhile, the application prospects of semantic segmentation techniques in remote sensing image have progressively demonstrated substantial practical significance. By utilizing the semantic segmentation technology for remote sensing images, detailed information such as the spatial distribution and density of buildings and other infrastructures can be efficiently extracted. This information will play a crucial role in land surveying, urban planning, and post-disaster assessments. However, this advancement has simultaneously increased the complexity of semantic segmentation for buildings in remote sensing images. Consequently, the challenge of efficiently and accurately extracting building information from high-resolution imagery has emerged as a pivotal concern in the field of semantic segmentation of remote sensing images, which demands urgent attention and resolution. In recent years, deep learning has notably advanced in the field of semantic segmentation of remote sensing images. These advancements are due to its ability to learn any data distribution without requiring prior statistical knowledge of the input data, its capacity for self-learning target features, and its strong generalization capabilities. However, the process of semantic segmentation for remote sensing images of buildings faces substantial obstacles, which are primarily due to robust interferences such as varying lighting conditions, seasonal changes, and complex background information, as well as the intricate architectural structures and edge details of the buildings themselves. To address these challenges, this study proposes a global perception and detail enhancement asymmetric-UNet (GPDEA-UNet) network for building semantic segmentation in remote sensing images.MethodFirst, the proposed network using UNet architecture constructs a feature encoder module based on the selective state space module. This module is specifically designed to meticulously extract the texture, boundary, and deep semantic features of buildings in remote sensing images. It leverages the visual state space as its fundamental building block and incorporates dynamic convolution decomposition (DCD) to significantly enhance the extraction of intricate features and context information in the remote sensing images while effectively reducing computational overhead. Second, a multi-scale dual cross-attention (MDCA) module is introduced to further broaden the global receptive field of the network and tackle the semantic discrepancy challenges posed by the codec during skip connections. MDCA represents an advanced attention-weighting mechanism that harmoniously integrates cross-channel attention and cross-spatial attention. This module substantially enhances the capability of the network to extract and fuse feature information pertinent to the region and boundary of the segmented target. Meanwhile, it effectively resolves the interdependencies among multi-scale encoder features in channel and spatial dimensions, which bridges the semantic gap between encoder and decoder features. Finally, a detail enhancement decoder module is designed to restore the resolution of the extracted feature maps, with the aim of addressing the issue of image detail information loss during the upsampling phase. This module builds upon the principles of DCD and incorporates a cascade upsampling (CU) module. The CU is specifically engineered to capture richer semantic information, retain feature details and semantic integrity, and ultimately ensure the high accuracy and delicate precision of the segmentation results. Our network achieves a highly specialized and nuanced segmentation of remote sensing building images by integrating these sophisticated components.ResultExperimental results demonstrate the exceptional robustness of the GPDEA-UNet network introduced in this study across various datasets. Specifically, on the WHU Aerial Imagery Dataset (WHU), the network achieves an intersection over union (IoU) of 91.60%, precision of 95.36%, recall of 95.89%, and an F1-score of 95.62%. Similarly, on the Massachusetts Building Dataset, the network attains an IoU of 73.51%, precision of 79.44%, recall of 86.81%, and an F1-score of 82.53%. When compared with other state-of-the-art networks, the quantitative indicators reveal that the GPDEA-UNet network attains optimal performance on the WHU dataset and either optimal or near-optimal performance on the Massachusetts Building Dataset. Furthermore, qualitative analysis demonstrates that the proposed network achieves superior segmentation results on the WHU and Massachusetts Building Datasets. The network maintains high-quality segmentation even for remote sensing images with inferior imaging quality, such as those with low resolution, noise, or occlusion.ConclusionAn asymmetric remote sensing building segmentation network with global perception and detail enhancement is proposed by combining a selective state space module and a multi-scale dual cross-attention mechanism. Experiments on two remote sensing datasets show that the proposed network can effectively improve the accuracy and visualization effects of remote sensing building segmentation. Furthermore, the network exhibits remarkable robustness and versatility. The high precision and recall rates achieved in our experiments highlight its capability to excel not only in high-quality remote sensing building segmentation but also in challenging scenarios. This study shows that the proposed network has excellent universality and application potential in remote sensing image segmentation and provides a new research idea and method for the research and application of remote sensing image processing.  
      关键词:remote sensing images;building segmentation;visual state space;dynamic convolution decomposition(DCD);cross-attention;detail enhancement   
      194
      |
      74
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 81602162 false
      更新时间:2025-08-19
    • 在遥感图像语义分割领域,专家提出了一种高效方法,通过微调大模型SAM,实现了遥感图像的高效语义分割,性能优于其他深度学习方法。
      Liu Siyong, Zhao Yili
      Vol. 30, Issue 8, Pages: 2884-2896(2025) DOI: 10.11834/jig.240540
      DP-SAM: efficient semantic segmentation of remote sensing images by fine-tuning SAM
      摘要:ObjectiveThe segment anything model (SAM) has become a large-scale model benchmark in the field of zero-shot segmentation of natural images. However, given the complex scene diversity and semantic information of remote sensing images and SAM being a segmentation model that requires prompt information, directly applying this complex macro model to the remote sensing image segmentation task faces oversegmentation and requires numerous artificial professional manual prompts. In addition, working like this to generate high-resolution images itself requires substantial computing resources. In response to the above problems, this study proposes an effective SAM fine-tuning method to cope with the complexity of remote sensing image segmentation. This method combines the advantages of convolutional neural networks (CNNs) and Transformer, and by carefully adjusting the parameters and structure of the SAM model, it aims to reduce the risk of oversegmentation and reliance on manual prompt input, thereby improving the model’s applicability and effectiveness in this field. Through this method, we can better adapt to the characteristics of remote sensing images, improve the accuracy and efficiency of segmentation results, and provide more reliable solutions for remote sensing image segmentation tasks.MethodFirst, when the input image passes through the SAM image encoder, training is frozen while using SAM’s original prior knowledge weights, and a new lightweight CNN encoder path ResNet18 is introduced to utilize the excellent prior knowledge of the SAM original image encoder and introduce the CNN path for fine-tuning. The purpose is to allow the dual path piecewise arbitrary model (DP-SAM) to use SAM weights while also using another CNN path for learning to avoid forgetting in the model. Second, the decoder also has two paths. The mask decoder adopts a fine-tuned prompt-free approach, which eliminates the dependence on the prompt encoder module and makes the model more flexible and versatile. By using the CNN decoder path to splice the shallow to deep semantic features of the CNN encoder, more feature information can be fused. In this way, the CNN decoder and the mask decoder output two prediction masks; thus, comparing the fusion or separate output effects of the two masks is an interesting experiment to derive the optimal segmentation strategy. We name this fine-tuned model with two paths as DP-SAM. This improved model not only improves segmentation accuracy but also considers the lightweight and versatility of the model, bringing a more comprehensive and efficient solution to remote sensing image segmentation tasks using SAM fine-tuning.ResultWe evaluate DP-SAM using two publicly available labeled datasets (Potsdam and Vaihingen). Both datasets are high-resolution images with complex scenes including dense streets, large building complexes, etc. These complex scenes pose challenges to remote sensing image segmentation tasks, requiring models with good generalization capabilities and sensitivity to details. By evaluating on these two datasets, we can comprehensively understand the performance and effect of DP-SAM in processing high-resolution, complex scene remote sensing images. Experimental results show that DP-SAM performs well in semantic segmentation. On the Potsdam dataset, DP-SAM achieves 86.2% mIoU and 92.7% F1 score. On the Vaihingen dataset, DP-SAM achieves 85.9% mIoU and 92.4% F1 score. These evaluation indicators highlight the excellent performance and robustness of the DP-SAM model in remote sensing image segmentation tasks. At the same time, we conducted ablation experiments to evaluate whether to fuse the prediction masks generated by the two decoder paths or output them separately when training the fine-tuned model. The ablation experiments verified the optimal generation strategy of the dual-path mask of the DP-SAM decoder. are the prediction masks generated by the dual-channel decoder respectively.ConclusionThe method proposed in this study realizes the application of large-scale models in semantic segmentation scenarios required for remote sensing. DP-SAM not only effectively captures semantic information in images but also produces highly accurate segmentation results. This is crucial for remote sensing image processing and analysis; moreover, it provides reliable technical support and solutions for practical applications in map production, environmental monitoring, and other fields and promotes the widespread application of remote sensing technology in practice. Through this innovative method, we demonstrated the superior performance of large models in complex scenes, brought increased accuracy and efficiency to the field of remote sensing image analysis, and promoted the development and application of remote sensing technology. The source code of this work will be available athttps://github.com/Jacky-Android/DP-SAM.  
      关键词:segment anything model (SAM);zero-shot;semantic segmentation of remote sensing images;image encoder;prompt-free;mask decoder   
      393
      |
      149
      |
      0
      <HTML>
      <L-PDF><WORD><Meta-XML>
      <引用本文> <批量引用> 82620117 false
      更新时间:2025-08-19
    0