摘要:Given the continuous expansion of training datasets and the rapid evolution of deep learning architectures, vision foundation models and large language models have demonstrated remarkable generalization and adaptability across diverse downstream tasks, thereby drawing increasing attention from the research community. Within the domain of remote sensing (RS), data exhibit significant heterogeneity across multiple sources, modalities, spatial scales, and temporal dimensions. Designing pretrained RS foundation models (RSFMs) capable of effectively capturing such complex geospatial dependencies is critical for robust feature representation and intelligent interpretation of RS imagery. This paper presents a comprehensive review of the recent progress in pretraining strategies for RSFMs by emphasizing unimodal and multimodal learning paradigms. For unimodal models, representative frameworks based on self-supervised contrastive learning and masked image modeling are summarized. They leverage large-scale optical, hyperspectral, and radar imagery to learn transferable visual representations. These pretraining methods substantially enhance downstream performance in land cover classification, object detection, semantic segmentation, and change detection tasks. For multimodal models, we analyze the integration of image-text, image-location, and image-audio modalities through contrastive alignment strategies and cross-modal embedding learning, thereby effectively improving semantic coherence, generalizability, and interpretability in geospatial representation learning. Furthermore, widely adopted RS pretraining datasets, including their data sources, modality compositions, spatial resolutions, and annotation characteristics, are systematically summarized in this paper. Representative datasets, such as BigEarthNet, SEN12MS, and SkySenseGPT, are reviewed to demonstrate the diversity and scale of existing data resources. The importance of building open, standardized, and reproducible data repositories is emphasized, as these datasets serve as the foundation for training scalable and generalizable RSFMs. From a methodological perspective, this paper discusses the major pretraining paradigms that have shaped the current landscape of RSFMs, including contrastive self-supervised learning, generative self-supervised learning, and hybrid teacher-student distillation. These paradigms aim to maximize representational consistency between augmented views, reconstruct masked information, and align intermediate features between models, thereby enabling the extraction of semantically rich and transferable geospatial features. Despite these advances, several challenges remain unresolved in the development of RSFMs. Data-related issues, such as the scarcity of well-annotated multimodal datasets, geographic and temporal imbalance, and high acquisition costs, continue to hinder large-scale model training. Model scalability poses another limitation, as the billion-parameter-level architectures demand extensive computational resources and energy consumption during training and inference. Moreover, current RSFMs often suffer from limited cross-domain and cross-sensor generalization, thereby leading to performance degradation when applied to new regions or modalities. Transparency and interpretability also remain pressing concerns, as understanding the internal mechanisms of deep RSFMs and improving their robustness against adversarial perturbations are essential for reliable real-world deployment. Future research may address these challenges by focusing on developing scalable multimodal architectures that can jointly process optical, synthetic aperture radar, hyperspectral, and textual data, as well as by designing lightweight RSFMs through model compression, sparse training, and modular architecture optimization. Improving cross-domain and cross-temporal generalization by incorporating domain adaptation, meta-learning, and transfer learning techniques will further enhance model robustness under diverse acquisition conditions. In addition, integrating explainable artificial intelligence approaches, uncertainty quantification, and attention-based visualization can improve the interpretability and trustworthiness of RSFMs, thereby enabling their safe application in operational RS systems. Overall, this paper provides a systematic and forward-looking overview of the current development status, pretraining methodologies, benchmark datasets, and existing challenges of RSFMs. This work aims to offer a theoretical and methodological reference for the future construction of intelligent, scalable, and trustworthy foundation models in the RS domain by consolidating advances in unimodal and multimodal pretraining paradigms.
摘要:ObjectiveCharts are a vital tool for information presentation in research and business analysis, offering a clear and intuitive understanding of complex data relationships and trends. However, the inability to access the underlying raw data of these visual representations creates a serious barrier for in-depth analysis and further data utilization. Chart data extraction (CDE) technology aims to bridge this gap by accurately extracting data from visual charts and converting them into structured numerical values or tables, enabling complex metric calculations, chart type conversions, and other downstream tasks.MethodThis study addresses the challenges of CDE in the Chinese language context. We present the construction of the large-scale Chinese bar chart dataset, which contains 58 712 images encompassing various types, including vertical, horizontal, and stacked bar charts, with complex scenarios (e.g., multiangle text rotation). We derive specialized datasets for chart text recognition, text classification, and legend detection, providing a robust foundation for Chinese chart understanding tasks. To evaluate and compare different CDE approaches, we propose two benchmark models. Rule-based chart data extraction: this method combines text recognition differentiable binarization net (DBNet) and object detection you only look once version 5 (YOLOv5) algorithms to extract textual elements and graphical features. A rule-based library, which is designed based on chart structure characteristics, is then employed for data structure reconstruction. Large model fine-tuning for chart data extraction: this approach utilizes a pretrained large vision-language model (i.e., Qwen-VL-2B) and adapts it to the CDE task through parameter-efficient fine-tuning low-rank adaptation. This method leverages the model’s general knowledge representation capabilities and enhances its performance on specific tasks with minimal computational resources. To assess the performance of the proposed methods and existing state-of-the-art models, we design and implement a CDE and type conversion system by using PyQt5. This system integrates multiple models, including the rule-based approach, the fine-tuned large model, and other open-source CDE models, enabling users to easily extract data and convert chart types.ResultExperiments conducted on the Chinese bar chart dataset demonstrate the effectiveness of the proposed methods. Among the compared methods, the rule-based approach achieves the best performance overall (F1 score of 69.97%), particularly for charts without data labels. This success can be attributed to the utilization of optical character recognition and object detection models specifically trained for the dataset, along with the image segmentation and multiscale line sampling true value correction algorithm. On the English Understanding Data Visualization via Question Answering Dataset, the fine-tuned large model outperforms existing state-of-the-art methods, achieving an F1 score of 61.28%. This result highlights the model’s strong generalization capabilities and robustness in cross-lingual scenarios. Qualitative analysis further confirms the effectiveness of the proposed methods. In comparison with other models, our approach demonstrates superior performance in handling complex chart structures and irregular text, accurately extracting metadata from charts. Ablation experiments are also conducted to investigate the contributions of the different components of the large model fine-tuning approach. Results reveal that the combined fine-tuning of the visual encoder, language decoder, and cross-modal adapter achieves excellent performance, indicating the necessity of a holistic optimization strategy for complex CDE tasks.ConclusionThis study presents a comprehensive approach to CDE in the Chinese language context. The construction of a large-scale Chinese bar chart dataset and the proposal of two benchmark models provide valuable resources and reference standards for future research. The developed CDE and type conversion system demonstrates the practical application potential of the proposed methods. The current dataset focuses on bar charts and synthetic data, so future work may explore the integration of real-world charts and additional chart types to enhance dataset realism and diversity. Further research should also be conducted to investigate other robust model structures and training methods to address the limitations of large models in managing complex chart structures and irregular text. The developed system can be further expanded with functionalities, such as supporting other chart type conversions and chart style modifications. In conclusion, this study provides a substantial contribution to the field of Chinese CDE by offering new solutions and promoting the development of multimodal data analysis. We believe that further research and development in this area will unlock the full potential of chart data and enable efficient and insightful data analyses in various domains.The dataset is linked at https://doi.org/10.57760/sciencedb.j00240.00052. The code is linked at https://github.com/maqiuping59/ChineseChartExtract.
关键词:large language model fine-tuning;multimodal data extraction;Chinese chart dataset;vision-language joint learning;data visualization reverse engineering
摘要:ObjectiveMotion deblurring is an important yet complex task in image restoration. In many practical scenarios, images can exhibit complex motion blur because of camera shake, rapidly moving objects, or changes in shooting conditions. This type of blur often has nonuniform characteristics, resulting in a substantial loss of high-frequency details in the images. High-frequency details are crucial for image clarity, recognition, and visual experience, making their restoration a research hotspot in image processing. However, existing deblurring methods typically perform poorly when confronted with these complexities, and the restored images often suffer from noticeable blurriness and detail loss. This phenomenon not only affects image quality but also limits performance in high-precision application scenarios, such as video surveillance, autonomous driving, and medical imaging. Therefore, designing a novel, effective motion image deblurring method that can address complex nonuniform blur situations has become an urgent issue that needs resolution.MethodA motion image deblurring method that is based on a Transformer architecture and multiscale feature fusion is proposed in this study to resolve the challenges of motion deblurring. This method utilizes an encoder-decoder network structure, which is widely used in deep learning because of its excellent feature extraction and restoration capabilities in image tasks. In the encoder part, a module that combines convolutional neural networks (CNNs) and a dual-attention mechanism is designed. The advantage of CNNs is their ability to effectively extract local features from images. The dual-attention mechanism guides the model’s focus to important areas in the image. This combination not only enhances the depth and breadth of feature extraction but also substantially improves the model’s ability to capture key details, thereby providing a solid foundation for subsequent deblurring tasks. In the feature fusion section, a multiscale feature fusion module is employed to enhance feature expressiveness through a gated deep convolutional feedforward network and a feature enhancement module. The multiscale feature fusion strategy allows the network to integrate feature information from different scales, ensuring that details are preserved during the restoration process and effectively addressing varying degrees of blur across different regions of the image. The feature enhancement module extracts high-frequency information to enhance the clarity and realism of the final image, ensuring that critical details are not lost during reconstruction. In the decoder part, a modified Transformer module is introduced. Fourier transform is integrated into the feature extraction process within the feedforward network layer to enhance frequency-domain information. Fourier transform aids in processing images in the frequency domain, providing rich frequency information and effectively improving the model’s ability to recover subtle features in blurred images. This innovative improvement offers a new approach to feature processing, allowing for a high level of restoration performance in actual deblurring tasks.ResultTo evaluate the effectiveness of the proposed method, extensive comparative experiments are conducted on GoPro and Human-Centric Indoor Deblurring (HIDE) datasets, and in-depth comparisons with existing mainstream methods are performed. Experimental results indicate that the proposed method achieves considerable advantages across multiple metrics. In the GoPro dataset, the peak signal-to-noise ratio (PSNR) reaches 32.70 dB, and the structural similarity index measure (SSIM) is 0.954, indicating that the quality of the restored images is very high, and details are largely recovered. In the HIDE dataset, PSNR reaches 30.53 dB, with an SSIM of 0.922, further validating the effectiveness and adaptability of the proposed method. Ablation experiments are performed to provide in-depth analysis of each module’s contributions, verifying the positive effect of the proposed innovative points on motion image deblurring. These experiments clearly demonstrate the unique roles of each module and provide quantitative evidence that confirms the effectiveness of multiscale feature fusion and the Transformer module.ConclusionThe motion deblurring method that is based on the Transformer architecture and multiscale feature fusion substantially outperforms existing mainstream methods in experiments conducted on GoPro and HIDE datasets. This research provides a novel solution to the critical issue of motion image deblurring and showcases the potential applicability of the proposed method in practical scenarios, particularly in image processing, video surveillance, and autonomous driving. Future research efforts could focus on extending the principles underlying this method to other image restoration tasks, such as denoising and super-resolution, further enhancing the model’s versatility and robustness. Moreover, exploring the incorporation of domain-specific knowledge and prior information related to the nature of blur into the deblurring process could guide the model toward achieving highly favorable outcomes in real-world applications. Continual optimization and refinement of this innovative approach are expected to create opportunities in the field of image restoration. By doing so, this method has the potential to substantially contribute to the advancement of related technologies and applications, including but not limited to improving image quality in photography and enhancing the performance of image processing systems used in various industries. In summary, the proposed motion deblurring method represents a remarkable advancement in the ongoing effort to overcome the challenges posed by motion blur in images. With the advent of deep learning techniques and modern image processing methodologies, these advancements can be leveraged to develop solutions that directly address the limitations of existing approaches. Whether utilized to enhance the quality of consumer photography, improve security camera footage, or refine images captured in autonomous driving systems, successful motion deblurring techniques have far-reaching implications. The link is https://doi.org/10.57760/sciencedb.j00240.00170 and https://github.com/zh7546/project.git.
摘要:ObjectiveIn the tasks of 3D scene reconstruction and novel view synthesis, neural radiance fields (NeRF) stands as a landmark advancement in this field. It significantly drives the development and evolution of neural rendering-based 3D scene reconstruction methods. Moreover, 3D Gaussian splatting (3DGS) has emerged as a simple yet computationally efficient approach, thereby gaining recognition for its fast training and rendering capabilities. However, these methods typically operate only under ideal input conditions and often struggle to handle motion blur——a critical issue, given that motion blur in input images severely degrades rendering quality. Some existing methods attempt to address this issue by employing trainable parameters or models during reconstruction to predict the trajectory of camera motion blur. Nevertheless, this predictive approach fails to reflect the real trajectory of camera movement accurately, thereby limiting the accuracy of 3D scene reconstruction. Other methods integrate deblurring models with the 3DGS framework to accelerate model training and rendering speeds. Regrettably, these methods suffer from quality degradation issues, such as artifacts when rendering novel views, because of the risk of model overfitting during training. This study proposes a novel method that fuses event camera data with the 3DGS framework to overcome the limitations of the aforementioned approaches.MethodAn event camera triggers event responses only when the pixel brightness changes beyond a preset threshold, thereby enabling it to capture the instantaneous motion trajectories of objects and subtle brightness fluctuations in dynamic scenes accurately. The proposed method fully leverages the unique advantage of event cameras——microsecond-level high temporal resolution. For the sequential grayscale images output by the model, the method designs an event bin calculation module: first, it divides the grayscale image sequence at fixed time intervals and defines the time window corresponding to two consecutive frames as an event bin unit. Subsequently, it counts the number of events triggered by brightness changes for each pixel within this unit, thereby generating event bin data predicted by the model. The method further introduces real event bins as supervision signals to ensure the effectiveness of event information supervision. Real event bins are constructed using synchronously collected raw event camera data, and the pixel-level L2 error between the predicted event bins and the real event bins is calculated. Then, this error is incorporated into the model’s loss function. This supervision mechanism forces the model to learn the laws of the continuous blurring process at high temporal resolution, thereby avoiding traditional methods’ problem of losing details of dynamic processes because of reliance on discrete frame images. The information loss of traditional frame images in dynamic scenes can be compensated by combining the event bin calculation module with the 3DGS framework, thereby providing rich spatiotemporal constraints for deblurring reconstruction. Second, the method innovatively introduces a Gaussian shape attribute transformation network, which establishes a prediction and adjustment mechanism for attributes such as scaling and rotation of each 3D Gaussian primitive in the 3DGS framework. During the model training phase, this transformation mechanism achieves dynamic optimization of Gaussian primitives through iterative adjustments: after each training epoch, the parameters of all Gaussian primitives are updated based on the transformation attributes predicted by the network, gradually correcting the deviation between the initial Gaussian primitives and the real scene structure. This approach effectively prevents the Gaussian model from overfitting to the training views. During the rendering process, these transformation attributes can dynamically adjust the spatial distribution of 3D Gaussian primitives in real time, thereby allowing the Gaussian primitives to conform to the motion states and geometric structures of dynamic objects in the scene. As a result, the constraints of the original view data on the model are broken, and the degradation of rendering quality caused by geometric mismatches is avoided when switching to novel views.ResultExperimental results demonstrate the superiority of the proposed method. On synthetic datasets, it outperforms the existing methods in terms of peak signal-to-noise ratio, structural similarity index, and learned perceptual image patch similarity. This finding indicates a significant reduction in image noise and blur, and the rendered results exhibit structural information (such as edges and textures) that is highly consistent with real scenes. On real-world datasets, the method achieves a substantial reduction in the blind/referenceless image spatial quality evaluator value, thereby fully verifying the method’s deblurring capability in real and complex scenarios. Compared with NeRF-based methods, the proposed method reduces the training time from 48 h to less than 1 h (representing an approximately 50-fold improvement in training efficiency) and achieves a real-time rendering speed of 140 frames/s.ConclusionComprehensive experimental results confirm that the proposed method addresses the shortcomings of existing neural rendering-based deblurring reconstruction techniques in novel view generalization ability. It also achieves dual improvements in image clarity and computational efficiency through innovations in data fusion and network structure. This method provides a practical technical solution for high-fidelity and real-time reconstruction of dynamic scenes.
摘要:ObjectiveThe rapid advancement of deep learning techniques has resulted in significant progress in the field of image inpainting, particularly in utilizing semantic structures to guide the inpainting process. Semantic structures, such as semantic label maps, have become increasingly popular because of their ability to provide valuable contextual information about missing or damaged regions of images. These semantic structures help guide the inpainting algorithm to restore missing content in a way that aligns well with the overall context and semantic meaning of the image. Traditional inpainting models often rely on pixel-level information and texture-based techniques to reconstruct missing regions. However, they may struggle to maintain consistent semantic structures, particularly when dealing with complex images or large damaged areas. Recent advances have shown that combining semantic information with inpainting tasks can significantly improve the quality and realism of image restoration. In particular, semantic segmentation, which can be leveraged to guide the inpainting process, has become an essential tool for extracting meaningful context from an image. Although semantic segmentation models provide rich, semantic information that is crucial for accurate image inpainting, many existing methods remain unidirectional. These models often rely solely on pretrained semantic segmentation networks to provide guidance, without considering feedback from the inpainting model itself to improve or refine the segmentation. This limitation can prevent the model from fully correcting semantic errors, resulting in inadequately accurate inpainting outcomes. We propose a novel image inpainting method that introduces feedback correction within a semisupervised learning framework to overcome the aforementioned challenges. This approach allows for a bidirectional interaction between the inpainting and segmentation models, thereby enabling them to refine each other iteratively and ultimately enhance the final inpainting results.MethodThe proposed method follows a three-stage progressive image inpainting algorithm, which includes coarse inpainting, semisupervised semantic correction, and fine inpainting. Each of these stages plays a critical role in improving the quality and accuracy of the final image restoration, particularly in scenarios where semantic biases and texture errors are prevalent.In the first stage, coarse inpainting is performed. The goal of this module is to create an initial pixel-level reconstruction of the missing areas. At this stage, the algorithm focuses on restoring basic structures and textures within the damaged regions. Although the inpainting process may still introduce some semantic errors or distortions, it provides a solid foundation for further refinement. This coarse inpainting serves as the basis for the following stages, which aim to address semantic inconsistencies and improve texture details.The second stage involves semisupervised semantic correction. Here, we leverage a cross-image semantic consistency strategy to enhance the semantic understanding of the damaged regions and improve the accuracy of segmentation. This module employs semisupervised learning techniques to generate pseudo-labels for unlabeled images, which are then used to refine the semantic segmentation of the inpainted areas. The segmentation network can correct any semantic errors introduced during the initial inpainting stage by incorporating feedback from the inpainting model. This feedback loop helps ensure that the inpainted regions align closely with the rest of the image in terms of texture and semantic content.The third and final stage is fine inpainting. In this stage, the semantic segmentation model——now trained with the refined semantic labels——guides the inpainting model to restore the missing content with high precision. The improved semantic labels allow the inpainting model to generate accurate pixel-level details and textures, thereby ensuring that the final result is semantically consistent with the original image. This stage refines the image restoration process by incorporating the corrected segmentation and the inpainted content, thereby resulting in a realistic and seamless reconstruction.ResultWe conducted extensive experiments using two publicly available datasets, namely, CelebA-HQ and Cityscapes, to validate the effectiveness of the proposed method. The evaluation was based on several commonly used metrics, including learned perceptual image patch similarity (LPIPS), peak signal-to-noise ratio (PSNR), and structural similarity index measure (SSIM). These metrics are designed to assess the perceptual quality, structural consistency, and overall fidelity of the inpainted images. The experimental results demonstrate that the proposed method outperforms existing inpainting techniques. On the CelebA-HQ dataset, our approach achieves a 5.88% reduction in LPIPS, a 0.52% increase in PSNR, and a significant improvement in SSIM, thereby indicating a notable enhancement in perceptual quality. The results on the Cityscapes dataset are even more impressive than those on the CelebA-HQ dataset, with the LPIPS decreasing by 6.15%, the SSIM increasing by 1.58%, and the PSNR values remaining superior to those of the other methods. These improvements highlight the effectiveness of the proposed feedback correction mechanism, which helps address semantic biases and texture errors in the inpainting process. Ablation studies were also conducted to confirm the contribution of the semantic correction mechanism further. Results show that the integration of semantic correction significantly enhances the inpainting quality, thereby validating the importance of the interaction between the inpainting and segmentation models.ConclusionThis study provides new insights into the synergy between semantic segmentation and image inpainting. The proposed method addresses semantic inconsistencies and texture errors more effectively than previous approaches by fostering an interactive relationship between the two models. Experimental results demonstrate that the feedback correction mechanism, implemented within a semisupervised learning framework, significantly improves the overall performance of image inpainting tasks. This approach opens new possibilities for high-quality image restoration, particularly in applications where semantic consistency is crucial for achieving realistic and contextually accurate results.
摘要:ObjectiveSingle-image super-resolution (SISR) focuses on generating a high-resolution (HR) image from a single low-resolution (LR) input by recovering the texture and structural details that are often lost during image degradation. In recent years, SISR has been widely applied in various domains, including computer vision, image processing, and public surveillance. Current SISR methods include traditional approaches——such as interpolation, reconstruction, and example-based techniques——that rely on prior knowledge but often struggle to recover fine details and textures. By contrast, deep learning-based methods, particularly convolutional neural networks, have made significant progress. However, many of these methods improve performance primarily by increasing network depth, which introduces feature redundancy and inefficient utilization of multilayer features, thereby resulting in artifacts such as texture distortion and edge inconsistency. This work addresses these challenges by proposing an adaptive feature fusion recursive network (AFFRN), which incorporates an adaptive feature fusion module (AFFM) consisting of three specialized branches. AFFRN facilitates efficient information propagation and progressive feature enhancement by employing a recursive refinement mechanism; thus, it can significantly improve texture reconstruction quality.MethodThe proposed AFFRN consists of three main stages. First, the input LR image is processed by several convolutional layers to extract initial low-level representations, which primarily capture basic edge and texture information. These feature maps are passed through AFFM, which performs dynamic and depth-aware fusion of multilevel features. Within AFFM, three specialized branches work in parallel to extract and integrate complementary feature representations: 1) The detail attention branch (DAB), which focuses on enhancing shallow-layer features using attention mechanisms to preserve fine textures and edges; 2) The detail exploration branch (DEB), which captures deep contextual information through densely connected blocks combined with downsampling and upsampling to enable modeling of long-range dependencies and global structures; 3) The weight assignment branch (WAB), which, as the core innovation, adaptively assigns weights to features from DAB and DEB based on their relative importance to facilitate selective fusion that suppresses redundancy and emphasizes discriminative information. Moreover, AFFM is integrated into a recursive framework, where the fused features from each iteration are fed back into the module for subsequent refinement. This recursive refinement, which is critical for accurate super-resolution reconstruction, facilitates coarse-to-fine structural recovery and gradual enhancement of hierarchical features. Finally, the aggregated and enhanced feature representations are passed through convolutional and transposed convolutional layers to restore spatial resolution and reconstruct the output HR image. This design allows AFFRN to balance depth, efficiency, and representational power effectively, thereby resulting in superior performance on texture-rich and structurally complex image regions.ResultAFFRN is trained on 900 images selected from the DIV2K dataset, with data augmentation being performed through random horizontal and vertical flips to enhance data diversity. Training is performed using the Adam optimizer, with L1 loss serving as the loss function. Performance is evaluated on five public benchmark datasets: Set5, Set14, BSD100, Urban100, and Manga109. HR images are downsampled using bicubic interpolation with scale factors of ×2, ×3, and ×4 to simulate LR inputs. The Y channel of the YCbCr color space is used for evaluation, with peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) serving as the assessment metrics. AFFRN is compared with 19 representative SISR methods. Experimental results demonstrate that AFFRN achieves the highest average PSNR and SSIM values across all scaling factors. Moreover, it recovers highly accurate textures and fine structural details. On the Urban100 dataset, AFFRN achieves PSNR gains of 0.75 dB, 0.53 dB, and 0.54 dB over AFSMNet at scaling factors of ×2, ×3, and ×4, respectively. On the Manga109 dataset, the proposed method also outperforms AFSMNet by 0.11 dB, 0.17 dB, and 0.38 dB at the same scales. Ablation studies further validate the effectiveness and necessity of each module within the proposed network. Removing the core WAB from AFFM results in a PSNR decrease of 0.24 dB, thereby emphasizing the critical role of the proposed WAB in performance improvement.ConclusionThis study proposes AFFRN, a network specifically designed for the SISR task. The architecture integrates a recursive learning strategy with an AFFM, which plays a critical role in progressively enhancing feature representations across multiple depths. Within the AFFM, the DAB is responsible for refining shallow features to preserve high-frequency textures, whereas the DEB captures deep semantic information by modeling global contextual dependencies. WAB adaptively allocates importance weights to merge these complementary features effectively, thereby promoting informative feature integration while suppressing redundancy. The recursive framework further strengthens feature propagation by enabling iterative refinement, thereby allowing the model to improve detail recovery progressively over multiple stages. Comprehensive evaluations on five benchmark datasets demonstrate that, compared with existing state-of-the-art methods, AFFRN achieves competitive performance in quantitative metrics and visual quality. Ablation studies further validate the necessity and effectiveness of each component, especially highlighting the critical role of the WAB in enhancing reconstruction accuracy. These results confirm that the proposed adaptive multibranch fusion combined with recursive refinement can effectively alleviate feature redundancy and inefficiency issues, thereby leading to rich texture recovery and highly accurate super-resolution results.
摘要:ObjectiveTexture filtering is a fundamental task in computer vision that attempts to preserve significant structures and smooth out irrelevant textures in an image through filtering. It is useful for many image analysis and processing tasks, such as contour detection, image segmentation, image abstraction, detail enhancement, tone mapping, and image denoising. Existing texture filtering algorithms can be mainly divided into traditional filtering algorithms, such as algorithms based on local and global optimization, and neural network filtering algorithms. Traditional filtering algorithms can hardly use high-level semantics in images and rely heavily on manual parameter adjustment, whereas neural network filtering algorithms achieve better results with the help of the powerful feature representation ability of convolutional neural networks (CNNs). However, neural network filtering algorithms have fixed inference parameters after training. Textures and structures of varying scales, which require different filtering strengths, exist in images; adaptively adjusting and smoothing large-scale and strong-gradient textures while maintaining small-scale and weak-gradient structures are difficult for a neural network with fixed parameters. In addition, high-quality texture filtering datasets for training neural networks are lacking, and the common practice is to select the filtering results of existing algorithms as imperfect labels or to make synthetic structure-texture images with domain shift from real scene images. However, all of them have flaws, resulting in poor generalization performance of the algorithm on real scene images. To address the issue of the filtering algorithm’s lack of adaptive filtering capability, this study proposes a two-stage dynamic side window filtering kernel prediction network (SWNet). It employs reverse thinking and a divide-and-conquer strategy, which involves smoothing first and then reconstructing, suppressing large-scale textures, distinguishing between structural and texture pixels during reconstruction, and restoring clear structures. For the issue of domain shift in the dataset, an improved structure-texture image synthesis method is proposed, and a hybrid synthetic texture filtering (HSTF) dataset is made.MethodWe first design a hierarchical U-Net encoding-decoding module based on transformers and CNNs to generate a structure area segmentation map and an oversmooth image. Then, based on guided filtering, side window filtering, and dynamic convolution, a filter kernel prediction module is designed to predict the sampling points and weight values of eight groups of side window filter kernels under the guidance of the structure area segmentation map, the oversmooth image, and the original input image. Afterward, the oversmooth image is sampled and filtered. Finally, a linear fusion is used to obtain the filtering result. The HSTF dataset consists of two parts: a filling subdataset that fills different texture maps in the segmented area and a fusion subdataset that fuses a structure background map with a single texture map. They are combined for network training.ResultIn the experiment, the SWNet algorithm is compared with 18 algorithms, including classical and latest algorithms, on 6 datasets. Compared with the second-best algorithm on the test set of the HSTF dataset, the SWNet algorithm improves the peak signal-to-noise ratio(PSNR) value by 2.581 dB and the structure similarity index measure(SSIM) value by 0.033. On the benchmark for edge-preserving image smoothing(BEPS) dataset, the SWNet algorithm improves the PSNR value by 4.616 dB and the SSIM value by 0.090. On the nankai smoothing dataset(NKS) dataset, it increases the PSNR value by 1.942 dB and the SSIM value by 0.012. On the structure-preserving image smoothing(SPS) dataset, SWNet raises the PSNR value by 2.701 dB and the SSIM value by 0.034. Meanwhile, a visual effect comparison experiment is carried out on the RTV, BEPS, and ImageNet2012 datasets to intuitively verify the effectiveness of the SWNet algorithm from the viewpoint of human eye perception. Unlike other algorithms, the SWNet algorithm presents a finer structure preservation and smoother texture area (no remnant texture and close to a flat pure color). We also retrain some classical algorithms on the original dataset, the filling subdataset, the fusion subdataset, and the HSTF dataset. The comparative experiment shows that the HSTF dataset can effectively improve the structure preservation and texture smoothing effect of the algorithm on real scene images.ConclusionThe SWNet algorithm proposed in this paper combines the advantages of global dependence capture of transformers and low-level feature extraction of CNNs. It uses an oversmooth image and a structure area segmentation map to guide the smoothing and preservation processes. With the help of the strong structure-preserving ability of the side window filter kernel and the design of dynamic sampling points and weight values, SWNet can effectively smooth out different scale textures while maintaining the significant and narrow structures, thereby realizing scale-adaptive filtering. Moreover, the HSTF dataset considers the structure and texture pattern of real scene images, which can effectively improve the generalization performance of the algorithm on real scene images.
关键词:texture filtering;scale adaptive;dataset generation;dynamic side window;guided filtering;Transformer
摘要:ObjectiveDriven by standards such as moving picture expert group (MPEG), H.26x, audio video coding standard (AVS), and AOMedia video (AVx), video coding technology has evolved significantly over the past 3 decades. These standards commonly adopt a block-based hybrid framework that improves compression efficiency through spatial/temporal prediction, adaptive entropy coding, and rate-distortion optimization. However, lossy compression inevitably introduces artifacts such as blurring, blocking, and ringing. Traditional encoders address these issues using hand-crafted filters, such as a deblocking filter, a sample adaptive offset, an adaptive loop filter, a constrained directional enhancement filter, and loop restoration. Deep neural networks, offering superior performance due to their nonlinear representation power, have been recently introduced to replace rule-based filters. However, existing models often require large parameter counts and substantial memory, thereby limiting their deployment on mobile devices. This work addresses these challenges by introducing a compact, deployment-friendly neural network loop filter tailored for mobile and edge environments.MethodWe propose a lightweight and deployment-friendly approach built on a U-Net backbone to tackle the challenges of high memory consumption, slow inference speed, and poor hardware compatibility in neural network-based loop filtering. The network adopts an encoder-decoder structure with progressive downsampling to reduce interblock complexity and overall computational cost. Structural reparameterization is employed to optimize the network further for deployment: During training, multibranch convolutions enhance feature representation capacity, whereas at inference, they are merged into single-branch standard convolutions to minimize operator dependencies, simplify execution, and reduce intermediate memory usage. A key challenge of U-Net architectures is the memory overhead caused by long skip connections, given that feature maps from each encoder stage must be stored for later use. We address this bottleneck by designing a multiscale feature fusion and compression module. First, encoder features from all stages are compressed via pointwise convolutions and spatially aligned at the smallest resolution using downsampling. Then, these features are concatenated along the channel dimension and undergo a second compression stage to generate a compact and expressive representation. In the decoder, this representation is reconstructed to its original resolution and further enhanced using reparameterized convolutional blocks. This design significantly reduces peak memory consumption while preserving feature fidelity and improving filtering quality, effectively achieving a joint optimization of memory footprint, inference speed, and reconstruction performance. The entire network is trained end-to-end by taking image residuals as input and using mean squared error as the loss function. Finally, the trained model is embedded into the AOMedia Video 1(AV1) encoder pipeline and combined with an R-D cost optimization strategy to select the optimal filtered frames adaptively during the loop filtering stage, thereby improving overall coding efficiency under real-world video coding scenarios.ResultExtensive experiments conducted within the AV1encoding framework demonstrate that the proposed method achieves a -5.09% BD-rate (Bjøntegaard delta rate) with 77 frames/s at 1 080 P in intraframe prediction and a -4.32% BD-rate in interframe prediction, thereby providing a quantitative evaluation of its performance. In particular, under intraframe prediction, our model attains an average BD-rate reduction of -5.09%. This finding indicates that our model outperforms most competing methods, such as variable-filter-size residue learning convolutional neural network (VRCNN) (-3.48%), MobileNetV2 (-2.66%), and RepSR (re-parameterization super resolution) (-3.81%). Moreover, it achieves -5.52% and -6.75% in ClassA1 (4 K) and ClassA3 (720 P), respectively, thereby demonstrating strong performance across both high- and mid-resolution content. Under interframe prediction, the proposed method maintains competitive performance with an average BD-rate reduction of -4.32%, closely matching the average BD-rate reduction of deep semi-smooth newton network (DeepSN-Net) (-4.35%) while exhibiting stable results across test sequences. For instance, in ClassA1, our model attains a larger gain (-4.61%) than all other baselines. In ClassA3, where DeepSN-Net achieves the highest gain (-6.23%), our model still performs comparably (-5.77%) with low fluctuation across resolutions. Moreover, the model’s peak memory consumption during the Msc3 (memory of skip connection 3) stage is reduced from 27.69 MB to 1.73 MB, indicating a 93.75% decrease, by adopting the feature compression strategy with compression coefficients α₁ = α₂ = α₃ = β = 0.5. This substantial reduction is critical for deployment on resource-constrained platforms. When deployed on a Snapdragon 8 Gen1 neural processing unit (NPU), the model processes 1 080 P video at 77 frames/s with a peak memory usage of only 55 MB. This finding confirms the model’s suitability for real-time mobile inference under strict memory and latency constraints.ConclusionThe lightweight neural network loop filtering method proposed in this work effectively balances compression efficiency, memory footprint, and inference speed, thereby delivering an integrated solution tailored for mobile and edge device deployment. Compared with the existing state-of-the-art neural network video coding models, our approach achieves competitive or superior coding gains while drastically reducing resource consumption, particularly in peak memory usage and inference latency. This combination of high performance and deployment friendliness makes the proposed method highly suitable for real-time video processing scenarios on mobile devices equipped with neural network accelerators. Our results validate the practical viability of structurally reparameterized convolution modules combined with feature fusion compression in mitigating memory bottlenecks and accelerating inference without sacrificing reconstruction quality. This work provides a promising direction for advancing neural network video coding technologies toward widespread adoption in mobile and edge computing environments to enable enhanced video quality and efficient bandwidth utilization with low power and resource demands.
关键词:loop filtering;feature compression;mobile deployment;AOMedia Video 1(AV1);Structural re-parameterization;neural network model optimization;low memory footprint;low latency
摘要:ObjectiveThe propagation of misinformation has emerged as a significant challenge in the digital age. Traditional multimodal fake news detection research has primarily focused on the binary classification task of content authenticity; for instance, the capabilities for identifying specific types of tampering and localizing tampered regions are lacking. Fake content that combines visual and textual modalities in cyberspace has proliferated rapidly with the rapid advancement of multimedia technologies. Despite some progress in multimodal media tampering detection and localization, existing studies commonly face critical issues, including insufficient cross-modal hierarchical information interaction and limited accuracy in localizing tampered regions. This study addresses these challenges by proposing a tampering detection framework based on multiview visual-language information interaction.MethodThe proposed method is designed to tackle the key challenges in multimodal deep forgery detection and localization by leveraging multiple perspectives of visual-language interaction. First, a hierarchical tampering contrastive learning mechanism is constructed using global and local dual-view feature embedding. This mechanism achieves fine-grained cross-modal semantic alignment, effectively capturing the semantic inconsistency in tampered regions. The approach innovatively integrates multiple strategies to enhance the model’s ability to detect and localize forgeries. Second, a forgery perception interaction module is designed. It incorporates multiscale feature extraction and frequency-domain feature fusion techniques. In this way, the localization capability of tampered features at different granularities is significantly improved. In addition, a cross-modal gating fusion module is introduced to optimize the information interaction between modalities. This approach is achieved through a dynamic weight allocation strategy that enhances the model’s discriminative power in multimodal deep forgery detection and fine-grained classification tasks. The proposed method utilizes advanced techniques in deep learning, including transformers and bidirectional encoder representations from transformers (BERT), to achieve fine-grained alignment and tampering detection. Vision transformers (ViTs) are used for image feature extraction, whereas BERT is used for extracting textual features. The embedded features are subjected to tampering contrastive learning from global and local perspectives. Unlike conventional methods, which typically only pull close-matching image-text pairs and push away nonmatching pairs, the proposed approach employs the InfoNCE loss function to push away tampered image-text pairs simultaneously, thereby reinforcing their semantic inconsistency. This method enhances the model’s ability to distinguish between tampered and nontampered content, thereby addressing the critical issue of cross-modal alignment in traditional methods. Moreover, a dual-stream cross-modal attention mechanism is utilized to facilitate deep-level information interaction between visual and textual modalities. This mechanism enables the model to capture detailed information from both modalities and improve the overall detection accuracy. In addition to the attention mechanism, the forgery perception interaction module and cross-modal gating fusion module further refine the model’s ability to detect subtle discrepancies in content across different modalities. These modules collectively improve the detection of fine-grained tampering, which is crucial for accurately localizing tampered regions in multimodal content. A multitask learning framework is employed to handle various downstream tasks simultaneously, including bounding box regression, binary classification, multilabel classification, and token-level tampering detection. The multitask learning module consists of a set of lightweight multilayer perceptrons (MLPs) designed to handle these tasks efficiently. The model can simultaneously learn from various types of labeled data and generalize across different types of tasks by employing a multitask learning approach, thereby providing a comprehensive solution for multimodal deep forgery detection. Specific Implementation: The proposed framework is implemented on a high-performance computing environment using the DGM4 dataset. The model is trained for 50 epochs on an NVIDIA RTX 4090 GPU, leveraging the PyTorch framework. The dataset used for training contains various multimodal content, including images and corresponding text, which is essential for evaluating the model’s effectiveness in detecting and localizing tampered content across different modalities. ViT is used for extracting image features, which allows the model to process visual data efficiently and capture detailed spatial information. BERT, a powerful transformer-based model for natural language processing, is used for text feature extraction. Then, these features are embedded into a shared space, where tampering contrastive learning is applied. The contrastive learning mechanism ensures that the embeddings of tampered image-text pairs are pushed away from the embeddings of authentic image-text pairs, thereby reinforcing the semantic inconsistencies in tampered regions. This fine-grained contrastive learning approach allows the model to identify subtle discrepancies in both the image and text modalities, thereby improving the detection of tampered regions. In terms of information interaction, the dual-stream cross-modal attention mechanism is employed to allow for deep interaction between the visual and textual features. This mechanism facilitates the exchange of information between the two modalities, thereby enabling the model to capture complex relationships between the image and the text. The forgery perception interaction module further enhances the model’s ability to detect subtle tampering by integrating multiscale features and fusing frequency-domain information. This approach enables the model to capture tampering features at different scales and granularities, thereby improving its ability to localize tampered regions effectively. The cross-modal gating fusion module is another crucial component of the proposed framework. It optimizes the interaction between modalities by using a dynamic weight allocation strategy. This strategy ensures that the most relevant features from each modality are given high importance during the decision-making process, thereby improving the model’s discriminative power. This module is particularly effective in scenarios where the tampered content is subtle or involves complex interactions between visual and textual elements. The multitask learning module, consisting of lightweight MLPs, is designed to support various downstream tasks. These tasks include bounding box regression for localizing tampered regions, binary classification for distinguishing tampered and authentic content, multilabel classification for handling multiple types of tampering, and token-level tampering detection for identifying tampered tokens in text. The multitask framework allows the model to learn from multiple sources of supervision and generalize across different tasks, thereby improving its overall performance.ResultExperimental results demonstrate that the proposed model outperforms existing approaches, such as the hierarchical multimodal manipulation reasoning transformer framework based on hierarchical reasoning, in multimodal deep forgery localization tasks. In the image deep forgery localization task, the model achieves a 6.41% improvement in the intersection over union at a threshold of 75% (IoU75) metric, thereby indicating a significant enhancement in the precision of localized tampered regions. In text tampering localization tasks, the model improves the recall and F1 scores by 5.63% and 2.01%, respectively, thereby demonstrating its superior ability to detect tampered text content. These improvements are a direct result of the fine-grained alignment and deep interaction between visual and textual features enabled by the proposed framework. Compared with the visual-language pretraining with a gate fusion framework, the proposed model exhibits a comprehensive performance advantage in the evaluation of multimodal multitask learning. The model’s ability to handle various tasks simultaneously, coupled with its robust performance in image and text modalities, makes it a highly effective solution for multimodal deep forgery detection and localization.ConclusionThe multiview visual-language information interaction model proposed in this paper exhibits significant superiority over other models in multimodal deep forgery detection and localization tasks, thereby providing a novel technical solution for the multimedia content security field.
摘要:ObjectiveIn recent years, domain adaptive object detection has become an important area of research with the growing demand for robust object detection systems across different environmental conditions. Traditional object detectors that use deep learning methods obtain relatively high performance, but they largely rely on large-scale labeled datasets. However, labeling large-scale datasets requires high costs. This challenge motivates the exploration of domain adaptation that aims to maintain good detection performance by learning domain-invariant features or training models with cross-domain generalization ability. Recent studies in domain adaptive object detection have increasingly adopted the mean teacher paradigm, which leverages the stability of the parameters of the teacher model during training to address critical challenges in cross-domain scenarios. The method can enhance the invariance of the model to specific domain shift while preserving the discriminant ability of the feature representations by enforcing the consistency between the predictions of the student model and the outputs of the teacher model. However, these studies often overlook the fact that the student model may learn noisy pseudo-labels generated by the teacher model. This situation can consequently lead to performance degradation. If the teacher model produces a noisy pseudo-label, then it could potentially steer the student model toward an incorrect objective, ultimately leading to the generation of erroneous feedback for the teacher model. Thus, a novel domain adaptive object detection framework named negative teaching and negative learning (NTNL) which integrates a negative teaching module, a negative learning module, and an adaptive weighting mechanism, is proposed in this study to alleviate the aforementioned problem.MethodNTNL not only considers the positive aspects but also incorporates negative thinking; that is, it evaluates the probability that an object does not belong to a certain class. In particular, NTNL introduces the negative teaching strategy to the teacher model, which can iteratively reduce the likelihood of generating incorrect pseudo-labels for uncertain samples, thereby improving the quality of guidance for the student model. Moreover, NTNL employs negative learning such that the student model can be guided to recognize complementary labels. Furthermore, this study designs an adaptive weighting mechanism to address the scale variation of negative teaching due to the different number of classes in diverse tasks. This mechanism can ensure that the proportion of negative teaching loss in the total loss is roughly consistent across different transfer learning tasks.ResultExtensive experiments are conducted on three cross-domain object detection tasks for performance evaluation, including weather adaptation (Cityscapes to Foggy Cityscapes), cross-style city road adaptation (Cityscapes to Bdd100k), and real-world to clipart adaptation (Pascal VOC to Clipart1k). The proposed method NTNL generally obtains better performance than the compared methods. The mean average precision values of NTNL on the above three tasks are 63.4%, 52.5%, and 45.4%, which are 8.0%, 4.7% and 1.7% higher than the mean average precision values of the recent methods, respectively. The proposed NTNL achieves such good performance, probably because it employs the negative teaching and negative learning, both of which help in predicting the class of hard samples correctly. In addition, NTNL can balance the weight of negative teaching in different cross-domain object detection tasks because of the design of the adaptive weighting mechanism, thereby improving the generalization ability of the model. This study conducts a series of ablation experiments, and the resultant findings confirm the effectiveness of each component within the proposed NTNL framework. This study also analyzes the parameter sensitivity to reveal the impact of these parameters on performance. Finally, this study utilizes t-distributed stochastic neighbor embedding (t-SNE) to represent the image features extracted by different models, thereby demonstrating the advantages of the proposed NTNL intuitively. The visualization reveals that the feature clusters generated by NTNL exhibit a high degree of separation, particularly for semantically similar categories, such as riders and people, which are difficult to distinguish; thus, the discriminative effect becomes pronounced.ConclusionThis study proposes a novel domain adaptive object detection method, NTNL, which reduces the impact of noisy pseudo-labels via negative teaching and negative learning. This method also improves the generalization ability of the model with an adaptive weighting mechanism. Comprehensive comparisons with other domain adaptive object detection methods show that the proposed NTNL can achieve superior performance on public cross-domain object detection tasks, thereby proving its validity. Future work will explore incorporating the degree of the domain shift and task difficulty into the adaptive weighting mechanism to improve model generalization further.
摘要:ObjectiveUnmanned aerial vehicle (UAV) imagery is characterized by an extremely high density of small objects (often < 16 × 16 pixels) and frequent corruption by weather-related noise, such as haze, rain, and motion blur. These factors cause severe signal attenuation and geometric ambiguity, making small-object detection a persistent bottleneck for convolutional neural networks and Transformer detectors. Although two- and one-stage frameworks have achieved remarkable progress on natural-scene datasets, their accuracy decreases considerably on UAV benchmark datasets, including VisDrone2019-DET, UAV Benchmark Object Detection and Tracking (UAVDT), and Car Parking Lot (CARPK). This study aims to enhance small-object precision without increasing the model size or sacrificing real-time capability.MethodTo address the challenges above, this study proposes YOLO-WF, an enhanced YOLOv8 architecture that synergistically fuses wavelet convolution with frequency-domain attention. First, YOLO-WF embeds a composite Fourier self-attention module (CFSA) in the backbone. By applying 2D fast Fourier transform (FFT) to feature maps, CFSA estimates cross-channel covariance on the amplitude spectrum, generates a learnable frequency mask that amplifies harmonics beneficial for small-object detection, and suppresses high-frequency noise caused by haze, rainfall, and motion blur. The modulated spectrum is then transformed back to the spatial domain via inverse FFT, enabling the model to obtain a global receptive field and long-range dependencies without increasing convolution kernel sizes. Second, a low-frequency wavelet transform convolution module (LOWTC) is inserted at the feature extraction stage. Utilizing second-order Daubechies-4 wavelets, LOWTC decomposes the input features into approximation (A) and detail (D) subbands. The A subband, containing object shape and contextual information, is processed by dilated depth-wise convolutions to capture long-range dependencies. The D subband, which preserves edges and textures, is unchanged. This divide-and-conquer strategy not only enlarges the receptive field but also alleviates the long-distance modeling deficiency of standard convolutions while keeping parameter growth and computational overhead low, resulting in rich, robust feature representations for small and medium objects. Third, after shallow feature extraction, a dedicated P2 detection head for objects smaller than 32 pixels is added, together with a small-object label assignment strategy, to prevent large-instance interference and substantially improve the localization accuracy of tiny targets. Extensive experiments on VisDrone2019, UAVDT, and CARPK benchmarks demonstrated that YOLO-WF achieves substantial accuracy gains over the baseline model.ResultAll experiments were conducted on single NVIDIA GeForce RTX 4090 D (24 210 MiB) running Ubuntu 16.04. The software stack comprised Python 3.9 and PyTorch 2.6.0+cu124. Input images were resized to 640 × 640 pixels. The mini-batch size was set to 4 because of limited GPU memory. In accordance with the standard practice for fair comparison, no ImageNet pretrained weights were loaded; all models were trained from scratch for 100 epochs. On the VisDrone2019 dataset, the average precision (AP) metrics of AP50, APs (small objects), and APm (medium objects) reach 47.1%, 19.9%, and 40.3%, respectively. On the UAVDT dataset, the AP50, APs, and APm metrics are 81.56%, 38.54%, and 65.12%, respectively. Furthermore, on the CARPK dataset, the AP50, APs, and APm metrics are 94.3%, 33.3%, and 67.9%, respectively. These results indicate that the YOLO-WF model not only achieves high detection accuracy for small objects but also maintains good performance for objects of different sizes. In addition, comparisons with several typical detection methods showed that the YOLO-WF model has a substantial increase in average detection accuracy. For example, compared with the traditional YOLOv8 model, the YOLO-WF model achieves an average improvement of 6.5% in detection accuracy on the VisDrone2019 dataset and 2.44% on the UAVDT dataset. These improvements highlight the effectiveness of the proposed YOLO-WF model in enhancing the detection performance for small objects in UAV-captured images. Sequential module removal on VisDrone2019 shows that the detection head of small objects contributes the most (+3.6% APs), followed by CFSA (+2.7%) and LOWTC (+2.1%). Combining CFSA with LOWTC yields an extra +1.2%, indicating that frequency-domain enhancement and wavelet context are complementary. Replacing standard convolutions with GSConv recovers 0.3% lost by the additional detection head, validating the slimming strategy.ConclusionThe proposed YOLO-WF model demonstrates notable improvements in the detection of small objects in UAV-captured images. By incorporating the CFSA module, LOWTC module, and a specialized detection head, the model effectively enhances the extraction of key features and improves the detection capability for small objects. Experimental results on multiple benchmark datasets validate the superior performance of the YOLO-WF model in terms of detection accuracy and recall. This study provides a promising solution for small-object detection in UAV-captured images, and the proposed model has the potential to be applied in various practical scenarios, such as surveillance, traffic monitoring, and environmental monitoring. Future work may focus on further optimizing the model architecture and exploring the integration of additional techniques to further enhance the detection performance and adaptability of the model to different types of UAV-captured images and environmental conditions.
摘要:ObjectiveObject tracking, which aims to track a target continuously in a given video sequence, is one of the important research directions in computer vision. It has been widely used in many fields, such as video surveillance, human-computer interaction, and visual navigation. The fundamental challenge lies in maintaining robust tracking performance across diverse environmental conditions, including extreme lighting variations, occlusions, and dynamic background scenarios. The emergence of deep learning significantly facilitates the development of the field of object tracking. However, coping with the challenges triggered by the attributes of the scene is difficult. RGB tracking uses only visible light information and often struggles with challenging scenarios, including dim illumination and bad weather, particularly in low-light conditions, foggy weather, or nighttime environments where color and texture information becomes unreliable or completely unavailable. In comparison, thermal infrared images use the temperature as a source of information, do not depend on lighting conditions, and can provide considerable details in low-light environments. However, thermal infrared images suffer from low resolution, noise contamination, and a lack of texture details. RGB-thermal (RGB-T) object tracking is a downstream task of visual tracking. It extracts features by fusing visible and thermal infrared images to complement each other, and its core lies in the way of feature fusion. It has emerged as a promising solution to overcome the limitations of single-modality tracking. As the earliest deep learning method, the convolutional neural network (CNN) is widely used in Siamese networks for RGB-T object tracking. After the emergence of the transformer, researchers use a unified network for joint feature learning and relation modeling, thereby facilitating the interaction between the template and search region. However, existing methods tend to single out CNN or a transformer for feature extraction. Moreover, they underutilize complementary local and global feature relationships even though various complex fusion modules are designed to facilitate intermodal information interaction. In particular, CNNs excel at capturing local structural details but lack long-range dependencies, whereas transformers capture global context but may dilute fine-grained details during attention computation. Thus, a three-branch multistage fusion network for RGB-T object tracking, which models shallow local features and deep global features separately and promotes the wide propagation of multistage features, is proposed in this study to address the aforementioned issue. Our approach leverages the complementary strengths of CNN and transformer while mitigating their respective limitations through a carefully designed multistage fusion strategy.MethodThe RGB modality possesses abundant detailed features, whereas the thermal infrared modality, despite lacking detailed information, still has significant global features that cannot be overlooked. We enhance and fuse features by employing different fusion strategies at different layers to achieve a robust representation and prevent interference between local and global features. First, considering that attention may dilute local features, we design a convolutional fusion module (CFM) to extract and integrate local features between adjacent patches. CFM preserves the spatial structure of local information through convolutional operations that maintain the inherent spatial relationships between neighboring patches. This module, which is embedded in an additional branch that directly processes the features after patch embedding, merges this branch into the backbone. The additional branch is designed to operate in parallel with the main transformer backbone, thereby ensuring that local feature extraction does not interfere with global context modeling. The three-branch design improves the retention of local features and facilitates the propagation of shallow local information to deep layers while avoiding attention distractions. In addition, an attention fusion enhancement module (AFEM), a partially weight-shared module, is designed to extract and fuse global features. AFEM employs a dual-path linear projection strategy where one path uses shared parameters to maintain feature consistency across modalities, whereas the other path uses independent parameters to handle modality-specific characteristics adaptively. This design effectively normalizes the feature distributions of RGB and thermal modalities before fusion, thereby addressing the challenge of heterogeneous data distributions between different sensor modalities. AFEM is directly inserted into the backbone, which can integrate the global features of deep networks and avoid affecting local features. Our method facilitates the integration of local and global features and significantly enhances model robustness through the design of additional branches and multistage fusion. The multistage fusion strategy enables progressive refinement of features across different network depths, with shallow layers focusing on local detail preservation and deep layers emphasizing global contextual understanding.ResultOur method is evaluated on three standard RGB-T object tracking datasets, including LasHeR, RGBT210, and RGBT234. Result shows that our method achieves precision rate (PR)/success rate (SR) of 83.7%/60.6% on RGBT210, and MPR/MSR of 86.4%/83.5% on RGBT234, thereby demonstrating our method’s robustness to modality-specific challenges. On LasHeR, our method (72.0%/67.7%/57.0%) achieves 2.4%/1.5% gains in PR/SR compared with the baseline (69.6%/65.9%/55.5%). We evaluate the challenge attributes on LasHeR, and the results indicate that our method achieves optimal performance under the most challenging attributes. For instance, our method achieves a PR of 59.2% and far surpasses other methods in abrupt illumination variation, thereby demonstrating the effectiveness of our local-global feature fusion strategy. We also perform ablation experiments to demonstrate the effectiveness of the modules. Results show that our designed modules and the three-branch architecture effectively enhance performance. They validate the synergistic enhancement from our novel module ensemble.ConclusionIn this study, we propose a three-branch multistage fusion network for RGB-T object tracking that obtains rich semantic information by extracting and fusing local and global features. Our method decouples the local and global feature extraction and processes these two features through dedicated pathways before fusion. In this way, our method effectively addresses the fundamental limitation of the existing RGB-T tracking approaches that either overemphasize local details or global context at the expense of the other. With the additional branch, the tracker effectively captures features at different stages and facilitates the propagation of local features. The three-branch architecture enables the network to maintain high-quality local details throughout the entire processing pipeline. This approach is particularly crucial for tracking small objects or objects with fine-grained textures. Experimental results show that our algorithm outperforms state-of-the-art RGB-T object tracking algorithms and effectively improves RGB-T object tracking performance. Our method achieves superior performance across multiple benchmark datasets while maintaining reasonable computational efficiency, with a frames per second (FPS) of 22.7.
摘要:ObjectiveTeaching behavior recognition has a wide range of applications in the field of smart classrooms; in particular, the state of students can be analyzed in real time, and teachers can be provided with accurate feedback on teaching behaviors to help them adjust the teaching rhythm and improve the teaching methods. In this way, the efficiency of classroom interactions and the quality of knowledge transfer are enhanced. However, in the actual teaching scenario, the promotion of various teaching reforms has resulted in the emergence of new interactive and technology-integrated teaching behaviors in the smart classroom, such as collaborative group discussions using tablets and students displaying results on whiteboards. These new teaching behaviors are different from the traditional teaching behaviors. Moreover, the annotation sample size is quite limited because of the cost of annotation. In such a situation, ensuring that the model has the ability of few-shot continual learning, i.e., to achieve accurate recognition of new teaching behaviors without catastrophic forgetting of old teaching behaviors, becomes the main challenge of this task. Most of the existing few-sample continuous learning algorithms are based on pretrained visual language models (e.g., contrastive language-image pre-training (CLIP)), which are fine-tuned to match image and text features through a fine-tuning network. However, these studies often ignore the fact that behavioral labels such as “listening”, “writing” and “using a tablet” contain rich semantic information in themselves. For example, the label “writing” contains multilayered semantic information: From the basic semantics, it refers to “the action of a person who holds a writing instrument in his/her hand to record symbols on a specific carrier”, which includes the sequential action logic of “hold-move-leave a mark”. From the behavioral entity, it involves the student subject, the objects such as pen, notebook, tablet, and the interactions between them, e.g., “student—with a pen—on a notebook”. From the scene perception, it refers to the student writing on a desk in a classroom scene, which is a rich presentation of the deep details in the behavior. However, most existing algorithms simply match “writing” as a single text label with image features, thereby failing to parse out the action logic of the underlying semantics, the interaction relationship of behavioral entities, and the contextual associations of scene perception. Moreover, these existing algorithms have a weak understanding of the deep connotations of the behavior, thereby leading to difficulties in understanding the behavioral actions of the model. Thus, the recognition robustness of a scene with few samples is affected. To this end, a scalable chain-of-thought-guided (SCOTG) algorithm for continuous teaching behavior recognition with few samples is proposed.MethodWe generate detailed descriptive text about the behavior labels through the chain of thought, expand the semantics of the behavior labels for mining, and extract the structure (subject, predicate, and object) of the ternary knowledge representation to condense the structured knowledge. In this way, the key entities and relationships in the behavior can be accurately reflected, and the model can deeply understand and identify the behavioral actions. In the stage of “association”, the three-stage Q&A of the chain of thought is adopted. The first is the basic semantic level, which allows the big language model to identify the subject of the behavior and the location of the occurrence by adding the necessary nouns. This level also converts the behavior label into a basic sentence. The semantic extension at the entity interaction level follows. It further emphasizes the interactivity of the behavior by introducing other visual entities related to the action. The final and third is the semantic extension at the scene-aware level, which allows LLMs to generate detailed scene descriptions, capture the dynamic changes and reactions of the subjects in the scene, and enrich the details of the behaviors at a fine granularity. In the “refinement” stage, the large language model is used to extract the triples in the sentences, condense the knowledge, remove redundant information, and make it highly structured. Meanwhile, the concise knowledge of the triples makes being handled by the text encoder of CLIP effective. In addition to the scaling of text labels, the SCOTG algorithm designs a multilevel cross-modal matching mechanism to calculate the similarity matching between different levels of ternary textual features and multilayered visual features of the image. This algorithm also employs a hierarchical weighting strategy to set different weights according to the contribution of each layer of features in the behavior recognition task and obtain the total similarity that is finally used for classification, which takes full advantage of the knowledge of the ternary group on the textual side of the hierarchical semantic structure of the text-side triad (e.g., base action triad, entity interaction triad, and scene perception triad) and the cross-modal associations of the image-side multilayer visual features (e.g., low-layer texture features, mid-layer part features, and high-level scene features). Among them, the matching of the base action triad with low-level visual features can capture fine-grained action details such as “grip-move”. The matching of the entity interaction triad with mid-level visual features can strengthen the entity relationship recognition of “subject-interaction-object” and the matching of the scene. The matching of the association triad with the high-level visual features can correlate with the contextual logic of the behavior. The model can understand the nature of the behavior from semantic to visual dimensions through this kind of multilevel accurate alignment. Thus, the model can effectively make up for the defect of the traditional single-feature matching, which is insufficient for capturing the deeper associations. Compared with traditional methods, the SCOTG algorithm freezes the backbone network for pretraining the visual language model, scales only the behavioral labels, and reduces the computational complexity by training the visual language model through prompt learning.ResultExperiments were conducted on a classroom scene image dataset with 32 behavioral categories, with seven methods being used for comparison. Compared with the model with the second-highest performance in the three-way five-shot task setting, the SCOTG algorithm showed an average accuracy improvement of 1.98% in all tasks, and 1.36% in the final task. Under the three-way three-shot task setting, the average accuracy in all tasks of the SCOTG algorithm improved by 1.03% compared with that of the model with the second-highest performance. Under the three-way one-shot task setting, the average accuracy of the SCOTG algorithm is improved by 0.78% in all tasks compared with the model with the second-highest performance.ConclusionIn this study, we propose an SCOTG algorithm for continuous teaching behavior recognition with few samples and design a multilevel cross-modal matching mechanism. Experimental results show that the proposed SCOTG algorithm effectively improves the model’s understanding of teaching behaviors and enhances the model’s ability to recognize novel teaching behaviors in sampleless scenarios. The code is available at https://github.com/2002zlj/scotg.
摘要:ObjectiveKaryotype analysis has always been the gold standard for detecting fetal chromosomal abnormalities. As one of its core steps, chromosome classification is the premise of subsequent anomaly recognition. The traditional chromosome classification mainly relies on the manual operation of karyotype experts. However, this approach is inefficient and has strong subjectivity. Thus, high-performance automatic classification methods must be urgently developed. Although remarkable progress has been made in automatic classification methods based on deep learning, the approaches often rely heavily on large-scale labeled datasets and the assumption that the training and test sets are independent and identically distributed. In clinical practice, the distribution of chromosome data from different sources varies significantly because of the high cost of chromosome data annotation and the differences in banding technique and acquisition equipment. Domain adaptation methods, which can transfer the knowledge of the labeled source domain to the unlabeled target domain data, provide a solution for this issue. However, common domain adaptation methods require access to source domain and target domain data. This approach cannot easily meet the increasingly strict requirements of data privacy protection in the context of medical systems. In addition, the difference in karyotype resolution caused by a different number of chromosome bands is the main factor leading to the decline of cross-domain knowledge transfer performance of chromosomes. In response to these issues, we propose a bi-confidence pseudo-label guided source-free domain adaptation for chromosome classification with karyotype resolution discrepancy (BCPL-SFDA).MethodThe overall framework of BCPL-SFDA is divided into two stages: source domain model training and target domain model training. In the training stage of the source domain model, the model is pretrained with labeled source domain data, and the basic characteristics and discriminant ability of chromosome classification are obtained. In the training stage of the target domain model, the pretrained source domain model is used to initialize the network weight, and then transfer learning is carried out on the unlabeled target domain data. In this study, a deep and shallow double-branch feature retention mechanism for karyotype resolution discrepancy is constructed. This mechanism is different from the traditional classification model, which only relies on the last layer of deep features for classification. The mechanism in this study combines deep and shallow features to obtain transferable knowledge at different levels of the source domain. It retains the deep semantic prior of the source domain and the shallow texture and shape representation. It also realizes the progressive alignment of the feature space of the source domain and the target domain. The understanding and recognition ability of the model to differential resolution chromosome images are significantly enhanced. Furthermore, a bi-confidence class-centric pseudo label (BCCPL) strategy, which aims at the typical characteristics of large intraclass variation and small interclass variation of chromosome data, is proposed. It combines high and low confidence samples to explore the intraclass diversity and interclass distinctions of chromosomes thoroughly.ResultIn this study, two G-band chromosome datasets with different resolutions, chromosome based on inception-ResNet (CIR-Net) and Private, are used. First, we compare these datasets using 10 domain adaptation methods. For Private→CIR-Net tasks, the classification accuracy of the proposed method reaches 96.11%, which is 1.17% higher than that of the optimal domain adaptation method —— a hybrid model of structurally regularized deep clustering(H-SRDC) and 6.76% higher than that of the optimal source-free domain adaptation method —— source hypotheisis transfer(SHOT). In CIR-Net→Private tasks, the proposed method achieves significant advantages: compared with the second-best domain adversarial training of neural networks (DANN) method, the proposed method shows an improvement of 9.97%. Experimental results show the significant performance advantages of the proposed method. Second, the ablation experiments are performed to demonstrate the effectiveness of each component. We compare the performance of different branch number feature retention mechanisms, and the double-branch feature retention mechanism achieves the best accuracy. In gradient weighted class activation mapping (Grad-CAM) visualization, shallow and deep features are extracted by this mechanism, which realizes dual attention to the overall chromosome structure and local details. We also verify the effectiveness of the loss function Lsim. The BCCPL strategy effectively improves the learning ability of the model under complex data distribution. Finally, the limitations of this method on short chromosomes are discussed through visual analysis.ConclusionA source-free domain adaptation BCPL-SFDA for chromosome classification is proposed in this paper to solve the problem of cross-domain chromosomal karyotype resolution discrepancy and the inherent characteristics of chromosome data with large intraclass differences and small interclass differences. In particular, BCPL-SFDA adopts a dual-path framework, which preserves the deep semantic prior of the source domain and the shallow texture and morphological representation. It also realizes the progressive alignment of the feature space between the source domain and the target domain, thereby effectively overcoming the performance limitation of single-layer features when dealing with karyotype resolution discrepancy. Moreover, the BCCPL strategy is designed to combine high and low confidence samples and generate reliable pseudo-labels through multicenter prototype clustering. Thus, the attention of the model to easily confused samples is enhanced, and the performance of the model is improved. Results of cross-domain classification on two chromosome datasets with different resolutions demonstrate that this method can effectively enhance the recognition capability for transferring between chromosome images of different resolutions. Moreover, it can optimize the extraction of intraclass and interclass features and improve the performance of cross-domain chromosome classification. In the future, fine-grained feature extraction of short and small chromosomes can be studied to solve the limitations of such methods. This study can provide an important reference for further improving the performance of the model in cross-resolution chromosome classification tasks.
摘要:ObjectiveSome network parameters remain in an unstable optimization state during training because of the lack of ground truth images for supervision in infrared and visible image fusion tasks. As a result, deep semantic information from the source images is difficult to preserve effectively during the fusion process. Existing fusion methods commonly suffer from degradation in multilevel semantic representation and lack effective cross-level feature interaction mechanisms; thus, full integration of shallow details with deep semantic information during fusion becomes difficult. In addition, transformer-based fusion methods incur substantial computational overhead when modeling global features. This study addresses these challenges by proposing a multilevel infrared and visible image fusion network based on Mamba, thereby leveraging recent advances in state space modeling.MethodIn this study, we propose a novel infrared and visible image fusion algorithm that constructs a multilevel feature extraction architecture. The model effectively alleviates the loss of semantic information during deep network training by employing a progressive downsampling strategy and dense connections during the feature encoding phase, thereby preserving fine-grained textures and high-level semantic features from the source images. During the feature extraction stage, we introduce an innovative F-Mamba module that leverages the selective memory mechanism of state space models and hardware-aware algorithms. This design mitigates the limitations of traditional convolutional receptive fields while maintaining linear computational complexity, thereby enabling the network to capture cross-level feature representations ranging from local textures to global semantic information efficiently. A cross-level feature aggregation module is proposed to enhance the extraction of complementary features further across levels. This module employs multiscale dilated convolutions to align and fuse shallow visual features with deep semantic features, thereby achieving effective preservation of fine-grained semantic information in cross-modal images.ResultComparative experiments were conducted on the multispectral road scenarios (MSRSs), visible-infrared paired dataset for low-light vision (LLVIP), and a new dataset of aligned infrared and visible images (RoadScene) to compare 13 traditional and deep learning-based fusion methods. In the subjective evaluation, the proposed method demonstrated a clear advantage in the restoration of target detail features and visual quality. With regard to the objective evaluation, our method achieved optimal values in six objective metrics on the MSRS dataset: entropy, spatial frequency (SF), visual information fidelity, peak signal-to-noise ratio (PSNR), average gradient (AG), and edge intensity (EI). Compared with the best values in the six metrics of 13 existing fusion algorithms, those of our method reached an average increase of 3.03%, 1.56%, 15.89%, 7.26%, 2.61%, and 1.62%. The values achieved by our method in SF, PSNR, AG, and EI on the LLVIP dataset are the best among all the values. Compared with the best values in the four aforementioned metrics of 13 existing fusion algorithms, those of our method reached an average increase of 6.42 %, 0.45%, 6.47%, and 7.23%. Moreover, the values achieved by our method in AG and EI on the RoadScene dataset are the best among all the values. In addition, we validated the effectiveness of each component in the network through ablation experiments. In the computational efficiency comparison experiments, the proposed method demonstrates significant advantages over most transformer-based and Mamba-based approaches. In semantic segmentation experiments, the proposed method outperforms the second-best approach by 0.42% in terms of mean intersection over union (mIoU), thereby demonstrating the effectiveness of our fusion algorithm in preserving multilevel semantic features.ConclusionIn this study, we propose a Mamba-based multilevel fusion network that integrates a hierarchical feature extraction architecture with the F-Mamba module to preserve deep semantic features from the source images effectively while maintaining linear computational complexity. Experimental results show that, compared with 13 existing fusion methods, the proposed approach demonstrates superior performance in preserving fine-grained semantic features, thereby restoring target details and achieving high computational efficiency.
摘要:ObjectiveDataset distillation offers a promising solution for synthesizing lightweight training datasets. This technique significantly reduces computational requirements and memory consumption, thereby alleviating model training burdens and accelerating the overall process. It is particularly valuable in resource-constrained environments where efficient data utilization is crucial. However, current mainstream methods overemphasize the performance of distilled datasets. They often employ bi-level optimization pipelines for data synthesis, which results in prohibitively high computational costs. This constraint severely limits their applicability to large-scale and high-resolution datasets. Furthermore, these methods rely on matching-based architectures to update distilled data. This approach makes them vulnerable to architecture overfitting, consequently impairing their cross-framework generalization. Practical challenges also arise from the characteristics of real-world datasets. Large-scale datasets demand excessive computational resources, thereby degrading distillation efficiency. Conversely, small-sample datasets contain insufficient information, which compromises the quality of the generated data. Dataset distribution imbalances further exacerbate these issues, thereby affecting efficiency and output quality. These fundamental limitations collectively restrict the practical deployment of existing distillation methods across diverse real-world scenarios.MethodThis study proposes a novel dataset distillation method that integrates coreset optimization and prototype-augmented diffusion to overcome these limitations. Our method is built on a decoupled distillation framework, which decomposes the traditional bi-level optimization pipeline into two manageable stages: data preprocessing and distilled data synthesis. This decomposition significantly reduces computational costs and enables efficient distillation for large-scale, high-resolution datasets. In particular, the proposed method comprises two key modules: prototype-augmented latent diffusion model (PA-LDM) and adaptive coreset optimization module (ACOM). We introduce an improved latent diffusion model named PA-LDM for efficient and high-quality synthesis of distilled data. This model incorporates prototype learning techniques into the latent space diffusion process, thereby enhancing distillation performance and data versatility significantly. PA-LDM operates in the latent space to capture essential dataset features while minimizing redundancy. We develop the ACOM module to address the impact of the original dataset scale and data distribution. This preprocessing module combines coreset selection and data augmentation to optimize the dataset’s scale and distribution adaptively, thereby ensuring that the resulting distilled data are balanced and representative.ResultWe conduct extensive experiments on multiple public benchmark datasets to evaluate the performance of our proposed method comprehensively. These public benchmark datasets include three small-scale datasets (CIFAR-10, CIFAR-100 and Tiny-ImageNet) and four large-scale datasets (ImageNet-1K and its subsets). Our method is compared with eight state-of-the-art techniques, including matching-based distillation methods (e.g., differentiable siamese augmentation(DSA) and trajectory-matching based methods(MTT)) and decoupled distillation approaches (e.g., squeeze, recover and relabel(SRe2L) and generalized various backbone and statistical matching(V-GBSM)). Experimental results demonstrate that the proposed method is extensively compared against eight existing distillation techniques, including MTT, and that it is evaluated on three small-scale and four large-scale datasets. Our approach consistently outperforms most existing methods across various image per class(IPC) settings on all datasets. For small-scale dataset distillation, the proposed method achieves top-1 accuracy of up to 56.4% and 73.5% under IPC = 10 and 50, respectively. Compared with classical distillation methods, such as DSA and SRe2L, the proposed method yields average performance improvements of 8.3% and 27.5%, respectively. On large-scale datasets, our method reaches top-1 accuracy of 57.6%, 84.9%, and 90.6% at IPC = 10, 50, and 100, thereby outperforming state-of-the-art decoupled distillation methods by an average of 0.4% to 3.0%. Moreover, our method, which achieves superior distillation performance, improves the synthesis efficiency of distilled data by more than 2.16 times, thereby demonstrating a better trade-off between compression ratio and model performance.ConclusionThe proposed dataset distillation method, which synergistically combines coreset optimization with prototype-augmented diffusion, achieves high efficiency and superior quality in distilling datasets of varying scales. The method demonstrates remarkable adaptability to imbalanced data distributions through its intelligent adaptive processing while generating high-resolution distilled images with exceptional visual fidelity. These synthesized images not only serve as effective substitutes for original training data but also exhibit excellent cross-architecture generalization. Our method reduces the time required for knowledge compression by 49% while maintaining a performance degradation of less than 0.5%, with some cases even achieving lossless compression. This breakthrough enables efficient model training while preserving critical data characteristics, thereby marking a significant advancement for practical dataset distillation applications in diverse real-world scenarios.
摘要:ObjectiveThe primary objective of construction site hazard identification is to elevate the safety management standards within operational construction environments substantially by leveraging advanced automation technologies. In recent years, the pervasive adoption of large language models (LLMs) has opened new avenues for research in this critical domain. The overarching goal is to mitigate human error, enhance detection efficiency, and proactively prevent accidents through intelligent systems capable of interpreting complex visual and textual data from construction sites. A meticulous analysis of the current research landscape, based on LLMs, reveals that existing methodologies can be broadly classified into two distinct categories, each presenting its respective set of advantages and limitations. The first approach leverages the capability of image-text matching to perform collaborative reasoning by integrating visual input with textual hazard descriptions. The second method involves constructing domain-specific datasets to fine-tune large models through instruction tuning or guide them via multiturn dialogues. The former enhances the alignment between images and semantic representations through multimodal fusion yet exhibits limitations in capturing complex hazard characteristics. The latter strengthens the model’s analytical depth with domain knowledge infusion but suffers from high training costs and poor generalizability.MethodThis study addresses these limitations by proposing a risk-detection, retrieval-augmented generation hazard identification method that dynamically integrates external knowledge bases with retrieved case contexts through prompt tuning, thereby resolving misjudgments caused by LLMs’ lack of domain knowledge and weakened feature associations. The proposed architecture is systematically structured into three cohesive and interdependent core modules, each serving a distinct and vital function: The first module is the retrieval database module. It serves as the external knowledge repository, populated with a comprehensive collection of historical construction hazard cases. Each entry within this database is a rich, multimodal data object comprising visual data (images) and its corresponding textual annotation, which includes a detailed description of the hazard type, its location, and contextual information. The integrity, diversity, and relevance of this database are paramount because it forms the foundational knowledge source for the entire system. The second module is the image similarity retrieval module. This component is responsible for the efficient and accurate retrieval of the most relevant cases from the database, given a new query image from a construction site. At the heart of this module is a powerful vision-language model, specifically the contrastive language-image pretraining (CLIP) model. CLIP excels at mapping images and text into a shared, high-dimensional semantic embedding space. When a new query image is processed, it is encoded into an embedding vector. Then, this vector is compared against the precomputed embeddings of all images in the retrieval database using a similarity metric. The top-K most semantically similar cases are retrieved, thereby ensuring that the subsequent reasoning steps are informed by visually and contextually analogous examples. The third module is the LLM retrieval-augmentation reasoning module, which is the central reasoning engine. The retrieved similar cases (both their images and text) are formatted into a structured prompt, thereby providing the LLM (e.g., GLM-4V) with a critical few-shot learning context. This prompt, which also includes the query image, guides the LLM to perform in-context learning. This entire framework operates in a training-free manner for the LLM itself. Thus, no additional fine-tuning is required, thereby guaranteeing efficiency, reducing computational overhead, and enhancing scalability and ease of deployment.ResultA rigorous empirical evaluation was conducted to validate the efficacy of the proposed framework. Experiments were systematically performed using authentic, real-world construction site data, encompassing various hazard scenarios and environmental conditions. The framework was subjected to comprehensive testing and systematic assessment across multiple state-of-the-art large language models to ensure the robustness and general applicability of the approach. The results were highly promising and demonstrated a substantial quantitative improvement. When integrated with the GLM-4V model, the retrieval-augmentation framework achieved a recognition accuracy of 50%. This value represents a significant and remarkable improvement of 35.49% over the baseline performance of the vanilla GLM-4V model without retrieval augmentation. Beyond this aggregate metric, a detailed category-wise analysis revealed consistent performance gains across a majority of individual hazard types. This finding indicates that the method enhances the model’s capability universally rather than being biased toward a specific hazard category. Furthermore, ablation studies were designed, and the LPIPS algorithm was introduced to compare it with the CLIP algorithm used in the image similarity retrieval module. Results demonstrated the clear superiority of the CLIP-based semantic retrieval strategy over the LPIPS-based perceptual strategy in this specific task.ConclusionAs proposed in this paper, the retrieval-augmented method based on similar cases delivers a significant breakthrough in automated construction site safety monitoring. It tangibly and markedly enhances the key performance indicators of LLMs——namely, accuracy and contextual understanding ability——in the complex task of hazard identification. The framework demonstrates robust generalization performance across a wide spectrum of multicategory hazard scenarios, thereby effectively addressing the core limitations of previous approaches related to training cost and adaptability. The training-free nature of the approach makes it particularly attractive for real-world deployment, thereby offering a scalable and sustainable solution for enhancing on-site safety protocols.
关键词:large language model(LLM);risk detection;multimodal recognition;retrieval enhancement generation;prompt fine-tuning
摘要:ObjectiveLine drawing extraction is one of the key tasks in the fields of computer vision and image processing. It aims to extract contour and edge information automatically with semantic continuity and structural consistency from the original image by using edge detection and feature learning techniques. In this way, a high-quality structured input for downstream tasks, such as animation coloring, style transfer, image generation, and illustration restoration, is provided. This task not only requires the model to identify the main outline of the object accurately but also needs to maintain the continuity of the lines and the rationality of the overall structure while suppressing the interference of irrelevant background and texture details. When facing complex textures and rich background images, the existing line drawing extraction methods can obtain relatively clear lines in regular scenes; however, balancing the detail fidelity of the lines and the purity of the background is difficult, and problems such as line breakage, blurred contours, loss of local details, and background artifacts are prone to occur. These problems cause the extraction results to lack semantic integrity and artistic consistency, thereby reducing the input quality of downstream tasks and making the demands of actual creation and industrial applications for high-precision line drawings difficult to meet. In response to the above problems, this study proposes a high-fidelity line drawing extraction model, namely, cross-level enhanced aggregation and refinement network (CLEAR-Net), based on cross-level response fusion and joint loss optimization. This model fully utilizes multiscale semantic information by integrating feature responses at different levels and introduces a joint optimization strategy, thereby effectively improving the quality of line extraction. It suppresses background artifacts while ensuring structural consistency, thereby obtaining pure line drawing results.MethodThis study made structural improvements based on U2-Net and introduced a deconvolution module to enhance the model’s response ability to features at different levels. The model can fully restore the spatial detail information of the deep response by adding deconvolution operations in the upsampling stage; thus, delicate edge structure extraction is achieved in the multiscale feature fusion process. Subsequently, a dynamic side aggregation module was proposed to achieve dynamic fusion and optimization of cross-level features. This module can automatically allocate aggregation weights based on the correlation between features of different layers, strike a balance between global structural information and local texture details, and effectively enhance the coherence and integrity of the line structure. A background suppression supervision mechanism is proposed for the common background artifact problem in complex texture scenes. This mechanism enables the model to penalize the pseudo-responses in the background area dynamically. It also effectively reduces the interference of background noise. As a result, the purity and robustness of the results are enhanced. A joint loss function combining the background suppression loss with the improved cross-entropy loss is designed to enhance the quality of the generated results further. As a result, the background artifacts are suppressed, and the foreground lines are optimized, thereby achieving a dual improvement in line quality and background purity. Finally, this study was conducted in collaboration with a professional art team to build the first high-precision hand-drawn dataset, ArtLine-2K, which contains 2 000 pairs of high-quality rendered line drawing samples covering various painting styles and complex scenes. This dataset was expanded to 10 000 pairs of samples through data augmentation. Thus, the problem of scarce high-quality labeled data in the line drawing extraction task is effectively alleviated. A systematic comparison was conducted with multiple advanced methods on the ArtLine-2K dataset.ResultExperimental results show that the differences between the generated results of CLEAR-Net and the real annotations are difficult to distinguish with the naked eye. The errors of its core accuracy indicators, MSE (0.000 247) and MAE (0.004 810), and the real annotations reach subpixel accuracy (MAE < 0.005), thereby achieving the breakthrough performance on ArtLine-2K. The generated results were evaluated by professional painters and could be directly used for secondary creation. The ablation experiments were also conducted on ArtLine-2K to verify the effectiveness of the proposed method.ConclusionExperimental results show that CLEAR-Net achieved a breakthrough performance on ArtLine-2K. The generated results are almost indistinguishable from the real annotations. The precision index MSE is 0.000 247, and MAE is 0.004 810. Moreover, the error of the proposed model reaches the subpixel level (MAE < 0.005), which is significantly better than that of the existing methods. Compared with other models, Clear-Net performs outstandingly in detail restoration, line continuity, and background purity. The generated line drawings have clear and natural lines with smooth edges and no artifacts. They can be directly used for secondary creation after being evaluated by professional artists. A systematic ablation experiment was carried out on ArtLine-2K to verify the effectiveness of the model structure design and loss function. Results show that the introduction of the feature extraction module, side fusion mechanism, background suppression loss, and smooth heating cross-entropy can synergistically and significantly reduce the error. Compared with the benchmark model, the proposed model achieves more than 95% improvement in overall performance. Furthermore, CLEAR-Net still maintains stable performance on low-quality datasets, such as Anime Sketch Colorization Pair, thereby demonstrating excellent cross-domain generalization ability and robustness.
摘要:ObjectiveThe initial value problem of geodesics on surfaces is not only a classical and fundamental issue in differential geometry but also a topic of considerable significance in a wide range of application domains. From a purely mathematical perspective, geodesics represent the natural generalization of straight lines to curved spaces. They describe the shortest or locally shortest paths constrained to lie on a given surface, thereby making them essential objects of study in Riemannian geometry and global analysis. Beyond mathematics, geodesics play a crucial role in fields such as geometric modeling, computer graphics, computer-aided design, robotics, and scientific computing. For example, in geometric modeling and shape analysis, geodesics are used for surface parameterization, shape matching, and feature extraction. In graphics processing, they provide efficient ways to compute distances, generate texture maps, and design deformations of complex surfaces. In scientific simulations, they are often indispensable for accurately capturing trajectories or flow lines constrained by curved geometries. The wide-ranging importance of geodesics underscores the need for efficient, accurate, and stable computational methods. According to the traditional practice, geometric methods for computing geodesics attempt to exploit the intrinsic properties of curves on surfaces. A common strategy involves iteratively constructing the geodesic path by Taylor expansions. The position of the curve at subsequent steps can be estimated by applying Taylor expansions, thereby propagating the geodesic path forward. This geometric approach has the advantage of offering clear intuition and relatively simple formulations. However, in practice, it often encounters severe limitations in terms of accuracy and computational efficiency. Given that Taylor expansions truncate high-order terms, local errors accumulate along the geodesic trajectory, thereby resulting in noticeable global deviations. Furthermore, iterative propagation tends to involve repeated geometric computations that can be time-consuming for complex surfaces. Our work addresses these challenges by introducing several enhancements to the traditional geometric framework.MethodThe first and most important improvement builds on a key geometric property of geodesics: at any point on a smooth surface, the curvature vector of a geodesic lies parallel to the surface normal vector. This fact allows us to establish a direct and robust formulation for computing curvature vectors without relying on approximations that introduce unnecessary complexity or numerical instability. Given that arc-length-parametrized curves have tangent vectors of unit length, we can improve the computational accuracy even further. We derive a new method for computing the curvature vector that reduces the number of intermediate steps and eliminates certain sources of numerical error by combining these two observations. This approach not only makes the computation process straightforward but also significantly improves numerical accuracy. As a consequence, the overall geometric algorithm achieves a good balance between computational simplicity and precision. Second, related error estimations for the improved geometric method are provided.ResultExtensive experimental evaluations have been conducted to validate the proposed method. Comparative studies were conducted with respect to one state-of-the-art geometric method and one classical numerical method. The results are compelling. Compared with the geometric method, our method reduces run time by approximately 25% in terms of computation time. This reduction translates to substantial savings when dealing with large-scale problems or real-time applications. The improvements in accuracy are particularly striking: our method achieves precision gains of 1 to 3 orders of magnitude over the geometric method. For perspective, the computation time needed by the geometric method should be at least 13 times that of our method to achieve the same level of accuracy. Such an improvement highlights the efficiency of our formulation and its potential to handle demanding computational tasks. In comparison with the numerical method, the advantages of the proposed method are equally noteworthy. In particular, the run time of our approach is reduced by approximately 75% relative to the numerical method, thereby making it significantly faster. In terms of accuracy, the numerical method can attain high accuracy under certain circumstances; however, it often exhibits substantial variability depending on the surface geometry. By contrast, our method provides consistently stable accuracy across different test cases, thereby demonstrating its robustness. This stability is particularly valuable in applications such as simulation or optimization.ConclusionThe implications of these findings are broad. The proposed method offers a reliable solution for surface geodesic computation by integrating theoretical soundness with practical efficiency. It preserves the desirable property of computational stability inherited from the original geometric framework while simultaneously achieving high levels of accuracy and efficiency. The ability to balance these three aspects——precision, efficiency, and stability——is rarely observed in existing methods, which tend to optimize one aspect at the expense of the others. Consequently, the proposed method, which can meet the requirements of diverse applications ranging from academic research to industrial deployment, stands out as a more practical and versatile solution than existing methods. The contributions of this work can be summarized as follows. First, we introduce a novel formulation for computing the curvature vector of geodesics, grounded in fundamental geometric properties. This formulation simplifies computations while improving numerical accuracy. Second, we provide rigorous error estimations. Third, comprehensive experimental evaluations show that the performance of the proposed method is superior to that of the geometric method in accuracy and run time, whereas it outperforms the numerical method in terms of computational stability and efficiency. Together, these contributions establish a method that not only advances the state of the art in geodesic computation but also expands the practical applicability of geometric algorithms to highly challenging and large-scale scenarios.
关键词:geodesic;curvature vector;geometry;parametric surface;initial value problem
摘要:ObjectiveClothed human generation, which aims to recover the 3D geometry and texture of the human body from input data to generate accurate 3D human models, is a challenging problem in the fields of computer vision and computer graphics. The need for high-quality generations has become increasingly critical with the growing demand for realistic 3D human models in applications such as virtual reality and augmented reality. Traditional multiview generation methods, which are often expensive and impractical for everyday use, typically require specialized equipment to capture images from multiple viewpoints. By contrast, obtaining single-view images from the web is much easier than obtaining multiview images. Thus, single-view generation methods become more cost-effective than multiview generation methods, and the model creation process becomes simple. Given these advantages, we consider using a single view as input to recover the 3D model of a clothed human. However, single-view images lack comprehensive spatial information and structural details of occluded regions. Thus, recovering a complete 3D shape becomes difficult. As a result, existing methods based on implicit functions struggle to learn rear-view information effectively, thereby leading to overly smooth and unrealistic back regions in the generated 3D human model. Methods combining diffusion models show some potential in enhancing texture detail performance. However, most of these methods lack view consistency constraints, thereby making the full recovery of the local texture details of the human body difficult. Additionally, the absence of precise geometric constraints during the diffusion process causes discrepancies between the generated models and the true geometry, particularly when handling complex 3D structures. Existing methods typically assume a uniform point distribution across spatial regions by ignoring variations in the distribution of query points caused by differences in distance from the human body surface. This assumption makes adapting to the geometric complexity differences across various regions of the body difficult for these methods. As a result, these methods face limitations when generating the surfaces of loose clothing, which have complex and variable geometries. This study addresses these challenges by combining three mechanisms: pose diffusion priors generation, multiview consistency constraints, and adaptive geometry generation. This approach not only preserves the generative capabilities of the diffusion model but also introduces geometric constraints to ensure the accuracy of the generation. Furthermore, this method can generate high-quality 3D human models by incorporating the probability distribution of human body structure. This study proposes a generation method that integrates pose diffusion priors with multiview consistency.MethodThis study constructs a method for single-view clothed human generation. First, a human pose estimation algorithm is used to extract 25 key points, which are encoded into Gaussian heatmaps to achieve spatial continuity modeling. This approach enables the model to understand the spatial relationships around the key points. The Gaussian heatmaps, combined with the human mask and UV mapping, are used to construct a pose feature vector. This feature vector guides the denoising process of the latent diffusion model and generates 2D diffusion images for unseen viewpoints through an adaptive cross-attention mechanism. Second, after the normal information of the (skinned multi-person linear model expressive, SMPLX) human template estimated from the input image and the 2D diffusion image are fused, they are input into the cross-view normal consistency network, where the multiview consistency mechanism extracts the corresponding 3D spatial features for each viewpoint. Finally, the voxelized features of the SMPLX human template and the 3D spatial features are fused and input into the distribution prediction network for spatial occupancy probability estimation. The model can express geometric uncertainty at different spatial locations and sample from the learned probability distribution by learning the distribution parameters of each point. Then, the 3D features, voxelized features, and sampling results are input into the occupancy prediction network to achieve 3D clothed human generation. Our entire model is trained on the THuman2.0 (Tsinghua human 2.0 dataset) dataset, with 490 images being used for training and 21 images being used for testing. We tested the model on the CAPE (clothed auto-person encoding) dataset to evaluate the generalization ability of the model further. This dataset is divided into two subsets: CAPE fitted poses (CAPE-FP), which contains 75 images used to assess the geometric generation accuracy of the method under simple poses, and CAPE nonfitted poses (CAPE-NFP), which contains 75 images and focuses on evaluating the method’s adaptability to complex poses. The experiments are conducted on an NVIDIA GeForce RTX 3090 GPU, with a learning rate being set to 1 × 10⁻4 and a batch size of 2.ResultWe conducted experiments on the THuman2.0 and CAPE datasets and compared the single-view clothed human generation results with the results of six other methods. Chamfer distance (CD) is used to evaluate the overall geometric similarity of the 3D human body, and point-to-surface distance (P2S) is used to assess the geometric accuracy of the reconstructed surface. Both metrics perform well when their values are small. On the THuman2.0 dataset, the CD and P2S metrics of the single-view clothed human generation method were reduced by 6.27% and 5.74%, respectively, compared with those of the best-performing method. On the CAPE-FP and CAPE-NFP subsets, the CD and P2S of the single-view clothed human generation method performed better than those of the other comparison methods. On the entire CAPE dataset, the CD metric of the single-view clothed human generation method decreased by an average of 8.67%, and the P2S metric decreased by an average of 2.38%. Quantitative experiments show that our method has good generalization ability for unseen data and can effectively handle human generation tasks in complex poses. Inference efficiency comparison results show that the computational complexity of our method is lower than that of similar diffusion model methods. Experimental results indicate that combining pose diffusion priors and multiview consistency helps recover the texture details of the 3D human body, and adaptive geometry generation enables accurate recovery of complex clothing topologies.ConclusionThe single-view 3D clothed human generation method proposed in this paper, which combines pose diffusion priors and multiview consistency, effectively recovers the local details of the clothed human and accurately generates 3D human models with complex topological structures, such as rich wrinkle details and loose clothing.
关键词:single-view clothed human generation;pose diffusion priors;multiview consistency constraints;distribution prediction network;probability distribution
摘要:ObjectiveSemantic segmentation, which aims to perform dense, pixel-level classification, has emerged as a pivotal technology in computer vision. Often termed remote sensing image semantic segmentation, this task is fundamental for interpreting vast amounts of geospatial data within the domain of Earth observation. Its applications are wide-ranging and critical, including land-cover mapping, urban development monitoring, change detection, and environmental surveillance. Early approaches to this problem rely on low-level features and classic machine learning, but they struggle with the immense complexity found in very-high-resolution imagery. The advent of deep learning, particularly convolutional neural networks, revolutionized the field. Models based on the fully convolutional network and U-Net architectures redefine semantic segmentation as an end-to-end pixel-labeling problem. These models excel at learning hierarchical feature representations directly from data. However, despite their progress, two persistent and critical challenges hinder their performance. The first challenge is the information imbalance in multiscale feature fusion. The encoder-decoder structure generates high-level features that are rich in semantic context but spatially coarse and low-level features that are spatially precise but semantically weak. Standard fusion strategies, such as symmetric skip-connections, treat these features equally. This approach leads to suboptimal feature fusion where fine details can be diluted by overly smooth semantic information or to that where high-level context is corrupted by irrelevant low-level texture. The second challenge is the difficulty in extracting directional features. Man-made and natural structures such as roads, rivers, and building boundaries are inherently linear and exhibit strong directional anisotropy. Standard square convolutional kernels are isotropic and ill-suited for capturing these long, continuous structures, which often lead to segmented outputs with discontinuous lines and fragmented object boundaries. This study aims to develop a novel deep learning architecture that intelligently fuses multiscale features while explicitly enhancing the network’s ability to perceive directional information, thereby producing accurate and structurally coherent segmentation maps. In this way, the dual challenges can be overcome.MethodIn this study, we propose a new network architecture, the selective attention with directional feature enhancement network (SADENet), which is constructed on a robust encoder-decoder framework. We utilize a ResNet backbone to serve as the feature encoder. The core innovations are encapsulated within two new modules integrated into the decoder path: the top-k cross-attention (TCA) module and the directional feature enhancement module (DFEM). The TCA module is specifically designed to facilitate a highly intelligent fusion of features. It employs an asymmetric cross-attention mechanism where the high-level feature acts as the query, whereas the low-level feature provides the key and value. This approach allows for a context-aware selection process where high-level semantic information actively guides the search for the most relevant fine-grained details. A top-k selection strategy is incorporated to maintain computational tractability. This strategy prunes the attention matrix to consider only the most significant feature interactions, thereby improving efficiency. DFEM is specifically engineered to address the challenge of linear feature segmentation. It consists of a multibranch parallel structure where each branch utilizes a pair of asymmetric 1D convolutions——one horizontal (e.g., 1 × k) and one vertical (k × 1) ——to capture features along cardinal orientations explicitly and independently. The module can model directional structures across a spectrum of scales by using multiple branches with varying kernel sizes (k = 1, 3, 5, 7). Then, the features from these parallel branches are adaptively fused using a position enhancement module. The entire network is implemented using the PyTorch deep learning framework. As detailed in the paper, the model training utilized the Lookahead optimizer wrapping AdamW, with a cosine annealing learning rate scheduler and an initial learning rate of 6 × 10-4. The models were trained for 105 epochs with a batch size of 8. We applied extensive data augmentation, including random cropping, flipping, rotation, and mosaic augmentation. All experiments were conducted on a workstation equipped with an NVIDIA RTX 4090 GPU.ResultWe evaluated our proposed SADENet against a suite of five representative state-of-the-art methods——U-Net, DeepLabv3+, BANet, UNetFormer, and CNN and multiscale Transformer fusion network(CMTFNet)——on two public benchmark datasets: ISPRS Vaihingen and ISPRS Potsdam. On the Vaihingen dataset, our model achieved a mean intersection-over-union (mIoU) of 84.68% and an overall accuracy (OA) of 93.55%, outperforming the second-best method, CMTFNet, by a notable margin of 0.94% in mIoU and 2.4% in OA. This superior performance was observed across all six land-cover classes. On the more challenging Potsdam dataset, SADENet achieved an mIoU of 86.84% and a mean F1-score of 92.84%, thereby demonstrating the best performance among all compared methods. We conducted comprehensive ablation experiments on the Vaihingen dataset to validate the effectiveness of our proposed components. Starting from a U-Net baseline scoring 81.25% mIoU, the integration of the TCA module alone increased the mIoU by 1.99 percentage points to 83.24%. Similarly, integrating only the DFEM module improved the baseline by a margin of 2.31 percentage points to 83.56%, which is even greater than the improvement gained from integrating the TCA module alone. This finding underscores the powerful impact of the integration of only the DFEM module on feature representation. The complete SADENet model, which combines both modules, achieved the final mIoU of 84.68%, representing a total performance gain of 3.43 percentage points over the baseline. This finding confirms that both modules are effective and contribute synergistically. Visual comparisons of the segmentation maps further illustrated our model’s advantages, thereby showing visibly sharp building boundaries, continuous and correctly delineated road networks, and a superior ability to distinguish small, densely packed objects, such as cars, where other methods tended to produce blurred or merged results. This finding confirms the model’s practical utility.ConclusionIn this study, we proposed a novel network, SADENet, which integrates a selective attention mechanism and a directional feature enhancement module for the task of remote sensing image semantic segmentation. The experimental results conclusively show that our model significantly outperforms several state-of-the-art approaches on challenging and widely used benchmark datasets. The designed modules effectively address the key issues of multiscale feature imbalance and poor linear structure representation, thereby leading to highly accurate and structurally coherent segmentation results. SADENet provides a powerful new tool for geospatial analysis by fostering a highly targeted fusion of features and enhancing the perception of anisotropic structures. The work demonstrates that substantial improvements in segmentation quality can be achieved by carefully designing architectural components tailored to specific data challenges, thereby paving the way for highly reliable automated interpretation of complex remote sensing imagery.