摘要:The rapid advancement of deep learning has profoundly accelerated the development of autonomous driving (AD) systems, enabling them to operate in increasingly complex, dynamic, and unstructured real-world environments. By leveraging large-scale, high-dimensional data and increasingly powerful neural network architectures, these systems have achieved substantial progress in a wide range of core tasks, including high-precision perception through sensor fusion, accurate prediction of surrounding agent behaviors, and real-time decision-making under complex traffic constraints. This progress has enabled autonomous vehicles (AVs) to navigate structured environments such as urban streets and highways. However, ensuring the safety and reliability of AD remains a fundamental and unresolved challenge, particularly when systems encounter rare, unexpected, or hazardous situations that lie outside the typical distribution of training and test data. Scenarios involving long-tail events, adversarial perturbations, or distribution shifts can severely compromise system performance, leading to outcomes that may threaten safety despite high overall accuracy under standard conditions. These edge cases are difficult to capture through conventional data collection because of their infrequency and unpredictability, and are often underrepresented in existing benchmarks. Consequently, the research community has increasingly recognized the need to construct evaluation datasets that are not only large in scale but also rich in diversity and semantic relevance to meaningfully assess system robustness under real-world variability. To support such efforts, recent work has begun to explore how to systematically generate challenging and informative scenarios through synthetic data, simulation platforms, and generative modeling techniques. These trends underscore a shift toward more proactive, data-centric approaches to validation, emphasizing the importance of evaluating AD systems not only in nominal cases but also under the full spectrum of environmental complexity and operational risk. To address the pressing need for reliable evaluation under rare and safety-critical scenarios, this survey presents a comprehensive review of data generation techniques aimed at enhancing safety assessment in AD systems. Although such scenarios occur infrequently in real-world operations, they often correspond to critical failure points where system vulnerabilities are most pronounced. Traditional validation pipelines, which rely heavily on precollected datasets or limited real-world testing, are insufficient for uncovering these edge cases given their constrained scenario coverage and the high cost, risk, and unpredictability associated with capturing rare events in physical environments. In light of these limitations, we emphasize artificial data generation as a proactive and scalable approach for constructing diverse, targeted, and informative evaluation scenarios. Rather than concentrating solely on algorithmic refinements within individual modules, we adopt a holistic perspective that investigates how safety vulnerabilities emerge across the entire autonomy stack, encompassing perception, prediction, and planning. While these components are often evaluated in isolation, they operate in a highly interconnected manner in deployed systems. Errors in perception may cascade into inaccurate predictions or unsafe planning decisions, while flaws in predictive reasoning can trigger suboptimal or hazardous behaviors in downstream planning, even when upstream inputs are correct. These safety risks frequently stem not only from the limitations inherent to individual modules but also from complex interdependencies, hidden feedback loops, and susceptibility to environmental uncertainties. Such cross-cutting failure pathways highlight the need for scenario generation frameworks that capture not only isolated errors but also integrated, system-level interactions that more accurately reflect the operational complexities of real-world AD. To support rigorous and targeted evaluation under these conditions, this survey organizes existing data generation approaches into three methodological paradigms: data-driven generative modeling, which employs machine learning models to synthesize diverse and realistic driving data distributions; optimization-based methods, which explicitly seek failure-inducing conditions through adversarial testing, scenario search, or reward-driven exploration; and semantics-guided generation, which leverages high-level constraints, formal specifications, or scene priors to ensure interpretability, control, and relevance in generated scenarios. For each paradigm, we analyze the underlying principles, technical implementations, and typical application domains, and we critically assess their strengths and limitations in capturing failure modes that are essential for robust safety evaluation. Through this analysis, we aim to clarify the landscape of scenario generation techniques, identify gaps in current methodologies, and provide a foundation for future research toward more comprehensive and risk-aware validation of AD systems. Despite the progress made, current data generation techniques still face several critical challenges that constrain their effectiveness in supporting system-level safety validation. A key difficulty lies in balancing the trade-off between data realism and diversity. While high-fidelity samples are necessary to ensure physical plausibility and semantic coherence, excessive emphasis on realism may hinder the generation of rare, safety-critical scenarios that are essential for robustness evaluation. Conversely, approaches that prioritize diversity may produce semantically inconsistent or physically implausible samples, reducing their applicability to real-world systems. These challenges are further complicated by the differing requirements across system modules, where perception tasks emphasize sensory accuracy and geometric consistency, while prediction and planning demand coherent behavioral evolution and decision feasibility. In addition, most existing methods lack mechanisms to incorporate system feedback into the generation process, limiting their ability to adaptively target failure-prone conditions. The absence of an integrated testing pipeline that connects data generation, simulation, and deployment also restricts the overall effectiveness of these methods in realistic safety assessment. Moving forward, future research should focus on developing adaptive and interpretable generation frameworks that can dynamically adjust generation objectives on the basis of task context and system feedback while building high-precision simulation platforms that support consistent multimodal data modeling and semantically controllable scenario construction. Advancing these capabilities will be essential for transforming data generation into a reliable foundation for the evaluation and improvement of safety-critical AD systems.
摘要:ObjectiveThe rapid evolution of artificial intelligence-generated content (AIGC), particularly text-to-image models such as Stable Diffusion, Midjourney, and DALL-E, presents a dual-use dilemma. While enabling unprecedented creative applications, the proliferation of hyper-realistic AI-generated images poses severe societal risks, including the spread of disinformation, sophisticated fraud, and intellectual property infringement. This dynamic environment underscores the urgent need for robust and reliable detection technologies. However, conventional detection methods, which rely on offline training, are fundamentally ill-equipped for this dynamic environment. The “train-once, deploy-forever” paradigm causes them to fail when faced with the ceaseless emergence of novel generative models, leading to a swift decay in performance. This critical limitation highlights the need for a more adaptive approach. Continual learning (CL), a paradigm designed to enable models to learn sequentially from a continuous data stream while mitigating “catastrophic forgetting,” offers a promising solution. Yet, its application to the diverse and fast-paced domain of modern AIGC detection is hindered by two significant, unaddressed gaps. First, there is a conspicuous absence of a specialized, large-scale benchmark for systematically evaluating continual AIGC detection methods. Second, and more critically, a unique real-world data constraint creates a novel and formidable learning challenge. This constraint arises because new generative models (positive samples) are often released and become accessible, while their corresponding, high-quality real training images (negative samples) remain proprietary and unavailable. This condition leads to a novel “mixed dual- and single-class” incremental learning problem: Initial learning tasks may possess both positive and negative samples, but subsequent tasks are often restricted to positive samples only. This scenario fundamentally violates the core assumptions of most existing CL algorithms, rendering them ineffective. To address these profound challenges, this study establishes a foundational methodology for the continual detection of AI-generated images. Our primary objective is to construct a comprehensive benchmark and a robust framework capable of dynamically adapting to an ever-expanding stream of generative models, particularly under these realistic and challenging data constraints.MethodFirst, we introduce and release a continual AI-generated image detection (CAID) benchmark, the first large-scale dataset specifically tailored for this task. It contains high-quality images from five state-of-the-art generative models (Stable Diffusion v1.5, DALL-E 2, Imagen, Midjourney, and Parti) and corresponding real images, organized into a sequential task stream to simulate the real-world emergence of new AIGC technologies. Upon this benchmark, we formally define the CAID problem, which requires a model not only to perform binary classification (real vs. fake) but also to achieve multiclass source attribution (i.e., identifying the specific generator model), a task crucial for “copyright identification”. To standardize evaluation, we design three benchmarks with increasing difficulty based on data replay constraints. Scenario 1 (full replay): a lenient setting where a small buffer of historical positive (fake) and negative (real) samples can be replayed.Scenario 2 (negative-only replay): a more practical setting where only historical negative samples can be replayed, reflecting IP or privacy restrictions.Scenario 3 (no replay): the most stringent setting, completely forbidding access to any past samples and forcing the model to learn incrementally from single-class data.Then, we propose tailored solutions for these scenarios. For Scenarios 1 and 2, we adapt existing CL methods using a “negative sample sharing” mechanism. This solution ensures that a small set of real images from the initial task is consistently available, providing the necessary negative class information to stabilize training and prevent the failure of standard loss functions. For the most severe no-replay scenario (Scenario 3), in which our experiments find that existing methods catastrophically fail, we propose a novel universal conversion framework. This framework is engineered to rescue failing methods by systematically addressing the core breakdown points of loss function invalidation, severe classifier output bias, and feature representation drift. We integrate three synergistic components to achieve this objective. First, we use knowledge distillation (KD), in which the model from the previous task acts as a “teacher” to guide the current model, preserving knowledge of past classes by matching its output logits. Second, we use a cosine-normalized classifier to replace the standard linear layer through cosine normalization (CN), which calibrates output logits across all tasks to mitigate bias. Finally, the framework utilizes prompt tuning (PT) by freezing the pretrained vision transformer backbone and training only a small set of learnable “prompt” parameters. This parameter-efficient approach drastically reduces overfitting on new single-class data and preserves the integrity of the learned feature space.ResultOur extensive experiments on the CAID benchmark validate the efficacy of our methodology. In Scenarios 1 and 2, adapted CL methods like FOSTER and S-Prompts perform well, confirming the value of replaying historical data. The most compelling results emerge from Scenario 3. Standard replay-free methods (LwF, EWC, and S-Prompts) completely collapse, with their average accuracy (AA) dropping to random-guess levels (~50%), and suffer from extreme catastrophic forgetting. The application of our universal conversion framework resurrects these methods. Their AA surges dramatically to 65%, and, critically, catastrophic forgetting is largely eliminated, with AF scores dropping to near-zero levels. This finding provides unequivocal evidence that our framework successfully navigates the extreme challenges of no-replay, single-class incremental learning. An in-depth ablation study confirms that all three components (KD, CN, and PT) are indispensable and work synergistically to achieve this remarkable performance recovery. Furthermore, t-SNE visualizations verify that our framework significantly reduces feature drift, while frequency spectrum analysis highlights the inherent difficulty of the CAID dataset in comparison with traditional deepfakes, justifying the need for our approach.ConclusionThis study makes a foundational contribution to the critical and burgeoning field of AIGC detection. We have constructed and released the first large-scale benchmark for CAID, formally defined the problem, and identified the novel and practical “mixed dual- and single-class” learning challenge. Our proposed solutions, particularly the innovative universal conversion framework, provide a robust and effective strategy for developing adaptive detection systems, demonstrating how to maintain performance even under the most stringent real-world data constraints. By open-sourcing our dataset and code, we aim to provide a solid foundation and catalyze future research in this vital area. Ultimately, our findings offer significant methodological support for building the next generation of future-proof detection systems capable of keeping pace with the relentless evolution of AI generation technologies.
摘要:ObjectiveMeters, as critical components of substations, are essential for maintaining power grid stability. However, prolonged exposure to harsh environments such as extreme weather and temperature fluctuations makes them prone to defects such as cracks and deformations. These issues can disrupt operations and threaten grid reliability. Early detection is vital to prevent cascading failures. While recent advancements in computer vision have improved defect identification, training high-performance models relies on large, accurately labeled datasets, which are costly and limited by the scarcity of real-world defect data for specialized equipment including meters. Generative data augmentation methods offer an effective and promising solution to this issue. Techniques such as generative adversarial networks and denoising diffusion probabilistic models have proven their capability to generate visually compelling images by training on large-scale datasets. These methods are widely used to supplement existing datasets, enhance data diversity, and improve model training efficiency. However, when applied to small-sample datasets of substation meter defects, these approaches face great challenges, such as contour distortion, insufficient texture details, and excessive similarity to original images. These issues degrade the quality of generated images, fail to capture the subtle characteristics of meter defects, and limit their usefulness in tasks such as defect detection and segmentation. To overcome these limitations, this study proposes a novel defect generation method based on the Stable Diffusion model, which is specifically designed for small-sample scenarios. By leveraging its capability to balance high-quality and diverse image generation, this approach addresses the weaknesses of existing methods through improving the fidelity and variability of generated defect images. The proposed method ensures better alignment with real-world applications and enhances the applicability of synthetic images in downstream detection and analysis tasks. Ultimately, it contributes to improved defect detection performance and increased reliability in industrial applications.MethodThis study proposes a novel small-sample defect generation method for substation meters based on a diffusion model. This method addresses the limitations of existing approaches in capturing structural and defect-specific features while achieving high-quality image generation to substantially enhance downstream applications. First, the pre-trained Stable Diffusion model was fine-tuned using a meter knowledge embedding strategy. This process effectively integrated the structural characteristics of substation meters and defect features into the model weights, which enhanced the capability of the model to comprehend meter patterns and improved the accuracy of key feature representation and reconstruction. Second, a crack feature modeling module was developed. This module utilized a structured preprocessing approach to process normal meter images and seamlessly integrated them with existing defect masks, which generated control images with precise geometric and spatial constraints. The module effectively delineated the spatial distribution of defects, which ensured accuracy and consistency in defect localization, as well as provided reliable conditional guidance for subsequent generation processes. Finally, an innovative hypernetwork-based conditional generation mechanism was introduced. While maintaining the diversity of generated images, this mechanism achieved precise manipulation of defect shapes, positions, and other characteristics. By dynamically adjusting model weights and refining input conditions, the hypernetwork effectively ensured local constraints and global coherence during the generation process. It enabled precise control over defect generation while moderately reducing strict constraints on other details, which provided the model creative flexibility to balance high-quality and diverse image generation.ResultComprehensive experimental validation was conducted on the constructed substation meter dataset, which demonstrated that the proposed method can generate high-quality images with diverse and precise defect characteristics. The resulting images closely aligned with real-world application scenarios. The introduction of synthetic data significantly improved the performance of the model in downstream defect detection tasks. Notably, when 40% synthetic data were added to the training set, model precision increased by 26.9%, and mAP50 improved by 19.1%, which further verified the effectiveness of the proposed method in enhancing detection accuracy and robustness. Moreover, comparative experiments with advanced mainstream methods highlighted the superiority of the proposed approach. Fréchet inception distance (FID) and inception score (IS) were used as evaluation metrics to measure the similarity between generated and real images and the diversity of generated images, respectively. A lower FID score indicates higher generation quality, which reflects a smaller gap between the distributions of generated and real images. Meanwhile, a higher IS score demonstrates better clarity and diversity of the generated images. Experimental results show that the proposed method achieved the best performance in FID and IS metrics, with scores of 76.72 and 2.45, respectively, which significantly surpassed other mainstream methods.ConclusionThis study proposes a small-sample generation method based on the Stable Diffusion model, which focuses on the generation of substation meter defect images. Experimental results demonstrate that the proposed method effectively addresses issues associated with existing generation models, such as poor quality and high redundancy of images generated from small-sample specialized datasets. By producing high-quality defect images, the method significantly enhances the accuracy and robustness of downstream defect detection tasks. Thus, it provides a solid and reliable technical foundation for the stable operation of power systems.
摘要:ObjectiveVideo steganography is a technique used to hide secret information within videos, leveraging videos as carriers for secure communication. Existing video steganography methods primarily embed secret information through the modification of redundant information within cover videos. Although these approaches have been widely adopted, they still suffer from considerable limitations. First, modifying original videos often leaves detectable traces, increasing the vulnerability of these methods to detection via advanced steganalysis networks, which are adept at identifying unnatural patterns introduced by such modifications. Second, robustness is a critical challenge for steganographic methods because videos are often subjected to lossy processing during transmission, such as compression, bit errors, and frame rate conversion. Current approaches that address robustness typically train secret extractors on distorted video data generated by attack layers simulating known types of distortions. Despite their effectiveness in controlled environments, these methods heavily depend on prior knowledge regarding the types and intensities of possible distortions that videos might encounter during transmission. This dependence limits their adaptability and effectiveness in real-world scenarios where transmission distortions are unpredictable and diverse. Aiming to overcome these limitations, this paper introduces, for the first time, a generative video steganography scheme based on diffusion models. Different from traditional methods that require modification of existing videos as carriers, the proposed approach directly generates videos from secret information, bypassing the need for carrier videos and substantially enhancing steganographic security. Additionally, the proposed method operates in a stable and compact latent space to help embed and extract secret information. This approach not only reduces the computational complexity but also enhances robustness against various video processing distortions commonly encountered in practical applications. Through its elimination of cover video modifications and leveraging of the properties of latent diffusion models, this approach represents a remarkable advancement in video steganography.MethodThe proposed method introduces a novel bijective mapping module that establishes a direct and reversible mapping between secret information and Gaussian noise. This module is specifically designed to facilitate seamless embedding and extraction of secret information within the latent space of text-to-video (T2V) generative models. During the embedding phase, the sender applies the bijective mapping module to facilitate the transformation of the secret information into a Gaussian-like noise distribution using an inverse discrete cosine transform (IDCT-2D). The transformed Gaussian noise, which contains the embedded secret, is then input into the latent space of a pre-trained T2V model. The T2V model iteratively denoises the latent Gaussian noise, producing the final steganographic video. This embedding process seamlessly integrates secret information into the video during generation, ensuring that the final output video appears visually natural and eliminates the need for modifying existing cover videos. During the secret extraction phase, the receiver processes the received steganographic video, which may have undergone lossy transmission or various distortions. The reverse denoising process of the T2V model is applied to recover the latent Gaussian noise from the received video. Once the Gaussian noise is retrieved, the bijective mapping module uses a discrete cosine transform (DCT-2D) to reconstruct the original secret information. The use of reversible transformations ensures accurate extraction of the embedded information, even under challenging conditions. Aiming to further improve extraction accuracy, the method employs a repeated embedding strategy, in which the same secret is embedded multiple times throughout the generated video. This strategy allows for majority voting during the extraction phase, effectively mitigating the effects of lossy transmission and increasing overall robustness.ResultAiming to simulate real-world scenarios, an attack layer was incorporated into the experiments to introduce various video processing distortions on the steganographic videos. This layer emulated common distortions such as video compression, bit errors, and frame rate conversion. For video compression, H.264 and H.265 codecs were tested under different intensity levels, including average bitrate (ABR) and constant rate factor (CRF) settings. Bit errors were then simulated to replicate imperfections in transmission channels, where specific bits are flipped during the process. Frame rate conversion scenarios were specifically designed to mimic frame rate adjustments caused by device or network constraints. Each distortion type was applied at two intensity levels, with high levels leading to substantial video degradation. Extensive experiments on the WebVid-10M dataset demonstrated the effectiveness of the proposed method compared to two baseline approaches. The proposed method achieved notably higher embedding capacities, with bit rates of 96, 192, and 384 bits per frame, surpassing the baseline methods. The proposed approach maintained high secret extraction accuracy, exceeding 90%, even under severe video distortions such as high-intensity H.264 and H.265 compression. In contrast, both baseline methods demonstrated substantial reductions in extraction accuracy under similar conditions, highlighting the robustness of the proposed method. The visual quality of the generated videos was also evaluated using the Fréchet video distance (FVD) and contrastive language-image pre-training(CLIP) metrics, which measured text alignment, domain similarity, and motion smoothness. Experimental results revealed that the steganographic videos maintained visual quality comparable to videos generated by the T2V model in the absence of secret embedding, demonstrating their practicality and usability in real-world applications. Additionally, the security of the proposed method was validated through theoretical analysis and steganalysis experiments. Theoretical evaluation involved visualizing the stego noise, which revealed a distribution closely resembling a Gaussian distribution. Furthermore, the autocorrelation function verified the independence of stego noise elements. For the steganalysis evaluation, three advanced steganalysis networks were used to detect hidden information in the generated steganographic videos. Detection rates remained close to 50%, indicating the effective detection resistance of the proposed method. This resistance highlights the strong security of the approach, which avoids altering existing videos and instead operates within a generative framework to produce natural and undetectable outputs.ConclusionThis paper presents a novel generative video steganography scheme that uses diffusion models to facilitate video generation from secret information, addressing the limitations of traditional approaches. By operating within a stable latent space, the proposed method ensures higher robustness against common video processing distortions, including compression, bit errors, and frame rate conversion. Experimental results demonstrate that the proposed method achieves substantially higher embedding capacities compared to state-of-the-art baseline methods, while maintaining over 90% secret extraction accuracy under varying levels of video distortion. Additionally, the proposed approach exhibits high visual quality and strong resistance to steganalysis. These results highlight the practicality and security of the proposed scheme, making it a promising solution for secure communication in real-world scenarios. This research not only introduces a new paradigm for video steganography but also establishes a foundation for further research in generative steganographic techniques.
关键词:steganography;diffusion model;text-to-video(T2V) generative model;robust video steganography;generative video steganography
摘要:Remote sensing object detection (RSOD) has emerged as a critical research direction, and it attracts considerable attention due to its complexity and fundamental importance. The primary objective of RSOD is to accurately classify and regress the position of the objects of interest in remote sensing imagery (RSI). According to the image acquisition platform, RSI can be classified into tree classes: satellite-, aerial-, and drone-based. The sensors equipped on these platforms capture ground information from different altitudes and views, which fulfills diverse observational needs and applications ranging from urban planning to environmental monitoring. With the advent of deep learning, especially the convolutional neural networks (CNNs) and Transformers, the detection accuracies obtain substantive improvement relying on the powerful feature representative capability and better adaptivity. However, small object detection (SOD), as a subfield of object detection, limits the performance improvement in recent years. This study focuses on the remote sensing small object detection, of which we summarize seven major challenges. 1) Less available features. Small objects typically occupy only a few pixels with absolutely small sizes in RSI, which leads to limited appearance information and detailed spatial feature information. In addition, with multiple convolution and pooling operations in deep CNNs, the information loss becomes severe as the deepening feature layers. In general, the detailed feature information of small objects resides in shallow layers. To address this issue, super-resolution and multi-scale feature fusion are usually employed. Super-resolution methods enhance the details of small objects by reconstructing high-frequency information and refining spatial resolution through advanced upsampling techniques and feature restoration algorithms. This approach rectifies structural distortions and improves the clarity of features. In addition, the incorporation of shallow CNN layers can retain rich spatial details, which facilitates precise localization. The fusion of shallow and deep features enables networks to capture fine-grained details of small objects and global semantic information, which ultimately improves detection accuracy and robustness. 2) Misaligned evaluation metrics. The evaluation metric of object detection often relies on intersection over union (IoU), but it can be problematic for small objects, which are highly sensitive to even minor localization errors. Such errors can lead to a significant drop in IoU and an increase in false negatives. To address this issue, several alternative evaluation metrics, such as mean average precision (mAP) at lower IoU thresholds, have been proposed to capture the performance of models on small objects effectively. Meanwhile, the assignment of anchor boxes further complicates small object detection, as a smaller number of anchors match small objects, which makes precise localization more difficult. Therefore, improving loss functions and evaluation metrics specifically tailored for small object detection is essential. Moreover, the sensitivity of IoU during non-maximum suppression can result in the erroneous suppression of small objects, which negatively impacts overall detection performance. Developing robust evaluation strategies that account for the unique characteristics of small objects is critical for enhancing detection accuracy. 3) Large image coverage. Remote sensing images typically have high resolution and cover wide areas. As a result, small objects appear smaller, which makes them harder to detect. Directly processing entire images leads to high computational costs and can result in missed detections. Many existing detection frameworks are optimized for medium-sized images and objects. Thus, novel methods that effectively enhance object detection within large-area RSI are necessary. Strategies such as image segmentation or region proposal networks can help mitigate these challenges by allowing models to focus on specific areas of interest rather than processing entire images indiscriminately. 4) Complex background interference. Remote sensing images encompass a diverse array of terrains, man-made structures, and natural elements. This condition complicates the separation of small objects from their complex backgrounds. Environmental factors, including atmospheric conditions, terrain variations, lighting, and shadows, can further increase false positive and false negative rates. To tackle these challenges, strategies such as context-aware learning and label noise suppression have been developed. Contextual learning utilizes spatial and semantic relationships to enhance object detection accuracy. Meanwhile, label noise suppression techniques help mitigate the impact of labeling uncertainty on model performance, which improves the robustness. 5) Non-uniform data distribution. Objects in remote sensing images are often sparsely distributed, with certain regions densely populated while others are not. This non-uniform distribution necessitates a focus on clustered areas to enhance detection efficiency. However, traditional uniform or random crop processing methods may prove inefficient for detecting dense objects, which leads to wasted computational resources. Developing mechanisms that localize and prioritize these dense areas represents a critical challenge in improving detection performance. 6) Uncertain orientation information. Owing to the bird’s-eye perspective characteristic of RSI, objects present arbitrary rotated orientations. Standard horizontal bounding boxes often fail to accurately describe these rotated objects. By contrast, oriented bounding boxes facilitate highly precise localization by incorporating rotation angles. However, several issues remain unexplored, including label mismatching, positive-negative sample imbalance, and limited feature supervision for geometrically complex small objects. These challenges can lead to suboptimal learning of key object features, which requires further research and innovation in bounding box representation. 7) Scarcity of specialized datasets. Existing remote sensing datasets predominantly focus on medium to large objects, with relatively few designed specifically for small object detection. Although some datasets such as AI-TOD, TinyPerson, and SODA exist, they are often limited in terms of the number of object categories and instances available. Furthermore, class frequency imbalances exacerbate the issues related to model performance in specific applications. The lack of comprehensive small object datasets hinders the development of robust models tailored to this task, which necessitates the construction of new datasets that address these limitations. This study presents a comprehensive review of deep learning-based techniques for small object detection in RSI. We analyze the primary challenges associated with this task and highlight recent advancements in key areas such as data augmentation, super-resolution techniques, multi-scale feature fusion, anchor box mechanisms, contextual information integration, and label assignment strategies. Furthermore, we provide an overview of commonly used benchmark datasets, evaluation metrics, and various applications related to small object detection. Despite of significant progress that has been made in the field of RSOD, numerous challenges remain, particularly concerning small object detection. Future research directions should focus on developing innovative solutions to address these challenges, which provides insights that will advance the field and enhance the performance of small object detection in remote sensing applications. By addressing these critical areas, researchers can contribute to more effective and accurate remote sensing methodologies, which ultimately enhance our understanding and monitoring of the world.
摘要:The rapid development of online learning and communication has gradually increased the application of handwritten mathematical expressions in fields such as education and technology. The capability to accurately recognize handwritten mathematical expressions and convert them into structured formats such as MathML or LaTeX has become a critical issue. However, due to the inherent challenges introduced by the diverse handwriting styles and the complex two-dimensional structure of mathematical expressions, this problem remains highly challenging. Current research on handwritten mathematical expression recognition (HMER) is primarily divided into two categories: traditional and deep learning-based methods. Among these approaches, deep learning-based methods have attracted considerable attention due to their capability to model the problem end-to-end using encoder-decoder architectures. These methods typically involve three key steps: feature extraction, alignment, and regression, each of which presents its own set of difficulties.Feature extraction refers to the process by which the encoder extracts high-level semantic information from the input image. One of the primary challenges in this step is learning semantic invariant features. Owing to differences in writing habits, the same symbol can be written in vastly different styles by different individuals. This stylistic variability can lead to significant challenges in recognizing symbols accurately. For instance, symbols such as “x” and “X” or “C” and “c” can appear very similar in handwritten form, which causes difficulty for the model to distinguish between them. In addition, the same symbol may vary in size depending on its position within the expression. For example, a comma or a period in a superscript or subscript position may be much smaller than the same symbol in the main body of the expression. Therefore, the vision encoder must learn to extract features that are invariant to these stylistic, size, and shape differences. This capability is crucial for ensuring that the model can accurately recognize symbols regardless of how they are written or where they appear in the expression. Alignment refers to the process by which the decoder gradually aligns the extracted visual features with the corresponding text features using attention mechanisms. A key challenge in this step is the “lack of coverage” problem. Ideally, the model should convert each text structure on the image into an appropriate text representation without repeating or omitting any parts. However, in practice, models often suffer from over-parsing (repeating parts of the expression) or under-parsing (omitting parts of the expression). This issue arises because the model lacks global alignment information, which leads to difficulty in determining which parts of the image have already been processed. To address this concern, some models introduce coverage vectors. These vectors keep track of the attention distribution over previous time steps, which ensure that the model does not repeatedly attend to the same region of the image. This feature helps mitigate over-parsing and under-parsing issues, which improves the overall accuracy of the model. Regression refers to the process by which the decoder generates the output sequence in a self-regressive manner. This step involves two main challenges: the output imbalance problem and the modeling of 2D structure relationships. The output imbalance problem arises because models typically predict the output sequence from left to right (L2R), which can lead to more accurate predictions for the prefix of the sequence compared with the suffix. The reason is that the capability of the model to capture dependencies between symbols diminishes as the distance between them increases. To address this problem, some models employ bidirectional training strategies, where the model is trained to predict the sequence in L2R and right-to-left directions. This approach helps balance the accuracy of the prefix and suffix predictions. Moreover, mathematical expressions often have a nested——2D structure, which can be challenging to model using traditional sequence-based approaches. Some models attempt to address this issue by incorporating tree-structured decoders, which explicitly model the hierarchical relationships between symbols. However, these methods often require highly complex training procedures and may not always outperform sequence-based approaches. This study builds on the introduction of the abovementioned “feature extraction”, “alignment”, and “regression” steps and the existing challenges and difficulties. It then reviews relevant improvements made in previous works on these challenges. For example, in the feature extraction step, models such as DenseWAP and ABM have introduced multi-scale feature extraction techniques to better capture fine-grained details in the input image. In the alignment step, models including WAP and CoMER have introduced coverage vectors and attention refinement modules to address the lack of coverage problem. In the regression step, models such as BTTR and ABM have employed bidirectional training strategies to improve the modeling of long-range dependencies. After discussing these improvements, this study supplements the experimental performance of some models on printed datasets. These experiments show that the main bottleneck of existing models is the diversity of writing strokes in handwritten expressions. While models perform well on printed datasets, where the style and size of symbols are consistent, their performance drops significantly on handwritten datasets due to the variability in writing styles. This drop in performance highlights the need for further research into improving the robustness of models to different writing styles. In addition to discussing traditional and deep learning-based methods, this study also conducts tests on HMER datasets using large multimodal models (LMMs) such as GPT-4V and Qwen-VL-Max. These models, which are trained on vast amounts of multimodal data, have shown impressive performance in various vision and language tasks. However, their performance on HMER tasks is still suboptimal compared with those of specialized models. This performance is likely due to the lack of handwritten mathematical expression data in their training sets and the difficulty of capturing the complex 2D structure of mathematical expressions. The results of these experiments suggest that, while LMMs have potential in HMER, their performance can still be improved. Finally, this study points out future directions for development in the field of HMER. One promising direction is the integration of tree-structured prediction with sequence-based approaches. While tree-structured decoders offer a more interpretable way to model the hierarchical relationships in mathematical expressions, they often require highly complex training procedures and may not always outperform sequence-based methods. Future research could explore ways to combine the strengths of both approaches through pre-training the decoder on large-scale LaTeX datasets and fine-tuning it on handwritten expression data. Another direction is the development of more robust data augmentation techniques to improve the capability of the model to handle diverse writing styles. Furthermore, the creation of larger and more diverse datasets, including multi-line expressions and mixed handwritten and printed text, could help advance the field. While significant progress has been made in the field of HMER, many challenges still need to be addressed. By overcoming these challenges and exploring new directions, researchers can continue to improve the accuracy and robustness of HMER models. This enhancement increases their implementation in real-world applications such as education, document digitization, and scientific research.
摘要:ObjectiveImage denoising represents a fundamental challenge in the field of image processing, with the primary goal of recovering clear images from their noise-degraded counterparts. Throughout the image acquisition and formation processes, multiple factors such as suboptimal lighting conditions, temperature fluctuations, and imaging system corrections can considerably contribute to the presence of noise in the final images. The impact of image noise extends beyond mere visual perception degradation; specifically, it substantially affects the accuracy of advanced image processing tasks, including image segmentation and object recognition. Traditional denoising approaches, which require manual tuning of numerous parameters, are complex and time consuming. While convolutional neural network (CNN)-based denoising methods have demonstrated promising results, pure Transformer neural networks have shown great effectiveness in image denoising. Moreover, CNN-based denoising methods are inherently constrained by their convolution kernel sizes, which limits their capability to utilize global image information. Conversely, Transformer-based methods effectively leverage global image information, but they demand exponentially increasing computational resources for enhanced detail restoration. In addition, the original Swin Transformer lacks good adaptability to high-resolution image inputs. In response to these challenges, we have developed a U-Net image denoising method based on Swin Transformer V2, which successfully integrates Transformer features with conventional convolutional features. Thus, it achieves remarkable denoising performance and visual quality across standard image denoising datasets.MethodWe present a novel image denoising network method based on Swin Transformer V2. The network consists of downsampling and upsampling stages. During downsampling, images undergo feature extraction in progressive feature spaces. Each encoder layer contains a different number of DB-Transformer and Transformer blocks. In each DB-Transformer block, parallel Transformer and local convolution branches independently extract Transformer and local convolution feature maps, respectively, and these features interact before being passed to the next block. During upsampling, the network reconstructs images from extracted features. The upsampling decoders contain only Transformer blocks, with each decoder preceded by a feature fusion module that receives features from downsampling and upsampling stages. The feature fusion module incorporates a global average pooling component and a multilayer perceptron, which, through a softmax function, generate dynamic weights that enable the network to adaptively select more informative features from different feature maps. Long-skip connections are employed before the final output, as noisy and clean images share considerable information, and these connections prevent gradient vanishing. To enhance the adaptability and denoising performance of Swin Transformer V2 blocks within our network, we strategically position layer normalization before self-attention computation. This way accelerates network convergence and optimizes it for small-scale model training. Furthermore, instead of utilizing masked shifted window-based multi-head self-attention, we implement mirror padding for incomplete window sections, which enhances the contribution of edge pixels to training. This approach ensures their equal importance in image denoising tasks. We train on a combined dataset of Berkeley segmentation dataset 500(BSD500), diverse 2K resolution high quality images(DIV2K), Flickr2K, and Waterloo exploration database(WaterlooED), with random patch selection from each image. Our experiments utilize the Charbonnier loss function and progressive training mechanism, and they are conducted on a single NVIDIA GeForce RTX 4070Ti Super GPU.ResultTo validate the effectiveness of the model, we conducted comprehensive testing using four widely recognized datasets in the image denoising domain: color Berkeley segmentation dataset (CBSD68), Kodak24, McMaster, and color Urban100. With peak signal-to-noise ratio (PSNR) as the primary evaluation metric, our denoising experiments achieved impressive average PSNR values of 28.59 dB, 29.87 dB, 30.27 dB, and 29.88 dB across these datasets with a noise level of 50. Compared with traditional algorithms, our approach demonstrates significantly enhanced denoising effects and visual perception, with PSNR metrics surpassing those of CNN-based denoising methods. Notably, while achieving comparable performance to Transformer-based methods, our denoising algorithm requires only 26.12% of the floating-point operations. We also conducted extensive ablation studies to verify the effectiveness of our proposed method. In particular, various aspects including the number of convolution blocks, feature fusion modules, and Transformer block improvements were examined. The experimental results convincingly demonstrate that our approach effectively balances training efficiency with image denoising performance.ConclusionWe have successfully developed and implemented a U-Net deep learning network model based on Swin Transformer V2 for image denoising, which definitively establishes the viability of Swin Transformer V2 in the image denoising domain. Our network architecture effectively combines the strengths of local convolution and Transformer. Thus, it not only efficiently extracts valuable information from both structural feature maps but also achieves superior training efficiency. The experimental results comprehensively demonstrate that our proposed network architecture offers significant advantages in detail restoration and operational efficiency.
摘要:ObjectiveDynamic point clouds refer to collections of 3D data points that represent the surfaces of objects in a scene at various instances in time. These point clouds are increasingly being utilized across a wide array of applications, including virtual reality, augmented reality, autonomous vehicles, and robotics. Each of these applications demands high-fidelity representations of dynamic environments, which makes the efficient handling of dynamic point clouds a critical area of research. However, the compression of dynamic point clouds presents great challenges due to the complex spatial and temporal correlations inherent in these 3D structures. As the demand for real-time and high-quality 3D data increases, addressing these challenges becomes essential for the advancement of technologies in various fields.MethodIn recognition of the intricate challenges associated with dynamic point cloud compression, this study proposes an innovative method that effectively combines two distinct yet complementary techniques: a convolutional feature extraction algorithm and a target neighborhood feature extraction algorithm. The first component of the proposed method, that is, the convolutional feature extraction algorithm, plays a crucial role in distilling local spatiotemporal motion information from aligned point clouds. By applying convolutional operations, this method can extract hierarchical feature representations that encapsulate essential spatial patterns and motion dynamics present in the point cloud data. This hierarchical representation is vital for understanding how objects within the scene move and interact over time. The second key aspect of the proposed compression method involves the use of an adaptive k-nearest neighbor (kNN) algorithm. Given the varying motion amplitudes of dynamic point clouds, capturing long-distance spatiotemporal information effectively is important. The adaptive kNN algorithm helps achieve this capability by identifying relevant neighboring points across frames. By utilizing the kNN algorithm, the proposed method can identify and focus on critical data points that provide context and continuity over time. Therefore, even points that are not immediately adjacent in space but are relevant to the current data frame can still be considered, which maintains a broader view of the spatial dynamics at play. One of the fundamental innovations of this method is its capability to predict the occupancy probability of each voxel block in the current frame based on the extracted dynamic feature information. This prediction mechanism enables the compression algorithm to accurately determine which areas of the voxel grid are likely to contain meaningful data points. By anticipating these areas, the method can be more selective regarding the data it retains, which allows for efficient compression without sacrificing important details. This predictive capability is especially useful in scenarios where computational resources are limited, as it enables effective data management.ResultThe effectiveness of the proposed method is evaluated against established standards and cutting-edge techniques in the field. The methodology adheres to general testing conditions recommended by the International Moving Picture Experts Group (MPEG) for training and evaluation. In comparative analyses, the proposed method demonstrates impressive performance against existing dynamic point cloud compression standards such as GeS (inter) and MPEG’s video-based point cloud compression method (V-PCC). Specifically, the proposed method achieves average Bjontegaard Delta (BD) rate gains of 94.14% (96.67%) and 82.93% (72.57%) on peak signal-to-noise ratio (PSNR) for datasets D1 and D2, respectively. When compared with the two most advanced learning-based dynamic point cloud compression methods currently available, the results remain equally compelling. The proposed method achieves average BD rate gains of 22.53% (26.73%) and 14.21% (40.80%) on PSNR D1 (D2). These metrics further illustrate the superiority of the proposed approach, which showcases its capability to achieve high-quality reconstructions of dynamic point clouds while significantly optimizing data size. The implications of this research extend beyond mere numerical improvements. The proposed compression method represents a meaningful advancement in the field of dynamic point cloud data handling, which provides critical support for applications that rely on the efficient transmission and storage of large volumes of dynamic data.ConclusionThe dynamic point cloud compression method proposed in this study effectively addresses the intricate challenges posed by dynamic data in 3D spaces through the innovative integration of convolutional feature extraction and adaptive kNN techniques. By accurately predicting voxel occupancy and leveraging the strengths of each component algorithm, the proposed method demonstrates significant improvements in compression efficiency while maintaining high-quality point cloud reconstructions. The implications of this research extend beyond mere numerical improvements, which represents a meaningful advancement in the field of dynamic point cloud data handling. This advancement provides critical support for applications that rely on the efficient transmission and storage of large volumes of dynamic data. Future research directions could explore additional enhancements, such as integrating diverse machine learning strategies, improving real-time processing capabilities, and validating the approach with a wider range of dynamic datasets. Ultimately, these improvements drive further innovation in dynamic point cloud technologies. The continuous evolution of this field has strong potential for unlocking new possibilities in immersive experiences, autonomous systems, and various other domains that depend on precise and efficient 3D data management.
摘要:ObjectiveDrones equipped with 4K cameras have emerged as a transformative technology across diverse domains, which range from traffic surveillance and criminal investigation to military reconnaissance and disaster response. The capability to capture aerial images over vast areas enables unprecedented visual analysis capabilities. In drone-based applications, object detection serves as a fundamental technology for scene perception, which facilitates precise object recognition and tracking to enhance operational efficiency and safety in complex scenes. Despite the success of deep learning-based object detection models on natural scenes, drone-based detection presents unique challenges. A notable challenge is the motion blur introduced by the high-speed flight of the drone, which degrades fine details of small objects in the captured images. High-speed motion often leads to blurred images that are difficult for traditional models to process, especially when detecting small or distant objects that require high spatial resolution. Although recent advances in diffusion models have shown promise in addressing super-resolution tasks (i.e., providing finer details in blurry images), their use in drone images remains limited due to the sparse distribution of objects in the scenes. Full-image super-resolution is computationally expensive and inefficient when the objects of interest are few and scattered across large areas. Another major challenge arises from the high-altitude perspective of drone images, where ground objects appear smaller and are easily overwhelmed by complex background. This inherent scale reduction makes small objects particularly difficult to detect and classify, as their features become less distinctive against varying ground patterns and environmental conditions. The accurate detection of such small objects demands fine-grained local spatial information and global contextual features. Local features are essential for capturing distinctive small object characteristics, while global features provide crucial contextual cues for distinguishing objects from similar background patterns. Although hybrid architectures that combine convolutional neural networks and Transformers attempt to leverage their complementary advantages in local feature extraction and global dependency modeling, these approaches still present inherent limitations. The sequence-based processing of the Transformer, even with positional encoding, fundamentally constrains local spatial perception due to the conversion of 2D spatial structures into 1D sequences. Furthermore, the independent stacking of convolution and self-attention layers, rather than facilitating feature integration, tends to intensifies the disconnect between local and global representations. This architectural limitation results in feature separation rather than meaningful fusion, which particularly impacts the capability of the model to detect and classify small objects in complex aerial scenes.MethodThis study proposes a novel small object detection model based on foreground refinement and a multidimensional inductive bias self-attention to address the aforementioned challenges. We introduce a foreground refinement module (FRM) that employs a class-agnostic multi-layer aggregated activation map to identify regions of interest for overcoming the computational inefficiency of processing high-resolution drone images. This module enables targeted refinement that preserves computational resources and enhances critical small object details. At the core of our method, we propose a multidimensional inductive bias self-attention network (MIBSN), which consists of multidimensional self-attention, hybrid enhanced feed-forward, scale coupling module, and progressive interaction module. The multidimensional self-attention module decomposes attention operations across horizontal and vertical dimensions, and it implements adaptive window-based attention to significantly enhance spatial position awareness. Within the multidimensional self-attention, we innovatively incorporate an inductive bias perception path that operates in parallel with the attention mechanism, which is specifically designed to capture fine-grained local features. The parallel computation generates global contextual information and local spatial details, which are jointly processed by a hybrid enhanced feed-forward module utilizing dynamic convolution kernels. This design ensures the Transformer inherits strong inductive bias properties while maintaining comprehensive feature representation capabilities in complex backgrounds. To effectively handle the multi-scale nature of objects in drone images, the scale coupling module employs adaptive kernel configurations while dynamically interacting with multidimensional self-attention. This way maximizes the preservation of local and global features across different scales. Finally, the progressive interaction module progressively aggregates neighborhood features in the network decoder, which enables sufficient restoration of small object details in the final prediction maps. By integrating FRM and MIBSN, our method provides a comprehensive solution for small object detection. The FRM enhances fine-grained details through targeted refinement of regions of interest, while MIBSN ensures robust feature extraction through its multi-component architecture. Specifically, the combination of local spatial information and global context in MIBSN, along with the enhanced detail preservation from FRM, enables accurate detection and classification of small objects even in complex drone images.ResultTo validate the effectiveness of our proposed method, we conduct extensive experiments on three challenging datasets: vision meets drones 2019 (VisDrone2019), unmanned aerial vehicle detection and tracking (UAVDT), and Northwestern Polytechnical University Very High Resolution 10-Class Dataset (NWPU VHR-10). Our method demonstrates remarkable performance on VisDrone2019, which is a large-scale drone-captured dataset. Specifically, it achieves 53.2% average precision at the IoU threshold of 0.5 (AP@50) and 32.8% mean average precision across all IoU thresholds from 0.5 to 0.95 (mAP), which substantially surpass those of existing state-of-the-art object detection methods. Notably, compared with the current leading Transformer-based detector DINO (detection Transformer with improved denoising anchor boxes), our approach achieves a significant improvement of 7.9%. On UAVDT benchmark, our approach achieves 38.7% AP@50 and 23.4% mAP. This substantial gain can be attributed to our innovative multidimensional inductive bias self-attention network, which effectively enhances local feature learning and global dependency modeling. To further demonstrate the generalization capability of our method in handling small objects from aerial perspectives, we evaluate our approach on the NWPU VHR-10 remote sensing dataset. Despite the different imaging characteristics between drone and satellite imagery, both scenarios share common challenges in small object detection, including scale variation and complex backgrounds. Our method achieves a state-of-the-art mAP of 64.5% and AP@50 of 93.9% on this dataset, which validates its robust feature extraction capabilities across different aerial imaging domains. We conduct comprehensive ablation studies on FRM and MIBSN to systematically analyze the contribution of each proposed component to the overall performance improvement for providing deeper insights into the effectiveness of our method.ConclusionIn this study, we propose a novel small object detection method for drone images. This method integrates a foreground refinement module for targeted refinement processing and a multidimensional inductive bias self-attention network for enhanced feature learning. Our approach effectively addresses the challenges of small object detection. Comprehensive evaluations on VisDrone2019, UAVDT, and NWPU VHR-10 datasets demonstrate the superior performance of our method, which greatly advances the detection accuracy of small objects in drone images. The code is available at https://github.com/CUMT-GMSC/MIBSN.
摘要:ObjectiveThe acceleration of global urbanization and the rapid development of intelligent transportation has highlighted the importance of road construction and maintenance in urban management. With the rapid growth of urban population and the increase in the number of motor vehicles, road facilities are facing great pressure. Road surface defects, such as cracks and potholes, are prone to occur under the action of multiple factors such as frequent vehicle crushing and climate change. These defects not only affect driving comfort but also potentially pose serious traffic safety risks. Pavement defect detection is an important part of road maintenance and management and is related to traffic safety and road service life. Traditional pavement defect detection methods have disadvantages such as low efficiency, poor security, strong subjectivity, and insufficient accuracy. The development of computer technology and artificial intelligence has encouraged researchers to apply machine vision and deep learning technology to pavement defect detection, which greatly improved the performance and efficiency of pavement defect detection. However, due to the large amount of parameters and calculation, the traditional deep learning model has difficulty meeting the real-time detection requirements on embedded devices and edge devices with limited hardware resources. This complexity limits the promotion and application of deep learning technology in the field of road defect detection. This study aims to reduce the computational resource requirements and optimize the target detection performance of the pavement defect detection algorithm. Based on the YOLOv8 (you only look once version 8) object detection framework, along with partial convolution and inception depthwise convolution mechanism, a lightweight pavement defect detection algorithm called YOLOv8n-PIVI is proposed.MethodFirst, the PartialBlock module and IDBlock (inception depthwise block) module are introduced into the backbone network to reduce the number of parameters of the backbone network and improve the processing capability of the network for features of different scales. While introducing the lightweight advantage of partial convolution, the PartialBlock module also supplements the 1 × 1 convolution to extract the overall information and arrange features. This convolution not only makes up for the lack of feature analysis capability of partial convolution to a certain extent but also helps partial convolution learn a more effective spatial feature extraction method. The IDBlock module inherits the advantages of partial lightweight and deeper feature analysis of inception depthwise convolution, and it performs a more comprehensive analysis of features at a smaller cost through the traditional 1 × 1 convolution. In addition, PartialBlock module and IDBlock module contain a MaxPool module in the tail to expand the receptive field of network features and reduce the number of network parameters. The organic combination of the PartialBlock module and the IDBlock module helps reduce the number of backbone network parameters and optimize the feature extraction capability of the backbone network. Second, inspired by the excellent feature parsing capability and the applicability in resource-constrained scenarios of VanillaNet, we introduce VanillaBlock, which is the core module of VanillaNet, into the feature fusion network of YOLOv8n. This incorporation aims to optimize the feature fusion capability of the model and reduce the computational burden of the model. Finally, the inception depthwise convolution module with low overhead and multi-scale feature parsing capability is introduced to replace the two series 3 × 3 convolutions in the original detection head by combining the 1 × 1 convolution with low overhead. This way reduces the computational cost of the detection head and improves its parsing capability of different scale features. A more lightweight and efficient ID-Detect (inception depthwise detect) is proposed to make the detection head lighter and further improve the performance of the detection head.ResultThe comparison and ablation experiments are conducted on the Pothole Dataset, which consists of 1 784 still images taken by vehicle-mounted cameras and other mobile devices. The dataset is divided into a training set (1 265 images), a validation set (401 images), and a test set (118 images) with an 11∶3∶1 ratio. To verify the effectiveness of the improved model in optimizing the pavement defect detection effect, the improved model is compared with some classical algorithms in the field of object detection, including Centernet, RetinaNet, Faster R-CNN (faster region-based convolutional neural networks), SSD (single shot multibox detector), Efficientdet, and YOLOv8. Experimental results show that the Precision and Recall of the optimization algorithm proposed in this study are improved to a certain extent compared with the baseline, and the mAP50 (mean average precision at 50% intersection over union) reaches 0.55. It is 3.5 percentage points higher than 0.515 of the baseline algorithm. The FPS (frames per second) reaches 243, which is 43 higher than the baseline algorithm. The parameter number of the model is reduced from 3.01 × 106 to 2.01 × 106, which is only 67% of the baseline. The calculation amount is reduced from 8.1 GFLOPs (giga floating point of operations per second) to 5.9 GFLOPs, which is only 72% of the baseline. The model parameter file is reduced from 6.3 MB to 4.3 MB, and the memory occupation is decreased by nearly one-third compared with the original model. Compared with the classical algorithms in the field of object detection, the proposed algorithm has certain advantages in detection accuracy, computational complexity, and FPS compared with other algorithms in the model of the same parameter magnitude. In the ablation experiment, the PartialBlock module, IDBlock module, VanillaBlock module, and ID-Detect module introduced in this study are individually added to the baseline model and removed from the final model. The effectiveness of the modules introduced in this study in improving the performance of the baseline algorithm for pavement defect detection is verified. In the subsequent comparative analysis, the improvement of the optimization algorithm proposed in this study compared with the baseline algorithm in pavement defect detection performance is shown in a more detailed and intuitive way through the comparison of experimental result index parameters, the comparison of training process index changes, and the comparison of pavement pothole detection effects. Subsequently, we also design experiments to roughly simulate the performance of the pavement defect detection algorithm proposed in this study in the actual situation, and the shortcomings of the algorithm are analyzed. In addition, we compare the proposed algorithm and the baseline algorithm on the RDD2022 (road damage detector 2022–China-MotorBike) dataset and VOC2012 (visual object classes 2012) dataset to analyze the robustness of the algorithm. Experimental results show that the proposed algorithm can adapt to the multi-type road defect detection tasks in China on the RDD2022–China-MotorBike dataset and the multi-class daily object detection tasks on the VOC2012 dataset. Meanwhile, it shows better object detection performance than the baseline model. Therefore, the proposed algorithm can adapt to different object detection tasks, has good adaptability and robustness, and is expected to be applicable to highly complex and diverse road defect detection scenarios.ConclusionThe comprehensive experimental results show that the optimized algorithm YOLOv8n-PIVI proposed in this study can achieve high object detection performance with lightweight computing resource requirements. The model parameter file size is as low as 4.3 MB. The algorithm also has good adaptability and robustness, which help reduce the application limitation of road defect detection algorithms and broaden the applicable scene of pavement defect detection systems.
摘要:ObjectivePhotoacoustic tomography (PAT) is an emerging imaging technique that combines optical contrast with acoustic resolution, and it demonstrates great potential in the field of biomedical applications in recent years. Its non-invasive nature, high contrast, and high resolution make it an important tool in medical diagnosis and research. However, due to limitations in the imaging principles and the complexity of biological tissues, photoacoustic images often exhibit blurring, artifacts, and low contrast, which restrict their accuracy and reliability in clinical applications. Improving image quality is crucial for the further advancement of PAT technology. Current research generally treats image reconstruction and segmentation as two separate steps. First, the inverse problem is solved to reconstruct a 2D or 3D image from the photoacoustic signals collected by the ultrasonic detectors, which realizes the “signal-to-image” conversion. Then, regions of interest (e.g., specific tissues or structures) are segmented or identified from the reconstructed images, which realizes the “image-to-feature” conversion. This decoupled processing framework shows substantial limitations in practical applications. In the image reconstruction phase, errors inevitably arise due to inaccuracy of the physical model, unreasonable assumptions (e.g., the acoustically homogeneous medium and constant speed of sound), noise interference, and incomplete data acquisition. These errors accumulate in subsequent segmentation and target recognition tasks, which greatly reduces the accuracy of automated analysis. This inadequacy, in turn, limits the potential of PAT in fields such as precision medicine and real-time diagnosis. To address this issue, this study proposes an end-to-end deep learning framework that innovatively integrates image reconstruction and segmentation tasks into a unified network. Through the interaction and collaborative optimization between the two tasks, the framework fully leverages their potential correlations to considerably improve the overall quality of PAT images and achieve higher accuracy in image segmentation.MethodThis study presents an end-to-end joint reconstruction and segmentation deep learning framework, which aims to enhance the quality of PAT image reconstruction and segmentation accuracy through iterative optimization. The framework consists of two core modules: the reconstruction and segmentation modules. The reconstruction module is used to generate an initial image using deep neural networks while minimizing reconstruction errors caused by factors such as incomplete physical models or noises. The segmentation module performs further image processing based on the reconstruction module, with its task being the precise segmentation of the regions of interest. To capture the details of the target regions in the image, the segmentation module adopts a context-aware mechanism, which combines global and local features of the image. Specifically, the segmentation module employs a hybrid architecture that combines fully convolutional networks (FCNs) and DenseNet. The encoder-decoder structure of the FCN enables pixel-level classification and extracts low-level and high-level features from the image, while DenseNet enhances the utilization of information from previous layers through dense connections, which improves feature representation capability. The combination of FCN and DenseNet allows the segmentation module to better process complex image information and achieve higher precision semantic segmentation. The key aspect of the framework is the establishment of a bidirectional feedback mechanism between the reconstruction and segmentation modules, which creates a deep collaboration loop. In this mechanism, the output of the segmentation module serves as a constraint that guides the reconstruction module to refine the restoration of the target region, while the image provided by the reconstruction module offers important prior information to the segmentation module. Ultimately, the segmentation of the target regions is further optimized. Through this bidirectional feedback, the two modules promote each other during the iterative process, which continuously improves the accuracy of reconstruction and segmentation. In each iteration, the reconstruction module corrects reconstruction errors with the feedback from the segmentation module, while the segmentation module extracts more accurate features from the progressively optimized reconstructed image, which improves segmentation precision. Unlike traditional methods, this framework avoids the issue of accumulation of errors due to independent tasks through deep collaboration. In conventional approaches, image reconstruction and segmentation are typically processed separately, which leads to a lack of information sharing between the two tasks and ultimately affects the overall accuracy of the results. On the contrary, the proposed framework optimizes both modules jointly. This way ensures that the output of each module provides higher-quality input for the next, which ensures synchronized improvement of reconstruction and segmentation tasks. Ultimately, through multiple iterations, this deep integration of information not only greatly enhances image clarity but also substantially improves segmentation accuracy.ResultTo validate the effectiveness of the proposed method, several simulation experiments, phantom studies, and in vivo experiments were conducted. Experimental results demonstrate remarkable advantages over the two-step scheme in image reconstruction and segmentation tasks. In simulation experiments, the proposed method improved the structural similarity index (SSIM) of the reconstructed images by up to 10.01% and the peak signal-to-noise ratio (PSNR) by up to 12.15%. In segmentation tasks, the DICE coefficient and the Jaccard index increased by up to 13.27% and 6.08%, respectively, while the average symmetric surface distance (ASSD) was reduced by up to 16.55%. In phantom experiments, the method exhibited robust performance even in complex scenarios (e.g., heterogeneous media and variable speed of sound). Compared with other joint reconstruction and segmentation methods, the SSIM and PSNR of the reconstructed images can be enhanced by approximately 3.65% and 4.96%, respectively; the DICE, Jaccard index, and ASSD for image segmentation can be raised by nearly 2.73%, 3.85%, and reduced by nearly 5.99%, respectively. Further in vivo experiments validated the applicability of this method in real-world biological imaging. In tumor detection and vascular imaging tasks, the images generated by the proposed method exhibited sharp boundaries, accurate shapes, and significantly improved segmentation integrity of the target regions, which exceed those of traditional methods. Therefore, the proposed method has strong robustness and generalizability.ConclusionThis study proposes an end-to-end deep learning framework that jointly performs image reconstruction and segmentation tasks, which effectively solves the critical issue of error accumulation in traditional two-step methods. Compared with existing approaches, the proposed method better restores the shape and boundaries of the target region, reduces information loss and blurring, and demonstrates great performance improvements in simulation, phantom, and in vivo experiments. The study shows that the proposed method has significant theoretical importance and demonstrates great potential in practical applications. Future research will further explore the application potential of this framework in real-time processing and large-scale datasets to promote the clinical translation of PAT into medical diagnosis and treatment.
摘要:ObjectiveSkeleton data have become a prime candidate for use with graph convolutional network (GCN) because of their lightweight nature and intrinsic topological structure. The alignment between skeleton data and GCN has attracted considerable attention in developing human action recognition techniques based on these data. These techniques leverage the strengths of GCN to interpret the skeletal structures and movements inherent in human actions. However, traditional graph convolution methods encounter challenges in effectively modeling long-range node relationships, which are crucial for accurately recognizing complex actions. This limitation arises from the inherent design of conventional graph convolutions, which typically focus on local neighborhood information and struggle with capturing dependencies between distant nodes in the graph.MethodTo overcome this challenge, we propose a novel approach called skeleton large-kernel and contextual GCN (SLK-GCN). This innovative network aims to enhance spatial features from two distinct perspectives, which improves the capability to model long-range dependencies and capture the complexity of human actions effectively. The first key component of our SLK-GCN is the skeleton-large kernel convolution (SLKC) operator. This operator is designed to expand the receptive field and enhance channel adaptability, which leads to improved spatial feature extraction. Traditional convolutional kernels have limited capability in capturing extensive spatial relationships due to their relatively small receptive fields. By contrast, SLKC employs large kernel convolution networks, which significantly broaden the receptive field. This broader receptive field allows the model to simulate long-range dependencies between nodes effectively. In this way, SLKC enhances the capacity of the model to handle the spatial complexities inherent in human action recognition tasks. The large kernel approach not only captures a wide array of spatial information but also ensures that the extracted features are comprehensive and nuanced, which contributes to enhanced overall model performance. In addition to SLKC, we introduce a lightweight global context modeling (GCM) module as the second key component of SLK-GCN. The GCM module is designed to automatically learn and adapt to the topological structure of the skeleton while integrating contextual features from a global perspective. Traditional models often fail to consider the global context; instead, they focus on local node interactions. However, capturing global relationships between nodes is essential for understanding the full scope of human actions, especially those involving complex movements that span multiple joints and limbs. The GCM module addresses this gap by capturing global relationships and contextual information across the entire skeleton. This integration of global context enhances the representational capacity and robustness of the model, which allows it to accurately interpret and classify various human actions.ResultTo validate the effectiveness of our proposed SLK-GCN, we conducted extensive experiments on several widely used datasets, including NTU RGB+D, NTU RGB+D 120, and Northwestern-UCLA. These datasets are well regarded in the field of human action recognition and provide a diverse set of scenarios and action types for comprehensive evaluation. Experimental results demonstrate that SLK-GCN exhibits significant advantages and effectiveness in human action recognition tasks. Specifically, the incorporation and combination of SLKC and GCM enable SLK-GCN to effectively extract and utilize spatial features when processing complex skeleton data. This enhanced feature extraction capability translates to improved accuracy and robustness in recognizing various human actions. The success of SLK-GCN can be attributed to several factors. First, the capability of the SLKC operator to simulate long-range dependencies ensures that the model captures the intricate spatial relationships between different parts of the skeleton. This capability is particularly important for recognizing actions that involve coordinated movements across multiple joints. Second, the integration of global context by the GCM module provides a holistic view of the skeleton, which enables the model to consider the broader context in which individual movements occur. This holistic perspective is crucial for accurately interpreting complex actions that cannot be understood solely through local interactions. Furthermore, the combination of SLKC and GCM in SLK-GCN represents a synergistic approach to feature enhancement. While SLKC focuses on expanding the receptive field and capturing long-range dependencies, GCM complements this feature by integrating global contextual information. Overall, these components ensure that SLK-GCN has a comprehensive understanding of local and global spatial features, which leads to superior performance in human action recognition tasks. The implications of our work extend beyond the immediate scope of skeleton-based human action recognition. The principles underlying SLK-GCN, particularly the emphasis on large receptive fields and GCM, can be applied to other domains where capturing complex spatial relationships is essential. For example, similar approaches could be adapted for use in gesture recognition, sign language interpretation, and even broader applications in computer vision and robotics where understanding spatial dependencies is critical.ConclusionThe spatial feature enhanced graph convolutional network (SLK-GCN) represents an important advancement in the field of human action recognition. By addressing the limitations of traditional graph convolutions in modeling long-range node relationships, SLK-GCN offers a robust solution for capturing the complexity of human actions. The innovative combination of SLKC and GCM enables SLK-GCN to effectively extract and utilize spatial features, which results in improved accuracy and robustness. Our extensive experimental validation on multiple datasets underscores the effectiveness of this approach, which highlights its potential for broader applications in understanding and interpreting complex spatial data.
摘要:ObjectiveDistracted driving, as a primary cause of traffic accidents, claims over 1.2 million lives globally each year. Notably, nearly 90% of incidents involve high-risk behaviors, such as mobile phone usage. While existing end-to-end recognition methods based on convolutional neural networks (CNNs) effectively capture local features (e.g., hand posture and facial expressions), their performance is constrained by the limited receptive fields of convolutional operations. This limitation hinders the modeling of global spatial relationships in driving scenarios (e.g., limb coordination and cabin environment interactions). This inadequacy results in significant accuracy degradation under complex scenarios. Although vision Transformers (ViTs) achieve global feature extraction through self-attention mechanisms, their practical deployment involves multiple challenges. For example, traditional ViT models overly rely on final-layer classification tokens for decision making, which leads to underutilization of fine-grained semantic information from shallow and middle layers. Moreover, their self-attention mechanisms, while modeling global dependencies, inadequately address localized detail features, which causes misclassification among behavior categories with similar spatial distributions (e.g., phone calls vs. smoking). In addition, the quadratic computational complexity of standard multi-head attention relative to image resolution hinders real-time performance on resource-constrained vehicular platforms. Furthermore, existing studies predominantly rely on simulated driving datasets (e.g., State Farm), which exhibit limitations such as single-view capture, highly standardized actions, and low background complexity. These datasets diverge markedly from real-world driving environments characterized by multi-view perspectives, dynamic lighting variations, and complex background interference. As a result, the generalizability of the model is restricted. To address the aforementioned issues, this study proposes a dual-stage ViT framework for distracted driving behavior recognition, which integrates global and local features.MethodThe proposed framework employs a two-stage collaborative architecture to enhance distracted driving detection through hierarchical feature integration and computational optimization. In the first stage, an improved ViT addresses the semantic loss in shallow layers by aggregating class token features from intermediate Transformer layers (9–12). This method is achieved via a weighted fusion mechanism, where learnable coefficients——optimized through end-to-end training——dynamically balance contributions from different layers. A hierarchical supervision strategy further reinforces feature preservation by applying cross-entropy loss at multiple network depths, which preserves early-stage visual cues (e.g., preliminary hand positions or facial orientations) for global context modeling. The second stage focuses on challenging cases by synergizing the broad scene understanding of ViT with the localized precision of MobileNetV3. Spatial attention masks generated from the outputs of ViT first amplify driver-centric regions (e.g., hand-screen interactions) while suppressing irrelevant elements such as window reflections and moving backgrounds through adaptive pixel-wise masking. The refined images then feed into MobileNetV3 to capture micro-scale patterns, such as finger flexion during screen operations or distinct shapes of held objects (e.g., cigarettes vs. phones). A cross-attention mechanism bridges the two modalities: the global context of ViT (serving as query vectors) interacts with the localized key-value pairs of MobileNetV3. This mechanism enables bidirectional feature alignment where high-level scene semantics guide the interpretation of fine details, and vice versa. For optimized computational efficiency, a dynamic attention pipeline first segments images into 16 × 16 blocks, calculates region affinity scores to construct adjacency matrices, and retains Top-k high-relevance areas (e.g., 68% of hand/face regions). The remaining zones undergo token-level attention with reduced spatial granularity, which cuts redundant calculations by 26.87% while maintaining discriminative power. An adaptive threshold η governs sample routing——strict filtering (η = 1.0) ——for standardized datasets such as State Farm ensures efficiency. Meanwhile, relaxed thresholds (η = 0.9) for real-world data redirect ambiguous cases (e.g., low-light conditions and occluded actions) to stage 2 for reevaluation. This dual-phase approach achieves 99.69% mean average precision across datasets while operating at 5.48 GFLOPs, which demonstrates robust performance in diverse scenarios from controlled lab environments to complex road conditions. By strategically combining multi-scale feature extraction with context-aware computation allocation, the method overcomes the inherent limitations of ViT in handling fine details and computational overhead. Therefore, it is a practical solution for in-vehicle safety systems.ResultExperiments were conducted on the State Farm dataset and a custom dataset of dangerous driving behaviors in passenger vehicles built by the research team. These experimental results were then compared with those of a number of different algorithms, including CNN-based counterfactual attention learning(CAL), attentive pairwise interaction network(API-NET), Stacked-LSTM, progressive- attention convolutional neural network(PA-CNN), and MobileNetV3. In addition, Transformer model-based vision Transformer(ViT), cross-attention multi-scale vision transformer(CrossViT), and Transformer architecture for fine-grained recognition(TransFG) were compared. Results demonstrate that the method exhibits average accuracies (mAP) of 99.69% and 96.87% on the State Farm dataset and the self-constructed dataset of hazardous driving behaviors of passenger vehicles by the group, respectively. Moreover, the numbers of floating-point operations (FLOPs) are 5.48 and 6.82 G, while the number of covariates is 93.31 M. The TransFG algorithm results in increases of 0.98% and 1.04% in mAP, reductions of 25.54% and 17.23% in FLOPs, and an increase of only 7.08 M in the number of parameters. Furthermore, the detection efficacy of ViT can be enhanced through the incorporation of a multimodal information interaction module, a token information supplementation module, and a two-stage attention module. The addition of the multimodal information interaction module results in improvements of 1.24% and 0.97% in the mAP of ViT. The token information supplementation module leads to improvements of 0.15% and 0.64% in the mAP of ViT, while the two-stage attention module yields improvements of 0.16% and 0.19% in the mAP. While the accuracy improvement is not statistically significant, the FLOPs decrease considerably. The test results demonstrate that the proposed method outperforms the state-of-the-art TransFG algorithm in terms of mAP, FLOPs, and the number of parameters. The visualization analysis demonstrates that the model can focus on key areas such as hands and faces while significantly suppressing window reflections and dynamic background interference.ConclusionThis study takes dangerous driving behavior detection as the starting point, followed by the introduction of the ViT model and MobileNetV3 model. Moreover, a two-stage model based of ViT for dangerous driving behavior recognition called TS-ViT is proposed to address the large number of ViT parameters and high computational cost. The first stage of the inference process of the model inputs the input images into the ViT model, and the images with obvious features are recognized. In the second stage, the global features of ViT and the local features of MobileNet are fused by the cross-attention mechanism and the self-attention mechanism. This fusion re-recognizes the images with complex backgrounds and poor lighting in the first stage. This process also filters out the background information to retain the foreground information, which distinguishes the complex features of the images. The method proposed in this study can accurately identify dangerous driving behaviors in complex scenarios, which demonstrates superior robustness. It provides a new perspective for research in the field of classification tasks.
摘要:ObjectiveNowadays, navigation and positioning technology has become an indispensable part of people’s daily life, and the satellite positioning system has been successfully built and widely used. However, in the environment where buildings are dense or satellites are blocked indoors, satellite positioning is inaccurate due to signal interference. Therefore, visual navigation and positioning technology has been developed to overcome this difficulty. This technology determines the position and attitude of the camera in 3D space through image processing and computer vision technology. It is an important means to solve the navigation and positioning problem in the satellite navigation rejection scene. Usually, visual matching navigation requires the pre-construction of 3D point cloud information of the scene. The acquisition of 3D point cloud models can be mainly divided into three types: manual model construction by mathematical modeling software, mapping by professional instruments, and crowdsourcing mapping by consumer terminals. These models are time consuming and laborious in constructing large-scale feature point cloud databases. Meanwhile, visual modeling of video stream data based on consumer terminals has advantages such as low cost, convenient data update, and wide spatial coverage. However, due to the large number of video frames and image redundancy, the 3D model reconstruction calculation cost is high, the cumulative error is large, and even the reconstruction failure is caused. Thus, this study proposes a 3D reconstruction key frame extraction method based on mutual check weighted optical flow.MethodFirst, the image scene is pre-classified. In the process of video shooting, the camera passes through multiple scenes, and the video frame changes in different scenes. The pixel changes of the straight road are mainly distributed at the edge of the image. At the same time, the pixels of the entire image have changed at the turning point, such that the video needs to be pre-classified. The system receives the self-collected video stream data. Then, it combines the gyroscope data obtained by the mobile terminal to divide the scene into two types: straight road and turning road, which provides a basis for subsequent targeted optical flow aggregation adjustment and adjacent frame similarity calculation. Then, cross-check adjacent frame matching is conducted, followed by the use of the Scale Invariant Feature Transform algorithm to detect feature points and their descriptors in the previous frame image. The matching points in the second frame are calculated by fast library for approximate nearest neighbors matching and pyramid Lucas-Kanade(LK) optical flow method, and the feature points successfully matched by both methods are detected. The incorrect matching points are eliminated by calculating the 2D Euclidian distance. Strong matching point pairs are obtained to capture the dynamic changes between frames, which ensures the accuracy and effectiveness of the subsequent optical flow calculation. Finally, the total optical flow field is aggregated, and the similarity of adjacent frames is calculated. Considering the differences in the video frames of the straight road and the turning road, the contribution of matching feature points of adjacent frames also varies in the similarity calculation. Therefore, optical flow aggregation needs to be conducted by weighting. After detecting the vanishing point of the image and taking it as the center, different weights are assigned to the strong matching point pairs near the vanishing point according to the scene classification. Next, the aggregate information of the total optical flow field of adjacent frames is weighted to obtain the optical flow changes between frames. Then, the images with significant motion changes are judged as key frames according to the set threshold, and the key frames of video stream are finally extracted.ResultIn the experiment, the self-developed video acquisition app of the research group in the consumer terminal is used to conduct data self-acquisition of the target environment. In different scenes with different lighting and ground feature distribution, four groups of data are obtained using different traveling routes and speeds, and each group records the scene video information and gyroscope data synchronously. The proposed algorithm is compared with the traditional key frame extraction algorithm, and different algorithms are used to screen video key frames and count the number of extracted key frames. The structural similarity index is used to calculate the number of highly similar frames in key frames. High-similarity frames are the images with high visual similarity between the two frames in the extracted video key frames. A greater number of high-similarity frames correspond to higher redundancy in the extracted key frames. Then, the results of key frame extraction in the straight line and turning road are compared with the those for the original video frame, respectively. The proportion of the number of key frames extracted in the original video frame is calculated in different scenes to evaluate the scene adaptability of the algorithm. Finally, the key frames extracted by the algorithm are used for 3D model reconstruction experiment, and the road map is drawn with global navigation satellite system(GNSS) data to evaluate the integrity of the reconstructed model. Experimental results show that the proposed algorithm can reduce the total number of video frames to approximately 10%, and the minimum can reach 4.56%. Meanwhile, the proportion of high-similarity frames in key frames is less than 3%, and the minimum is 1.91%, which is significantly less than in other algorithms. In addition, the number of key frames extracted by the algorithm in this study is much larger than that in the straight road. This result is better than those of other algorithms and meets the expected demand for the number of images in different scenes under the application of 3D reconstruction. The integrity percentages of the final model reconstruction are 100%, 100%, 97.46%, and 96.54% on the four groups of data. Therefore, it is obviously better than other algorithms.ConclusionThis study proposes a key frame extraction method for 3D reconstruction based on mutual check weighted optical flow. This method can effectively reduce the number of video frames and improve the quality of key frame screening in diversified scenes. At the same time, the extracted key frames can improve the matching accuracy and stability of adjacent frames and enhance the robustness of 3D reconstruction.
摘要:ObjectivePoint cloud semantic segmentation aims at assigning semantic labels to individual points in 3D space. However, the performance of segmentation models often deteriorates when they are applied to unseen domains due to domain shifts——differences in the distribution of data between the source and target domains. To address this problem, domain adaptation techniques, particularly test-time adaptation (TTA), have gained attention. Unlike traditional unsupervised domain adaptation (UDA), TTA does not require access to labeled source-domain data, which makes it a privacy-preserving and practical solution. TTA aims to fine-tune a model on unlabeled target-domain data, which enables it to generalize better to the target domain without compromising privacy. Recent advancements have incorporated multi-modal data, such as pairing point cloud data with images, to further improve domain adaptation performance. These methods enhance feature representations and provide richer statistical estimations for pseudo-label generation. However, in the case of domain shift, alignment deviations may occur between modalities, which misleads the model to make incorrect predictions. In addition, the domain differences of the modalities may further lead to inconsistent model understanding and fragmented semantic predictions, which affects the performance of point cloud semantic segmentation tasks. To this end, the use of pre-trained foundation models, such as contrastive language-image pretraining (CLIP) and segment anything model (SAM), with their rich intrinsic prior knowledge, has shown promising results in improving semantic segmentation performance in domain adaptation.MethodThis study proposes a foundation model-driven TTA method that incorporates knowledge from pre-trained models to improve multi-modal domain adaptation. The approach consists of two main components. One is the CLIP-based text-embedded segmentation layer. Based on the CLIP model, this layer uses text embeddings tailored to the categories of the target domain to replace the original segmentation layer in the model. This integration enhances the semantic generalization capability of the model, which allows it to adapt better to the target domain without requiring additional labeled data. The capability of CLIP to align text and image feature spaces through contrastive learning aids in transferring semantic knowledge from large-scale datasets to domain-specific tasks. The other is the SAM mask-based feature consistency constraint module. Built upon SAM, this module generates object masks with consistent semantics across the point cloud and image modalities. By enforcing feature consistency within these masks, the module improves feature extraction and helps the model maintain semantic coherence between different modalities. This way ensures that features remain aligned and consistent throughout the adaptation process, which enhances segmentation performance, particularly in scenarios involving complex domain shifts.ResultThe method is evaluated on three multi-modal domain adaptation scenarios using datasets: nuScenes, A2D2, and SemanticKITTI. These datasets offer diverse challenges, such as geographic and lighting domain shifts, as well as cross-dataset adaptation. Our method is compared with state-of-the-art TTA and UDA approaches to assess its effectiveness. Experimental results demonstrate the effectiveness of the proposed method across various domain adaptation scenarios. In the A2D2-SemanticKITTI adaptation setting, the method achieves a mean intersection over union (mIoU) of 48.3% in the 2D adaptation scenario, which surpasses the state-of-the-art method by 4.6%. In the 3D adaptation setting, the method achieves an mIoU of 42.9%. These results highlight the capability of the proposed method to improve 2D and 3D semantic segmentation performance by leveraging two proposed modules. Compared with UDA methods, which rely on source and target domain data, the proposed method performs competitively despite only utilizing unlabeled target-domain data. This performance demonstrates the practicality of the method in real-world scenarios where access to source data is limited due to privacy concerns. Furthermore, qualitative results illustrate substantial improvements in segmentation quality, with the method producing highly accurate and coherent results in complex environments. Ablation studies further show the validation of the proposed modules. The text-embedded segmentation layer significantly enhanced 2D performance, with the mIoU increased from 45.8% to 48.1%. The mask-constrained feature consistency module had a greater impact on 3D performance, which improved the mIoU from 40.2% to 41.8%. When both modules were combined, the method achieved the highest performance across 2D and 3D tasks, with an mIoU of 48.3% for 2D and 42.9% for 3D. These experiments confirm that the combination of both modules is crucial for achieving optimal performance in multi-modal domain adaptation.ConclusionIn this study, we propose a TTA method that uses the knowledge of a large visual model to guide the test. By integrating visual-text information and local feature consistency constraints, it substantially improves the generalization performance of point cloud semantic segmentation in various scenarios. Extensive experiments across three real-world scenarios demonstrate the superior performance of our method.
关键词:point cloud;semantic segmentation;test-time adaptation(TTA);vision foundation model;multi-modality
摘要:ObjectiveImage dehazing is a crucial task in computer vision and image processing. This task aims to recover potential haze-free images from those degraded by atmospheric scattering phenomena such as haze and fog. Image dehazing research holds substantial practical importance and application value. It enhances the performance and reliability of visual systems under adverse weather conditions, which provides technical support for critical applications such as autonomous driving, intelligent surveillance, and remote sensing image analysis. It also serves as a preprocessing step to improve the performance of advanced computer vision tasks including object detection, image classification, and scene understanding. Simultaneously, image dehazing technology has driven the development of atmospheric scattering models and image restoration theory, which promotes innovative applications of deep learning in low-level computer vision tasks. It has great value for improving the effectiveness of public safety, environmental monitoring, and intelligent transportation systems, which serves as a crucial supporting technology for achieving all-weather, all-scenario visual perception capabilities. Existing methods utilize differential features of clear-degraded image pairs in spatial and frequency domains to achieve dehazing with certain effectiveness. However, three major problems remain: limitations in spatial domain feature extraction and fusion, poor performance in frequency domain feature fusion, and inefficient integration of features from both domains. To address these issues, we propose the dual-domain feature fusion network (DFFNet), which focuses on the fusion of features from spatial and frequency domains.MethodDFFNet is based on soft reconstruction (SR) to recover potential haze-free images and consists of two key modules. First, we design a spatial-domain feature fusion module (SFFM) more suitable for image soft reconstruction. This module adopts a Transformer-style architecture that captures global features through large-kernel attention mechanisms to accurately locate hazy regions in images. Meanwhile, it models local features through pixel attention mechanisms to effectively restore image edges and details. The two attention mechanisms jointly satisfy the dual requirements for global and local features in the image soft reconstruction process. Specifically, these features are mapped and fused using a convolutional feed-forward network. The SFFM maintains a relatively small parameter scale and low computational complexity. Meanwhile, according to the spectral convolution theorem, visual feature processing in the spatial domain and frequency domain is essentially equivalent——convolution operations in the spatial domain are equivalent to multiplication operations in the frequency domain. Therefore, we choose to emphasize high-frequency components through convolution. We propose the frequency-domain feature fusion module (FFFM), which adopts an implicit method to process high-frequency information without requiring explicit Fourier transforms. By stacking multiple convolutions as high-pass filters, the module enhances high-frequency components in the input features and enriches their diversity. As a result, the capability of the model to restore high-frequency details in images is improved. Increasing the number of stacked convolutional layers can extract richer high-frequency features, which enhances the quality of the output features of the FFFM module. However, considering parameter scale and computational efficiency, we ultimately choose to stack three convolutional layers for extracting three different types of high-frequency features. To better capture contextual information while optimizing computational overhead, the FFFM module is only applied between the encoder and decoder at the third scale. The output features of the first and second scale encoders are adaptively fused with the upsampled output features of the second and third scale decoders through SKFusion, respectively. The output features of the first scale decoder, after patch unembed, are processed together with the input image through the SR module to recover the potential haze-free image. In DFFNet, the fusion of spatial and frequency domain features occurs in the FFFM module located at the network bottleneck. The features at the bottleneck have already extracted rich multi-scale spatial information through multiple SFFM layers, which possess strong semantic expression capabilities. Simultaneously, as the connection point between the encoder and decoder, the features fused at this position can directly influence the subsequent recovery process. These fused features allow high-frequency information to effectively guide the entire decoding process, which enhances restoration of image edges and details.ResultThe DFFNet, which combines the two key module designs, demonstrates performance exceeding those of current state-of-the-art methods on two benchmark datasets. DFFNet-L is the first dehazing network to achieve peak signal-to-noise ratio values exceeding 43 dB on the SOTS-Indoor test set and 36 dB on the Haze4K dataset. The specific values are 43.83 dB and 36.39 dB, which outperform those of the current state-of-the-art image dehazing method MixDehazeNet-L by 1.21 dB and 0.45 dB, respectively. Furthermore, DFFNet is more lightweight, with only 46.0% of the parameters and 67.1% of the floating point operations compared with MixDehazeNet-L. In the visualization experiments, DFFNet-L demonstrates the highest quality of haze-free image restoration. This superior performance is primarily attributed to its large receptive field design, which enables the model to fully capture and utilize global contextual information in images for precise dehazing. This advantage allows DFFNet-L to accurately identify the spatial distribution characteristics of haze in scenes, which achieves thorough haze removal, uniform and natural color restoration, and improved overall visual effects in the recovered images. Simultaneously, DFFNet-L leverages the frequency domain differential features between clear and degraded image pairs, which significantly enhances its capability to restore high-frequency components of images. This improved capability results in sharper edge contours and clearer textural details in the recovered images, which enhance the accuracy of detail restoration and make the results highly closely approximate the haze-free ground truth images. Moreover, the main modules of DFFNet, SFFM, and FFFM possess good transferability and scalability, which allow them to be conveniently transferred to other computer vision tasks. As a result, new solutions for improving model performance are provided.ConclusionThis study proposes a dual-domain feature fusion network that combines the advantages of convolutional neural network models and Transformer models, which effectively addresses the existing problems in dual-domain feature fusion and achieves excellent dehazing results. The code is available at https://github.com/WWJ0720/DFFNet.
摘要:ObjectiveThe analysis of histopathological images provides essential pathological information for tumor diagnosis and treatment, and it is widely regarded by experts as the gold standard for tumor diagnosis. Prior to digitization, histopathological images undergo a complex, multi-step process that includes sampling, fixation, dehydration, embedding, sectioning, staining, and mounting. This process is labor intensive, has limited automation, and is subject to the subjective actions of pathologists, which can introduce artifacts. These artifacts may obscure normal tissue and complicate the assessment of true lesions, which leads to potential misdiagnoses or missed diagnoses that can impact patient treatment and prognosis. Furthermore, artifacts in histopathological images can hinder the performance of computer-aided diagnostic systems, which underscores the importance of effective artifact detection in finalized images. Deep learning methods for identifying and differentiating artifacts with prominent morphological and color distinctions, such as ink marks, tissue folds, and blurriness, have shown promising results. However, certain artifact types, such as looseness and tissue bubbles, are visually similar, which makes them particularly challenging to differentiate. Effective differentiation requires models that can maintain sensitivity to local features while also achieving a comprehensive understanding of global patterns and structures, which ensures the robust integration of local and global information. However, current methods for artifact detection have not fully addressed the challenges posed by these visually similar artifacts, which indicates an urgent need for further investigation in this area. This study aims to address the limitations of existing methods in integrating and interacting with local and global information for detecting highly similar artifacts. Thus, this study proposes a local and global information interactive fusion network MoLiNet (mobile linear net) to improve the multi-classification of pathological image artifacts.MethodIn pathology, histopathological images are generally stored as whole-slide images (WSIs) organized in a pyramid structure. Researchers typically preprocess WSIs by segmenting them into smaller image patches. After patch segmentation, we further process these patches using an HSV color space-based edge detection module, which is called HSVEF. HSVEF is designed to improve the attention of the network to fissures. In this work, we extend the concept of a parallel framework by employing a dual-branch network. The CNN branch utilizes the lightweight CNN MobileNetV3, which enables efficient feature extraction while enhancing the capacity of the network to capture local details. As image patches pass through this branch, features are progressively refined from low- to higher-level representations. The Transformer branch is constructed using MGL-Trans (mobile gated linear Transformer), which takes dimensionally compressed image features as input. Within the MGL-Trans module, the conventional feedforward layer is replaced by the mobile gated linear unit (MobileGLU), which leverages a gated linear unit (GLU). The MobileGLU design enables the model to selectively amplify or suppress features at different stages, which enhances its capability to recognize complex image patterns. A focused linear cross-attention (FLCA) mechanism is incorporated between the two branches, which enables effective interaction between local and global features. In our model, this attention mechanism is concretely implemented as the feature gather module and the feature diffusion module. Finally, the features extracted from both branches are concatenated and fed into a classification head for categorization. This architecture not only enriches the overall understanding of image features by the model but also enhances its performance in complex image analysis tasks.ResultThe proposed network surpasses comparable state-of-the-art methods in classification accuracy and computational efficiency. On the Ningbo Clinical Pathology Diagnosis Center Similar Artifact Dataset, our model achieves accuracy, precision, recall, and F1-score values of 0.942 4, 0.947 7, 0.949 8, and 0.948 6, respectively. These values outperform the dual-branch architecture Dyn-Perceiver and artifact classification methods AR-Classifier and DKL. Notably, our network reduces parameters and computational load by 80.91% and 96.94%, respectively, compared with DKL. In addition, heatmap visualization analysis confirms that our method accurately identifies critical regions of pathological artifacts and assigns higher weights to these areas, which greatly enhance the accuracy of multi-class artifact detection tasks.ConclusionCompared with similar methods, our approach effectively extracts artifact information from images. In this network, we designed the MGL-Trans based on Transformer principles. It allows the model to selectively enhance or suppress features at various stages, which improves its capability to recognize complex image patterns. We also developed a focused linear cross-attention mechanism based on cross-attention. FLCA utilizes an efficient focusing function to better capture dependencies between two input sequences while maintaining computational efficiency and high performance. These innovations enable the model to effectively distinguish similar artifacts in histopathological slides while significantly reducing computational resource consumption. This achievement offers an innovative and efficient solution for the quality assessment of pathological images.
摘要:ObjectiveLow-dose computed tomography (LDCT) technology has attracted considerable attention in medical imaging because of its capability to substantially reduce radiation exposure while preserving clinical diagnostic value. In dental medicine, cone-beam CT (CBCT), as a pivotal 3D imaging technology, encounters a critical challenge: optimizing radiation dose control not only impacts image quality but also directly affects patient safety. However, dose reduction inevitably intensifies quantum noise and artifact generation, which creates a dual challenge in oral imaging due to the complex anatomical characteristics of regions. Although denoising techniques for low-dose pulmonary CT imaging have advanced considerably, research on noise suppression in low-dose oral CBCT remains limited. The wide dynamic range between high-density structures such as teeth and alveolar bone and low-density soft tissues including gingiva and mucosa, along with the inherently low contrast of delicate anatomical features such as root canals, often leads to over-smoothing and loss of critical details in current deep learning-based low-dose CT denoising algorithms. These limitations hinder clinical diagnostic accuracy, which underscores the need for tailored noise reduction strategies specific to the unique imaging challenges of oral CBCT. Therefore, this study aims to explore an efficient denoising method for low-dose oral CBCT images. This method achieves effective suppression of image noise and artifacts while maximally preserving detailed information of dental edges, pulp chambers, and soft tissues, under the premise of ensuring radiation safety.MethodA multi-scale attention-based encoder-decoder network model is proposed to address the challenges of excessive noise, prominent artifacts, and blurred details in low-dose dental CBCT images. The model adopts an end-to-end encoder-decoder architecture, which utilizes multi-layer convolution, downsampling, and upsampling operations to achieve direct mapping from noisy images to high-quality outputs. During the encoding phase, the network extracts low-, medium-, and high-level feature information through multi-layer convolution and pooling operations, which progressively captures global and local semantic information. In the decoding phase, upsampling and residual connections are employed to restore image details while preserving high-resolution information. The core innovations of the model lie in the integration of the multi-scale feature fusion module (MFFM) and the directional attention feature refinement module (DAFRM). The MFFM utilizes parallel convolutional kernels of varying sizes to extract multi-scale features, followed by self-attention mechanisms to dynamically weight and fuse these features. Thus, it comprehensively integrates information from tooth contours, bone structures, and soft tissue textures. The DAFRM focuses on enhancing regions with distinct spatial directional characteristics, such as the enamel-dentinal junction and pulp chambers. By leveraging adaptive pooling and attention-weighted strategies along horizontal and vertical orientations, this module strengthens the representation of critical anatomical details. To further optimize denoising performance, a hybrid loss function is designed, which combines pixel-wise mean squared error, smoothness loss, and structural similarity loss. Through multi-objective joint optimization, the network achieves noise suppression while maintaining pixel-level accuracy, structural continuity, and realistic restoration of edge details. This approach ensures a balance between radiation safety and diagnostic fidelity, which provides a robust framework for high-quality image reconstruction in low-dose dental CBCT applications.ResultExperiments were conducted on a low-dose dental CBCT dataset to systematically validate the effectiveness of the proposed denoising algorithm, and its performance was compared with those of seven state-of-the-art denoising methods. Experimental results demonstrate significant improvements across all evaluation metrics. The proposed model achieves a peak signal-to-noise ratio of 38.40 dB, which outperforms RED-CNN and WGAN by 3.54 dB and 14 dB, respectively. For the structural similarity index, the model attains a value of 0.952 4, which represents improvements of 3% and 18% over the aforementioned baselines. Visual comparisons further confirm that the proposed method effectively preserves fine structures such as tooth edges and root canals while eliminating noise, avoiding common over-smoothing artifacts, and yielding denoised images with superior visual clarity. To further verify the robustness of the algorithm, additional tests were performed on CBCT datasets under varying dose levels. The results indicate that the proposed model consistently delivers high-quality denoising performance across higher and lower doses, which demonstrates strong adaptability to variations in noise distribution and intensity. This performance highlights the robust generalization capability and reliability of the model in clinical scenarios with heterogeneous imaging parameters.ConclusionThe proposed multi-scale attention encoder-decoder network model for low-dose oral CBCT denoising effectively reduces noise and artifacts induced by low-dose imaging while preserving critical structural details. This model provides high-quality imaging foundations for subsequent clinical applications such as endodontic diagnosis and orthodontic treatment planning.