摘要:Images are essential carriers of visual information and play an integral role in various aspects of human life, ranging from daily interactions to complex technological applications. However, during processes such as acquisition, transmission, and storage, images are often exposed to a range of environmental and technical factors that can lead to quality degradation. This degradation not only diminishes visual perception and causes information loss but also has broader implications, adversely affecting computer vision tasks. A decline in image quality reduces the accuracy of critical computer vision applications such as semantic segmentation and object detection, which rely heavily on high-quality input images. In application scenarios where precision and reliability are crucial, such as autonomous driving, intelligent healthcare, and other safety-critical environments, image degradation can notably hinder user experience and compromise the reliability of data-driven systems. Image restoration and enhancement technologies aim to recover degraded images to their original clarity and fidelity. These technologies aim to restore images free from distortion, thereby improving subjective visual quality and enhancing the performance of subsequent tasks that depend on these images. While traditional image restoration techniques have proven effective in addressing mild degradation, they often struggle with more complex or severe degradations, especially when multiple degradation factors are involved. This limitation has spurred researchers to investigate more advanced methods capable of addressing a wide range of complex and intricate degradation scenarios. Recent advancements in computational hardware, coupled with rapid developments in deep learning, have led to substantial breakthroughs in vision and multimodal large models. These models, backed by sophisticated architectures and extensive training, have shown exceptional potential across various fields. Leveraging these advancements, image restoration and enhancement technologies have made substantial progress, offering promising solutions to previously challenging problems. This paper provides a systematic review of the current research landscape in image restoration and enhancement, offering an in-depth analysis of several core technologies driving advancements in this area. The primary contributions of this paper are structured around the following six focal areas: 1) compilation and analysis of datasets for image restoration and enhancement tasks: the effectiveness of image restoration methods is heavily dependent on the quality and scale of the datasets used for training and evaluation. This paper offers a thorough compilation of datasets commonly employed in image restoration tasks, including denoising, deraining, and dehazing. This study offers insights into the characteristics of these datasets, including their scale, quality, and the techniques employed to generate low-quality images, enabling a comprehensive understanding of dataset effects on restoration performance. 2) Exploration of vision Transformer (ViT) in image restoration and enhancement: Vision Transformers (ViTs) have introduced the powerful Transformer architecture to the field of image processing. By enabling the modeling of long-range dependencies, ViTs have shown considerable promise in image restoration and enhancement tasks. This paper offers a systematic review of the application of ViT-based methods while evaluating their potential for handling complex image degradation patterns. 3) Summary of diffusion model-based image restoration and enhancement methods: diffusion models have emerged as effective solutions for addressing complex image degradation and restoring fine details in challenging cases. This paper provides a summary of recent advancements in diffusion model-based image restoration, focusing on the unique strengths of the iterative denoising process. Compared to traditional methods, diffusion models exhibit strong capabilities in recovering details for severely degraded images. However, they also present certain risks, such as the potential to generate content that may appear less realistic. 4) Analysis of the potential of X-anything models in image restoration and enhancement tasks: represented by models such as the segment anything model (SAM), X-anything models utilize extensive pre-training and prior information to perform robust zero-shot predictions, even when applied to degraded images with limited or no labeling. This paper explores the potential applications of SAM and similar models in image restoration, emphasizing their capability to provide stable restoration results through zero-shot learning. This approach proves particularly advantageous in scenarios involving unlabeled or weakly labeled data. 5) Application of multimodal large models in image restoration and enhancement: with the emergence of multimodal large models such as CLIP and GPT-4V, researchers have started harnessing the powerful information fusion capabilities of these models for image restoration and enhancement. This paper highlights the advantages of multimodal models in tackling complex restoration tasks by examining how they leverage pre-trained semantic information to guide the restoration process. The integration of these semantic features enables multimodal models to outperform traditional methods, particularly in challenging scenarios where conventional techniques often fall short. 6) Challenges and prospects of image restoration and enhancement technologies: despite the considerable advancements in recent years, image restoration and enhancement technologies continue to face substantial challenges in practical applications. Key obstacles include the difficulty of obtaining high-quality, diverse training data, the high computational demands of these models, and the need for enhanced model stability across various conditions. This paper thoroughly examines these challenges and explores prospective research directions, such as improving model adaptability to resource constraints, developing more efficient data acquisition methods, and enhancing model robustness. These directions aim to provide valuable insights for researchers and practical applications, fostering further development in the field. Overall, this paper provides a comprehensive overview of the research advancements in image restoration and enhancement, both domestically and internationally. This paper seeks to encourage new ideas and introduce innovative directions for future research and applications in this rapidly evolving field by systematically summarizing current progress and analyzing key technological innovations.
关键词:image restoration and enhancement;visual large model;large multimodal model(LMM);vision Transformer (ViT);diffusion model;X-anything;computer vision
摘要:With the advent of the big data era, video platforms such as YouTube, TikTok, and Kuaishou have gained popularity due to their rich video content. However, the explosive growth of data has also made it difficult for users to retrieve content that interests them. Traditional unimodal retrieval relies on manual annotations, limiting flexibility and incurring high costs. Video-text retrieval (VTR) addresses this issue by using deep learning to enable cross-modal retrieval between video and text, allowing for the retrieval of the most relevant content from a corresponding database based on either a text or video query. Early VTR methods relied on predefined concepts for retrieval, which lacked scalability. VTR based on joint embedding spaces has become mainstream, bridging modality differences through feature extraction and alignment, while holding significant value in the fields of natural language processing and computer vision. This method has also been widely used in various sectors, such as healthcare, social media, and short videos. Video-text retrieval based on joint embedding spaces involves four key technologies: video and text feature representation extraction, video-text feature alignment, and the objective function. The goal of video feature representation extraction is to convert videos into feature vectors for better understanding via computers. This is mainly divided into two aspects: spatiotemporal and multimodal features. On the one hand, spatiotemporal features are achieved by extracting spatial information from video frames and modeling temporal information. On the other hand, multimodal features involve integrating audio, subtitles, and motion information within the video to enhance video understanding. Methods based on multimodal features aggregate rich multimodal information, effectively improving retrieval performance. However, these methods also have high dataset requirements and require large amounts of labeled data to extract features from various modalities. Furthermore, such methods lack intelligent multimodal fusion mechanisms, are unable to coordinate the relationships between different modalities, and still need improvement in retrieval efficiency. The goal of text feature representation extraction is to map high-dimensional discrete language sentences into low-dimensional dense feature representations, in which the key element is the effective modeling of sequential relationships within the text. Early methods used bag-of-words and word2vec to represent word embeddings, followed by recurrent neural networks (RNNs) and convolutional neural networks (CNNs) to model dependencies between words. Recently, the Transformer model, with its self-attention mechanism, has enabled parallel processing of text data and captured global dependency information, thereby achieving breakthroughs in multiple benchmarks. It is now the most competitive method used to date. Video-text feature alignment maps the feature representations of both video and text into a shared embedding space for similarity computation. Coarse-grained feature alignment is achieved by calculating global similarity, which is efficient but unable to capture subtle semantic differences. In comparison, fine-grained feature alignment focuses on aligning local information by capturing low-level features and semantic information at lower and higher layers, respectively. This may also enhance model detail perception through explicit alignment, thus improving retrieval accuracy. Objective functions include triplet loss and contrastive loss. Triplet loss optimizes the model by ensuring that the similarity between positive sample pairs is higher than that of negative sample pairs, but it is greatly influenced by the quality of negative samples and batch size. Contrastive loss increases the distance between positive sample pairs and shortens the distance between negative sample pairs. This does not require setting thresholds and is commonly used to optimize VTR models, allowing it to overcome some of the limitations of triplet loss. VTR models typically adopt a pretraining and fine-tuning strategy, in which they are pretrained on large-scale image-text and video datasets and then fine-tuned on benchmark datasets specific to video-text retrieval. The benchmark datasets are summarized in terms of their quantity and duration. The evaluation metrics for testing the models include R@1, R@5, R@10, MdR (median rank), and MnR (mean rank). Several conclusions can be drawn by comparing the test results of various models on typical datasets. First, in the extraction of multimodal video feature representations, an excessive amount of modality information has not significantly improved model performance although existing methods extract and aggregate multimodal information using expert models. In fact, this may introduce noise, highlighting the urgent need for intelligent modality fusion methods. Second, the extraction of spatiotemporal feature representations is crucial for model performance, particularly the advantages of the Transformer architecture in modeling spatiotemporal information. Future research will focus more on how to effectively relate time and space information to enhance video representation capabilities. Additionally, fine-grained information interaction can effectively improve model performance; however, the complexity of the model structure makes optimization and implementation difficult. Therefore, more efficient fine-grained interaction methods must be explored. Finally, the ranking of distinct models on different datasets can vary, reflecting the influence of dataset differences and model structures. This finding indicates the need to develop VTR models with stronger generalization capabilities. Several challenges and future directions for video-text retrieval are also discussed. The first challenge is the lack of high-quality datasets, which limits model training. In particular, existing datasets have limited evaluation of temporal information modeling, and there is an urgent need for standardized, high-quality datasets. Second, retrieval efficiency is often overlooked in existing methods. In large-scale video data retrieval, improving efficiency without sacrificing accuracy will be a major focus of future research. Third, scalable retrieval models remain a challenge. Given that current models require fine-tuning for each dataset, future research must focus on leveraging the general knowledge of foundational models to improve their adaptability and transferability. Finally, the exploration of unsupervised learning methods is becoming a trend, with future research focusing on the optimization of models using large amounts of unlabeled video data.
摘要:Artificial intelligence-generated content (AIGC), which refers to the use of artificial intelligence (AI) technology to generate digital content, such as text, images, videos, and three-dimensional (3D) assets, has developed rapidly in the past few years, triggering a technological revolution. In the field of 3D AIGC, text-guided 3D editing has research significance and application value. In accordance with the guidance of the target text, it can change the geometry and appearance of existing 3D assets, thereby creating diversified and high-quality 3D assets. Compared with other guiding conditions, such as reference images and sketches, the 3D content editing paradigm guided by natural language has the advantages of friendly interaction, high efficiency, and strong practicability. This paradigm also has wide application potential in virtual/augmented reality, automatic driving, robots, and other fields. In recent years, the emergence and development of a series of key technologies, such as advanced neural representation, generative models, and text-guided image generation and editing, have led to significant progress in text-guided 3D editing and achieved certain outcomes. However, editing 3D content with text guidance remains a challenging task. Unlike the text-guided 3D generation task of generating 3D assets from zero, text-guided 3D editing edits the existing 3D assets and changes their geometric structure and appearance, among others, to obtain a new asset that conforms to the description of the target text. In the process of 3D editing, the core problem is to ensure that the nonediting areas are not affected while completing the task that meets the requirements of the target text. Second, it is difficult to correctly understand the target text and edit 3D assets that are semantically consistent with the target text, especially when the target text describes complex scenes, including multiple objects and different attributes. Furthermore, selecting 3D representations that are suitable for editing is a complex task, and both explicit (e.g., voxels and grids) and implicit (e.g., neural radiation fields and distance functions) representations have advantages and disadvantages in terms of representation ability and efficiency. Finally, the lack of a large dataset of text-3D assets and the inconsistency of multiple perspectives make text-guided 3D editing more challenging. In recent years, neural radiance field and 3D Gaussian splatting have been proposed. Due to their advantages, such as continuity and high photorealistic rendering, significant progress has been made in the field of high-quality 3D reconstruction and rendering of scenes. With large pretrained text-image alignment models, neural radiance fields have also been extended to text-guided 3D generation. Therefore, a simple way to implement text-guided 3D editing is to finetune the pretrained text-guided 3D generation model and modify the geometry or appearance of the 3D asset, among many other processes, so that it meets the new target text description. Earlier methods supervised the adjustment of the neural radiance fields by contrast language-image pretraining loss to align it with the new target text. Recent methods mostly utilize score distillation sampling loss optimization to edit neural radiance fields. However, this approach based on fine-tuning generation models can only change 3D assets globally and does not support fine-grained 3D editing. At the same time, the emergence of large text-image datasets and pretrained text-image alignment models has promoted the flourishing development of text-guided image editing techniques. Representative image editing techniques are introduced into 3D editing, which is a promising direction to solve the task of text-guided 3D editing. This editing paradigm avoids the need for text-3D data pairs by elevating 2D image editing to neural radiance fields, thereby enabling key advances in text-guided 3D editing. Early methods have conducted image editing on images rendered by the existing 3D models to conform to the target text, using the edited image to reconstruct the target 3D models to complete 3D editing meeting the target text. Subsequent methods further improve the editing quality and efficiency through multiview consistent editing and generalized editing. However, such methods rely on the ability of text to guide image editing and can only use image editing to provide implicit constraints without explicit control of the 3D editing process, which is not ideal for high-quality 3D editing. To achieve more accurate editing, recent research work has focused on introducing explicit editing constraints in the editing process, thus limiting 3D editing to the editable area and avoiding unnecessary editing while meeting the requirements of the target text. These methods can automatically determine the editing region from the semantic correspondence between the target text and the image, thus enabling impressively high-quality 3D editing. In view of the significant advances mentioned above, the above literature must be systematically summarized and analyzed for researchers interested in the field of text-guided 3D editing. This paper focuses on the latest advancements in text-guided 3D editing based on neural radiance fields and 3D Gaussian splatting, summarizing existing research from the aspects of methodological essence and editing capabilities. Specifically, this paper categorizes current research into three types according to their editing constraints: unconstrained, implicit constraints, and explicit constraints, to deeply analyze the essence of each method. In addition, the paper discusses the editing capabilities of these methods from various perspectives, including types of editing (e.g., geometry and appearance), scope of editing (e.g., objects and scenes), and editing robustness (e.g., global or local controllability). Finally, the paper analyzes the challenges faced by current research and offers insights and prospects for potential future research directions. In summary, the contributions of this paper are as follows: 1) it offers the first review of text-guided 3D editing based on neural radiance field and 3D Gaussian splatting, 2) it provides a set of effective classification criteria to summarize the existing research work from the essence of the methods, and 3) it discusses the 3D editing capabilities of existing studies based on the principle of effective classification.
摘要:ObjectiveIn recent years, video-text cross-modal retrieval has garnered widespread attention from academia and industry due to its significant application value in areas such as video recommendation, public safety, sports analysis, and personalized advertising. This task primarily involves video retrieval (VR) and video moment retrieval (VMR), aiming to identify videos or video moments from a video library or a specific video that are semantically most similar to a given query text. The inherent heterogeneity between video and text, as they belong to different modalities, makes direct feature matching highly challenging. Thus, the key challenge in video-text cross-modal retrieval lies in effectively aligning these two cross-modal data types in the feature space to achieve precise semantic relevance calculation. Current methods primarily focus on enhancing semantic matching across modalities through cross-modal interactions on existing datasets to improve retrieval performance. Although improvement in modeling has seen significant progress, issues inherent to datasets remain unexplored. In the context of video-text cross-modal retrieval, this study observes an ill-posed problem during training with existing datasets, manifested as a single query text corresponding to multiple videos or video moments, leading to nonunique retrieval results. These one-to-many samples frequently lead to model confusion during training, hinder the alignment of cross-modal feature representations, and degrade overall model performance. For instance, if a query text describes a target video and a nontarget video, then retrieving the latter during training is penalized as incorrect, thereby artificially increasing the distance between the query text and the nontarget video in the feature space, despite their high semantic relevance. This paper defines these problematic one-to-many samples as hard samples, whereas one-to-one samples are defined as easy samples. To address this issue, this paper proposes an iterative optimization method for VR data using large language model guidance. By leveraging the built-in knowledge of large language models, this method augments one-to-many video-text pairs with fine-grained information and iteratively refines them into one-to-one mappings.MethodInitially, the dataset is divided into easy and hard sample sets based on video-text similarity. Specifically, the similarity between the query text and all videos is calculated. If the similarity between the query text and the target video is not the highest, then the data pair is classified into the hard sample set; otherwise, it is classified into the easy sample set. For videos in the difficult sample set, several frames are uniformly sampled and inputted into an image-text generation model to produce frame-level descriptive texts. This process aims to capture fine-grained information, such as objects not described by the query text, detailed appearances, and color attributes in the video. However, given that multiple frames may contain similar scenes and objects, the extracted fine-grained textual descriptions are often redundant and noisy. To address this, an iterative optimization module based on video-text semantic associations is introduced. This module combines the original query text with fine-grained information extracted from the target video and integrates it with a carefully designed prompt template, which is inputted into a large language model. The model then generates a refined, fine-grained, and unique query text. The quality of the optimization results depends significantly on the design of the prompt templates. The templates include the following key elements: 1) clear task descriptions; 2) relevant examples that meet specified conditions; and 3) specific requirements, such as extracting co-occurring content across multiple frames during summarization. The emphasis on co-occurring content is justified by two key reasons: first, such content often carries critical and essential information; second, summarizing shared elements effectively reduces the likelihood of introducing erroneous descriptions. High-quality outputs from large language models typically result from multiple interactions with the user, as these models can refine their responses based on user feedback. Inspired by this, the study aims to automate the optimization process without requiring predefined interaction rounds. To further optimize the fine-grained query text, an iterative condition based on video-text semantic association is designed. Specifically, the optimized query text and corresponding video are encoded through an encoder. If the similarity of the extracted features in the feature space meets a predefined condition, then the optimized query text is deemed satisfactory, and the optimization process is terminated. Otherwise, if the condition is not met, then the current optimization results are used to update the prompt information, and the query text is further refined iteratively until the dataset no longer contains one-to-many issues for any query text. Finally, the optimized data are used to train the video-text cross-modal retrieval model.ResultThe effectiveness of the proposed method was validated on multiple mainstream video-text cross-modal retrieval datasets. In the VMR task, four neural network models trained on the Charades-STA dataset and optimized using the proposed method showed an average improvement of 2.42% in the R@1, IoU = 0.5 metric, with a maximum improvement of 3.23%. When IoU = 0.7, performance improvements reached up to 4.38%. In the QVHighlights dataset, the performance of MomentDETR and QDDETR improved by 5.48% and 1.35%, respectively, with an average improvement of 3% when IoU = 0.7. In the VR task, two methods demonstrated an average improvement of 1.4% in the R@1 metric on the MSR-VTT dataset, with a maximum improvement of 1.6%. These results demonstrate the proposed method’s effectiveness and its generalizability across different datasets.ConclusionThe proposed iterative optimization method for VR data using large language model guidance effectively alleviates the one-to-many issue in datasets. A single optimization of the dataset can enhance the retrieval performance of multiple methods. This approach offers a novel perspective for video-text cross-modal retrieval research and promotes advancements in related technologies.
关键词:video understanding;cross-modal retrieval;cross-modal feature alignment;large language model(LLM);data optimization
摘要:ObjectiveWith the advancement of technology, the demand for large models in speech interaction has increasingly grown in recent years. This paper investigates a self-supervised pretrained large model based on the speech information disentanglement technique, aiming to train model that extracts linguistic, paralinguistic, and nonlinguistic information from speech. This approach leverages vast amounts of unannotated data, enabling the model to learn effectively even in the absence of labels. By achieving the independence of the extracted speech representations, downstream models can clearly distinguish between different types of information, which is a crucial step in enhancing the accuracy and controllability of speech processing. Specifically, the core of the disentanglement technique lies in effectively extracting and separating different layers of information within speech signals. Linguistic information conveys specific content, while paralinguistic information includes the speaker’s emotions, intonation, and other nuances. Nonlinguistic information may encompass the speaker’s physiological state, environmental sounds, or background noise. Therefore, by modularizing these types of information, the model not only gains a better understanding of speech content but also allows for flexible adjustments or replacements of these elements as needed. This process provides comprehensive and detailed speech information for downstream language and generative models, considerably enhancing their support for complex verbal interactions. Furthermore, this technique facilitates multitask learning. In speech interaction tasks, different application scenarios have varying demands for speech information. Through disentanglement, the model can adapt to these diverse requirements, achieving higher performance across multiple tasks, such as speech recognition, emotion recognition, and speech synthesis. By leveraging the speech information disentanglement technique, the self-supervised pretrained large model not only offers flexibility for practical applications but also opens up new avenues for further research.MethodTo address this challenge, we propose an information disentanglement-based self-supervised speech representation learning model that effectively leverages vast amounts of unannotated data, achieving high-quality speech information disentanglement. Specifically, we build upon an encoder-style self-supervised learning (SSL) framework and introduce two lightweight specialized modules. Both modules enhance the model’s capacity to extract pitch variation and speaker identity from speech signals, which are crucial for achieving expressive and contextually rich speech generation. We employ a residual removal approach that systematically disentangles the extracted pitch variation and speaker information from the main processing branch, thus ensuring that these components do not interfere with the learning of content information. The main branch is then trained using HuBERT’s speech masking prediction mechanism, which optimizes the deep layers of the encoder for superior performance in linguistic tasks. This method allows for the progressive extraction and refinement of the representations of pitch variation, speaker identity, and content from the input speech, thereby fostering a more nuanced understanding of the speech signal. Furthermore, we combine the diverse representations obtained from different layers, strategically adjusting their weights to generate task-specific representations tailored for various downstream speech processing applications. Such flexibility is essential for effectively addressing the distinct requirements of tasks such as speech recognition, emotion recognition, and voice conversion. Additionally, we introduce a progressive generator that builds upon these representations, resulting in the seamless execution of downstream speech generation tasks. This comprehensive approach not only enhances the model’s adaptability and performance across multiple tasks but also paves the way for more sophisticated and context-aware speech interaction systems.ResultExperimental results indicate that our proposed method demonstrates significant advantages across various tasks, including speech recognition, speaker verification, speech enhancement, emotion recognition, and voice conversion tasks. Notably, in the disentanglement-based emotion and voice conversion task, our model achieves substantial improvements in emotional similarity, speaker similarity, word accuracy, and audio quality ratings compared to the second best model. These enhancements reflect the model’s capability to effectively disentangle and manipulate various speech information components, which, in turn, contributes to a natural and expressive speech output. This result underscores the effectiveness of our approach in not only improving the quality of synthesized speech but also enhancing the controllability of emotional and speaker characteristics, ultimately paving the way for more sophisticated applications in human-computer interaction.ConclusionIncorporating information disentanglement into the pretrained extraction model enhances the analysis and synthesis capabilities of speech information. This feature enables the model to clearly identify and manipulate aspects of speech, such as linguistic content, emotional state, and speaker identity. Improved clarity of speech features is crucial for advancing large models focused on verbal interaction. Furthermore, this approach offers insights and practical tools that are applicable in various fields, from enhancing speech recognition to improving emotional expressiveness in generated speech. This method also fosters more nuanced human-computer communication and encourages research into complex speech dynamics, potentially leading to innovations in personalized virtual assistants and adaptive language learning tools and ultimately enriching user experiences in human-computer interactions.
关键词:information disentanglement;self-supervised learning(SSL);speech codec;speech interaction large model;speech synthesis
摘要:ObjectiveThe task of generating medical reports involves producing accurate and comprehensive examination results based on symptoms observed in medical images. This technology can ease the workload of radiologists, reduce diagnostic errors due to a lack of experience, and expedite clinical workflows. Medical report generation shares similarities with image captioning, but it presents two unique challenges: generating long texts and handling the imbalance in medical data distribution. Current approaches often train models specifically for medical report generation from scratch, relying on limited publicly available data. Due to their insufficient capability to fuse visual and textual features and generate rich, detailed information, these models frequently underperform. Large multimodal models (LMMs), which combine visual encoders and large language models (LLMs), are well-suited for image-based text generation tasks. They have the capability to recognize images and generate high-quality, knowledge-rich text, making them promising candidates for medical report generation. However, the application of LMMs in Chinese medical report generation remains in its early stages, particularly in terms of accurately understanding medical images and generating normatively correct medical reports. Moreover, these models often suffer from hallucination issues, where the generated responses appear logical but are factually incorrect or unfounded. This paper proposes a Chinese medical report generation model that integrates semantic fine-tuning and cross-modal retrieval-augmented (FRCM) to address these challenges.MethodBuilding upon the LMM framework of LLaVA, this paper fine-tunes and adapts the visual encoder and LLM to the medical domain. A collaborative training strategy that incorporates general data and domain-specific data is proposed, and a novel cross-modal retrieval-augmented strategy is introduced during the inference phase. The paper also translates the largest dataset in the medical report generation domain, MIMIC-CXR, into Chinese and uses it as in-domain data for research on Chinese medical report generation. First, considering the unique characteristics of medical images and Chinese medical reports, the corresponding modules of LLaVA are replaced with a medical visual encoder that is trained on a large volume of medical images and a medical LLM optimized for Chinese language processing. This adaptation enhances the model’s capability to effectively handle medical data. Second, a two-phase training strategy is employed, using general and domain-specific data. In the first phase of training, only the projection layer is trained. The domain-specific data facilitates medical image-text alignment, ensuring the model can accurately link medical images to their corresponding textual descriptions. In the second phase of training, the parameters of the projection layer are further updated, and a low-rank adaptation method is applied to fine-tune the LLM. The domain-specific data enhances the model’s capacity to generate professional Chinese medical reports, while the general data improves the model’s understanding of complex instructions. Throughout the entire training process, medical images are processed by the visual encoder to extract global and local feature vectors. These local feature vectors are then projected into visual embeddings that align with the dimensionality of the LLM’s embedding space. Medical reports and instructions are tokenized into text embeddings by the tokenizer of the LLM. These text embeddings, along with the visual embeddings, are input into the LLM during training. Finally, a cross-modal retrieval-augmented strategy is proposed to further mitigate the hallucination problem in the model. This strategy incorporates a cross-modal similar report retrieval module. During inference, the global feature vectors produced by the visual encoder are layer-normalized and input into the report retrieval module, which performs a cross-modal retrieval from the image to relevant reports. The retrieved similar reports are then used as supplementary knowledge, providing additional context to the LLM. This approach helps reduce hallucinations, thereby improving the accuracy and robustness of the generated medical reports.ResultOn the Chinese MIMIC-CXR dataset, the FRCM model outperformed other Chinese medical report generation models, such as XrayGLM and XrayPULSE, achieving notable improvements in several evaluation metrics. Specifically, FRCM showed increases of 10.4%, 10.1%, 9.7%, 9.1%, 6.6%, 9.4%, and 38.4% in BLEU-1, BLEU-2, BLEU-3, BLEU-4, recall-oriented understudy for gisting evaluation-longest common subsequence(ROUGE-L), metric for evaluation of translation with explicit ORdering(METEOR), and consensus-based image description evaluation(CIDEr) scores, respectively. When compared to models fine-tuned on LLaVA and Qwen-VL, FRCM also achieved score improvements of 4.1%, 3.1%, 3.3%, 3.6%, and 25.1% in BLEU-1, BLEU-2, BLEU-3, BLEU-4, and CIDEr, respectively. Ablation experiments revealed the effectiveness of the proposed approach. Data ablation demonstrated that adding diverse general data during training enhances the model’s capability to follow complex instructions, resulting in better utilization of additional knowledge and improving the quality of the generated medical reports. Module ablation further highlighted the importance of key components within FRCM, which substantially enhance its performance. Three case studies demonstrated that the Chinese medical reports generated by FRCM were superior to those produced by other models in terms of accuracy and information richness.ConclusionThis paper proposes FRCM, a model designed to generate Chinese medical reports from medical images. Unlike traditional medical report generation methods, FRCM leverages LMMs to effectively address the challenges associated with long text generation and the imbalance of medical data in the report generation task. While LMMs are typically pre-trained on extensive general datasets, they face limitations in recognizing medical images and generating specialized medical reports. The paper builds upon the LLaVA model framework, utilizing a medical visual encoder and a medical LLM, fine-tuning them semantically. This study introduces a similar report retrieval module to further mitigate the inherent hallucination problem of LMMs. This module supplies additional knowledge during the inference stage, aiding the model in generating more accurate reports. Experimental results show that FRCM performs satisfactorily in generating Chinese medical reports.
关键词:Chinese medical report generation;large multimodal model(LMM);retrieval enhancement;semantic fine-tuning;knowledge guidance
摘要:ObjectiveUltrasound imaging plays a crucial role in medical diagnosis due to its convenience, non-invasive nature, and cost-effectiveness, making it an essential tool in clinical settings. However, accurately localizing and extracting detailed features from ultrasound images, especially when dealing with complex pathological boundaries such as nodules and cysts, remains a major challenge. While traditional convolutional neural networks (CNNs) excel at feature extraction through convolutional layers, their limited receptive fields often result in the loss of crucial global information. Conversely, Transformer-based models are proficient at capturing global features through self-attention mechanisms but tend to struggle with capturing fine local details effectively. Additionally, their high computational requirements limit their use in real-time medical applications. The segmentation any model (SAM) has recently demonstrated success in natural image segmentation. However, its performance notably declines when applied to medical images, particularly ultrasound, often necessitating manual intervention. This decline is primarily due to the fact that SAM is trained exclusively on natural images, which differ substantially in domain distribution from medical images. An enhanced SAM model (i.e., SAM combined with counterfactual prompt and cascaded decoder (SAMCD)) is proposed to address this limitation.MethodSAMCD enhances the existing SAM framework by incorporating a Bypass CNN image encoder, a simple cross-branch interaction adapter (SCIA), a counterfactual intervention prompt generator, and a cascaded decoder. First, a Bypass CNN encoder and a novel module named SCIA are used. The limited local information of the ViT encoder is compensated by integrating the Bypass CNN encoder with the SCIA module, thereby enhancing the capability of the model to capture fine details. Next, a counterfactual intervention mechanism based on causal learning is introduced to adapt to the prompts produced by the prompt generator and optimize its output. This mechanism encourages the model to focus on generating factual prompts, strengthening the learning capability of the prompt generator, improving segmentation precision, and reducing dependency on high-quality prompts. Furthermore, a cascaded decoder is incorporated to capture rich edge information. The original SAM decoder is used to create a prior mask, followed by an edge-attention-enhanced Transformer decoder and a pixel-level decoder. This multistage decoding process enables the model to better capture and refine rich edge information, leading to highly accurate segmentation results. Finally, a two-stage training strategy is employed to enhance the segmentation performance of the model and accelerate convergence. The first stage focuses on training the interactive segmentation model, while the second stage concentrates on training the automatic segmentation model that incorporates a prompt generator. In the experiments, the hardware platform is NVIDIA GeForce RTX 3090, the programming language is Python 3.9, and the deep learning framework is PyTorch. The network is trained under a batch size of 4, a learning rate of 0.000 1, and a number of training rounds of 200, and the Adam optimizer is selected. SAMCD is initialized with SAM weights before training, and the images are scaled to 256 × 256 pixels using bilinear interpolation during training.ResultExperiments were conducted on the TN3K and BUSI datasets to evaluate the performance of the SAMCD model, using a range of metrics including Dice similarity coefficient(DSC), mean intersection over union(mIoU), Hausdorff distance(HD), accuracy(Acc), sensitivity(Sen), and specificity(Spe). Notably, lower HD values indicate better segmentation performance, while higher values for metrics such as DSC and mIoU (ranging from 0 to 1) indicate better performance. In these evaluations, the SAMCD model achieved a DSC score of 83.66% on the thyroid nodule 3K(TN3K) dataset and 84.29% on the breast ultrasound image(BUSI) dataset, surpassing the performance of the original SAM, MedSAM, SAMed, and SAMCT in almost all metrics. Compared to SAMCT, the SAMCD model shows improvements of 0.91% and 0.16% on mIoU and Acc, respectively, on the TN3K dataset. Furthermore, SAMCD outperforms other SAM-related comparison models by an average of 20.43% and 12.91% in performance. When compared to non-SAM approaches, SAMCD achieves 4.65%, 3.29%, 13.58%, 5.16%, and 2.22% higher DSC values than U-Net, CE-Net, SwinUnet, TransFuse, and TransUNet, respectively, on the TN3K dataset. In terms of Acc, SAMCD outperforms TransFuse and TransUNet by 0.79% and 0.29%, respectively, while also demonstrating an average improvement of 4.95% in Sen and 0.46% in Spe over the five non-SAM methods. Additionally, SAMCD requires fewer training parameters and consumes less computational resources compared to SAM-related models. Ablation experiments and visual analyses further confirm the substantial performance improvements provided by the SAMCD method.ConclusionSAMCD leverages the strong feature extraction capabilities of SAM by enhancing its encoder, prompt generator, decoder, and training strategy. These improvements enable SAMCD to accurately capture complex local details and small targets in ultrasound images, thereby substantially improving the automatic segmentation performance for ultrasound medical imaging.
摘要:ObjectiveCeladon is not only a dazzling pearl among the cultural treasures of the Chinese nation but also a cultural messenger in cultural exchanges between China and other countries. It has rich historical and cultural connotations and demonstrates excellent artistic value. Its elegant shape and moist glaze make it an outstanding representative of traditional Chinese craft aesthetics. The production of celadon embodies the wisdom and creativity of ancient craftsmen and is an important carrier for the inheritance of excellent traditional Chinese culture. In the context of cultural digitization, constructing a cross-modal knowledge graph of celadon is one of the key technologies for promoting the protection and inheritance of celadon culture. In this process, matching the same entities across different modalities, which involves aligning the different modal features of equivalent entities, is crucial. However, the inherent structural differences between cross-modal data present challenges for alignment tasks. Traditional methods that rely on manually annotated data can ensure the accuracy of alignment to some extent, but they have problems such as low efficiency and high cost. In addition, coarse-grained annotated data can hardly meet the requirements for fine-grained concepts and for entity recognition when constructing a cross-modal knowledge graph. At present, the vision-language pretraining (VLP) model can effectively capture cross-modal semantic associations by learning rich cross-modal representations from large-scale unmarked image-text pair data. The strong cross-modal understanding ability of the VLP model can provide precise semantic associations and fine-grained entity recognition for aligning entities of different modalities in graph construction. Here, a cross-modal entity alignment method based on the VLP model, which can map multiple features of images, is proposed to maximize the degree of matching between celadon images and text.MethodThe cross-modal entity alignment method proposed in this study, which maps multiple features of images, is initialized with the publicly available VLP model for both the image and the text encoders, and the parameters of the encoders remain unchanged during the training process. The method mainly consists of four parts. First, on the basis of the visual characteristics of celadon images, local features in terms of contour, texture, and color are extracted. Then, a gated multifusion unit is introduced to adaptively assign weights to the image features, and the extracted multiple local image features are used to generate reliable fused features. Furthermore, a multilayer fully connected mapper is designed to learn the mapping of the fused features to an appropriate intermediate representation space by using multiple layers of nonlinear transformations, guiding the text encoder to generate text features that match the image features more closely. Finally, the model is trained and optimized via the information noise contrastive estimation loss function, that is, by optimizing the similarity of positive sample pairs and the difference in negative sample pairs through calculating the cosine similarity between cross-modality features, thereby establishing the connection between image features and text features.ResultThe proposed method was compared with four of the latest benchmark methods in an experimental comparison, namely, contrastive VLP in Chinese (CN-CLIP), context optimization (CoOp), conditional context optimization (CoCoOp), and mapping pictures to words (Pic2Word). The quantitative evaluation metrics are the recall rates, including R@1, R@5, R@10, and the mean recall (MR). The experiments were conducted using the ChinaWare dataset, so all methods were trained on this dataset. A data table comparing each method’s performance on recall rate metrics was provided. In terms of the MR metric, the proposed method outperformed zero-shot CN-CLIPViT-B/16 by 3.2% in the text-to-image alignment task and by 7.5% in the image-to-text task. CoOp focuses on text features; it also outperforms CoOp by 11.4% and 12.1%, respectively. Moreover, CoCoOp considers image features on the basis of CoOp, and the proposed method outperforms CoCoOp by 8.4% and 9.5%, respectively. Pic2Word also focuses on original image features and does not fully utilize other local image features to improve model performance, and the proposed method outperforms Pic2Word by 5.8% and 5.6%, respectively.ConclusionThe cross-modal entity alignment method proposed in this study can fully explore the effective intermediate representation of image features to reconstruct text features without changing the parameters of the VLP model, thereby improving the cross-modal recognition accuracy of the details of celadon. The experimental results show that this method is superior to several state-of-the-art methods and has improved the performance of alignment. Ultimately, a celadon cross-modal knowledge graph with 8 949 nodes and 18 211 relationships was successfully constructed by applying technologies such as ontology modeling, data mining, and the cross-modal entity alignment method proposed in this study.
摘要:ObjectiveWith the rapid development of facial image synthesis technology, from simple image editing techniques to complex generative adversarial networks, people can easily create highly realistic fake facial images and videos, negatively impacting social information security. The low accuracy of existing face forgery methods and their poor generalization capability can be attributed to the notable differences in data distribution among samples generated by various forgery methods. A method called “multivariate and soft blending sample-driven image-text alignment for Deepfake detection”, which fully utilizes the multimodal alignment of images and texts to capture subtle traces of face forgery, is proposed to address this challenge.MethodConsidering that traditional face forgery detection methods are typically trained on single-mode forged images with complex forgery modes, the multivariate and soft blending augmentation (MSBA) method is introduced. Multivariate and soft blending images are generated by randomly mixing forged images of different forgery modes with various weights. The network model learns to estimate the blending weights of each forgery mode and the forgery intensity map from these images, enhancing the capability to capture multiple forgery clues simultaneously, thereby further improving the detection capabilities for complex and unknown forgery patterns. The declining performance of the network model in distinguishing true and false detections is due to the diverse forgery modes and intensities present in the face forgery images. A multivariate forgery intensity estimation (MFIE) module based on the MSBA method is proposed to address this issue. This module effectively learns from face forgery images with varying modes and intensities, guiding the image encoder to extract highly generalized features and improving the overall detection accuracy of the network framework. The main contributions include the following: 1) as the first to integrate the CLIP model into the face forgery detection task a multiple-and-soft-blending sample-driven image-text alignment network framework is proposed for face forgery detection, leveraging the multimodal information alignment of images and texts to substantially enhance detection accuracy. 2) A multivariate and soft blending augmentation (MSBA) method is introduced to enhance the capability of the network model to recognize various forgery patterns. This method is utilized to synthesize multiple-and-soft-blending images encompassing complex forgery patterns. Building upon the MSBA approach, a multivariate forgery intensity estimation (MFIE) module that guides the network model to deeply mine features related to forgery patterns and intensities within facial forgery images has been further developed. The MSBA method and MFIE module, working in tandem, drive the backbone network to selectively extract targeted forgery cues from images that encompass a range of forgery patterns, thus enhancing the generalization and robustness of the model. 3) Experimental results demonstrate highly competitive performance in in-domain and cross-domain tests across datasets, including FaceForensics++ (FF++), Celeb-DF, DeepFake detection challenge (DFDC), DeepFake detection challenge preview (DFDCP), Deepfake detection (DFD), and DeeperForensics-1.0 (DFV1). In the experimental process, 16 frames are extracted from each video for the training dataset and 32 frames for the testing dataset, with all images resized to 224 × 224 pixels and normalized to the range [0,1] before network input. In the experimental setup, the network model is initialized using a pre-trained CLIP model with a 16 × 16 image patch size. For the training process, the AdaN optimizer is employed, setting the initial learning rate to 2E-5 and the batch size to 64. After 75 training epochs, a cosine annealing strategy is applied for 25 additional epochs, reducing the learning rate to 2E-7. The proposed method is implemented within the PyTorch framework and is trained using a single NVIDIA GeForce RTX 3090 GPU graphics card. Based on previous work, the area under the ROC curve (AUC) and accuracy (ACC) metrics are primarily employed to evaluate the performance of the network.ResultIn in-domain experiments, the approach has achieved a notable enhancement in performance metrics when compared with the most proficient existing methodologies. Specifically, a marked improvement of 3.32% in ACC and a notable increase of 4.02% in the AUC are observed. In cross-domain experiments, the proposed method is tested and compared with six existing methods on the image level across five datasets, resulting in an average improvement of 3.27% in the AUC metric. Ablation study results indicate that the proposed MSBA method and the MFIE module both provide positive contributions to the enhancement of face forgery detection performance.ConclusionThe CLIP network framework is designed for the face forgery detection task, which substantially enhances the accuracy of detecting forged faces. The proposed method of MSBA, coupled with the MFIE module, plays a crucial supportive role. These contributions have led to performance gains that surpass those of existing methods. The model parameters and computational complexity of the method are relatively high due to the use of large-scale language-vision models, which leads to certain limitations in response speed. Future work will consider reducing the computational overhead of the model while maintaining or even further improving the accuracy and robustness of face forgery detection.
摘要:This is the 30th annual survey series of bibliographies on image engineering in China. This statistic and analysis study aims to capture the up-to-date development of image engineering in China, provide a targeted means of literature searching facility for readers working in related areas, and supply a useful recommendation for the editors of journals and potential authors of papers. Specifically, considering the wide distribution of related publications in China, all references (865) on image engineering research and technique are selected carefully from the research papers (2 892 in total) published in all issues (154) of a set of 15 Chinese journals. These 15 journals are considered important, in which papers concerning image engineering have higher quality and are relatively concentrated. The selected references are initially classified into five categories (image processing, image analysis, image understanding, technique application, and survey) and then into 23 specialized sub-categories in accordance with their main contents (same as the last 19 years). Analysis and discussions about the statistics of the results of classifications by journal and by category are also presented. In addition, as a roundup of the 30 years for this review series, the 20 164 articles in the field of image engineering selected from 76 790 academic research and technical applications published in a total of 3 734 issues over the past 30 years are divided into six periods of five-years, comprehensive statistics and analysis were made on the selection of image engineering literature and the number of image engineering literatures in each sub-category. Analysis on the statistics in 2024 shows that: from a research perspective, image analysis has currently received the most attention, with image segmentation and primitive detection, object detection and recognition, as well as human biometric feature extraction and validation being the focus of research; from an application perspective, remote sensing, radar, sonar, surveying and mapping are the most active fields, and the development and application of new image technologies are expanding rapidly. According to the comparison of 30 years of statistical data, it can be seen that the number of literatures in some sub-categories of the four categories of image processing, image analysis, image understanding, and technology applications has continuously growing and kept ahead, but there are also some classes that the numbers are gradually decreasing, reflecting changes in different research directions over the years. In conclusion, this work shows a general and up-to-date picture of the various continuing progresses, either for depth or for width, of image engineering in China in 2024. The statistics for 30 years also provide readers with more comprehensive and credible information on the development trends of various research directions.
摘要:ObjectiveScene text image super-resolution (STISR) is an emerging visual enhancement technology aimed at enhancing the quality of low-resolution text images and improving text readability. This technology is extensively applied in fields such as autonomous driving, document retrieval, and text recognition. Existing STISR techniques can be categorized into traditional and deep learning methods. Traditional methods offer some image enhancement but heavily rely on manual feature extraction and complex prior knowledge. On the contrary, deep learning methods showcase considerable advantages due to their robust automatic feature extraction and learning capabilities. Therefore, deep learning STISR methods have become increasingly popular within the academic community. Recent deep learning-based STISR methods leverage rich semantic priors from text images to guide image reconstruction and text recovery. Unlike methods that solely rely on visual feature extraction, such as convolutional neural networks, multilayer perceptrons, and Transformers, these emerging methods integrate semantic priors with visual feature analysis for more efficient image reconstruction. However, they often overlook dynamic features of text structure, such as the contextual connections between neighboring characters and directional features. Therefore, the text semantic priors generated are not effectively aligned or fused with image features, limiting the quality of the reconstructed images. To overcome these limitations, a new cross-modal fusion super-resolution method, which incorporates text structure dynamic perception, is proposed to enhance the quality of low-resolution text images and text readability.MethodIn this study, an innovative cross-modal fusion STISR method that leverages text structure dynamic perception is proposed. The process begins with low-resolution text images being corrected using a spatial transform network module and a thin plate splines module. These modules adjust the spatial positions of irregular characters, preparing the images for further processing. The corrected images are then processed through a text structure dynamic perception module and a semantic space alignment module to extract image modal and text modal features. The text structure dynamic perception module, which includes a direction sensing block and a context linkage unit, captures multiscale directional features and identifies the contextual relationships between character neighboring characters, respectively, accurately capturing the dynamic features of the text structure within the image modal. The semantic space alignment module processes the low-resolution images using recognition rendering and binarization to derive text semantic priors and mask priors. These priors are then combined through feature addition to generate advanced text semantic priors, which are aligned with image features through affine transformations, guided by the image modal. Finally, the developed cross-modal fusion module employs an adaptive weight distribution strategy to enhance the interactive integration of text and image features across modals, producing the final super-resolved text image.ResultThe proposed method was evaluated against 13 mainstream methods, with a primary focus on the text recognition accuracy of the reconstructed text images (this metric is critical, given the unique nature of text images, where readability and recognizability are paramount). Secondary evaluation metrics included peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM), which, despite their limitations in fully capturing image quality due to the misalignment of low-resolution and high-resolution images in real datasets, supplemented the evaluation. Experiments conducted on the real dataset Texezoom demonstrated that the proposed method achieved recognition accuracies of 67.1%, 56.6%, and 63.6% on three standard text recognizers, namely, attentional scene text recognizer(ASTER), convolutional recurrent neural network(CRNN), and multi-object rectified attention network(MORAN), respectively, outperforming the existing representative method, PERMR, by 3.0%, 4.6%, and 3.1%, respectively. In terms of image quality, the proposed method achieved PSNR and SSIM values of 21.9 dB and 0.789, ranking first and second, respectively, among all compared methods. Additionally, visual comparisons further highlighted the superior quality and readability of the text images reconstructed using the proposed approach.ConclusionIn this study, a novel STISR method that effectively generates advanced text semantic priors by accurately capturing dynamic text structure features is proposed. This method facilitates the alignment and integration of text semantic priors with image features, drastically improving the quality of reconstructed text images and enhancing text readability. Experimental results demonstrate that the proposed method outperforms other mainstream methods, delivering notable improvements in image quality and text readability.
关键词:scene text image super resolution (STISR);dynamic features of text structure;multi-scale orientation feature;semantic space alignment;cross modal fusion
摘要:ObjectiveRegong art, originating from the Longwu River valley in the Tibetan region of Huangnan, Qinghai Province, has flourished in this area, forming a unique regional artistic style. In 2009, this art was inscribed on the UNESCO Representative List of the Intangible Cultural Heritage of Humanity. Thangka, as one of the most important forms of Regong art, embodies the rich historical and cultural heritage of the Tibetan region, holding substantial historical, cultural, and artistic value. During the field collection process, many Thangka works displayed issues such as cracks, tears, water stains, and mold spots due to poor preservation conditions. However, traditional restoration methods are not only inefficient but also risk causing further damage, making them unsuitable for the proper conservation and development of Regong art. Consequently, conducting research on the restoration of damaged Thangka images is urgently needed. However, attempts to restore Thangka images using current enhancement and restoration algorithms have encountered several issues, such as blurred texture lines and misaligned repairs. These issues arise because the complexity and diversity of Thangka images make it difficult for existing models to capture their unique structural and textural characteristics.MethodTherefore, an interactive Thangka image restoration network, LSFNet, guided by line draft repair, is proposed to address the aforementioned challenges. This method comprises three parts. First, the interactive line restoration involves collaboration with Thangka artists to guide the restoration of the line structure, ensuring that the restored lines closely resemble those in real Thangka images. Second, the style and texture restoration phase is where an overall style and texture module is constructed to learn the unique characteristics of Thangka images. By integrating channel attention mechanisms and fully connected layers, the module captures global information and synthesizes it into preliminary restoration features. Finally, the refinement restoration phase introduces a linear attention module during the downsampling process. This module captures local and global dependencies, allowing the model to extract features of different scales, further refining the restoration, eliminating restoration traces, and enhancing the overall image quality. PatchGAN is also adopted as the discriminator, dividing the input image into multiple receptive fields and conducting independent binary classification for each receptive field to assess whether it matches the texture characteristics of the target image. This approach effectively enables pixel-level supervision, enhancing the overall image restoration quality.ResultThis paper created a Thangka restoration dataset comprising a total of 25 000 images, collected through field research and data gathering. The Canny algorithm was employed to extract edge line art, resulting in a line art dataset. Additionally, Photoshop tools were used to simulate damage to the Thangka images, generating 5 000 mask maps. Another 1 000 mask maps were sourced from public datasets to enhance restoration performance under various damage conditions, combining them into a final dataset containing 6 000 mask maps. All datasets have a resolution of 256 × 256 pixels. The proposed method in this paper was trained, tested, and compared with other restoration methods, including DeepFillv2, EdgeConnect, DFNet, HiFill, and T-Former, using the dataset created in this study. Results indicate that the proposed method exhibits strong repair performance, with superior metrics in peak signal-to-noise ratio (PSNR), structural similarity (SSIM), and learned perceptual image patch similarity (LPIPS) on the Tangka dataset compared to other methods. Specifically, compared to the second-best performing model, the proposed method achieved a 10.55% increase in PSNR, a 1.8% increase in SSIM, and a 57.98% reduction in LPIPS. Experimental results demonstrate that the proposed interactive Thangka image restoration method, based on line art repair, can effectively restore damaged Thangka images, producing results that are closer to authentic Thangka images.ConclusionThis article proposes an interactive line drawing repair method, guided by Thangka artists, to repair damaged areas of the line drawing. Subsequently, a style and texture restoration phase is employed to learn the distinctive style features of Thangka images. Finally, a fine-tuning repair process further optimizes the restoration results. Experimental results demonstrate that this method effectively repairs damaged Thangka images, producing restoration outcomes that conform to the style and content of Thangka art.
摘要:ObjectiveWith the rapid advancement of deep learning technologies, image recognition accuracy has substantially improved. However, traditional deep learning models are heavily dependent on large-scale, well-labeled datasets. This reliance on manually annotated data not only incurs high costs but also poses challenges when models need to adapt to new, unseen data or operate in data-scarce situations. Zero-shot learning (ZSL) has emerged as a promising solution to these challenges. This approach enables models to recognize unseen, unlabeled data by training on labeled data from seen classes. Depending on the task setting, ZSL is typically categorized into conventional zero-shot learning (CZSL) and generalized zero-shot learning (GZSL). CZSL occurs when the test samples only contain unseen class samples, while GZSL is applicable when the test samples include seen and unseen class ones. The GZSL is more aligned with real-world situations, thereby improving the accuracy of generalized zero-shot image recognition and offering notable practical importance. Generative methods are the most effective approach for addressing GZSL because they can generate visual features for unseen classes, transforming the ZDL problem into a conventional supervised learning task. However, generative zero-shot recognition encounters challenges, such as insufficient discriminative information in generated features, inconsistencies between pseudo-visual features and semantic information, and domain shift, ultimately reducing recognition accuracy. Generative zero-shot image recognition combined with double contrastive embedding learning is proposed to address these issues.MethodA generative framework based on the variational autoencoder–generative adversarial network (VAE-GAN) is constructed to address the main challenge of insufficient discriminability in generated features, and a contrastive embedding module is integrated. Through a collaborative training strategy involving multiple network components, the quality of pseudo-features is notably enhanced, leading to substantial improvements in the accuracy of zero-shot image recognition. Additionally, leveraging conditional VAE-GAN as the core generative network, an innovative dual contrastive learning strategy, which effectively integrates intra-domain and cross-domain information, is proposed, maximizing information utilization. Specifically, intra-domain contrastive learning among pseudo-sample instances of unseen classes and their prototypes is introduced, ensuring that generated pseudo-visual features align closely with semantic information, which mitigates confusion between visible and unseen classes. Cross-domain center-prototype contrastive learning, which strengthens the alignment between visual centers and semantic prototypes, is then implemented. This approach effectively reduces inter-class variance, facilitating more efficient cross-domain knowledge transfer. Consequently, the model becomes less reliant on visible classes, partially alleviating domain shift, which results in exceptional performance even in unknown domains.ResultThe experimental framework was rigorously evaluated on zero-shot and generalized zero-shot recognition tasks across four distinct datasets, and the results were compared to the latest field advancements. In zero-shot recognition, the proposed method achieved state-of-the-art performance on the AWA1 and CUB datasets, showing a notable improvement of 2.2% and 2.7% in T1 values, respectively, over the second-best models. On the AWA2 and SUN datasets, the proposed approach demonstrated competitive performance, further displaying its robustness across diverse data environments. For generalized zero-shot recognition, the algorithm outperformed competitors by achieving the highest H values on the AWA1, AWA2, and CUB datasets, revealing improvements of 0.6%, 0.8%, and 2.8%, respectively, over the second-best methods. Notably, this approach is the only one to achieve an accuracy over 70% across all three datasets, emphasizing its exceptional generalization capability. The method performed competitively on the SUN dataset, further reinforcing its overall effectiveness. The observed performance gains can be attributed to the generated pseudo-features, which exhibit a high degree of similarity to the genuine features of the visible classes. This approach helps mitigate semantic confusion and effectively addresses domain shift. Ablation studies were conducted on the AWA1 and CUB datasets to further validate the effectiveness of the proposed approach, providing empirical evidence of its contributions. Additionally, the impact of the number of generated samples on model performance was evaluated by conducting extensive experiments with varying sample sizes in zero-shot and generalized zero-shot scenarios across all four datasets, offering valuable insights for optimizing the generation process to enhance recognition accuracy. Finally, t-SNE dimensionality reduction and visualization experiments were performed on randomly selected unseen classes from the coarse-grained AWA2 and fine-grained CUB datasets to assess the effectiveness of using a low-dimensional embedding space in the algorithm. These experiments visually represent the embedded features and demonstrate the advantages of the proposed approach in capturing discriminative and meaningful representations in a low-dimensional space.ConclusionExperimental results indicate that the proposed method enhances the accuracy of zero-shot and generalized zero-shot image recognition, effectively addressing the challenges associated with current contrastive learning-based generative zero-shot image recognition, while also demonstrating good generalization performance. Future research will consider a trade-off between recognition accuracy and efficiency by designing lightweight networks to reduce model complexity and improve recognition efficiency.
摘要:ObjectiveIn the industrial manufacturing domain, quality control plays a crucial role in ensuring the reliability and safety of products. As a key part of quality control, the efficiency and accuracy of defect detection directly influence product quality and the economic benefits of enterprises. Traditional defect detection methods, such as artificial vision detection or rule-based image processing techniques, are often limited by the subjective judgment of operators and the complexity of rules. These methods are also challenging to adapt to the high efficiency and accuracy demands of modern industries. With the advancements in deep learning, especially the breakthroughs in deep neural networks for image recognition, defect detection based on deep learning has gradually become a research hotspot. Current defect detection solutions often model the problem based on the data distribution of normal samples, identifying and labeling defect samples as outliers in the data. However, in practical applications, products can exhibit a wide variety of defect types with different morphologies, making it difficult for a single model to effectively address all defect types. Therefore, training a separate model for each defect type has become a common practice. While this approach helps to address specific defect categories, it requires large amounts of training data and computing resources, and the maintenance and updating of multiple models can be extremely cumbersome. Researchers have designed a unified detection framework that can handle multiple defect types in a single model to solve these problems. However, these unified frameworks are often trained with a single fixed perturbation scheme, limiting the capability of the model to learn various flawed features. In addition, these frameworks often fail to fully leverage the feature information output by the encoders and decoders at various layers of the network. As a result, the models may become overly dependent on specific sampled features, which can limit the generalization capability of the model.MethodThis paper proposes an innovative improved multiclass defect detection network that substantially improves the generalization capability and robustness of the model to address the challenges in unified detection frameworks. This improvement is achieved by introducing a feature perturbation pool and a multilayer feature fusion strategy. On the one hand, in traditional deep learning model training, the model is typically trained only on raw datasets, which leads to overfitting to a specific data set. As a result, the model performs poorly when faced with various changes encountered in actual production scenarios. The feature perturbation pool introduces a series of stochastic perturbations to the training data to address this problem. These perturbations, such as rotation, scaling, cropping, and color dithering of the input data, simulate the various changes the model may encounter in practical applications. Through this data augmentation, the model learns more generalized feature representations, improving its adaptability to a wide range of practical applications. On the other hand, in traditional deep learning model training, only the features from the last layer are typically used for classification or other tasks. However, feature maps at different levels of the network capture important information at different scales and levels of abstraction, which is crucial for effective defect identification and localization. Therefore, this paper proposes a multilayer feature fusion strategy that integrates feature information from different levels within the network. Specifically, a skip connection method is employed to directly fuse the low-level feature maps from the encoder with the high-level feature maps from the decoder. This approach allows the model to capture global context information while retaining local details, ultimately enhancing its capability to identify and localize defects accurately.ResultCompared to the current state-of-the-art methods, the proposed method exhibits excellent performance. The defect detection accuracy and defect localization accuracy achieve 97.17% and 96.93%, respectively, on the MVTec-AD dataset. Furthermore, on the VisA dataset, the two accuracies reach 91.08% and 99.08%, respectively.ConclusionThe proposed joint feature perturbation pool and multilayer feature fusion multitype defect detection network exhibit enhanced robustness and the capability to capture complex relationships between features. The network not only shows remarkable potential in theoretical exploration but also shows broad application prospects for practical industrial applications. In future developments, exploring more efficient feature extraction and fusion technologies, along with smarter training strategies, will be key to addressing a wider range of defect types and more complex production environments. These advancements can promote the evolution of industrial quality control systems toward a smarter and more efficient direction, making substantial contributions to improving product quality and enhancing production efficiency.
摘要:ObjectiveFine-grained image classification focuses on distinguishing between categories that are visually similar but semantically distinct, making it more challenging than traditional image classification tasks. In practice, collecting a large amount of labeled fine-grained image data is often time-consuming and costly. Accurate data annotation often requires domain expertise, adding to the complexity of building such datasets. Traditional classification methods struggle to capture the subtle variations between categories in fine-grained images, which can result in poor performance, particularly when working with limited samples. Therefore, leveraging few-shot learning(FSL) methods has become essential for addressing fine-grained image classification problems. Few-shot fine-grained image classification aims to accurately differentiate between similar categories with only a few labeled examples. Among the current leading approaches, metric-based meta-learning methods are widely used for few-shot learning. However, these methods often rely on global image features, making it difficult to fully capture the intricate structures and subtle differences inherent in fine-grained images. Moreover, existing few-shot learning techniques often face considerable challenges when applied to fine-grained classification tasks, such as large intra-class variability and high inter-class similarity. These challenges can severely limit classification performance. A novel few-shot fine-grained image classification method that integrates bifurcated attention and feature interaction mechanisms is proposed to address these issues.MethodThe proposed approach begins with the introduction of a bifurcated attention module within the feature extraction network. This module is designed to dynamically adjust the model’s focus on different parts of the image using two distinct pathways: spatial attention and channel attention. The spatial attention pathway enables the model to prioritize regions of the image based on their spatial relevance, while the channel attention pathway adjusts the focus according to the importance of different feature channels. By combining the two distinct pathways, the model can flexibly emphasize the most critical aspects of the image, which enhances its capability to capture fine-grained details. The features extracted from the two attention branches are then concatenated along the channel dimension, enabling the model to integrate more detailed and discriminative features essential for accurate fine-grained image classification. By directing attention to important areas of the image, the bifurcated attention module reduces unnecessary computations on irrelevant or less informative regions, resulting in improved computational efficiency. This approach not only lowers the overall computational load but also allows it to better capture the subtle differences between fine-grained categories. Additionally, a random sampling strategy is incorporated to create query subsets, which helps reduce the number of parameters involved in the classification process. This strategy contributes to a more streamlined computational process, enhancing the model’s efficiency in few-shot tasks. Once the query subsets are generated, the features of each category are averaged within the subset, and a feature interaction module is introduced. This module calculates the correlation between the query subset and the support set samples, enabling the model to effectively understand the relationships between them. Leveraging these computed correlations, the model adaptively assigns weights to the support features, emphasizing the most distinctive regions in the feature space of the samples. The inter-channel dependencies within the support features are also considered, selectively highlighting the most important features. This approach allows the model to better focus on key aspects of the image that are most relevant to distinguishing between similar categories. Relation network-based metrics are combined with cosine similarity to measure the correlation between query samples and support set prototypes, further enhancing classification performance. By incorporating both metrics, the model can accurately assess the similarities and differences between samples, ultimately achieving improved few-shot fine-grained image classification performance.ResultThe proposed method demonstrates strong performance across several benchmark datasets. On the caltech-UCSD birds-200-2011(CUB-200-2011) dataset, the classification accuracy of the proposed method outperforms the second-best method by 5.95% and 1.21% in the 5-way 1-shot and 5-way 5-shot task settings, respectively. This finding indicates a substantial improvement in fine-grained classification accuracy, particularly in the few-shot learning scenario where only a limited number of labeled examples are available. Similarly, on the Stanford Dogs dataset, the method achieves 4.15% and 2.29% improvements in classification accuracy for the 1-shot and 5-shot tasks, respectively, compared to the second-best method. These results further demonstrate the effectiveness of the proposed approach in addressing few-shot fine-grained classification challenges. Additionally, on the Stanford Cars dataset, the method outperforms most comparative methods, highlighting its strong generalizability across various fine-grained image datasets. Furthermore, complexity analysis experiments reveal that the proposed bifurcated attention module strikes a balance between memory overhead and training time. Despite its enhanced capability to capture detailed features, the module does not introduce substantial computational complexity. Visualization experiments further validate the effectiveness of the module in capturing long-range dependencies in fine-grained images. By focusing on these dependencies, the method can more comprehensively identify distinctive features, resulting in improved classification performance.ConclusionThe few-shot fine-grained image classification method proposed in this study improves sample feature representation without notably increasing model complexity. By integrating bifurcated attention and feature interaction mechanisms, the proposed method effectively captures the subtle differences between fine-grained categories, leading to improved classification performance. Additionally, the method optimizes the sample distribution in feature space, ensuring that samples from the same category are more tightly clustered, while samples from different categories are more distinctly separated. In comparison to other baseline methods, the proposed approach outperforms them in terms of classification accuracy and overall performance on few-shot fine-grained classification tasks.
摘要:ObjectiveEstimating joint hand-object poses from a single RGB image is a particularly challenging task, primarily due to the severe occlusions that often occur during hand-object interactions, which complicate the identification of critical features. Interactive scenes involving hands and objects are typically dynamic and complex, making it difficult for traditional computer vision techniques to handle such intricate situations, especially in the presence of major occlusions. Furthermore, many existing hand-object feature extraction networks rely on feature pyramid networks (FPNs) to fuse multiscale features, aiming to capture information across different levels. However, FPN-based methods frequently encounter issues related to the loss of channel information during the feature extraction process, which can negatively impact the accuracy of final pose estimations. A novel hand-object feature enhancement complementary (HOFEC) model, which is designed to optimize the feature extraction and fusion processes, is proposed, thereby enhancing pose estimation performance under complex backgrounds and occlusion conditions and addressing the aforementioned challenges.Method1) A novel architecture, known as the channel attention-guided feature pyramid network (CAG-FPN), is introduced to effectively address the prevalent issue of channel information loss within feature extraction processes. This model strategically integrates a channel attention mechanism into the traditional FPN framework, thereby enhancing its capability to discern and highlight the intricate relationships and implications among various existing channels in the input data during the critical multiscale feature fusion stage. The channel attention mechanism operates by dynamically adjusting the weights assigned to different feature channels based on their relevance to the task at hand. This dynamic weighting enables the network to more effectively identify and utilize crucial feature information that essential for accurate recognition. Additionally, this architecture is enhanced with a dual-stream ResNet-50 network, built around the principles of local sharing. This innovative approach enables the joint construction of a comprehensive hand-object feature extraction network, notably boosting the overall feature extraction capabilities of the model. As a result, the model exhibits a marked improvement in its capability to capture and represent hand and object features, particularly in complex scenes with high variability and occlusion. 2) A sophisticated spatial attention module, which is designed to simultaneously enhance hand and object features while extracting critical information regarding the occluded regions that may hinder visibility, is designed to effectively address the challenges posed by mutual occlusion during hand-object interactions. The implementation of the spatial attention mechanism allows the model to selectively focus on important areas of interest, thereby improving its capability to accurately recognize and interpret occluded regions that are essential for effective pose estimation. In addition to the spatial attention module, a cross-attention module that facilitates the exchange of secondary features between the hand and object has also been innovatively designed. This module injects the secondary features of the hand into the primary features of the object and vice versa, thus fostering a robust complementarity between hand and object features. Through this design, the module effectively integrates occlusion information from the hand and object regions while employing a correlation matrix to filter out irrelevant background noise. This dual approach ensures that the processes of feature enhancement and mutual complementarity are conducted with high precision and thoroughness. Consequently, this approach notably improves the overall accuracy of pose estimation in scenarios with complex and dynamic hand-object interactions. 3) By employing separate hand and object decoders, the poses of the hand and the object can be independently recovered. The two decoders consider the interaction effects between the hand and the object during the information fusion process, ensuring that the final pose output information is accurate and consistent. This design enables the model to effectively handle pose estimation in complex hand-object interaction scenarios, providing robust and reliable technical support for practical applications.ResultCompared to state of the art(SOTA) models, the proposed method demonstrates competitive performance in hand and object pose estimation tasks on the HO3D and Dex-ycb datasets. On the HO3D dataset, the hand pose estimation metrics PAMPJPE and PAMPVPE show an improvement of 0.1 mm over the next best model, HandOccNet. Additionally, the object pose estimation metric ADD-0.1D surpasses the suboptimal HFL-Net by 2.1%. On the Dex-ycb dataset, comparisons with seven recent models reveal that the hand pose estimation metrics MPJPE and PAMPJPE improve by 0.2 mm and 0.1 mm, respectively, over HFL-Net, while the object pose estimation metric ADD-0.1D shows a 6.4% improvement over HFL-Net.ConclusionThe HOFEC model proposed in this paper aims to improve the accuracy of hand-object pose estimation in interactive scenarios by facilitating the complementary information exchange between the hand and the object. By introducing a channel attention mechanism and incorporating shuffling operations, the proposed model not only addresses the issue of channel information loss in FPNs but also strengthens and supplements features across different scales. A feature enhancement module based on spatial attention is designed to enhance the hand and object features at the spatial scale, while simultaneously extracting secondary features for the hand and object. Through a cross-attention mechanism, these secondary features are used to complement the primary features of the hand and object, effectively filtering out irrelevant background information linked to the secondary features. This approach successfully addresses the challenge of underutilizing occlusion information, ultimately improving the accuracy of the hand-object pose estimation task. Building upon this foundation, a hand-object decoder that separately decodes the hand and object is developed, ultimately reconstructing their complete poses. Experimental results have shown that, even in cases of severe occlusion during hand-object interaction, the proposed HOFEC model can still accurately estimate hand and object poses.
摘要:ObjectiveDeep learning has made remarkable progress in 3D model classification. However, most existing classification methods rely on supervised learning, which limits their capability to recognize only the model categories seen during training. With the development of computer-aided design and LiDAR sensor technologies, an increasing number of novel 3D model classes are emerging, presenting a challenge: how to effectively identify model classes that were not encountered during training. Zero-shot learning has been proposed to address this challenge. However, this approach faces a major limitation due to the shortage of large-scale datasets with high-quality semantic information. To overcome this issue, many existing methods introduce large-scale pre-trained models with rich semantic information from 2D image domains, such as the contrastive language-image pre-training (CLIP) network. While these methods project 3D models to 2D space to meet the input requirements of CLIP visual encoder and achieve reasonable results, they do not fully capture the 3D information from the datasets and fail to leverage the knowledge inherent to the 3D domain. A straightforward approach to addressing this limitation is to adopt the learning strategy of multiview convolutional neural networks, which involves fine-tuning the CLIP visual encoder and optimizing its network parameters using a 3D model dataset. The goal is to leverage the advantages of 2D data annotation while incorporating the inherent characteristics of 3D models. However, this strategy does not yield effective results for CLIP. The fine-tuned network tends to overfit the training set, causing it to gradually forget much of the valuable 2D knowledge during the tuning process. Therefore, this strategy is not feasible. This paper proposes a consistency constraint guided network (CCG-Net) for zero-shot 3D model classification to overcome these problems.MethodCCG-Net aims to leverage the strengths of 2D and 3D domains while mitigating the issues of overfitting and knowledge forgetting. CCG-Net comprises fixed and dynamic parts. The fixed part of the network employs a frozen CLIP model to learn cross-modal information from large-scale 2D visual and semantic data. Stopping the backpropagation in this part forces the network to focus on preserving 2D information. In contrast, the dynamic part is a learnable encoder designed for extracting global features from 3D models, with a strong emphasis on acquiring 3D knowledge. A view consistency constraint is applied in the dynamic part to guide the extraction of 3D features. This design ensures that the 2D knowledge from the pre-trained model is fully preserved, while also allowing the network to learn new information from 3D data. The information from two modalities is then effectively fused into comprehensive 3D model features, which are used for classification. Mask consistency constraints are introduced to enhance the extraction of features for 3D data and improve the robustness of the 3D encoding process. This constraint guides the network in enhancing its capability to learn the 3D model through self-supervised learning. The specific approach involves employing different masking methods to obtain a diverse set of mask features. Once these features are generated, the next step is to constrain their consistency. The network can effectively learn and integrate the essential characteristics from the masked data by ensuring the consistency of these mask features, finally enhancing model robustness and accuracy. Additionally, the pre-trained network employs a mutual exclusion loss, which assumes a mutual exclusion relationship between the labels to be classified. However, this network is unsuitable for the zero-shot task of tuning on a small-scale dataset. A non-mutual exclusion loss, guided by the homogeneity consistency constraints, is also proposed to address this issue, ensuring the accuracy of the learning direction and the network’s capability to generalize its learning during training on a small-scale dataset.ResultThree different consistency constraint schemes work collaboratively within the network to optimize its parameters, effectively preventing overfitting during fine-tuning on 3D data. This approach enhances the reliability and generalization of feature extraction, ultimately enhancing zero-shot classification performance. Quantitatively, on the ZS3D dataset, the proposed method achieves a classification accuracy of 70.1%, marking a substantial 9.2% improvement over the current best results, achieved by discriminative feature-guided zero-shot learning of 3D model classification (DFG-ZS3D). Additionally, this method demonstrates improvements on the dataset proposed by Cheraghian, achieving classification accuracies of 57.8%, 19.9%, and 12.2% on the ModelNet10, McGill, and Shrec 2015 subsets, respectively. These results correspond to improvements of 22.8%, 3.3%, and 2.3% over the state-of-the-art methods. The ScanObjectNN dataset, which comprises 3D models obtained from real-world scans rather than synthetic data, further validates the effectiveness of CCG-Net. On this dataset, CCG-Net attains the highest performance across its three subsets, with classification accuracies of 32.4%, 28.9%, and 19.3% on the OBJ_ONLY (Object only), OBJ_BG(Object and background), and PB_T50_RS(Object augmented rot scale) subsets, respectively. The performance improvement on real-world datasets further validates the generalization capability of the proposed method. Additionally, ablation experiments confirm the effectiveness of the three consistency constraints. Finally, qualitative analysis results of the confusion matrix demonstrate that the network can avoid overfitting to a certain extent. However, this analysis also reveals shortcomings in the capability of the network to extract discriminative features, providing a perspective for future research.ConclusionCompared to methods that rely solely on pre-trained models, the proposed approach in this paper leverages the strengths of language-image pre-trained network while incorporating knowledge from the 3D modeling domain through view consistency constraint. This method improves the robustness and generalization capability of the network by designing self-supervised enhancement under mask consistency constraint and refining the homogeneity consistency constraint loss function. Therefore, this method achieves accurate improvement for zero-shot 3D model classification.
关键词:3D model classification;zero-shot learning;self-supervised learning;image-text pre-training;visual-language multimodality
摘要:ObjectiveGait recognition, a promising biometric identification technology, offers advantages such as resistance to disguise and the capability to perform long-range surveillance, making it widely used in criminal investigations, industry applications, surveillance systems, and social security. However, in practical applications, elements such as observation angles and complicated backgrounds still pose challenges to the accuracy of existing gait recognition algorithms. Gait recognition methods mainly rely on feature extraction to obtain distinctive feature maps for identification. However, current algorithms fail to adequately capture and utilize human biometrics present in gait data. Fortunately, the deep residual module has been proven effective in extracting high-level feature. This study offers a high-precision gait recognition method to address these limitations and achieve consistent recognition in practical applications. This method comprehensively extracts multiple features using the deep residual module and combines gait features with body shape features, improving overall recognition accuracy.MethodThis study employs a 50-layer residual network (ResNet-50) as the backbone for constructing a multibranch feature fusion network based on human skeleton information. The network combines deep feature extraction and network fusion modules to ensure reliable and accurate gait identification. High-resolution network (HRNet) is used to extract skeletal information, utilizing information blending across parallel networks at varying resolutions. This approach improves the network’s recognition accuracy and its capability to identify features from low-resolution images. The gait cycle is extracted by analyzing the similarity of leg movement features, which serves as a hyperparameter for the network to minimize computational load while preserving important information. Following data augmentation, feature modeling is performed using pre-network skeleton information. Optimal residual modules are then applied within the deep residual module for scale alignment and information transfer, enabling the in-depth extraction of gait and body shape features. These features are divided into three branches: skeletal motion, gait speed, and body proportion. A multibranch feature fusion is implemented. To enhance the network’s structure and its information interaction mechanisms. This module uses an information transfer and weight allocation mechanism, similar to the attention mechanism. First, the skeletal motion and body proportion branches are concatenated at the input end using matrix concatenation, combining them into spatial information. These concatenated features are then mapped to a low-dimensional space, producing a set of combined features. The fused features are further mapped to the spatial and velocity branches through an activation function, where the activation values act as weight parameters for each branch. This process adjusts the weights of each branch’s feature maps based on their relevance to the task, combining features from different branches to leverage their complementary strengths. This approach enables the comprehensive identification of target identity information, addressing the limitations of low discrimination and recognition accuracy. This strategy improves the generalizability and recognition accuracy of the network.ResultThe MPII human pose dataset (MPII) comprises 6 619 test sets, 14 679 training sets, and 2 726 validation sets. The HRNet demonstrates excellent performance in point localization. For input images of 256 × 256 pixels with a threshold of 0.01, the average head-normalized probability of correct keypoints (PCKh) for each point exceeds 83%. Furthermore, the keypoint localization for the lower extremities is particularly strong, with PCKh values greater than 95% for the ankle and knee, meeting the precision requirements for further experiments. In the ablation experiment, three distinct walking scenarios were considered: normal walking (NM), walking with backpacks (BG), and walking with a coat or jacket (CL). The accuracy of gait recognition using a single feature was relatively low, especially when wearing a thick coat. However, by combining the three types of features for classification, the recognition accuracy for the NM group substantially improved, reaching 94.52%. This approach also demonstrated high resistance to interfering elements, with the overall recognition rate increasing by 4.50% compared to the second-best method. The cross-view dataset from the Institute of Automation, Chinese Academy of Sciences (CASIA-B) was chosen for training and testing to assess the generalization capabilities of the model across diverse angles. The initial training parameters are as follows: batch size = 64, learning rate = 0.000 1, learning decay rate = 0.01, and dropout = 0.35. The model demonstrated robust performance in cross-view experiments, particularly excelling at 36° and 126°, as well as at nearby intervals. In the multistate experiment, the recognition rate for the NM group reached an impressive 97.36%. Furthermore, the proposed method outperforms similar algorithms, particularly for the CL group. Gait recognition technology is applicable in indoor and outdoor environments. However, outdoor environments present additional challenges, including lighting changes, angle adjustments, and dynamic backgrounds. Consequently, algorithms trained on indoor datasets often struggle to perform well in outdoor settings. Therefore, the algorithm in this study was validated using a self-built outdoor dataset. While most existing gait detection algorithms experience a loss of over 15% in accuracy when transitioning from indoor to outdoor environments, the proposed method achieved a 4.1% accuracy improvement over the second-best gait detection algorithm, demonstrating strong generalization potential.ConclusionThe gait recognition method presented in this study effectively leverages the robust resilience of skeleton information and the advantages of multifeature fusion. This method efficiently reduces interference in challenging environments, such as complicated backgrounds, thick clothing, and varying angles, resulting in stable and high-precision recognition of target identities, making it well-suited to meet the requirements of practical applications.
摘要:ObjectiveColorectal cancer, a high-incidence and extremely harmful disease, represents a serious threat to human health. Statistics show that approximately 95% of cases develop from the progressive growth of colon polyps, highlighting the critical importance of early identification and monitoring of polyps in reducing the incidence of colorectal cancer. However, traditional manual diagnostic methods often from high omission rates, limiting the effectiveness of early intervention. In this context, the introduction of deep learning technology offers a promising solution. By thoroughly analyzing the characteristics of lesions, including the precise location and morphological structure of polyps, deep learning models can substantially enhance the screening efficiency and accuracy of doctors, thus driving innovation in the prevention and treatment of colorectal cancer. In recent years, with the rapid advancements in deep learning technology, its application in medical image analysis and other fields has made remarkable breakthroughs. Notably, models such as convolutional neural networks (CNNs) and visual Transformers (ViTs) have been widely adopted in medical tasks, demonstrating excellent performance and accelerating the clinical adoption of computer-aided diagnosis technologies. Given the complex characteristics of colorectal polyp images, such as substantial morphological heterogeneity and blurred edge definitions, this study introduces an innovative polyp boundary clue deep fusion network (PBCDF-Net). This network focuses on improving the segmentation accuracy of polyp images by accurately capturing and segmenting polyp boundaries through the integration of multilayer features. Verified across multiple datasets, the PBCDF-Net has demonstrated excellent performance, thereby deepening the understanding of the pathological characteristics of polyps but also providing a powerful tool for clinical practice, with important practical value and a forward-looking perspective.MethodThe proposed PBCDF-Net uses Res2Net-50 as its backbone, which extracts features from different receptive fields to enhance the multiscale representation capability of the target object, enabling the network to exhibit strong feature extraction and model expression capabilities. This paper specially designed a boundary clue mining module (BCMM) to address the fuzziness and uncertainty of polyp boundaries. This module extracts effective boundary clues from low-level feature layers, which are rich in texture and detail information, by incorporating specific operators. These boundary clues are then integrated with advanced semantic feature layers, allowing for the precise identification of polyp locations and resulting in more accurate and effective boundary information. Subsequently, the mined boundary clues are fused with semantic feature layers from different levels to achieve higher-precision segmentation of polyp fuzzy boundaries. This fusion compensates for the lack of boundary detail in the semantic feature layers, thereby further improving the model’s segmentation performance. Considering the important morphological differences and structural complexity of polyps, a foreground target enhancement module (FTEM) is also designed. This module enhances the features of small, hard-to-detect polyps and polyps with complex structures, improving the network’s capability to identify and perceive polyps with different tissue structures. A deep feature fusion module (DFFM) is designed during the decoding stage to efficiently integrate boundary detail features and enhanced target features. This module performs a preliminary fusion of the two feature types and then applies hierarchical cross-over to the preliminarily fused features. The feature fusion process is ultimately achieved through deep fusion, with the DFFM implemented in the form of cascade transfer fusion. This method ensures reliable correlation between the upper and lower features. For dataset processing, this article uses experimental data configurations from various mainstream networks, including PraNet. Specifically, the training data comprises a total of 1 450 polyp images, sourced from the Kvasir dataset (900 images) and the CVC-ClinicDB dataset (550 images). For testing, the remaining data from the Kvasir and CVC-ClinicDB datasets are combined with additional datasets, including ETIS, CVC-ColonDB, and CVC-300. This study uses five key indicators to evaluate model performance: average Dice, average IoU, structure measure, weighted F measure, and enhanced alignment measure.ResultThe study comprehensively evaluated the performance of the proposed PBCDF-Net model on the colorectal polyp segmentation task, using five public datasets (Kvasir, ETIS, CVC-ColonDB, CVC-ClinicDB and CVC-300) as benchmarks for testing. Additionally, one-in-out crossover experiments were conducted on the latest PolypGen dataset. These datasets cover a wide range of polyp morphological features, ensuring the evaluation is comprehensive and valid. A systematic comparative analysis of its performance is conducted using nine advanced segmentation methods to objectively evaluate the progress of PBCDF-Net: U-Net, UNet++, SFA, PraNet, ACSNet, CCBANet, DCRNet, ECTransNet and CIFGNet. Experimental results demonstrate the excellent segmentation capabilities of PBCDF-Net across various datasets. In particular, on the CVC-ClinicDB dataset, PBCDF-Net outperforms CCBANet, with increases of 6.6%, 7.4%, 3.4%, 7%, and 4.9% in key metrics such as mDice, mIoU, structural metrics, weighted F-measure, and augmented alignment metrics, respectively. Similarly, on the Kvasir and CVC-300 datasets, PBCDF-Net exhibits an average improvement of 4.5%, 6.2%, 2.5%, 6.3% and 2.9% across all evaluation metrics compared to recent methods. In addition, cross-experiment results on the PolypGen dataset indicate that PBCDF-Net improves by 4.6% and 4.9% on mDice and mIoU, respectively, compared to PraNet, outperforming several state-of-the-art methods across several metrics. These improvements highlight the strong capability of PBCDF-Net to maintain segmentation structural integrity and detail fidelity, demonstrating its notable advantages in improving the quality of segmentation outputs.ConclusionThe constructed PBCDF-Net model demonstrated excellent performance in colorectal polyp segmentation tasks. Through careful subjective evaluation, the consistent performance of the network across multiple datasets was confirmed, highlighting its strong adaptability to polyp size diversity and edge fuzziness, as well as its high accuracy in defining polyp contours. Additionally, the ablation experimental analysis of the three core components in the network design (boundary information mining module, foreground region enhancement module and deep feature integration module) clearly confirmed that these components played a decisive role in enhancing the segmentation accuracy of the model, effectively improving the overall performance of the algorithm.
摘要:ObjectiveMagnetic resonance fingerprinting (MRf) is a rapid and efficient quantitative imaging technique that simultaneously provides multiple physiological tissue parameters. It encodes tissue differences into a unique pattern of fingerprints using pseudo-random pulse excitation sequences. Quantitative parameter maps can be obtained via fingerprint reconstruction and parameter inversion through signal recovery and pattern recognition. Similar to fast MR imaging techniques, noncartesian sparse sampling is utilized in MRf to accelerate the scanning process, allowing the entire field of view to be sampled within a single repetition time. Highly sparse downsampling can introduce significant aliasing noise in the reconstructed images used to generate fingerprint sequences. Hence, reconstructing clean and distinctive fingerprints is crucial for bridging qualitative measurements with quantitative physiological tissue parameter maps in MRf. Model-based reconstruction methods remove aliasing noise by solving optimization problems. Various constraints, such as low rank prior, sparsity prior, are applied to the optimization problem to ensure the convergence of solutions. However, many of these prior constraints fail to adequately balance denoising and edge preserving due to the diversity of images. The low-rank constraint focuses on noise suppression, which easily leads to the oversmoothing of reconstructed fingerprints. Sparse constraint improves this problem to some extent. Given the diversity of image blocks, the best sparse representation and reconstruction of signals cannot be achieved when using fixed sparse transform and uniform sparsity level. Blind compress sensing adapts to learn the features and structures of the data without making any prior assumptions.MethodBasing on the above analysis, we propose an adaptive sparsifying transform learning-based MRf reconstruction method. First, the images are reconstructed through iterative processes involving sparsifying transform domain learning and sparse representation-based reconstruction. The adaptively learned sparse transformation achieves low sparse levels, effectively removing aliasing artifacts compared with conventional sparsifying transforms such as wavelet or Fourier transform. Second, given that the MRf dictionary serves as the ideal estimation of fingerprints, we can retrieve the temporal features of fingerprints by incorporating the MRf dictionary into the reconstruction model. In each iteration, the reconstructed fingerprints are updated with the best-matching dictionary atoms to improve the discrimination of fingerprints. Finally, singular value decomposition (SVD) is applied to compress the temporal dimension of fingerprints based on the correlation of signals at adjacent time points to accelerate the reconstruction process. Thus, reconstruction and dictionary matching are carried out in a subspace spanning 5–10 singular components. Simulation experiments compare the proposed method with other state-of-art model-based reconstruction methods to verify its effectiveness.ResultThe reconstructed tissue parameter maps demonstrated that our approach achieves higher accuracy than five other MRf reconstruction methods. The average relative errors of the three parameter maps reconstructed with our approach are 4.67%, 4.2%, and 1.12%, which are reduced by more than 5% compared with those of conventional MRf methods. Our proposed approach also exhibits improvement compared with other model-based methods without incurring additional computation time. Our approach involves several key parameters, including block size, number of blocks, and number of singular components. The optimal values for these parameters are determined through experiments. We also present the performance of our approach with different values of these parameters in the discussion section.ConclusionWe introduce an innovative adaptive sparsifying transform learning-based reconstruction method for MRf to enhance the accuracy and quality of tissue parameter maps. Our approach effectively mitigates aliasing artifacts by leveraging the MRf dictionary and incorporates SVD to compress the temporal dimension without increasing computation time. Results indicate that our approach outperforms other MRf reconstruction techniques. The findings contribute to the advancement of MRf technology toward clinical applications and hold significant value in medical imaging, particularly in early disease detection and precision medicine. By improving image quality and parameter measurement accuracy, this approach aids clinicians in promptly and accurately diagnosing lesions and optimizing treatment plans. However, the current algorithm still requires the manual adjustment of iteration thresholds to achieve optimal sparsification effects when faced with different sampling trajectories. Future research should focus on achieving adaptive threshold selection to further enhance the versatility and practical application potential of the algorithm.
摘要:ObjectiveSingle-modality medical imaging is often insufficient for providing a comprehensive review of lesion characteristics, including structure, metabolism, and other critical details. Medical images can generally be categorized into anatomical medical imaging and functional medical imaging. Anatomical medical imaging offers rich information on the structure of the body, but it lacks insight into metabolic processes. In contrast, functional medical imaging is the opposite. In clinical applications, doctors use medical imaging from multiple modalities to diagnose diseases, localize lesions, and plan surgeries. However, simultaneously observing multimodal medical images is not intuitive and may not fully capture all the relevant features of the lesion. Therefore, multimodal medical image fusion is commonly employed in practice to integrate and enhance the information from different imaging techniques. How to fully retain the unique features of each modality while effectively integrating the shared features between modalities is a common challenge in medical image fusion. The information interaction of shared modal features in currently used two-branch image coding methods is often underdeveloped, and the process is somewhat inadequate. This condition limits the establishment of feature correlations between multimodal images. A multiscale medical image fusion network is designed to address these issues. This network is based on progressive feature extraction, frequency domain information supplementation, and image reconstruction by Swin Transformer and convolutional neural network(CNN).MethodFirst, a multiscale feature extraction module guided by gradient information was designed, which can be integrated into a three-branch feature extraction architecture. The left and right branches are responsible for extracting the unique features from each modality of the medical images, while the middle branch extracts the shared features between modalities. The extraction architecture comprises several multiscale feature extraction modules, each based on gradient information guidance. These submodule can simultaneously integrate features from all scale levels. The extraction architecture fully considers the information interaction between modalities and can progressively extract the common and unique features across different modalities. In addition, this extraction architecture effectively integrates multiscale features from multimodal medical images. A progressive fusion module that integrates cross-attention mechanisms was designed to fully utilize the frequency domain information and guide the fusion process at the modal level. This fusion module enhances the interaction of spatial domain information between different modalities and leverages high- and low-frequency positional information from the frequency domain, guiding the model for more targeted multimodal fusion. Finally, a Swin-CNN reconstruction module was designed to determine the relationship between global and local area features of medical images. The reconstruction module uses Swin Transformer to capture global information, such as the overall structure and shape of the image, while simultaneously employing CNN to extract regional features, such as local texture details. The reconstruction module can effectively improve the quality of fused images by integrating the global and local feature information of medical images simultaneously.ResultThe datasets used for the experiments include the MRI-SPECT and MRI-PET fusion datasets from the whole brain database at Harvard Medical School and the GFP-PC fusion dataset from the John Innes Center, respectively. Considering the visual effect of the fused images, the proposed fusion model effectively preserves the structural and functional features of different medical image modalities and improves the quality of the fused images. The advantages of the fused images generated by this model are as follows: 1) The fused image has richer texture details and sharper features such as edges and contours. These images effectively preserve the information-rich regions of each modal image. 2) The fused image also effectively preserves the visual features in all original medical images, which ensures no bias toward preserving information from only one modality of the medical image. 3) The fused image is rendered effectively, with no artifacts affecting the visual effect. In addition, in terms of comparison of quantitative indicators, the model achieves optimization for all eight image fusion evaluation metrics in MRI-SPECT and MRI-PET fusion tasks. Compared to the model with the second-best performance, the mutual information (MI) and discrete cosine transform feature mutual information (FMIdct) are drastically improved. MI demonstrated an improvement of 4.42% and 17.30%, respectively, and FMIdct showed improvements of 5.17% and 11%, respectively. In the GFP-PC fusion task, six optimal and two sub-optimal results are achieved. Compared to the model with the second-best performance, MI and visual information fidelity (VIF) are substantially improved by 16.43% and 16.87%, respectively. Ablation experiments were also conducted for the network structure and loss function of the model to effectively analyze the experimental results and evaluate the effectiveness of each part of the model in this paper. Experimental results show that all model components and the loss function enhance the image fusion effect.ConclusionThe proposed fusion model leverages the common and unique features of different medical image modalities and progressively integrates multiscale information using a three-branch architecture. The model also utilizes a progressive fusion module that incorporates cross-attention to fuse high- and low-frequency features in a highly targeted manner. Furthermore, the model focuses on the global and local attribute information of medical images in the reconstruction process, effectively enhancing the quality of multimodal medical image fusion. The proposed model in this paper performs well in three medical image fusion tasks with good generalization capability. This model can provide multimodal medical fusion images with clear contour structures and rich texture details, aiding doctors in clinical diagnosis and improving diagnostic efficiency and accuracy. Future studies will investigate the constraints or effects of downstream medical semantic segmentation and other tasks on image fusion. The network architecture will also be optimized for specific tasks, ensuring a close integration between tasks such as semantic segmentation and image fusion. This research aims to improve the quality of fused images while enhancing the performance of downstream tasks, thereby expanding the application possibilities of multimodal medical image fusion.
关键词:multimodal medical image fusion;multiscale;progressive extraction and fusion;frequency domain information guidance;global-local reconstruction
摘要:ObjectiveMedical report generation leverages natural language processing and machine learning techniques to convert medical data, such as medical images, into structured text reports. This aims to improve approach aims to improve efficiency in the medical field and reduce the workload of healthcare professionals. Annually, over 3.6 billion X-ray imaging examinations are conducted worldwide, the majority of which are chest X-rays, and this number continues to rise. This growing trend places increasing pressure on radiologists, impacting the quality and speed of clinical decision-making. Inspired by various advanced general image captioning methods, current medical report generation methods can be broadly categorized into three types based on their implementation: encoder-decoder methods, cross-modal alignment methods, and knowledge-enhanced methods. Among these, encoder-decoder methods, particularly those utilizing the Transformer architecture, are the most widely adopted. The Transformer architecture excels at encoding long-term dependencies and learning efficient feature representations, increasing its suitability for medical report generation tasks. However, most of these methods employ an end-to-end approach that generates diagnostic reports without classifying the types of chest diseases, thereby lacking the support of disease label semantic information. Despite remarkable progress in medical report generation research driven by advancements in deep learning, several challenges remain. First, a semantic gap exists between medical images and text reports. Most existing methods focus only on aligning prominent information in the images and text, often overlooking the interaction of fine-grained details, which hampers the connection between local information in specific image regions and the corresponding text in the reports. Second, the complexity and diversity of organs and diseases in chest X-ray images, along with special cases such as complications and disease coexistence, make generating accurate text reports a challenging task. Additionally, key clinical information in medical report generation often comes from descriptions of abnormalities. The absence of abnormalities in the images and reports can cause models to generate normal and similar reports. This paper proposes a chest X-ray image report generation method that integrates knowledge enhancement and feature alignment to address these issues. The approach includes three main modules: 1) A feature representation module extracts detailed features from text reports and chest images; 2) Knowledge-enhanced visual features incorporate prior medical knowledge, guiding the learning of visual features; 3) Global-local feature alignment promotes semantic alignment between images, reports, and disease labels, thereby improving the accuracy and completeness of the generated reports.MethodFirst, chest X-ray images and text reports are input, and an image and text feature representation module, which includes visual and textual encoders to extract global and local features from the images and text, respectively, is constructed. Then, a chest prior knowledge graph is introduced, enabling knowledge-enhanced visual feature learning through pathological image knowledge encoding, resulting in enhanced visual features after fusion. Finally, cross-attention is defined to align the global-local features of the image and text with the visual-disease labels across modalities, and multihead attention is used in the encoder-decoder to generate accurate chest X-ray image reports.ResultThe effectiveness of the proposed method was validated through comparative experiments on two challenging datasets, namely IU X-Ray and MIMIC-CXR. Results show that in the IU X-Ray dataset, the BLEU-1, BLEU-3, and BLEU-4 scores reached 0.505, 0.235, and 0.178, respectively, indicating improvements over most existing methods for the same task. This finding indicates that the proposed model offers certain advantages in text fluency and better focus on disease regions, generating correct label information and notably improving various report metrics. The CIDEr and ROUGE-L scores are also comparable to those of other methods, demonstrating that knowledge enhancement and feature alignment positively impact the quality of the generated reports. On the MIMIC-CXR dataset, compared to the second_best method, the BLEU-2 and BLEU-3 metrics increased by 0.4% and 1.2%, respectively, proving model robustness, which maintains its advantages even when faced with more complex and diverse data. The CE metrics in the MIMIC-CXR dataset also showed improvements, with the precision, recall, and F1 metrics reaching 0.428, 0.343, and 0.360, respectively, indicating its effectiveness in generating complete and consistent reports. Qualitative experiments show that the proposed method generates medical reports that are largely consistent with reference reports. Additionally, for abnormal images, the model accurately identifies the abnormal regions, with the generated reports not only describing the disease conditions but also detailing the lesions. The vocabulary used in the generated reports is professional and closely aligns with the ground truth reports, demonstrating that the knowledge enhancement contributes to the capability of the model to produce accurate and domain-specific text. Ablation experiments confirm that incorporating image and text feature representation, knowledge enhancement, and feature alignment modules leads to improved report generation performance compared to the basic model. This finding indicates that incorporating external knowledge to the medical report generation model and fully utilizing the visual features of similar medical images allows the model learn sufficient visual features of medical images during the encoding stage, guiding the model in generating more complete and accurate medical reports during the decoding stage.ConclusionThe chest X-ray image report generation method proposed in this paper captures detailed features of images and text, focuses on the relationships between global-local features and disease categories, and enhances the alignment between images and text, enabling the generation of complete and accurate medical reports.