语义微调和跨模态检索增强的中文医学报告生成

李恒泰; 刘慧; 陈公冠; 闫子申; 盛玉瑞; 张彩明

doi:10.11834/jig.240451

浏览量 : 0 下载量: 0 CSCD: 0

PDF
导出
分享
收藏
专辑

语义微调和跨模态检索增强的中文医学报告生成
Semantic fine-tuning and cross-modal retrieval-augmented Chinese medical reports generation
2024年页码：1-17
网络出版日期： 2024-12-23 ，
DOI： 10.11834/jig.240451
稿件说明：

移动端阅览

李恒泰,刘慧,陈公冠等.语义微调和跨模态检索增强的中文医学报告生成[J].中国图象图形学报,

Li Hengtai,Liu Hui,Chen Gongguan,et al.Semantic fine-tuning and cross-modal retrieval-augmented Chinese medical reports generation[J].Journal of Image and Graphics,
李恒泰,刘慧,陈公冠等.语义微调和跨模态检索增强的中文医学报告生成[J].中国图象图形学报, DOI： 10.11834/jig.240451.

Li Hengtai,Liu Hui,Chen Gongguan,et al.Semantic fine-tuning and cross-modal retrieval-augmented Chinese medical reports generation[J].Journal of Image and Graphics, DOI： 10.11834/jig.240451.

摘要

目的

医学报告生成旨在根据医学影像生成准确的诊断结果，以减轻医生负担、提高临床工作效率。然而，中文医学报告生成在准确理解医学影像及规范描述医学报告方面仍存在局限，并存在幻觉问题。为解决上述问题，本文提出了一种基于语义微调和跨模态检索增强的中文医学报告生成（semantic fine-tuning and cross-modal retrieval-augmented Chinese medical reports generation，FRCM）模型。

方法

基于多模态大模型LLaVA，本文对其视觉编码器和大语言模型进行领域适配与微调，并提出了一种通用数据与垂域数据协同训练策略：利用通用数据提高模型对复杂指令的理解能力，并利用垂域数据使模型具备医学图像-文本对齐能力及专业的中文医学报告生成能力。在推理阶段，提出了一种新的跨模态检索增强策略，利用引导知识有效缓解了模型的幻觉问题，进一步提高了模型生成医学报告的准确性和鲁棒性。

结果

在中文 MIMIC-CXR 数据集上，与XrayGLM和XrayPULSE模型相比，FRCM在双语评估替代指标（bilingual evaluation understudy-ngram，BLEU-n）的BLEU-1、BLEU-2、BLEU-3、BLEU-4、基于最长公共子序列的召回率指标（recall-oriented understudy for gisting evaluation-longest common subsequence，ROUGE-L）、显式顺序翻译评价指标（metric for evaluation of translation with explicit ORdering，METEOR）和基于共识的图像描述评估指标（consensus-based image description evaluation，CIDEr）等7个指标上分别提升了10.4%、10.1%、9.7%、9.1%、6.6%、9.4%、38.4%。与LLaVA和Qwen-VL上微调过的模型相比，FRCM在BLEU-1、BLEU-2、BLEU-3、BLEU-4和CIDEr等5个指标上的得分分别提升了4.1%、3.1%、3.3%、3.6%和25.1%。消融实验结果表明，FRCM使用的训练方法和关键组件能够有效提升模型的性能。实验通过两个案例分析，进一步证明FRCM生成的中文医学报告在准确性和信息丰富度上优于其他模型。

结论

本文通过设计多模态大模型训练与推理策略，综合了语义微调和检索增强的优点，生成了更加详细且准确的中文医学报告。

Abstract

Objective

The task of generating medical reports involves producing accurate and comprehensive examination results based on symptoms observed in medical images. This technology can alleviate the burden on radiologists， reduce diagnostic errors due to lack of experience， and expedite clinical workflows. Medical report generation is similar to image captioning， but it presents two unique challenges： long text generation and the imbalance in medical data distribution. Current approaches tend to train a specific model for the medical report generation task from scratch using limited publicly available data. Due to insufficient ability to fuse visual and textual features and generate rich information， their performance is often suboptimal. Large multimodal models （LMMs）， composed of visual encoders and large language models （LLMs）， possess the ability to recognize images and generate high-quality text with rich knowledge， making them particularly suitable for image-based text generation tasks. Their emergence provides a novel solution for the medical report generation task. However， LMMs are still in the early stages in the field of Chinese medical report generation， especially in accurately understanding medical images and normatively describing medical reports. Moreover， these models have inherent hallucination issues， where the generated responses appear logical but are actually incorrect or unfounded. To address the above problems， this paper proposes a Chinese medical report generation model based on semantic fine-tuning and cross-modal retrieval-augemented （FRCM）.

Method

Based on the LMM framework of LLaVA， this paper fine-tunes and adapts the visual encoder and LLM for the medical domain. It proposes a collaborative training strategy using general data and domain-specific data， and introduces a novel cross-modal retrieval-augemented strategy during the inference phase. The paper translates the largest dataset in the medical report generation domain， MIMIC-CXR， into Chinese and uses it as in-domain data for research on Chinese medical report generation. Firstly， considering the characteristics of medical images and Chinese medical reports， the corresponding modules of LLaVA are replaced with a medical visual encoder trained on a large amount of medical images and a medical LLM with strong Chinese processing capabilities， allowing the model to better handle data in the medical field. Secondly， a two-phase training strategy using both general and domain-specific data is employed. In the first training phase， only the projection layer is trained. The domain-specific data enables the model to achieve medical image-text alignment， while the general data enhances the model's generalization capability. In the second training phase， the parameters of the projection layer are further updated， and a low-rank adaptation method is used to fine-tune the LLM. The domain-specific data provides the model with the ability to generate professional Chinese medical reports， and the general data improves the model's understanding of complex instructions. During the entire training process， medical images are encoded by the visual encoder into global feature vectors and local feature vectors. The local feature vectors are projected into visual embeddings with the same dimensions as the LLM's embedding space. Medical reports and instructions are tokenized into text embeddings by the LLM's tokenizer and input into the LLM along with the visual embeddings for training. Finally， to further alleviate the hallucination problem of the model， a cross-modal retrieval-augemented strategy is proposed. A cross-modal similar report retrieval module is designed. During inference， the global feature vectors obtained from the visual encoder are layer normalized and input into the similar report retrieval module to perform cross-modal retrieval from image to report. The retrieved similar reports are then used as additional knowledge input to the LLM， thereby reducing hallucinations and improving the accuracy and robustness of the model in generating medical reports.

Result

On the Chinese MIMIC-CXR dataset， compared to the LMMs XrayGLM and XrayPLUSE for Chinese medical report generation， FRCM achieved improvements of 10.4%， 10.1%， 9.7%， 9.1%， 6.6%， 9.4%， and 38.4% in BLEU-1， BLEU-2， BLEU-3， BLEU-4， ROUGE-L， METEOR， and CIDEr scores， respectively. Compared to models fine-tuned on LLaVA and Qwen-VL， FRCM achieved score improvements of 4.1%， 3.1%， 3.3%， 3.6%， and 25.1% in BLEU-1， BLEU-2， BLEU-3， BLEU-4， and CIDEr， respectively. In ablation experiments， both data ablation and module ablation were conducted. Data ablation demonstrated that adding diverse general data during training enhances the model's ability to follow complex instructions， thereby improving the quality of generated medical reports by better utilizing additional knowledge. Module ablation revealed that the key components used in FRCM significantly enhance its performance. Furthermore， two case studies demonstrated that the Chinese medical reports generated by FRCM is superior to those produced by other models in terms of accuracy and information richness.

Conclusion

This paper proposes FRCM， aimed at generating Chinese medical reports from medical images. Unlike traditional medical report generation methods， this study leverages LMM techniques to effectively address the challenges of long text generation and imbalanced medical data in the task of medical report generation. However， LMMs are typically pre-trained on extensive general data and have limitations in recognizing medical images and generating specialized medical reports. Based on the LLaVA model framework， this paper utilizes a medical visual encoder and a medical LLM， fine-tuning them semantically. To further mitigate the inherent hallucination problem of LMMs， we designed a similar report retrieval module. This module provides additional knowledge during the inference stage to assist the model in generating more accurate reports. Experimental results show that FRCM performs satisfactorily in the task of Chinese medical report generation.

关键词

中文医学报告生成多模态大模型检索增强语义微调知识引导

Keywords

Chinese medical report generationlarge multimodal modelretrieval enhancementsemantic fine-tuningknowledge guidance

references

Bai J Z， Bai S， Yang S S， Wang S J， Tan S N， Wang P， Lin J Y， Zhou C and Zhou J R. 2023. Qwen-vl： a frontier large vision-language model with versatile abilities ［EB/OL］. ［2024-07-31］. https://arxiv.org/pdf/2308.12966.pdfhttps://arxiv.org/pdf/2308.12966.pdf

Banerjee S and Lavie A. 2005. Meteor： an automatic metric for MT evaluation with improved correlation with human judgments//Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization， Ann Arbor， USA： Association for Computational Linguistics： 65-72

Chen G H， Chen S N， Zhang R F， Chen J Y， Wu X B， Zhang Z Y， Chen Z H， Li J Q， Wan X and Wang B Y. 2024 Allava： harnessing gpt4v-synthesized data for lite vision-language models ［EB/OL］. ［2024-07-31］. https://arxiv.org/pdf/2402.11684.pdfhttps://arxiv.org/pdf/2402.11684.pdf

Chen Z H， Song Y， Chang T H， and Wan X. 2020. Generating radiology reports via memory-driven transformer//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing （EMNLP）. Online： Association for Computational Linguistics： 1439–1449 ［DOI： 10.18653/v1/2020.emnlp-main.112http://dx.doi.org/10.18653/v1/2020.emnlp-main.112］

Deria A， Kumar K， Chakraborty S， Mahapatra D and Roy S. 2024. Inverge： intelligent visual encoder for bridging modalities in report generation//Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops （CVPRW）. Seattle， USA： IEEE： 2028-2038 ［DOI： 10.1109/CVPRW63382.2024.00208http://dx.doi.org/10.1109/CVPRW63382.2024.00208］

Dosovitskiy A， Beyer L， Kolesnikov A， Weissenborn D， Zhai X H， Unterthiner T， Dehghani M， Minderer M， Heigold G， Gelly S， Uszkoreit J and Houlsby N. 2021. An image is worth16 × 16 words： transformers for image recognition at scale ［EB/OL］. ［2024-07-31］. https://arxiv.org/pdf/2010.11929.pdfhttps://arxiv.org/pdf/2010.11929.pdf

Douze M， Guzhva A， Deng C Q， Johnson J， Szilvasy G， Mazaré P， Lomeli M， Hosseini L and Jégou H. 2024. The faiss library ［EB/OL］. ［2024-07-31］. https://arxiv.org/pdf/2401.08281.pdfhttps://arxiv.org/pdf/2401.08281.pdf

Du H J and Liu X L. 2020. Image description generation method based on inhibitor learning. Journal of Image and Graphics， 25（2）： 333-342.

杜海骏，刘学亮. 2020. 融合约束学习的图像字幕生成方法. 中国图象图形学报， 25（2）： 333-342 ［DOI： 10.11834/jig.190222http://dx.doi.org/10.11834/jig.190222］

Harzig P， Chen Y Y， Chen F and Lienhart R. 2019. Addressing data bias problems for chest x-ray image report generation ［EB/OL］. ［2024-07-31］. https://arxiv.org/pdf/1908.02123.pdfhttps://arxiv.org/pdf/1908.02123.pdf

Hu E， Shen Y L， Wallis P， Allen-Zhu Z， Li Y Z， Wang S， Wang L and Chen W Z. 2022. Lora： low-rank adaptation of large language models//Proceedings of the Tenth International Conference on Learning Representations. ［s.l.］： OpenReview.net

Huang Z Z， Zhang X F and Zhang S T. 2023. Kiut： knowledge-injected u-transformer for radiology report generation//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Vancouver， Canada： IEEE： 19809-19818 ［DOI： DOI：10.1109/CVPR52729.2023.01897http://dx.doi.org/DOI：10.1109/CVPR52729.2023.01897］

Jing B Y， Xie P T and Xing E. 2018. On the automatic generation of medical imaging reports//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Melbourne， Australia： Association for Computational Linguistics： 2577–2586 ［DOI： 10.18653/v1/P18-1240http://dx.doi.org/10.18653/v1/P18-1240］

Johnson A E W， Pollard T J， Berkowitz S J， Greenbaum N R， Lungren M P， Deng C， Mark R G and Horng S. 2019. Mimic-cxr， a de-identified publicly available database of chest radiographs with free-text reports. Scientific Data， 6（1）： #317 ［DOI： 10.1038/s41597-019-0322-0http://dx.doi.org/10.1038/s41597-019-0322-0］

Li C Y， Cliff Wong， Zhang S， Usuyama N， Liu H T， Yang J W， Naumann T， Poon H， and Gao J F. 2024. Llava-med： training a large language-and-vision assistant for biomedicine in one day//Proceedings of the 37th International Conference on Neural Information Processing Systems. New Orleans， USA： Curran Associates Inc.： 28541-28564

Li C Y， Liang X D， Hu Z T and Xing E P. 2018. Hybrid retrieval-generation reinforced agent for medical image report generation//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montréal， Canada： Curran Associates Inc.： 1537-1547

Lin C Y. 2004. Rouge： a package for automatic evaluation of summaries//Proceedings of Text Summarization Branches Out. Barcelona， Spain： Association for Computational Linguistics： 74-81

Liu C， Tian Y H， Chen W D， Song Y and Zhang Y D. 2024a. Bootstrapping large language models for radiology report generation//Proceedings of the AAAI Conference on Artificial Intelligence. Vancouver， Canada： AAAI： 18635-18643 ［DOI： 10.1609/aaai.v38i17.29826http://dx.doi.org/10.1609/aaai.v38i17.29826］

Liu F L， Wu X， Ge S， Fan W and Zou Y X. 2021a. Exploring and distilling posterior and prior knowledge for radiology report generation//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Nashville， USA： IEEE： 13748-13757 ［DOI： 10.1109/CVPR46437.2021.01354http://dx.doi.org/10.1109/CVPR46437.2021.01354］

Liu F L， Yin C C， Wu X， Ge S， Zhang P and Sun X. 2021b. Contrastive attention for automatic chest x-ray report generation//Findings of the Association for Computational Linguistics： ACL-IJCNLP 2021. Online： Association for Computational Linguistics： 269-280 ［DOI： 10.18653/v1/2021.findings-acl.23http://dx.doi.org/10.18653/v1/2021.findings-acl.23］

Liu H T， Li C Y， Wu Q Y and Lee Y J. 2024b. Visual instruction tuning//Proceedings of the 37th International Conference on Neural Information Processing Systems. New Orleans， USA： Curran Associates Inc.： 34892-34916

Liu M F， Hu H J， Li L J， Yu Y and Guan W L. 2022. Chinese image caption generation via visual attention and topic modeling. IEEE Transactions on Cybernetics， 52（2）： 1247-1257 ［DOI： 10.1109/TCYB.2020.2997034http://dx.doi.org/10.1109/TCYB.2020.2997034］

Liu Z Y， Sun Z Y， Zang Y H， Li W， Zhang P， Dong X Y， Xiong Y J， Lin D H and Wang J Q. 2024c. Rar： retrieving and ranking augmented mllms for visual recognition ［EB/OL］. ［2024-07-31］. https://arxiv.org/pdf/2403.13805.pdfhttps://arxiv.org/pdf/2403.13805.pdf

Longpre S， Yauney G， Reif E， Lee K， Roberts A， Zoph B， Zhou D， Wei J， Robinson K， Mimno D and Ippolito D. 2024. A pretrainer’s guide to training data： measuring the effects of data age， domain coverage， quality， & toxicity//Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics. Mexico City， Mexico： Association for Computational Linguistics： 3245-3276 ［DOI： 10.18653/v1/2024.naacl-long.179http://dx.doi.org/10.18653/v1/2024.naacl-long.179］

Luo L， Ning J Z， Zhao Y W， Wang Z J， Ding Z Y， Chen P， Fu W R， Han Q Y， Xu G T， Qiu Y Z， Pan D H， Li J R， Li H， Feng W D， Tu S B， Liu Y Q， Yang Z H， Wang J， Sun Y Y， and Lin H F. 2024. Taiyi： a bilingual fine-tuned large language model for diverse biomedical tasks. Journal of the American Medical Informatics Association， 31（9）： 1865-1874 ［DOI： 10.1093/jamia/ocae037http://dx.doi.org/10.1093/jamia/ocae037］

Moor M， Huang Q， Wu S， Yasunaga M， Dalmia Y， Leskovec J， Zakka C， Reis E P and Rajpurkar P. 2023. Med-flamingo： a multimodal medical few-shot learner//Proceedings of the 3rd Machine Learning for Health Symposium. New Orleans， USA： PMLR： 353-367

Papineni K， Roukos S， Ward T and Zhu W J. 2002. Bleu： a method for automatic evaluation of machine translation//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia， USA： Association for Computational Linguistics： 311-318 ［DOI： 10.3115/1073083.1073135http://dx.doi.org/10.3115/1073083.1073135］

Petinaux B， Bhat R， Boniface K and Aristizabal J. 2011. Accuracy of radiographic readings in the emergency department. The American Journal of Emergency Medicine， 29（1）： 18-25 ［DOI： 10.1016/j.ajem.2009.07.011http://dx.doi.org/10.1016/j.ajem.2009.07.011］

Shen Y Q， Chen Z， Mamalakis M， Liu Y G， Li T B， Su Y Z， He J J， Liò P and Wang Y G. 2024. Toursynbio： a multi-modal large model and agent framework to bridge text and protein sequences for protein engineering ［EB/OL］. ［2024-09-18］. https://arxiv.org/pdf/2408.15299.pdfhttps://arxiv.org/pdf/2408.15299.pdf

Singhal K， Tu T， Gottweis J， Sayres R， Wulczyn E， Hou L， Clark K， Pfohl S， Cole-Lewis H， Neal D， Schaekermann M， Wang A， Amin M， Lachgar S， Mansfield P， Prakash S， Green B， Dominowska E， Arcas B A Y， Tomasev N， Liu Y， Wong R， Semturs C， Mahdavi S S， Barral J， Webster D， Corrado G S， Matias Y， Azizi S， Karthikesalingam A and Natarajan V. 2023. Towards expert-level medical question answering with large language models ［EB/OL］. ［2024-07-31］. https://arxiv.org/pdf/2305.09617.pdfhttps://arxiv.org/pdf/2305.09617.pdf

Tan Y， Zhang Z X， Li M C， Pan F， Duan H， Huang Z J， Deng H， Yu Z H， Yang C， Shen G Y， Qi P， Yue C Y， Liu Y X， Hong L， Yu H Q， Fan G S and Tang Y. 2024. Medchatzh： a tuning llm for traditional Chinese medicine consultations. Computers in biology and medicine， 172： #108290 ［DOI： 10.1016/j.compbiomed.2024.108290http://dx.doi.org/10.1016/j.compbiomed.2024.108290］

Thawkar O C， Shaker A M， Mullappilly S S， Cholakkal H， Anwer R M， Khan S， Laaksonen J and Khan F S. 2023. Xraygpt： chest radiographs summarization using medical vision-language models//Proceedings of the 23rd Workshop on Biomedical Natural Language Processing. Bangkok， Thailand： Association for Computational Linguistics： 440-448 ［DOI： 10.18653/v1/2024.bionlp-1.35http://dx.doi.org/10.18653/v1/2024.bionlp-1.35］

Tian S B， Jin Q， Yeganova L， Lai P， Zhu Q Q， Chen X Y， Yang Y F， Chen Q Y， Kim W， Comeau D C， Islamaj R， Kapoor A， Gao X and Lu Z Y. 2024a. Opportunities and challenges for chatgpt and large language models in biomedicine and health. Briefings in Bioinformatics， 25（1）： #bbad493 ［DOI： 10.1093/bib/bbad493http://dx.doi.org/10.1093/bib/bbad493］

Tian Y H， Gan R Y， Song Y， Zhang J X and Zhang Y D. 2024b. Chimed-gpt： a Chinese medical large language model with full training regime and better alignment to human preferences//Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Bangkok， Thailand： Association for Computational Linguistics： 7156-7173 ［DOI： 10.18653/v1/2024.acl-long.386http://dx.doi.org/10.18653/v1/2024.acl-long.386］

Vaswani A， Shazeer N， Parmar N， Uszkoreit J， Jones L， Gomez A N， Kaiser L and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach， USA： Curran Associates Inc.： 6000-6010

Vedantam R， Lawrence Zitnick C L and Parikh D. 2015. Cider： consensus-based image description evaluation//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition （CVPR）. Boston， USA： IEEE： 4566-4575 ［DOI：10.1109/CVPR.2015.7299087http://dx.doi.org/10.1109/CVPR.2015.7299087］

Vinyals O， Toshev A， Bengio S and Erhan D. Show and tell： a neural image caption generator//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition （CVPR）. Boston， USA： IEEE： 3156-3164 ［DOI： 10.1109/CVPR.2015.7298935http://dx.doi.org/10.1109/CVPR.2015.7298935］

Wang C， Li M Y， He J J， Wang Z R， Darzi E， Chen Z， Ye J， Li T B， Su Y Z， Ke J， Qu K L， Li S X， Yu Y， Liò P， Wang T Y， Wang Y G and Shen Y Q. 2024. A survey for large language models in biomedicine ［EB/OL］［2024-09-18］ https://arxiv.org/pdf/2409.00133.pdfhttps://arxiv.org/pdf/2409.00133.pdf

Wang H C， Liu C， Xi N W， Qiang Z W， Zhao S D， Qin B and Liu T. 2023a. Huatuo： tuning llama model with Chinese medical knowledge ［EB/OL］. ［2024-07-31］. https://arxiv.org/pdf/2304.06975.pdfhttps://arxiv.org/pdf/2304.06975.pdf

Wang S， Tang L Y， Lin M Q， Shih G， Ding Y and Peng Y F. 2022a. Prior knowledge enhances radiology report generation//AMIA Joint Summits on Translational Science Proceedings. ［s.l.］： AMIA： 486-495

Wang Z Y， Han H W， Wang L， Li X and Zhou L P. 2022b. Automated radiographic report generation purely on transformer： a multicriteria supervised approach. IEEE Transactions on Medical Imaging， 41（10）： 2803-2813 ［DOI： 10.1109/TMI.2022.3171661http://dx.doi.org/10.1109/TMI.2022.3171661］

Wang Z Y， Liu L Q， Wang L and Zhou L P. 2023b. R2gengpt： radiology report generation with frozen llms. Meta-Radiology， 1（3）： #100033 ［DOI： 10.1016/j.metrad.2023.100033http://dx.doi.org/10.1016/j.metrad.2023.100033］

Wang Z Y， Zhou L P， Wang L and Li X. 2021. A self-boosting framework for automated radiographic report generation// Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition （CVPR）. Nashville， USA： IEEE： 2433-2442 ［DOI： 10.1109/CVPR46437.2021.00246http://dx.doi.org/10.1109/CVPR46437.2021.00246］

Xing S X， Fang J Z， Ju Z H， Guo Z and Wang Y. 2024. Research on automatic generation of multimodal medical image reports based on memory driven. Sheng Wu Yi Xue Gong Cheng Xue Za Zhi， 41（1）： 60-69.

邢素霞，方俊泽，鞠子涵，郭正，王瑜. 2024. 基于记忆驱动的多模态医学影像报告自动生成研究. 生物医学工程学杂志， 41（1）： 60-69 ［DOI： 10.7507/1001-5515.202304001http://dx.doi.org/10.7507/1001-5515.202304001］

Xiong H L， Wang S， Zhu Y T， Zhao Z H， Liu Y X， Huang L L， Wang Q and Shen D G. 2023. Doctorglm： fine-tuning your Chinese doctor is not a herculean task ［EB/OL］. ［2024-07-31］. https://arxiv.org/pdf/2304.01097.pdfhttps://arxiv.org/pdf/2304.01097.pdf

Xu D X， Chen Y Y， Wang J Y， Huang Y， Wang H P， Jin Z， Wang H X， Yue W H， He J， Li H and Huang Y. 2024a. Mlevlm： improve multi-level progressive capabilities based on multimodal large language model for medical visual question answering//Findings of the Association for Computational Linguistics： ACL 2024. Bangkok， Thailand： Association for Computational Linguistics： 4977-4997 ［DOI： 10.18653/v1/2024.findings-acl.296http://dx.doi.org/10.18653/v1/2024.findings-acl.296］

Xu Z Y， Feng C， Shao R L， Ashby T， Shen Y， Jin D， Cheng Y， Wang Q F and Huang L F. 2024b. Vision-flan： scaling human-labeled tasks in visual instruction tuning ［EB/OL］. ［2024-07-31］. https://arxiv.org/pdf/2402.11690.pdfhttps://arxiv.org/pdf/2402.11690.pdf

Yan H， Liu Y L， Jin L W and Bai X. 2023. The development， application， and future of llm similar to chatgpt. Journal of Image and Graphics， 28（9）： 2749-2762

严昊，刘禹良，金连文，白翔. 2023. 类ChatGPT大模型发展、应用和前景. 中国图像图形学报， 28（9）： 2749-2762 ［DOI： 10.11834/jig.230536http://dx.doi.org/10.11834/jig.230536］

Yang S H， Zhao H J， Zhu S B， Zhou G Y， Xu H F， Jia Y X and Zan H Y. 2024. Zhongjing： enhancing the Chinese medical capabilities of large language model through expert feedback and real-world multi-turn dialogue//Proceedings of the AAAI Conference on Artificial Intelligence. Vancouver， Canada： AAAI： 19368-19376 ［DOI：10.1609/aaai.v38i17.29907http://dx.doi.org/10.1609/aaai.v38i17.29907］

Yao Y F， Duan J H， Xu K D， Cai Y F， Sun Z B and Zhang Y. 2024. A survey on large language model （llm） security and privacy： the good， the bad， and the ugly. High-Confidence Computing， 4（2）： #100211 ［DOI： 10.1016/j.hcc.2024.100211http://dx.doi.org/10.1016/j.hcc.2024.100211］

Zhang C， Liwicki S and Cipolla R. 2022. Beyond the cls token： image reranking using pretrained vision transformers//Proceedings of the 33rd British Machine Vision Conference 2022. London， UK： BMVA Press： #80

Zhang J F， Han X X， Yang Z H， Wang Z C， Zheng J J， Yang Z M and Zhu J M. 2021. Radiology residency training in China： results from the first retrospective nationwide survey. Insights into Imaging， 12（1）： #25 ［DOI： 10.1186/s13244-021-00970-2http://dx.doi.org/10.1186/s13244-021-00970-2］

Zhang L Y， Shu J H， Hu J L， Li F F， He J J， Wang P and Shen Y Q. 2024. Exploring the potential of large language models in radiological imaging systems： improving user interface design and functional capabilities. Electronics， 13（11）： #2002 ［DOI： 10.3390/electronics13112002http://dx.doi.org/10.3390/electronics13112002］

Zhang S， Xu Y B， Usuyama N， Xu H W， Bagga J， Tinn R， Preston S， Rao R， Wei M， Valluri N， Wong C， Tupini A， Wang Y， Mazzola M， Shukla S， Liden L， Gao J F， Lungren M P， Naumann T， Wang S and Poon H. 2023. Biomedclip： a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs ［EB/OL］. ［2024-07-31］. https://arxiv.org/pdf/2303.00915.pdfhttps://arxiv.org/pdf/2303.00915.pdf

Zhang Y X， Wang X S， Xu Z Y， Yu Q H， Yuille A and Xu D G. 2020. When radiology report generation meets knowledge graph// Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto， USA： AAAI： 12910-12917 ［DOI： 10.1609/aaai.v34i07.6989http://dx.doi.org/10.1609/aaai.v34i07.6989］

文章被引用时，请邮件提醒。

提交

输电线路部件视觉缺陷检测综述