Current Issue Cover

朱翌, 李秀(清华大学深圳国际研究生院, 深圳 518055)

摘 要
随着医疗成像技术的不断提升,放射科医师每天要撰写的医学报告也与日俱增。深度学习兴起后,基于深度学习的医学图像描述技术用于自动生成医学报告,取得了显著效果。本文全面整理了近年来深度医学图像描述方向的论文,包括这一领域的最新方法、数据集和评价指标,分析了它们各自的优劣,并以模型结构为线索予以介绍,是国内首篇针对医疗图像描述任务的综述。现今的深度医疗图像描述技术主要以编码器—解码器结构为基础进行拓展,包括但不局限于加入检索方法、模板匹配方法、注意力机制、强化学习和知识图谱等方法。检索和模板匹配方法虽然简单,但由于医学报告的特殊性仍在本任务上有不错的效果;注意力机制使模型产生报告时能关注图像和文本的某一部分,已经被几乎所有主流模型所采用;强化学习方法突破了医疗图像描述任务中梯度下降训练法与离散的语言生成评价指标不匹配的瓶颈;知识图谱方法则融合了人类医生对于疾病的先验知识,有效提高了生成报告的临床准确性。此外,Transformer等新型结构也正越来越多地取代循环神经网络(recurrent neural network,RNN)甚至卷积神经网络(convolutional neural network,CNN)的位置成为网络主干。本文最后讨论了目前深度医疗图像描述仍需解决的问题以及未来的研究方向,希望能推动深度医疗图像描述技术真正落地。
A survey of medical image captioning technique: encoding, decoding and latest advance

Zhu Yi, Li Xiu(Shenzhen International Graduate School, Tsinghua University, Shenzhen 518055, China)

Medical image captioning is a labor-intensive daily task for radiologists nowadays. The emerging deep medical image captioning technique has its potential to generate medical captions automatically. There are some challenges to be resolved as mentioned below:1)to organize a feasible and clear structure to readers;2)to strengthen deep medical image caption task itself;3)to optimize the introduced methods. First,the aims and objectives are identified. Then,literature is reviewed for the growth of deep medical image caption till 2021,including their latest methods,datasets and evaluation metrics,and comparative analysis between medical image caption task and generic image caption task. Deep image caption technique is introduced on the basis of prior network structure. Current deep medical image caption technique is mainly developed in terms of the encoder-decoder structure,such as adding retrieval-based methods,template matching based methods,attention mechanisms,reinforcement learning,and knowledge graphs. Specifically,the encoder-decoder structure can be integrated into convolutional neural network(CNN)for image feature extraction and recurrent neural network (RNN)for caption generation,and the two kind of networks are linked by an intermediate vector,called context vector. Such models are based on CNN-RNN-RNN structure,called hierarchical RNN or long short-term memory(LSTM). This structure allows two sort of RNNs to be stacked together,which can generate its thematic vector and captions,and the caption is generated and supervised by the theme vector. The feature of the medical captions can be recognized in relevance to high ratio of repetition and special sentence patterns although the retrieval-based and template-matched methods are still relatively simple. The attention mechanism can be used for a certain part of the image and sentence when the caption is generated and the length of the contextual vector becomes variable. Medical image caption task-oriented reinforcement learning(RL)can be used to alleviate the mismatch problem between the gradient descent training method and the discrete language generation evaluation metric as well. RL can also work as multi-agent to guide the decoder in the form of output before the decoder works,and it can output well-balanced and logical medical contents. Knowledge graph can integrate the prior knowledge of expertise into the model,and diseases having similar features will be in closer nodes in the graph where the disease information can be updated through graph convolution. The integration of medical knowledge graph is focused on improving the clinical accuracy of the generated report effectively. These methods are compatible for each other like template matching based method and attention mechanism based RL can be used simultaneously. In addition,Transformerrelated structures have been developing intensively as the new backbone network beyond RNN and CNN. Transformer or the self-attention block can be trained in parallel,and it can capture the long-distance reliance between tokens,which serves as a better feature extractor. Popular datasets in deep medical image caption are IU X-Ray and MIMIC-CXR,in which frontal and lateral X-Ray images of chest and multiple sentences melted into a single report. Medical annotations like medical subject headings(MeSH)or unified medical language system(UMLS)keywords are beneficial to generate more accurate reports as they can be treated as extra information,and the classification of these tags can be seen as a pretraining task. Generic natural language generation metrics are applied to evaluate the report generated by deep medical image caption models. New metrics like SPICE,SPIDEr and BERTSCORE have been developing beyond existing BLEU-n, ROUGE,METEOR and CIDEr scores. Finally,future research directions are predicted on the four aspects:1)more diverse and more accurate datasets,such as other related modalities like magnetic resonance imaging(MRI)and color Doppler ultrasound. The model can be more robust and adaptive to various tasks in this way because current datasets mostly focus on chest X-Ray photos,which is limited to a single body part and a single modality. 2)Evaluation metrics can be more accurate and cost-effective in clinical beyond BLEU or ROUGE scores-related generic natural language generation metrics. The manpower of radiologists can be optimized while existing generic NLG metrics are not the best evaluation in medicine. 3)Unsupervised and semi-supervised methods can be used to lower dataset-relevant cost for the medical image captioning task. The cost and training samples can be optimized based on the existing pre-training models like ViLBERT and VL-BERT. 4)More prior knowledge can be integrated into the model for the medical image captioning task and multiround conversational medical report generation can be more detailed.