Current Issue Cover
视觉Transformer预训练模型的胸腔X线影像多标签分类

邢素霞, 鞠子涵, 刘子骄, 王瑜, 范福强(北京工商大学, 北京 100048)

摘 要
目的 基于计算机的胸腔X线影像疾病检测和分类目前存在误诊率高,准确率低的问题。本文在视觉Transformer(vision Transformer,ViT)预训练模型的基础上,通过迁移学习方法,实现胸腔X线影像辅助诊断,提高诊断准确率和效率。方法 选用带有卷积神经网络(convolutional neural network,CNN)的ViT模型,其在超大规模自然图像数据集中进行了预训练;通过微调模型结构,使用预训练的ViT模型参数初始化主干网络,并迁移至胸腔X线影像数据集中再次训练,实现疾病多标签分类。结果 在IU X-Ray数据集中对ViT迁移学习前、后模型平均AUC(area under ROC curve)得分进行对比分析实验。结果表明,预训练ViT模型平均AUC得分为0.774,与不使用迁移学习相比提升了0.208。并针对模型结构和数据预处理进行了消融实验,对ViT中的注意力机制进行可视化,进一步验证了模型有效性。最后使用Chest X-Ray14和CheXpert数据集训练微调后的ViT模型,平均AUC得分为0.839和0.806,与对比方法相比分别有0.014~0.031的提升。结论 与其他方法相比,ViT模型胸腔X线影像的多标签分类精确度更高,且迁移学习可以在降低训练成本的同时提升ViT模型的分类性能和泛化性。消融实验与模型可视化表明,包含CNN结构的ViT模型能重点关注有意义的区域,高效获取胸腔X线影像的视觉特征。
关键词
Multi-label classification of chest X-ray images with pre-trained vision Transformer model

Xing Suxia, Ju Zihan, Liu Zijiao, Wang Yu, Fan Fuqiang(Beijing Technology and Business University, Beijing 100048, China)

Abstract
Objective The chest X-ray-relevant screening and diagnostic method is essential for radiology nowadays. Most of chest X-ray images interpretation is still restricted by clinical experience and challenged for misdiagnose and missed diagnoses. To detect and identify one or more potential diseases in images automatically,it is beneficial for improving diagnostic efficiency and accuracy using computer-based technique. Compared to natural images,multiple lesions are challenged to be detected and distinguished accurately in a single image because abnormal areas have a small proportion and complex representations in chest X-ray images. Current convolutional neural network(CNN)based deep learning models have been widely used in the context of medical imaging. The structure of the CNN convolution kernel has sensitive to local detail information,and it is possible to extract richer image features. However,the convolution kernel cannot be used to get global information,and the features-extracted are restricted of redundant information like its relevance of background, muscles,and bones. The model’s performance in multi-label classification tasks are affected to a certain extent. At present,the vision Transformer(ViT)model has achieved its priorities in computer vision-related tasks. The ViT can be used to capture information simultaneously and effectively for multiple regions of the entire image. However,it is required to use large-scale dataset training to achieve good performance. Due to some factors like patient privacy and manual annotate costs,the size of the chest X-ray image data set has been limited. To reduce the model's dependence on data scale and improve the performance of multi-label classification,we develop the CNN-based ViT pre-training model in terms of the transfer learning method for diagnosis-assisted of chest X-ray image and multi-label classification. Method The CNN-based ViT model is pre-trained on a huge scale ground truth dataset,and it is used to obtain the initial parameters of the model. The model structure is fine-tuned according to the features of chest X-ray dataset. A 1×1 convolution layer is used to convert the chest X-ray images channels between 1 to 3. The number of output nodes of the linear layer in the classifier is balanced from 1 000 to the number of chest X-ray classification labels,and the Sigmoid is used as an activation function. The parameters of the backbone network are initialized in terms of the pre-trained ViT model parameters,and it is trained in the chest X-ray dataset after that to complete multi-label classification. The experiment is configured of Python3. 7 and PyTorch1. 8 to construct the model and RTX3090 GPU for training. Stochastic gradient descent(SGD)optimizer,binary cross-entropy(BCE)loss function,an initial learning rate of 1E-3,the cosine annealing learning rate decay are used. For training,each image is scaled to a size of 512×512 pixels,and a 224×224 pixels area and it is then cropped in random as the model input,and data augmentation is performed randomly by some of the flipping,perspective transformation, shearing,translation,zooming,and changing brightness. For testing,the chest X-ray image is scaled to 256×256 pixels and center crop a 224×224 area to input the trained model. Result The experiment is performed on the IU X-Ray,which is a small-scale chest X-ray dataset. This model is evaluated in quantitative using the average of area under ROC curve (AUC)scores across all classification labels. The results show that the average AUC score of the pre-trained ViT model is 0. 774. The accuracy and training efficiency of the non-pre-trained ViT model is dropped significantly. The average AUC score is reached to 0. 566 only,which is 0. 208 lower. In addition,the attention mechanism heat map is generated based on the ViT model,which can strengthen the interpretability of the model. A series of ablation experiments are carried out for data augmentation,model structure,and batch size design. The fine-tuned ViT model is trained on the Chest-Ray14 and CheXpert dataset as well. The average AUC score is reached to 0. 839 and 0. 806,which is optimized by 0. 014 and 0. 031. Conclusion A pre-trained ViT model is used for the multi-label classification of chest X-ray images via transfer learning. The experimental results illustrate that the ViT has its stronger multi-label classification performance in chest Xray images,and its attention mechanism is beneficial for lesions precision-focused like the interior of the chest cavity and the heart. Transfer learning is potential to improve the classification performance and model generalization of the ViT in small-scale datasets,and the training cost is reduced greatly. Ablation experiments demonstrate that the incorporated model of CNN and Transformer has its priority beyond single-structure model. Data enhancement and the batch size cutting can improve the performance of the model,but smaller scale of batch is still interlinked to longer training span. To improve the model's ability,we predict that future research direction can be focused on the extraction for complex disease and highlevel semantic information,such as their small lesions,disease location,and severity.
Keywords

订阅号|日报