赵小明1, 廖越辉1, 张石清2, 方江雄2, 何遐遐2, 汪国余2, 卢洪胜2(1.杭州电子科技大学;2.台州中心医院，台州学院)
目的 目前,基于计算机辅助诊断的乳腺肿瘤动态对比增强磁共振成像(DCE-MRI)检测和分类存在着准确度低,缺乏可用的数据集问题。方法 针对此问题,拟建立一个可用的乳腺DCE-MRI影像数据集,并提出一种将面向局部特征学习的卷积神经网络(Convolutional Neural Networks, CNN)和全局性特征学习的视觉Transformer(Vision Transformer,ViT)方法相融合的局部-全局跨注意力融合网络(Local Global Cross Attention Fusion Network,LG-CAFN),用于实现乳腺肿瘤DCE-MRI影像的自动诊断,以提高乳腺癌的诊断准确率和效率。该网络采用跨注意力机制方法,将CNN分支提取出的图像局部特征和ViT分支提取出的图像全局特征进行有效融合,从而获得更具判别性的图像特征用于乳腺肿瘤DCE-MRI影像的良恶性分类。结果 在收集的乳腺癌DCE-MRI影像数据集上设置了两组包含不同种类的乳腺DCE-MRI 序列实验,并与VGG16,深度残差网络(ResNet),SENet(Squeeze-and-Excitation Networks),ViT以及SwinT(SwinTransformer)方法进行比较。实验结果表明,LG-CAFN在两组实验中取得的准确率都优于其它方法。结论 提出的LG-CAFN方法具有优异的局部-全局特征学习能力,可以有效改善乳腺肿瘤DCE-MRI影像良恶性的分类性能。
A method of breast tumor DCE-MRI image classification integrating Vision Transformer with CNN
Zhao xiaoming, Liao yuehui1, Zhang shiqing2, Fang jiangxiong2, He xiaxia2, Wang guoyu2, Lu hongsheng2(1.Hangzhou Dianzi University;2.Taizhou Central Hospital,Taizhou University)
Objective Among women in the United States, breast cancer is the most frequently detected type of cancer, except for nonmelanoma skin cancer. It ranks as the second highest cause of cancer-related deaths in women, following lung cancer. Breast cancer cases have been on the rise in the past few years, but the number of deaths caused by breast cancer has either remained steady or decreased. This could be due to improved early detection techniques and more effective treatment options. Magnetic resonance imaging (MRI), especially dynamic contrast-enhanced (DCE)-MRI, has shown promising results in the screening of women with a high risk of breast cancer and in determining the stage of breast cancer in newly diagnosed patients. As a result, MRI, especially DCE-MRI, is becoming increasingly recognized as a valuable adjunct diagnostic tool for the timely detection of breast cancer. With the development of artificial intelligence, many deep learning models based on Convolutional Neural Networks (CNN) have been widely used in medical image analysis such as VGG, and ResNet. These models can automatically extract deep features from images, eliminating the need for hand-crafted feature extraction and saving a lot of time and effort. However, CNN cannot obtain global information, and global information of medical images is very useful for the diagnosis of breast tumors. To acquire global information, the Vision Transformer(ViT) has proposed and achieved magnificent results in computer vision(CV) tasks. ViT uses convolution operation to separate the entire input image into many small image patches. Then, Vit can simultaneously process these image patches by multi-head self-attention layers and capture global information in different regions of the entire input image. However, ViT inevitably loses local information while capturing global information. To integrate the advantages of CNNs and ViT, there are studies have been proposed to combine the advantages of CNN and ViT to obtain more comprehensive feature representations for achieving better performance in breast tumor diagnosis tasks. Method Based on the above observations and inspired by integrating the CNN and ViT. We proposed a novel cross attention fusion network based on CNN and ViT, which can simultaneously extract local detail information from CNN and global information from ViT. Then, we use a Non-local block to fusion this information to classify breast tumor DCE-MR images. The model structure mainly contains three parts: local CNN and global ViT branches, feature coupling unit(FCU), and cross attention fusion. The CNN subnetwork uses SENet for capturing local information and the ViT subnetwork captures global information. For the extracted feature maps from these two branches, their feature dimensions are usually different. To address this issue, we adopt a feature coupling unit(FCU) to eliminate feature dimension mis-alignment between these two branches. At last, the Non-local block is used to compute the correspondences on the two different inputs. We adopt the former two stages (stage-1, and stage-2) of SENet50 as our local CNN subnetwork and a 7-layer ViT(ViT-7) as our global subnetwork. Each stage in SENet50 is composed of some Residual blocks and SEblocks. Each Residual block contains a convolution layer, a convolution layer, and a convolution layer and each SEblocks contains a global average pooling layer, two FC layers, and a sigmoid activation function. Here, we separately set it to be 3 in stage-1 and 4 in stage-2 for the number of residual blocks and SEblocks. The 7-layer ViT contains seven encoder layers, which include two LayerNorms, a multi-head self-attention module, and a simple MLP block. The FCU contains a 1×1 convolution, a BatchNorm layer, and a Nearest Neighbor Interpolation. The Non-local block consists of four 1×1 convolutions and a Softmax function. Result We compare the model performance with the other five deep learning models such as VGG16, ResNet50, SENet50, ViT and SwinT(Swin Transformer), and conduct two sets of experiments that use different breast tumor DCE-MRI sequences to evaluate the robustness and generalization of the model. The quantitative evaluation metrics contain accuracy and area under the ROC curve(AUC). Compared with VGG16 and ResNet50 in two sets of experiments, the accuracy increased by 3.7%, 3.6% and AUC increased by 0.045, 0.035 on average respectively. Compared with SENet50 and ViT-7 in two sets of experiments, the accuracy increased by 3.2%, 1.1% and AUC increased by 0.035, 0.025 on average respectively. Compared with SwinT in two sets of experiments, the accuracy increased by 3.0%, 2.6% and AUC increased by 0.05,0.03 respectively. In addition, the Class Activation Map(CAM) of learned feature representations of models is generated to increase the interpretability of the models. At last, we conduct a series of ablation experiments to prove the effectiveness of our proposed method. Specially, we compare different fusion methods such as feature-level fusion and decision-level fusion with our cross attention fusion module. Compared with the feature-level fusion method in two sets of experiments, the accuracy increased by 1.6%, 1.3% and AUC increased by 0.03, 0.02 respectively. Compared with the decision-level fusion method in two sets of experiments, the accuracy increased by 0.7%, 1.8% and AUC increased by 0.02, and 0.04 respectively. These experimental results fully demonstrate the effectiveness of our method in the breast tumor DCE-MR images classification task. Conclusion In this study, a novel cross attention fusion network based on local CNN and global Vision Transformer(LG-CAFN) is proposed for the benign and malignant tumor classification of breast DCE-MR images. Extensive experiments demonstrate the superior performance of our method compared to several state-of-the-art methods. In our study, although the LG-CAFN model is only used for the diagnosis of breast tumor DCE-MR images, this approach can be very easily transferred to other medical image diagnostic tasks. Therefore, in future work, we intend to further extend our approach to other medical image diagnostic tasks, such as breast ultrasound images, breast CT images, etc. In addition, we will also explore automatic segmentation tasks for breast DCE-MR images to analyze breast DCE-MR images more comprehensively and to help radiologists make more accurate diagnoses.