摘 要：目的 脊椎CT图像存在组织结构显示不佳、对比度差、噪音干扰等问题；传统分割算法分割精度低、分割过程需人工干预、往往只能实现半自动分割，不能满足实时分割需求。基于卷积神经网络（convolutional neural networks, CNN）的U-Net模型成为医学图像分割标准，但仍存在长距离交互受限的问题。Transformer集成全局自注意力机制，可捕获长距离的特征依赖，近几年在计算机视觉领域表现出巨大优势。本文提出一种CNN与Transformer混合分割模型TransAGUNet（transformer attention gate u-net），以实现对脊椎CT图像的高效自动化分割。方法 提出的模型将Transformer、注意力门控机制（attention gate, AG）及U-Net相结合构成编码-解码结构。编码器使用Transformer和CNN混合架构，提取局部及全局特征；解码器使用CNN架构，在跳跃连接部分融入AG，将下采样特征图对应的注意力图（attention map）与下一层上采样后获得的特征图进行拼接，融合低层与高层特征从而实现更精细的分割。实验使用Dice Loss与带权重的交叉熵之和作为损失函数，以解决正负样本分布不均的问题。结果 将提出的算法在VerSe2020数据集上进行测试，Dice系数较主流的CNN分割模型U-Net、Attention U-Net、U-Net++、U-Net3+分别提升了4.47%、2.09%、2.44%、2.23%，较近年来优秀的Transformer与CNN混合分割模型TransUNet、TransNorm分别提升了2.25%、1.08%。结论 本文所提出的算法较以上六种分割模型在脊椎CT图像的分割性能最优，有效地提升了脊椎CT图像的分割精度，分割实时性较好。
Spine CT image sgmentation based on Transformer
Lu Ling,QI Weimin(Jianghan University)
Abstract: Objective Spine disease has become a high incidence disease in the contemporary era and is developing younger, so its diagnosis and treatment are particularly critical. With computer-aided diagnosis, segmentation of the spine area and the background area of the spine CT image, combined with 3D reconstruction technology, can assist physicians to observe the spine lesion area more clearly, and provide theoretical support for simulating the surgical path and surgical planning. The accuracy of spine CT image segmentation is critical, and efficient and accurate spine CT image segmentation can restore the actual position and physiological shape of the patient"s vertebrae to the greatest extent possible, allowing physicians to better grasp the distribution of lesions. The difficulty of spine segmentation is exacerbated by the complex structure of the spine, poor display of tissue structure, poor contrast, and noise interference in spine CT images. The segmentation of spine images by manual annotation method relies on the physician"s a priori knowledge and clinical experience, and the segmentation results are highly subjective and time-consuming, and the long working hours may lead to deviations, thus affecting the patient"s diagnosis; the traditional segmentation method with the help of computer technology mainly uses low-latitude features such as texture, shape, and color of the image for segmentation, which often can only achieve semi-automatic segmentation, and the image information is not The image information is not fully utilized and the segmentation accuracy is low, which cannot meet the demand of real-time segmentation. The segmentation method based on deep learning can realize automatic segmentation, effectively extract image features and improve segmentation accuracy. As a branch of computer vision, medical image analysis, medical image segmentation algorithms based on convolutional neural networks (CNN) have been proposed one after another and have become the mainstream research direction. Among them, the characteristics of U-Net structure itself and the fixed structure of medical images with multimodality make U-Net perform well in medical image segmentation and become the benchmark for medical image segmentation. However, the inherent limitations of the convolutional structure lead to problems such as limited long-distance interaction. In contrast, Transformer, a non-CNN architecture, integrates a global self-attentive mechanism to capture long-range feature dependencies and is widely used in natural language processing, such as machine translation and text classification. In recent years, researchers have introduced Transformer into the field of computer vision and achieved more advanced results in tasks such as image classification and image segmentation. In this paper, we combine the advantages of CNN architecture and Transformer to propose a CNN and Transformer hybrid segmentation model TransAGUNet (transformer attention gate u-net) to achieve efficient and automated segmentation of spine CT images. Method The proposed model combines Transformer, U-Net and attention gate (AG) mechanism to form an encoding-decoding structure. The encoder uses a hybrid Transformer and CNN architecture, which consists of a combination of ResNet50 and ViT models. For the sliced spine CT images, the low-level features are first extracted by ResNet50, and the feature maps corresponding to three downsampling are retained, and then Patch Embedding and Position Embedding are performed. The obtained Patches are input to the Transformer encoder to learn long-term contextual dependencies and extract global features. The decoder adopts a CNN architecture, which uses 2-dimensional bilinear upsampling at 2X rate to recover the image size layer by layer. The AG structure is incorporated in a jump-connected bottom-up triple layer to fuse shallow features with higher-level features for fine segmentation. The decoder uses a CNN structure to recover the image size layer by layer using a 2-dimensional bilinear upsampling 2-fold rate. The AG structure is incorporated in the bottom-up three layers of the jump connection to obtain the attention map corresponding to the downsampled features, and then stitched with the upsampled features in the next layer, and then decoded by two ordinary convolutions and one 1×1 convolution.Finally, it enters the binary classifier and distinguishes foreground and background pixel by pixel to obtain the spine segmentation prediction map. The AG parameters are computationally small, easily integrated with CNN models, and can automatically learn the shape and size of the target to highlight salient features and suppress feature responses in irrelevant regions, replacing the localization module by probability-based soft attention, eliminating the need to divide the ROI, and by a small amount of computation to improve the sensitivity and accuracy of the model. The experiments use dice loss summed with weighted cross entropy loss as the loss function to solve the problem of uneven distribution of positive and negative samples. Result The proposed algorithm was tested on the VerSe2020 dataset, and the Dice coefficients improved by 4.47%, 2.09%, 2.44% and 2.23%, respectively, over the mainstream CNN architecture of segmentation networks U-Net, Attention U-Net, U-Net++ and U-Net3+, over the excellent Transformer and CNN hybrid segmentation of recent years TransUNet and TransNorm improved by 2.25% and 1.08% respectively. In order to verify the validity of the proposed model, corresponding ablation experiments were done, and the results showed that the decoding structure designed in this paper compared with TransUNet, the Dice coefficient improves by 0.75%, and the Dice coefficient improves by 1.5% on top of that after adding AG. And in order to explore the effect of the number of AG connections on the model performance, experiments are conducted using AG with different numbers of connections, and the results show that the Dice coefficient obtained without adding AG is the smallest, and the model performance obtained by adding AG in three jump connections on the resolution scales of 1/2, 1/4, and 1/8 is optimal. Conclusion The proposed algorithm achieves the best segmentation results on spine CT images than the above six CNN segmentation models and Transformer and CNN hybrid segmentation models, effectively improving the segmentation accuracy of spine CT images with better segmentation real-time performance.