卢玲, 漆为民(江汉大学人工智能学院, 武汉 430056)
目的 脊椎CT（computed tomography）图像存在组织结构显示不佳、对比度差以及噪音干扰等问题；传统分割算法分割精度低，分割过程需人工干预，往往只能实现半自动分割，不能满足实时分割需求。基于卷积神经网络（convolutional neural network，CNN）的U-Net模型成为医学图像分割标准，但仍存在长距离交互受限的问题。Transformer集成全局自注意力机制，可捕获长距离的特征依赖，在计算机视觉领域表现出巨大优势。本文提出一种CNN与Transformer混合分割模型TransAGUNet （Transformer attention gate U-Net），以实现对脊椎CT图像的高效自动化分割。方法 提出的模型将Transformer、注意力门控机制（attention gate，AG）及U-Net相结合构成编码—解码结构。编码器使用Transformer和CNN混合架构，提取局部及全局特征；解码器使用CNN架构，在跳跃连接部分融入AG，将下采样特征图对应的注意力图（attention map）与下一层上采样后获得的特征图进行拼接，融合低层与高层特征从而实现更精细的分割。实验使用Dice Loss与带权重的交叉熵之和作为损失函数，以解决正负样本分布不均的问题。结果 将提出的算法在VerSe2020数据集上进行测试，Dice系数较主流的CNN分割模型U-Net、Attention U-Net、U-Net++和U-Net3+分别提升了4.47%、2.09%、2.44%和2.23%，相较优秀的Transformer与CNN混合分割模型TransUNet和TransNorm分别提升了2.25%和1.08%。结论 本文算法较以上6种分割模型在脊椎CT图像的分割性能最优，有效地提升了脊椎CT图像的分割精度，分割实时性较好。
Spine CT image segmentation based on Transformer
Lu Ling, Qi Weimin(School of Artificial Intelligence, Jianghan University, Wuhan 430056, China)
Objective The incidence of spine diseases has increased in the contemporary era and is increasingly affecting younger individuals. Therefore，the diagnosis and treatment of such diseases are particularly critical. Using 3D reconstruction technology，computer-aided diagnosis，and segmentation of the spine area and the background area of the spine computed tomography（CT）image can assist physicians in clearly observing the spine lesion area and provide theoretical support for surgical path simulation and surgical planning. The accuracy of spine CT image segmentation is critical in restoring the actual position and physiological shape of the patients’vertebrae to the greatest extent possible，thus allowing physicians to understand the distribution of lesions. However，the difficulty of spine segmentation is exacerbated by the complex structure of the spine，poor display of tissue structure，poor contrast，and noise interference in spine CT images. The segmentation of spine images via manual annotation relies on the physicians’a priori knowledge and clinical experience，and the segmentation results are highly subjective and time consuming. Long working hours may also lead to deviations that affect the patients’diagnosis. With the help of computer technology，the traditional segmentation method mainly uses lowlatitude features，such as texture，shape，and color of the image，for segmentation and often can only achieve semiautomatic segmentation. Moreover，this method does not fully utilize the image information and has low segmentation accuracy that fails to meet the demand of real-time segmentation. The segmentation method based on deep learning can realize automatic segmentation，effectively extract image features，and improve segmentation accuracy. In the branch of computer vision（CV），medical image segmentation algorithms based on convolutional neural network（CNN）have been proposed one after another and have become the mainstream research direction in medical image analysis. Among these algorithms， the characteristics of the U-Net structure itself and the fixed structure of medical images with multi-modality enhance the performance of U-Net in medical image segmentation and provide a benchmark for medical image segmentation. However， the inherent limitations of the convolutional structure can lead to problems，such as limited long-distance interaction. By contrast，Transformer，a non-CNN architecture，integrates a global self-attentive mechanism to capture long-range feature dependencies and is widely used in natural language processing，such as machine translation and text classification. In recent years，researchers have introduced Transformer into the field of computer vision and achieved advanced results in certain tasks，such as image classification and image segmentation. This paper then combines the advantages of the CNN architecture and Transformer to propose a CNN and Transformer hybrid segmentation model called Transformer attention gate U-Net（TransAGUNet）that realizes an efficient and automated segmentation of spine CT images. Method The proposed model combines Transformer，U-Net，and the attention gate（AG）mechanism to form an encoding–decoding structure. The encoder uses a hybrid Transformer and CNN architecture，which consists of a combination of ResNet50 and ViT models. For the sliced spine CT images，the low-level features are initially extracted by ResNet50，the feature maps corresponding to three downsampled features are retained，and then patch embedding and position embedding are performed. The obtained patches are then inputted to the Transformer encoder to learn long-term contextual dependencies and extract global features. The decoder adopts a CNN architecture that applies 2D bilinear upsampling at 2×rate to recover the image size layer by layer. The AG structure is incorporated into a jump-connected bottom-up triple layer to fuse shallow features with higher-level features for fine segmentation. The decoder uses a CNN structure to recover the image size layer by layer by performing 2D bilinear upsampling at a 2-fold rate. The AG structure is incorporated into the bottom-up three layers of the jump connection to obtain the attention map corresponding to the downsampled features，stitched with the upsampled features in the next layer，and then decoded by two ordinary convolutions and one 1×1 convolution. The AG structure then enters the binary classifier and distinguishes the foreground and background pixel by pixel to obtain the spine segmentation prediction map. The AG parameters are computationally small，easily integrated into CNN models，and can automatically learn the shape and size of the target to highlight salient features and suppress feature responses in irrelevant regions. These parameters replace the localization module via probability-based soft attention，thus eliminating the need to divide the ROI，and improve the sensitivity and accuracy of the model by a small amount of computation. The experiments use Dice Loss summed with weighted cross entropy loss as the loss function to solve the uneven distribution of positive and negative samples. Result The proposed algorithm is tested on the VerSe2020 dataset，and the Dice coefficients improve by 4. 47%，2. 09%，2. 44%，and 2. 23% over the mainstream CNN architectures of segmentation networks U-Net，Attention U-Net，U-Net++，and U-Net3+，respectively. Meanwhile，the Dice coefficients over the excellent Transformer and CNN hybrid segmentations TransUNet and TransNorm improve by 2. 25% and 1. 08%，respectively. To verify the validity of the proposed model，several ablation experiments are performed，and results show that compared with TransUNet，the Dice coefficient of the designed decoding structure improves by 0. 75% and by 1. 5% after adding AG. To explore the effect of the number of AG connections on the model performance，experiments are conducted using AG with different numbers of connections，and results show that the Dice coefficient obtained without adding AG is the smallest and that the optimal model performance is achieved by adding AG in three jump connections on the resolution scales of 1/2，1/4，and 1/8. Conclusion Compared with the above six CNN segmentation models and the Transformer and CNN hybrid segmentation models，the proposed algorithm achieves the best segmentation results on spine CT images，thus effectively improving the segmentation accuracy of spine CT images with better segmentation real-time performance.