基于Transformer的脊椎CT图像分割

卢玲; 漆为民

doi:10.11834/jig.221084

医学图像处理 | 浏览量 : 0 下载量: 6 CSCD: 0

PDF
导出
分享
收藏
专辑

基于Transformer的脊椎CT图像分割
Spine CT image segmentation based on Transformer
2023年28卷第11期页码：3618-3628
纸质出版日期： 2023-11-16 ，
DOI： 10.11834/jig.221084
稿件说明：

移动端阅览

卢玲，漆为民. 2023. 基于Transformer的脊椎CT图像分割. 中国图象图形学报， 28(11):3618-3628

Lu Ling， Qi Weimin. 2023. Spine CT image segmentation based on Transformer. Journal of Image and Graphics， 28(11):3618-3628
卢玲，漆为民. 2023. 基于Transformer的脊椎CT图像分割. 中国图象图形学报， 28(11):3618-3628 DOI： 10.11834/jig.221084.

Lu Ling， Qi Weimin. 2023. Spine CT image segmentation based on Transformer. Journal of Image and Graphics， 28(11):3618-3628 DOI： 10.11834/jig.221084.

摘要

目的

脊椎CT（computed tomography）图像存在组织结构显示不佳、对比度差以及噪音干扰等问题；传统分割算法分割精度低，分割过程需人工干预，往往只能实现半自动分割，不能满足实时分割需求。基于卷积神经网络（convolutional neural network，CNN）的U-Net模型成为医学图像分割标准，但仍存在长距离交互受限的问题。Transformer集成全局自注意力机制，可捕获长距离的特征依赖，在计算机视觉领域表现出巨大优势。本文提出一种CNN与Transformer混合分割模型TransAGUNet（Transformer attention gate U-Net），以实现对脊椎CT图像的高效自动化分割。

方法

提出的模型将Transformer、注意力门控机制（attention gate，AG）及U-Net相结合构成编码—解码结构。编码器使用Transformer和CNN混合架构，提取局部及全局特征；解码器使用CNN架构，在跳跃连接部分融入AG，将下采样特征图对应的注意力图（attention map）与下一层上采样后获得的特征图进行拼接，融合低层与高层特征从而实现更精细的分割。实验使用Dice Loss与带权重的交叉熵之和作为损失函数，以解决正负样本分布不均的问题。

结果

将提出的算法在VerSe2020数据集上进行测试，Dice系数较主流的CNN分割模型U-Net、Attention U-Net、U-Net++和U-Net3+分别提升了4.47%、2.09%、2.44%和2.23%，相较优秀的Transformer与CNN混合分割模型TransUNet和TransNorm分别提升了2.25%和1.08%。

结论

本文算法较以上6种分割模型在脊椎CT图像的分割性能最优，有效地提升了脊椎CT图像的分割精度，分割实时性较好。

Abstract

Objective

The incidence of spine diseases has increased in the contemporary era and is increasingly affecting younger individuals. Therefore， the diagnosis and treatment of such diseases are particularly critical. Using 3D reconstruction technology， computer-aided diagnosis， and segmentation of the spine area and the background area of the spine computed tomography （CT） image can assist physicians in clearly observing the spine lesion area and provide theoretical support for surgical path simulation and surgical planning. The accuracy of spine CT image segmentation is critical in restoring the actual position and physiological shape of the patients’ vertebrae to the greatest extent possible， thus allowing physicians to understand the distribution of lesions. However， the difficulty of spine segmentation is exacerbated by the complex structure of the spine， poor display of tissue structure， poor contrast， and noise interference in spine CT images. The segmentation of spine images via manual annotation relies on the physicians’ a priori knowledge and clinical experience， and the segmentation results are highly subjective and time consuming. Long working hours may also lead to deviations that affect the patients’ diagnosis. With the help of computer technology， the traditional segmentation method mainly uses low-latitude features， such as texture， shape， and color of the image， for segmentation and often can only achieve semi-automatic segmentation. Moreover， this method does not fully utilize the image information and has low segmentation accuracy that fails to meet the demand of real-time segmentation. The segmentation method based on deep learning can realize automatic segmentation， effectively extract image features， and improve segmentation accuracy. In the branch of computer vision （CV）， medical image segmentation algorithms based on convolutional neural network （CNN） have been proposed one after another and have become the mainstream research direction in medical image analysis. Among these algorithms， the characteristics of the U-Net structure itself and the fixed structure of medical images with multi-modality enhance the performance of U-Net in medical image segmentation and provide a benchmark for medical image segmentation. However， the inherent limitations of the convolutional structure can lead to problems， such as limited long-distance interaction. By contrast， Transformer， a non-CNN architecture， integrates a global self-attentive mechanism to capture long-range feature dependencies and is widely used in natural language processing， such as machine translation and text classification. In recent years， researchers have introduced Transformer into the field of computer vision and achieved advanced results in certain tasks， such as image classification and image segmentation. This paper then combines the advantages of the CNN architecture and Transformer to propose a CNN and Transformer hybrid segmentation model called Transformer attention gate U-Net （TransAGUNet） that realizes an efficient and automated segmentation of spine CT images.

Method

The proposed model combines Transformer， U-Net， and the attention gate （AG） mechanism to form an encoding–decoding structure. The encoder uses a hybrid Transformer and CNN architecture， which consists of a combination of ResNet50 and ViT models. For the sliced spine CT images， the low-level features are initially extracted by ResNet50， the feature maps corresponding to three downsampled features are retained， and then patch embedding and position embedding are performed. The obtained patches are then inputted to the Transformer encoder to learn long-term contextual dependencies and extract global features. The decoder adopts a CNN architecture that applies 2D bilinear upsampling at 2× rate to recover the image size layer by layer. The AG structure is incorporated into a jump-connected bottom-up triple layer to fuse shallow features with higher-level features for fine segmentation. The decoder uses a CNN structure to recover the image size layer by layer by performing 2D bilinear upsampling at a 2-fold rate. The AG structure is incorporated into the bottom-up three layers of the jump connection to obtain the attention map corresponding to the downsampled features， stitched with the upsampled features in the next layer， and then decoded by two ordinary convolutions and one 1 × 1 convolution. The AG structure then enters the binary classifier and distinguishes the foreground and background pixel by pixel to obtain the spine segmentation prediction map. The AG parameters are computationally small， easily integrated into CNN models， and can automatically learn the shape and size of the target to highlight salient features and suppress feature responses in irrelevant regions. These parameters replace the localization module via probability-based soft attention， thus eliminating the need to divide the ROI， and improve the sensitivity and accuracy of the model by a small amount of computation. The experiments use Dice Loss summed with weighted cross entropy loss as the loss function to solve the uneven distribution of positive and negative samples.

Result

The proposed algorithm is tested on the VerSe2020 dataset， and the Dice coefficients improve by 4.47%， 2.09%， 2.44%， and 2.23% over the mainstream CNN architectures of segmentation networks U-Net， Attention U-Net， U-Net++， and U-Net3+， respectively. Meanwhile， the Dice coefficients over the excellent Transformer and CNN hybrid segmentations TransUNet and TransNorm improve by 2.25% and 1.08%， respectively. To verify the validity of the proposed model， several ablation experiments are performed， and results show that compared with TransUNet， the Dice coefficient of the designed decoding structure improves by 0.75% and by 1.5% after adding AG. To explore the effect of the number of AG connections on the model performance， experiments are conducted using AG with different numbers of connections， and results show that the Dice coefficient obtained without adding AG is the smallest and that the optimal model performance is achieved by adding AG in three jump connections on the resolution scales of 1/2， 1/4， and 1/8.

Conclusion

Compared with the above six CNN segmentation models and the Transformer and CNN hybrid segmentation models， the proposed algorithm achieves the best segmentation results on spine CT images， thus effectively improving the segmentation accuracy of spine CT images with better segmentation real-time performance.

关键词

脊椎CT图像医学图像分割深度学习Transformer注意力门控机制（AG）

Keywords

spine CT imagemedical image segmentationdeep learningTransformerattention gate （AG）

references

Azad R， Al-Antary M T， Heidari M and Merhof D. 2022. TransNorm： Transformer provides a strong spatial normalization mechanism for a deep segmentation model. IEEE Access， 10： 108205-108215 ［DOI： 10.1109/ACCESS.2022.3211501http://dx.doi.org/10.1109/ACCESS.2022.3211501］

Ba J L， Kiros J R and Hinton G E. 2016. Layer normalization ［EB/OL］. ［2022-11-23］. https://arxiv.org/pdf/1607.06450.pdfhttps://arxiv.org/pdf/1607.06450.pdf

Chen J N， Lu Y Y， Yu Q H， Luo X D， Adeli E， Wang Y， Lu L， Yuille A L and Zhou Y Y. 2021. TransUNet： Transformers make strong encoders for medical image segmentation ［EB/OL］. ［2022-11-23］. https://arxiv.org/pdf/2102.04306.pdfhttps://arxiv.org/pdf/2102.04306.pdf

Chen L W and Rudnicky A. 2022. Fine-grained style control in Transformer-based text-to-speech synthesis//Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics， Speech and Signal Processing. Singapore， Singapore： IEEE： 7907-7911 ［DOI： 10.1109/ICASSP43922.2022.9747747http://dx.doi.org/10.1109/ICASSP43922.2022.9747747］

Dosovitskiy A， Beyer L， Kolesnikov A， Weissenborn D， Zhai X H， Unterthiner T， Dehghani M， Minderer M， Heigold G， Gelly S， Uszkoreit J and Houlsby N. 2021. An image is worth 16 × 16 words： Transformers for image recognition at scale//Proceedings of the 9th International Conference on Learning Representations. ［s.l.］： OpenReview.net， 2021

Egonmwan E and Chali Y. 2019. Transformer and seq2seq model for paraphrase generation//Proceedings of the 3rd Workshop on Neural Generation and Translation. Hong Kong， China： Association for Computational Linguistics： 249-255 ［DOI： 10.18653/v1/D19-5627http://dx.doi.org/10.18653/v1/D19-5627］

Guo D F and Terzopoulos D. 2021. A Transformer-based network for anisotropic 3D medical image segmentation//Proceedings of the 25th International Conference on Pattern Recognition. Milan， Italy： IEEE： 8857-8861 ［DOI： 10.1109/ICPR48806.2021.9411990http://dx.doi.org/10.1109/ICPR48806.2021.9411990］

He K M， Zhang X Y， Ren S Q and Sun J. 2016. Deep residual learning for image recognition//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas， USA： IEEE： 770-778 ［DOI： 10.1109/cvpr.2016.90http://dx.doi.org/10.1109/cvpr.2016.90］

Huang G， Liu Z， Van Der Maaten L and Weinberger K Q.2017. Densely connected convolutional networks//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu， USA： IEEE： 2261-226 ［DOI： 10.1109/CVPR.2017.243http://dx.doi.org/10.1109/CVPR.2017.243］

Huang H M， Lin L F， Tong R F， Hu H J， Zhang Q W， Iwamoto Y， Han X H， Chen Y W and Wu J. 2020. Unet 3+： a full-scale connected unet for medical image segmentation///Proceedings of the 2020 IEEE International Conference on Acoustics， Speech and Signal Processing. Barcelona， Spain： IEEE： 1055-1059 ［DOI： 10.1109/ICASSP40776.2020.9053405http://dx.doi.org/10.1109/ICASSP40776.2020.9053405］

Jha D， Smedsrud P H， Riegler M A， Johansen D， Le Lange T， Halvorsen P and Johansen H D. 2019. ResUNet++： an advanced architecture for medical image segmentation//The 2019 IEEE International Symposium on Multimedia. San Diego， USA： IEEE： 225-2255 ［DOI： 10.1109/ISM46123.2019.00049http://dx.doi.org/10.1109/ISM46123.2019.00049］

Ji S Y and Xiao Z Y. 2021. Integrated context and multi-scale features in thoracic organs segmentation. Journal of Image and Graphics， 26（9）： 2135-2145

吉淑滢，肖志勇. 2021. 融合上下文和多尺度特征的胸部多器官分割. 中国图象图形学报， 26（9）： 2135-2145 ［DOI： 10.11834/jig.200558http://dx.doi.org/10.11834/jig.200558］

Jin S N， Zhou D B， He B and Gu J J. 2021. Segmentation of spine CT images based on multi-scale feature fusion and attention mechanism. Computer Systems and Applications， 30（10）： 280-286

金顺楠，周迪斌，何斌，顾静军. 2021. 基于多尺度特征融合与注意力机制的脊柱CT图像分割. 计算机系统应用， 30（10）： 280-286 ［DOI： 10.15888/j.cnki.csa.008118http://dx.doi.org/10.15888/j.cnki.csa.008118］

Kolařík M， Burget R， Uher V， Říha K and Dutta M K. 2019. Optimized high resolution 3D dense-U-Net network for brain and spine segmentation. Applied Sciences， 9（3）： #404 ［DOI： 10.3390/app9030404http://dx.doi.org/10.3390/app9030404］

Li X and He J. 2018. Application of 3D fully convolution network in spine segmentation. Electronic Science and Technology， 31（11）： 75-79

李贤，何洁. 2018. 3D全卷积网络在脊柱分割中的应用. 电子科技， 31（11）： 75-79 ［DOI： 10.16180/j.cnki.issn1007-7820.2018.11.019http://dx.doi.org/10.16180/j.cnki.issn1007-7820.2018.11.019］

Li X M， Chen H， Qi X J， Dou Q， Fu C W and Heng P A. 2018. H-DenseUNet： hybrid densely connected UNet for liver and tumor segmentation from CT volumes. IEEE Transactions on Medical Imaging， 37（12）： 2663-2674 ［DOI： 10.1109/TMI.2018.2845918http://dx.doi.org/10.1109/TMI.2018.2845918］

Liu Y， Wu R R， Tang L and Song N N. 2022. U-Net-based mediastinal lymph node segmentation method in bronchial ultrasound elastic images. Journal of Image and Graphics， 27（10）： 3082-3091

刘羽，吴蓉蓉，唐璐，宋宁宁. 2022. U-Net支气管超声弹性图像纵膈淋巴结分割. 中国图象图形学报， 27（10）： 3082-3091 ［DOI： 10.11834/jig.210225http://dx.doi.org/10.11834/jig.210225］

Liu Z L， Chen G， Shan Z Y and Jang X Q. 2018. Segmentation of spine CT image based on deep learning. Computer Applications and Software， 35（10）： 200-204， 273

刘忠利，陈光，单志勇，蒋学芹. 2018. 基于深度学习的脊柱CT图像分割. 计算机应用与软件， 35（10）： 200-204， 273 ［DOI： 10.3969/j.issn.1000-386x.2018.10.036http://dx.doi.org/10.3969/j.issn.1000-386x.2018.10.036］

Löffler M T， Sekuboyina A， Jacob A， Grau A L， Scharr A， El Husseini M， Kallweit M， Zimmer C， Baum T and Kirschke J S. 2020. A vertebral segmentation dataset with fracture grading. Radiology： Artificial Intelligence， 2（4）： #190138 ［DOI： 10.1148/ryai.2020190138http://dx.doi.org/10.1148/ryai.2020190138］

Long J， Shelhamer E and Darrell T. 2015. Fully convolutional networks for semantic segmentation//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston， USA： IEEE： 3431-3440 ［DOI： 10.1109/CVPR.2015.7298965http://dx.doi.org/10.1109/CVPR.2015.7298965］

Milletari F， Navab N and Ahmadi S A. 2016. V-Net： fully convolutional neural networks for volumetric medical image segmentation//Proceedings of the 14th International Conference on 3D Vision. Stanford， USA： IEEE： 565-571 ［DOI： 10.1109/3DV.2016.79http://dx.doi.org/10.1109/3DV.2016.79］

Oktay O， Schlemper J， Le Folgoc L， Lee M， Heinrich M， Misawa K， Mori K， McDonagh S， Hammerla N Y， Kainz B， Glocker B and Rueckert D. 2018. Attention U-Net： learning where to look for the pancreas ［EB/OL］. ［2022-11-23］. https://arxiv.org/pdf/1804.03999.pdfhttps://arxiv.org/pdf/1804.03999.pdf

Ronneberger O， Fischer P and Brox T. 2015. U-Net： convolutional networks for biomedical image segmentation//Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention. Munich， Germany： Springer： 234-241 ［DOI： 10.1007/978-3-319-24574-4_28http://dx.doi.org/10.1007/978-3-319-24574-4_28］

Sekuboyina A， Kukačka J， Kirschke J S， Menze B H and Valentinitsch A. 2018. Attention-driven deep learning for pathological spine segmentation//Proceedings of the 5th International Workshop on Computational Methods and Clinical Applications in Musculoskeletal Imaging. Quebec City， Canada： Springer： 108-119 ［DOI： 10.1007/978-3-319-74113-0_10http://dx.doi.org/10.1007/978-3-319-74113-0_10］

Shi Y Y， Wang Y Q， Wu C Y， Yeh C F， Chan J， Zhang F， Le D and Seltzer M. 2021. Emformer： efficient memory Transformer based acoustic model for low latency streaming speech recognition//Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics， Speech and Signal Processing. Toronto， Canada： IEEE： 6783-6787 ［DOI： 10.1109/ICASSP39728.2021.9414560http://dx.doi.org/10.1109/ICASSP39728.2021.9414560］

Strudel R， Garcia R， Laptev I and Schmid C. 2021. Segmenter： Transformer for semantic segmentation//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 7242-7252 ［DOI： 10.1109/ICCV48922.2021.00717http://dx.doi.org/10.1109/ICCV48922.2021.00717］

Tian F Y， Zhou M Q， Yan F， Fan L and Geng G H. 2020. Spinal CT segmentation based on AttentionNet and DenseUnet. Laser and Optoelectronics Progress， 57（20）： #201008

田丰源，周明全，闫峰，范力，耿国华. 2020. 基于AttentionNet和DenseUnet的脊椎CT分割. 激光与光电子学进展， 57（20）： #201008 ［DOI： 10.3788/LOP57.201008http://dx.doi.org/10.3788/LOP57.201008］

Vaswani A， Shazeer N， Parmar N， Uszkoreit J， Jones L， Gomez A N， Kaiser Ł and Polosukhin I. 2017. Attention is all you need//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach， USA： Curran Associates Inc.： 6000-6010.

Wang W X， Chen C， Ding M， Yu H， Zha S and Li J Y. 2021. TransBTS： multimodal brain tumor segmentation using Transformer//Proceedings of the 24th International Conference on Medical Image Computing and Computer-Assisted Intervention. Strasbourg， France： Springer： 109-119 ［DOI： 10.1007/978-3-030-87193-2_11http://dx.doi.org/10.1007/978-3-030-87193-2_11］

Xiao X， Lian S， Luo Z M and Li S Z. 2018. Weighted res-UNet for high-quality retina vessel segmentation//Proceedings of the 9th International Conference on Information Technology in Medicine and Education. Hangzhou， China： IEEE： 327-331 ［DOI： 10.1109/ITME.2018.00080http://dx.doi.org/10.1109/ITME.2018.00080］

Zhao Y M， Li Q and Guan X. 2020. Lightweight brain tumor segmentation algorithm based on a group convolutional neural network. Journal of Image and Graphics， 25（10）： 2159-2170

赵奕名，李锵，关欣. 2020. 组卷积轻量级脑肿瘤分割网络. 中国图象图形学报， 25（10）： 2159-2170 ［DOI： 10.11834/jig.200247http://dx.doi.org/10.11834/jig.200247］

Zhou H Y， Guo J S， Zhang Y H， Yu L Q， Wang L S and Yu Y Z. 2022. nnFormer： interleaved Transformer for volumetric segmentation ［EB/OL］. ［2022-11-23］. https://arxiv.org/pdf/2109.03201v1.pdfhttps://arxiv.org/pdf/2109.03201v1.pdf

Zhou Z W， Siddiquee M M R， Tajbakhsh N and Liang J M. 2018.Unet++： a nested u-net architecture for medical image segmentation//Proceedings of the 4th International Workshop on Deep Learning in Medical Image Analysis. Granada， Spain： Springer： 3-11 ［DOI： 10.1007/978-3-030-00889-5_1http://dx.doi.org/10.1007/978-3-030-00889-5_1］

文章被引用时，请邮件提醒。

提交

轻量级图像超分辨率的蓝图可分离卷积Transformer网络

图像去模糊研究综述

TransAS-UNet:融合Swin Transformer和UNet的乳腺癌区域分割

采用多尺度视觉注意力分割腹部CT和心脏MR图像

通道注意力嵌入的Transformer图像超分辨率重构