嵌入卷积增强型Transformer的头影解剖关键点检测

杨恒; 顾晨亮; 胡厚民; 张劲; 李康; 何凌

doi:10.11834/jig.220933

医学图像处理 | 浏览量 : 0 下载量: 2 CSCD: 0

PDF
导出
分享
收藏
专辑

嵌入卷积增强型Transformer的头影解剖关键点检测
Cephalometric landmark keypoints localization based on convolution-enhanced Transformer
2023年28卷第11期页码：3590-3601
纸质出版日期： 2023-11-16 ，
DOI： 10.11834/jig.220933
稿件说明：

移动端阅览

杨恒，顾晨亮，胡厚民，张劲，李康，何凌. 2023. 嵌入卷积增强型Transformer的头影解剖关键点检测. 中国图象图形学报， 28(11):3590-3601

Yang Heng， Gu Chenliang， Hu Houmin， Zhang Jing， Li Kang， He Ling. 2023. Cephalometric landmark keypoints localization based on convolution-enhanced Transformer. Journal of Image and Graphics， 28(11):3590-3601
杨恒，顾晨亮，胡厚民，张劲，李康，何凌. 2023. 嵌入卷积增强型Transformer的头影解剖关键点检测. 中国图象图形学报， 28(11):3590-3601 DOI： 10.11834/jig.220933.

Yang Heng， Gu Chenliang， Hu Houmin， Zhang Jing， Li Kang， He Ling. 2023. Cephalometric landmark keypoints localization based on convolution-enhanced Transformer. Journal of Image and Graphics， 28(11):3590-3601 DOI： 10.11834/jig.220933.

摘要

目的

准确可靠的头像分析在正畸诊断、术前规划以及治疗评估中起着重要作用，其常依赖于解剖关键点间的相互关联。然而，人工注释往往受限于速率与准确性，并且不同位置的结构可能共享相似的图像信息，这使得基于卷积神经网络的方法难有较高的精度。Transformer在长期依赖性建模方面具有优势，这对确认关键点的位置信息有所帮助，因此开发一种结合Transformer的头影关键点自动检测算法具有重要意义。

方法

本文提出一种基于卷积增强型Transformer的U型架构用于侧位头影关键点定位，并将其命名为CETransNet（convolutional enhanced Transformer network）。通过改进Transformer模块并将其引入至U型结构中，在建立全局上下文连接的同时也保留了卷积神经网络获取局部信息的能力。此外，为更好地回归预测热图，提出了一种指数加权损失函数，使得监督学习过程中关键点附近像素的损失值能得到更多关注，并抑制远处像素的损失。

结果

在2个测试集上， CETransNet分别实现了1.09 mm和1.39 mm的定位误差值，并且2 mm内精度达到了87.19%和76.08%。此外，测试集1中共有9个标志点达到了100%的4 mm检测精度，同时多达12个点获得了90%以上的2 mm检测精度；测试集2中，尽管只有9个点满足90%的2 mm检测精度，但4 mm范围内有10个点被完全检测。

结论

CETransNet能够快速、准确且具备鲁棒性地检测出解剖点的位置，性能优于目前先进方法，并展示出一定的临床应用价值。

Abstract

Objective

Accurate and reliable cephalometric image measurement and analysis， which usually depend on the correlation among anatomical landmark points， play essential roles in orthodontic diagnosis， preoperative planning， and treatment evaluation. However， manual annotation hinders the speed and accuracy of measurement to a certain extent. Therefore， an automatic cephalometric landmark detection algorithm for daily diagnosis needs to be developed. However， the size of anatomical landmarks accounts for a small proportion of an image， and the structures at different positions may share similar radians， shapes， and surrounding soft tissue information that are difficult to distinguish. The current methods based on convolutional neural networks （CNNs） extract depth features by applying down-sampling to facilitate the building of a global connection， but these methods may suffer from spatial information loss and inefficient context modeling， hence preventing them from meeting accuracy requirements in clinical applications. Transformer has advantages in long-term dependency modeling but is not good at capturing local features， hence explaining the insufficient accuracy of models based on pure Transformer for key point localization. Therefore， an end-to-end model with global context modeling and better local spatial feature representation must be built to solve these problems.

Method

To detect the anatomical landmarks efficiently and effectively， a U-shaped architecture based on convolution-enhanced Transformer called CETransNet is proposed in this manuscript to locate the key points of lateral cephalometric images. The overwhelming success of UNet lies in its ability to analyze the local fine-grained nature of an image at the deep level， but this method suffers from global spatial information loss. By improving and introducing the Transformer module into the U-shaped structure， the ability of convolutional networks to obtain local information is retained while establishing global context connection. In addition， to efficiently regress and predict the heatmaps， an exponential weighted loss function is proposed so that the loss value near the landmark pixels can receive more attention in the supervised learning process and the loss of distant pixels can be suppressed. Each image is rescaled to 768 × 768 pixels and maintains a fixed aspect ratio corresponding to its original ratio via a zero padding operation， and data augmentation is performed via random rotation， Gaussian noise addition， and elastic transformation. During the training phase， experiments are conducted on a server using Tesla V100 SXM3-32 GB GPUs. The model is optimized by an Adam optimizer with a batch size of 2， and the initial learning rate is set to 0.000 1 and decreased by 0.75 times every 5 epochs.

Result

To demonstrate its strengths， CETransNet is compared with the most advanced methods， and ablation studies are performed to confirm the contribution of each component. Experiments were performed on a public X-ray cephalometric dataset. Quantitative results show that CETransNet obtains mean radial error （MRE） values of 1.09 mm and 1.43 mm in the two test datasets， respectively， and the accuracies within a clinically accepted 2 mm error are 87.16% and 76.08%. A total of 9 key points in Test1 achieve a 100% successful detection rate （SDR） value， and in the clinically allowable 2.0 mm region， the detection accuracy reaches 90% with up to 12 landmarks. In Test2， although only 9 points satisfy the SDR accuracy of 90%， 10 points within 4 mm are completely detected. Compared with the best competing method， CETransNet improves the MRE by 2.7% and 2.1% on the two datasets， respectively. CETransNet also outperforms other popular vision Transformer methods on the benchmark Test1 dataset and achieves a 2.16% SDR improvement within 2 mm compared with the sub-optimal model. Meanwhile， the analysis of the influence of the backbone network on the model performance reveals that ResNet-101 reaches the minimal MRE， while ResNet-152 obtains the best SDR within 2 mm. Results of ablation studies show that the convolution-enhanced Transformer can decrease MRE by 0.3 mm and improve SDR in 2.0 mm by 7.36%. Meanwhile， the proposed EWSmoothL1 further reduces the MRE to 1.09 mm. Benefitting from these components， CETransNet can detect the position of anatomical landmarks quickly， accurately， and robustly.

Conclusion

This paper proposes a cephalometric landmark detection framework with a U-shaped architecture that embeds the convolution-enhanced Transformer in each residual layer. By fusing the advantages of both Transformer and CNNs， the proposed framework effectively captures the long-term dependence and local natures and thus obtains the special position and structure information of key points. To address the ambiguity caused by other similar structures in an image， an exponential weighted loss function is proposed in order for the model to focus on the loss of the target area than the other parts. Experimental results show that CETransNet achieves the best MRE and SDR performance compared with advanced methods， especially in the clinically allowable 2.0 mm region. A series of ablation experiments also prove the effectiveness of the proposed modules， thereby confirming that CETransNet shows a competent performance in anatomical landmark detection and possesses great potential to solve the problems in cephalometric analysis and treatment planning. In future work， other lightweight models with better robustness will be designed.

关键词

头影测量关键点检测视觉Transformer注意力机制热图回归卷积神经网络（CNN）

Keywords

cephalometric measurementlandmark keypoints localizationvision Transformerattention mechanismheatmap regressionconvolutional neural network （CNN）

references

Ao Y Y. 2022. Research on X-ray Lateral Cephalometric Landmark Detection Based on Deep Learning. Chengdu， China： University of Electronic Science and Technology of China （敖悦源. 2022. 基于深度学习的X光侧面头影标志点检测研究. 成都：电子科技大学）

Cao H， Wang Y Y， Chen J， Jiang D S， Zhang X P， Tian Q and Wang M N. 2023. Swin-Unet： unet-like pure Transformer for medical image segmentation//Proceedings of 2023 European Conference on Computer Vision. Tel Aviv， Israel： Springer： 205-218 ［DOI： 10.1007/978-3-031-25066-8_9http://dx.doi.org/10.1007/978-3-031-25066-8_9］

Chen J N， Lu Y Y， Yu Q H， Luo X D， Adeli E， Wang Y， Lu L， Yuille A L and Zhou Y Y. 2021. TransUNet： Transformers make strong encoders for medical image segmentation ［EB/OL］. ［2022-09-01］. https://arxiv.org/pdf/2102.04306.pdfhttps://arxiv.org/pdf/2102.04306.pdf

Chen R N， Ma Y X， Chen N L， Lee D and Wang W P. 2019. Cephalometric landmark detection by attentive feature pyramid fusion and regression-voting//Proceedings of the 22nd International Conference on Medical Image Computing and Computer-Assisted Intervention. Shenzhen， China： Springer： 873-881 ［DOI： 10.1007/978-3-030-32248-9_97http://dx.doi.org/10.1007/978-3-030-32248-9_97］

Dosovitskiy A， Beyer L， Kolesnikov A， Weissenborn D， Zhai X H， Unterthiner T， Dehghani M， Minderer M， Heigold G， Gelly S， Uszkoreit J and Houlsby N. 2021. An image is worth 16 × 16 words： Transformers for image recognition at scale//Proceedings of the 9th International Conference on Learning Representations. ［s.l.］： OpenReview.net

Gilmour L and Ray N. 2020. Locating cephalometric X-ray landmarks with foveated pyramid attention//Proceedings of 2020 International Conference on Medical Imaging with Deep Learning. Montréal， Canada： PMLR， 2020： 262-276

Grau V， Alcañiz M， Juan M C， Monserrat C and Knoll C. 2001. Automatic localization of cephalometric landmarks. Journal of Biomedical Informatics， 34（3）： 146-156 ［DOI： 10.1006/jbin.2001.1014http://dx.doi.org/10.1006/jbin.2001.1014］

He T， Yao J， Tian W D， Yi Z， Tang W and Guo J X. 2021. Cephalometric landmark detection by considering translational invariance in the two-stage framework. Neurocomputing， 464： 15-26 ［DOI： 10.1016/j.neucom.2021.08.042http://dx.doi.org/10.1016/j.neucom.2021.08.042］

Ibragimov B， Likar B， Pernus F and Vrtovec F. 2014. Automatic cephalometric X-ray landmark detection by applying game theory and random forests//Proceedings of 2014 International Symposium on Biomedical Imaging （ISBI）. ［s.l.］：［s.n.］： 1-8

Kaur A and Singh C. 2015. Automatic cephalometric landmark detection using Zernike moments and template matching. Signal， Image and Video Processing， 9（1）： 117-132 ［DOI： 10.1007/s11760-013-0432-7http://dx.doi.org/10.1007/s11760-013-0432-7］

Lee H， Park M and Kim J. 2017. Cephalometric landmark detection in dental X-ray images using convolutional neural networks//Proceedings of SPIE 10134， Medical Imaging 2017： Computer-Aided Diagnosis. Orlando， USA： SPIE： 494-499 ［DOI： 10.1117/12.2255870http://dx.doi.org/10.1117/12.2255870］

Li W J， Lu Y H， Zheng K， Liao H F， Lin C H， Luo J B， Cheng C T， Xiao J， Lu L， Kuo C F and Miao S. 2020. Structured landmark detection via topology-adapting deep graph learning//Proceedings of 2020 European Conference on Computer Vision. Glasgow， UK： Springer： 266-283 ［DOI： 10.1007/978-3-030-58545-7_16http://dx.doi.org/10.1007/978-3-030-58545-7_16］

Lindner C， Bromiley P A， Ionita M C and Cootes T F. 2015. Robust and accurate shape model matching using random forest regression-voting. IEEE Transactions on Pattern Analysis and Machine Intelligence， 37（9）： 1862-1874 ［DOI： 10.1109/TPAMI.2014.2382106http://dx.doi.org/10.1109/TPAMI.2014.2382106］

Lindner C and Cootes T F. 2015. Fully automatic cephalometric evaluation using random forest regression-voting//Proceedings of 2015 IEEE International Symposium on Biomedical Imaging （ISBI） 2015—Grand Challenges in Dental X-ray Image Analysis—Automated Detection and Analysis for Diagnosis in Cephalometric X-ray Image. ［s.l.］： IEEE

Liu Z， Lin Y T， Cao Y， Hu H， Wei Y X， Zhang Z， Lin S and Guo B N. 2021. Swin Transformer： hierarchical vision Transformer using shifted windows//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 9992-10002 ［DOI： 10.1109/ICCV48922.2021.00986http://dx.doi.org/10.1109/ICCV48922.2021.00986］

Oh K， Oh I S， Le V N T and Lee D W. 2021. Deep anatomical context feature learning for cephalometric landmark detection. IEEE Journal of Biomedical and Health Informatics， 25（3）： 806-817 ［DOI： 1109/JBHI.2020.3002582http://dx.doi.org/1109/JBHI.2020.3002582］

Payer C， Štern D， Bischof H and Urschler M. 2019. Integrating spatial configuration into heatmap regression based CNNs for landmark localization. Medical Image Analysis， 54： 207-219 ［DOI： 10.1016/j.media.2019.03.007http://dx.doi.org/10.1016/j.media.2019.03.007］

Ren J H， Zhang G H， Qiao G Z and Wu X P. 2023. Cephalometric mark point detection with multi-scale feature fusion. Computer Engineering， 49（3）： 271-279

任家豪，张光华，乔钢柱，武秀萍. 2023. 多尺度特征融合的头影标志点检测. 计算机工程， 49（3）： 271-279 ［DOI： 10.19678/j.issn.1000-3428.0064173http://dx.doi.org/10.19678/j.issn.1000-3428.0064173］

Wang C W， Huang C T， Hsieh M C， Li C H， Chang S W， Li W C， Vandaele R， Marée R， Jodogne S， Geurts P， Chen C， Zheng G Y， Chu C W， Mirzaalian H， Hamarneh G， Vrtovec T and Ibragimov B. 2015. Evaluation and comparison of anatomical landmark detection methods for cephalometric X-ray images： a grand challenge. IEEE Transactions on Medical Imaging， 34（9）： 1890-1900 ［DOI： 10.1109/TMI.2015.2412951http://dx.doi.org/10.1109/TMI.2015.2412951］

Yang S， Quan Z B， Nie M and Yang W K. 2021. TransPose： keypoint localization via Transformer//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal， Canada： IEEE： 11802-11812 ［DOI： 10.1109/ICCV48922.2021.01159http://dx.doi.org/10.1109/ICCV48922.2021.01159］

Zeng M M， Yan Z L， Liu S， Zhou Y H and Qiu L X. 2021. Cascaded convolutional networks for automatic cephalometric landmark detection. Medical Image Analysis， 68： #101904 ［DOI： 10.1016/j.media.2020.101904http://dx.doi.org/10.1016/j.media.2020.101904］

Zhang Y D， Liu H Y and Hu Q. 2021. TransFuse： fusing Transformers and CNNs for medical image segmentation//Proceedings of the 24th International Conference on Medical Image Computing and Computer-Assisted Intervention. Strasbourg， France： Springer： 14-24 ［DOI： 10.1007/978-3-030-87193-2_2http://dx.doi.org/10.1007/978-3-030-87193-2_2］

Zhong Z S， Li J， Zhang Z X， Jiao Z C and Gao X B. 2019. An attention-guided deep regression model for landmark detection in cephalograms//Proceedings of the 22nd International Conference on Medical Image Computing and Computer-Assisted Intervention. Shenzhen， China： Springer： 540-548 ［DOI： 10.1007/978-3-030-32226-7_60http://dx.doi.org/10.1007/978-3-030-32226-7_60］

文章被引用时，请邮件提醒。

提交

结合部首字形和层级结构的手写汉字纠错方法

层级语义融合的场景文本检测

红外与可见光图像特征动态选择的目标检测网络

注意力引导局部特征联合学习的人脸表情识别

面向高光谱场景分类的空—谱模型蒸馏网络