嵌入卷积增强型Transformer的头影解剖关键点检测

杨恒; 顾晨亮; 胡厚民; 张劲; 李康; 何凌

发布时间： 2023-11-17
摘要点击次数： 750
全文下载次数： 420
DOI: 10.11834/jig.220933
2023 | Volume 28 | Number 11

嵌入卷积增强型Transformer的头影解剖关键点检测

杨恒¹, 顾晨亮², 胡厚民¹, 张劲³, 李康¹, 何凌³(1.四川大学电气工程学院, 成都 610065;2.中国西南电子技术研究所, 成都 610036;3.四川大学生物医学工程学院, 成都 610065)

摘要

目的准确可靠的头像分析在正畸诊断、术前规划以及治疗评估中起着重要作用，其常依赖于解剖关键点间的相互关联。然而，人工注释往往受限于速率与准确性，并且不同位置的结构可能共享相似的图像信息，这使得基于卷积神经网络的方法难有较高的精度。Transformer在长期依赖性建模方面具有优势，这对确认关键点的位置信息有所帮助，因此开发一种结合Transformer的头影关键点自动检测算法具有重要意义。方法本文提出一种基于卷积增强型Transformer的U型架构用于侧位头影关键点定位，并将其命名为CETransNet （convolutional enhancedTransformer network）。通过改进Transformer模块并将其引入至U型结构中，在建立全局上下文连接的同时也保留了卷积神经网络获取局部信息的能力。此外，为更好地回归预测热图，提出了一种指数加权损失函数，使得监督学习过程中关键点附近像素的损失值能得到更多关注，并抑制远处像素的损失。结果在2个测试集上，CETransNet分别实现了1.09 mm和1.39 mm的定位误差值，并且2 mm内精度达到了87.19%和76.08%。此外，测试集1中共有9个标志点达到了100%的4 mm检测精度，同时多达12个点获得了90%以上的2 mm检测精度；测试集2中，尽管只有9个点满足90%的2 mm检测精度，但4 mm范围内有10个点被完全检测。结论 CETransNet能够快速、准确且具备鲁棒性地检测出解剖点的位置，性能优于目前先进方法，并展示出一定的临床应用价值。

关键词

头影测量关键点检测视觉Transformer 注意力机制热图回归卷积神经网络（CNN）

Cephalometric landmark keypoints localization based on convolution-enhanced Transformer

Yang Heng¹, Gu Chenliang², Hu Houmin¹, Zhang Jing³, Li Kang¹, He Ling³(1.School of Electrical Engineering, Sichuan University, Chengdu 610065, China;2.China Southwest Electronic Technology Research Institute, Chengdu 610036, China;3.School of Biomedical Engineering, Sichuan University, Chengdu 610065, China)

Abstract

Objective Accurate and reliable cephalometric image measurement and analysis，which usually depend on the correlation among anatomical landmark points，play essential roles in orthodontic diagnosis，preoperative planning，and treatment evaluation. However，manual annotation hinders the speed and accuracy of measurement to a certain extent. Therefore，an automatic cephalometric landmark detection algorithm for daily diagnosis needs to be developed. However，the size of anatomical landmarks accounts for a small proportion of an image，and the structures at different positions may share similar radians，shapes，and surrounding soft tissue information that are difficult to distinguish. The current methods based on convolutional neural networks（CNNs）extract depth features by applying down-sampling to facilitate the building of a global connection，but these methods may suffer from spatial information loss and inefficient context modeling，hence preventing them from meeting accuracy requirements in clinical applications. Transformer has advantages in long-term dependency modeling but is not good at capturing local features，hence explaining the insufficient accuracy of models based on pure Transformer for key point localization. Therefore，an end-to-end model with global context modeling and better local spatial feature representation must be built to solve these problems. Method To detect the anatomical landmarks efficiently and effectively，a U-shaped architecture based on convolution-enhanced Transformer called CETransNet is proposed in this manuscript to locate the key points of lateral cephalometric images. The overwhelming success of UNet lies in its ability to analyze the local fine-grained nature of an image at the deep level，but this method suffers from global spatial information loss. By improving and introducing the Transformer module into the U-shaped structure，the ability of convolutional networks to obtain local information is retained while establishing global context connection. In addition，to efficiently regress and predict the heatmaps，an exponential weighted loss function is proposed so that the loss value near the landmark pixels can receive more attention in the supervised learning process and the loss of distant pixels can be suppressed. Each image is rescaled to 768×768 pixels and maintains a fixed aspect ratio corresponding to its original ratio via a zero padding operation，and data augmentation is performed via random rotation，Gaussian noise addition，and elastic transformation. During the training phase，experiments are conducted on a server using Tesla V100 SXM3-32 GB GPUs. The model is optimized by an Adam optimizer with a batch size of 2，and the initial learning rate is set to 0. 000 1 and decreased by 0. 75 times every 5 epochs. Result To demonstrate its strengths，CETransNet is compared with the most advanced methods，and ablation studies are performed to confirm the contribution of each component. Experiments were performed on a public X-ray cephalometric dataset. Quantitative results show that CETransNet obtains mean radial error （MRE）values of 1. 09 mm and 1. 43 mm in the two test datasets，respectively，and the accuracies within a clinically accepted 2 mm error are 87. 16% and 76. 08%. A total of 9 key points in Test1 achieve a 100% successful detection rate （SDR）value，and in the clinically allowable 2. 0 mm region，the detection accuracy reaches 90% with up to 12 landmarks. In Test2，although only 9 points satisfy the SDR accuracy of 90%，10 points within 4 mm are completely detected. Compared with the best competing method，CETransNet improves the MRE by 2. 7% and 2. 1% on the two datasets， respectively. CETransNet also outperforms other popular vision Transformer methods on the benchmark Test1 dataset and achieves a 2. 16% SDR improvement within 2 mm compared with the sub-optimal model. Meanwhile，the analysis of the influence of the backbone network on the model performance reveals that ResNet-101 reaches the minimal MRE，while ResNet-152 obtains the best SDR within 2 mm. Results of ablation studies show that the convolution-enhanced Transformer can decrease MRE by 0. 3 mm and improve SDR in 2. 0 mm by 7. 36%. Meanwhile，the proposed EWSmoothL1 further reduces the MRE to 1. 09 mm. Benefitting from these components，CETransNet can detect the position of anatomical landmarks quickly，accurately，and robustly. Conclusion This paper proposes a cephalometric landmark detection framework with a U-shaped architecture that embeds the convolution-enhanced Transformer in each residual layer. By fusing the advantages of both Transformer and CNNs，the proposed framework effectively captures the long-term dependence and local natures and thus obtains the special position and structure information of key points. To address the ambiguity caused by other similar structures in an image，an exponential weighted loss function is proposed in order for the model to focus on the loss of the target area than the other parts. Experimental results show that CETransNet achieves the best MRE and SDR performance compared with advanced methods，especially in the clinically allowable 2. 0 mm region. A series of ablation experiments also prove the effectiveness of the proposed modules，thereby confirming that CETransNet shows a competent performance in anatomical landmark detection and possesses great potential to solve the problems in cephalometric analysis and treatment planning. In future work，other lightweight models with better robustness will be designed.

Keywords

cephalometric measurement landmark keypoints localization vision Transformer attention mechanism heatmap regression convolutional neural network（CNN）