Current Issue Cover
普通话发音过程中的舌3维运动控制模型

刘蝉1, 张少川1, 钱兆鹏1, 牛海军1,2(1.北京航空航天大学生物与医学工程学院, 北京 100083;2.北京航空航天大学北京市生物医学工程高精尖中心, 北京 100083)

摘 要
目的 言语发音过程中发音器官及其运动形态的精确可视化对发音机制的理解、言语疾病的诊断和治疗以及人机言语交互研究都具有重要意义。舌作为言语产生的重要器官,因其运动速度快、变形复杂、发音过程中不可见等原因,可视化比较困难。为此,提出一种基于统计模型法研究汉语普通话元辅音发音时舌的3维动态控制模型。方法 首先采集普通话元辅音发音过程中讲话人的磁共振图像(MRI),采用手动标记法提取舌轮廓并建立静态3维网格模型;其次以模型顶点为变量,通过线性主成分分析法提取控制参数并建立舌运动控制方程;最后对发音过程中舌运动控制仿真效果进行评估。结果 共提取含舌尖、舌体、舌背和下颌在内的6个3维模型运动控制参数,下颌参数控制下颌张合引起的舌旋转运动,舌体和舌背参数分别控制舌前后、拱起和凹陷运动,舌尖参数分别控制舌尖上下、前后和上翘运动,所提取的6个参数可以表达87.4%的舌3维运动变化,仿真效果优于其他语言的运动控制结果。结论 本文方法可以有效应用于汉语普通话发音的舌建模与3维运动控制,降低舌3维运动建模的复杂性,研究结果可以为汉语普通话发音过程中的器官可视化提供有用信息。
关键词
3D motion control model of tongue for Mandarin pronunciation

Liu Chan1, Zhang Shaochuan1, Qian Zhaopeng1, Niu Haijun1,2(1.School of Biological Science and Medical Engineering, Beihang University, Beijing 100083, China;2.Beijing Advanced Innovation Center for Biomedical Engineering, Beihang University, Beijing 100083, China)

Abstract
Objective The accurate visualization of vocal organs and their movement patterns during pronunciation is crucial for the understanding of pronunciation mechanism, diagnosis and treatment of speech diseases, and human-computer interaction research. As an important vocal organ, the tongue is not completely visible and moves rapidly and flexibly during speaking; therefore, it is difficult to visualize. Advancements in medical imaging technique in recent years have made it possible to capture clear tongue images, thus promoting the development of modeling strategies. Among these strategies, the three most common methods are parametric modeling, physiological modeling, and statistical modeling. Statistical modeling has the advantages of simple calculation, minimal control parameters, fast simulation speed, and strong interpretability and it is suitable for developing a real-time speech training system. However, few studies have applied this method to tongue modeling for Chinese Mandarin pronunciation, and existing statistical models have drawbacks in precision and simulation capabilities. Therefore, this study proposes an improved 3D dynamic control model of the tongue based on statistical modeling for Mandarin vowel-consonant pronunciation. Method The control parameters were extracted using statistical modeling based on linear principal component analysis. The model was based on the assumption that the tongue motion and control parameters are linear. First, a representative corpus was established on the basis of tongue shape variation during Mandarin vowel-consonant pronunciation. The corpus included a set of 49 artificially sustained articulations designed to cover the maximal range of Mandarin allophones, namely, 8 vowels, 40 consonants in consonant vowel(CV) sequences, and a rest position. On the basis of the corpus, sagittal volume images of the tongue from one speaker were acquired by magnetic resonance imaging (MRI), and supplementary images of the hard palate, jaw, and teeth were acquired by computed tomography (CT). The images were preprocessed and then the upper and lower jaws in the CT images were filled manually into the MRI. The 3D tongue model composed of the sagittal MRI was segmented horizontally to obtain the corresponding axial slices. According to the distribution of the tongue muscles, a tongue contour-marking method was designed on the basis of a semi-polar grid proposed for the research on tongue. Afterward, the contours of the tongue were manually edited in sagittal and transverse MRI using the designed method to build models described as triangular meshes, which were combined to build full 3D models of the tongue. The 3D surface mesh model of the resting tongue was selected as the reference articulation and then fitted by elastic deformation to each of the 3D sets of planar contours to meet the LCA (linear principal compoment analysis) requirement for the vertices’ correspondence in each observation. The vertices of the geometric model were taken as variables, the control parameters were extracted from the midsagittal contours of the models using a statistical method, and the simulation error of these parameters and their contribution to control the overall 3D shape of the tongue were evaluated. Result A triangular surface mesh tongue model consisting of more than 2 000 vertices was established for Mandarin vowel-consonant pronunciation, and its movement was controlled by six parameters. One parameter affects the tongue rotation around a point of the tongue back, another two respectively control the front-back and flattening-bunching movements of the tongue, and the last three respectively control the upper-lower, front-back, and upper curved movements of the tongue tip. The six parameters were combined, and the 3D tongue model was restructured. The parameters could explain 87.4% of the variance in the tongue, only 2% below the optimal result from a raw principal component analysis with the same number of components. The vertices of the whole 3D model had a reconstruction error of 0.149 cm. The absolute values of the control parameters were compared from the vowel and consonant perspectives. The tongue body has a larger movement for vowel pronunciation, whereas the tip moves more flexibly during consonant pronunciation. The effects of the separameters can be introduced from a biomechanical perspective by analyzing the tongue’s motions caused by its various muscles. Compared with the control parameters of a French tongue modeling, a parameter for controlling the tongue tip movement was added to our model, and a parameter for controlling the tongue root was removed to meet the strong dependence of Mandarin pronunciation on the tongue tip. The difference in contribution rate of each control parameter between the two language models was consistent with the characteristics of the tongue movement in each pronunciation. Conclusion This work produced a number of valuable results. First, a database of 3D geometrical descriptions of the tongue was established for a speaker sustaining a set of 49 Chinese allophones covering the speech possibilities of the subject. Second, LCA of these data revealed that six components could account for approximately 87.4% of the total variance in the tongue shape, the highest value reached compared with those of other languages’ statistical models. The method to extract parameters provides the biomechanical interpretation meaning of the parameters. The statistical model is suitable for Chinese tongue modeling, but the steps for parameter extraction must be adjusted according to the pronunciation characteristics of each language to achieve the ideal simulation effect. The method of contrasting the tongue shape change from the perspective of vowels and consonants and from Mandarin and French pronunciations based on control parameter values provides a new way of studying Mandarin pronunciation and comparing different languages.
Keywords

订阅号|日报