普通话发音过程中的舌3维运动控制模型
3D motion control model of tongue for Mandarin pronunciation
- 2019年24卷第11期 页码:1942-1951
收稿:2018-12-10,
修回:2019-6-15,
录用:2019-6-22,
纸质出版:2019-11-16
DOI: 10.11834/jig.180644
移动端阅览

浏览全部资源
扫码关注微信
收稿:2018-12-10,
修回:2019-6-15,
录用:2019-6-22,
纸质出版:2019-11-16
移动端阅览
目的
2
言语发音过程中发音器官及其运动形态的精确可视化对发音机制的理解、言语疾病的诊断和治疗以及人机言语交互研究都具有重要意义。舌作为言语产生的重要器官,因其运动速度快、变形复杂、发音过程中不可见等原因,可视化比较困难。为此,提出一种基于统计模型法研究汉语普通话元辅音发音时舌的3维动态控制模型。
方法
2
首先采集普通话元辅音发音过程中讲话人的磁共振图像(MRI),采用手动标记法提取舌轮廓并建立静态3维网格模型;其次以模型顶点为变量,通过线性主成分分析法提取控制参数并建立舌运动控制方程;最后对发音过程中舌运动控制仿真效果进行评估。
结果
2
共提取含舌尖、舌体、舌背和下颌在内的6个3维模型运动控制参数,下颌参数控制下颌张合引起的舌旋转运动,舌体和舌背参数分别控制舌前后、拱起和凹陷运动,舌尖参数分别控制舌尖上下、前后和上翘运动,所提取的6个参数可以表达87.4%的舌3维运动变化,仿真效果优于其他语言的运动控制结果。
结论
2
本文方法可以有效应用于汉语普通话发音的舌建模与3维运动控制,降低舌3维运动建模的复杂性,研究结果可以为汉语普通话发音过程中的器官可视化提供有用信息。
Objective
2
The accurate visualization of vocal organs and their movement patterns during pronunciation is crucial for the understanding of pronunciation mechanism
diagnosis and treatment of speech diseases
and human-computer interaction research. As an important vocal organ
the tongue is not completely visible and moves rapidly and flexibly during speaking; therefore
it is difficult to visualize. Advancements in medical imaging technique in recent years have made it possible to capture clear tongue images
thus promoting the development of modeling strategies. Among these strategies
the three most common methods are parametric modeling
physiological modeling
and statistical modeling. Statistical modeling has the advantages of simple calculation
minimal control parameters
fast simulation speed
and strong interpretability and it is suitable for developing a real-time speech training system. However
few studies have applied this method to tongue modeling for Chinese Mandarin pronunciation
and existing statistical models have drawbacks in precision and simulation capabilities. Therefore
this study proposes an improved 3D dynamic control model of the tongue based on statistical modeling for Mandarin vowel-consonant pronunciation.
Method
2
The control parameters were extracted using statistical modeling based on linear principal component analysis. The model was based on the assumption that the tongue motion and control parameters are linear. First
a representative corpus was established on the basis of tongue shape variation during Mandarin vowel-consonant pronunciation. The corpus included a set of 49 artificially sustained articulations designed to cover the maximal range of Mandarin allophones
namely
8 vowels
40 consonants in consonant vowel(CV) sequences
and a rest position. On the basis of the corpus
sagittal volume images of the tongue from one speaker were acquired by magnetic resonance imaging (MRI)
and supplementary images of the hard palate
jaw
and teeth were acquired by computed tomography (CT). The images were preprocessed and then the upper and lower jaws in the CT images were filled manually into the MRI. The 3D tongue model composed of the sagittal MRI was segmented horizontally to obtain the corresponding axial slices. According to the distribution of the tongue muscles
a tongue contour-marking method was designed on the basis of a semi-polar grid proposed for the research on tongue. Afterward
the contours of the tongue were manually edited in sagittal and transverse MRI using the designed method to build models described as triangular meshes
which were combined to build full 3D models of the tongue. The 3D surface mesh model of the resting tongue was selected as the reference articulation and then fitted by elastic deformation to each of the 3D sets of planar contours to meet the LCA (linear principal compoment analysis) requirement for the vertices' correspondence in each observation. The vertices of the geometric model were taken as variables
the control parameters were extracted from the midsagittal contours of the models using a statistical method
and the simulation error of these parameters and their contribution to control the overall 3D shape of the tongue were evaluated.
Result
2
A triangular surface mesh tongue model consisting of more than 2 000 vertices was established for Mandarin vowel-consonant pronunciation
and its movement was controlled by six parameters. One parameter affects the tongue rotation around a point of the tongue back
another two respectively control the front-back and flattening-bunching movements of the tongue
and the last three respectively control the upper-lower
front-back
and upper curved movements of the tongue tip. The six parameters were combined
and the 3D tongue model was restructured. The parameters could explain 87.4% of the variance in the tongue
only 2% below the optimal result from a raw principal component analysis with the same number of components. The vertices of the whole 3D model had a reconstruction error of 0.149 cm. The absolute values of the control parameters were compared from the vowel and consonant perspectives. The tongue body has a larger movement for vowel pronunciation
whereas the tip moves more flexibly during consonant pronunciation. The effects of the separameters can be introduced from a biomechanical perspective by analyzing the tongue's motions caused by its various muscles. Compared with the control parameters of a French tongue modeling
a parameter for controlling the tongue tip movement was added to our model
and a parameter for controlling the tongue root was removed to meet the strong dependence of Mandarin pronunciation on the tongue tip. The difference in contribution rate of each control parameter between the two language models was consistent with the characteristics of the tongue movement in each pronunciation.
Conclusion
2
This work produced a number of valuable results. First
a database of 3D geometrical descriptions of the tongue was established for a speaker sustaining a set of 49 Chinese allophones covering the speech possibilities of the subject. Second
LCA of these data revealed that six components could account for approximately 87.4% of the total variance in the tongue shape
the highest value reached compared with those of other languages' statistical models. The method to extract parameters provides the biomechanical interpretation meaning of the parameters. The statistical model is suitable for Chinese tongue modeling
but the steps for parameter extraction must be adjusted according to the pronunciation characteristics of each language to achieve the ideal simulation effect. The method of contrasting the tongue shape change from the perspective of vowels and consonants and from Mandarin and French pronunciations based on control parameter values provides a new way of studying Mandarin pronunciation and comparing different languages.
Boets B, Wouters J, van Wieringen A, et al. Modelling relations between sensory processing, speech perception, orthographic and phonological ability, and literacy achievement[J]. Brain and Language, 2008, 106(1):29-40.[DOI:10.1016/j.bandl.2007.12.004]
Iarocci G, Rombough A, Yager J, et al. Visual influences on speech perception in children with autism[J]. Autism, 2010, 14(4):305-320.[DOI:10.1177/1362361309353615]
Iribe Y, Manosavan S, Katsurada K, et al. Improvement of animated articulatory gesture extracted from speech for pronunciation training[C]//Proceedings of 2012 IEEE International Conference on Acoustics, Speech, and Signal Processing. Kyoto, Japan: IEEE, 2012: 5133-5136.[ DOI: 10.1109/ICASSP.2012.6289076 http://dx.doi.org/10.1109/ICASSP.2012.6289076 ]
Massaro D W, Light J. Using visible speech to train perception and production of speech for individuals with hearing loss[J]. Journalof Speech, Language, and Hearing Research, 2004, 47(2):304-320.[DOI:10.1044/1092-4388(2004/025)]
Yu J, Chen C W. From talking head to singing head: a significant enhancement for more natural human computer interaction[C]//Proceedings of 2017 IEEE International Conference on Multimedia and Expo. Hong Kong, China: IEEE, 2017: 511-516.[ DOI: 10.1109/ICME.2017.8019362 http://dx.doi.org/10.1109/ICME.2017.8019362 ]
Zhang J G, Wu X Y, Kong J P. Tongue shape variation model for simulating Mandarin Chinese articulation[C]//Proceeding of the 10th International Symposium on Chinese Spoken Language Processing. Tianjin, China: IEEE, 2016: 1-5.[ DOI: 10.1109/ISCSLP.2016.7918365 http://dx.doi.org/10.1109/ISCSLP.2016.7918365 ]
Xu K L, Yang Y, Leboullenger C, et al. Contour-based 3D tongue motion visualization using ultrasound image sequences[C]//Proceedings of 2016 IEEE International Conference on Acoustics, Speech and Signal Processing. Shanghai, China: IEEE, 2016: 5380-5384.[ DOI: 10.1109/ICASSP.2016.7472705 http://dx.doi.org/10.1109/ICASSP.2016.7472705 ]
Badin P, Serrurier A. Three-dimensional linear modeling of tongue: articulatory data and models[C]//Proceedings of the 7th International Seminar on Speech Production.Ubatuba-SP, Brazil: ISSP, 2006: 395-402.
Beautemps D, Badin P, Bailly G. Linear degrees of freedom in speech production:analysis of cineradio- and labio-film data and articulatory-acoustic modeling[J]. The Journal of the Acoustical Society of America, 2001, 109(5):2165-2180.[DOI:10.1121/1.1361090]
Engwall O. Combining MRI, EMA and EPG measurements in a three-dimensional tongue model[J]. Speech Communication, 2003, 41(2-3):303-329.[DOI:10.1016/s0167-6393(02)00132-2]
Dang J W, Honda K. Estimation of vocal tract shapes from speech sounds with a physiological articulatory model[J]. Journal of Phonetics, 2002, 30(3):511-532.[DOI:10.1006/jpho.2002.0167]
Fels S, Lloyd J E, van den Doel K, et al. Developing physically-based, dynamic vocal tract models using artisynth[C]//Proceedings of the 7th International Seminar on Speech Production.Ubatuba-SP, Brazil: ISSP, 2006: 419-426.
Ilie M D, Negrescu C, Stanomir D. An efficient parametric model for real-time 3D tongue skeletal animation[C]//Proceedings of the 9th International Conference on Communications. Bucharest, Romania: IEEE, 2012: 129-132.[ DOI: 10.1109/ICComm.2012.6262577 http://dx.doi.org/10.1109/ICComm.2012.6262577 ]
Jiang C, Yu J, Luo C W, et al. Speech visualization system based on physiological tongue model[J]. Journal of Image and Graphics, 2015, 20(9):1237-1246.
江辰, 於俊, 罗常伟, 等.基于生理舌头模型的语音可视化系统[J].中国图象图形学报, 2015, 20(9):1237-1246. [DOI:10.11834/jig.20150911]
Wang L, Chen H, Li S, et al. Phoneme-level articulatory animation in pronunciation training[J]. Speech Communication, 2012, 54(7):845-856.[DOI:10.1016/j.specom.2012.02.003]
Fang Q, Chen Y, Wang H B, et al. An improved 3D geometric tongue model[C]//Proceedings of the 17th Annual Conference of the International Speech Communication Association. San Francisco, USA: ISCA, 2016: 1104-1107.[ DOI: 10.21437/Interspeech.2016-901 http://dx.doi.org/10.21437/Interspeech.2016-901 ]
Gabioud B. Articulatory models in speech synthesis[M]//Keller E.Fundamentals of Speech Synthesis and Speech Recognition.Chichester: JohnWiley&Sons, 1994: 215-230.
Maeda S. Improved articulatory models[J]. The Journal of the Acoustical Society of America, 1988, 84(S1):S146.[DOI:10.1121/1.2025845]
Song C, Wei J G, Fang Q, et al. Tongue shape synthesis based on active shape model[C]//Proceedings of the 8th International Symposium on Chinese Spoken Language Processing. Kowloon, China: IEEE, 2012: 383-386.[ DOI: 10.1109/ISCSLP.2012.6423537 http://dx.doi.org/10.1109/ISCSLP.2012.6423537 ]
Harshman R, Ladefoged P, Goldstein L. Factor analysis of tongue shapes[J]. The Journal of the Acoustical Society of America, 1977, 62(3):693-707.[DOI:10.1121/1.381581]
Bresch E, Goldstein L, Narayanan S. An analysis-by-synthesis approach to modeling real-time MRI articulatory data using the task dynamic application framework[J]. The Journal of the Acoustical Society of America, 2009, 125(4):2498-2498.[DOI:10.1121/1.4783351]
Bay H, Ess A, Tuytelaars T, et al. Speeded-up robust features (SURF)[J]. Computer Vision and Image Understanding, 2008, 110(3):346-359.[DOI:10.1016/j.cviu.2007.09.014]
Amberg B, Romdhani S, Vetter T. Optimal step nonrigid ICP algorithms for surface registration[C]//Proceedings of 2007 IEEE Conference on Computer Vision and Pattern Recognition. Minneapolis, MN, USA: IEEE, 2007: 1-8.[ DOI: 10.1109/CVPR.2007.383165 http://dx.doi.org/10.1109/CVPR.2007.383165 ]
Perrier P, Payan Y, Zandipour M, et al.Influences of tongue biomechanics on speech movements during the production of velar stop consonants:a modeling study[J]. The Journal of the Acoustical Society of America, 2003, 114(3):1582-1599.[DOI:10.1121/1.1587737]
Chen Y H. The role of Chinese-French pronunciation contrast in teaching Chinese as a foreign language[J]. Journal of Zhejiang Normal University:Social Sciences, 2008, 33(2):111-114.
陈永花.喀麦隆汉法语音对比在对外汉语教学中的应用[J].浙江师范大学学报:社会科学版, 2008, 33(2):111-114. [DOI:10.3969/j.issn.1001-5035.2008.02.023]
相关文章
相关作者
相关机构
京公网安备11010802024621