|Abstract: ||本研究的目的是使用等效電路模型以及數學函數來建構產生母音的構音模型，其中結合了聲帶模型與聲道模型，並且使用文獻上所提供的生理參數作為依據，模擬人類正常情形下的語音產生。為了驗證此模型的正確與否，本論文分別使用Takemoto以及Story兩位學者在文獻中所提供的核磁共振造影(Magnetic Resonance Imaging, MRI)的聲道面積函數(vocal tract area function)，還有Rosenberg學者提供的聲門信號以及Two-Mass模型來驗證此模型。|
聲帶(vocal fords)位於喉部，是左右對稱的瓣膜結構，利用振動來產生聲音。在模擬聲門訊號中，我們使用Rosenberg學者所提供的數學函數以及Two-Mass模型來產生聲門訊號。Two-Mass模型是使用物理模型轉化成等效電路模型，其中利用兩個質量塊表示聲帶，彈簧以及阻尼來表示聲帶的肌肉運動。在聲道模型中，我們將聲道當成不同管子所組合而成的多節管。在無損管(lossless tube)的模型中，我們可以利用流速以及聲壓的變化得出一數學模型，但是雖然這種方法較為簡單，卻忽略了聲道管壁對於語音的影響。MAEDA學者則是提供了一個包含了聲道管壁能量消耗的聲道系統模型，其中也給出了將模型轉換成等效電路的方法，利用此模型，就可以結合聲門訊號產生想要的語音。
本論文使用Story學者(/AA/、/IY/、/UW/、/AE/、/AO/)以及Takemoto(/a/、/i/、/u/、/e/、/o/)學者在文獻中所提供的母音聲道面積函數模擬母音，並且比較母音的前三共振峰值與聲道形狀的關聯。另外也使用Rosenberg學者提供的聲門訊號以及Two-Mass聲帶模型產生的訊號與MAEDA模型作結合，並且觀察使用不同的聲門訊號對語音會有什麼影響。研究結果顯示，Rosenberg訊號與Two-Mass聲帶模型在頻域上一樣保有低通濾波器的特性。而Two-Mass聲帶模型，在低頻的能量上會較明顯，高頻的能量則衰減的較快。搭配本論文的構音系統模型，這兩種聲門信號都能夠模擬英、日文母音的發音。但是搭配DIVA (Directions Into Velocities Articulator, DIVA) 模型在模擬日文的時候，共振峰值超出DIVA能夠模擬的範圍，所以沒辦法產生正確的日文母音。至於我們的模擬結果與Story學者(其聲道節數依不同母音分別為42~46節)的結果比較，前三共振峰值的平均誤差分別為-7.4、2.58以及-0.46%；而比較Takemoto學者(其聲道節數依不同母音分為68~75節)的模擬結果，前三個共振峰值平均誤差為-2.01、1.99以及0.75%。以上結果顯示，本論文的模型可以成功的模擬英文以及日語母音的聲音，並且能夠使用生理參數調控模型，而且當聲道分割成越多節管時，本模型的母音前三共振峰的準確度越高。
;The purpose of this study is to build an articulatory model that employs an equivalent lumped electric circuit and related mathematical function to represent the vocal fold and vocal tract systems based on the physiological data from the literature to simulate individual’s vowel production under normal circumstances. Two vocal tract area functions of vowel production from the magnetic resonance imaging (MRI) studies by researchers of Takemoto group and Story, and two vocal folds models (Rosenberg glottal signal and two-mass model) were used to verify our model.
The vocal folds are composed of two symmetrical mucous membranes across the larynx to generate sound through vibration. We simulated the glottal signal with the mathematical functions of Rosenberg’s study and the two-mass model representing the vocal folds as two concatenated mass-spring-damper systems.
In this study, the vocal tract system from the glottis to the lips was modeled as a tube with many concatenated sections. Based on the lossless tube model, we were able to employ the variation of volume velocity and sound pressure to build a mathematical vocal tract model. Although this approach is relatively simple, the problem is that the viscous effect from the vocal tract wall on vowel production is ignored. On the contrary, MAEDA proposed a vocal tract model that considered energy consumption on the vocal tract wall and also put forward a way to transform a physical model into an equivalent electric circuit model. With MAEDA’s vocal tract model, it is plausible to simulate the vowel production we want with the glottal signals.
In this study, we utilized vocal tract area functions from Story’s (/AA/、/IY/、/UW/、/AE/、/AO/) and Takemoto’s (/a/、/i/、/u/、/e/、/o/) research, to verify our vocal tract model with their corresponding vowels production. Furthermore, we applied Rosenberg and the two-mass model with the MAEDA model and observed what effects would be on the vowel production using different glottal signals.
The results showed that both the Rosenberg’s signal and two-mass model have low-pass filter characteristics. However, the frequency response of the two-mass model had more low frequency and less high frequency signals. In combination with our vocal tract model used in this study, these two glottal signals were capable of being used to simulate English and Japanese vowel production, respectively. But when they were used with the vocal tract portion of the DIVA (Directions Into Velocities Articulator, DIVA) model, they were incapable of simulating the correct Japanese vowel due to the formant frequency range limitation defined by the DIVA model.
In addition, we verified our articulatory model with the vocal tract area function from Story’s study (vocal tract sections varies from 42 to 46 sections depending on different vowels), and found that the differences for the first three formant frequencies between both studies were -7.4, -2.58, and -0.46%, respectively. Furthermore, the differences between ours and Takemoto’s study (vocal tract sections ranges from 68 to 75 sections depending on different vowels) were only -2.01, 1.99, and -0.75%, respectively. In summary, our model could simulate individual’s vowel production under normal circumstances based on the physiological data from the literature; the accuracy of vowel simulation could be higher as the vocal tract is divided into more sections in our model.