|Abstract: ||隨著近年來身心障礙的族群越來越受到重視，現今的社會有越來越多的構音障礙者會尋求言語障礙的治療以及復健。因此，發展一套語言治療師能用來輔助及復健治療的工具就顯的愈來愈重要。本研究目的是發展一套以軟體為主的可見式語音診斷與復健系統，使用者可以透過使用者介面分別對說話正常與構音障礙個案錄下語音訊號，並比對兩組語音信號的波形、頻譜、聲譜及基頻等資訊，提供量化的客觀分析。除此之外，此系統藉由比對構音障礙個案與正常個案的語音資訊進行評分，分析的語音資訊包含了聲調、母音辨識、有聲/無聲、擦音偵測與語音強度等。在聲調的辨識上選用倒頻譜法與簡化逆向濾波追蹤(Simplified Inverse Filter Tracking, SIFT) 演算法做聲調的擷取；在母音辨識上則是選用第K位最近鄰居模型(K-Nearest Neighbors, K-NN)和多層感知機模型(Multilayer Perceptron, MLP)。在有聲與無聲的偵測上，利用聲調、短時能量與過零率等資訊進行判斷；在擦音的評分則是利用聲調、衝直條的位置與強度進行辨識。系統的自動評分功能是利用適應性符號辨識指標(Adaptive Signed Correlation Index, ASCI) 對語音進行評分量化；測量構音障礙者所產生的母音在聲調與有聲/無聲與正常人的相似度，並計算兩者的尤拉距離作為評分的標準，子音則是比較衝直條與有聲/無聲等資訊進行評分。最後系統將母音與子音的評分平均提供使用者此段語音訊號的量化結果。|
為了評估此系統的實用性、功能性以及正確性，本研究分析八組 (其中包含六位成人，二位小孩；四位男性，四位女性)音訊來做後續的結果分析以及比較，此音訊為先前的研究於台北榮民總醫院以及衛生福利部桃園醫院新屋分院復健科語言治療團隊合作的資料。從共有2086音框分析的結果顯示；在聲調的辨識上，倒傳遞法的錯誤率5.32%略低於SIFT法的6.6%；母音的辨識上，多層感知機的正確率(男生92.61%, 女生86.75% 小孩83.75%)及速度都略好於K-NN模型(91.67%, 86.21%, 和 80.69%)。系統的擦音辨識結果與四位評分者(男性，23-27歲) 做出的評分有79.7% 的一致性 (64組評分51組相符)；而在綜合評分上，有81.25的一致性(192筆資料156筆相符)。本研究所開發的系統除了操作簡單以外，從實驗的結果可以顯示本系統提供專業分析，亦可作為言語障礙者構音狀況評估、診斷以及復健的工具。;Since speech is one of primary means of communication, the needs of speech diagnosis and rehabilitation for patient having speech disorders are increasing. Therefore, the development of advanced system to assist the speech diagnosis and rehabilitation assessment is getting more important. The purpose of this study is to develop a tool to assist speech therapy and rehabilitation which focused on building simple interface to let the assessment be done without the need of particular knowledge of speech processing while at the same time also provided further deep analysis of the speech which can be useful for the speech therapist. Practically, the tool provides automatic scoring based on the comparison of the patient’s speech signal with another normal person’s speech signal on several aspects including pitch, vowel classification, voiced-unvoiced detection, fricative detection and sound intensity comparison to provide a quantitative analysis. In order to provide accurate pitch estimation, this research compared the use of two pitch tracking algorithms including cepstrum method and Simplified Inverse Filter Tracking (SIFT) method. In addition, this study also compared the use of two popular classification algorithms including K-Nearest Neighbors (K-NN) and Multilayer Perceptron (MLP) algorithm to classify vowels based on pitch and formants. The voiced-unvoiced decision employed the speech information including pitch, short term time energy and zero crossing rates, while the fricative detection employed the speech information including pitch, spectral peak location and their intensity. Finally, the automatic scoring was then done by using Adaptive Signed Correlation Index (ASCI) to quantify the similarity on pitch contour and voiced/unvoiced detection. Regarding the vowel quality scoring, it measured the Euclidean distance as the scoring quantification when both classes are different. For the strident fricative detection, the scoring was based on the location of the spectral peak of the fricative segments using distance metrics and based on the voiced or voiceless classification. Last, the overall score was computed from the average score of all features scoring.
In order to evaluate the performance and the practicality of the system, this study used and analyzed 8 patient’s speech recordings (6 adults and 2 children, 4 males and 4 females) which had been recorded in previous study in cooperation with Taipei Veterans General Hospital and Tao Yuan General Hospital. The experiment result on pitch algorithm comparison showed that from a total of 2086 frames, the cepstrum method had 5.32% of gross pitch error (GPE) which was lower than 6.6% by the SIFT method. For the vowel classification algorithm, MLP method provided better accuracy (92.61% for men, 86.75% for women and 83.75% for children) compared to K-NN method (91.67%, 86.21% and 80.69%) and up to 5-times faster in the computation time. Particularly on the fricative detection, the outcome of the tool showed that 51 out of 64 audio observations (79.7%) from 4 respondents (graduate students, laboratory members, males between 23-27 years old) were consistent. In total, it can be calculated that from 192 audio and visual observations done by 4 respondents, 156 grading results performed by the tool were consistent (81.25%). The experimental results also showed the advantage of the tool by using provided simple and professional mode to indicate the difference on several aspects of speech between normal speaker and patient with speech disorders to assist the speech diagnostic and rehabilitation.