基於臨床實驗結果發展深度學習去迴響人工電子耳語音編碼策略

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：22

、訪客IP：18.191.175.60

姓名

江旭崧(Xu-Song Jiang) 查詢紙本館藏

畢業系所

電機工程學系

論文名稱

基於臨床實驗結果發展深度學習去迴響人工電子耳語音編碼策略
(Based on Clinical Trial Outcomes to Develop Deep Learning Dereverberation Coding Strategy of Cochlear Implants)

相關論文

★ 獨立成份分析法於真實環境中聲音訊號分離之探討	★ 口腔核磁共振影像的分割與三維灰階值內插
★ 數位式氣喘尖峰氣流量監測系統設計	★ 結合人工電子耳與助聽器對中文語音辨識率的影響
★ 人工電子耳進階結合編碼策略的中文語音辨識成效模擬--結合助聽器之分析	★ 中文發聲之神經關聯性的腦功能磁振造影研究
★ 利用有限元素法建構3維的舌頭力學模型	★ 以磁振造影為基礎的立體舌頭圖譜之建構
★ 腎小管之草酸鈣濃度變化與草酸鈣結石關係之模擬研究	★ 口腔磁振影像舌頭構造之自動分割
★ 微波輸出窗電性匹配之研究	★ 以軟體為基準的助聽器模擬平台之發展-噪音消除
★ 以軟體為基準的助聽器模擬平台之發展-回饋音消除	★ 模擬人工電子耳頻道數、刺激速率與雙耳聽對噪音環境下中文語音辨識率之影響
★ 用類神經網路研究中文語音聲調產生之神經關聯性	★ 教學用電腦模擬生理系統之建構

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2027-9-1以後開放)

摘要(中)

本論文內容分析人工電子耳語音編碼策略在模擬噪音迴響條件下，中文語音理解度差異，並針對迴響問題，發展深度學習去迴響人工電子耳語音編碼策略。人工電子耳（Cochlear Implant, CI）為一種植入耳蝸並能夠刺激聽神經的聽覺輔具（Assistive Listening Device），可以恢復重度聽力損失患者的聽覺感知，目前人工電子耳有許多不同的語音處理編碼策略，例如目前使用度最高且商業化的進階結合編碼（Advanced Combination Encoder, ACE）策略，或是近年來，有學者以類神經網路為架構發展人工電子耳處理策略模型（ElectrodeNet-CS），通常在安靜環境下，人工電子耳語音編碼策略都能提供良好的語音辨識理解程度，然而位於噪音迴響干擾的環境時，人工電子耳的使用者的語音辨識理解程度會急劇下降。
本研究首先進行ACE策略和ElectrodeNet-CS策略的臨床實驗（Clinical Trial），比較不同噪音條件下，人工電子耳使用者字元辨識分數（Word Recognition Score, WRS）的差異，根據臨床實驗結果，兩種策略在乾淨語音的條件下，平均字元辨識分數皆達80\%以上，但是隨著噪音訊號增強，人工電子耳使用者的平均分數明顯下滑，訊噪比（Signal-to-noise Ratios,SNR）為-5dB時，辨識分數不到10\%，噪音嚴重影響人工電子耳的語音辨識度，且目前人工電子耳語音編碼策略仍無法有效處理噪音問題，為了評估類神經網路發展的人工電子耳語音編碼策略是否與傳統策略有相似的性能，以成對母體平均數差異檢定（Paired-Samples T-tests）檢驗兩種策略的實驗數據，檢驗結果顯示兩種策略無顯著差異，說明類神經網路可以達成與傳統策略相似的功能。
緊接著進行第二種臨床實驗，使用去迴響的深度學習模型LS-Unet作為兩種策略的前處理，比較不同噪音迴響條件下，兩種策略的語音辨識度的差異，評估使用LS-Unet作為前處理是否有效改善人工電子耳使用者受到噪音迴響干擾時的語音辨識程度，根據臨床實驗結果，所有噪音迴響條件的平均字元辨識分數皆在50\%以下，兩種策略加上LS-Unet並不能提供良好的語音辨識度，對於噪音迴響的處理仍有很大的改善空間，但在參與臨床實驗二的受試者中，有一位個案s19的實驗結果明顯高於其他個案，在所有條件下的辨識分數高達70\%以上，而這位個案是唯一一位天生聽力損失患者，且在學語前就植入人工電子耳，或許這是s19個案辨識度較優秀的原因。
最後根據臨床實驗結果，在噪音迴響條件下，人工電子耳語音編碼策略無法有效的轉換語音訊號，本研究針對迴響對人工電子耳造成的影響，加以改善人工電子耳語音辨識度，以Unet模型用於分離迴響訊號中的語音特徵，並加上不同功能的網路層，發展創新的深度學習去迴響人工電子耳語音編碼策略RT-Unet，輸入迴響語音頻譜訊號，經過模型運算後，輸出語音電極訊號，本論文將更詳細說明模型架構、訓練方法和測試評估結果。在評估測試結果中，RT-Unet在處裡迴響語音有優良的客觀評估分數，所有迴響時間條件的平均短時客觀理解度（Short-Time Objective Intelligibility, STOI）分數達到0.76622，而歸一化共異變數指標（Normalized Covariance Metric, NCM）分數高達0.91302相對其他電子耳語音編碼策略的客觀評量分數，RT-Unet表現更為優秀，證明RT-Unet模型架構的可行性，為人工電子耳語音訊號處理提供可研究發展方向。

摘要(英)

This paper analyzes the differences in Chinese speech comprehension under simulated noise and reverberation conditions with various cochlear implant (CI) speech coding strategies. It also aims to develop a deep learning dereverberation strategy for CI speech encoding to address reverberation issues. A CI is an assistive listening device implanted in the cochlea that can stimulate the auditory nerve, helping to restore auditory perception for patients with severe hearing loss. Currently, there are many different speech processing encoding strategies for CIs, such as the Advanced Combination Encoder (ACE) strategy, which is the most widely used and commercialized. In recent years, researchers have developed CI processing strategy models based on neural network architectures, such as ElectrodeNet-CS. Typically, in quiet environments, CI speech coding strategies can provide good speech recognition and comprehension. However, in environments with noise and reverberation interference, the speech recognition and comprehension abilities of CI users drastically decline.
This study first conducted clinical trials comparing the ACE strategy and the ElectrodeNet-CS strategy, examining the differences in word recognition scores (WRS) of cochlear implant users under various noise conditions. According to the clinical trial results, both strategies achieved an average word recognition score of over 80\% in clean speech conditions. However, as the noise level increased, the average scores of CI users significantly declined. When the signal-to-noise ratio (SNR) was -5dB, the recognition scores dropped to below 10\%. Noise severely impacts the speech recognition ability of CI users, and current CI speech encoding strategies still cannot effectively handle noise issues. To evaluate whether the neural network-developed CI speech encoding strategy performs similarly to traditional strategies, paired-samples T-tests were used to examine the experimental data of the two strategies. The test results showed no significant differences between the two strategies, indicating that neural network strategies can achieve functionality similar to traditional strategies.
Subsequently, a second clinical trial was conducted using the LS-Unet deep learning dereverberation model as a preprocessing step for both strategies. The study compared the speech recognition performance of the two strategies under various noise and reverberation conditions, evaluating whether using LS-Unet as preprocessing effectively improves the speech recognition ability of cochlear implant users affected by noise and reverberation. According to the clinical trial results, the average word recognition scores under all noise and reverberation conditions were below 50\%. Adding LS-Unet to both strategies did not provide good speech recognition performance, indicating that there is still significant room for improvement in handling noise and reverberation.In the participants of Clinical Trial II, there is one case whose experimental results are significantly higher than those of other cases, with recognition scores exceeding 70\% under all conditions. This individual is the only participant with congenital hearing loss who received a cochlear implant before developing speech, which might be the reason for their superior recognition ability.
Finally, based on the clinical trial results, it was found that cochlear implant speech encoding strategies cannot effectively convert speech signals under noise and reverberation conditions. This study aims to improve the speech recognition ability of cochlear implants affected by reverberation by using the Unet model to separate speech features from reverberant signals. By adding different functional network layers, an innovative deep learning dereverberation cochlear implant speech encoding strategy, RT-Unet, was developed. The model takes reverberant speech spectrum signals as input, processes them, and outputs speech electrode signals. This thesis provides a detailed description of the model architecture, training methods, and test evaluation results. In the evaluation of the test results, RT-Unet achieved excellent objective evaluation scores when handling reverberant speech. The average Short-Time Objective Intelligibility (STOI) score for all reverberation time conditions reached 0.76622, and the Normalized Covariance Metric (NCM) score was as high as 0.91302. Compared to the objective evaluation scores of other cochlear implant speech encoding strategies, RT-Unet performed better, demonstrating the feasibility of the RT-Unet model architecture and providing a promising research and development direction for cochlear implant speech signal processing.

關鍵字(中)

★ 人工電子耳
★ 深度學習
★ 語音去迴響

關鍵字(英)

★ Cochlear Implant
★ Deep Learning
★ Speech Dereverberation

論文目次

中文摘要........................................................................................................ i Abstract .......................................................................................................... iii
目錄................................................................................................................ viii
圖目錄............................................................................................................ x
表目錄............................................................................................................ xii
第一章緒論 ................................................................................................ 1
1.1 研究動機 ....................................................................................... 1
1.2 文獻回顧 ....................................................................................... 2
1.2.1 人工電子耳語音處理編碼策略 ........................................ 2
1.2.2 語音增強模型 .................................................................... 4
1.3 研究目的與貢獻 ........................................................................... 6
1.4 論文架構 ....................................................................................... 7
第二章研究背景及原理 .............................................................................. 8
2.1 人工電子耳聲音編碼策略 ........................................................... 8
2.1.1 人工電子耳 ........................................................................ 8
2.1.2 進階結合編碼策略 ACE ................................................... 9
2.1.3 類神經網路編碼策略 ........................................................ 11
2.2 去迴響模型 ................................................................................... 12
2.2.1 空間迴響 ............................................................................ 12
2.2.2 Unet..................................................................................... 13
2.2.3 LS-Unet............................................................................... 16
第三章臨床實驗方法和結果討論.............................................................. 17
3.1 臨床實驗 ....................................................................................... 17
3.1.1 實驗方法 ............................................................................ 17
3.1.2 實驗語料 ............................................................................ 19
3.1.3 受測者 ................................................................................ 20
3.1.4 人工電子耳實驗平台 ........................................................ 22
3.2 臨床實驗結果與討論 ................................................................... 24
3.2.1 Paired-Samples T-tests ........................................................ 24
3.2.2 ACE vs ElectrodeNet-CS ................................................... 25
3.2.3 LS-Unet+ACE vs LS-Unet+ElectrodeNet-CS ................... 30
第四章深度學習人工電子耳語音編碼策略.............................................. 36
4.1 訊號前處理 ................................................................................... 37
4.2 RT-Unet Framework ...................................................................... 39
4.2.1 Reduce Time Resolution ..................................................... 39
4.2.2 Unet..................................................................................... 41
4.2.3 Recover Time Resolution ................................................... 42
4.2.4 Out Feature ......................................................................... 42
4.3 模型訓練 ....................................................................................... 43
4.3.1 Model training objective ..................................................... 43
4.3.2 Model training setup ........................................................... 44
4.4 語料庫 ........................................................................................... 45
4.5 測試評估 ....................................................................................... 46
4.5.1 電極圖 ................................................................................ 47
4.5.2 STOI.................................................................................... 48
4.5.3 NCM ................................................................................... 49
4.5.4 客觀評估結果 .................................................................... 50
4.5.5 正常人主觀聽力實驗 ........................................................ 57
4.6 結果討論 ....................................................................................... 58
第五章總結與未來展望.............................................................................. 60
5.1 總結 ............................................................................................... 60
5.2 未來展望 ....................................................................................... 62
參考文獻........................................................................................................ 63

參考文獻

Berouti, M., Schwartz, R., & Makhoul, J. (1979). Enhancement of speech
corrupted by acoustic noise. In Icassp’79. ieee international conference
on acoustics, speech, and signal processing (Vol. 4, pp. 208–211).
Ephraim, Y., & Malah, D. (1984). Speech enhancement using a minimum-mean
square error short-time spectral amplitude estimator. IEEE Transactions
on acoustics, speech, and signal processing, 32(6), 1109–1121.
Gajecki, T., Zhang, Y., & Nogueira, W. (2023). A deep denoising sound coding
strategy for cochlear implants. IEEE Transactions on Biomedical Engi-
neering.
Goldsworthy, R. L., & Greenberg, J. E. (2004). Analysis of speech-based speech
transmission index methods with implications for nonlinear operations.
The Journal of the Acoustical Society of America, 116(6), 3679–3689.
Hochberg, I., Boothroyd, A., Weiss, M., & Hellman, S. (1992). Effects of noise
and noise suppression on speech perception by cochlear implant users.
Ear and hearing, 13(4), 263–271.
Huang, E. H.-H., Chao, R., & Tsao, Y. (2024). Electrodenet–a deep learning
based sound coding strategy for cochlear implants. IEEE Transactions on
Cognitive and Developmental Systems, 16(1), 346-357.
Lea, C., Flynn, M. D., Vidal, R., Reiter, A., & Hager, G. D. (2017). Tem-
poral convolutional networks for action segmentation and detection. In
proceedings of the ieee conference on computer vision and pattern recog-
nition (pp. 156–165).
León, D., & Tobar, F. (2021). Late reverberation suppression using u-nets. arXiv
preprint arXiv:2110.02144.
Lim, J. S., & Oppenheim, A. V. (1979). Enhancement and bandwidth compres-
sion of noisy speech. Proceedings of the IEEE, 67(12), 1586–1604.
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for
semantic segmentation. In Proceedings of the ieee conference on com-
puter vision and pattern recognition (pp. 3431–3440).
Luo, Y., & Mesgarani, N. (2019). Conv-tasnet: Surpassing ideal time–frequency
magnitude masking for speech separation. IEEE/ACM transactions on
audio, speech, and language processing, 27(8), 1256–1266.
Nogueira, W., Büchner, A., Lenarz, T., & Edler, B. (2005). A psychoacous-
tic” nofm”-type speech coding strategy for cochlear implants. EURASIP
Journal on Advances in Signal Processing, 2005, 1–16.
Organization, W. H. (2021). Who: 1 in 4 people projected to have hearing prob-
lems by 2050. Retrieved from https://www.who.int/news/item/
02-03-2021-who-1-in-4-people-projected-to-have-hearing
-problems-by-2050
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional net-
works for biomedical image segmentation. In Medical image computing
and computer-assisted intervention–miccai 2015: 18th international con-
ference, munich, germany, october 5-9, 2015, proceedings, part iii 18 (pp.
234–241).
Sun, L., Du, J., Dai, L.-R., & Lee, C.-H. (2017). Multiple-target deep learn-
ing for lstm-rnn based speech enhancement. In 2017 hands-free speech
communications and microphone arrays (hscma) (pp. 136–140).
Taal, C. H., Hendriks, R. C., Heusdens, R., & Jensen, J. (2010). A short-
time objective intelligibility measure for time-frequency weighted noisyspeech. In 2010 ieee international conference on acoustics, speech and
signal processing (pp. 4214–4217).
Wilson, B. S., Finley, C. C., Lawson, D. T., Wolford, R. D., Eddington, D. K.,
& Rabinowitz, W. M. (1991). Better speech recognition with cochlear
implants. Nature, 352(6332), 236–238.
Wouters, J., McDermott, H. J., & Francart, T. (2015). Sound coding in cochlear
implants: From electric pulses to hearing. IEEE Signal Processing Mag-
azine, 32(2), 67–80.
Xu, Y., Du, J., Dai, L.-R., & Lee, C.-H. (2014). A regression approach to speech
enhancement based on deep neural networks. IEEE/ACM transactions on
audio, speech, and language processing, 23(1), 7–19.
Zhao, L., Zhu, W., Li, S., Luo, H., Zhang, X.-L., & Rahardja, S. (2024). Multi-
resolution convolutional residual neural networks for monaural speech
dereverberation. IEEE/ACM Transactions on Audio, Speech, and Lan-
guage Processing.
吳俊易. (2023). 神經網路應用於人工電子耳語音去迴響 (Unpublished
master’s thesis). 國立中央大學電機工程研究所.
林金賢. (2021). 深度學習用於語音迴響抑制之研究 (Unpublished master’s
thesis). 國立中央大學電機工程研究所.
黃國原. (2009). 模擬人工電子耳頻道數、刺激速率與雙耳聽對噪音環境
下中文語音辨識之影響 (Unpublished master’s thesis). 國立中央大學
電機工程研究所.
黃銘緯. (2005). 台灣地區噪音下漢語語音聽辨測試 (Unpublished master’s
thesis). 國立台北護理學院聽語障礙科學研究所.

指導教授

吳炤民(Chao-Min Wu)

審核日期

2024-7-24

推文