摘要(英) |
This paper analyzes the differences in Chinese speech comprehension under simulated noise and reverberation conditions with various cochlear implant (CI) speech coding strategies. It also aims to develop a deep learning dereverberation strategy for CI speech encoding to address reverberation issues. A CI is an assistive listening device implanted in the cochlea that can stimulate the auditory nerve, helping to restore auditory perception for patients with severe hearing loss. Currently, there are many different speech processing encoding strategies for CIs, such as the Advanced Combination Encoder (ACE) strategy, which is the most widely used and commercialized. In recent years, researchers have developed CI processing strategy models based on neural network architectures, such as ElectrodeNet-CS. Typically, in quiet environments, CI speech coding strategies can provide good speech recognition and comprehension. However, in environments with noise and reverberation interference, the speech recognition and comprehension abilities of CI users drastically decline.
This study first conducted clinical trials comparing the ACE strategy and the ElectrodeNet-CS strategy, examining the differences in word recognition scores (WRS) of cochlear implant users under various noise conditions. According to the clinical trial results, both strategies achieved an average word recognition score of over 80\% in clean speech conditions. However, as the noise level increased, the average scores of CI users significantly declined. When the signal-to-noise ratio (SNR) was -5dB, the recognition scores dropped to below 10\%. Noise severely impacts the speech recognition ability of CI users, and current CI speech encoding strategies still cannot effectively handle noise issues. To evaluate whether the neural network-developed CI speech encoding strategy performs similarly to traditional strategies, paired-samples T-tests were used to examine the experimental data of the two strategies. The test results showed no significant differences between the two strategies, indicating that neural network strategies can achieve functionality similar to traditional strategies.
Subsequently, a second clinical trial was conducted using the LS-Unet deep learning dereverberation model as a preprocessing step for both strategies. The study compared the speech recognition performance of the two strategies under various noise and reverberation conditions, evaluating whether using LS-Unet as preprocessing effectively improves the speech recognition ability of cochlear implant users affected by noise and reverberation. According to the clinical trial results, the average word recognition scores under all noise and reverberation conditions were below 50\%. Adding LS-Unet to both strategies did not provide good speech recognition performance, indicating that there is still significant room for improvement in handling noise and reverberation.In the participants of Clinical Trial II, there is one case whose experimental results are significantly higher than those of other cases, with recognition scores exceeding 70\% under all conditions. This individual is the only participant with congenital hearing loss who received a cochlear implant before developing speech, which might be the reason for their superior recognition ability.
Finally, based on the clinical trial results, it was found that cochlear implant speech encoding strategies cannot effectively convert speech signals under noise and reverberation conditions. This study aims to improve the speech recognition ability of cochlear implants affected by reverberation by using the Unet model to separate speech features from reverberant signals. By adding different functional network layers, an innovative deep learning dereverberation cochlear implant speech encoding strategy, RT-Unet, was developed. The model takes reverberant speech spectrum signals as input, processes them, and outputs speech electrode signals. This thesis provides a detailed description of the model architecture, training methods, and test evaluation results. In the evaluation of the test results, RT-Unet achieved excellent objective evaluation scores when handling reverberant speech. The average Short-Time Objective Intelligibility (STOI) score for all reverberation time conditions reached 0.76622, and the Normalized Covariance Metric (NCM) score was as high as 0.91302. Compared to the objective evaluation scores of other cochlear implant speech encoding strategies, RT-Unet performed better, demonstrating the feasibility of the RT-Unet model architecture and providing a promising research and development direction for cochlear implant speech signal processing. |
參考文獻 |
Berouti, M., Schwartz, R., & Makhoul, J. (1979). Enhancement of speech
corrupted by acoustic noise. In Icassp’79. ieee international conference
on acoustics, speech, and signal processing (Vol. 4, pp. 208–211).
Ephraim, Y., & Malah, D. (1984). Speech enhancement using a minimum-mean
square error short-time spectral amplitude estimator. IEEE Transactions
on acoustics, speech, and signal processing, 32(6), 1109–1121.
Gajecki, T., Zhang, Y., & Nogueira, W. (2023). A deep denoising sound coding
strategy for cochlear implants. IEEE Transactions on Biomedical Engi-
neering.
Goldsworthy, R. L., & Greenberg, J. E. (2004). Analysis of speech-based speech
transmission index methods with implications for nonlinear operations.
The Journal of the Acoustical Society of America, 116(6), 3679–3689.
Hochberg, I., Boothroyd, A., Weiss, M., & Hellman, S. (1992). Effects of noise
and noise suppression on speech perception by cochlear implant users.
Ear and hearing, 13(4), 263–271.
Huang, E. H.-H., Chao, R., & Tsao, Y. (2024). Electrodenet–a deep learning
based sound coding strategy for cochlear implants. IEEE Transactions on
Cognitive and Developmental Systems, 16(1), 346-357.
Lea, C., Flynn, M. D., Vidal, R., Reiter, A., & Hager, G. D. (2017). Tem-
poral convolutional networks for action segmentation and detection. In
proceedings of the ieee conference on computer vision and pattern recog-
nition (pp. 156–165).
León, D., & Tobar, F. (2021). Late reverberation suppression using u-nets. arXiv
preprint arXiv:2110.02144.
Lim, J. S., & Oppenheim, A. V. (1979). Enhancement and bandwidth compres-
sion of noisy speech. Proceedings of the IEEE, 67(12), 1586–1604.
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for
semantic segmentation. In Proceedings of the ieee conference on com-
puter vision and pattern recognition (pp. 3431–3440).
Luo, Y., & Mesgarani, N. (2019). Conv-tasnet: Surpassing ideal time–frequency
magnitude masking for speech separation. IEEE/ACM transactions on
audio, speech, and language processing, 27(8), 1256–1266.
Nogueira, W., Büchner, A., Lenarz, T., & Edler, B. (2005). A psychoacous-
tic” nofm”-type speech coding strategy for cochlear implants. EURASIP
Journal on Advances in Signal Processing, 2005, 1–16.
Organization, W. H. (2021). Who: 1 in 4 people projected to have hearing prob-
lems by 2050. Retrieved from https://www.who.int/news/item/
02-03-2021-who-1-in-4-people-projected-to-have-hearing
-problems-by-2050
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional net-
works for biomedical image segmentation. In Medical image computing
and computer-assisted intervention–miccai 2015: 18th international con-
ference, munich, germany, october 5-9, 2015, proceedings, part iii 18 (pp.
234–241).
Sun, L., Du, J., Dai, L.-R., & Lee, C.-H. (2017). Multiple-target deep learn-
ing for lstm-rnn based speech enhancement. In 2017 hands-free speech
communications and microphone arrays (hscma) (pp. 136–140).
Taal, C. H., Hendriks, R. C., Heusdens, R., & Jensen, J. (2010). A short-
time objective intelligibility measure for time-frequency weighted noisyspeech. In 2010 ieee international conference on acoustics, speech and
signal processing (pp. 4214–4217).
Wilson, B. S., Finley, C. C., Lawson, D. T., Wolford, R. D., Eddington, D. K.,
& Rabinowitz, W. M. (1991). Better speech recognition with cochlear
implants. Nature, 352(6332), 236–238.
Wouters, J., McDermott, H. J., & Francart, T. (2015). Sound coding in cochlear
implants: From electric pulses to hearing. IEEE Signal Processing Mag-
azine, 32(2), 67–80.
Xu, Y., Du, J., Dai, L.-R., & Lee, C.-H. (2014). A regression approach to speech
enhancement based on deep neural networks. IEEE/ACM transactions on
audio, speech, and language processing, 23(1), 7–19.
Zhao, L., Zhu, W., Li, S., Luo, H., Zhang, X.-L., & Rahardja, S. (2024). Multi-
resolution convolutional residual neural networks for monaural speech
dereverberation. IEEE/ACM Transactions on Audio, Speech, and Lan-
guage Processing.
吳俊易. (2023). 神經網路應用於人工電子耳語音去迴響 (Unpublished
master’s thesis). 國立中央大學電機工程研究所.
林金賢. (2021). 深度學習用於語音迴響抑制之研究 (Unpublished master’s
thesis). 國立中央大學電機工程研究所.
黃國原. (2009). 模擬人工電子耳頻道數、刺激速率與雙耳聽對噪音環境
下中文語音辨識之影響 (Unpublished master’s thesis). 國立中央大學
電機工程研究所.
黃銘緯. (2005). 台灣地區噪音下漢語語音聽辨測試 (Unpublished master’s
thesis). 國立台北護理學院聽語障礙科學研究所. |