深度學習用於語音回響抑制之研究

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：63

、訪客IP：3.142.172.238

姓名

林金賢(Jin-Sian Lin) 查詢紙本館藏

畢業系所

電機工程學系

論文名稱

深度學習用於語音回響抑制之研究
(Study of speech dereverberation based on deep learning approach)

相關論文

★ 獨立成份分析法於真實環境中聲音訊號分離之探討	★ 口腔核磁共振影像的分割與三維灰階值內插
★ 數位式氣喘尖峰氣流量監測系統設計	★ 結合人工電子耳與助聽器對中文語音辨識率的影響
★ 人工電子耳進階結合編碼策略的中文語音辨識成效模擬--結合助聽器之分析	★ 中文發聲之神經關聯性的腦功能磁振造影研究
★ 利用有限元素法建構3維的舌頭力學模型	★ 以磁振造影為基礎的立體舌頭圖譜之建構
★ 腎小管之草酸鈣濃度變化與草酸鈣結石關係之模擬研究	★ 口腔磁振影像舌頭構造之自動分割
★ 微波輸出窗電性匹配之研究	★ 以軟體為基準的助聽器模擬平台之發展-噪音消除
★ 以軟體為基準的助聽器模擬平台之發展-回饋音消除	★ 模擬人工電子耳頻道數、刺激速率與雙耳聽對噪音環境下中文語音辨識率之影響
★ 用類神經網路研究中文語音聲調產生之神經關聯性	★ 教學用電腦模擬生理系統之建構

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

回響通常是由天花板、牆壁和地板的聲音反射所造成的，在我們生活的環境中處處都會有回響的存在。對於正常人耳而言，回響所造成的影響並不明顯，不過對於助聽器或其他聽覺輔具的使用者而言，回響會嚴重影響他們語音接收的品質，即使在安靜的環境下，也可能會聽不清楚。現有的傳統除回響方法，雖然也可以表現出相當不錯的性能，但它們仍都需要已知的環境特性來抑制回響，這在真實環境下很難去實現。現今，深度學習發展迅速，利用大量的訓練資料來訓練深度神經網路(Deep neural network, DNN)便可以得到輸出與輸入之間的非線性關係，改善了傳統方法對環境的依賴性。本論文利用實驗室先前所錄製的TMHINT(Taiwan mandarin hearing in noise test)句子作為實驗語料，模擬了許多不同環境下的回響語料來進行訓練(2160句)及測試(480句)，再從語音中萃取對數功率聲譜(Logarithmic power spectrogram, LPS)作為輸入特徵，讓深度神經網路來進行監督式學習。本實驗中使用的神經網路架構有深層降噪自編碼器(Deep denoise autoencoder, DDAE)與整體式深度與集成學習演算法(Integrated deep and ensemble learning algorithm, IDEA)，並比較他們彼此間的優劣勢及結合其他網路架構所呈現出來的結果，依據不同的訓練目標，網路的性能也不一致。在這我們也比較了映射(Mapping)與遮罩(Masking)方式的區別。為了證實比較結果的可信度，我們使用了國外語音研究上常用的TIMIT語料，加以驗證我們的結果。最後，藉由語音品質感知度(Perceptual evaluation of speech quality, PESQ)與短時客觀語音清晰度(Short time objective intelligibility, STOI)等評估方法來對各項結果做評估，來找出最合適的網路架構及輸出目標。評估結果表明，DDAE與IDEA兩者跟殘差網路(Residual networks)做結合的效益是最佳的(PESQ平均值2.2以上、STOI平均值0.8以上)，而在遮罩目標下，DDAE無論是在架構上或是回響抑制能力上的表現，都明顯優於IDEA。

摘要(英)

Reverberation, generally caused by sound reflections from ceilings, floors, and walls, exists everywhere in the environment we live in. For normal human ears, the effect of reverberation is not obvious. However, for the people who need hearing aids or other assistive hearing devices, reverberation significantly affect the quality of speech reception. Even in a noiseless environment, reverberation still makes people with hearing loss unable to hear clearly. Although traditional dereverberation approaches can show reasonably good performance, they still rely on the knowledge of environmental characteristics, which are difficult to be obtained in the real environment. Nowadays, the rapid-growing deep learning is a powerful tool that can be used for dereverberation. By using a large amount of data to train the deep neural networks (DNNs), we can obtain the nonlinear relationship between input and output. Comparing to the traditional methods, DNN eliminates the environment dependence and improve the performance. In this thesis, sentences from TMHINT (Taiwan mandarin hearing in noise test) previously recorded by our research team, are chosen as the speech material for experiments, and simulated the reverberant speech under different conditions for training (2160 sentences) and testing (480 sentences). The logarithmic power spectrum (LPS) was extracted from the speech as the input feature, and the DNN is used for supervised learning. The neural network architecture utilized in this experiment includes the deep denoising autoencoder (DDAE) and the integrated deep and ensemble learning algorithm (IDEA). This research, compares their advantages and disadvantages, and combines with other network architectures. Different training targets with the same network are also compared for the performance. The differences between mapping and masking methods are evaluated. In order to verify the credibility of the comparison results, we also used the TIMIT corpus for experiments. The evaluation methods perceptual evaluation of speech quality (PESQ) and short-time objective voice intelligibility (STOI) are used to assess the results, which give most suitable network architecture and output target. The evaluation results showed that both of the combination of DDAE with residual network and IDEA with residual network were the best among all of the methods. (Average PESQ score is equal to 2.2 or more, while STOI is equal to 0.8 or more). Furthermore, under masking, DDAE offered a better indications of the architecture and dereverberation capability compared to IDEA.

關鍵字(中)

★ 深度學習
★ 回響抑制

關鍵字(英)

★ Deep learning
★ Dereverberation

論文目次

摘要 I
Abstract III
誌謝 V
目錄 VI
圖目錄 IX
表目錄 XI
英文縮寫名稱對照表 XII
第一章緒論 1
1.1研究動機 1
1.2文獻探討 2
1.2.1傳統回響抑制策略 2
1.2.2深度學習回響抑制策略 3
1.3研究目的與貢獻 4
1.4論文架構 5
第二章回響與神經網路之介紹 6
2.1回響及回響時間介紹 6
2.1.1密閉空間中的回響 6
2.1.2回響時間 7
2.3類神經網路 8
2.2.1類神經網路介紹 8
2.2.2感知器介紹 10
2.2.3深度神經網路 11
2.2.3卷積神經網路 12
2.2.4倒傳遞演算法 15
第三章研究方法 18
3.1語音資料前處理 18
3.2語音回響抑制 20
3.3訓練目標 22
3.3.1 Ideal Amplitude Mask (IAM) 22
3.3.2 Logarithmic Mask (LOGMASK) 23
3.3.3 Ideal Ratio Mask (IRM) 23
3.4深層降噪自編碼器 24
3.4.1 Residual DDAE 25
3.4.2 Highway DDAE 27
3.5整體式深度與集成學習演算法 28
3.6系統架構 29
第四章實驗結果與討論 31
4.1中文實驗語料 31
4.2空間中的回響模擬 32
4.3實驗設置 34
4.4評估指標 39
4.4.1 語音品質感知度PESQ 39
4.4.2 短時客觀語音清晰度STOI 40
4.5實驗結果與討論 42
4.5.1 語料性質及架構對於DDAE的影響 42
4.5.2 DDAE與IDEA之比較 45
4.5.3 與實驗室其他相關研究方法之比較 49
4.5.4 不同輸出目標的實驗結果 52
4.5.5 TIMIT語料之評估結果 57
第五章結論與未來展望 60
5.1結論 60
5.2未來展望 61
參考文獻 63

參考文獻

Allen, J. B., & Berkley, D. A. (1979). “Image Method for Efficiently Simulating Small Room Acoustics,”. Journal of the Acoustical Society of America, vol. 65, no. 4, pp. 943-950.
Bees, D., Blostein, M., & Kabal, P. (1991). Reverberant speech enhancement using cepstral processing. in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 977-980.
Benesty, J., Sondhi, M. M., & Huang, Y. (2007). Springer handbook of speech processing :Ch. 4.6.
Delcroix, M., Yoshioka, T., Ogawa, A., & Kubo, Y. (2014). “Linear prediction-based dereverberation with advanced speech enhancement and recognition technologies for the REVERB challenge,”. in Proc. REVERB Challenge, pp. 1–8.
Delfarah, M., & Wang, D. L. (2017). “Features for masking-based monaural speech separation in reverberant conditions,”. IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 5, pp. 1085–1094.
Erdogan, H., Hershey, J. R., Watanabe, S., & Roux, J. L. (2015). Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 708-712.
Garofolo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G., & Pallett, D. S. (1993). DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1. Tech. Rep, vol. 93.
Gillespie, B. W., Malvar, S. H., & Florencio, D. A. (2001). Speech dereverberation via maximum-kurtosis subband adaptive filtering. in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3701-3704.
Habets, E. A. (2010). Room impulse response generator. Technische Universiteit Eindhoven.
Han, K., Wang, Y., & Wang, D. (2014). “Learning spectral mapping for speech dereverberation,”. IEEE International Conference on Acoustic, Speech and Signal Processing, pp. 4661–4665.
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep residual learning for image recognition. in Proc. Computer Vision and Pattern Recognition, pp. 770-778.
Hussain, T., Siniscalchi, S. M., Lee, C. -C., Wang, S. -S., Tsao, Y., & Liao, W. -H. (2017). Experimental Study on Extreme Learning Machine Applications for Speech Enhancement. IEEE Access, vol. 5, pp. 25542-25554.
Jin, Z., & Wang, D. L. (2009). “Supervised learning approach to monaural segregation of reverberant speech,”. IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, no. 4, pp. 625–638.
Lee, W. J., Wang, S. S., Chen, F., Lu, X., Chien, S. Y., & Tsao, Y. (2018). Speech Dereverberation Based on Integrated Deep and Ensemble Learning Algorithm. IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5454-5458.
Li, J., Akagi, M., & Suzuki, Y. (2006). “Noise reduction based on microphone array and post-fltering for robust hands-free speech recognition in adverse environments,”. Ph.D. dissertation, School of Information Science, Japan Advanced Institute of Science and Technology, Japan.
Loizou, P. C. (2007). Speech Enhancement: Theory and Practice.
Lu, X., Tsao, Y., Matsuda, S., & Hori, C. (2014). Ensemble modeling of denoising autoencoder for speech spectrum restoration. in Proc. INTERSPEECH.
Ma, J., Hu, Y., & Loizou, P. C. (2009). Objective measures for predicting speech intelligibility in noisy conditions based on new band-importance functions. J Acoust Soc Am, 125(5), pp. 3387-3405.
Miyoshi, M., & Kaneda, Y. (1988). “Inverse filtering of room acoustics,”. IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 36, no. 2, pp. 145–152.
Mohammadiha, N., & Doclo, S. (2016). “Speech dereverberation using nonnegative convolutive transfer function and spectro-temporal modeling,”. IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 2, pp. 276–289.
Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted Boltzmann machines. in Proceedings of theInternational Conference on Machine Learning, pp. 807-814.
Neely, S. T., & Allen, J. B. (1979). “Invertibility of a room impulse response,”. Journal of the Acoustical Society of America, vol. 66, pp. 165–169.
Nisa, H. K. (2021). Speech dereverberation based on HELM framework for cochlear implant coding strategy. Master′s Thesis, Institute of Electrical Engineering, National Central University.
Radlovic, B. D., Williamson, R. C., & Kennedy, R. A. (2000). Equalization in an acoustic reverberant environment: robustness results. IEEE Transactions on Speech and Audio Processing, vol. 8, no. 3, pp. 311-319.
Rix, A. W., Beerends, J. G., Hollier, M. P., & Hekstra, A. P. (2001). Perceptual evaluation of speech quality (pesq) - a new method for speech quality assessment of telephone networks and codecs. in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 749-752.
SrivastavaKR, GreffK, & SchmidhuberJ. (2015). Highway networks. CoRR, vol. abs/1505.00387.
STEVEN F.BOLL. (1979). Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech, and Signal Processing, pp. 113-120.
Taal, C. H., Hendriks, R. C., Heusdens, R., & Jensen, J. (2011). An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech. IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2125-2136.
Virtanen, T., Gemmeke, J., & Raj, B. (2013). Active-Set Newton Algorithm for Overcomplete Non-Negative Representations of Audio. IEEE Transactions on Audio, Speech, and Language Processing., vol. 21, no. 11, pp. 2277-2289.
Wang, D., & Lim, J. (1982). The unimportance of phase in speech enhancement. IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 30, no. 4, pp. 679-681.
Wang, Y., Narayanan, A., & Wang, D. (2014). On training targets for supervised speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 12, pp. 1849-1858.
Williamson et al. (2016). Complex Ratio Masking for Monaural Speech Separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 3, pp. 483-492.
Williamson, D. S., & Wang, D. L. (2017). Time-Frequency Masking in the Complex Domain for Speech Dereverberation and Denoising. IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 7, pp. 1492-1501.
Wu, M., & Wang, D. L. (2006). “A two-stage algorithm for one-microphone reverberant speech enhancement,”. IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 14, no. 3, pp. 774–784.
Xiao et al. (2016). “Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation,”. EURASIP Journal on Advances in Signal Processing, vol. 2016, no. 1, pp. 1-18.
Yoshioka, T., & Nakatani, T. (2012). “Generalization of multi-channel linear rediction methods for blind MIMO impulse response shortening,”. IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 10, pp. 2707–2720.
Zhang, X. L., & Wang, D. L. (2016). “A deep ensemble learning method for monaural speech separation,”. IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 5, pp. 967–977.
中華民國衛生福利部統計處. (2020). 擷取自 https://dep.mohw.gov.tw/DOS/cp-2976-13827-113.html
高士喆. (2014). 「語音增強使用感知激勵頻譜振福之貝氏估計器」. 國立台北科技大學電機工程研究所. 碩士論文.
陳星瑋. (2019). 「基於深度神經網路之多聲道聲源方位估計與語音增強」. 國立交通大學電信工程研究所. 碩士論文.
黃國原. (2009). 「模擬人工電子耳頻道數、刺激速率與雙耳聽對噪音環境下中文語音辨識之影響」. 國立中央大學電機工程研究所. 碩士論文.
黃銘緯. (2005). 「台灣地區噪音下漢語語音聽辨測試」. 國立台北護理學院聽語障礙科學研究所. 碩士論文.
楊宗翰. (2012). 「使用適應波束形成與增益衰減後濾波器之殘響消除方法」. 國立交通大學電機與控制工程研究所. 碩士論文.

指導教授

吳炤民(Chao-Min Wu)

審核日期

2021-1-25

推文