具有注意力門之卷積遞迴神經網路於實時單通道語音增強

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：25

、訪客IP：18.225.234.56

姓名

吳文宇(Wen-Yu Wu) 查詢紙本館藏

畢業系所

通訊工程學系

論文名稱

具有注意力門之卷積遞迴神經網路於實時單通道語音增強
(Convolutional Recurrent Neural Network With Attention Gates For Real-time Single-channel Speech Enhancement)

相關論文

★ 基於區域權重之衛星影像超解析技術	★ 延伸曝光曲線線性特性之調適性高動態範圍影像融合演算法
★ 實現於RISC架構之H.264視訊編碼複雜度控制	★ 基於卷積遞迴神經網路之構音異常評估技術
★ 具有元學習分類權重轉移網路生成遮罩於少樣本圖像分割技術	★ 具有注意力機制之隱式表示於影像重建三維人體模型
★ 使用對抗式圖形神經網路之物件偵測張榮	★ 基於弱監督式學習可變形模型之三維人臉重建
★ 以非監督式表徵分離學習之邊緣運算裝置低延遲樂曲中人聲轉換架構	★ 基於序列至序列模型之 FMCW雷達估計人體姿勢
★ 基於多層次注意力機制之單目相機語意場景補全技術	★ 基於時序卷積網路之單FMCW雷達應用於非接觸式即時生命特徵監控
★ 視訊隨選網路上的視訊訊務描述與管理	★ 基於線性預測編碼及音框基頻週期同步之高品質語音變換技術
★ 基於藉語音再取樣萃取共振峰變化之聲調調整技術	★ 即時細緻可調性視訊在無線區域網路下之傳輸效率最佳化研究

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2026-7-1以後開放)

摘要(中)

現今室內或室外環境中，到處存在噪音，這不僅影響語音品質，也影響自動語音辨識。因此，在產品開發上，我們需考慮實時語音增強性能，例如:智慧音箱。傳統語音增強算法對於平穩狀態的噪音，例如:空調聲，具有良好降噪效果。然而，對於非平穩狀態的噪音，例如:風聲，其降噪效果有限。由於，現今深度學習技術盛行，語音增強受益於深度學習，可以有效處理非平穩狀態的噪音。
本論文提出的方法為以具有注意力門 (Attention Gates, AG) 之卷積遞迴神經網路 (Convolutional Recurrent Neural Network, CRNN) 模型，來實現語音增強。由於模型結合卷積神經網路 (Convolutional Neural Network, CNN) 的優點，例如:強大的特徵提取，添加注意力門以增強重要特徵，抑制不相關部分，以及長短期記憶網路 (Long Short-Term Memory Network, LSTM) 的優點，例如:時間序列動態建模。因此，模型能夠有效地估計出複數比例遮罩 (Complex Ratio Mask, CRM)，從而獲得更好的語音品質。由於，提出之模型參數量只有2.3M，計算複雜度低，因此可達到實時語音增強目的。

摘要(英)

In today′s indoor or outdoor environment, noises exist everywhere, which not only affect the speech quality but also affect automatic speech recognition. Therefore, in product development, we need to consider the performance of real-time speech enhancement, such as smart speakers. Traditional speech enhancement algorithms have good noise reduction effects for stationary noises, such as air conditioner noises. However, for non-stationary noises, such as wind noises, its noise reduction effects are limited. Due to the popularity of deep learning technology, speech enhancement benefits from deep learning, which can effectively deal with non-stationary noises.
The method proposed in this paper is to use the convolutional recurrent neural network model with attention gates, to achieve speech enhancement. Because the model combines the advantages of the convolutional neural network, such as powerful feature extraction, adding attention gates to enhance important features and suppress irrelevant parts, and the advantages of the long short-term memory network, such as time series dynamic modeling. Therefore, the model can effectively estimate the complex ratio mask, to obtain better speech quality. Since the parameters of the proposed model are only 2.3M, the computational complexity is low, the objective of real-time speech enhancement can be achieved.

關鍵字(中)

★ 深度學習
★ 實時語音增強
★ 卷積遞迴神經網路

關鍵字(英)

★ Deep Learning
★ Real-time Speech Enhancement
★ Convolutional Recurrent Neural Network

論文目次

摘要 i
Abstract ii
誌謝 iii
目錄 iv
圖目錄 vii
表目錄 ix
第一章緒論 1
1-1 研究背景 1
1-2 研究動機與目的 2
1-3 論文架構 3
第二章傳統語音增強技術相關介紹 4
2-1 語音增強技術概述 4
2-2 語音增強算法相關介紹 5
2-2-1 頻譜刪減法 6
2-2-2 維納濾波法 6
2-2-3 最小均方誤差短時頻譜幅度估計法 7
2-3 總結傳統語音增強技術 10
第三章深度學習技術相關介紹 11
3-1 卷積神經網路 11
3-1-1 卷積層 12
3-1-2 批歸一化及激活函數 13
3-1-3 池化層 18
3-1-4 全連接層 18
3-2 遞迴神經網路 19
3-2-1 遞迴神經網路介紹 20
3-2-2 長短期記憶網路介紹 22
3-3 深度學習語音增強相關介紹 25
3-3-1 時域語音增強方法介紹 25
3-3-2 時頻域語音增強方法介紹 26
第四章提出之架構 28
4-1 系統架構 28
4-2 訓練模型階段 29
4-2-1 語音數據集前處理 30
4-2-2 神經網路模型架構 33
4-2-3 損失函數 40
4-3 測試模型階段 44
第五章實驗結果與分析討論 45
5-1 實驗環境與數據集介紹 45
5-2 實驗結果比較與討論 47
第六章結論與未來展望 56
參考文獻 57

參考文獻

[1] S.Boll, “Suppression of Acoustic Noise in Speech Using Spectral Subtraction,” IEEE Trans. Acoust, Speech, Signal Process. Vol.27, pp 113-120, Apr 1979.
[2] A. Yelwande, S. Kansal and A. Dixit, “Adaptive wiener filter for speech enhancement,” 2017 International Conference on Information, Communication, Instrumentation and Control (ICICIC), 2017, pp. 1-4.
[3] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean square error short-time spectral amplitude estimator,” in IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 32, no. 6, pp. 1109-1121, December 1984.
[4] I. Cohen and B. Berdugo, “Noise estimation by minima controlled recursive averaging for robust speech enhancement,” in IEEE Signal Processing Letters, vol. 9, no. 1, pp. 12-15, Jan. 2002.
[5] Y. Lecun, L. Bottou, Y. Bengio and P. Haffner, “Gradient-based learning applied to document recognition,” in Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998.
[6] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” In Proceedings of The 32nd International Conference on Machine Learning, pages 448–456, 2015.
[7] S.Hochreiter and J.Schmidhuber, “Long short-term memory,” Neural computation, 9(8):1735–1780, 1997.
[8] D. Stoller, S. Ewert and S. Dixon, “Wave-u-net: A multi-scale neural network for end-to-end audio source separation,” arXiv preprint arXiv:1806.03185, 2018.
[9] Y. Luo and N. Mesgarani, “Conv-Tasnet: Surpassing ideal time– frequency magnitude masking for speech separation,” IEEE/ACM transactions on audio, speech, and language processing, vol. 27, no. 8, pp. 1256–1266, 2019.
[10] F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. L. Roux, J. R. Hershey and B. Schuller, “Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR,” Latent Variable Analysis and Signal Separation Lecture Notes in Computer Science, p. 9199, 2015.
[11] D. Wang, “On ideal binary mask as the computational goal of auditory scene analysis,” in Speech separation by humans and machines. Springer, 2005, pp. 181–197.
[12] A. Narayanan and D. Wang, “Ideal ratio mask estimation using deep neural networks for robust speech recognition,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2013, pp. 7092–7096.
[13] H. Erdogan, J. R. Hershey, S. Watanabe and J. Le Roux, “Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 708–712.
[14] D. S. Williamson, Y. Wang and D. Wang, “Complex ratio masking for monaural speech separation,” IEEE/ACM transactions on audio, speech, and language processing, vol. 24, no. 3, pp. 483–492, 2015.
[15] K. Paliwal, K. W´ojcicki and B. Shannon, “The importance of phase in speech enhancement,” speech communication, vol. 53, no. 4, pp. 465–494, 2011.
[16] C. K. Reddy, V. Gopal, R. Cutler, E. Beyrami, R. Cheng, H. Dubey, S. Matusevych, R. Aichner, A. Aazami, S. Braun et al., “The interspeech 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results,” arXiv preprint arXiv:2005.13981, 2020.
[17] Y. Xia, S. Braun, C. K. A. Reddy, H. Dubey, R. Cutler and I. Tashev, “Weighted Speech Distortion Losses for Neural-Network-Based Real-Time Speech Enhancement,” ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 871-875.
[18] Y. Hu, Y. Liu, S. Lv, M. Xing, S. Zhang, Y. Fu, J. Wu, B. Zhang and L. Xie, “DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement,” arXiv preprint arXiv:2008.00264, 2020.
[19] X. Hao, X. Su, R. Horaud and X. Li, "Fullsubnet: A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech Enhancement," ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6633-6637.
[20] O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori, S. McDonagh, N. Y. Hammerla, B. Kainz et al., “Attention u-net: Learning where to look for the pancreas,” arXiv preprint arXiv:1804.03999, 2018.
[21] S. Braun and I. Tashev, “Data augmentation and loss normalization for deep noise suppression,” arXiv preprint arXiv:2008.06412, 2020.
[22] C. H. Taal, R. C. Hendriks, R. Heusdens and J. Jensen, “A short-time objective intelligibility measure for time-frequency weighted noisy speech,” 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, 2010, pp. 4214-4217.
[23] V. Panayotov, G. Chen, D. Povey and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr. 2015, pp. 5206–5210.
[24] J. F. Gemmeke et al., “Audio Set: An ontology and human-labeled dataset for audio events,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, Mar. 2017, pp. 776–780.
[25] J. Thiemann, N. Ito and E. Vincent, “The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings,” presented at the ICA 2013 Montreal, Montreal, Canada, 2013, pp. 035081–035081.
[26] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in 2017 IEEE ICASSP, 2017, pp. 5220–5224.
[27] G. Pirker, M. Wohlmayr, S. Petrik and F. Pernkopf, “A Pitch Tracking Corpus with Evaluation on Multipitch Tracking Scenario,” p. 4.

指導教授

張寶基(Pao-Chi Chang)

審核日期

2021-7-16

推文