基於知識蒸餾之單通道語音增強

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：62

、訪客IP：3.146.221.125

姓名

高康捷(Kao, Kang Jie) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

基於知識蒸餾之單通道語音增強
(Single channel speech enhancement based on knowledge distillation)

相關論文

★ Single and Multi-Label Environmental Sound Recognition with Gaussian Process	★ 波束形成與音訊前處理之嵌入式系統實現
★ 語音合成及語者轉換之應用與設計	★ 基於語意之輿情分析系統
★ 高品質口述系統之設計與應用	★ 深度學習及加速強健特徵之CT影像跟骨骨折辨識及偵測
★ 基於風格向量空間之個性化協同過濾服裝推薦系統	★ RetinaNet應用於人臉偵測
★ 金融商品走勢預測	★ 整合深度學習方法預測年齡以及衰老基因之研究
★ 漢語之端到端語音合成研究	★ 基於 ARM 架構上的 ORB-SLAM2 的應用與改進
★ 基於深度學習之指數股票型基金趨勢預測	★ 探討財經新聞與金融趨勢的相關性
★ 基於卷積神經網路的情緒語音分析	★ 運用深度學習方法預測阿茲海默症惡化與腦中風手術存活

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

近年來深度類神經網路於語音增強領域發展迅速，深度深、層數多的大型神經網路架構可以獲得更好的降噪效果，但在實際應用層面，如即時通訊、即時語音辨識，多需應用在行動裝置、智慧家電等設備上，這些設備的運算效能有限，沒有足夠的資源來進行大量的運算。因此，為了克服這個問題，最新的研究傾向發展低延遲的輕量模型，以較少的參數來獲得同等或更好的效果。
本論文以雙信號變換LSTM網路(Dual-Signal Transformation LSTM Network, DTLN)為基礎，提出一知識蒸餾的訓練方法。知識蒸餾中，老師模型(Teacher model)是一訓練好的層數加深、寬度加寬的DTLN模型，學生模型(Student model)則是原模型設置。由於DTLN是由兩個LSTM(Long Short-Term Memory)網路級聯而成，因此，本論文中，老師模型對學生模型中兩部分分別進行蒸餾，實驗結果表明，此方法能夠達到更好的蒸餾效果，使學生模型成為一個參數量相當，降噪效果更好的網路。

摘要(英)

In recent years, deep neural networks have developed rapidly in the field of speech enhancement. Large-scale neural network architectures with deep depth and many layers can achieve better noise reduction effects. However, at the practical application level, such as instant messaging, real-time speech recognition, and more It needs to be applied to devices such as mobile devices and smart home appliances. These devices have limited computing performance and do not have enough resources to perform a large amount of computing. Therefore, in order to overcome this problem, the latest research tends to develop a low-latency lightweight model to obtain the same or better results with fewer parameters.
Based on the Dual-Signal Transformation LSTM Network (DTLN), this paper proposes a knowledge distillation training method. In knowledge distillation, the teacher model is a trained DTLN model with deeper layers and wider width, and the student model is the original model setting. Since DTLN is formed by cascading two LSTM (Long Short-Term Memory) networks, in this paper, the teacher model distills the two parts of the student model separately. The experimental results show that this method can achieve better The distillation effect makes the student model a network with equivalent parameters and a better noise reduction effect.

關鍵字(中)

★ 單通道語音增強
★ 知識蒸餾
★ 深度神經網路

關鍵字(英)

論文目次

目錄
圖目錄 III
表目錄 IV
第一章　緒論 - 1 -
1-1研究背景與目的 - 1 -
1-2研究方法與章節概要 - 2 -
第二章相關研究 - 3 -
2-1語音增強背景介紹 - 3 -
2-2雙信號轉換LSTM網路 - 5 -
2-3知識蒸餾提出背景 - 10 -
2-3-1知識蒸餾有效的原因 - 10 -
2-3-2知識蒸餾的方法(Softmax) - 12 -
2-3-3 Loss function - 14 -
2-3-4溫度的調整 - 16 -
2-3-5 Matching logits - 17 -
第三章提出之架構 - 19 -
3-1 使用語音資料集與前處理 - 19 -
3-2 提出模型的架構 - 22 -
3-2-1 Student模型 - 22 -
3-2-2 Teacher模型 - 24 -
3-2-3 整體模型架構 - 26 -
3-3 損失函數 - 28 -
3-3-1 信噪比 (SNR) - 28 -
3-3-2 尺度不變信噪比(SI-SNR) - 29 -
3-3-3 提出模型之損失函數 - 31 -
第四章實驗結果與分析討論 - 32 -
4.1 實驗環境介紹 - 32 -
4.1.1 實驗測試集 - 32 -
4.1.2 評估指標 - 33 -
4.2實驗結果比較與討論 - 34 -
4.2.1 Teacher模型 - 34 -
4.2.2 知識蒸餾架構實驗 - 36 -
4.2.3 損失函數 - 38 -
第五章結論與未來展望 - 40 -
第六章參考資料 - 41 -

參考文獻

[1] A . Kumar and D. Florencio, , “Speech enhancement in multiple-noise conditions using deep neural networks,” inProc. of theInt. Speech Communication Association Conf. (INTERSPEECH), 2016.
[2] Lei Sun, Jun Du, Li-Rong Dai, and Chin-Hui Lee, “ Multiple-target deep learningfor LSTM-RNN based speech enhancemen,” InHSCMA, 2017.
[3] T. Gao, J. Du, L.-R. Dai, and C.-H. Lee,, “Densely connected pro-gressive learning for LSTM-based speech enhancement,,” in Proc. ICASSP, 2018.
[4] T. Kounovsky and J. Malek,, “Single channel speech enhance-ment using convolutional neural network,” in Electronics, Con-trol, Measurement, Signals and their Application to Mechatronics(ECMSM), 2017 IEEE International Workshop of., IEEE, 2017.
[5] S. R. Park and J. Lee, “A Fully Convolution Neural Network for Speech Enhancement”.arXiv preprint arXiv:1609.07132, 2016.
[6] Pascual, Santiago, Antonio Bonafonte, and Joan Serrà, “SEGAN: Speech Enhancement Generative Adversarial Network”.arXiv preprint arXiv:1703.09452(2017)..
[7] Y. Luo and N. Mesgarani,, “Tasnet: time-domain audio separation network for real-time, single-channel speech separation”. in2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 696–700..
[8] Y. Luo, Z. Chen, and T. Yoshioka, “Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation”.arXiv preprint arXiv:1910.06379, 2019.
[9] Y. Luo and N. Mesgarani, “Conv-TASnet: Surpassing ideal time-frequency magnitude masking for speech separation,” IEEE/ACMtransactions on audio, speech, and language processing, vol. 27,no. 8, pp. 1256–1266, 2019.
[10] C. K. Reddy, V. Gopal, R. Cutler, E. Beyrami, R. Cheng,H. Dubey, S. Matusevych, R. Aichner, A. Aazami, S. Braunet al., “The interspeech 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results,” arXiv preprint arXiv:2005.13981, 2020.
[11] Nils L. Westhausen, Bernd T. Meyer, “Dual-Signal Transformation LSTM Network for Real-Time Noise Suppression,” arXivpreprint arXiv:2005.07551, 2020.
[12] J. S. Lim and A. V. Oppenheim, “Enhancement and bandwidth compression of noisy speech,” Proceedings of the IEEE, vol. 67,no. 12, pp. 1586–1604, 1979.
[13] Y. Ephraim and D. Malah,, “Speech enhancement using aminimummean square error short-time spectral amplitude estimator,” IEEE Transactions on acoustics, speech, and signal pro-cessing, vol. 32, no. 6, pp. 1109–1121, 1984.
[14] N. Krishnamurthy and J. H. Hansen, “Babble noise: modeling, analysis, and applications,” IEEE transactions on audio, speech, and language processing, vol. 17, no. 7, pp. 1394–1407, 2009.
[15] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “An experimental study on speech enhancement based on deep neural networks,” IEEESignal processing letters, vol. 21, no. 1, pp. 65–68, 2013.
[16] K. Han, Y. Wang, and D. Wang, “Learning spectral mapping for speech dereverberation”.IEEE, 2014, pp. 4628–4632.
[17] S. R. Park and J. Lee, “A fully convolutional neural network for speech enhancement,” arXiv preprint arXiv:1609.07132, 2016.
[18] H. Phanet al., “Improving gans for speech enhancement,” preprint arXiv:2001.05532, 2020.
[19] S. Pascual, A. Bonafonte, and J. Serra, “Segan: Speech enhancement generative adversarial network,” preprint arXiv:1703.09452, 2017.
[20] E. Nachmani, Y. Adi, and L. Wolf,, “Voice separation withan unknown number of multiple speakers,” arXiv:2003.01531, 2020.
[21] A. Dfossezet al., “Music source separation in the waveform domain,” preprint arXiv:1911.13254, 2019.
[22] N. Virag, “Single channel speech enhancement based on masking properties of the human auditory system,” IEEE Transactions on Speech and Audio Processing, pp. 126 - 137, 1999.
[23] Thomas Lotter and Peter Vary, “Dual-Channel Speech Enhancement by Superdirective Beamforming,” EURASIP Journal on Advances in Signal Processing, 2006.
[24] J. Meyer, K.U. Simmer, “Multi-channel speech enhancement in a car environment using Wiener filtering and spectral subtraction,” 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, 1997.
[25] L. Griffiths and C. Jim, “An alternative approach to linearly constrained adaptive beamforming,” IEEE Transactions on antennas and propagation, vol. 30, no. 1, pp. 27–34, 1982.
[26] C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi, “Investigating rnn-based speech enhancement methods for noise-robust text-to-speech”. inSSW, 2016, pp. 146–152..
[27] J.-M. Valin, “ A hybrid dsp/deep learning approach to real-time full-band speech enhancemen”. in2018 IEEE 20th InternationalWorkshop on Multimedia Signal Processing (MMSP).IEEE,2018, pp. 1–5..
[28] Y. Isik, J. L. Roux, Z. Chen, S. Watanabe, and J. R. Hershey, “Single-channel multi-speaker separation using deep clustering,” arXiv preprint arXiv:1607.02173, 2016.
[29] M. Kolbæk, D. Yu, Z.-H. Tan, and J. Jensen, “Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,” IEEE/ACM Transactions on Au-dio, Speech, and Language Processing, vol. 25, no. 10, pp. 1901–1913, 2017.
[30] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780,, 1997.
[31] M. Maciejewski, G. Wichern, E. McQuinn, and J. Le Roux, “Whamr!: Noisy and reverberant single-channel speech separation”. inICASSP 2020-2020 IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP). IEEE, 2020,pp. 696–700..
[32] W. Verhelst, “Overlap-add methods for time-scaling of speech”.Speech Communication, 2000.
[33] Geoffrey Hinton, Oriol Vinyals, Jeff Dean, "Distilling the Knowledge in a Neural Network," arXiv:1503.02531, 2015.
[34] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur,, “Librispeech: An ASR corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015.
[35] “ITU-T P.808: Subjective evaluation of speech quality with a crowdsourcing approach”.2018.
[36] J. F. Gemmeke et al., “Audio Set: An ontology and human-labeled dataset for audio events,” 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),, 2017.
[37] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, Yoshua Bengio, “FitNets: Hints for Thin Deep Nets”.In ICLR, 2015..
[38] G. Pirker, M. Wohlmayr, S. Petrik, and F. Pernkopf,, “A Pitch Tracking Corpus with Evaluation on Multipitch Tracking Scenario,”.
[39] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra., “Perceptual Evaluation of Speech Quality (PESQ)-A New Method for Speech Quality Assessment of Telephone Networks and Codecs,” InIEEE International Conference on Acoustics, Speech, and SignalProcessing,, 2001.
[40] TAAL, C. H., HENDRIKS, R. C., HEUSDENS, R.,ANDJENSEN, “An algorithm for intelligibility prediction of time–frequency weighted noisy speech,” Audio, Speech, and LanguageProcessing, IEEE, 2011.

指導教授

王家慶

審核日期

2021-10-27

推文