EPG2S：基於電子硬顎圖訊號的語音生成技術

、線上人數：32

、訪客IP：3.147.79.224

姓名	陳柏勳(Po-Hsun Chen) 查詢紙本館藏	畢業系所	資訊工程學系
論文名稱	EPG2S：基於電子硬顎圖訊號的語音生成技術 (EPG2S: Speech Synthesis Technology Based on Electropalatography Signal)
檔案	[Endnote RIS 格式] [Bibtex 格式] [相關文章] [文章引用] [完整記錄] [館藏目錄] [檢視] [下載] 本電子論文使用權限為同意立即開放。已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。
摘要(中)	使用發音的運動資訊合成語音，能為現實應用帶來益處，例如聲帶受損的病患、需要靜音通話的場景，或是在高噪音的環境中。在這項研究中，我們探索了另類數據，即電子硬顎圖 (Electropalatography, EPG)，並提出了一種新穎的多模態 EPG 轉語音 (EPG-to-Speech, EPG2S) 合成系統。我們的模型有兩項目標：(1) 僅使用 EPG 信號合成語音。 (2) 如果我們可以在有噪聲的環境中同時獲得語者的語音信號，我們就可以利用 EPG 信號進行語音增強 (SE)。在 EPG2S 系統中我們研究了兩種融合策略，分別為後期融合 (Late Fusion, LF) 和早期融合 (Early Fusion, EF)。在漢語語料庫上的實驗結果表明，第一個目標中，與加入真實世界噪聲的語音相比，所提出的多模態 EPG2S 系統平均皆優於 SNR 為 -5dB 或更低的背景噪聲。第二個目標中，這些系統在 PESQ、STOI 和 ESTOI 這些語音評估指標中，優於僅使用語音訊號的 SE 系統。這些結果驗證了使用 EPG 信號合成語音的可行性以及將其納入 SE 系統的有效性。
摘要(英)	Synthesized speech from articulatory movement can bring benefits to patients with vocal cord disorders, situations requiring silence, or in high-noise environments. In this study, we explore alternative data, namely electropalatography (EPG), and propose a novel multimodal EPG-to-speech (EPG2S) synthesis system. Our model has two goals: (1) Synthesize speech using only EPG signal. (2) If we can obtain the speaker′s audio signal in a noisy environment simultaneously, we can perform speech enhancement (SE) by leveraging the EPG signal. Two fusion strategies are investigated for the EPG2S system, namely late fusion (LF) and early fusion (EF). Experimental results on a Mandarin corpus. In the first goal, compared to speech with real-world noises, the proposed multimodal EPG2S systems outperform background noise at an SNR level of -5dB or lower on average. In the second goal, these systems outperform the audio-only SE counterparts in PESQ, STOI, and ESTOI speech evaluation metrics. These results verify the feasibility of using EPG signals to synthesize speech and the effectiveness of incorporating it into the SE system.
關鍵字(中)	★ 多模態 ★ 電子硬顎圖 ★ 語音合成 ★ 語音增強	關鍵字(英)	★ multimodal ★ electropalatography ★ speech synthesis ★ speech enhancement
論文目次	摘要 i Abstract ii Table of Contents iii List of Figures iv List of Tables v 1 Introduction 1 1.1 Silent Speech and Speech Enhancement . . . . . . . 1 1.2 Research Motivation and Purpose . . . . . . . . . 2 1.3 Paper Architecture . . . . . . . . . . . . . . . . 3 2 Related Work 4 2.1 Deep-learning-based SE . . . . . . . . . . . . . . 4 2.2 Multimodal SE . . . . . . . . . . . . . . . . . . 5 2.3 Deep Feature Loss . . . . . . . . . . . . . . . . 6 3 PROPOSED METHOD 7 3.1 The Overall EPG2S structure . . . . . . . . . . . 7 3.1.1 The LF Strategy (EPG2S!) . . . . . . . . . . 10 3.1.2 The EF Strategy (EPG2S) . . . . . . . . . . . 11 3.2 Training Stages and Loss Function . . . . . . . . 13 4 Experiment 14 4.1 Data Analysis . . . . . . . . . . . . . . . . . . 14 4.2 Experimental Setup . . . . . . . . . . . . . . . 15 5 Evaluation results and discussions 17 5.1 The Performance of EPG-to-Speech . . . . . . . . .17 5.2 The Performance of the EPG2S with Audio Input . . 19 5.3 Analyze the Performance of LF and EF Strategies . 20 5.4 Analyze Low-Resource EPG Signal . . . . . . . . . 21 6 Conclusion 23 Reference 24 Appendix A 31 Appendix B 32
參考文獻	[1] M. Janke and L. Diener, “Emg-to-speech: Direct generation of speech from facial electromyographic signals,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, pp. 2375–2385, 2017. [2] D. Gaddy and D. Klein, “Digital voicing of silent speech,” In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020. [3] B. Cao, N. Sebkhi, T. Mau, O. Inan, and J.Wang, “Permanent magnetic articulograph (pma) vs electromagnetic articulograph (ema) in articulation-to-speech synthesis for silent speech interface,” 2019. [4] Y.-W. Chen, K.-H. Hung, S.-Y. Chuang, J. Sherman,W.-C. Huang, X. Lu, andY. Tsao, “Ema2s: An end-to-end multimodal articulatory-to-speech system,” in ISCAS, 2021. [5] G. Gosztolya, T. Grósz, L. Tóth, A. Markó, and T. Csapó, “Applying dnn adaptation to reduce the session dependency of ultrasound tongue imaging-based silent speech interfaces,” Acta Polytechnica Hungarica, vol. 17, pp. 109–124, 2020. [6] B. McMicken, A.Kunihiro, L.Wang, S.V. Berg, and K.Rogers, “Electropalatography in a case of congenital aglossia,” Journal of Communication Disorders, Deaf Studies & Hearing Aids, vol. 2, pp. 1–7, 2014. [7] J. Verhoeven, N. Miller, L. Daems, and C. C. Reyes-Aldasoro, “Visualisation and analysis of speech production with electropalatography,” J. Imaging, vol. 5, p. 40, 2019. [8] J. Li, L. Deng, R. Häeb-Umbach, and Y. Gong, Robust automatic speech recognition: A bridge to practical applications. Academic Press, October 2015. [9] D. Michelsanti and Z. Tan, “Conditional generative adversarial networks for speech enhancement and noise-robust speaker verification,” in INTERSPEECH, 2017. [10] Y. Lai, F. Chen, S.Wang, X. Lu, Y. Tsao, and C. Lee, “A deep denoising autoencoder approach to improving the intelligibility of vocoded speech in cochlear implant simulation,” IEEE Transactions on Biomedical Engineering, vol. 64, pp. 1568–1578, 2017. [11] S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 27, pp. 113–120, 1979. [12] P. Scalart and J. V. Filho, “Speech enhancement based on a priori signal to noise estimation,” in Proc. ICASSP, 1996. [13] Y. Hu and P. C. Loizou, “A subspace approach for enhancing speech corrupted by colored noise,” IEEE Transactions on Speech and Audio Processing, vol. 11, no. 4, 2002. [14] A. Rezayee and S. Gazor, “An adaptive klt approach for speech enhancement,” IEEE Transactions on Speech and Audio Processing, vol. 9, no. 2, pp. 87–95, 2001. [15] P. S. Huang, S. D. Chen, P. Smaragdis, and M. A. Hasegawa-Johnson, “Singing-voice separation from monaural recordings using robust principal component analysis,” in Proc. ICASSP, vol. 11, 2012. [16] X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhancement based on deep denoising autoencoder,” in INTERSPEECH, 2013. [17] Y.Xu, J. Du, L.-R. Dai, and C.-H. Lee, “Aregression approach to speech enhancement based on deep neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, pp. 7–19, 2015. [18] A. Pandey and D.Wang, “A new framework for cnn-based speech enhancement in the time domain,” IEEE/ACMTransactions on Audio, Speech, and Language Processing, vol. 27, pp. 1179–1188, 2019. [19] J. Qi, H. Hu, Y. Wang, C.-H. H. Yang, S. M. Siniscalchi, and C.-H. Lee, “Exploring deep hybrid tensor-to-vector network architectures for regression based speech enhancement,” arXiv preprint arXiv:2007.13024, 2020. [20] S. Fu, T. Wang, Y. Tsao, X. Lu, and H. Kawai, “End-to-end waveform utterance enhancement for direct evaluation metrics optimization by fully convolutional neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, pp. 1570–1584, 2018. [21] H. Erdogan, J. Hershey, S. Watanabe, and J. L. Roux, “Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks,” 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 708–712, 2015. [22] L. Sun, J. Du, L.-R. Dai, and C. Lee, “Multiple-target deep learning for lstm-rnn based speech enhancement,” 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA), pp. 136–140, 2017. [23] J.-C. Hou, S.-S. Wang, Y.-H. Lai, Y. Tsao, H.-W. Chang, and H.-M. Wang, “Audiovisual speech enhancement using multimodal deep convolutional neural networks,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 2, no. 2, pp. 117–128, 2018. [24] S.-Y. Chuang, Y. Tsao, C.-C. Lo, and H.-M. Wang, “Lite audio-visual speech enhancement,” in Proc. INTERSPEECH 2020, 2020. [25] C. Yu, K.-H. Hung, S.-S.Wang, Y. Tsao, and J.-w. Hung, “Time-domain multi-modal bone/air conducted speech enhancement,” IEEE Signal Processing Letters, vol. 27, pp. 1035–1039, 2020. [26] D. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 10, pp. 1702–1726, 2018. [27] T. N. Sainath, A.-r. Mohamed, B. Kingsbury, and B. Ramabhadran, “Deep convolutional neural networks for lvcsr,” in 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, 2013, pp. 8614–8618. [28] T. Sainath, B. Kingsbury, G. Saon, H. Soltau, A. rahman Mohamed, G. E. Dahl, and B. Ramabhadran, “Deep convolutional neural networks for large-scale speech tasks,” Neural networks, vol. 64, pp. 39–48, 2015. [29] C.-H.Yang, J. Qi, P.-Y. Chen, X. Ma, and C.-H. Lee, “Characterizing speech adversarial examples using self-attention u-net enhancement,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 3107–3111. [30] K. Kinoshita, M. Delcroix, A. Ogawa, and T. Nakatani, “Text-informed speech enhancement with deep neural networks,” in Proc. INTERSPEECH 2015, 2015. [31] D. Michelsanti, Z.-H. Tan, S. Sigurdsson, and J. Jensen, “Deep-learning-based audiovisual speech enhancement in presence of lombard effect,” Speech Communication, vol. 115, pp. 38–50, 2019. [32] Y.-W. Chen, K.-H. Hung, S.-Y. Chuang, J. Sherman, X. Lu, and Y. Tsao, “A study of incorporating articulatory movement information in speech enhancement,” arXiv preprint arXiv:2011.01691, 2020. [33] P. Atrey, M. A. Hossain, A. E. Saddik, and M. Kankanhalli, “Multimodal fusion for multimedia analysis: a survey,” Multimedia Systems, vol. 16, pp. 345–379, 2010. [34] L. A. Gatys, A. S. Ecker, and M. Bethge, “A neural algorithm of artistic style,” ArXiv, vol. abs/1508.06576, 2015. [35] Q. Chen and V. Koltun, “Photographic image synthesis with cascaded refinement networks,” 2017 IEEE International Conference on Computer Vision (ICCV), pp. 1520–1529, 2017. [36] L. Wang, Y. Li, and S. Lazebnik, “Learning deep structure-preserving image-text embeddings,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5005–5013, 2016. [37] F. Germain, Q. Chen, and V. Koltun, “Speech denoising with deep feature losses,” Proc. INTERSPEECH 2019, 2019. [38] J. F. Hacking, B. L. Smith, and E. M. Johnson, “Utilizing electropalatography to train palatalized versus unpalatalized consonant productions by native speakers of american english learning russian,” 2017. [39] M.-W. Huang, “Development of taiwan mandarin hearing in noise test,” 2005. [40] G. Hu, “100 nonspeech environmental sounds,” 2004. [Online]. Available: http://web.cse.ohio-state.edu/pnl/corpus/HuNonspeech/HuCorpus.html [41] N. Perraudin, P. Balázs, and P. Søndergaard, “A fast griffin-lim algorithm,” 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 1–4, 2013. [42] A. Rix, J. Beerends, M. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,” 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Proceedings, vol. 2, pp. 749–752, 2001. [43] C. Taal, R. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time–frequency weighted noisy speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, pp. 2125–2136, 2011. [44] J. Jensen and C. Taal, “An algorithm for predicting the intelligibility of speech masked by modulated noise maskers,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, pp. 2009–2022, 2016. [45] A. Parveen, H. Inbarani, and E. Sathishkumar, “Performance analysis of unsupervised feature selection methods,” 2012 International Conference on Computing, Communication and Applications, pp. 1–7, 2012.
指導教授	蔡宗翰曹昱(Richard Tzong-Han Tsai Yu Tsao)	審核日期	2021-9-27
推文	facebook plurk twitter funp google live udn HD myshare reddit netvibes friend youpush delicious baidu
網路書籤	Google bookmarks del.icio.us hemidemi myshare

博碩士論文 108522059 詳細資訊