高品質口述系統之設計與應用

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：6

、訪客IP：3.144.33.41

姓名

朱祥豪(Hsiang-Hao Chu) 查詢紙本館藏

畢業系所

資訊工程學系在職專班

論文名稱

高品質口述系統之設計與應用
(The Design and Application of High Quality Spoken System)

相關論文

★ Single and Multi-Label Environmental Sound Recognition with Gaussian Process	★ 波束形成與音訊前處理之嵌入式系統實現
★ 語音合成及語者轉換之應用與設計	★ 基於語意之輿情分析系統
★ 深度學習及加速強健特徵之CT影像跟骨骨折辨識及偵測	★ 基於風格向量空間之個性化協同過濾服裝推薦系統
★ RetinaNet應用於人臉偵測	★ 金融商品走勢預測
★ 整合深度學習方法預測年齡以及衰老基因之研究	★ 漢語之端到端語音合成研究
★ 基於 ARM 架構上的 ORB-SLAM2 的應用與改進	★ 基於深度學習之指數股票型基金趨勢預測
★ 探討財經新聞與金融趨勢的相關性	★ 基於卷積神經網路的情緒語音分析
★ 運用深度學習方法預測阿茲海默症惡化與腦中風手術存活	★ 運用LLM自動生成食譜方法與系統

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

本論文主要是研究基於神經網路之高品質口述系統的技術，並延伸相關的設計與應用，與以往最大不同的是，現今我們擁有更多的訓練資料、更快速的硬體設備、以及更多樣可搭配在語音合成的其他增強技術，讓合成語音的品質更加貼近真人聲音。要使這項技術能應用於生活上，需要設計具彈性且支援多方技術的工具來供實作，本系統主要是用Python語言開發，安裝在Linux作業系統上，需要一個可支援外部前端功能的工具，前端輸出的格式必須為狀態層次校準(state-level alignment)的HTS標籤，目前支援兩個語音編碼器(vocoder)：STRAIGHT和WORLD，在訓練神經網絡之前，對語言特徵使用min-max正規化，而輸出聲學特徵則是採用mean-variance正規化。至於聲學建模(Acoustic Modelling)的原理，則是採用前饋神經網路(Feedforward Neural Network)和基於遞歸神經網路之長短期記憶(Long Short-Term Memory based RNN)於系統中實現。另外，就本系統的特色與長處，分別介紹三種相關的應用。最後，也期待這系統，除了不斷地在品質及效能上精進之外，也能推展到台灣各個有需要的地方。

摘要(英)

This paper focuses on the technology of high quality dictation system based on neural network and extends the related design and application. The biggest difference is that we have more training materials, faster hardware and more Variety can be used in the voice synthesis of other enhanced technology, so that the quality of synthetic speech more close to the real voice. To make this technology can be applied to life, the need to design flexible and support multi-technology tools for implementation, the system is mainly developed in Python language, installed on the Linux operating system, you need a support for external front-end features Tools, the front-end output format must be state-level alignment of the HTS tag, currently supports two voice coder (vocoder): STRAIGHT and WORLD, before training the neural network, the language features using min-max regular And the output acoustic feature is normalized with mean-variance. As for the principle of Acoustic Modeling, the Feedforward Neural Network and Long Short-Term Memory based (RNN) are implemented in the system. In addition, the characteristics and strengths of the system, respectively, introduced three related applications. Finally, it is also looking forward to this system, in addition to constantly in the quality and efficiency on the sophisticated, but also to promote the various needs of Taiwan.

關鍵字(中)

★ 口述

關鍵字(英)

論文目次

中文摘要 I
英文摘要 II
致謝 III
章節目錄 IV
圖目錄 VI
表目錄 VII
第一章緒論 1
1.1 研究動機 1
1.2 研究目的 1
1.3 研究背景 2
第二章系統架構 4
2.1 語音合成工具 4
2.2 聲學建模(ACOUSTIC MODELLING) 5
2.3 系統實現 8
第三章系統實作 10
3.1 環境部署 10
3.2 軟體安裝 11
3.3 訓練資料 13
3.4 實作結果 16
第四章系統相關應用介紹 19
4.1 應用於聽取新聞 19
4.2 應用於電玩智能配音 19
4.3 應用於智慧童話書 21
第五章結論與展望 23
參考文獻 24

參考文獻

[1] R. A. J. Clark, K. Richmond, and S. King, “Multisyn: Open-domain unit selection for the Festival speech synthesis system,” Speech Communication, vol. 49, no. 4, pp. 317–330, 2007.

[2] H. Zen, K. Tokuda, and A. W. Black, “Statistical parametric speech synthesis,” Speech Communication, vol. 51, no. 11, pp. 1039–1064, 2009.

[3] T. Merritt, J. Latorre, and S. King, “Attributing modeling errors in HMM synthesis by stepping gradually from natural to modelled speech,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), 2015, pp. 4220–4224.

[4] K. Tokuda, Y. Nankaku, T. Toda, H. Zen, J. Yamagishi, and K. Oura, “Speech synthesis based on hidden Markov models,” Proceedings of the IEEE, vol. 101, no. 5, pp. 1234–1252, 2013.

[5] Z.-H. Ling, S.-Y. Kang, H. Zen, A. Senior, M. Schuster, X.-J. Qian, H. M. Meng, and L. Deng, “Deep learning for acoustic modeling in parametric speech generation: A systematic review of existing techniques and future trends,” IEEE Signal Processing Magazine, vol. 32, no. 3, pp. 35–52, 2015.

[6] H. Zen, “Acoustic modeling in statistical parametric speech synthesis - from HMM to LSTM-RNN,” in Proc. MLSLP, 2015, invited paper.

[7] T. Weijters and J. Thole, “Speech synthesis with artificial neural networks,” in Proc. Int. Conf. on Neural Networks, 1993, pp. 1764–1769.

[8] G. Cawley and P. Noakes, “LSP speech synthesis using backpropagation networks,” in Proc. Third Int. Conf. on Artificial Neural Networks, 1993, pp. 291–294.

[9] C. Tuerk and T. Robinson, “Speech synthesis using artificial neural networks trained on cepstral coefficients.” in Proc. European Conference on Speech Communication and Technology (Eurospeech), 1993, pp. 4–7.

[10] M. Riedi, “A neural-network-based model of segmental duration for speech synthesis,” in Proc. European Conference on Speech Communication and Technology (Eurospeech), 1995, pp. 599–602.

[11] O. Karaali, G. Corrigan, N. Massey, C. Miller, O. Schnurr, and A. Mackie, “A high quality text-to-speech system composed of multiple neural networks,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), vol. 2, 1998, pp. 1237–1240.

[12] Z.-H. Ling, L. Deng, and D. Yu, “Modeling spectral envelopes using Restricted Boltzmann Machines and Deep Belief Networks for statistical parametric speech synthesis,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 10, pp. 2129–2139, 2013.

[13] S. Kang, X. Qian, and H. Meng, “Multi-distribution deep belief network for speech synthesis,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), 2013, pp. 8012–8016.

[14] S. Kang and H. Meng, “Statistical parametric speech synthesis using weighted multi-distribution deep belief network,” in Proc. Interspeech, 2014, pp. 1959–1963.

[15] H. Zen and A. Senior, “Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), 2014, pp. 3844–3848.

[16] B. Uria, I. Murray, S. Renals, and C. Valentini, “Modelling acoustic feature dependencies with artificial neural networks: Trajectory-rnade,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), 2015, pp. 4465–4469.

[17] H. Zen, A. Senior, and M. Schuster, “Statistical parametric speech synthesis using deep neural networks,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), 2013, pp. 7962–7966.

[18] H. Lu, S. King, and O. Watts, “Combining a vector space representation of linguistic context with a deep neural network for text-to-speech synthesis,” Proc. the 8th ISCA Speech Synthesis Workshop (SSW), pp. 281–285, 2013.

[19] Y. Qian, Y. Fan, W. Hu, and F. K. Soong, “On the training aspects of deep neural network (DNN) for parametric TTS synthesis,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), 2014, pp. 3829–3833.

[20] Z. Wu, C. Valentini-Botinhao, O. Watts, and S. King,“Deep neural networks employing multi-task learning and stacked bottleneck features for speech synthesis,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), 2015, pp. 4460–4464.

[21] K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda, “The effect of neural networks in statistical parametric speech synthesis,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), 2015, pp. 4455–4459.

[22] O. Watts, G. E. Henter, T. Merritt, Z. Wu, and S. King,“From HMMs to DNNs: where do the improvements come from?” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), 2016.

[23] C. Valentini-Botinhao, Z. Wu, and S. King, “Towards minimum perceptual error training for DNN-based speech synthesis,” in Proc. Interspeech, 2015, pp. 869–873.

[24] Z.Wu and S. King, “Minimum trajectory error training for deep neural networks, combined with stacked bottleneck features,” in Proc. Interspeech, 2015, pp. 309–313.

[25] Y. Fan, Y. Qian, F. K. Soong, and L. He, “Sequence generation error (SGE) minimization based deep neural networks training for text-to-speech synthesis,” in Proc. Interspeech, 2015, pp. 864–868.

[26] Z. Wu and S. King, “Improving trajectory modelling for dnn-based speech synthesis by using stacked bottleneck features and minimum generation error training,” IEEE Trans. Audio, Speech and Language Processing, 2016.

[27] Y. Fan, Y. Qian, F. Xie, and F. K. Soong, “TTS synthesis with bidirectional LSTM based recurrent neural networks,” in Proc. Interspeech, 2014, pp. 1964–1968.

[28] H. Zen and H. Sak, “Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), 2015, pp. 4470–4474.

[29] Zhizheng Wu, Oliver Watts, Simon King, “Merlin: An Open Source Neural Network Speech Synthesis System,” in Proc. 9th ISCA Speech Synthesis Workshop (SSW9), September 2016, Sunnyvale, CA, USA.

[30] Z. Wu and S. King, “Investigating gated recurrent neural networks for speech synthesis,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), 2016.

[31] SPTK官方網站, http://sp-tk.sourceforge.net/

[32] T. Merritt, R. A. Clark, Z. Wu, J. Yamagishi, and S. King,“Deep neural network-guided unit selection synthesis,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), 2016.

[33] Q. Hu, Z. Wu, K. Richmond, J. Yamagishi, Y. Stylianou, and R. Maia, “Fusion of multiple parameterisations for DNN-based sinusoidal speech synthesis with multi-task learning,” in Proc. Interspeech, 2015, pp. 854–858.

[34] M. MORISE, F. YOKOMORI, and K. OZAWA,“WORLD: a vocoder-based high-quality speech synthesis system for real-time applications,” IEICE transactions on information and systems, 2016.

[35] H. Kawahara, I. Masuda-Katsuse, and A. de Cheveign´e,“Restructuring speech representations using a pitchadaptive time–frequency smoothing and an instantaneousfrequency-based F0 extraction: Possible role of a repetitive structure in sounds,” Speech communication, vol. 27, no. 3, pp. 187–207, 1999.

[36] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.

[37] A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional LSTM and other neural network architectures,” Neural Networks, vol. 18, no. 5, pp. 602–610, 2005.

[38] Festival官網下載網址, http://festvox.org/packed/festival/2.4/

[39] Merlin提供的訓練資料下載連結,http://104.131.174.95/slt_arctic_full_data.zip

[40] 陰陽師官網, https://www.onmyojigame.com/#2

[41] Merlin相關討論文章, https://github.com/CSTR-Edinburgh/merlin/issues/18

[42] 市面上販售有聲故事書, http://shopping.windmill.com.tw/product.php?product_num=10155936

指導教授

王家慶

審核日期

2017-7-25

推文