深度學習應用於多場景語音合成系統

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：10

、訪客IP：3.148.252.155

姓名

鄭晨佑(Chen-Yu Cheng) 查詢紙本館藏

畢業系所

電機工程學系

論文名稱

深度學習應用於多場景語音合成系統
(Application of deep learning to multi-scenario speech synthesis systems)

相關論文

★ 獨立成份分析法於真實環境中聲音訊號分離之探討	★ 口腔核磁共振影像的分割與三維灰階值內插
★ 數位式氣喘尖峰氣流量監測系統設計	★ 結合人工電子耳與助聽器對中文語音辨識率的影響
★ 人工電子耳進階結合編碼策略的中文語音辨識成效模擬--結合助聽器之分析	★ 中文發聲之神經關聯性的腦功能磁振造影研究
★ 利用有限元素法建構3維的舌頭力學模型	★ 以磁振造影為基礎的立體舌頭圖譜之建構
★ 腎小管之草酸鈣濃度變化與草酸鈣結石關係之模擬研究	★ 口腔磁振影像舌頭構造之自動分割
★ 微波輸出窗電性匹配之研究	★ 以軟體為基準的助聽器模擬平台之發展-噪音消除
★ 以軟體為基準的助聽器模擬平台之發展-回饋音消除	★ 模擬人工電子耳頻道數、刺激速率與雙耳聽對噪音環境下中文語音辨識率之影響
★ 用類神經網路研究中文語音聲調產生之神經關聯性	★ 教學用電腦模擬生理系統之建構

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2027-9-1以後開放)

摘要(中)

在當前的科技浪潮中，深度學習技術已成為推動人工智能發展的核心力量。語音合成系統，作為人機互動的重要橋樑，對於提升機器的理解能力和自然交互性具有關鍵作用。本研究探討了深度學習技術在語音合成系統中的應用，特別針對護理和倉儲這兩個具有高度應用潛力的場景進行了詳細的分析與實驗。
我們選擇並實現了小型多頻帶短時傅立葉反轉換變分推理對抗學習的端到端語音合成 (Mini Multi-Band Inverse-Short-Time-Fourier-Transform Variational Inference with adversarial learning for end-to-end Text-To-Speech, Mini-MB-iSTFT-VITS)模型，該模型通過減少多頻帶短時傅立葉反轉換變分推理對抗學習的端到端語音合成 (Multi-Band Inverse-Short-Time-Fourier-Transform Variational Inference with adversarial learning for end-to-end Text-To-Speech, MB-iSTFT-VITS)模型中的隱藏通道數量和層數，顯著降低了參數量並加快了推論速度。實驗結果顯示，使用自行製作的Waki語料庫訓練的模型在主觀評估的平均意見分數 (Mean Opinion Score, MOS)、客觀評估的深度噪音抑制平均意見分數 (Deep Noise Suppression Mean Opinion Score, DNSMOS)和客觀評估的UTokyo-SaruLab Mean Opinion Score (UTMOS)評分中均表現出色，顯示出語音的高自然度和品質。在主觀的比較平均意見分數 (Comparative Mean Opinion Score , CMOS)評估中，Mini-MB-iSTFT-VITS模型雖然略低於真實語音，但與MB-iSTFT-VITS模型的表現相近，顯示出兩者在語音自然度方面的差異不大。此外，在推論速度的測試中，Mini-MB-iSTFT-VITS模型無論使用哪種語料庫，都顯示出顯著的優勢，實時率 (Real Time Factor, RTF)結果顯示其推論速度快於MB-iSTFT-VITS模型，使其更適合實時應用場景。

在護理場景中，語音合成技術的應用可以顯著提高護理人員的工作效率，減少操作錯誤，確保病患能夠得到及時且準確的照護，在此論文中我們開發了一個護理系統，它可以幫助護理師設置語音的護理提醒，使得護理師可以有更多的心力在目前要處理的護理行為，而花費護理師大多的時間其實是在記錄操作過的護理行為，所以在此系統中在設置提醒的同時會將提醒的時間以及要進行的護理行為傳送到我們所設置的護理資料庫裡的護理紀錄中，這可以大幅降低護理人員的上班時間，進一步減輕了護理人員的工作負擔。

在倉儲場景中，語音合成技術的應用能夠有效地提高工作流程效率，幫助員工更快速地完成日常操作，減少人為因素帶來的錯誤，提高整體運營效率，在此論文中我們開發了一個倉儲系統，系統掃描QR碼後生成3段語音的平均時間為1.59秒，生成1段語音的平均時間為0.51秒，說明此系統可以實時生成語音幫助撿貨員降低人為因素帶來的錯誤，並且此系統會在撿貨員開始撿貨以及結束撿貨將資訊傳送到倉儲資料庫，使得倉儲的管理人員可以提高整體營運的效率。

摘要(英)

In the current wave of technological advancements, deep learning technology has become a core force driving the development of artificial intelligence. Speech synthesis systems, as an important bridge for human-computer interaction, play a crucial role in enhancing machine understanding and natural interaction. This study explores the application of deep learning technology in speech synthesis systems, with a detailed analysis and experiments focusing on two scenarios with high application potential: nursing and warehousing.
We selected and implemented a Mini Multi-Band Inverse-Short-Time-Fourier-Transform Variational Inference with adversarial learning for end-to-end Text-To-Speech model (Mini-MB-iSTFT-VITS). By reducing the number of hidden channels and layers in the Multi-Band Inverse-Short-Time-Fourier-Transform Variational Inference with adversarial learning for end-to-end Text-To-Speech (MB-iSTFT-VITS) model, we significantly lowered the number of parameters and accelerated inference speed. Experimental results show that the model trained with the self-created Waki corpus performed excellently in Mean Opinion Score (MOS) for subjective evaluation, Deep Noise Suppression Mean Opinion Score (DNSMOS) for objective evaluation, and UTokyo-SaruLab Mean Opinion Score (UTMOS) for objective evaluation, demonstrating high naturalness and quality of the synthesized speech. In the subjective Comparative Mean Opinion Score (CMOS) evaluation, although the Mini-MB-iSTFT-VITS model slightly lagged behind real speech, its performance was close to the MB-iSTFT-VITS model, indicating minimal difference in speech naturalness between the two. Additionally, in the inference speed tests, the Mini-MB-iSTFT-VITS model showed significant advantages regardless of the corpus used. The Real-Time Factor (RTF) results indicated its inference speed surpassed that of the MB-iSTFT-VITS model, making it more suitable for real-time application scenarios.

In the nursing scenario, the application of speech synthesis technology can significantly improve the efficiency of nursing staff, reduce operational errors, and ensure that patients receive timely and accurate care. In this paper, we developed a nursing system that helps nurses set up voice reminders, allowing them to focus more on current nursing tasks. Since much of the nurses′ time is spent recording completed nursing actions, the system records the reminder times and the nursing actions to be performed in the nursing database. This can substantially reduce nurses′ working hours and further lighten their workload.

In the warehousing scenario, the application of speech synthesis technology can effectively improve workflow efficiency, help employees complete daily operations more quickly, reduce errors caused by human factors, and enhance overall operational efficiency. In this paper, we developed a warehousing system where the average time to generate three segments of speech after scanning a QR code is 1.59 seconds, and the average time to generate one segment of speech is 0.51 seconds. This indicates that the system can generate speech in real-time to assist pickers in reducing errors caused by human factors. Additionally, the system sends information to the warehousing database when pickers start and finish picking, allowing warehouse managers to improve overall operational efficiency.

關鍵字(中)

★ 語音合成

關鍵字(英)

★ Speech synthesis

論文目次

摘要 i
Abstract iii
目錄 v
圖目錄 viii
表目錄 xi
第一章緒論 1
1.1 研究動機 1
1.2 文獻探討 2
1.2.1 傳統語音合成方法 2
1.2.2 深度學習相關語音合成方法 3
1.3 研究目的 6
1.4 論文架構 7
第二章神經網路相關原理介紹 9
2.1 神經網路概述 9
2.1.1 循環神經網絡 (Recurrent Neural Networks, RNN) 9
2.1.2 長短期記憶網絡 (Long Short-Term Memory, LSTM) 10
2.1.3 損失函數 (Loss Function) 12
2.2 VITS網路 14
2.2.1 KL散度 (Kullback-Leibler Divergence)介紹及公式推導 15
2.2.2 變分推論 (Variational Inference)原理推導 16
2.2.3 VITS中使用的變分推論公式推導 19
2.2.4 梅爾頻譜圖 (Mel spectrogram) 21
2.2.5 生成對抗網路 (Generative Adversarial Network, GAN) 21
2.2.6 單調對齊搜索 (Monotonic Alignment Search, MAS) 23
2.2.7 隨機時長預測器 (Stochastic Duration Predictor) 24
2.2.8 VITS中的損失函數 26
2.3 MB-iSTFT-VITS網路 27
2.3.1 網路架構介紹 28
2.3.2 多頻帶生成 (Multi-Band Generation) 28
2.3.3 將iSTFTNet中的技術運用在MB-iSTFT-VITS 29
2.4 結論 30
3 第三章實驗操作及方法 31
3.1 語料庫介紹 31
3.1.1 LJ Speech 31
3.1.2 Biaobei語料庫 32
3.1.3 臺灣口音中文語料庫 33
3.2 各個場景的開發環境和資料庫設計 37
3.2.1 Firebase 37
3.2.2 React 38
3.3 倉儲系統功能設計 38
3.3.1 情境架構 38
3.3.2 系統介面 39
3.4 護理系統功能設計 42
3.4.1 情境架構 42
3.4.2 系統介面 43
3.5 Mini-MB-iSTFT-VITS訓練設置與實施 45
3.5.1 訓練時使用的軟硬體規格 45
3.5.2 Mini-MB-iSTFT-VITS的訓練參數設置 47
3.6 結論 48
第四章研究結果與討論 49
4.1 實驗設備 49
4.2 語音自然度評估方法 50
4.2.1 平均意見分數 (Mean Opinion Score, MOS) 51
4.2.2 深度噪音抑制平均意見分數 (Deep Noise Suppression Mean Opinion Score, DNSMOS) 51
4.2.3 UTokyo-SaruLab Mean Opinion Score (UTMOS) 52
4.2.4 比較平均意見分數 (Comparative Mean Opinion Score , CMOS) 52
4.3 語音評估結果 53
4.3.1 平均意見分數評估結果 54
4.3.2 DNSMOS評估結果 55
4.3.3 UTMOS評估結果 57
4.3.4 比較平均意見分數評估結果 58
4.3.5 推論時間結果 60
4.4 護理系統整體操作結果 61
4.5 倉儲系統整體操作結果 70
4.6 結果與討論 76
4.6.1 MOS、DNSMOS、UTMOS與CMOS的評估結果討論 76
4.6.2 語音合成應用於倉儲以及護理方面的有效性討論 79
4.6.3 結論 82
5結論與未來展望 85
5.1 結論 85
5.2 未來展望 86
參考文獻 88

參考文獻

Cui, Y., Wang, X., He, L., & Soong, F. K. (2020). An Efficient Subband Linear Prediction for LPCNet-Based Neural Synthesis. INTERSPEECH,
Databaker.Technology. (2017). Chinese Standard Mandarin Speech Copus. https://www.data-baker.com/open_source.html
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. Advances in neural information processing systems, 27.
Ito, L. J. a. K. (2017). The LJ Speech Dataset. https://keithito.com/LJ-Speech-Dataset/
Kaneko, T., Tanaka, K., Kameoka, H., & Seki, S. (2022). iSTFTNet: Fast and lightweight mel-spectrogram vocoder incorporating inverse short-time Fourier transform. ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
Kawamura, M., Shirahata, Y., Yamamoto, R., & Tachibana, K. (2023). Lightweight and high-fidelity end-to-end text-to-speech with multi-band generation and inverse short-time fourier transform. ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
Kim, J., Kim, S., Kong, J., & Yoon, S. (2020). Glow-tts: A generative flow for text-to-speech via monotonic alignment search. Advances in neural information processing systems, 33, 8067-8077.
Kim, J., Kong, J., & Son, J. (2021). Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. International Conference on Machine Learning,
Kong, J., Kim, J., & Bae, J. (2020). Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in neural information processing systems, 33, 17022-17033.
Luo, R., Tan, X., Wang, R., Qin, T., Li, J., Zhao, S., Chen, E., & Liu, T.-Y. (2021). Lightspeech: Lightweight and fast text to speech with neural architecture search. ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
Reddy, C. K., Gopal, V., & Cutler, R. (2021). DNSMOS: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors. ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
Ren, Y., Hu, C., Tan, X., Qin, T., Zhao, S., Zhao, Z., & Liu, T.-Y. (2020). Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558.
Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., & Liu, T.-Y. (2019). Fastspeech: Fast, robust and controllable text to speech. Advances in neural information processing systems, 32.
Saeki, T., Xin, D., Nakata, W., Koriyama, T., Takamichi, S., & Saruwatari, H. (2022). Utmos: Utokyo-sarulab system for voicemos challenge 2022. arXiv preprint arXiv:2204.02152.
Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., & Skerrv-Ryan, R. (2018). Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP),
Tan, X., Chen, J., Liu, H., Cong, J., Zhang, C., Liu, Y., Wang, X., Leng, Y., Yi, Y., & He, L. (2024). Naturalspeech: End-to-end text-to-speech synthesis with human-level quality. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Yang, G., Yang, S., Liu, K., Fang, P., Chen, W., & Xie, L. (2021). Multi-band melgan: Faster waveform generation for high-quality text-to-speech. 2021 IEEE Spoken Language Technology Workshop (SLT),
Yu, C., Lu, H., Hu, N., Yu, M., Weng, C., Xu, K., Liu, P., Tuo, D., Kang, S., & Lei, G. (2020). DurIAN: Duration Informed Attention Network for Speech Synthesis. Interspeech,

指導教授

吳炤民(Chao-Min Wu)

審核日期

2024-7-24

推文