中大機構典藏-NCU Institutional Repository-提供博碩士論文、考古題、期刊論文、研究計畫等下載:Item 987654321/95763
English  |  正體中文  |  简体中文  |  Items with full text/Total items : 80990/80990 (100%)
Visitors : 41143580      Online Users : 235
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version


    Please use this identifier to cite or link to this item: http://ir.lib.ncu.edu.tw/handle/987654321/95763


    Title: 基於最佳傳輸條件流匹配之語音合成系統;OT-CFM Based Text to Speech Systems
    Authors: 金珉旭;Jin, Min-Xyu
    Contributors: 資訊工程學系
    Keywords: 深度學習;語音合成;流匹配
    Date: 2024-08-08
    Issue Date: 2024-10-09 17:15:26 (UTC+8)
    Publisher: 國立中央大學
    Abstract: 傳統語音合成方法主要依賴於統計參數語音合成或拼接式合成技術。這些方法依靠手動提取的語音特徵和繁雜的演算法合成語音,但缺乏自然度和情感,合成效果極差。自 2010 年代深度學習蓬勃發展之始,研究者開始探索使用深度神經網絡(DNN)提升合成語音的品質,時至今日,各式深度學習模型與演算法已完全取代傳統合成方法,生成媲美真人的語音。但當前的語音合成模型仍有以下缺點:訓練、推理速度稍慢,仍需耗費相當的時間成本;且生成自然流暢的語音已非難事,但往往缺乏情感變化,較為單調。

    本論文使用最佳傳輸條件流匹配生成模型構建一套語音合成系統,該模型能生成高自然度、高相似度的語音,並擁有高效的訓練及推理速度。本論文之語音合成系統包括以下兩種任務:多語言語音合成及中文情感語音合成。多語言語音合成系統使用 Carolyn、JSUT、Vietnamese Voice Dataset 三個資料集,建立支援中文、日文及越南文之語音合成系統。中文情感語音合成系統使用具有情感風格之中文資料集 ESD-0001,搭配預訓練wav2vec 情感風格提取器,用於提取訓練語音之情感特徵,使模型學習將資料集中之情感風格遷移至生成語音。
    ;Traditional speech synthesis methods mainly rely on statistical parametric speech synthesis or concatenative synthesis techniques. These methods depend on manually
    extracted speech features and complex algorithms to synthesize speech, but they lack naturalness and emotion, resulting in poor synthesis quality. Since the rise of deep
    learning in the 2010s, researchers have begun exploring the use of deep neural networks (DNN) to enhance the quality of synthesized speech. Today, various deep learning
    models and algorithms have completely replaced traditional synthesis methods, generating speech comparable to real human voices. However, current speech synthesis
    models still have the following drawbacks: training and inference speeds are somewhat slow, requiring considerable time costs; generating natural and fluent speech is no
    longer a challenge, but it often lacks emotional variation, resulting in a monotonous output.

    This paper constructs a speech synthesis system using an optimal transport conditional flow matching generative model, which can generate highly natural and
    similar speech while achieving efficient training and inference speeds. The speech synthesis system in this paper includes the following two tasks: multilingual speech
    synthesis and Chinese emotional speech synthesis. The multilingual speech synthesis system uses three datasets: Carolyn, JSUT, and Vietnamese Voice Dataset, to establish
    a speech synthesis system supporting Chinese, Japanese, and Vietnamese. The Chinese emotional speech synthesis system uses the ESD-0001 Chinese dataset with emotional
    style, along with a pre-trained wav2vec emotional style extractor, to extract emotional features from the training speech, allowing the model to learn to transfer the emotional styles from the dataset to the generated speech.
    Appears in Collections:[Graduate Institute of Computer Science and Information Engineering] Electronic Thesis & Dissertation

    Files in This Item:

    File Description SizeFormat
    index.html0KbHTML24View/Open


    All items in NCUIR are protected by copyright, with all rights reserved.

    社群 sharing

    ::: Copyright National Central University. | 國立中央大學圖書館版權所有 | 收藏本站 | 設為首頁 | 最佳瀏覽畫面: 1024*768 | 建站日期:8-24-2009 :::
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - 隱私權政策聲明