中大機構典藏-NCU Institutional Repository-提供博碩士論文、考古題、期刊論文、研究計畫等下載:Item 987654321/90032
English  |  正體中文  |  简体中文  |  全文筆數/總筆數 : 78852/78852 (100%)
造訪人次 : 38263472      線上人數 : 623
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
搜尋範圍 查詢小技巧:
  • 您可在西文檢索詞彙前後加上"雙引號",以獲取較精準的檢索結果
  • 若欲以作者姓名搜尋,建議至進階搜尋限定作者欄位,可獲得較完整資料
  • 進階搜尋


    請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/90032


    題名: 語碼轉換語音合成基於自監督學習與領域自適應之語者編碼器;Code-switching TTS Based On Self-supervised Learning Approach And Domain Adaptation Speaker Encoder
    作者: 白承勲;Pai, Cheng-Hsun
    貢獻者: 資訊工程學系
    關鍵詞: 語碼轉換;語音合成;自監督學習;領域自適應;Code-switching;Text To Speech Synthesis;Self-supervised Learning;Domain Adaptation
    日期: 2022-09-21
    上傳時間: 2022-10-04 12:08:41 (UTC+8)
    出版者: 國立中央大學
    摘要: 近期基於深度學習的端對端語音合成模型在語音品質上取得巨大的進步,逐漸取代傳統的語音合成方法成主流。隨著全球化的影響,各式設備如語音助理、導航系統與車站廣播等,對於語碼轉換語音合成的需求逐漸增加,相關研究也備受重視。語碼轉換的定義為,在一段對話中或句子中交替使用多於一種以上的語言,常見的語碼轉換例如中文、英文交雜使用。在理想的情況下,我們需要一名精通多種語言的語者,錄製包含多種語言的語碼轉換資料集,然而要找到這樣的語者並不容易且標記成本昂貴,因此大部分的研究是以組合多個單語言資料集為主。在只有單語言資料集可用的情況下,語碼轉換語音合成的挑戰在於保持語言切換間語者聲音的一致性與語音自然度,包含語音品質、口音及語速。目前主流研究使用編碼器、解碼器的模型架構,配合語者向量以及語言向量來特徵化特定語者聲音及語言的韻律,也有部分研究使用多個語言分開的單語言編碼器來對語言資訊建模,然而這些方法對於合成出高自然度的語音仍是挑戰。為解決上述問題,我們將自監督學習與框等級的領域對抗訓練引入基於語者驗證任務的語者編碼器,促使不同語言的語者向量在語者空間上保持一致的分佈,以提升語碼轉換語音合成的性能表現,並在語音合成模型的選擇上使用非自回歸式的語音合成模型,以此解決跨語言語音合成產生的語速不自然問題。我們首先展示在LibriTTS與AISHELL3的混合語言資料集中,透過自監督表徵訓練的語者編碼器比起傳統MFCC在語者驗證任務上有4.968%的絕對EER下降,說明自監督表徵對於領域複雜的資料集有更好的泛化性,隨後我們在語碼轉換語音合成任務分別得到3.635與3.675的語音自然度與語者相似度MOS分數。我們的方法簡化過去文獻中使用多個單一語言編碼器對語言資訊建模的需要,並加入框等級域對抗訓練針對語者向量在語者特徵空間上進行優化,以利於語碼轉換語音合成任務。;In recent years, deep learning-based end to end models have been widely used in speech synthesis, getting significant progress in regards to speech quality. Deep learning-based approach gradually becomes mainstream, replacing conventional approach. With the impact of globalization, various devices such as voice assistants, navigation systems and station announcements, have gradually increased the demand for code-switching TTS, and related research has also received much attention. Code-switching occurs when a speaker alternates between two or more languages in the content of single conversation or sentence. Common code-switching example such as mix of Chinese and English. Ideally, we will have a speaker, who is proficient in multiple languages, to record code-switching speech containing multiple languages. However, it is not easy to find such speaker, and the cost of labeling is expensive. Most research focus on combining multiple monolingual datasets. Under the circumstances of only monolingual datasets are available, there are several challenges for code-switching TTS, including keeping speaker consistency when code-switching occurs and ensuring naturalness of synthesized speech, such as speed, accent and quality. Recent research mainly uses encoder-decoder E2E-based framework. Speaker and language embedding are introduced to characterize the voice of speaker and the global prosody of language. Some research uses multiple separated monolingual encoders, to model the language information. Although the methods been purposed above, the high quality and speaker consistent speech synthesis is still a challenging task. To solve these problems, we propose to introduce self-supervised learning and frame-level domain adversarial training to speaker verification-based speaker encoder, that prompts speaker embeddings of different language stay in same distribution in speaker space, to improve the performance of code-switching TTS. We also choose to use non-autoregressive TTS model, to deal with unnatural speed of synthesized speech which happens in cross-lingual TTS. We first demonstrate that in the mixed monolingual datasets of LibriTTS and AISHELL3, self-supervised representation has 4.968% absolute EER decrease, compare with conventional MFCC, indicating that self-supervised representation has better generalization for datasets with complex domains. Then, we obtain the naturalness and speaker similarity MOS scores of 3.635 and 3.675 respectively in the code-switching TTS task. Our approach simplifies the need of using multiple single-language encoders to model the linguistic information in the past literature, and introduces frame-level domain adversarial training to optimize speaker embedding on speaker space for code-switching TTS tasks.
    顯示於類別:[資訊工程研究所] 博碩士論文

    文件中的檔案:

    檔案 描述 大小格式瀏覽次數
    index.html0KbHTML159檢視/開啟


    在NCUIR中所有的資料項目都受到原著作權保護.

    社群 sharing

    ::: Copyright National Central University. | 國立中央大學圖書館版權所有 | 收藏本站 | 設為首頁 | 最佳瀏覽畫面: 1024*768 | 建站日期:8-24-2009 :::
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - 隱私權政策聲明