語碼轉換語音合成基於自監督學習與領域自適應之語者編碼器;Code-switching TTS Based On Self-supervised Learning Approach And Domain Adaptation Speaker Encoder

NCU Institutional Repository > 資訊電機學院 > 資訊工程研究所 > 博碩士論文 > Item 987654321/90032

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/90032

題名:	語碼轉換語音合成基於自監督學習與領域自適應之語者編碼器;Code-switching TTS Based On Self-supervised Learning Approach And Domain Adaptation Speaker Encoder
作者:	白承勲;Pai, Cheng-Hsun
貢獻者:	資訊工程學系
關鍵詞:	語碼轉換;語音合成;自監督學習;領域自適應;Code-switching;Text To Speech Synthesis;Self-supervised Learning;Domain Adaptation
日期:	2022-09-21
上傳時間:	2022-10-04 12:08:41 (UTC+8)
出版者:	國立中央大學
摘要:	近期基於深度學習的端對端語音合成模型在語音品質上取得巨大的進步，逐漸取代傳統的語音合成方法成主流。隨著全球化的影響，各式設備如語音助理、導航系統與車站廣播等，對於語碼轉換語音合成的需求逐漸增加，相關研究也備受重視。語碼轉換的定義為，在一段對話中或句子中交替使用多於一種以上的語言，常見的語碼轉換例如中文、英文交雜使用。在理想的情況下，我們需要一名精通多種語言的語者，錄製包含多種語言的語碼轉換資料集，然而要找到這樣的語者並不容易且標記成本昂貴，因此大部分的研究是以組合多個單語言資料集為主。在只有單語言資料集可用的情況下，語碼轉換語音合成的挑戰在於保持語言切換間語者聲音的一致性與語音自然度，包含語音品質、口音及語速。目前主流研究使用編碼器、解碼器的模型架構，配合語者向量以及語言向量來特徵化特定語者聲音及語言的韻律，也有部分研究使用多個語言分開的單語言編碼器來對語言資訊建模，然而這些方法對於合成出高自然度的語音仍是挑戰。為解決上述問題，我們將自監督學習與框等級的領域對抗訓練引入基於語者驗證任務的語者編碼器，促使不同語言的語者向量在語者空間上保持一致的分佈，以提升語碼轉換語音合成的性能表現，並在語音合成模型的選擇上使用非自回歸式的語音合成模型，以此解決跨語言語音合成產生的語速不自然問題。我們首先展示在LibriTTS與AISHELL3的混合語言資料集中，透過自監督表徵訓練的語者編碼器比起傳統MFCC在語者驗證任務上有4.968%的絕對EER下降，說明自監督表徵對於領域複雜的資料集有更好的泛化性，隨後我們在語碼轉換語音合成任務分別得到3.635與3.675的語音自然度與語者相似度MOS分數。我們的方法簡化過去文獻中使用多個單一語言編碼器對語言資訊建模的需要，並加入框等級域對抗訓練針對語者向量在語者特徵空間上進行優化，以利於語碼轉換語音合成任務。;In recent years, deep learning-based end to end models have been widely used in speech synthesis, getting significant progress in regards to speech quality. Deep learning-based approach gradually becomes mainstream, replacing conventional approach. With the impact of globalization, various devices such as voice assistants, navigation systems and station announcements, have gradually increased the demand for code-switching TTS, and related research has also received much attention. Code-switching occurs when a speaker alternates between two or more languages in the content of single conversation or sentence. Common code-switching example such as mix of Chinese and English. Ideally, we will have a speaker, who is proficient in multiple languages, to record code-switching speech containing multiple languages. However, it is not easy to find such speaker, and the cost of labeling is expensive. Most research focus on combining multiple monolingual datasets. Under the circumstances of only monolingual datasets are available, there are several challenges for code-switching TTS, including keeping speaker consistency when code-switching occurs and ensuring naturalness of synthesized speech, such as speed, accent and quality. Recent research mainly uses encoder-decoder E2E-based framework. Speaker and language embedding are introduced to characterize the voice of speaker and the global prosody of language. Some research uses multiple separated monolingual encoders, to model the language information. Although the methods been purposed above, the high quality and speaker consistent speech synthesis is still a challenging task. To solve these problems, we propose to introduce self-supervised learning and frame-level domain adversarial training to speaker verification-based speaker encoder, that prompts speaker embeddings of different language stay in same distribution in speaker space, to improve the performance of code-switching TTS. We also choose to use non-autoregressive TTS model, to deal with unnatural speed of synthesized speech which happens in cross-lingual TTS. We first demonstrate that in the mixed monolingual datasets of LibriTTS and AISHELL3, self-supervised representation has 4.968% absolute EER decrease, compare with conventional MFCC, indicating that self-supervised representation has better generalization for datasets with complex domains. Then, we obtain the naturalness and speaker similarity MOS scores of 3.635 and 3.675 respectively in the code-switching TTS task. Our approach simplifies the need of using multiple single-language encoders to model the linguistic information in the past literature, and introduces frame-level domain adversarial training to optimize speaker embedding on speaker space for code-switching TTS tasks.
顯示於類別:	[資訊工程研究所] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	80	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....