基於語者特徵領域泛化之零資源語音轉換系統;Zero-shot Voice Conversion Based on Speaker Embedding Domain Generalization

NCUIR > College of Electrical Engineering & Computer Science > Graduate Institute of Computer Science and Information Engineering > Electronic Thesis & Dissertation > Item 987654321/90046

Please use this identifier to cite or link to this item: https://ir.lib.ncu.edu.tw/handle/987654321/90046

Title:	基於語者特徵領域泛化之零資源語音轉換系統;Zero-shot Voice Conversion Based on Speaker Embedding Domain Generalization
Authors:	鄭俊祥;CHENG, CHUN-HSIANG
Contributors:	資訊工程學系
Keywords:	語音轉換;語者編碼;語音合成;領域泛化;元學習;voice conversion;speaker embedding;text-to-speech;domain generalizationn;meta-learning
Date:	2022-09-23
Issue Date:	2022-10-04 12:08:57 (UTC+8)
Publisher:	國立中央大學
Abstract:	近年來隨著深度學習的發展，讓人們開始可以進行一些天馬行空的想像，透過語音轉換的方式，將任何一位來源語者的聲音，只保留聲音中的語義資訊(如文字)，將聲音中的語者資訊(如音高、語速、能量)轉換成另一位目標語者的聲音。然而，若要達到良好的轉換效果，就必須要有足夠的訓練資料對模型進行足夠的訓練，並且需要提升模型的泛化能力來提高模型對任何領域的推論效果。因此通常語音轉換任務在註冊語者(訓練時用過的語者資料)上的效果較好，而在未註冊語者(訓練時未用過的語者資料)上效果較差，雖然近年來也有研究朝向未註冊語者的語音轉換，但合成出的品質還是低於註冊語者的品質，因此本論文希望建構出一個零資源的中文語音轉換系統來改善語音轉換任務中未註冊語者的語音品質。本論文建構了一種零資源的語音轉換系統，主要透過有效地解耦語音當中的語義資訊及語者資訊來達成零資源的語音轉換，本論文讓模型分別透過預訓練之語音辨識模型Wav2vec 2.0模型提取來自於來源語者的語義資訊，以及透過WavLM模型提取來自於目標語者的語者資訊，再將目標語者的語者資訊透過Robust MAML模型將語者資訊映射到一個領域泛化(domain generalization)的空間中，使其能夠直接應用於任何未註冊的語者領域(unseen speaker domain)，最後再透過遷移學習的方式，將語義資訊以及領域泛化之語者資訊經由語音合成模型FastSpeech2合成出目標語者的語音，以此建構出一個零資源的語音轉換系統。;In recent years, with the development of deep learning, people can start to have some wild imagination. Through the method of voice conversion, the voice of any source speaker will only retain the semantic information (such as text) in the voice, and the voice will be converted the speaker information (such as pitch, speed, energy) of source speaker into the speaker information of another target speaker. However, in order to achieve a good conversion effect, there must be enough training data to train the model enough, and the generalization ability of the model needs to be improved to improve the inference effect of the model in any data domain. Therefore, the speech conversion task usually performs better on registered speakers (speaker data used in training), but is less effective on unregistered speakers (speaker data not used in training), although in recent years there have research is aimed at the voice conversion of unregistered speakers, but the quality of the synthesis is still lower than that of registered speakers. Therefore, this paper hopes to construct a zero-resource Chinese voice conversion system to improve the voice quality of unregistered speakers in the voice conversion task.. This paper constructs a zero-resource speech conversion system, which mainly achieves zero-resource speech conversion by effectively decoupling the semantic information and speaker information in the speech. In this paper, the model uses the pre-trained speech recognition model Wav2vec 2.0 model to extract the semantic information from the source speaker, and extract the speaker information from the target speaker through the WavLM model, and then map the speaker information of the target speaker to a domain generalization feature space through the Robust MAML model, it can be directly applied to any unregistered speaker domain (unseen speaker domain). finally, through transfer learning, the speech of target voice will be synthesized by the source speaker’s semantic information and target speaker’s speaker information through the FastSpeech2 model.
Appears in Collections:	[Graduate Institute of Computer Science and Information Engineering] Electronic Thesis & Dissertation

Files in This Item:

File	Description	Size	Format
index.html		0Kb	HTML	129	View/Open

社群 sharing

Loading...