使用門控遞歸網絡和對比學習進行語音合成的非並行語音轉換：一種混合深度學習方法;Non-Parallel Voice Conversion for Speech Synthesis using Gated Recurrent Networks and Contrastive Learning: A Hybrid Deep Learning Approach

NCUIR > College of Electrical Engineering & Computer Science > Graduate Institute of Computer Science and Information Engineering > Electronic Thesis & Dissertation > Item 987654321/92696

Please use this identifier to cite or link to this item: http://ir.lib.ncu.edu.tw/handle/987654321/92696

Title:	使用門控遞歸網絡和對比學習進行語音合成的非並行語音轉換：一種混合深度學習方法;Non-Parallel Voice Conversion for Speech Synthesis using Gated Recurrent Networks and Contrastive Learning: A Hybrid Deep Learning Approach
Authors:	比⾺特;Prihasto, Bima
Contributors:	資訊工程學系
Keywords:	語音合成;語音轉換;非平行數據;遞歸神經網絡;對比學習;hard negative example;注意機制;Speech synthesis;voice conversion;non-parallel data;recurrent neural networks;contrastive learning;hard negative example;attention mechanism
Date:	2023-07-29
Issue Date:	2023-10-04 16:08:42 (UTC+8)
Publisher:	國立中央大學
Abstract:	這篇論文對語音處理做出了重大貢獻，特別是在語音合成和語音轉換方面。這個貢獻分為三個主要部分。首先，已經確定基於 RNN 的模型適用於解決語音合成問題，但是計算時間長仍然是一個問題。本論文在對 MGU 進行修改的基礎上，成功地構建了一種新的 RNN 架構，從 MGU 的一些方程中去除了單元狀態歷史。這種基於 MGU 的新架構的速度是其他基於 MGU 的架構的兩倍，但仍能產生同等質量的聲音。兩種對比學習之前都解決了非平行語音轉換問題，但是聲音合成結果並不理想。這是因為沒有保留聲源的信息內容，無法調整音色和韻律來匹配目標聲音。本論文介紹了一種硬性反例的對比學習方法，稱為CNEG-VC。該技術基於語音輸入生成實例方面的負面示例，並使用對抗性損失來生成硬負面示例，從而提高非並行語音轉換的性能。最後，論文提出了在頻譜特徵中使用選擇性注意作為非並行語音轉換中對比學習的錨點，稱為 CSA-VC。該技術基於對每行概率分佈的測量來選擇查詢，並使用減少的注意力矩陣來確保在合成中保留源關係。;This dissertation has made a substantial contribution to speech processing, particularly in speech synthesis and voice conversion. There are three main parts to this contribution. Firstly, it has been established that RNN-based models are suitable for solving speech synthesis problems, however long computing time is still an issue. This dissertation successfully built a new RNN architecture based on modifications to the MGU, which removes the unit state history from some equations in the MGU. This new MGU-based architecture is twice as fast as the other MGU-based architectures yet still produce a sound of equal quality. Secondly, contrastive learning has previously solved non-parallel voice conversion problems, but the sound synthesis results were unsatisfactory. This is because the information content of the sound source was not preserved and the timbre and prosody could not be adjusted to match the target sound. This dissertation introduced a hard negative examples approach in contrastive learning, called CNEG-VC. This technique generates instance-wise negative examples based on the voice input and uses an adversarial loss to produce hard negative exam- ples, resulting in an improved performance in non-parallel voice conversion. Finally, the dissertation proposed the use of selective attention in spectral features as an anchor point for contrastive learning in non-parallel voice conversion, called CSA-VC. This technique selects a query based on the measurement of the probability distribution of each line and uses the reduced attention matrix to ensure that source relations are preserved in the synthesis.
Appears in Collections:	[Graduate Institute of Computer Science and Information Engineering] Electronic Thesis & Dissertation

Files in This Item:

File	Description	Size	Format
index.html		0Kb	HTML	30	View/Open

社群 sharing

Loading...