基於語者特徵領域泛化之零資源語音轉換系統

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：33

、訪客IP：3.142.98.108

姓名

鄭俊祥(CHUN-HSIANG CHENG) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

基於語者特徵領域泛化之零資源語音轉換系統
(Zero-shot Voice Conversion Based on Speaker Embedding Domain Generalization)

相關論文

★ Single and Multi-Label Environmental Sound Recognition with Gaussian Process	★ 波束形成與音訊前處理之嵌入式系統實現
★ 語音合成及語者轉換之應用與設計	★ 基於語意之輿情分析系統
★ 高品質口述系統之設計與應用	★ 深度學習及加速強健特徵之CT影像跟骨骨折辨識及偵測
★ 基於風格向量空間之個性化協同過濾服裝推薦系統	★ RetinaNet應用於人臉偵測
★ 金融商品走勢預測	★ 整合深度學習方法預測年齡以及衰老基因之研究
★ 漢語之端到端語音合成研究	★ 基於 ARM 架構上的 ORB-SLAM2 的應用與改進
★ 基於深度學習之指數股票型基金趨勢預測	★ 探討財經新聞與金融趨勢的相關性
★ 基於卷積神經網路的情緒語音分析	★ 運用深度學習方法預測阿茲海默症惡化與腦中風手術存活

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

近年來隨著深度學習的發展，讓人們開始可以進行一些天馬行空的想像，透過語音轉換的方式，將任何一位來源語者的聲音，只保留聲音中的語義資訊(如文字)，將聲音中的語者資訊(如音高、語速、能量)轉換成另一位目標語者的聲音。然而，若要達到良好的轉換效果，就必須要有足夠的訓練資料對模型進行足夠的訓練，並且需要提升模型的泛化能力來提高模型對任何領域的推論效果。因此通常語音轉換任務在註冊語者(訓練時用過的語者資料)上的效果較好，而在未註冊語者(訓練時未用過的語者資料)上效果較差，雖然近年來也有研究朝向未註冊語者的語音轉換，但合成出的品質還是低於註冊語者的品質，因此本論文希望建構出一個零資源的中文語音轉換系統來改善語音轉換任務中未註冊語者的語音品質。
本論文建構了一種零資源的語音轉換系統，主要透過有效地解耦語音當中的語義資訊及語者資訊來達成零資源的語音轉換，本論文讓模型分別透過預訓練之語音辨識模型Wav2vec 2.0模型提取來自於來源語者的語義資訊，以及透過WavLM模型提取來自於目標語者的語者資訊，再將目標語者的語者資訊透過Robust MAML模型將語者資訊映射到一個領域泛化(domain generalization)的空間中，使其能夠直接應用於任何未註冊的語者領域(unseen speaker domain)，最後再透過遷移學習的方式，將語義資訊以及領域泛化之語者資訊經由語音合成模型FastSpeech2合成出目標語者的語音，以此建構出一個零資源的語音轉換系統。

摘要(英)

In recent years, with the development of deep learning, people can start to have some wild imagination. Through the method of voice conversion, the voice of any source speaker will only retain the semantic information (such as text) in the voice, and the voice will be converted the speaker information (such as pitch, speed, energy) of source speaker into the speaker information of another target speaker. However, in order to achieve a good conversion effect, there must be enough training data to train the model enough, and the generalization ability of the model needs to be improved to improve the inference effect of the model in any data domain. Therefore, the speech conversion task usually performs better on registered speakers (speaker data used in training), but is less effective on unregistered speakers (speaker data not used in training), although in recent years there have research is aimed at the voice conversion of unregistered speakers, but the quality of the synthesis is still lower than that of registered speakers. Therefore, this paper hopes to construct a zero-resource Chinese voice conversion system to improve the voice quality of unregistered speakers in the voice conversion task..
This paper constructs a zero-resource speech conversion system, which mainly achieves zero-resource speech conversion by effectively decoupling the semantic information and speaker information in the speech. In this paper, the model uses the pre-trained speech recognition model Wav2vec 2.0 model to extract the semantic information from the source speaker, and extract the speaker information from the target speaker through the WavLM model, and then map the speaker information of the target speaker to a domain generalization feature space through the Robust MAML model, it can be directly applied to any unregistered speaker domain (unseen speaker domain). finally, through transfer learning, the speech of target voice will be synthesized by the source speaker’s semantic information and target speaker’s speaker information through the FastSpeech2 model.

關鍵字(中)

★ 語音轉換
★ 語者編碼
★ 語音合成
★ 領域泛化
★ 元學習

關鍵字(英)

★ voice conversion
★ speaker embedding
★ text-to-speech
★ domain generalizationn
★ meta-learning

論文目次

目錄
摘要 i
ABSTRACT ii
目錄 iv
圖目錄 vii
表目錄 viii
第一章緒論 1
1.1 研究背景與動機 1
1.2 研究目的 2
1.3 研究方法與章節概要 2
第二章語音轉換簡介與相關文獻探討 4
2.1 語音語者特徵 4
2.1.1 One-hot vector 4
2.1.2 D-vector 5
2.1.3 X-vector 6
2.2 Transformer 7
2.2.1 自注意力演算法(Self-Attention) 8
2.2.2 多頭注意力機制(Multi-head Attention) 9
2.2.3 位置編碼演算法(Positional Encoding，PE) 10
2.3 語音辨識模型 11
2.3.1 Contrastive Predictive Coding(CPC) 11
2.3.2 Wav2vec 2.0. 12
2.3.3 HuBERT 14
2.4 語音生成模型 15
2.4.1 Fastspeech 15
2.5 聲碼器(Vocoder) 17
2.5.1 HiFi-GAN 17
2.6 語音轉換相關文獻 18
第三章基於語者特徵領域泛化之零資源語音轉換系統Zero-shot Voice Conversion Based on Speaker Embedding Domain Generalization 20
3.1 語者特徵擷取模型 21
3.1.1 WavLM 21
3.1.2 WavLM模型架構 21
3.1.3 門控相對位置偏差(gated relative position bias) 23
3.1.4 遮蔽語音去噪及預測(Masked Speech Denoising and Prediction) 24
3.1.5 WavLM語者特徵擷取模型 24
3.2 語者特徵泛化模型 25
3.2.1 Model-Agnostic Meta-Learning(MAML) 26
3.2.2 MAML資料設置 26
3.2.3 MAML訓練流程 27
3.2.4 Robust MAML 29
3.3 多語者語音合成模型 31
3.3.1 FastSpeech2 31
3.3.2 FastSpeech2模型架構 32
3.3.3 方差適配器(Variance Adaptor) 32
3.3.4 多語者FastSpeech2模型 33
3.4 語者語音轉換模型 34
3.4.1 語音轉換模型架構 35
3.4.2 推論流程 36
第四章實驗 37
4.1 資料集 37
4.1.1 AISHELL1 37
4.1.2 AISHELL3 38
4.2 實驗設置 39
4.2.1 實驗設備及環境 39
4.2.2 語者特徵擷取模型之設置 40
4.2.3 語者特徵擷取模型Xvector模型之訓練 41
4.2.4 語者特徵擷取模型WavLM模型之訓練 41
4.2.5 語者特徵擷取模型WavLM + Robust MAML模型之訓練 43
4.2.6 語音合成模型訓練 43
4.3 實驗結果與分析 44
4.3.1 實驗評估方式 44
4.3.2 語者特徵擷取模型效果 45
4.3.3 註冊語者語音轉換效果 46
4.3.4 未註冊語者語者特徵 47
4.3.5 未註冊語者語音轉換效果 47
第五章結論及未來展望 49
第六章參考文獻 50

參考文獻

[1] E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, May 2014, pp. 4052–4056. doi: 10.1109/ICASSP.2014.6854363.
[2] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-Vectors: Robust DNN Embeddings for Speaker Recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Apr. 2018, pp. 5329–5333. doi: 10.1109/ICASSP.2018.8461375.
[3] V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neural network architecture for efficient modeling of long temporal contexts,” in Interspeech 2015, Sep. 2015, pp. 3214–3218. doi: 10.21437/Interspeech.2015-647.
[4] A. Vaswani et al., “Attention Is All You Need.” arXiv, Dec. 05, 2017. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1706.03762
[5] D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate.” arXiv, May 19, 2016. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1409.0473
[6] L. Dong, S. Xu, and B. Xu, “Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr. 2018, pp. 5884–5888. doi: 10.1109/ICASSP.2018.8462506.
[7] A. van den Oord, Y. Li, and O. Vinyals, “Representation Learning with Contrastive Predictive Coding.” arXiv, Jan. 22, 2019. Accessed: Jul. 16, 2022. [Online]. Available: http://arxiv.org/abs/1807.03748
[8] M. Gutmann and A. Hyvärinen, “Noise-contrastive estimation: A new estimation principle for unnormalized statistical models,” in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Mar. 2010, pp. 297–304. Accessed: Jul. 17, 2022. [Online]. Available: https://proceedings.mlr.press/v9/gutmann10a.html
[9] A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations.” arXiv, Oct. 22, 2020. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/2006.11477
[10] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” arXiv, May 24, 2019. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1810.04805
[11] E. Jang, S. Gu, and B. Poole, “Categorical Reparameterization with Gumbel-Softmax.” arXiv, Aug. 05, 2017. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1611.01144
[12] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units.” arXiv, Jun. 14, 2021. Accessed: Jun. 19, 2022. [Online]. Available: http://arxiv.org/abs/2106.07447
[13] S. Lloyd, “Least squares quantization in PCM,” IEEE Trans. Inf. Theory, vol. 28, no. 2, pp. 129–137, Mar. 1982, doi: 10.1109/TIT.1982.1056489.
[14] Y. Ren et al., “FastSpeech: Fast, Robust and Controllable Text to Speech.” arXiv, Nov. 20, 2019. Accessed: Jul. 01, 2022. [Online]. Available: http://arxiv.org/abs/1905.09263
[15] Y. Wang et al., “Tacotron: Towards End-to-End Speech Synthesis.” arXiv, Apr. 06, 2017. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1703.10135
[16] J. Shen et al., “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions.” arXiv, Feb. 15, 2018. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1712.05884
[17] W. Ping et al., “Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning.” arXiv, Feb. 22, 2018. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1710.07654
[18] J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis.” arXiv, Oct. 23, 2020. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/2010.05646
[19] K. Kumar et al., “MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis,” in Advances in Neural Information Processing Systems, 2019, vol. 32. Accessed: Jul. 17, 2022. [Online]. Available: https://papers.nips.cc/paper/2019/hash/6804c9bca0a615bdb9374d00a9fcba59-Abstract.html
[20] L.-W. Chen, H.-Y. Lee, and Y. Tsao, “Generative Adversarial Networks for Unpaired Voice Transformation on Impaired Speech.” arXiv, Aug. 22, 2019. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1810.12656
[21] T. Kaneko and H. Kameoka, “CycleGAN-VC: Non-parallel Voice Conversion Using Cycle-Consistent Adversarial Networks,” in 2018 26th European Signal Processing Conference (EUSIPCO), Sep. 2018, pp. 2100–2104. doi: 10.23919/EUSIPCO.2018.8553236.
[22] J. Serrà, S. Pascual, and C. Segura, “Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion.” arXiv, Sep. 05, 2019. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1906.00794
[23] D.-Y. Wu, Y.-H. Chen, and H. Lee, “VQVC+: One-Shot Voice Conversion by Vector Quantization and U-Net Architecture,” in Interspeech 2020, Oct. 2020, pp. 4691–4695. doi: 10.21437/Interspeech.2020-1443.
[24] C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang, “Voice Conversion from Non-parallel Corpora Using Variational Auto-encoder.” arXiv, Oct. 13, 2016. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1610.04019
[25] I. J. Goodfellow et al., “Generative Adversarial Networks.” arXiv, Jun. 10, 2014. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1406.2661
[26] D. P. Kingma and M. Welling, “Auto-Encoding Variational Bayes.” arXiv, May 01, 2014. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1312.6114
[27] J. Chou, C. Yeh, H. Lee, and L. Lee, “Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations.” arXiv, Jun. 24, 2018. Accessed: Jul. 14, 2022. [Online]. Available: http://arxiv.org/abs/1804.02812
[28] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, “StarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks.” arXiv, Jun. 29, 2018. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1806.02169
[29] J.-C. Chou, C. Yeh, and H. Lee, “One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization,” 2019. doi: 10.21437/interspeech.2019-2663.
[30] X. Huang and S. Belongie, “Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization.” arXiv, Jul. 30, 2017. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1703.06868
[31] K. Qian, Y. Zhang, S. Chang, X. Yang, and M. Hasegawa-Johnson, “AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss.” arXiv, Jun. 06, 2019. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1905.05879
[32] A. Polyak, L. Wolf, and Y. Taigman, “TTS Skins: Speaker Conversion via ASR.” arXiv, Jul. 26, 2020. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1904.08983
[33] S. Chen et al., “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing.” arXiv, Jan. 24, 2022. Accessed: Jun. 19, 2022. [Online]. Available: http://arxiv.org/abs/2110.13900
[34] J. Kahn et al., “Libri-Light: A Benchmark for ASR with Limited or No Supervision,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2020, pp. 7669–7673. doi: 10.1109/ICASSP40776.2020.9052942.
[35] Z. Chi et al., “XLM-E: Cross-lingual Language Model Pre-training via ELECTRA.” arXiv, Apr. 19, 2022. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/2106.16138
[36] J. Kang, R. Liu, L. Li, Y. Cai, D. Wang, and T. F. Zheng, “Domain-Invariant Speaker Vector Projection by Model-Agnostic Meta-Learning,” in Interspeech 2020, Oct. 2020, pp. 3825–3829. doi: 10.21437/Interspeech.2020-2562.
[37] C. Finn, P. Abbeel, and S. Levine, “Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks.” arXiv, Jul. 18, 2017. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1703.03400
[38] A. Raghu, M. Raghu, S. Bengio, and O. Vinyals, “Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML.” arXiv, Feb. 12, 2020. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1909.09157
[39] Q. Qian, S. Zhu, J. Tang, R. Jin, B. Sun, and H. Li, “Robust Optimization over Multiple Domains.” arXiv, Nov. 14, 2018. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1805.07588
[40] Q. Dou, D. C. Castro, K. Kamnitsas, and B. Glocker, “Domain Generalization via Model-Agnostic Learning of Semantic Features.” arXiv, Oct. 29, 2019. doi: 10.48550/arXiv.1910.13580.
[41] Y. Ren et al., “FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.” arXiv, Mar. 04, 2021. Accessed: Jul. 01, 2022. [Online]. Available: http://arxiv.org/abs/2006.04558
[42] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, “Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi,” 2017. doi: 10.21437/INTERSPEECH.2017-1386.
[43] A. Suni, D. Aalto, T. Raitio, P. Alku, and M. Vainio, “Wavelets for intonation modeling in HMM speech synthesis,” Th ISCA Speech Synth. Workshop, p. 6, 2013.
[44] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “AISHELL-1: An Open-Source Mandarin Speech Corpus and A Speech Recognition Baseline.” arXiv, Sep. 16, 2017. doi: 10.48550/arXiv.1709.05522.
[45] Y. Shi, H. Bu, X. Xu, S. Zhang, and M. Li, “AISHELL-3: A Multi-speaker Mandarin TTS Corpus and the Baselines.” arXiv, Apr. 22, 2021. doi: 10.48550/arXiv.2010.11567.
[46] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization.” arXiv, Jan. 29, 2017. doi: 10.48550/arXiv.1412.6980.
[47] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overﬁtting,” p. 30.
[48] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer Normalization.” arXiv, Jul. 21, 2016. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1607.06450
[49] D. Hendrycks and K. Gimpel, “Gaussian Error Linear Units (GELUs).” arXiv, Jul. 08, 2020. Accessed: Jul. 17, 2022. [Online]. Available: http://arxiv.org/abs/1606.08415
[50] “tsne.pdf.” Accessed: Jul. 15, 2022. [Online]. Available: http://www.cs.toronto.edu/~hinton/absps/tsne.pdf

指導教授

王家慶(Jia-Ching Wang)

審核日期

2022-9-23

推文