語碼轉換語音合成基於自監督學習與領域自適應之語者編碼器

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：9

、訪客IP：18.217.208.72

姓名

白承勲(Cheng-Hsun Pai) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

語碼轉換語音合成基於自監督學習與領域自適應之語者編碼器
(Code-switching TTS Based On Self-supervised Learning Approach And Domain Adaptation Speaker Encoder)

相關論文

★ Single and Multi-Label Environmental Sound Recognition with Gaussian Process	★ 波束形成與音訊前處理之嵌入式系統實現
★ 語音合成及語者轉換之應用與設計	★ 基於語意之輿情分析系統
★ 高品質口述系統之設計與應用	★ 深度學習及加速強健特徵之CT影像跟骨骨折辨識及偵測
★ 基於風格向量空間之個性化協同過濾服裝推薦系統	★ RetinaNet應用於人臉偵測
★ 金融商品走勢預測	★ 整合深度學習方法預測年齡以及衰老基因之研究
★ 漢語之端到端語音合成研究	★ 基於 ARM 架構上的 ORB-SLAM2 的應用與改進
★ 基於深度學習之指數股票型基金趨勢預測	★ 探討財經新聞與金融趨勢的相關性
★ 基於卷積神經網路的情緒語音分析	★ 運用深度學習方法預測阿茲海默症惡化與腦中風手術存活

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

近期基於深度學習的端對端語音合成模型在語音品質上取得巨大的進步，逐漸取代傳統的語音合成方法成主流。隨著全球化的影響，各式設備如語音助理、導航系統與車站廣播等，對於語碼轉換語音合成的需求逐漸增加，相關研究也備受重視。語碼轉換的定義為，在一段對話中或句子中交替使用多於一種以上的語言，常見的語碼轉換例如中文、英文交雜使用。在理想的情況下，我們需要一名精通多種語言的語者，錄製包含多種語言的語碼轉換資料集，然而要找到這樣的語者並不容易且標記成本昂貴，因此大部分的研究是以組合多個單語言資料集為主。在只有單語言資料集可用的情況下，語碼轉換語音合成的挑戰在於保持語言切換間語者聲音的一致性與語音自然度，包含語音品質、口音及語速。目前主流研究使用編碼器、解碼器的模型架構，配合語者向量以及語言向量來特徵化特定語者聲音及語言的韻律，也有部分研究使用多個語言分開的單語言編碼器來對語言資訊建模，然而這些方法對於合成出高自然度的語音仍是挑戰。為解決上述問題，我們將自監督學習與框等級的領域對抗訓練引入基於語者驗證任務的語者編碼器，促使不同語言的語者向量在語者空間上保持一致的分佈，以提升語碼轉換語音合成的性能表現，並在語音合成模型的選擇上使用非自回歸式的語音合成模型，以此解決跨語言語音合成產生的語速不自然問題。我們首先展示在LibriTTS與AISHELL3的混合語言資料集中，透過自監督表徵訓練的語者編碼器比起傳統MFCC在語者驗證任務上有4.968%的絕對EER下降，說明自監督表徵對於領域複雜的資料集有更好的泛化性，隨後我們在語碼轉換語音合成任務分別得到3.635與3.675的語音自然度與語者相似度MOS分數。我們的方法簡化過去文獻中使用多個單一語言編碼器對語言資訊建模的需要，並加入框等級域對抗訓練針對語者向量在語者特徵空間上進行優化，以利於語碼轉換語音合成任務。

摘要(英)

In recent years, deep learning-based end to end models have been widely used in speech synthesis, getting significant progress in regards to speech quality. Deep learning-based approach gradually becomes mainstream, replacing conventional approach. With the impact of globalization, various devices such as voice assistants, navigation systems and station announcements, have gradually increased the demand for code-switching TTS, and related research has also received much attention. Code-switching occurs when a speaker alternates between two or more languages in the content of single conversation or sentence. Common code-switching example such as mix of Chinese and English. Ideally, we will have a speaker, who is proficient in multiple languages, to record code-switching speech containing multiple languages. However, it is not easy to find such speaker, and the cost of labeling is expensive. Most research focus on combining multiple monolingual datasets. Under the circumstances of only monolingual datasets are available, there are several challenges for code-switching TTS, including keeping speaker consistency when code-switching occurs and ensuring naturalness of synthesized speech, such as speed, accent and quality. Recent research mainly uses encoder-decoder E2E-based framework. Speaker and language embedding are introduced to characterize the voice of speaker and the global prosody of language. Some research uses multiple separated monolingual encoders, to model the language information. Although the methods been purposed above, the high quality and speaker consistent speech synthesis is still a challenging task. To solve these problems, we propose to introduce self-supervised learning and frame-level domain adversarial training to speaker verification-based speaker encoder, that prompts speaker embeddings of different language stay in same distribution in speaker space, to improve the performance of code-switching TTS. We also choose to use non-autoregressive TTS model, to deal with unnatural speed of synthesized speech which happens in cross-lingual TTS. We first demonstrate that in the mixed monolingual datasets of LibriTTS and AISHELL3, self-supervised representation has 4.968% absolute EER decrease, compare with conventional MFCC, indicating that self-supervised representation has better generalization for datasets with complex domains. Then, we obtain the naturalness and speaker similarity MOS scores of 3.635 and 3.675 respectively in the code-switching TTS task. Our approach simplifies the need of using multiple single-language encoders to model the linguistic information in the past literature, and introduces frame-level domain adversarial training to optimize speaker embedding on speaker space for code-switching TTS tasks.

關鍵字(中)

★ 語碼轉換
★ 語音合成
★ 自監督學習
★ 領域自適應

關鍵字(英)

★ Code-switching
★ Text To Speech Synthesis
★ Self-supervised Learning
★ Domain Adaptation

論文目次

中文摘要 i
Abstract ii
圖目錄 iv
表目錄 vi
章節目錄 vii
第一章緒論 1
1.1 研究背景與目的 1
1.2 研究方法與章節概要 2
第二章文獻探討 3
2.1 端對端語音合成 3
2.2 語碼轉換語音合成 5
2.3 變壓器(Transformer) 6
2.3.1 自注意力機制(Self-Attention) 7
2.3.2 多頭注意力機制(Multi-head Attention 8
2.3.3 位置編碼演算法(Positional Encoding, PE) 9
2.4 HuBERT 10
2.4.1 隱藏單元學習 11
2.4.2 掩碼聲學表徵訓練 11
2.4.3 聚類集成 (Cluster Ensembles) 12
2.5 X-vector 12
2.6 領域自適應(Domain Adaption) 13
2.6.1 領域對抗訓練(Domain Adversarial Training, DAT) 14
第三章自監督學習與領域自適應之語者編碼器 16
3.1 WavLM 16
3.1.1 WavLM模型架構 18
3.1.2 掩碼降噪與預測(Mask Speech Denoising and Prediction) 18
3.1.3 門控相對位置偏差(Gated Relative Position Bias) 19
3.2 語者編碼器架構 20
3.3 訓練方式 21
3.3.1 預訓練階段 22
3.3.2 微調階段 22
第四章語碼轉換語音合成模型 25
4.1 FastSpeech2 25
4.1.1 文字編碼器(Text Encoder) 26
4.1.2 變異適應器(Variance Adaptor) 26
4.1.3 梅爾頻譜解碼器(Mel-spectrogram Decoder) 27
4.2 語音合成模型架構 27
4.3 訓練方法 28
4.3.1 非監督式對齊學習(Unsupervised Alignment Learning) 29
4.3.2 目標函數 31
4.4 語速調控 32
第五章實驗設置與結果 33
5.1 數據集 33
5.1.1 AISHELL3 33
5.1.2 LibriTTS 34
5.2 實驗環境 35
5.3 模型參數 36
5.4 訓練參數 37
5.4.1 語者編碼器超參數 37
5.4.2 語音合成模型超參數 37
5.5 實驗結果 38
5.5.1 實驗評估方式 38
5.5.2 實驗結果展示 38
5.5.3 實驗結果分析與探討 41
第六章結論與未來展望 43
參考文獻 44

參考文獻

[1] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “SIMULTANEOUS MODELING OF SPECTRUM, PITCH AND DURATION IN HMM-BASED SPEECH SYNTHESIS,” p. 4.
[2] “Speech parameter generation algorithms for HMM-based speech synthesis | IEEE Conference Publication | IEEE Xplore.” https://ieeexplore.ieee.org/document/861820 (accessed Jul. 04, 2022).
[3] Y. Ren et al., “FastSpeech 2: Fast and High-Quality End-to-End Text to Speech,” arXiv, arXiv:2006.04558, Mar. 2021. doi: 10.48550/arXiv.2006.04558.
[4] J. Donahue, S. Dieleman, M. Bińkowski, E. Elsen, and K. Simonyan, “End-to-End Adversarial Text-to-Speech.” arXiv, Mar. 17, 2021. Accessed: Jul. 04, 2022. [Online]. Available: http://arxiv.org/abs/2006.03575
[5] R. J. Weiss, R. Skerry-Ryan, E. Battenberg, S. Mariooryad, and D. P. Kingma, “Wave-Tacotron: Spectrogram-Free End-to-End Text-to-Speech Synthesis,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Jun. 2021, pp. 5679–5683. doi: 10.1109/ICASSP39728.2021.9413851.
[6] Y. Wang et al., “Tacotron: Towards End-to-End Speech Synthesis.” arXiv, Apr. 06, 2017. doi: 10.48550/arXiv.1703.10135.
[7] J. Shen et al., “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions.” arXiv, Feb. 15, 2018. doi: 10.48550/arXiv.1712.05884.
[8] Y. Ren et al., “FastSpeech: Fast, Robust and Controllable Text to Speech.” arXiv, Nov. 20, 2019. doi: 10.48550/arXiv.1905.09263.
[9] N. Li, S. Liu, Y. Liu, S. Zhao, M. Liu, and M. Zhou, “Neural Speech Synthesis with Transformer Network.” arXiv, Jan. 30, 2019. doi: 10.48550/arXiv.1809.08895.
[10] B. Li, Y. Zhang, T. Sainath, Y. Wu, and W. Chan, “Bytes are All You Need: End-to-End Multilingual Speech Recognition and Synthesis with Bytes.” arXiv, Nov. 21, 2018. doi: 10.48550/arXiv.1811.09021.
[11] Y. Zhang et al., “Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning.” arXiv, Jul. 24, 2019. doi: 10.48550/arXiv.1907.04448.
[12] Z. Liu and B. Mak, “Cross-lingual Multi-speaker Text-to-speech Synthesis for Voice Cloning without Using Parallel Corpus for Unseen Speakers.” arXiv, Nov. 26, 2019. doi: 10.48550/arXiv.1911.11601.
[13] J. Yang and L. He, “Towards Universal Text-to-Speech,” in Interspeech 2020, Oct. 2020, pp. 3171–3175. doi: 10.21437/Interspeech.2020-1590.
[14] Z. Cai, Y. Yang, and M. Li, “Cross-lingual Multispeaker Text-to-Speech under Limited-Data Scenario.” arXiv, May 20, 2020. doi: 10.48550/arXiv.2005.10441.
[15] Y. Cao et al., End-to-end Code-switched TTS with Mix of Monolingual Recordings. 2019. doi: 10.1109/ICASSP.2019.8682927.
[16] L. Xue, W. Song, G. Xu, L. Xie, and Z. Wu, “Building a mixed-lingual neural TTS system with only monolingual data.” arXiv, Aug. 22, 2019. doi: 10.48550/arXiv.1904.06063.
[17] X. Zhou, X. Tian, G. Lee, R. Das, and H. Li, End-to-End Code-Switching TTS with Cross-Lingual Language Model. 2020, p. 7618. doi: 10.1109/ICASSP40776.2020.9054722.
[18] H. Hemati and D. Borth, “Using IPA-Based Tacotron for Data Efficient Cross-Lingual Speaker Adaptation and Pronunciation Enhancement.” arXiv, Mar. 31, 2022. Accessed: Jul. 05, 2022. [Online]. Available: http://arxiv.org/abs/2011.06392
[19] S. Zhao, T. H. Nguyen, H. Wang, and B. Ma, “Towards Natural Bilingual and Code-Switched Speech Synthesis Based on Mix of Monolingual Recordings and Cross-Lingual Voice Conversion.” arXiv, Oct. 15, 2020. Accessed: Jul. 05, 2022. [Online]. Available: http://arxiv.org/abs/2010.08136
[20] S. Nakayama, A. Tjandra, S. Sakti, and S. Nakamura, “Speech Chain for Semi-Supervised Learning of Japanese-English Code-Switching ASR and TTS,” 2018 IEEE Spok. Lang. Technol. Workshop SLT, 2018, doi: 10.1109/SLT.2018.8639674.
[21] A. Vaswani et al., “Attention Is All You Need.” arXiv, Dec. 05, 2017. doi: 10.48550/arXiv.1706.03762.
[22] D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate.” arXiv, May 19, 2016. Accessed: Jul. 04, 2022. [Online]. Available: http://arxiv.org/abs/1409.0473
[23] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units,” ArXiv210607447 Cs Eess, Jun. 2021, Accessed: Apr. 30, 2022. [Online]. Available: http://arxiv.org/abs/2106.07447
[24] A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” ArXiv200611477 Cs Eess, Oct. 2020, Accessed: Apr. 30, 2022. [Online]. Available: http://arxiv.org/abs/2006.11477
[25] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-Vectors: Robust DNN Embeddings for Speaker Recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Apr. 2018, pp. 5329–5333. doi: 10.1109/ICASSP.2018.8461375.
[26] Y. Ganin et al., “Domain-Adversarial Training of Neural Networks,” arXiv, arXiv:1505.07818, May 2016. doi: 10.48550/arXiv.1505.07818.
[27] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” ArXiv181004805 Cs, May 2019, Accessed: Apr. 30, 2022. [Online]. Available: http://arxiv.org/abs/1810.04805
[28] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le, “XLNet: Generalized Autoregressive Pretraining for Language Understanding,” ArXiv190608237 Cs, Jan. 2020, Accessed: Apr. 30, 2022. [Online]. Available: http://arxiv.org/abs/1906.08237
[29] A. van den Oord, Y. Li, and O. Vinyals, “Representation Learning with Contrastive Predictive Coding,” ArXiv180703748 Cs Stat, Jan. 2019, Accessed: Apr. 30, 2022. [Online]. Available: http://arxiv.org/abs/1807.03748
[30] C. Wang et al., “UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data,” ArXiv210107597 Cs Eess, Jun. 2021, Accessed: Apr. 30, 2022. [Online]. Available: http://arxiv.org/abs/2101.07597
[31] S. Chen et al., “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,” arXiv, arXiv:2110.13900, Jan. 2022. doi: 10.48550/arXiv.2110.13900.
[32] F. Wang, W. Liu, H. Liu, and J. Cheng, “Additive Margin Softmax for Face Verification,” IEEE Signal Process. Lett., vol. 25, no. 7, pp. 926–930, Jul. 2018, doi: 10.1109/LSP.2018.2822810.
[33] M. Zhao, Y. Ma, M. Liu, and M. Xu, “The SpeakIn System for VoxCeleb Speaker Recognition Challange 2021.” arXiv, Sep. 05, 2021. Accessed: Jun. 23, 2022. [Online]. Available: http://arxiv.org/abs/2109.01989
[34] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, “Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi,” 2017. doi: 10.21437/INTERSPEECH.2017-1386.
[35] K. Shih, R. Valle, R. Badlani, A. Lancucki, W. Ping, and B. Catanzaro, “RAD-TTS: Parallel Flow-Based TTS with Robust Alignment Learning and Diverse Synthesis,” p. 8.
[36] L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proc. IEEE, vol. 77, no. 2, pp. 257–286, Feb. 1989, doi: 10.1109/5.18626.
[37] J. Kim, S. Kim, J. Kong, and S. Yoon, “Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search,” arXiv, arXiv:2005.11129, Oct. 2020. doi: 10.48550/arXiv.2005.11129.
[38] J. Kahn et al., “Libri-Light: A Benchmark for ASR with Limited or No Supervision,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2020, pp. 7669–7673. doi: 10.1109/ICASSP40776.2020.9052942.
[39] G. Chen et al., “GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio.” arXiv, Jun. 13, 2021. Accessed: Jul. 18, 2022. [Online]. Available: http://arxiv.org/abs/2106.06909
[40] C. Wang et al., “VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation.” arXiv, Jul. 27, 2021. Accessed: Jul. 18, 2022. [Online]. Available: http://arxiv.org/abs/2101.00390
[41] T. Wolf et al., “HuggingFace’s Transformers: State-of-the-art Natural Language Processing.” arXiv, Jul. 13, 2020. Accessed: Jul. 18, 2022. [Online]. Available: http://arxiv.org/abs/1910.03771
[42] J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis.” arXiv, Oct. 23, 2020. doi: 10.48550/arXiv.2010.05646.
[43] Y. Jia, H. Zen, J. Shen, Y. Zhang, and Y. Wu, “PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS.” arXiv, Jun. 07, 2021. Accessed: Jul. 19, 2022. [Online]. Available: http://arxiv.org/abs/2103.15060
[44] G. Zhang et al., Mixed-Phoneme BERT: Improving BERT with Mixed Phoneme and Sup-Phoneme Representations for Text to Speech. 2022.

指導教授

王家慶(Jia-Ching Wang)

審核日期

2022-9-21

推文