dc.description.abstract | In recent years, deep learning-based end to end models have been widely used in speech synthesis, getting significant progress in regards to speech quality. Deep learning-based approach gradually becomes mainstream, replacing conventional approach. With the impact of globalization, various devices such as voice assistants, navigation systems and station announcements, have gradually increased the demand for code-switching TTS, and related research has also received much attention. Code-switching occurs when a speaker alternates between two or more languages in the content of single conversation or sentence. Common code-switching example such as mix of Chinese and English. Ideally, we will have a speaker, who is proficient in multiple languages, to record code-switching speech containing multiple languages. However, it is not easy to find such speaker, and the cost of labeling is expensive. Most research focus on combining multiple monolingual datasets. Under the circumstances of only monolingual datasets are available, there are several challenges for code-switching TTS, including keeping speaker consistency when code-switching occurs and ensuring naturalness of synthesized speech, such as speed, accent and quality. Recent research mainly uses encoder-decoder E2E-based framework. Speaker and language embedding are introduced to characterize the voice of speaker and the global prosody of language. Some research uses multiple separated monolingual encoders, to model the language information. Although the methods been purposed above, the high quality and speaker consistent speech synthesis is still a challenging task. To solve these problems, we propose to introduce self-supervised learning and frame-level domain adversarial training to speaker verification-based speaker encoder, that prompts speaker embeddings of different language stay in same distribution in speaker space, to improve the performance of code-switching TTS. We also choose to use non-autoregressive TTS model, to deal with unnatural speed of synthesized speech which happens in cross-lingual TTS. We first demonstrate that in the mixed monolingual datasets of LibriTTS and AISHELL3, self-supervised representation has 4.968% absolute EER decrease, compare with conventional MFCC, indicating that self-supervised representation has better generalization for datasets with complex domains. Then, we obtain the naturalness and speaker similarity MOS scores of 3.635 and 3.675 respectively in the code-switching TTS task. Our approach simplifies the need of using multiple single-language encoders to model the linguistic information in the past literature, and introduces frame-level domain adversarial training to optimize speaker embedding on speaker space for code-switching TTS tasks. | en_US |