摘要: | 語音合成一詞表示使用人為的方式合成近乎真人言談的語音,早期的做法譬如統計方式。作法繁瑣容易出錯,且如果架構是管線架構,前一階的錯誤還會產生連鎖效應。近年來,隨著深度學習的熱潮,使用深度學習架構搭建文字轉語音(Text To Speech, TTS)系統的技術已越發成熟,各式各樣利用深度網路完成的TTS應用開始進入人們的生活周遭。在TTS能成功合成出逼真的語音後,這些人們已經不滿足於合成出逼真的聲音了,不論是要能夠合成指定語者的聲音、合成指定腔調的聲音或包含特定情感的聲音。現在的TTS系統要能夠依據使用者的喜好產生聲音。 Tacotron2是一個較為經典的深度學習架構,包含一個將文字編碼的編碼器、一個將編碼結果轉成解碼器輸入的注意力機制、一個輸出頻譜圖的解碼器與一個將頻譜圖轉為音訊的聲碼器。 注意力機制部分,除了Google實現Tacotron2採用的Location Sensitive Attention以及同樣常被運用於語音合成模型的Forward Attention外,Monotonic Chunkwise Attention(Mocha)也是注意力機制的一種。Mocha將原先Soft Attention的作用範圍縮小為長度固定的Chunk,希望藉此提升網路的準確度。不過目前Mocha相關的研究多運用在語音辨識(Speech Recognition)領域。本篇論文將基於多語者的Tacotron2模型,實驗上述三個注意力機制。比較Mocha Attention與其他兩者的結果差異後發現,經由MoChA產生的音訊並不會比較好,反倒是喪失了MoChA可以處理Streaming的優勢。;The term speech synthesis refers to the artificial way of synthesizing speech that is almost human-talking like. Early approaches such as statistical methods. The method is tedious and error-prone, and if the architecture is a pipeline architecture, the errors of the previous order will have a knock-on effect. In recent years, with the boom in deep learning, the technology of using the deep learning architecture to build the Text To Speech (TTS) system has become more and more mature, and various TTS applications using deep network have begun to enter people′s lives. . After TTS can successfully synthesize realistic speech, these people are not satisfied with synthesizing realistic voices, whether it is capable of synthesizing the voice of the specified speaker, synthesize a specific tone, or include emotional sound Today′s TTS systems must be able to generate sounds based on user preferences. Tacotron2 is a more classic deep learning architecture, including a text encoding Encoder, an attention mechanism that converts the encoding result into the input of the decoder, a decoder output spectrum and a vocoder that converts the spectrogram to audio. As attention mechanism, except for Location Sensitive, which Google uses to implement Tacotron2’s Attention and Forward Attention, which is also commonly used in speech synthesis models. Monotonic Chunkwise Attention (Mocha) is also a type of attention mechanism. Mocha reduce the scope of the original Soft Attention to a Chunk with a fixed length, hoping to improve network accuracy. However, Mocha-related research is mostly used in speech recognition field. This paper will be based on the multilingual Tacotron2 model. Attention mechanism. After comparing the difference between the results of Mocha Attention and the other two, it was found that the audio generated by MoChA will not be better, but it will lose the advantages to process streaming. |