摘要: | 語音翻譯(Speech Translation,ST)是自然語言處理(NLP)和語音處理的交叉領域,目的是將一種語言的語音直接翻譯成另一種語言的語音或文字。這項技術是現代科技的重要成果之一,不僅能實現無障礙交流,還能促進全球交流與合作,推動語言教育進步。隨著全球化和跨文化交流的加速,語音翻譯技術在各種應用場景中變得越來越重要,成為許多學者研究的焦點。 深度學習技術在翻譯任務中可以細分為多種類型:文字到文字(Text-to-Text)、文字到語音(Text-to-Speech)、語音到文字(Speech-to-Text)和語音到語音(Speech-to-Speech)。其中,文字到文字、語音到文字以及語音到語音的翻譯備受關注。大型語言模型(如GPT)具備高超的理解和生成能力,使得文字到文字翻譯在大量高質量訓練資料的支持下,效果尤為突出。 語音到語音翻譯可採用三階層級聯(3-Stage Cascaded)方法,將自動語音辨識(ASR)模型,機器翻譯(MT)模型和文字轉語音(TTS)模型進行串聯。這種方法使得級聯模型的缺點(錯誤傳遞以及高延遲)更為明顯。單階層語音到語音翻譯模型(Direct Speech-to-Speech Translation Model)雖然改善了級聯模型的缺點,其效果卻落後於強大的級聯模型。這主要是因為語音到語音的訓練資料稀少,即便使用資料增強方法,效果也不如級聯模型。因此,克服資料稀少或生成高質量的語音到語音資料成為一個重要議題。本篇論文志在找出其中的平衡,使得模型能夠同時擁有高效能且低延遲。 ;Speech Translation (ST) is an interdisciplinary field that combines Natural Language Processing (NLP) and speech processing, aiming to directly translate speech from one language into another language′s speech or text. This technology is one of the significant achievements of modern science, not only enabling barrier-free communication but also promoting global exchange and cooperation, as well as advancing language education. With the acceleration of globalization and cross-cultural exchanges, speech translation technology has become increasingly important in various application scenarios and has become a focal point of research for many scholars. Deep learning technology in translation tasks can be categorized into several types: Text-to-Text, Text-to-Speech, Speech-to-Text, and Speech-to-Speech. Among these, Text-to-Text, Speech-to-Text, and Speech-to-Speech translation are particularly noteworthy. Large language models (such as GPT) possess exceptional comprehension and generation capabilities, making Text-to-Text translation particularly effective with extensive high-quality training data. Speech-to-Speech translation can adopt a three-stage cascaded approach, linking Automatic Speech Recognition (ASR) models, Machine Translation (MT) models, and Text-to-Speech (TTS) models in sequence. This method makes the drawbacks of cascaded models more apparent; however, Direct Speech-to-Speech Translation Models still significantly lag behind well-trained cascaded models. This is primarily due to the scarcity of training data for Speech-to-Speech translation. Even with data augmentation techniques, the results are still inferior to cascaded models. Therefore, overcoming the scarcity of data or generating high-quality Speech-to-Speech data remains a crucial issue. This paper aims to find a balance, ensuring that the models achieve both high performance and low latency. |