非監督式華語與母語間的神經機器翻譯方法;Unsupervised Neuro Machine Translation between Mandarin and Mother Tongue

NCU Institutional Repository > 資訊電機學院 > 資訊工程學系 > 研究計畫 > Item 987654321/82307

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/82307

題名:	非監督式華語與母語間的神經機器翻譯方法;Unsupervised Neuro Machine Translation between Mandarin and Mother Tongue
作者:	蔡宗翰;王昱鈞
貢獻者:	國立中央大學資訊工程學系
關鍵詞:	非監督式神經機器翻譯;華語—母語機器翻譯;華粵機器翻譯;華語—臺語機器翻譯;華越機器翻譯;師生框架;生成對抗網路;Unsupervised Neural Machine Translation;Translation between Mandarin and Mother Tongue;Mandarin-Cantonese Translation;Mandarin-Taiwanese Translation;Mandarin-Vietnamese Translation;Teacher-Student Framework;Generative Adversarial Networks
日期:	2020-01-13
上傳時間:	2020-01-13 14:39:25 (UTC+8)
出版者:	科技部
摘要:	臺灣主要語言是華語。除此之外，相關母語還有本地民眾的臺語、客語等；港澳民眾的粵語；新住民的越語等。母語除在教育上日益受到重視外，在商業上，亦為拉近關係的有效方法。隨著人工智慧的快速推廣，目前已有許多華語智能客服上線。若能發展出華語與相關母語的機器翻譯系統，則所有現存的華語智能客服即可依照用戶的母語調整輸出，極具市場潛力。囿於華語—母語的平行語料的稀缺，此技術發展停滯不前。隨著神經機器翻譯技術的發展，跨語言處理獲得重大突破。除平行語料外，單語語料也可結合先進的非監督式深度學習演算法，做為機器翻譯的訓練資料。華語—母語的機器翻譯技術因而獲得了珍貴的發展契機，然仍有許多未探索過的障礙亟待克服。本計畫將循序漸進，依照語言特性與難度，基於非監督式深度學習演算法，發展出完整的華語—母語神經機器翻譯技術。第一年，我們預計發展基於深度學習之雙語百科條目連結及雙語對應語句配對技術。首先由線上百科中識別出雙語對應條目（華—粵、華—臺、華—越），做為雙語詞典。接著，由對應條目中，識別出雙語對應語句，做為機器翻譯之訓練語料。第二年，我們由華—粵開始發展，主因兩者同為漢字系統且已有部分語料。我們預計採用香港城市大學發展的規則式翻譯系統做為初始訓練資料產生器，透過華—粵共享編碼器及反向翻譯機制，以非監督方式反覆訓練華粵及粵華神經翻譯模型。接著，我們擬以華粵翻譯技術為基礎，將方法轉移至華—臺語。主要挑戰來自於臺語較多白話羅馬字語料，並非漢字系統。第三年我們將嘗試以某種語言為中介語言之類神經機器翻譯方法。考慮新住民、新南向、語言相似度、以及市場性，我們選擇曾同屬漢字文化圈的越語做為翻譯標的。由於華越雙語對應語料收集不易，為實際解決華越翻譯問題，我們預計採用非監督式師生框架方法，透過華英及英越對應語料，由英語作為中介，訓練出華越機器翻譯模型。此外，我們亦將建構生成對抗網路（GAN），最大化運用華越、華英及英越對應語料，進一步改善華越翻譯的準確度。第四年我們將先前所發展之華語—母語機器翻譯技術製作為線上服務，初步聚焦於社群網路訊息翻譯與線上客服訊息翻譯。預計將由服務中收集更多語料，以進一步改進神經機器翻譯之效果。 ;The main language of Taiwan is Mandarin. In addition, the related mother tongues include Taiwanese and Hakka of local people, Cantonese of Hong Kong and Macao people, Vietnamese of new residents, etc. Mother tongue, besides being paid more and more attention to in education, is also an effective way to get closer in business. With the rapid promotion of artificial intelligence, there are many intelligent Chinese customer service online. If the machine translation system of Chinese and related mother tongues can be developed, all the existing Chinese intelligent customer service can adjust the output according to the user's mother tongue, which has great market potential. Due to the scarcity of parallel corpus between Chinese and mother tongue, the development of this technology has stagnated. With the development of neural machine translation (NMT) technology, great breakthroughs have been made in cross-language processing. In addition to parallel corpus, monolingual corpus can also be used as training material for NMT in advanced unsupervised deep learning algorithms. Machine translation between Mandarin and mother tongue has gained precious opportunities for development, but there are still many obstacles to be overcome.This project will gradually develop a series of NMT technologies between Chinese and mother tongues based on unsupervised algorithms according to language characteristics and difficulties.In the first year, we expect to develop bilingual encyclopedia article linking and bilingual sentence pairing techniques based on deep learning. Firstly, the bilingual corresponding entries (Mandarin-Cantonese, Mandarin-Taiwanese, Mandarin-Vietnamese) are identified from online encyclopedia as bilingual dictionaries. Next, bilingual corresponding sentences are identified from corresponding articles and then used as training corpus for machine translation.In the second year, we will begin to develop from Mandarin-Cantonese. The main reason is that both of them use the Chinese character system and have some corpus. We expect to use the rule-based translation system developed by City University of Hong Kong as the initial training data generator to train the Mandarin-Cantonese neuro-translation model in an unsupervised way through the Mandarin-Cantonese shared encoder and back translation mechanism. Next, we intend to transfer the Mandarin-Cantonese translation method to Mandarin-Taiwanese. The main challenge is that Taiwanese corpus is mostly written in Pe̍h-ōe-jī.In the third year, we will try to use a pivot language in NMT. Considering the new immigrants, new southbound policy, language similarity and marketability, we choose Vietnamese. Due to the difficulty of collecting Mandarin-Vietnamese corresponding sentences, we expect to use the unsupervised teacher-student framework to train the Mandarin-Vietnamese machine translation model through the corresponding corpus of Mandarin-English and English-Vietnamese, with English as the pivot language. In addition, we will construct Generative Adversarial Networks (GAN) to exploit the corresponding corpora of Mandarin-Vietnamese, Mandarin-English and English-Vietnamese for further improving the accuracy of Mandarin-Vietnamese translation.In the fourth year, we developed the previously developed machine translation technologies as online services, initially focusing on social network message translation and online customer service message translation. More data are expected to be collected from the service to further improve the effects of NMT.
關聯:	財團法人國家實驗研究院科技政策研究與資訊中心
顯示於類別:	[資訊工程學系] 研究計畫

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	276	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....