基於台語與華語之語碼混合資料集與翻譯模型

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：11

、訪客IP：3.14.133.148

姓名

呂昕恩(Sin-En Lu) 查詢紙本館藏

畢業系所

資訊工程學系在職專班

論文名稱

基於台語與華語之語碼混合資料集與翻譯模型
(Hokkien-Mandarin Code-Mixing Dataset and Neural Machine Translation)

相關論文

★ A Real-time Embedding Increasing for Session-based Recommendation with Graph Neural Networks	★ 基於主診斷的訓練目標修改用於出院病摘之十代國際疾病分類任務
★ 混合式心臟疾病危險因子與其病程辨識於電子病歷之研究	★ 基於 PowerDesigner 規範需求分析產出之快速導入方法
★ 社群論壇之問題檢索	★ 非監督式歷史文本事件類型識別──以《明實錄》中之衛所事件為例
★ 應用自然語言處理技術分析文學小說角色之關係：以互動視覺化呈現	★ 基於生醫文本擷取功能性層級之生物學表徵語言敘述：由主成分分析發想之K近鄰算法
★ 基於分類系統建立文章表示向量應用於跨語言線上百科連結	★ Code-Mixing Language Model for Sentiment Analysis in Code-Mixing Data
★ 藉由加入多重語音辨識結果來改善對話狀態追蹤	★ 對話系統應用於中文線上客服助理:以電信領域為例
★ 應用遞歸神經網路於適當的時機回答問題	★ 使用多任務學習改善使用者意圖分類
★ 使用轉移學習來改進針對命名實體音譯的樞軸語言方法	★ 基於歷史資訊向量與主題專精程度向量應用於尋找社群問答網站中專家

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

台語與中文語碼混合在台灣是一個常見的口語現象，然而台灣遲至 21 世紀才開始建立官方書寫系統。缺少官方書寫系統，不僅代表著我們在 NLP 領域面臨資源不足的問題，導致我們在方言代碼混合任務上難以取得突破性研究，更意味著我們面臨著語言傳承的困難。基於上述問題，本研究將從簡要介紹台語的歷史以及台灣語碼混合現象著手，討論台灣語碼混合的語言比例組成與文法結構，建立基於台文字的台語語華語之語碼混合資料集，並介紹可應用於台文的現有斷詞工具。
同時我們將在本研究介紹台語語言模型的訓練方法，同時使用我們提出的資料集，利用 XLM 開發台語語碼混合翻譯模型。

為適用於語碼混合的情境，我們提出自動化語言標注(DLI)機制，並使用遷移學習提升翻譯模型表現。
最後我們根據交叉熵（Cross-entropy, CE）的問題，提出三種利用詞彙詞相似度來重構損失函數。我們提出 WBI 機制，解決詞彙資訊與字符集預訓練模型不相容的問題，並引入 WordNet 知識在模型中。與標準 CE 相比，在單語和語碼混資料集的實驗結果表明，我們的最佳損失函數在單語和 CM 在 BLEU 上，分別進步 2.42分（62.11 到 64.53）和 0.7（62.86 到 63.56）分。我們的實驗證明即使是使用基於字符訓練的語言模型，我們可以將辭彙的資訊攜帶到下游任務中。

摘要(英)

Code-mixing is a complicated task in Natural Language Processing (NLP), especially for mixed languages are dialects. In Taiwan, code-mixing is a common phenomenon. The most popular code-mixed language pair is Hokkien and Mandarin. However, there is a lack of resources in Hokkien. Therefore, we proposed a Hokkien-Mandarin code-mixing dataset and offered the efficient Hokkien word segment method through an open-source toolkit. These could overcome the morphology issue under the Sino-Tibetan family. We modify an XLM model (cross-lingual language model) with the dynamic language identify (DLI) mechanism and use transfer learning to train our proposed dataset on translation tasks. We found that by applying language knowledge, rules and offering the language tags, the model achieves good performance on code-mixing data translation results and maintains the quality of monolingual translation.

Recently, most neural machine translation models (NMT) use cross-entropy as the loss function, including XLM model. However, standard cross-entropy penalizes the model when it fails to generate ground truth answers, eliminating the opportunity to consider other possibilities. It can cause problems with extit{overcorrection} or extit{over-confident}. Some solutions to reconstruct the loss function using word similarity have been proposed. But these solutions are not suitable for Chinese because most Chinese models are pre-trained at the character level. In this work, we propose a simple but effective method, Word Boundary Insertion (WBI), to address the inconsistency between word-level and character-level by reconstructing the loss function of Chinese NMT models. WBI considers word similarity without modifying or retraining a new language model. We propose three modified loss functions for use with XLM, and the calculation of these loss functions also refers to the WordNet. Compared with the standard cross-entropy, experimental results on both monolingual and code-mixing (code-mixing) Hokkien-Chinese datasets show that our best loss function achieves BLEU score improvements of 2.42 (62.11 to 64.53) and 0.7 (62.86 to 63.56) on monolingual and code-mixing data, respectively.

關鍵字(中)

★ 語碼混合
★ 機器翻譯
★ 損失函數重構
★ 低資源語言

關鍵字(英)

★ Code-Mixing
★ Neural Machine Translation
★ Loss Function Reconstruction
★ Low Resource
★ WordNet

論文目次

中文摘要 v
Abstract vi
致謝 viii
Contents x
List of Figures xiii
List of Tables xiv
1 Introduction 1
1.1 Motivate . 1
1.1.1 Code-Mixing 1
1.1.2 Neural Machine Translation 2
1.2 Goal . 3
2 Background of Taiwanese Hokkien 5
2.1 History of Taiwanese Hokkien . 5
2.2 Taiwanese Hokkien Writing System and Difficulties 6
2.3 Difficulties in Written Taiwanese Hokkien . 9
2.3.1 Ambiguous Word Boundary in Written Taiwanese Hokkien 9
2.3.2 Literary and Colloquial Readings . 10
2.3.3 Insufficient Resource 11
x
2.4 Recap 11
3 Related Work 12
3.1 Code-mixing Theory 12
3.2 Code-Mixing Research in Taiwanese Hokkien 14
3.3 Code-Mixed Corpus 16
3.4 Code-Mixed Translation 18
3.5 Pre-trained Language Models . 20
3.6 Transfer Learning . 22
3.7 Chinese Language Model and Machine Translation 23
3.8 Recap 24
4 Dataset and Evaluation 25
4.1 Data source . 25
4.2 Synthetic Hokkien-Madarin Code-mixed Data . 26
4.2.1 Problems in using Chinese Toolkit . 27
4.2.2 Articut : Solution of Hokkien Word Segment . 29
4.2.3 ( CHECK DONE ) Synthetic Approach 31
4.3 Data Analysis 32
4.3.1 Human Scoring . 33
4.3.2 Code-Mixied Complexity . 35
4.3.3 Inter-rater score . 35
4.4 Recap 36
5 Hokkien Language Model and Translation System 37
5.1 Assumption . 37
5.2 Hokkien Language Model . 37
5.3 XLM Model . 39
5.3.1 Dynamic Language Identify and AutoEncoder 40
5.3.2 Transfer Learning 41
5.4 Loss Function 42
xi
5.4.1 Word-Boundary Insertion (WBI) . 43
5.4.2 Word Similarity . 44
5.4.3 Loss Function Modification 45
5.5 Recap 48
6 Experiment and Result 49
6.1 Dataset, Baseline, and Evaluation Metrics . 49
6.2 Experiment Settings . 50
6.3 Metrics . 50
6.4 Hokkien Language Model and XLM 51
6.5 Dynamic Language Identity and AutoEncoder . 52
6.6 Transfer Learning 53
6.7 Loss function 53
6.8 Recap 55
7 Loss Function Analysis 56
7.1 Similar-word Configuration . 56
7.2 Proposed Loss Function . 56
7.3 Case Study . 57
7.3.1 Similar-word Configuration 57
7.3.2 Proposed Loss Function 59
7.4 Recap 60
8 Conclusion and Future Work 61
8.1 Conclusion . 61
8.2 Future Work . 62
Bibliography 64

參考文獻

[1] S. Rijhwani, R. Sequiera, M. C. Choudhury, and K. Bali, “Translating codemixed
tweets: A language detection based system,” in 3rd workshop on Indian language
data resource and evaluation-WILDRE-3, 2016, pp. 81–82.
[2] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u.
Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural
Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,
R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates,
Inc., 2017. [Online]. Available: https://proceedings.neurips.cc/paper/2017/file/
3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
[3] M. A. Hedderich, L. Lange, H. Adel, J. Strötgen, and D. Klakow, “A survey
on recent approaches for natural language processing in low-resource scenarios,”
in Proceedings of the 2021 Conference of the North American Chapter of
the Association for Computational Linguistics: Human Language Technologies.
Online: Association for Computational Linguistics, Jun. 2021, pp. 2545–2568.
[Online]. Available: https://aclanthology.org/2021.naacl-main.201
[4] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep
bidirectional transformers for language understanding,” in Proceedings of the 2019
Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).
Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019,
pp. 4171–4186. [Online]. Available: https://aclanthology.org/N19-1423
64
[5] G. Lample and A. Conneau, “Cross-lingual language model pretraining,” Advances
in Neural Information Processing Systems (NeurIPS), 2019.
[6] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
[7] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on
Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345–1359, 2010.
[8] B. Zoph, D. Yuret, J. May, and K. Knight, “Transfer learning for low-resource
neural machine translation,” in Proceedings of the 2016 Conference on Empirical
Methods in Natural Language Processing. Austin, Texas: Association for
Computational Linguistics, Nov. 2016, pp. 1568–1575. [Online]. Available:
https://aclanthology.org/D16-1163
[9] R. Wang, X. Tan, R. Luo, T. Qin, and T.-Y. Liu, “A survey on low-resource
neural machine translation,” in Proceedings of the Thirtieth International Joint
Conference on Artificial Intelligence, IJCAI-21, Z.-H. Zhou, Ed. International
Joint Conferences on Artificial Intelligence Organization, 8 2021, pp. 4636–4643,
survey Track. [Online]. Available: https://doi.org/10.24963/ijcai.2021/629
[10] M. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y. Wu, Z. Chen, N. Thorat,
F. Viégas, M. Wattenberg, G. Corrado, M. Hughes, and J. Dean, “Google’s
multilingual neural machine translation system: Enabling zero-shot translation,”
Transactions of the Association for Computational Linguistics, vol. 5, pp. 339–351,
2017. [Online]. Available: https://aclanthology.org/Q17-1024
[11] T. Pires, E. Schlinger, and D. Garrette, “How multilingual is multilingual BERT?”
in Proceedings of the 57th Annual Meeting of the Association for Computational
Linguistics. Florence, Italy: Association for Computational Linguistics, Jul.
2019, pp. 4996–5001. [Online]. Available: https://aclanthology.org/P19-1493
65
[12] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural
networks,” in Advances in neural information processing systems, 2014, pp. 3104–
3112.
[13] P.-T. De Boer, D. P. Kroese, S. Mannor, and R. Y. Rubinstein, “A tutorial on the
cross-entropy method,” Annals of operations research, vol. 134, no. 1, pp. 19–67,
2005.
[14] O. Kovaleva, A. Rumshisky, and A. Romanov, “Similarity-based reconstruction
loss for meaning representation,” in Proceedings of the 2018 Conference
on Empirical Methods in Natural Language Processing. Brussels, Belgium:
Association for Computational Linguistics, Oct.-Nov. 2018, pp. 4875–4880.
[Online]. Available: https://aclanthology.org/D18-1525
[15] R. Müller, S. Kornblith, and G. E. Hinton, “When does label smoothing help?” in
Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle,
A. Beygelzimer, F. d′Alché-Buc, E. Fox, and R. Garnett, Eds., vol. 32. Curran
Associates, Inc., 2019. [Online]. Available: https://proceedings.neurips.cc/paper/
2019/file/f1748d6b0fd9d439f71450117eba2725-Paper.pdf
[16] W. Zhang, Y. Feng, F. Meng, D. You, and Q. Liu, “Bridging the gap between
training and inference for neural machine translation,” in Proceedings of the 57th
Annual Meeting of the Association for Computational Linguistics. Florence, Italy:
Association for Computational Linguistics, Jul. 2019, pp. 4334–4343. [Online].
Available: https://aclanthology.org/P19-1426
[17] M.-L. Chen, “Code-switching in mandarin and taiwan southern min：a case study
of two tv talk shows,” Linguistics at National Tsing Hua University, pp. 1–131,
2008.
[18] H.-N. Yeh, H. chen Chan, and Y. show Cheng, “Language use in taiwan: Language proficiency and domain analysis,” in Journal of Taiwan Normal University:Humanities & Social Sciences, 2004, 49(1), 75-10, 2004.
66
[19] C. Mu-Chen, “白話字的起源與在台灣的發展,” 2015. [Online]. Available: https://ndltd.ncl.edu.tw/cgi-bin/gs32/gsweb.cgi?o=dnclcdr&s=
id=%22104NTNU5642004%22.&searchmode=basic
[20] T.-S. Tan, “Tâi-oân-ōe án-tsuánn 寫－現今台灣話書寫系統 ê 整合過程 kah
各派立場 ê 比較分析,” 2019. [Online]. Available: http://rportal.lib.ntnu.edu.tw:
80/handle/20.500.12235/110446
[21] Y.-F. Liao, C.-Y. Chang, H.-K. Tiun, H.-L. Su, H.-L. Khoo, J. S. Tsay, L.-K.
Tan, P. Kang, T.-g. Thiann, U.-G. Iunn, J.-H. Yang, and C.-N. Liang, “Formosa
speech recognition challenge 2020 and taiwanese across taiwan corpus,” in 2020
23rd Conference of the Oriental COCOSDA International Committee for the Coordination and Standardisation of Speech Databases and Assessment Techniques
(O-COCOSDA), 2020, pp. 65–70.
[22] 中央研究院語言學研究所, Ed., Yu yan zheng ce de duo yuan wen hua si kao /
Zheng Jinquan [and others] bian ji., chu ban. ed., ser. Yu yan, she hui yu wen hua
xi lie cong shu ; 2. Taibei Shi: Zhong yang yan jiu yuan yu yuan xue yan jiu suo,
2007.
[23] Y. 楊晏彰, “台灣人出現華語與台語語碼轉換的現狀與發生契機之研究 ― 以
電視節目為分析語料,” Journal of Taiwan studies, Takushoku University, vol. 5,
pp. 91–118, mar 2021.
[24] 潘惠華, “Localization of language usage in ethnic chinese media and hybridity in
southern min soap operas and movies of taiwan, singapore, and china,” 臺灣國際
研究季刊, vol. 12, no. 3, pp. 173–206, 2016.
[25] P.-H. Ho, Code-Mixing of Taiwan Mandarin and Southern-Min: A Case Study of
the Use of Hybrid Words, 2020. [Online]. Available: https://books.google.com.tw/
books?id=R516zgEACAAJ
[26] H. F. Yang, “Wén bái yì dú 文白異讀 (literary and colloquial readings),” Dec 2015.
[Online]. Available: https://referenceworks.brillonline.com/entries/encyclopedia67
of-chinese-language-and-linguistics/wen-bai-yi-du-literary-and-colloquialreadings-COM_00000446
[27] C. Myers-Scotton, Duelling languages: Grammatical structure in codeswitching.
Oxford University Press, 1997.
[28] E. Mcclure, “Aspects of code-switching in the discourse of bilingual mexicanamerican children. technical report no. 44.” 1977.
[29] C. Hoffmann, An Introduction to Bilingualism, ser. Longman linguistics library.
Longman, 1991. [Online]. Available: https://books.google.com.tw/books?id=
XUuxngEACAAJ
[30] D. M. Lance, “The codes of the spanish-english bilingual,” TESOL Quarterly, pp.
343–351, 1970.
[31] A. J. AGUIRRE, “An experimental study of code alternation,” 1985.
[32] E. G. Bokamba, “Code-mixing, language variation, and linguistic theory:: Evidence from bantu languages,” Lingua, vol. 76, no. 1, pp. 21–62, 1988.
[33] C. Myers-Scotton, “Common and uncommon ground: Social and structural factors
in codeswitching,” Language in society, vol. 22, no. 4, pp. 475–503, 1993.
[34] S. N. Sridhar and K. K. Sridhar, “The syntax and psycholinguistics of bilingual
code mixing.” Canadian Journal of Psychology/Revue canadienne de psychologie,
vol. 34, no. 4, p. 407, 1980.
[35] L. A. Timm, “Spanish-english code-switching: el porque y how-not-to,” Romance
philology, vol. 28, no. 4, pp. 473–482, 1975.
[36] S. Poplack, “Sometimes i＇ll start a sentence in spanish y termino en espanol: toward
a typology of code-switching1,” 1980.
[37] C. W. Pfaff, “Constraints on language mixing: Intrasentential code-switching and
borrowing in spanish/english,” Language, pp. 291–318, 1979.
68
[38] S. Poplack, Syntactic structure and social function of code-switching. Centro de
Estudios Puertorriqueños,[City University of New York], 1978, vol. 2.
[39] A. Joshi, “Processing of sentences with intrasentential code-switching,” in Coling
1982: Proceedings of the Ninth International Conference on Computational Linguistics, 1982.
[40] A.-M. Di Sciullo, P. Muysken, and R. Singh, “Government and code-mixing1,”
Journal of linguistics, vol. 22, no. 1, pp. 1–24, 1986.
[41] H. M. Belazi, E. J. Rubin, and A. J. Toribio, “Code switching and x-bar theory: The
functional head constraint,” Linguistic inquiry, pp. 221–237, 1994.
[42] S. chen Chang, “Code-mixing of english and taiwanese in mandarin discourse,”
2001.
[43] A.-C. SUN, “The language interference in the taiwanese and mandarin contact from
the ”fiery thunderbolt” drama series,” 2019.
[44] Y. Shih and Z. Su, “A study of mandarin code-mixing in taiwanese speech,” in
First International Symposium on Languages in Taiwan, and then collected in The
Proceedings of the Symposium, 1995, pp. 731–767.
[45] Y.-L. Wu, C.-W. Hsieh, W.-H. Lin, C.-Y. Liu, and L.-C. Yu, “Unknown word
extraction from multilingual code-switching sentences in Chinese,” in ROCLING
2011 Poster Papers. Taipei, Taiwan: The Association for Computational
Linguistics and Chinese Language Processing (ACLCLP), Sep. 2011, pp.
349–360. [Online]. Available: https://aclanthology.org/O11-2013
[46] Y.-s. Cheng, “A preliminar synctactic study on mandarin/taiwanese code switching,” Unpublished MA thesis. National Taiwan Normal University, 1989.
[47] S. Schuster, S. Gupta, R. Shah, and M. Lewis, “Cross-lingual transfer learning
for multilingual task oriented dialog,” in Proceedings of the 2019 Conference of
the North American Chapter of the Association for Computational Linguistics:
69
Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis,
Minnesota: Association for Computational Linguistics, Jun. 2019, pp. 3795–3805.
[Online]. Available: https://aclanthology.org/N19-1380
[48] T. D. Singh and T. Solorio, “Towards translating mixed-code comments from social
media,” in Computational Linguistics and Intelligent Text Processing, A. Gelbukh,
Ed. Cham: Springer International Publishing, 2018, pp. 457–468.
[49] B. G. Patra, D. Das, and A. Das, “Sentiment analysis of code-mixed indian
languages: An overview of sail_code-mixed shared task @icon-2017,” CoRR, vol.
abs/1803.06745, 2018. [Online]. Available: http://arxiv.org/abs/1803.06745
[50] S. Lee and Z. Wang, “Emotion in code-switching texts: Corpus construction and
analysis,” in Proceedings of the Eighth SIGHAN Workshop on Chinese Language
Processing. Beijing, China: Association for Computational Linguistics, Jul.
2015, pp. 91–99. [Online]. Available: https://aclanthology.org/W15-3116
[51] A. Sharma, S. Gupta, R. Motlani, P. Bansal, M. Shrivastava, R. Mamidi, and D. M.
Sharma, “Shallow parsing pipeline - Hindi-English code-mixed social media text,”
in Proceedings of the 2016 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies. San
Diego, California: Association for Computational Linguistics, Jun. 2016, pp.
1340–1345. [Online]. Available: https://aclanthology.org/N16-1159
[52] S. Banerjee, N. Moghe, S. Arora, and M. M. Khapra, “A dataset for building
code-mixed goal oriented conversation systems,” in Proceedings of the 27th
International Conference on Computational Linguistics. Santa Fe, New Mexico,
USA: Association for Computational Linguistics, Aug. 2018, pp. 3766–3780.
[Online]. Available: https://aclanthology.org/C18-1319
[53] K. Singh, I. Sen, and P. Kumaraguru, “A Twitter corpus for Hindi-English
code mixed POS tagging,” in Proceedings of the Sixth International Workshop
on Natural Language Processing for Social Media. Melbourne, Australia:
70
Association for Computational Linguistics, Jul. 2018, pp. 12–17. [Online].
Available: https://aclanthology.org/W18-3503
[54] M. Dhar, V. Kumar, and M. Shrivastava, “Enabling code-mixed translation:
Parallel corpus creation and MT augmentation approach,” in Proceedings of the
First Workshop on Linguistic Resources for Natural Language Processing. Santa
Fe, New Mexico, USA: Association for Computational Linguistics, Aug. 2018,
pp. 131–140. [Online]. Available: https://aclanthology.org/W18-3817
[55] B. R. Chakravarthi, V. Muralidaran, R. Priyadharshini, and J. P. McCrae,
“Corpus creation for sentiment analysis in code-mixed Tamil-English text,”
in Proceedings of the 1st Joint Workshop on Spoken Language Technologies
for Under-resourced languages (SLTU) and Collaboration and Computing
for Under-Resourced Languages (CCURL). Marseille, France: European
Language Resources association, May 2020, pp. 202–210. [Online]. Available:
https://aclanthology.org/2020.sltu-1.28
[56] R. Xiang, M. Wan, Q. Su, C.-R. Huang, and Q. Lu, “Sina Mandarin alphabetical
words:a web-driven code-mixing lexical resource,” in Proceedings of the 1st
Conference of the Asia-Pacific Chapter of the Association for Computational
Linguistics and the 10th International Joint Conference on Natural Language
Processing. Suzhou, China: Association for Computational Linguistics, Dec.
2020, pp. 833–842. [Online]. Available: https://aclanthology.org/2020.aaclmain.84
[57] V. Srivastava and M. Singh, “PHINC: A parallel Hinglish social media
code-mixed corpus for machine translation,” in Proceedings of the Sixth
Workshop on Noisy User-generated Text (W-NUT 2020). Online: Association
for Computational Linguistics, Nov. 2020, pp. 41–49. [Online]. Available:
https://aclanthology.org/2020.wnut-1.7
[58] A. Pratapa, G. Bhat, M. Choudhury, S. Sitaram, S. Dandapat, and K. Bali,
“Language modeling for code-mixing: The role of linguistic theory based
71
synthetic data,” in Proceedings of the 56th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia:
Association for Computational Linguistics, Jul. 2018, pp. 1543–1553. [Online].
Available: https://aclanthology.org/P18-1143
[59] C.-T. Chang, S.-P. Chuang, and H. yi Lee, “Code-switching sentence generation
by generative adversarial networks and its application to data augmentation,” in
INTERSPEECH, 2019.
[60] Y. Gao, J. Feng, Y. Liu, L. Hou, X. Pan, and Y. Ma, “Code-switching sentence
generation by bert and generative adversarial networks.” in INTERSPEECH, 2019,
pp. 3525–3529.
[61] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,
A. Courville, and Y. Bengio, “Generative adversarial nets,” Advances in neural
information processing systems, vol. 27, 2014.
[62] L. Yu, W. Zhang, J. Wang, and Y. Yu, “Seqgan: Sequence generative adversarial nets with policy gradient,” in Proceedings of the AAAI conference on artificial
intelligence, vol. 31, no. 1, 2017.
[63] B. Samanta, S. Reddy, H. Jagirdar, N. Ganguly, and S. Chakrabarti, “A deep
generative model for code switched text,” in Proceedings of the Twenty-Eighth
International Joint Conference on Artificial Intelligence, IJCAI-19. International
Joint Conferences on Artificial Intelligence Organization, 7 2019, pp. 5175–5181.
[Online]. Available: https://doi.org/10.24963/ijcai.2019/719
[64] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint
arXiv:1312.6114, 2013.
[65] G. I. Winata, A. Madotto, C.-S. Wu, and P. Fung, “Code-switching language
modeling using syntax-aware multi-task learning,” in Proceedings of the
Third Workshop on Computational Approaches to Linguistic Code-Switching.
72
Melbourne, Australia: Association for Computational Linguistics, Jul. 2018, pp.
62–67. [Online]. Available: https://aclanthology.org/W18-3207
[66] ——, “Code-switched language models using neural based synthetic data from
parallel sentences,” in Proceedings of the 23rd Conference on Computational
Natural Language Learning (CoNLL). Hong Kong, China: Association
for Computational Linguistics, Nov. 2019, pp. 271–280. [Online]. Available:
https://aclanthology.org/K19-1026
[67] D. Gupta, A. Ekbal, and P. Bhattacharyya, “A semi-supervised approach to
generate the code-mixed text using pre-trained encoder and transfer learning,”
in Findings of the Association for Computational Linguistics: EMNLP 2020.
Online: Association for Computational Linguistics, Nov. 2020, pp. 2267–2280.
[Online]. Available: https://aclanthology.org/2020.findings-emnlp.206
[68] A. Gupta, A. Vavre, and S. Sarawagi, “Training data augmentation for code-mixed
translation,” in Proceedings of the 2021 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language
Technologies. Online: Association for Computational Linguistics, Jun. 2021, pp.
5760–5766. [Online]. Available: https://aclanthology.org/2021.naacl-main.459
[69] D. Gautam, P. Kodali, K. Gupta, A. Goel, M. Shrivastava, and P. Kumaraguru,
“CoMeT: Towards code-mixed translation using parallel monolingual sentences,”
in Proceedings of the Fifth Workshop on Computational Approaches to Linguistic
Code-Switching. Online: Association for Computational Linguistics, Jun. 2021,
pp. 47–55. [Online]. Available: https://aclanthology.org/2021.calcs-1.7
[70] R. M. K. Sinha and A. Thakur, “Machine translation of bi-lingual HindiEnglish (Hinglish) text,” in Proceedings of Machine Translation Summit X:
Papers, Phuket, Thailand, Sep. 13-15 2005, pp. 149–156. [Online]. Available:
https://aclanthology.org/2005.mtsummit-papers.20
73
[71] T. D. Singh and T. Solorio, “Towards translating mixed-code comments from social
media,” in Computational Linguistics and Intelligent Text Processing, A. Gelbukh,
Ed. Cham: Springer International Publishing, 2018, pp. 457–468.
[72] S. K. Mahata, S. Mandal, D. Das, and S. Bandyopadhyay, “Code-mixed to monolingual translation framework,” in Proceedings of the 11th Forum for Information
Retrieval Evaluation, 2019, pp. 30–35.
[73] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun,
Y. Cao, Q. Gao, K. Macherey et al., “Google’s neural machine translation system: Bridging the gap between human and machine translation,” arXiv preprint
arXiv:1609.08144, 2016.
[74] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representations by error propagation,” California Univ San Diego La Jolla Inst for Cognitive
Science, Tech. Rep., 1985.
[75] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE transactions on neural networks, vol. 5, no. 2, pp.
157–166, 1994.
[76] R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words
with subword units,” in Proceedings of the 54th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers). Berlin, Germany:
Association for Computational Linguistics, Aug. 2016, pp. 1715–1725. [Online].
Available: https://aclanthology.org/P16-1162
[77] S. Wu and M. Dredze, “Beto, bentz, becas: The surprising cross-lingual
effectiveness of BERT,” in Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and the 9th International Joint
Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong,
China: Association for Computational Linguistics, Nov. 2019, pp. 833–844.
[Online]. Available: https://aclanthology.org/D19-1077
74
[78] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy,
V. Stoyanov, and L. Zettlemoyer, “BART: Denoising sequence-to-sequence
pre-training for natural language generation, translation, and comprehension,” in
Proceedings of the 58th Annual Meeting of the Association for Computational
Linguistics. Online: Association for Computational Linguistics, Jul. 2020, pp.
7871–7880. [Online]. Available: https://aclanthology.org/2020.acl-main.703
[79] L. Torrey and J. Shavlik, “Transfer learning,” in Handbook of research on machine learning applications and trends: algorithms, methods, and techniques. IGI
global, 2010, pp. 242–264.
[80] S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman, “Indexing by
latent semantic analysis.” Journal of the American Society for Information Science
41, pp. 391–407, 1990.
[81] P. F. Brown, V. J. Della Pietra, P. V. deSouza, J. C. Lai, and R. L.
Mercer, “Class-based n-gram models of natural language,” Computational
Linguistics, vol. 18, no. 4, pp. 467–480, 1992. [Online]. Available: https:
//www.aclweb.org/anthology/J92-4003
[82] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.
[83] S. Ruder, M. E. Peters, S. Swayamdipta, and T. Wolf, “Transfer learning in natural
language processing,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials, 2019, pp.
15–18.
[84] N. Farra, “Cross-lingual and low-resource sentiment analysis,” Ph.D. dissertation,
Columbia University, 2019.
[85] S. Ruder, I. Vulić, and A. Søgaard, “A survey of cross-lingual word embedding
models,” Journal of Artificial Intelligence Research, vol. 65, pp. 569–631, 2019.
75
[86] J. Gu, H. Hassan, J. Devlin, and V. O. Li, “Universal neural machine translation for
extremely low resource languages,” arXiv preprint arXiv:1802.05368, 2018.
[87] Y. Wang, L. Cui, and Y. Zhang, “Does Chinese BERT encode word structure?” in
Proceedings of the 28th International Conference on Computational Linguistics.
Barcelona, Spain (Online): International Committee on Computational Linguistics,
Dec. 2020, pp. 2826–2836. [Online]. Available: https://aclanthology.org/2020.
coling-main.254
[88] W. Liu, X. Fu, Y. Zhang, and W. Xiao, “Lexicon enhanced Chinese sequence
labeling using BERT adapter,” in Proceedings of the 59th Annual Meeting of
the Association for Computational Linguistics and the 11th International Joint
Conference on Natural Language Processing (Volume 1: Long Papers). Online:
Association for Computational Linguistics, Aug. 2021, pp. 5847–5858. [Online].
Available: https://aclanthology.org/2021.acl-long.454
[89] M. Zhang, Z. Li, G. Fu, and M. Zhang, “Syntax-enhanced neural machine
translation with syntax-aware word representations,” in Proceedings of the 2019
Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).
Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019,
pp. 1151–1161. [Online]. Available: https://aclanthology.org/N19-1118
[90] H. Xu, Z. Chen, S. Wang, and X. Jiang, Chinese NER Using ALBERT and MultiWord Information. New York, NY, USA: Association for Computing Machinery,
2021, p. 141–145. [Online]. Available: https://doi.org/10.1145/3472634.3472667
[91] D. Teng, L. Qin, W. Che, S. Zhao, and T. Liu, “Injecting word information with
multi-level word adapter for chinese spoken language understanding,” in ICASSP
2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 8188–8192.
76
[92] Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun, and Q. Liu, “ERNIE: Enhanced
language representation with informative entities,” in Proceedings of the 57th
Annual Meeting of the Association for Computational Linguistics. Florence, Italy:
Association for Computational Linguistics, Jul. 2019, pp. 1441–1451. [Online].
Available: https://aclanthology.org/P19-1139
[93] S. Diao, J. Bai, Y. Song, T. Zhang, and Y. Wang, “ZEN: Pre-training
Chinese text encoder enhanced by n-gram representations,” in Findings of the
Association for Computational Linguistics: EMNLP 2020. Online: Association
for Computational Linguistics, Nov. 2020, pp. 4729–4740. [Online]. Available:
https://aclanthology.org/2020.findings-emnlp.425
[94] Z. Zhang, X. Han, H. Zhou, P. Ke, Y. Gu, D. Ye, Y. Qin, Y. Su, H. Ji, J. Guan, F. Qi,
X. Wang, Y. Zheng, G. Zeng, H. Cao, S. Chen, D. Li, Z. Sun, Z. Liu, M. Huang,
W. Han, J. Tang, J. Li, X. Zhu, and M. Sun, “Cpm: A large-scale generative chinese
pre-trained language model,” AI Open, vol. 2, pp. 93–99, 2021. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/S266665102100019X
[95] Z. Ke, L. Shi, S. Sun, E. Meng, B. Wang, and X. Qiu, “Pre-training
with meta learning for Chinese word segmentation,” in Proceedings of the
2021 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies. Online: Association
for Computational Linguistics, Jun. 2021, pp. 5514–5523. [Online]. Available:
https://aclanthology.org/2021.naacl-main.436
[96] J. Su, “Wobert: Word-based chinese bert model - zhuiyiai,” Tech. Rep., 2020.
[Online]. Available: https://github.com/ZhuiyiTechnology/WoBERT
[97] L. Zhang and M. Komachi, “Neural machine translation of logographic language
using sub-character level information,” in Proceedings of the Third Conference
on Machine Translation: Research Papers. Brussels, Belgium: Association
for Computational Linguistics, Oct. 2018, pp. 17–25. [Online]. Available:
https://aclanthology.org/W18-6303
77
[98] W. Lu, L. Zhou, G. Liu, and Q. Zhang, “A mixed learning objective for
neural machine translation,” in Proceedings of the 19th Chinese National
Conference on Computational Linguistics. Haikou, China: Chinese Information
Processing Society of China, Oct. 2020, pp. 974–983. [Online]. Available:
https://aclanthology.org/2020.ccl-1.90
[99] T. Tang, Minnan yu yu fa yan jiu shi lun, chu ban ed., ser. Xian dai yu yan xue lun
cong. Taiwan xue sheng shu ju, 1999.
[100] W.-j. Wang, C.-j. Chen, C.-m. Lee, C.-y. Lai, and H.-h. Lin, “Articut: Chinese Word
Segmentation and POS Tagging System,” 2021.
[101] N. Chomsky, “Remarks on nominalization. ra jacobs & ps rosembaum (eds.), readings in english transformational grammar,” 1970.
[102] ——, “A minimalist program for linguistic theory,” The view from Building 20:
Essays in linguistics in honor of Sylvain Bromberger, 1993.
[103] S. Khanuja, S. Dandapat, S. Sitaram, and M. Choudhury, “A new dataset
for natural language inference from code-mixed conversations,” arXiv preprint
arXiv:2004.05051, 2020.
[104] S. Ghosh, S. Ghosh, and D. Das, “Complexity metric for code-mixed social media
text,” Computación y Sistemas, vol. 21, 07 2017.
[105] B. Gambäck and A. Das, “Comparing the level of code-switching in corpora,”
in Proceedings of the Tenth International Conference on Language Resources
and Evaluation (LREC’16). Portorož, Slovenia: European Language Resources
Association (ELRA), May 2016, pp. 1850–1855. [Online]. Available: https:
//aclanthology.org/L16-1292
[106] J. L. Fleiss and J. Cohen, “The equivalence of weighted kappa and the intraclass
correlation coefficient as measures of reliability,” Educational and psychological
measurement, vol. 33, no. 3, pp. 613–619, 1973.
78
[107] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with
subword information,” Transactions of the Association for Computational Linguistics, vol. 5, pp. 135–146, 2017.
[108] S. Wang and F. Bond, “Building the chinese open wordnet (cow): Starting from core
synsets,” in Sixth International Joint Conference on Natural Language Processing,
2013, pp. 10–18.
[109] C.-R. Huang, S.-K. Hsieh, J.-F. Hong, Y.-Z. Chen, I.-L. Su, Y.-X. Chen, and
S.-W. Huang, “Chinese wordnet: Design and implementation of a cross-lingual
knowledge processing infrastructure,” Journal of Chinese Information Processing,
vol. 24, no. 2, pp. 14–23, 2010, (in Chinese).
[110] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic
evaluation of machine translation,” in Proceedings of the 40th Annual Meeting of
the Association for Computational Linguistics. Philadelphia, Pennsylvania, USA:
Association for Computational Linguistics, Jul. 2002, pp. 311–318. [Online].
Available: https://aclanthology.org/P02-1040
[111] T. Zhang*, V. Kishore*, F. Wu*, K. Q. Weinberger, and Y. Artzi, “Bertscore:
Evaluating text generation with bert,” in International Conference on Learning
Representations, 2020. [Online]. Available: https://openreview.net/forum?id=
SkeHuCVFDr

指導教授

蔡宗翰(Richard Tzong-Han Tsai)

審核日期

2022-1-21

推文