改進自注意力機制於神經機器翻譯之研究

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：108

、訪客IP：18.118.137.113

姓名

陳明萱(Ming-Hsuan Chen) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

改進自注意力機制於神經機器翻譯之研究

相關論文

★ 網路合作式協同教學設計平台－以國中九年一貫課程為例	★ 內容管理機制於常用問答集(FAQ)之應用
★ 行動多重代理人技術於排課系統之應用	★ 存取控制機制與國內資安規範之研究
★ 信用卡系統導入NFC手機交易機制探討	★ App應用在電子商務的推薦服務-以P公司為例
★ 建置服務導向系統改善生產之流程-以W公司PMS系統為例	★ NFC行動支付之TSM平台規劃與導入
★ 關鍵字行銷在半導體通路商運用-以G公司為例	★ 探討國內田徑競賽資訊系統－以103年全國大專田徑公開賽資訊系統為例
★ 航空地勤機坪作業盤櫃追蹤管理系統導入成效評估—以F公司為例	★ 導入資訊安全管理制度之資安管理成熟度研究－以B個案公司為例
★ 資料探勘技術在電影推薦上的應用研究-以F線上影音平台為例	★ BI視覺化工具運用於資安日誌分析—以S公司為例
★ 特權帳號登入行為即時分析系統之實證研究	★ 郵件系統異常使用行為偵測與處理-以T公司為例

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

神經機器翻譯任務之目的為透過深度學習模型將來源語言句子轉換為目標語言，同時得以保留來源句子語意及正確句法。近年來常用的模型之一為 Transformer，透過模型中的自注意力機制捕捉句子的全局資訊，在多項自然語言處理任務中表現良好。然而，有研究指出自注意力機制會學到重複資訊，且無法有效學習文本中的局部資訊。因此，本研究針對 Transformer 中的自注意力機制進行改善，分別加入 Gate 機制與 K-means 分群演算法，進而提出 Gated Attention 與 Clustered Attention，其中 Gated Attention 又涵蓋 Top-k % 方法及 Threshold 方法。透過將 Attention Map 集中化，加強模型捕捉局部資訊之能力，藉此學習到更多元的句子關係，提升其翻譯品質。
　　本研究將 Gated Attention 的 Top-k % 方法與 Threshold 方法，以及 Clustered Attention 應用於中英翻譯任務上，以 BLEU 作為評估指標，分別達到 25.30、24.69 及 24.69。其次，同時採用兩種注意力機制的混合組合模型之最佳結果為 24.88，並未比僅採用單一種方法要來得優秀。在實驗中皆證實本研究提出的改進模型優於原始 Transformer，另外亦表明了只使用一種注意力機制更能夠幫助 Transformer 學習文本資訊，且達到 Attention Map 集中化之目的。

摘要(英)

The purpose of Neural Machine Translation (NMT) is to translate a source sentence to a target sentence by deep learning models and to be able to preserve the semantic meaning of the source sentence and have correct syntax as well. Recently, Transformer is one of the commonly used models. It can capture the global information of sentences through the Self-Attention Mechanism and performs well in lots of Natural Language Processing (NLP) tasks. However, some studies have indicated that the Self-Attention Mechanism learns repetitive information and cannot learn local information of texts effectively. Therefore, we modify the Self-attention Mechanism in Transformer and propose Gated Attention and Clustered Attention, by adding Gated Mechanism and K-means clustering algorithm respectively. Moreover, Gated Attention includes Top-k% method and Threshold method. These approaches centralize the Attention Map to made model improve the ability to capture local information and learn more different relationship in sentences. Hence Transformer can provide a higher quality translation.
In this work, we apply Clustered Attention as well as Top-k% method and Threshold method of Gated Attention to Chinese-to-English translation tasks, and then the results are 24.69, 25.30 and 24.69 BLEU, respectively. Secondly, the best result of the hybrid combination model that uses both attention mechanisms at the same time is 24.88 BLEU, which is not better than using a single attention mechanism. In our experiments, we have found that the proposed model outperforms the vanilla Transformer. Furthermore, we have also observed that using only one attention mechanism can help Transformer learn text information better and achieve the goal of Attention Map centralization as well.

關鍵字(中)

★ 神經機器翻譯
★ Transformer
★ 自注意力機制
★ Gate機制
★ 分群演算法

關鍵字(英)

★ Neural Machine Translation
★ Transformer
★ Self-Attention Mechanism
★ Gate Mechanism
★ Clustering Algorithms

論文目次

摘要 ..................................................................................................................i
Abstract .........................................................................................................ii
誌謝 ................................................................................................................iii
目錄 ................................................................................................................iv
圖目錄 ............................................................................................................vi
表目錄 ...........................................................................................................vii
一、前言 .........................................................................................................1
1-1 研究背景 ................................................................................................1
1-2 研究動機 ................................................................................................2
1-3 研究目的 ................................................................................................3
1-4 文章架構 ................................................................................................4
二、文獻探討 .................................................................................................5
2-1 神經機器翻譯 ........................................................................................5
2-2 編解碼器架構 ........................................................................................6
2-2-1 RNN ...................................................................................................7
2-2-2 LSTM ..................................................................................................7
2-2-3 RNN Encoder-Decoder .................................................................9
2-3 Transformer .......................................................................................10
2-3-1 詞向量 ..............................................................................................11
2-3-2 殘差連結與層正規 ..........................................................................12
2-3-3 FFN ..................................................................................................13
2-3-4 線性層與 Softmax .........................................................................14
2-4 注意力機制 ..........................................................................................14
2-4-1 自注意力機制 ..................................................................................15
2-4-2 多向注意力機制 ..............................................................................17
2-4-3 自注意力機制相關研究 ..................................................................17
2-5 分群演算法 (Clustering Algorithm) ................................................20
2-5-1 K-means .........................................................................................20
2-5-2 K值選擇 ...........................................................................................21
三、研究方法 ..............................................................................................23
3-1 資料前處理 .........................................................................................24
3-2 模型訓練 .............................................................................................26
3-2-1 Attention Map .............................................................................26
3-2-2 Gated Attention ..........................................................................27
3-2-3 Clustered Attention ....................................................................29
3-2-4 多向注意力機制 ............................................................................31
3-3 結果評估 ............................................................................................32
3-3-1 產生翻譯句子 ................................................................................32
3-3-2 計算 BLEU .....................................................................................33
四、實驗 ....................................................................................................35
4-1 實驗設置 ...........................................................................................35
4-1-1 實驗環境與參數設置 ...................................................................35
4-1-2 資料集 ...........................................................................................36
4-2 實驗設計與結果 ...............................................................................37
4-2-1 實驗一：不同超參數設置下之模型表現 ...................................37
4-2-2 實驗二：Gated Attention 與 Clustered Attention 之效能 ..40
4-2-3 實驗三：不同 Attention Heads 組合下之模型表現 ...............41
4-3 討論與分析 ........................................................................................43
4-3-1 Attention Map 之分析 ................................................................43
4-3-2 最佳 K 值之分析 ............................................................................44
五、結論與未來方向 .................................................................................46
5-1 結論 ....................................................................................................46
5-2 研究限制 ............................................................................................46
5-3 未來研究方向 ....................................................................................46
參考文獻 .....................................................................................................48

參考文獻

Ankerst, M., Breunig, M. M., Kriegel, H.-P., & Sander, J. (1999). OPTICS: Ordering Points to Identify the Clustering Structure. Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, 49–60. https://doi.org/10.1145/304182.304187
Arora, P., Deepali, & Varshney, S. (2016). Analysis of K-Means and K-Medoids Algorithm For Big Data. Procedia Computer Science, 78, 507–512. https://doi.org/10.1016/j.procs.2016.02.095
Arthur, D., & Vassilvitskii, S. (2006). k-means++: The Advantages of Careful Seeding. Stanford.
Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer Normalization. ArXiv:1607.06450 [Cs, Stat]. http://arxiv.org/abs/1607.06450
Babhulgaonkar, A. R., & Bharad, S. V. (2017). Statistical Machine Translation. 2017 1st International Conference on Intelligent Systems and Information Management (ICISIM), 62–67. https://doi.org/10.1109/ICISIM.2017.8122149
Bahdanau, D., Cho, K., & Bengio, Y. (2016). Neural Machine Translation by Jointly Learning to Align and Translate. ArXiv:1409.0473 [Cs, Stat]. http://arxiv.org/abs/1409.0473
Chen, B., & Cherry, C. (2014). A Systematic Comparison of Smoothing Techniques for Sentence-Level BLEU. Proceedings of the Ninth Workshop on Statistical Machine Translation, 362–367. https://doi.org/10.3115/v1/W14-3346
Child, R., Gray, S., Radford, A., & Sutskever, I. (2019). Generating Long Sequences with Sparse Transformers. ArXiv:1904.10509 [Cs, Stat]. http://arxiv.org/abs/1904.10509
Cho, K., van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1724–1734. https://doi.org/10.3115/v1/D14-1179
Cordonnier, J.-B., Loukas, A., & Jaggi, M. (2020). Multi-Head Attention: Collaborate Instead of Concatenate. ArXiv:2006.16362 [Cs, Stat]. http://arxiv.org/abs/2006.16362
Dauphin, Y. N., Fan, A., Auli, M., & Grangier, D. (2017). Language Modeling with Gated Convolutional Networks. ArXiv:1612.08083 [Cs]. http://arxiv.org/abs/1612.08083
Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Kdd, 96(34), 226–231.
Garg, A., & Agarwal, M. (2018). Machine Translation: A Literature Review. ArXiv:1901.01122 [Cs]. http://arxiv.org/abs/1901.01122
Gehring, J., Auli, M., Grangier, D., & Dauphin, Y. N. (2017). A Convolutional Encoder Model for Neural Machine Translation. ArXiv:1611.02344 [Cs]. http://arxiv.org/abs/1611.02344
Gehring, J., Auli, M., Grangier, D., Yarats, D., & Dauphin, Y. N. (2017). Convolutional Sequence to Sequence Learning. ArXiv:1705.03122 [Cs]. http://arxiv.org/abs/1705.03122
Glorot, X., Bordes, A., & Bengio, Y. (2011). Deep Sparse Rectifier Neural Networks. Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 315–323. http://proceedings.mlr.press/v15/glorot11a.html
Graves, A., Wayne, G., & Danihelka, I. (2014). Neural Turing Machines. ArXiv:1410.5401 [Cs]. http://arxiv.org/abs/1410.5401
Gu, J., Wang, C., & Zhao, J. (2019). Levenshtein Transformer. ArXiv:1905.11006 [Cs]. http://arxiv.org/abs/1905.11006
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–778. https://doi.org/10.1109/CVPR.2016.90
He, P., Liu, X., Gao, J., & Chen, W. (2021). DeBERTa: Decoding-enhanced BERT with Disentangled Attention. ArXiv:2006.03654 [Cs]. http://arxiv.org/abs/2006.03654
Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Ioffe, S., & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ArXiv:1502.03167 [Cs]. http://arxiv.org/abs/1502.03167
Jin, X., & Han, J. (2010). K-Means Clustering. In C. Sammut & G. I. Webb (Eds.), Encyclopedia of Machine Learning (pp. 563–564). Springer US. https://doi.org/10.1007/978-0-387-30164-8_425

Kalchbrenner, N., & Blunsom, P. (2013). Recurrent Continuous Translation Models. Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 1700–1709. https://www.aclweb.org/anthology/D13-1176
Kaufman, L., & Rousseeuw, P. J. (2009). Finding Groups in Data: An Introduction to Cluster Analysis. New York, NY: John Wiley and Sons.
Kodinariya, T., & Makwana, P. (2013). Review on Determining of Cluster in K-means Clustering. International Journal of Advance Research in Computer Science and Management Studies, 1, 90–95.
Lakew, S. M., Cettolo, M., & Federico, M. (2018). A Comparison of Transformer and Recurrent Neural Networks on Multilingual Neural Machine Translation. Proceedings of the 27th International Conference on Computational Linguistics, 641–652. https://www.aclweb.org/anthology/C18-1054
Lample, G., & Conneau, A. (2019). Cross-lingual Language Model Pretraining. ArXiv:1901.07291 [Cs]. http://arxiv.org/abs/1901.07291
Luong, M.-T., Pham, H., & Manning, C. D. (2015). Effective Approaches to Attention-Based Neural Machine Translation. ArXiv:1508.04025 [Cs]. http://arxiv.org/abs/1508.04025
MacQueen, J. (1967). Some Methods for Classification and Analysis of Multivariate Observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, 281–297.
Mehdad, Y., Negri, M., & Federico, M. (2012). Match without a Referee: Evaluating MT Adequacy without Reference Translations. Proceedings of the Seventh Workshop on Statistical Machine Translation, 171–180. https://www.aclweb.org/anthology/W12-3122
Meng, F., Lu, Z., Li, H., & Liu, Q. (2016). Interactive Attention for Neural Machine Translation. Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, 2174–2185. https://www.aclweb.org/anthology/C16-1205
Na, S., Xumin, L., & Yong, G. (2010). Research on k-means Clustering Algorithm: An Improved k-means Clustering Algorithm. 2010 Third International Symposium on Intelligent Information Technology and Security Informatics, 63–67. https://doi.org/10.1109/IITSI.2010.74

Okpor, M. D. (2014). Machine Translation Approaches: Issues and Challenges. 11(5), 7.
Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: A Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311–318. https://doi.org/10.3115/1073083.1073135
Popescu-Belis, A. (2019). Context in Neural Machine Translation: A Review of Models and Evaluations. ArXiv:1901.09115 [Cs]. http://arxiv.org/abs/1901.09115
Raganato, A., Scherrer, Y., & Tiedemann, J. (2020). Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation. ArXiv:2002.10260 [Cs]. http://arxiv.org/abs/2002.10260
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning Representations by Back-Propagating Errors. Nature, 323(6088), 533–536. https://doi.org/10.1038/323533a0
Rush, A. (2018). The Annotated Transformer. Proceedings of Workshop for NLP Open Source Software (NLP-OSS), 52–60. https://doi.org/10.18653/v1/W18-2509
Singh, S. P., Kumar, A., Darbari, H., Singh, L., Rastogi, A., & Jain, S. (2017). Machine Translation Using Deep Learning: An overview. 2017 International Conference on Computer, Communications and Electronics (Comptelix), 162–167. https://doi.org/10.1109/COMPTELIX.2017.8003957
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to Sequence Learning with Neural Networks. ArXiv:1409.3215 [Cs]. http://arxiv.org/abs/1409.3215
Tan, Z., Wang, S., Yang, Z., Chen, G., Huang, X., Sun, M., & Liu, Y. (2020). Neural Machine Translation: A Review of Methods, Resources, and Tools. ArXiv:2012.15515 [Cs]. http://arxiv.org/abs/2012.15515
Tay, Y., Dehghani, M., Bahri, D., & Metzler, D. (2020). Efficient Transformers: A Survey. ArXiv:2009.06732 [Cs]. http://arxiv.org/abs/2009.06732
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. ArXiv:1706.03762 [Cs]. http://arxiv.org/abs/1706.03762
Voita, E., Talbot, D., Moiseev, F., Sennrich, R., & Titov, I. (2019). Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. ArXiv:1905.09418 [Cs]. http://arxiv.org/abs/1905.09418
Wang, B., Wang, A., Chen, F., Wang, Y., & Kuo, C.-C. J. (2019). Evaluating Word Embedding Models: Methods and Experimental Results. APSIPA Transactions on Signal and Information Processing, 8. https://doi.org/10.1017/ATSIP.2019.12
Wang, Z., Ma, Y., Liu, Z., & Tang, J. (2019). R-Transformer: Recurrent Neural Network Enhanced Transformer. ArXiv:1907.05572 [Cs, Eess]. http://arxiv.org/abs/1907.05572
Wu Y., Schuster M., Chen Z., Le Q. V., Norouzi M., Macherey W., Krikun M., Cao Y., Gao Q., Macherey K., Klingner J., Shah A., Johnson M., Liu X., Kaiser Ł., Gouws S., Kato Y., Kudo T., Kazawa H., … Dean J. (2016). Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. https://arxiv.org/abs/1609.08144v2
Xin, M., & Wang, Y. (2019). Research on Image Classification Model Based on Deep Convolution Neural Network. EURASIP Journal on Image and Video Processing, 2019(1), 40. https://doi.org/10.1186/s13640-019-0417-8
Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., & Hovy, E. (2016). Hierarchical Attention Networks for Document Classification. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1480–1489. https://doi.org/10.18653/v1/N16-1174
Zaheer, M., Guruganesh, G., Dubey, A., Ainslie, J., Alberti, C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., Yang, L., & Ahmed, A. (2021). Big Bird: Transformers for Longer Sequences. ArXiv:2007.14062 [Cs, Stat]. http://arxiv.org/abs/2007.14062
林佳蒼（2020）。多向注意力機制於翻譯任務改進之研究。國立中央大學資訊管理研究所碩士論文，桃園市。

指導教授

林熙禎(Shi-Jen Lin)

審核日期

2021-8-2

推文