應用二階段訓練於中國古代文獻釋義識別的弱監督學習架構;SPITAC: Weakly Supervised Learning for Paraphrase Identification with Two-Stage Training in Ancient Chinese Literature

NCU Institutional Repository > 資訊電機學院 > 資訊工程學系碩士在職專班 > 博碩士論文 > Item 987654321/89774

請使用永久網址來引用或連結此文件: https://ir.lib.ncu.edu.tw/handle/987654321/89774

題名:	應用二階段訓練於中國古代文獻釋義識別的弱監督學習架構;SPITAC: Weakly Supervised Learning for Paraphrase Identification with Two-Stage Training in Ancient Chinese Literature
作者:	朱彥慈;Chu, Yen-Tzu
貢獻者:	資訊工程學系在職專班
關鍵詞:	數位人文;釋義識別;文本對齊;弱監督學習;深度學習;Digital Humanities;Paraphrase Identification;Text Alignment;Weakly supervised learning;Deep Learning
日期:	2022-09-23
上傳時間:	2022-10-04 11:59:13 (UTC+8)
出版者:	國立中央大學
摘要:	在應用於中國古代文獻的數位人文領域中，已有些研究探討如何實現文本對齊技術來幫助歷史學者比較不同的文獻，不過這些研究並沒有以「相同語意」的觀點來對齊文本。故本研究將引入自然語言處理中釋義識別任務的概念，來找出不同文本中擁有相同語意的段落，並應用於後漢書、三國志和資治通鑑以作為範例。然而如果要採用釋義識別任務中最先進的自然語言處理技術，則會有一些限制需要去考量：（1）訓練資料不足（2）基於注意力方法的文本長度限制。為了解決這些問題，本研究提出了應用二階段訓練於中國古代文獻釋義識別的弱監督學習架構（SPITAC）。此方法有兩個主要部分：偽標籤訓練集生成和二階段訓練。在偽標籤訓練集生成中，本研究使用基於規則的方法來自動產生訓練資料集以解決訓練資料不足的問題。而為了解決文本長度限制，則採用句子過濾器的方法來刪減不重要的句子，將句子長度縮減到最大長度的範圍內。在二階段訓練的設計中，此方法可以使分類器更好的識別出硬負樣本來提升模型性能。從實驗結果表明，本研究的弱監督學習方法可以達到接近監督式學習的效果，而在消融實驗中，句子過濾器和二階段訓練可以有效提升性能，能提高4.14 F1分數並超越基線模型。最後本研究將從實際的文本中演示並分析此方法的成果，並從成效中探討這項任務的困難及未來改進方向。;Text alignment techniques have been studied in digital humanities research of ancient Chinese literature to assist historians in aligning the documents. Nevertheless, these studies didn′t align text in the "same meaning" perspective. In our work, we introduce paraphrase identification, the natural language processing(NLP) task that identifies whether the two texts convey the "same meaning", into Digital Humanities of Ancient Chinese literature and apply it to Book of the Later Han, Records of the Three Kingdoms, and Zizhi Tongjian as examples. However, if we employ SOTA methods to paraphrase identification, some limitations need to be taken into account: (1) insufficient train data and (2) text length limitation of the attention-based method. To handle these issues, we propose the Weakly Supervised Learning for Paraphrase Identification with Two-Stage Training in Ancient Chinese Literature(SPITAC).Our proposed scheme consists of two components: pseudo-label training set generation and two-stage training.The pseudo-label training set generation is based on the rule-based method to generate the training dataset automatically to overcome the lack of train data issue.To handle the problem of text length limitation, we adopt the sentence filter approach to delete unimportant sentences and shrink the text to less than the maximum length.The two-stage training enables the classifier to identify the hard negative samples more efficiently to improve the model performance. The experiment results show that our weakly supervised approach can achieve the results of nearly the supervised learning method.In the ablation study, our proposed scheme, sentence filter and two-stage training, can improve the F1 score by 4.14 compared to the baseline.Finally, we demonstrate and analyze the instances to show the effect of our method and indicate the future challenges for this task.
顯示於類別:	[資訊工程學系碩士在職專班 ] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	318	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....