無需人工標記資料的中國古代文獻事件對齊模型：以《清實錄》及《滿文老檔》為例;An Event Alignment Model for Ancient Chinese Literature without Requirement of Manually Labeled Data: A Case Study of the Qing Shi-Lu and Manchu Old Archives

NCU Institutional Repository > 資訊電機學院 > 資訊工程研究所 > 博碩士論文 > Item 987654321/93555

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/93555

題名:	無需人工標記資料的中國古代文獻事件對齊模型：以《清實錄》及《滿文老檔》為例;An Event Alignment Model for Ancient Chinese Literature without Requirement of Manually Labeled Data: A Case Study of the Qing Shi-Lu and Manchu Old Archives
作者:	趙若羽;Chao, Jo-Yu
貢獻者:	資訊工程學系
關鍵詞:	數位人文;釋義識別;文本對齊;ChatGPT;深度學習;Digital Humanities;Paraphrase Identification;Text Alignment;ChatGPT;Deep Learning
日期:	2024-01-25
上傳時間:	2024-09-19 17:13:20 (UTC+8)
出版者:	國立中央大學
摘要:	在應用於中國古代文獻的數位人文領域中，已有些研究探討如何實現文本對齊技術來幫助歷史學者比較不同的文獻，不過這些研究並沒有以「相同語意」的觀點來對齊文本。故本研究將引入自然語言處理中釋義識別任務的概念，來找出不同文本中擁有相同語意的段落，並應用於後漢書、三國志和資治通鑑以作為範例。然而如果要採用釋義識別任務中最先進的自然語言處理技術，則會有一些限制需要去考量:(1)訓練資料不足(2)基於注意力方法的文本長度限制。為了解決這些問題，本研究提出了應用二階段訓練於中國古代文獻釋義識別的弱監督學習架構(SPITAC)。此方法有兩個主要部分:偽標籤訓練集生成和二階段訓練。在偽標籤訓練集生成中，本研究使用基於規則的方法來自動產生訓練資料集以解決訓練資料不足的問題。而為了解決文本長度限制，則採用句子過濾器的方法來刪減不重要的句子，將句子長度縮減到最大長度的範圍內。在二階段訓練的設計中，此方法可以使分類器更好的識別出硬負樣本來提升模型性能。從實驗結果表明，本研究的弱監督學習方法可以達到接近監督式學習的效果，而在消融實驗中，句子過濾器和二階段訓練可以有效提升性能，能提高 4.14 F1 分數並超越基線模型。最後本研究將從實際的文本中演示並分析此方法的成果，並從成效中探討這項任務的困難及未來改進方向。;Implementing text alignment on ancient Chinese literature offers signif- icant assistance to academics investigating historical events, particularly as variations may occur in the descriptions of an event across different texts. These variations represent valuable research materials. However, the current studies rarely align text from the perspective of the ”same event”. In order to develop a tool that better aligns with the practical application conditions of text alignment in ancient Chinese literature, we adopted the predecessors’ ideas. We have redefined the ”Paraphrase” definition of Paraphrase Identi- fication task (a Natural Language Processing task determining whether two texts convey the same meaning) to facilitate the task of text alignment for ancient Chinese literature. This work encounters two primary challenges: 1) the deficiency of train- ing data and 2) the limitations in input length of the attention-based method. To address these issues, we proposed the Event Alignment Model for Ancient Chinese Literature without Requirement of Manually Labeled Data. In this framework, we utilize ChatGPT to generate a training set, thereby overcom- ing the lack of training data. Furthermore, we resolve the issue of text length limitation by employing a data slicing method to reduce paragraph size within a maximum length. Additionally, the GujiBERT model is also implemented for paraphrase identification. Experimental results show that our proposed EAMAC outperforms significantly more than the baseline and exhibits con- siderable stability and applicability when applied to other texts.
顯示於類別:	[資訊工程研究所] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	16	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....