非監督式歷史文本事件類型識別──以《明實錄》中之衛所事件為例;Unsupervised Event Type Identification of Historical Texts: A Case Study of Wei-so Events in the Ming Shilu

NCUIR > College of Electrical Engineering & Computer Science > Graduate Institute of Computer Science and Information Engineering > Electronic Thesis & Dissertation > Item 987654321/72179

Please use this identifier to cite or link to this item: http://ir.lib.ncu.edu.tw/handle/987654321/72179

Title:	非監督式歷史文本事件類型識別──以《明實錄》中之衛所事件為例;Unsupervised Event Type Identification of Historical Texts: A Case Study of Wei-so Events in the Ming Shilu
Authors:	賴郁婷;LAI,YU-TING
Contributors:	資訊工程學系
Keywords:	事件類型辨識;文本聚類;Paragraph Vector;明實錄;衛所;自然語言處理;古漢語;Event type identification;Text clustering;Paragraph Vector;Ming Shilu;Wei-suo;Natural Language Processing;Classical Chinese
Date:	2016-08-03
Issue Date:	2016-10-13 14:30:42 (UTC+8)
Publisher:	國立中央大學
Abstract:	自然語言技術對於古漢語方面的研究，受限於古漢語的資源匱乏，現有研究仍處於句讀、斷詞與命名實體擷取的初期階段。然而，能由文本中辨識出特定主題或事件，一直都是資訊擷取的重要目標，並且若能將事件擷取技術應用在歷史文本中，相信對人文學者也會有很大的幫助。　　但現有的事件擷取技術皆需要於事前定義事件模板，且現有的事件模板並不符合歷史文獻的情形。而定義事件模板與標注訓練資料皆需要大量時間人力，並仰仗專業知識，對於歷史文本尤為困難。因此，我們以文本聚類做為事件擷取的前置處理，以期識別出文本所含的事件類型，以便未來進一步歸納事件模板。文本聚類能將類似的文章群聚在一起，亦即事件類型相同的段落會分布在同一群集。本論文提出的非監督的文本事件類型識別方法，首先使用Paragraph Vector模型將文本向量化，並以其聚類結果做為事件類型，進一步訓練事件類型的分類器。　　本研究實現了初步的自動化文本事件類型識別，並實用於《明實錄》上，我們以識別衛所相關的事件為例，並開發網頁系統輔助研究者能更快速的歸納事件脈絡。本研究一方面希望能提供人文學者一個新的研究方法，另一方面也希望為古漢語文字探勘提出一個新的研究方向，奠定日後事件擷取研究的基礎。 ;Natural language processing (NLP) for classical Chinese is very challenging because the lack of resources. Current works focused mainly on named entity recognition (NER), sentence segmentation and word segmentation and still have much work left to implement a meticulous event extraction system for classical Chinese. 　　Current event extraction methods need to specify the target event type in advance, which is a high threshold for historical texts. The lack of word boundaries and POS tags are also the obvious barriers to apply these methods. Thus, we develop a tool that can classify paragraphs into event categories, which will make it easier to develop new extraction tools. We first use the Paragraph Vector model for texts embedding and apply unsupervised text clustering to group paragraphs into clusters by their event type. Then use categorized data for training an automatic text classifier. 　　In this thesis, we propose an unsupervised event type identification approach based on paragraph embedding and apply to the Ming Shilu, focusing on events involving “wei-so”. We also develop a web interface for users to overview the thread of the event. We believe such a tool can help historians to systematically analyze the evolution of historical events. This system also provides a new research direction for mining historical texts and creates a foundation for future work in event extraction of historical texts.
Appears in Collections:	[Graduate Institute of Computer Science and Information Engineering] Electronic Thesis & Dissertation

Files in This Item:

File	Description	Size	Format
index.html		0Kb	HTML	277	View/Open

社群 sharing

Loading...