博碩士論文 103522053 詳細資訊




以作者查詢圖書館館藏 以作者查詢臺灣博碩士 以作者查詢全國書目 勘誤回報 、線上人數:50 、訪客IP:3.15.202.169
姓名 賴郁婷(YU-TING LAI)  查詢紙本館藏   畢業系所 資訊工程學系
論文名稱 非監督式歷史文本事件類型識別──以《明實錄》中之衛所事件為例
(Unsupervised Event Type Identification of Historical Texts: A Case Study of Wei-so Events in the Ming Shilu)
相關論文
★ A Real-time Embedding Increasing for Session-based Recommendation with Graph Neural Networks★ 基於主診斷的訓練目標修改用於出院病摘之十代國際疾病分類任務
★ 混合式心臟疾病危險因子與其病程辨識 於電子病歷之研究★ 基於 PowerDesigner 規範需求分析產出之快速導入方法
★ 社群論壇之問題檢索★ 應用自然語言處理技術分析文學小說角色 之關係:以互動視覺化呈現
★ 基於生醫文本擷取功能性層級之生物學表徵語言敘述:由主成分分析發想之K近鄰算法★ 基於分類系統建立文章表示向量應用於跨語言線上百科連結
★ Code-Mixing Language Model for Sentiment Analysis in Code-Mixing Data★ 藉由加入多重語音辨識結果來改善對話狀態追蹤
★ 對話系統應用於中文線上客服助理:以電信領域為例★ 應用遞歸神經網路於適當的時機回答問題
★ 使用多任務學習改善使用者意圖分類★ 使用轉移學習來改進針對命名實體音譯的樞軸語言方法
★ 基於歷史資訊向量與主題專精程度向量應用於尋找社群問答網站中專家★ 使用YMCL模型改善使用者意圖分類成效
檔案 [Endnote RIS 格式]    [Bibtex 格式]    [相關文章]   [文章引用]   [完整記錄]   [館藏目錄]   [檢視]  [下載]
  1. 本電子論文使用權限為同意立即開放。
  2. 已達開放權限電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。
  3. 請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。

摘要(中) 自然語言技術對於古漢語方面的研究,受限於古漢語的資源匱乏,現有研究仍處於句讀、斷詞與命名實體擷取的初期階段。然而,能由文本中辨識出特定主題或事件,一直都是資訊擷取的重要目標,並且若能將事件擷取技術應用在歷史文本中,相信對人文學者也會有很大的幫助。
  但現有的事件擷取技術皆需要於事前定義事件模板,且現有的事件模板並不符合歷史文獻的情形。而定義事件模板與標注訓練資料皆需要大量時間人力,並仰仗專業知識,對於歷史文本尤為困難。因此,我們以文本聚類做為事件擷取的前置處理,以期識別出文本所含的事件類型,以便未來進一步歸納事件模板。文本聚類能將類似的文章群聚在一起,亦即事件類型相同的段落會分布在同一群集。本論文提出的非監督的文本事件類型識別方法,首先使用Paragraph Vector模型將文本向量化,並以其聚類結果做為事件類型,進一步訓練事件類型的分類器。
  本研究實現了初步的自動化文本事件類型識別,並實用於《明實錄》上,我們以識別衛所相關的事件為例,並開發網頁系統輔助研究者能更快速的歸納事件脈絡。本研究一方面希望能提供人文學者一個新的研究方法,另一方面也希望為古漢語文字探勘提出一個新的研究方向,奠定日後事件擷取研究的基礎。
摘要(英) Natural language processing (NLP) for classical Chinese is very challenging because the lack of resources. Current works focused mainly on named entity recognition (NER), sentence segmentation and word segmentation and still have much work left to implement a meticulous event extraction system for classical Chinese.
  Current event extraction methods need to specify the target event type in advance, which is a high threshold for historical texts. The lack of word boundaries and POS tags are also the obvious barriers to apply these methods. Thus, we develop a tool that can classify paragraphs into event categories, which will make it easier to develop new extraction tools. We first use the Paragraph Vector model for texts embedding and apply unsupervised text clustering to group paragraphs into clusters by their event type. Then use categorized data for training an automatic text classifier.
  In this thesis, we propose an unsupervised event type identification approach based on paragraph embedding and apply to the Ming Shilu, focusing on events involving “wei-so”. We also develop a web interface for users to overview the thread of the event. We believe such a tool can help historians to systematically analyze the evolution of historical events. This system also provides a new research direction for mining historical texts and creates a foundation for future work in event extraction of historical texts.
關鍵字(中) ★ 事件類型辨識
★ 文本聚類
★ Paragraph Vector
★ 明實錄
★ 衛所
★ 自然語言處理
★ 古漢語
關鍵字(英) ★ Event type identification
★ Text clustering
★ Paragraph Vector
★ Ming Shilu
★ Wei-suo
★ Natural Language Processing
★ Classical Chinese
論文目次 摘要 i
Abstract ii
Acknowledgments iii
Contents iv
List of figures vi
List of tables vii
1 Introduction 1
2 Related Works 3
2.1 Classical Chinese processing 3
2.2 Sentence clustering 4
2.3 Sentence representation 5
3 Method 6
3.1 Formal problem definition 6
3.2 System flow 6
3.2.1 Module 1 – Time extraction module 7
3.2.2 Module 2 – Named Entity Recognizer 7
3.2.3 Module 3 – Wei-so entities linking 11
3.2.4 Module 4 – Paragraph embedding 12
3.2.5 Module 5 – Clustering 15
3.2.6 Module 6 – Classifier 15
4 Experiment 16
4.1 Dataset 16
4.2 Experimental protocols 17
4.2.1 Baseline 17
4.2.1 The proposed method 17
4.3 Evaluation methodology 18
4.4 Experimental results 19
4.4.1 Considering different training texts of paragraph vectors 19
4.4.2 Comparison between different parameters 21
4.4.3 Comparison between different dimensions 22
4.4.4 Comparative with baseline 23
5 Discussion 25
5.1 Result 25
5.2 Error analysis 28
6 Humanity interpretation 29
6.1 System introduction 29
6.2 Analysis and compare result 33
7 Conclusion 36
8 Future work 37
References 39
參考文獻 [1] Chinea-Rios, Mara, Germán Sanchis-Trilles, and Francisco Casacuberta. "Sentence clustering using continuous vector space representation." Iberian Conference on Pattern Recognition and Image Analysis. Springer International Publishing, 2015.
[2] Le, Quoc V., and Tomas Mikolov. "Distributed Representations of Sentences and Documents." ICML. Vol. 14. 2014.
[3] Chang, Yung-Chun, et al. "Linguistic Template Extraction for Recognizing Reader-Emotion and Emotional Resonance Writing Assistance." ACL-IJCNLP (2015): 775-780.
[4] Wang, Li. Hanyu Shigao. Vol. 2. Science Press, 1958.
[5] Huang, Hen-Hsen, Chuen-Tsai Sun, and Hsin-Hsi Chen. "Classical chinese sentence segmentation." Proceedings of CIPS-SIGHAN Joint Conference on Chinese Language Processing. 2010.
[6] Shi, Min, X. H. Chen, and B. Li. "CRF Based Research on a Unified Ap-proach to Word Segmentation and POS Tagging for Pre-Qin Chinese." Journal of Chinese Information Processing 2.24 (2010): 39-45.
[7] Liu, Shih-Gang. "Automated Annotation of Person Name of the Veritable Records of the Qing Dynasty." Master Thesis, Department of Computer Science and Information Engineering, National Taiwan University (2012): 1-50.
[8] Kao, Shin-Kai. "Automated Annotation of Geo-information of Historical Documents: A Case Study with the Veritable Records of the Qing Dynasty." Master Thesis, Department of Computer Science and Information Engineering, National Taiwan University (2013): 1-40.
[9] Pang, Wai-him et al. “Automated Name-extraction in Chinese Classics: Applying PMI (Pointwise Mutual Information) Segmentation to Zizhi Tongjian.” Digital Humanities and Craft:Technological Change. (2014): 232.
[10] Tang, Yafen. "Research of Automatically Recognizing Name in Pre-Qin Ancient Chinese Classics." XINADAI TUSHU QINGBAO JISHU 29.7/8 (2013): 63-68.
[11] Li, Qi, Heng Ji, and Liang Huang. "Joint Event Extraction via Structured Prediction with Global Features." ACL (1). 2013.
[12] Aliguliyev, Ramiz M. "A new sentence similarity measure and sentence based extractive technique for automatic text summarization." Expert Systems with Applications 36.4 (2009): 7764-7772.
[13] Wang, Dingding, et al. "Multi-document summarization via sentence-level semantic analysis and symmetric matrix factorization." Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2008.
[14] Sarkar, Kamal. "Sentence clustering-based summarization of multiple text documents." International Journal of Computing Science and Communication Technologies 2.1 (2009): 325-335
[15] Han, Jiawei, Jian Pei, and Micheline Kamber. Data mining: concepts and techniques. Elsevier, 2011.
[16] Wei, Furu, et al. "Query-sensitive mutual reinforcement chain and its application in query-oriented multi-document summarization." Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2008.
[17] Kumaran, Giridhar, and James Allan. "Text classification and named entities for new event detection." Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2004.
[18] Hammouda, Khaled M., and Mohamed S. Kamel. "Efficient phrase-based document indexing for web document clustering." IEEE Transactions on knowledge and data engineering 16.10 (2004): 1279-1296.
[19] Zhao, Lin, Xuanjing Huang, and Lide Wu. "Fudan university at DUC 2005." Proceedings of DUC. Vol. 2005. 2005.
[20] Kotlerman, Lili, et al. "Sentence clustering via projection over term clusters." Proceedings of the First Joint Conference on Lexical and Computational Semantics-Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation. Association for Computational Linguistics, 2012.
[21] Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013).
[22] MacQueen, James. "Some methods for classification and analysis of multivariate observations." Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. Vol. 1. No. 14. 1967.
[23] Qian, Gang, et al. "Similarity between Euclidean and cosine angle distance for nearest neighbor queries." Proceedings of the 2004 ACM symposium on Applied computing. ACM, 2004.
[24] Chang, Chih-Chung, and Chih-Jen Lin. "LIBSVM: a library for support vector machines." ACM Transactions on Intelligent Systems and Technology (TIST) 2.3 (2011): 27.
[25] Dai, Andrew M., Christopher Olah, and Quoc V. Le. "Document embedding with paragraph vectors." arXiv preprint arXiv:1507.07998 (2015).
[26] Andrés-Ferrer, Jesús, Germán Sanchis-Trilles, and Francisco Casacuberta. "Similarity word-sequence kernels for sentence clustering." Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR). Springer Berlin Heidelberg, 2010.
[27] Yue, Chih-chia, “The Evolution of the Military System in Chiang-his during the Ming Dynasty,” Bulletin of the Institute of History and Philology (BIHP) Vol. 66-4, (1995.12)
[28] Wikipedia, Hundred Family Surnames, https://en.wikipedia.org/wiki/Hundred_Family_Surnames
指導教授 蔡宗翰(Richard Tzong-Han Tsai) 審核日期 2016-8-3
推文 facebook   plurk   twitter   funp   google   live   udn   HD   myshare   reddit   netvibes   friend   youpush   delicious   baidu   
網路書籤 Google bookmarks   del.icio.us   hemidemi   myshare   

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡  - 隱私權政策聲明