以作者查詢圖書館館藏 、以作者查詢臺灣博碩士 、以作者查詢全國書目 、勘誤回報 、線上人數:57 、訪客IP:18.227.140.195
姓名 朱彥慈(Yen-Tzu Chu) 查詢紙本館藏 畢業系所 資訊工程學系在職專班 論文名稱 應用二階段訓練於中國古代文獻釋義識別的弱監督學習架構
(SPITAC: Weakly Supervised Learning for Paraphrase Identification with Two-Stage Training in Ancient Chinese Literature)相關論文 檔案 [Endnote RIS 格式] [Bibtex 格式] [相關文章] [文章引用] [完整記錄] [館藏目錄] [檢視] [下載]
- 本電子論文使用權限為同意立即開放。
- 已達開放權限電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。
- 請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
摘要(中) 在應用於中國古代文獻的數位人文領域中,已有些研究探討如何實現文本對齊技術來幫助歷史學者比較不同的文獻,不過這些研究並沒有以「相同語意」的觀點來對齊文本。
故本研究將引入自然語言處理中釋義識別任務的概念,來找出不同文本中擁有相同語意的段落,並應用於後漢書、三國志和資治通鑑以作為範例。
然而如果要採用釋義識別任務中最先進的自然語言處理技術,則會有一些限制需要去考量:(1)訓練資料不足 (2)基於注意力方法的文本長度限制。
為了解決這些問題,本研究提出了應用二階段訓練於中國古代文獻釋義識別的弱監督學習架構(SPITAC)。此方法有兩個主要部分:偽標籤訓練集生成和二階段訓練。在偽標籤訓練集生成中,本研究使用基於規則的方法來自動產生訓練資料集以解決訓練資料不足的問題。而為了解決文本長度限制,則採用句子過濾器的方法來刪減不重要的句子,將句子長度縮減到最大長度的範圍內。在二階段訓練的設計中,此方法可以使分類器更好的識別出硬負樣本來提升模型性能。
從實驗結果表明,本研究的弱監督學習方法可以達到接近監督式學習的效果,而在消融實驗中,句子過濾器和二階段訓練可以有效提升性能,能提高4.14 F1分數並超越基線模型。最後本研究將從實際的文本中演示並分析此方法的成果,並從成效中探討這項任務的困難及未來改進方向。摘要(英) Text alignment techniques have been studied in digital humanities research of ancient Chinese literature to assist historians in aligning the documents. Nevertheless, these studies didn′t align text in the "same meaning" perspective.
In our work, we introduce paraphrase identification, the natural language processing(NLP) task that identifies whether the two texts convey the "same meaning", into Digital Humanities of Ancient Chinese literature and apply it to Book of the Later Han, Records of the Three Kingdoms, and Zizhi Tongjian as examples.
However, if we employ SOTA methods to paraphrase identification, some limitations need to be taken into account: (1) insufficient train data and (2) text length limitation of the attention-based method.
To handle these issues, we propose the Weakly Supervised Learning for Paraphrase Identification with Two-Stage Training in Ancient Chinese Literature(SPITAC).Our proposed scheme consists of two components: pseudo-label training set generation and two-stage training.The pseudo-label training set generation is based on the rule-based method to generate the training dataset automatically to overcome the lack of train data issue.To handle the problem of text length limitation, we adopt the sentence filter approach to delete unimportant sentences and shrink the text to less than the maximum length.The two-stage training enables the classifier to identify the hard negative samples more efficiently to improve the model performance.
The experiment results show that our weakly supervised approach can achieve the results of nearly the supervised learning method.In the ablation study, our proposed scheme, sentence filter and two-stage training, can improve the F1 score by 4.14 compared to the baseline.Finally, we demonstrate and analyze the instances to show the effect of our method and indicate the future challenges for this task.關鍵字(中) ★ 數位人文
★ 釋義識別
★ 文本對齊
★ 弱監督學習
★ 深度學習關鍵字(英) ★ Digital Humanities
★ Paraphrase Identification
★ Text Alignment
★ Weakly supervised learning
★ Deep Learning論文目次 中文摘要 iv
Abstract v
致謝 vi
Contents viii
List of Figures x
List of Tables xi
1 Introduction 1
2 Definition of Paraphrasing in Ancient Chinese Literature 4
3 Related work 7
3.1 Related Digital Humanity Research 7
3.2 Paraphrase Identification 8
3.3 Long Text Matching 9
3.4 Weak and Semi Supervision 10
3.4.1 Semi-supervised 10
3.4.2 Weakly Supervised 11
4 Methodology 12
4.1 Task Description 12
4.2 Framework Overview 12
4.3 Pseudo-label Training Set Generation 13
4.3.1 Get Candidate Set 14
4.3.2 Get Pseudo Labeled Data 15
4.3.3 Sentence Filter 17
4.4 Two Stage Training 20
4.4.1 BERT Fine-tuning 20
4.4.2 Final Training 21
5 Evaluation Dataset and Metrics 23
6 Experiment 25
6.1 Experimental Setup 25
6.1.1 Baseline 25
6.1.2 Implementation Details 26
6.2 Experimental Results 27
6.3 Discussion and Analysis 29
6.3.1 Ablation Study 29
6.3.2 Analysis on the Trend of Dev Scores Per Epoch 30
6.3.3 Analysis on Low Word Overlap Data 32
7 Case Study 34
7.1 Practical Case 34
7.2 The Important of Recall 36
7.3 The Important of Precision 37
8 Conclusion and Future Work 38
8.1 Conclusion 38
8.2 Future work 39
Bibliography 41參考文獻 [1] D. Sturgeon, “Unsupervised identification of text reuse in early chinese literature,”
Digital Scholarship in the Humanities, vol. 33, no. 3, pp. 670–684, 2018.
[2] P.-W. Fang, “On normalizing chinese calendar and its application to aligned reading
of the standard histories of the six dynasties,” 2020.
[3] L. Pang, Y. Lan, and X. Cheng, “Match-ignition: Plugging pagerank into transformer for long-form text matching,” in Proceedings of the 30th ACM International
Conference on Information & Knowledge Management, pp. 1396–1405, 2021.
[4] V. Rus, R. Banjade, and M. Lintean, “On paraphrase identification corpora,” in Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pp. 2422–2429, 2014.
[5] I. Androutsopoulos and P. Malakasiotis, “A survey of paraphrasing and textual entailment methods,” Journal of Artificial Intelligence Research, vol. 38, pp. 135–187,
2010.
[6] W. Lan and W. Xu, “Neural network models for paraphrase identification, semantic
textual similarity, natural language inference, and question answering,” in Proceedings of the 27th International Conference on Computational Linguistics, pp. 3890–
3902, 2018.
[7] B. Dolan and C. Brockett, “Automatically constructing a corpus of sentential paraphrases,” in Third International Workshop on Paraphrasing (IWP2005), 2005.
[8] T. Yousef and S. Janicke, “A survey of text alignment visualization,” IEEE transactions on visualization and computer graphics, vol. 27, no. 2, pp. 1149–1159, 2020.
[9] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser,
and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[10] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep
bidirectional transformers for language understanding,” 2018.
[11] D. Bernhard and I. Gurevych, “Answering learners'questions by retrieving question paraphrases from social q&a sites,” in Proceedings of the third workshop on
innovative use of NLP for building educational applications, pp. 44–52, 2008.
[12] J.-Y. Jiang, M. Zhang, C. Li, M. Bendersky, N. Golbandi, and M. Najork, “Semantic text matching for long-form documents,” in The world wide web conference,
pp. 795–806, 2019.
[13] L. Yang, M. Zhang, C. Li, M. Bendersky, and M. Najork, “Beyond 512 tokens:
Siamese multi-depth transformer-based hierarchical encoder for long-form document matching,” in Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 1725–1734, 2020.
[14] B. Liu, D. Niu, H. Wei, J. Lin, Y. He, K. Lai, and Y. Xu, “Matching article pairs
with graphical decomposition and convolutions,” arXiv preprint arXiv:1802.07459,
2018.
[15] X. Yang, Z. Song, I. King, and Z. Xu, “A survey on deep semi-supervised learning,”
arXiv preprint arXiv:2103.00550, 2021.
[16] D.-H. Lee et al., “Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks,” in Workshop on challenges in representation
learning, ICML, vol. 3, p. 896, 2013.
[17] Z.-H. Zhou, “A brief introduction to weakly supervised learning,” National science
review, vol. 5, no. 1, pp. 44–53, 2018.
[18] M. Vijaymeena and K. Kavitha, “A survey on similarity measures in text mining,”
Machine Learning and Applications: An International Journal, vol. 3, no. 2, pp. 19–
28, 2016.
[19] S. Brin and L. Page, “The anatomy of a large-scale hypertextual web search engine,”
Computer networks and ISDN systems, vol. 30, no. 1-7, pp. 107–117, 1998.
[20] R. Mihalcea and P. Tarau, “Textrank: Bringing order into text,” in Proceedings of the
2004 conference on empirical methods in natural language processing, pp. 404–411,
2004.指導教授 蔡宗翰(Richard Tzong-Han Tsai) 審核日期 2022-9-23 推文 facebook plurk twitter funp google live udn HD myshare reddit netvibes friend youpush delicious baidu 網路書籤 Google bookmarks del.icio.us hemidemi myshare