以自動產生之標註資料進行明實錄人名命名實體鏈結

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：22

、訪客IP：18.191.178.132

姓名

吳承翰(Chang-Han Wu) 查詢紙本館藏

畢業系所

資訊工程學系在職專班

論文名稱

以自動產生之標註資料進行明實錄人名命名實體鏈結
(Establishing an Entity Linking Model for Person Names in Ming Shilu with Automatically Constructed Labeled Data)

相關論文

★ A Real-time Embedding Increasing for Session-based Recommendation with Graph Neural Networks	★ 基於主診斷的訓練目標修改用於出院病摘之十代國際疾病分類任務
★ 混合式心臟疾病危險因子與其病程辨識於電子病歷之研究	★ 基於 PowerDesigner 規範需求分析產出之快速導入方法
★ 社群論壇之問題檢索	★ 非監督式歷史文本事件類型識別──以《明實錄》中之衛所事件為例
★ 應用自然語言處理技術分析文學小說角色之關係：以互動視覺化呈現	★ 基於生醫文本擷取功能性層級之生物學表徵語言敘述：由主成分分析發想之K近鄰算法
★ 基於分類系統建立文章表示向量應用於跨語言線上百科連結	★ Code-Mixing Language Model for Sentiment Analysis in Code-Mixing Data
★ 藉由加入多重語音辨識結果來改善對話狀態追蹤	★ 對話系統應用於中文線上客服助理:以電信領域為例
★ 應用遞歸神經網路於適當的時機回答問題	★ 使用多任務學習改善使用者意圖分類
★ 使用轉移學習來改進針對命名實體音譯的樞軸語言方法	★ 基於歷史資訊向量與主題專精程度向量應用於尋找社群問答網站中專家

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

命名實體鏈結 (NEL, Named Entity Linking) 是自然語言處理 (NLP,
Natural Language Processing) 的一項研究，在 NLP 中的研究中和應用
有著重要的作用，是不可或缺的一環，若能有效地提升 NEL 的準確性
的話就能更好的為開發高性能的 NLP 系統奠定基礎。
NEL 的主要挑戰是缺少帶標註的文本，在漢籍文本上尤為困難，
原因是因為古代人名時常出現重複的人名，使得註釋者除了必須會閱
讀漢籍文本之外也必須將每個候選人名的個人資料與文本的上下文做
比較，而使得研究人物的關係和社會網路更為困難，而本研究為了解
決此問題本篇提出了一套架構，除了上述問題之外也解決標註資料過
少的問題，該系統利用中國歷代人物傳記資料庫與中研院的人名權威
資料庫裡人名的履歷、時間、關係人等欄位自行產生訓練資料後再使
用 BERT 模型達成古人名的實體消歧與鏈結。
本研究以《明實錄》做為實驗文本，《明實錄》是中國明代官修的編
年體史書，該書中記錄了從明太祖朱元璋到明熹宗朱由校共十五代皇
帝，約兩百五十年的大量歷史文本，其中包含十三部，三千零五十五
卷，共計一千七百多萬字，而其中文本包含朝廷各院所呈繳之章奏、
批件等，並以各省官員收集的先朝紀錄作補充，逐年紀錄各個皇帝詔
赦、律令等，並含括了政治、經濟、文化、祭祀等大事而成。目前本
研究總共成功標註 8,787 個人名、257,302 個標籤，準確率 92.08%。

摘要(英)

NEL plays an important role both in the study and application of NLP. If
the accuracy of NEL is effectively improved, the foundation of high-performance
NLP development can be laid.
The main challenge of NEL is the lack of annotated texts, especially in
studying Classical Chinese, because ancient names often appear repeatedly,
which makes it difficult to study the historical figures relationships and their
social networks. Our system used the China Biographical Database Project
(CBDB) and Ming Qing Biographical Database to generate training data and
then uses BERT model to eliminate the physical disambiguation of the names.
This study took the Ming Shilu as the experiment text. The Ming Shilu is
an official chronological history book of the Ming Dynasty in China, chroni-
cling 15 generations emperors, from Zhu Yuan-Zhang to Zhu You-Jiao, cov-
ering about 250 years. There is over 17 million characters including 30,055
volumes and 13 parts in the Ming Shilu. The text records the imperial pardons
and laws of each emperor as well as political, economic, cultural, and ritual
events year by year, including the imperial decrees and approvals submitted
by the imperial ministries, and the records of previous dynasties collected by
the provincial officials.
8,787 names and 257,302 tags were successfully tagged in this study, with
92.08% accuracy.

關鍵字(中)

★ 命名實體鏈結
★ 明實錄
★ 中國歷代人物傳記資料庫
★ 人名權威資料庫
★ 自動產生訓練資料
★ BERT

關鍵字(英)

★ Named Entity Linking
★ Ming Shilu
★ China Biographical Database Project
★ Ming Qing Biographical Database
★ Auto-generated Training Data
★ BERT model

論文目次

中文摘要 i
Abstract iii
誌謝 v
目錄 vii
圖目錄 ix
表目錄
1 緒論 1
1.1 研究背景 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 研究動機與目的 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 章節概要 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 相關研究 5
2.1 命名實體識別 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 命名實體鏈結 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 數位人文 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 模型 8
3.1 問題定義 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 系統架構 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 正規表達擷取器 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.4 模版標註器 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4.1 職官處理 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4.2 出處處理 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4.3 關連人處理 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4.4 時間處理 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4.5 同名同姓處理 . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.5 BERT 模型 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4 實驗方法與結果 19
4.1 資料描述 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 前處理 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 參數說明 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.4 評估方式 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.5 實驗結果 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5 歷史案例分析 26
5.1 兩個集團執掌兵權初步分析 . . . . . . . . . . . . . . . . . . . . . . . 28
5.2 兩個集團在軍政官之分析 . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3 兩個集團在軍令官之分析 . . . . . . . . . . . . . . . . . . . . . . . . . 31
6 結論與展望 35
6.1 結論 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.2 未來研究方向 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
參考文獻 37

參考文獻

[1] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by
back-propagating errors,” nature, vol. 323, no. 6088, pp. 533–536, 1986.
[2] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and
L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural
computation, vol. 1, no. 4, pp. 541–551, 1989.
[3] A. Bagga and B. Baldwin, “Cross-document event coreference: Annotations, exper-
iments, and observations,” in Coreference and Its Applications, 1999.
[4] R. Bekkerman and A. McCallum, “Disambiguating web appearances of people in a
social network,” in Proceedings of the 14th international conference on World Wide
Web, 2005, pp. 463–470.
[5] X. Han and J. Zhao, “Named entity disambiguation by leveraging wikipedia semantic
knowledge cikm. 215–224,” Google Scholar Google Scholar Digital Library Digital
Library, 2009.
[6] M. Honnibal and R. Dale, “Damsel: The dsto/macquarie system for entity-linking.”
in TAC, 2009.
[7] D. M. Bikel, V. Castelli, R. Florian, and D.-j. Han, “Entity linking and slot filling
through statistical processing and inference rules.” in TAC, 2009.
[8] B. Han and T. Baldwin, “Lexical normalisation of short text messages: Makn sens
a# twitter,” in Proceedings of the 49th annual meeting of the association for compu-
tational linguistics: Human language technologies, 2011, pp. 368–378.
[9] J. Hsiang, L. Chen, H.-C. Tu, and J. Chong, “數位人文視野下的知識分類觀察：
兩部官修類書的比較分析,”
東亞觀念史集刊
, vol. 9, pp. 229–286, 12 2015.
[10] 謝順宏, 柯皓仁, and 張素玢, “臺灣歷史人物文本檢索與探勘系統之建置,”
圖
資與檔案學刊
, no. 92, pp. 67–87, 2018.
[11] 謝順宏, “臺灣歷史人物傳記數位人文系統設計與建置之研究,”
臺灣師範大學
圖書資訊學研究所學位論文
, pp. 1–108, 2020.
[12] 賴惠玲 and 劉昭麟, “客家象徵符碼 [硬頸] 之演變: 台灣報紙媒體縱剖面之分
析,”
傳播與社會學刊
, no. 39, pp. 29–60, 2017.
[13] Harvard University, Academia Sinica, and Peking University, “China biographical
database,” Apr. 2019. [Online]. Available: https://projects.iq.harvard.edu/cbdb
[14] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of
deep bidirectional transformers for language understanding,” arXiv preprint arXiv:
1810.04805, 2018.
[15] J. Lafferty, A. McCallum, and F. C. Pereira, “Conditional random fields: Probabilis-
tic models for segmenting and labeling sequence data,” 2001.
[16] C. Sutton and A. McCallum, “An introduction to conditional random fields for re-
lational learning,” Introduction to statistical relational learning, vol. 2, pp. 93–128,
2006.
[17] 唐玉萍, “张居正, 高拱在“隆庆和议”中的作用对比,”
赤峰學院學報
(哲學社
會科學版
), vol. 31, no. 5, pp. 17–22, 2010.
[18] 岳天雷, “隆庆时期的“俺答封贡”-论高拱在西北边疆的靖边功绩,”
殷都學刊
,
vol. 31, no. 1, pp. 29–36, 2010.
[19] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer, “Neural
architectures for named entity recognition,” arXiv preprint arXiv:1603.01360, 2016.
[20] X. Ma and E. Hovy, “End-to-end sequence labeling via bi-directional lstm-cnns-crf,”
arXiv preprint arXiv:1603.01354, 2016.
[21] T. Hirano, Y. Matsuo, and G. Kikui, “Detecting semantic relations between named
entities in text using contextual features,” in Proceedings of the 45th Annual Meeting
of the Association for Computational Linguistics Companion Volume Proceedings of
the Demo and Poster Sessions, 2007, pp. 157–160.
[22] X. Mao, Y. Dong, S. He, S. Bao, and H. Wang, “Chinese word segmentation and
named entity recognition based on conditional random fields,” in Proceedings of the
Sixth SIGHAN Workshop on Chinese Language Processing, 2008.
[23] Y. Chen and J. H. Martin, “Towards robust unsupervised personal name disambigua-
tion,” in Proceedings of the 2007 Joint Conference on Empirical Methods in Natu-
ral Language Processing and Computational Natural Language Learning (EMNLP-
CoNLL), 2007, pp. 190–198.
[24] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation,
vol. 9, no. 8, pp. 1735–1780, 1997.
[25] Y. Liu, “Fine-tune bert for extractive summarization,” arXiv preprint arXiv:
1903.10318, 2019.
[26] K.-F. Wong, M. Wu, and W. Li, “Extractive summarization using supervised and
semi-supervised learning,” in Proceedings of the 22nd international conference on
computational linguistics (Coling 2008), 2008, pp. 985–992.
[27] H. Zhang, J. Xu, and J. Wang, “Pretraining-based natural language generation for
text summarization,” arXiv preprint arXiv:1902.09243, 2019.
[28] 項潔 and 涂豐恩, “導論——什麼是數位人文,”
從保存到創造
:開啟數位人文研
究》
,頁
, pp. 9–28, 2011.

指導教授

蔡宗翰(Tzung-Han Tsai)

審核日期

2021-1-26

推文