姓名 洪滿珍(Man-Chen Hung)
論文名稱 利用雙重註釋編碼器於中文健康照護實體連結
論文名稱 利用雙重註釋編碼器於中文健康照護實體連結
(Leveraging Dual Gloss Encoders in Chinese Healthcare Entity Linking)
摘要(中) 詞義消歧是自然語言理解的一項重要且艱難的任務,尤其是對於醫療領域經常有多
種語義含意的詞彙。我們提出一個雙重註釋編碼器 (Dual Gloss Encoders, DGE) 模型,
以 BERT 轉譯器為基礎,將中文句子中健康照護領域的命名實體,連結到多國語言詞彙
語義網路 BabelNet,以實現上下文感知語義理解。消歧目標詞的每個註釋都源自
子中手動標記它們的註釋。最後,我們總共有 10,218 個句子,包含 40 個不同的消歧目
標詞和 94 個不同的語義註釋。我們將建構的數據劃分為訓練集 7,109 筆、發展集 979
筆與測試集 2,130 筆。實驗結果表明,我們提出的 DGE 模型的性能優於三個實體連結
模型,即 BERTWSD、GlossBERT 與 BEM,獲得了 F1-Score 97.81%。
摘要(英) Word sense disambiguation is an important and difficult task for natural language
understanding, especially for those lexical words with many semantic meanings in the
healthcare domain. We propose a BERT transformer based Dual Gloss Encoder (DGE) model
to link Chinese healthcare entities to the multi-lingual lexical network BabelNet for contextaware semantic understanding. The target word along with its context in original sentence is
encoded to obtain embedding vector. Each gloss of the target word is originated from BabelNet
to encode the gloss embedding. Target word embedding and each gloss embedding will be
paired to calculate the scores for sense disambiguation. The gloss with the highest score is
returned as predicted gloss for the target word in a given sentence. Due to a lack of Chinese
entity linking data in the healthcare domain, we collected proper domain-specific words and
manually annotated their glosses in the sentence. Finally, we have a total of 10,218 sentences
containing 40 distinct target words with 94 various semantic glosses. Our constructed data was
divided into three mutually exclusive datasets, including training set (7,109 sentences),
development set (979 sentences), and test set (2,130 sentences). Experimental results indicate
that our proposed DGE model performs better than three entity linking models, i.e., BERTWSD,
GlossBERT and BEM, obtaining the best F1-score of 97.81%.
關鍵字(中) ★ 實體連結
★ 詞義消歧
★ 語言轉譯器
★ 自然語言理解
★ 健康資訊學
關鍵字(英) ★ entity linking
★ word sense disambiguation
★ language transformers
★ natural language understanding
★ health informatics
論文目次 摘要 i
致謝 iii
目錄 iv
表目錄 vi
第一章 緒論 1
1-1 研究背景 1
1-2 動機與目的 3
1-3 章節概要 4
第二章 相關研究 5
2-1 語義消歧資料集 5
2-2 基於知識的方法 9
2-3 基於深度學習的方法 13
第三章 研究方法 22
3-1 系統架構 22
3-2 情境感知註釋編碼器 23
3-3 詞彙註釋編碼器 26
第四章 實驗結果 27
4-1 資料集建置 27
4-2 實驗設定 29
4-3 評估指標 30
4-4 模型比較 31
4-5 效能分析 33
4-6 錯誤分析 36
第五章 結論與未來工作 37
參考文獻 38
附錄一 目標詞統計表 48
附錄二 目標詞註釋與例句 50
指導教授 李龍豪(Lung-Hao Lee) 審核日期 2022-8-25
