強化領域知識語言模型於中文醫療問題意圖分類;Ingraining Domain knowledge in Language Models for Chinese Medical Question Intent Classification

NCU Institutional Repository > 資訊電機學院 > 電機工程研究所 > 博碩士論文 > Item 987654321/86915

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/86915

題名:	強化領域知識語言模型於中文醫療問題意圖分類;Ingraining Domain knowledge in Language Models for Chinese Medical Question Intent Classification
作者:	陳柏翰;Chen, Po-Han
貢獻者:	電機工程學系
關鍵詞:	領域知識擷取;預訓練語言模型;百科全書;知識圖譜;多元分類;domain knowledge extraction;pre-trained language models;encyclopedia;knowledge graph;multi-class classification
日期:	2021-10-27
上傳時間:	2021-12-07 13:25:41 (UTC+8)
出版者:	國立中央大學
摘要:	多分類文本分類旨在自動將輸入實例歸納至預先定義好的分類中，該方法可用於眾多應用情境，例如：情感分析、聊天機器人、問答系統、電商產品分類和過濾資料等。本研究的主要目標為歸納非結構化的中文醫療問題至正確的分類中，我們可以將分類資訊視為醫療知識特徵，有助於機器理解問題語意內涵，並可做為自動問答系統的基礎。近年來，在基於深度學習的方法中，最被廣泛使用的模型架構為轉譯器 (Transformers)，這些模型有效地捕獲了廣域語意資訊與結構句法，在許多自然語言處理任務得到好的效能表現。因此，我們以兩階段領域知識強化機制為基礎，改善三種主流預訓練模型，並提出EKG-Transformers (Encyclopedia enhanced pre-training with Knowledge Graph fine-tuning Transformers ) 模型，用於中文醫療問題意圖分類，我們將醫學百科 (Encyclopedia)蒐集的層級資料訓練於語言模型上，進一步將醫學領域的階層資訊，例如：疾病的症狀與檢測方式、治療方法的注意事項與副作用、藥物的用法與用量等，導入語言模型中，微調時加入建構的知識圖譜 (Knowledge Graph) 三元組，賦予關係網路給字序列中的命名實體，並將字序列轉化成句圖 (Sentence Graph)，讓模型在遇到需要知識驅動的序列時，能給予更好的語言表徵及分類。本研究使用了醫療問題意圖分類資料集 (Chinese Medical Intent Dataset, CMID)，該資料集歸納出了4個分類：病症、藥物、治療和其他，與涵蓋於其下的36個子分類，總共包含約12,000則的醫療問題，並標註了分詞與命名實體結果。藉由實驗結果與錯誤分析得知，我們提出的EKG-MacBERT模型達到最好的Micro F1-score 74.50%，比相關研究模型 (MacBERT, RoBERTa, BERT, TextCNN, TextRNN, TextGCN與FastText) 表現好，並為中文醫療問題意圖分類提出一個效能解決方案。;Our main research objective focuses on classifying unstructured Chinese medical questions into one of the pre-defined categories. Recently, the most widely used model architecture is Transformer, which effectively captures semantic and structural syntaxes to achieve promising results in many natural language processing tasks. We improve three mainstream pre-training models based on the two-stage domain knowledge enhancement mechanisms. We propose the EKG-Transformers (Encyclopedia enhanced pre-training with Knowledge Graph fine-tuning Transformers) for user intent classification of Chinese medical questions. During the pre-training phase, we ingrain hierarchical healthcare information, such as the symptoms and diagnoses of a disease, the precautions and side-effects of treatment, and usage and dosage of a drug in the language model. During the fine-tuning phase, a word sequence is endowed with the relation network and further converted into sentence graphs with the injection of triples related to the named entities from the knowledge graph. Experimental data came from the Chinese Medical Intent Dataset (CMID), which included manually annotated users’ intents (in 4 categories and 36 sub-categories), along with word segmentation and named entity results with a total of around 12,000 medical questions. Based on further experiments and the error analysis, EKG-MacBERT achieved the best F1-score of 74.50% that outperforms previous models including the MacBERT, RoBERTa, BERT, TextCNN, TextRNN, TextGCN, and FastText. In summary, our EKG-Transformers model brings forward an effective way to solve the problem of Medical Question Intent Classification.
顯示於類別:	[電機工程研究所] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	97	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....