強化領域知識語言模型於中文醫療問題意圖分類

DC 欄位	值	語言
DC.contributor	電機工程學系	zh_TW
DC.creator	陳柏翰	zh_TW
DC.creator	Po-Han Chen	en_US
dc.date.accessioned	2021-10-27T07:39:07Z
dc.date.available	2021-10-27T07:39:07Z
dc.date.issued	2021
dc.identifier.uri	http://ir.lib.ncu.edu.tw:88/thesis/view_etd.asp?URN=108521106
dc.contributor.department	電機工程學系	zh_TW
DC.description	國立中央大學	zh_TW
DC.description	National Central University	en_US
dc.description.abstract	多分類文本分類旨在自動將輸入實例歸納至預先定義好的分類中，該方法可用於眾多應用情境，例如：情感分析、聊天機器人、問答系統、電商產品分類和過濾資料等。本研究的主要目標為歸納非結構化的中文醫療問題至正確的分類中，我們可以將分類資訊視為醫療知識特徵，有助於機器理解問題語意內涵，並可做為自動問答系統的基礎。近年來，在基於深度學習的方法中，最被廣泛使用的模型架構為轉譯器 (Transformers)，這些模型有效地捕獲了廣域語意資訊與結構句法，在許多自然語言處理任務得到好的效能表現。因此，我們以兩階段領域知識強化機制為基礎，改善三種主流預訓練模型，並提出EKG-Transformers (Encyclopedia enhanced pre-training with Knowledge Graph fine-tuning Transformers ) 模型，用於中文醫療問題意圖分類，我們將醫學百科 (Encyclopedia)蒐集的層級資料訓練於語言模型上，進一步將醫學領域的階層資訊，例如：疾病的症狀與檢測方式、治療方法的注意事項與副作用、藥物的用法與用量等，導入語言模型中，微調時加入建構的知識圖譜 (Knowledge Graph) 三元組，賦予關係網路給字序列中的命名實體，並將字序列轉化成句圖 (Sentence Graph)，讓模型在遇到需要知識驅動的序列時，能給予更好的語言表徵及分類。本研究使用了醫療問題意圖分類資料集 (Chinese Medical Intent Dataset, CMID)，該資料集歸納出了4個分類：病症、藥物、治療和其他，與涵蓋於其下的36個子分類，總共包含約12,000則的醫療問題，並標註了分詞與命名實體結果。藉由實驗結果與錯誤分析得知，我們提出的EKG-MacBERT模型達到最好的Micro F1-score 74.50%，比相關研究模型 (MacBERT, RoBERTa, BERT, TextCNN, TextRNN, TextGCN與FastText) 表現好，並為中文醫療問題意圖分類提出一個效能解決方案。	zh_TW
dc.description.abstract	Our main research objective focuses on classifying unstructured Chinese medical questions into one of the pre-defined categories. Recently, the most widely used model architecture is Transformer, which effectively captures semantic and structural syntaxes to achieve promising results in many natural language processing tasks. We improve three mainstream pre-training models based on the two-stage domain knowledge enhancement mechanisms. We propose the EKG-Transformers (Encyclopedia enhanced pre-training with Knowledge Graph fine-tuning Transformers) for user intent classification of Chinese medical questions. During the pre-training phase, we ingrain hierarchical healthcare information, such as the symptoms and diagnoses of a disease, the precautions and side-effects of treatment, and usage and dosage of a drug in the language model. During the fine-tuning phase, a word sequence is endowed with the relation network and further converted into sentence graphs with the injection of triples related to the named entities from the knowledge graph. Experimental data came from the Chinese Medical Intent Dataset (CMID), which included manually annotated users’ intents (in 4 categories and 36 sub-categories), along with word segmentation and named entity results with a total of around 12,000 medical questions. Based on further experiments and the error analysis, EKG-MacBERT achieved the best F1-score of 74.50% that outperforms previous models including the MacBERT, RoBERTa, BERT, TextCNN, TextRNN, TextGCN, and FastText. In summary, our EKG-Transformers model brings forward an effective way to solve the problem of Medical Question Intent Classification.	en_US
DC.subject	領域知識擷取	zh_TW
DC.subject	預訓練語言模型	zh_TW
DC.subject	百科全書	zh_TW
DC.subject	知識圖譜	zh_TW
DC.subject	多元分類	zh_TW
DC.subject	domain knowledge extraction	en_US
DC.subject	pre-trained language models	en_US
DC.subject	encyclopedia	en_US
DC.subject	knowledge graph	en_US
DC.subject	multi-class classification	en_US
DC.title	強化領域知識語言模型於中文醫療問題意圖分類	zh_TW
dc.language.iso	zh-TW	zh-TW
DC.title	Ingraining Domain knowledge in Language Models for Chinese Medical Question Intent Classification	en_US
DC.type	博碩士論文	zh_TW
DC.type	thesis	en_US
DC.publisher	National Central University	en_US

博碩士論文 108521106 完整後設資料紀錄