應用門控機制與多層卷積深度學習模型於中文命名實體辨識之研究;Multi-Stack Convolution with Gating Mechanism for Chinese Named Entity Recognition

NCU Institutional Repository > 資訊電機學院 > 資訊工程學系碩士在職專班 > 博碩士論文 > Item 987654321/79574

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/79574

題名:	應用門控機制與多層卷積深度學習模型於中文命名實體辨識之研究;Multi-Stack Convolution with Gating Mechanism for Chinese Named Entity Recognition
作者:	張智皓;Chang, Chih-Hao
貢獻者:	資訊工程學系在職專班
關鍵詞:	深度學習;命名實體辨識;卷積神經網路;門控機制;Deep Learning;Named Entity Recognition;Convolutional Neural Networks;Gating Mechanism
日期:	2018-09-13
上傳時間:	2019-04-02 15:03:24 (UTC+8)
出版者:	國立中央大學
摘要:	在傳統的基於機器學習的中文命名實體辨識系統中，往往採用從中文文本中萃取出大量的人工特徵(hand-craft features)、甚至採用專家所設計實體專用關鍵詞庫(Dictionary)等，再利用線性統計與機率模型的方法統整出重要特徵進而找出中文語意規則，然而卻有兩個顯而易見的缺點：從大量中文文本中提取特徵是一件非常費時費力且複雜的任務；再者，模型的優劣完全相依於人工所設計之特徵辨識強度。因此，礙於中文語意混淆特性與未知詞彙，精確率難以提高。有鑑於在不同的語系結構下，英文以空格作為斷詞特徵，而中文則無明確斷詞表現，但字詞間的關係卻具有強烈的相依性，並根據前後文語意將展現不同的差異性(同字異義、一詞多義)。因此，在龐大語料庫中如何辨識中文命名實體，極具挑戰與可能性。為應對上述種種挑戰以及缺點，本研究採用深度學習架構完成中文命名實體辨識系統；首先透過非監督式學習(Unsupervised Learning)方式採用深度學習模型對大量文本預訓練詞嵌入字典；透過字典將字、詞數值化，再應用多層次卷積(Convolution)層階層式地萃取文字特徵，層與層間加入門控機制泛化特徵，在無任何特徵工程下自動萃取出蘊含於其中的特徵資訊，目的在於減少命名實體辨識對於人工特徵的依賴、及毋須設計中文識別特徵，該方法有效地應用於辨識實體類型。本研究使用資料文檔包括SIGHAN Bakeoff-3[1]及透過客製化爬蟲程式所擷取網路之文章作為訓練資料；以實體報章電子檔做為測試資料[31]，作為基準用以評估各模型之效能，經研究測試結果呈現，本文所提出之模型F1-Measure達SIGHAN overall 90.76%和報章電子檔 90.42 %之出眾效能。;Traditional Chinese Named Entity Recognition based on machine learning usually relies on large amounts of hand-craft features, even dictionaries created by experts specific for entity, and then, uses linear regression and statistical models to gather important features and Chinese semantic rules. However, two obvious flaws can be observed. Firstly, it is extremely time-consuming and complicated to extract features from Chinese texts. Secondly, the usefulness of the models completely depends on the recognition efficiency based on hand-craft features; as a result, it is difficult to improve its accuracy due to semantic confusion that is characteristic in Chinese and unknown vocabularies. In English, spaces are used for word segmentation, and Chinese does not have similar word segmentation. However, Chinese words are highly interdependent and demonstrate semantic differences (homographs, polysemy) based on the context. Therefore, a great challenge as well as a possibility is how to recognize Chinese named entities in large corpora. To provide a solution to the challenge and flaws mentioned above, this study employs deep learning structure to complete Chinese Named Entity Recognition. Firstly, the deep learning model is combined with unsupervised learning to embed a large amount of pre-training words in the vocabulary. Then, the vocabulary is used to numeralize words before using multi-stack convolution to extract textual features. Gating mechanism is also incorporated between layers to generalize features and automatically extract features without employing feature engineering. The purpose of doing so is to reduce the dependency on hand-craft features in Named Entity Recognition and avoid hand-craft Chinese recognition features. This method can be effectively applied to recognizing different types of entities. This study uses documents from SIGHAN Bakeoff-3 and utilizes customized crawler programs to capture internet articles for training data. Electronic files of newspaper articles are used as testing data and form the standard by which the efficiency of different models can be evaluated. The results show that the F1-Measure model proposed by the study reaches outstanding an overall efficiency of 90.76% in SIGHAN and 90.42% in electronic files of newspaper articles.
顯示於類別:	[資訊工程學系碩士在職專班 ] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	217	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....