基於半監督式學習的網路命名實體辨識模型訓練框架;A Framework for Web NER Model Trainingbased on Semi-supervised Learning

NCU Institutional Repository > 資訊電機學院 > 資訊工程研究所 > 博碩士論文 > Item 987654321/79580

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/79580

題名:	基於半監督式學習的網路命名實體辨識模型訓練框架;A Framework for Web NER Model Trainingbased on Semi-supervised Learning
作者:	周建龍;Chou, Chien-Lung
貢獻者:	資訊工程學系
關鍵詞:	命名實體辨識;半監督式學習;Distant supervision;局部敏感哈希;Tri-training;Named entity recognition (NER);Semi-supervised Learning;Distant supervision;Locality-Sensitive Hashing (LSH);Tri-training
日期:	2019-01-31
上傳時間:	2019-04-02 15:03:57 (UTC+8)
出版者:	國立中央大學
摘要:	命名實體辨識（NER）是自然語言理解中的一項重要任務，因為它可用來擷取文章中的關鍵實體（人名、地點名稱、組織名稱、日期、數字等）和對象（產品、歌曲、電影、活動名稱等）。這些實體對於眾多相關應用至關重要，例如用於分析社交網絡上的公眾意見的意見分析以及用於進行交互式對話和提供智能客戶服務的智慧型交談系統。然而，現有的自然語言處理（NLP）工具（例如Stanford named entity recognizer）僅可識別一般的命名實體(人名、地點名稱、組織名稱)，或者需準備特定格式的註釋訓練資料與特徵工程方可透過監督式學習訓練一客制的NER模型。由於並非所有語言或命名實體都具有公開的NER工具可使用，因此構建NER模型培訓框架對於低資源的語言或罕見的命名實體提取至關重要。要構建客制的NER模型，通常需要大量時間來準備，註釋和評估訓練/測試資料以及進行語言相關的特徵工程。現有的研究依賴於帶註釋的訓練數據，這對於大量數據集的準備而言，時間與人工成本是非常昂貴的。同時，這亦限制了命名實體辨識的有效性。在本論文中，我們研究並開發一個基於半監督學習的網路命名實體辨識模型訓練框架，可利用Web上所收集的大量資料以及已知命名實體解決為自定義NER模型準備訓練語料庫的問題。我們考量自動標記的有效性和效率問題以及與語言無關的特徵探勘來準備和註釋訓練數據。自動標記的主要挑戰在於標記策略的選擇以避免由於短種子和長種子而導致的假陽性（false positive）和假陰性（false negative）訓練資料，以及巨量的語料庫和已知命名實體而導致的標記時間過長。 Distant supervision利用已知命名實體為關鍵字並收集搜索片段並當作訓練資料並不是一個新的概念;然而，當大量的已知命名實體（例如550k）和收集到的句子（例如2M）時，自動標記訓練資料的效率變得至關重要。另外一個問題是用於監督學習的語言相關特徵挖掘。此外，我們亦修改了序列標記的tri-training，並為大型數據集推導出適當的初始化公式，並提升tri-training於較大的資料集上的效能。最後，我們對五種類型的實體識別任務進行了實驗，包括中式人名，食物名稱，地點名稱，興趣點（POI）和活動名稱的辨識，以證明所提出的Web NER模型構建框架是有效的。;Named entity recognition (NER) is an important task in natural language understanding because it extracts the key entities (e.g., person, organization, location, date, and number) and objects (e.g., product, song, movie, and activity name) mentioned in texts. These entities are essential to numerous text applications, such as those used for analyzing public opinion on social networks, and to the interfaces used to conduct interactive conversations and provide intelligent customer services. However, existing natural language processing (NLP) tools (such as Stanford named entity recognizer) recognize only general named entities or require annotated training examples and feature engineering for supervised model construction. Since not all languages or entities have public NER support, constructing a framework for NER model training is essential for low-resource language or entity information extraction (IE). Building a customized NER model often requires a significant amount of time to prepare, annotate, and evaluate the training/testing and language-dependent feature engineering. Existing studies rely on annotated training data; however, it is quite expensive to obtain large datasets, thus limiting the effectiveness of recognition. In this thesis, we examine the problem of developing a framework to prepare a training corpus from the web with known entities for custom NER model training via semi-supervised learning. We consider the effectiveness and efficiency problems of automatic labeling and language-independent feature mining to prepare and annotate the training data. The major challenge of automatic labeling lies in the choice of labeling strategies to avoid false positive and false negative examples, due to short and long seeds, and a long labeling time, due to large corpus and seed entities. Distant supervision, which collects training sentences from search snippets with known entities is not new; however, the efficiency of automatic labeling becomes critical when dealing with a large number of known entities (e.g., 550k) and sentences (e.g., 2M). Additionally, to address the language-dependent feature mining for supervised learning, we modify tri-training for sequence labeling and derive a proper initialization for large dataset training to improve the entity recognition performance for a large corpus. We conduct experiments on five types of entity recognition tasks including Chinese person names, food names, locations, points of interest (POIs), and activity names to demonstrate the improvements with the proposed web NER model construction framework.
顯示於類別:	[資訊工程研究所] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	140	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....