博碩士論文 100582004 完整後設資料紀錄

DC 欄位 語言
DC.contributor資訊工程學系zh_TW
DC.creator周建龍zh_TW
DC.creatorChien-Lung Chouen_US
dc.date.accessioned2019-1-31T07:39:07Z
dc.date.available2019-1-31T07:39:07Z
dc.date.issued2019
dc.identifier.urihttp://ir.lib.ncu.edu.tw:88/thesis/view_etd.asp?URN=100582004
dc.contributor.department資訊工程學系zh_TW
DC.description國立中央大學zh_TW
DC.descriptionNational Central Universityen_US
dc.description.abstract命名實體辨識(NER)是自然語言理解中的一項重要任務,因為它可用來擷取文章中的關鍵實體(人名、地點名稱、組織名稱、日期、數字等)和對象(產品、歌曲、電影、活動名稱等)。 這些實體對於眾多相關應用至關重要,例如用於分析社交網絡上的公眾意見的意見分析以及用於進行交互式對話和提供智能客戶服務的智慧型交談系統。 然而,現有的自然語言處理(NLP)工具(例如Stanford named entity recognizer)僅可識別一般的命名實體(人名、地點名稱、組織名稱),或者需準備特定格式的註釋訓練資料與特徵工程方可透過監督式學習訓練一客制的NER模型。 由於並非所有語言或命名實體都具有公開的NER工具可使用,因此構建NER模型培訓框架對於低資源的語言或罕見的命名實體提取至關重要。 要構建客制的NER模型,通常需要大量時間來準備,註釋和評估訓練/測試資料以及進行語言相關的特徵工程。 現有的研究依賴於帶註釋的訓練數據,這對於大量數據集的準備而言,時間與人工成本是非常昂貴的。 同時,這亦限制了命名實體辨識的有效性。 在本論文中,我們研究並開發一個基於半監督學習的網路命名實體辨識模型訓練框架,可利用Web上所收集的大量資料以及已知命名實體解決為自定義NER模型準備訓練語料庫的問題。 我們考量自動標記的有效性和效率問題以及與語言無關的特徵探勘來準備和註釋訓練數據。 自動標記的主要挑戰在於標記策略的選擇以避免由於短種子和長種子而導致的假陽性(false positive)和假陰性(false negative)訓練資料,以及巨量的語料庫和已知命名實體而導致的標記時間過長。 Distant supervision利用已知命名實體為關鍵字並收集搜索片段並當作訓練資料並不是一個新的概念;然而,當大量的已知命名實體(例如550k)和收集到的句子(例如2M)時,自動標記訓練資料的效率變得至關重要。 另外一個問題是用於監督學習的語言相關特徵挖掘。 此外,我們亦修改了序列標記的tri-training,並為大型數據集推導出適當的初始化公式,並提升tri-training於較大的資料集上的效能。 最後,我們對五種類型的實體識別任務進行了實驗,包括中式人名,食物名稱,地點名稱,興趣點(POI)和活動名稱的辨識,以證明所提出的Web NER模型構建框架是有效的。zh_TW
dc.description.abstractNamed entity recognition (NER) is an important task in natural language understanding because it extracts the key entities (e.g., person, organization, location, date, and number) and objects (e.g., product, song, movie, and activity name) mentioned in texts. These entities are essential to numerous text applications, such as those used for analyzing public opinion on social networks, and to the interfaces used to conduct interactive conversations and provide intelligent customer services. However, existing natural language processing (NLP) tools (such as Stanford named entity recognizer) recognize only general named entities or require annotated training examples and feature engineering for supervised model construction. Since not all languages or entities have public NER support, constructing a framework for NER model training is essential for low-resource language or entity information extraction (IE). Building a customized NER model often requires a significant amount of time to prepare, annotate, and evaluate the training/testing and language-dependent feature engineering. Existing studies rely on annotated training data; however, it is quite expensive to obtain large datasets, thus limiting the effectiveness of recognition. In this thesis, we examine the problem of developing a framework to prepare a training corpus from the web with known entities for custom NER model training via semi-supervised learning. We consider the effectiveness and efficiency problems of automatic labeling and language-independent feature mining to prepare and annotate the training data. The major challenge of automatic labeling lies in the choice of labeling strategies to avoid false positive and false negative examples, due to short and long seeds, and a long labeling time, due to large corpus and seed entities. Distant supervision, which collects training sentences from search snippets with known entities is not new; however, the efficiency of automatic labeling becomes critical when dealing with a large number of known entities (e.g., 550k) and sentences (e.g., 2M). Additionally, to address the language-dependent feature mining for supervised learning, we modify tri-training for sequence labeling and derive a proper initialization for large dataset training to improve the entity recognition performance for a large corpus. We conduct experiments on five types of entity recognition tasks including Chinese person names, food names, locations, points of interest (POIs), and activity names to demonstrate the improvements with the proposed web NER model construction framework.en_US
DC.subject命名實體辨識zh_TW
DC.subject半監督式學習zh_TW
DC.subjectDistant supervisionzh_TW
DC.subject局部敏感哈希zh_TW
DC.subjectTri-trainingzh_TW
DC.subjectNamed entity recognition (NER)en_US
DC.subjectSemi-supervised Learningen_US
DC.subjectDistant supervisionen_US
DC.subjectLocality-Sensitive Hashing (LSH)en_US
DC.subjectTri-trainingen_US
DC.title基於半監督式學習的網路命名實體辨識模型訓練框架zh_TW
dc.language.isozh-TWzh-TW
DC.titleA Framework for Web NER Model Trainingbased on Semi-supervised Learningen_US
DC.type博碩士論文zh_TW
DC.typethesisen_US
DC.publisherNational Central Universityen_US

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡  - 隱私權政策聲明