摘要: | 在本計畫中,我們研究如何通過遠程學習準備訓練語料庫,開發一個客製的命名實體識別(NER)模型工具。NER是提供自然語言處理(NLP)和理解的重要任務,因為它提供擷取文本中所提及的關鍵實體或概念。現有的自然語言處理工具,例如:史丹福NER,哈工大NER只能辨識常用(人員、地點、組織、日期等)的命名實體。要建構客製化或自定義的NER模型通常需要花費大量的時間準備訓練數據、標記和評估。由於並非所有語言或實體都具有相對模型的支持,因此提供一個NER訓練資料準備及模型建構的工具對於低資源語言或實體處理是必不可少的。 訓練一個NER模型通常需要依賴於語言或領域相關的特徵工程, 和大量的標記文本來準備序列標記。前者常用的解決方法是採用遠程學習, 通過已知實體從搜索片段收集訓練句子; 然而,當處理大量種子(例如500K)和句子(例如3,000K)時,自動標記的效率就成為問題。另外則是監督式學習所需的特徵擷取, 如何自動化這個步驟找出有用的k-gram特徵詞典是影響Linear chained CRF序列標記效能的關鍵。 因此在第一年,我們將重點放在自動標籤上,以解決標籤效率和效率問題。 對於前者,我們提出LSH(局部敏感散列)來選擇給定實體的潛在句子。對於後者,我們考慮以中文字元序列標記為主軸,應用String Alignment來允許近似匹配,並按照實體長度的降序排列標記句子,以避免巢狀標記。在特徵挖掘方面,我們考慮三種基於支持度,信賴度調和平方等特徵選擇方法。同時這項工具將可以依據實體長度,類型和語料源,針對給定的測試數據提供性能評估和錯誤分析。 第二年中我們以Word labeling為主軸, 並且在CRF之外加入其他的深度神經網路學習算法。 我們計劃實施CONV-CRF, BI-LSTM-CONV-CRF和ME-CRF。除了採用word embedding,也加入中文的Character embedding。除此之外, 我們也希望在更強大的模型底下, 可以建立一個模型來識別多種實體類型。 最後,我們計劃設計DNN-based CRFs可以輸出實體分佈, 做為其他NLP程序的輸入, 以解決一個實體背後有多重含義(例如華盛頓紀念碑隱含POI及人名)的可能性。 ;In this project, we study the problem of developing a tool to prepare training corpus from the Web for custom named entity recognition (NER) models via distant learning. NER is an important task in natural language processing (NLP) and comprehension, as it provides the key entity (person, organization, location, date, number, etc.) or concept (job position, product, event name, etc.) mentioned in texts. Existing NLP tools, e.g.: Stanford NER, recognize only commonly named entities. Since not all languages or entities have NER support, constructing a tool for NER model training is essential for low-resource language/entity information processing. Training an NER model usually requires a large number of labeled texts and task-dependent feature engineering for training corpus preparation. For the former, distance learning is a common practice to gather training sentences from the search snippets by querying search engine with known entities; however, the efficiency of automatic labeling becomes a problem when dealing with large numbers of seeds (eg 500K) and sentences (eg 2M) pairs. The second issue is the difficulty of mining interesting terms or k-grams as features for supervised learning like linear chained CRF. In the first year, we will focus on automatic labeling to address the labeling efficiency and effectiveness issue. For the former, we propose LSH (Locality Sensitive Hashing) to select potential sentences for a given entity. For the later, we apply string alignment to allow approximate matching and order seed entities in decreasing order of entity length to avoid nested labels in training data. On feature mining, we consider three measures for feature selection based on support, confidence and their harmonic means. Meanwhile, the tool provides performance evaluation and error analysis for given test data with respect to entity length, type and corpus source. In the second year, we will further consider NER as word labeling and include more learning algorithms in addition to CRF. We plan to implement three DNN based CRFs including CONV-CRF, BI-LSTM-CONV, CRF, and ME-CRF based on both word embedding and character embedding. Second, we will extend the NER tool to accept multiple entity lists to build one model for multiple entity type recognition. By combining the training data from multiple NER tasks, we can test the power of DNN-based CRF and see whether the performance of the new model is reduced due to the more complex sequence labeling problem or is improved due to the large amount of training data. |