基於已知名稱搜尋結果的網路實體辨識模型建立工具

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：38

、訪客IP：3.143.218.78

姓名

黃雅筠(Ya-yun Huang) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

基於已知名稱搜尋結果的網路實體辨識模型建立工具
(A Tool for Web NER Model Generation Based on Search Snippets of Known Entities)

相關論文

★ 行程邀約郵件的辨識與不規則時間擷取之研究	★ NCUFree校園無線網路平台設計及應用服務開發
★ 網際網路半結構性資料擷取系統之設計與實作	★ 非簡單瀏覽路徑之探勘與應用
★ 遞增資料關聯式規則探勘之改進	★ 應用卡方獨立性檢定於關連式分類問題
★ 中文資料擷取系統之設計與研究	★ 非數值型資料視覺化與兼具主客觀的分群
★ 關聯性字組在文件摘要上的探討	★ 淨化網頁：網頁區塊化以及資料區域擷取
★ 問題答覆系統使用語句分類排序方式之設計與研究	★ 時序資料庫中緊密頻繁連續事件型樣之有效探勘
★ 星狀座標之軸排列於群聚視覺化之應用	★ 由瀏覽歷程自動產生網頁抓取程式之研究
★ 動態網頁之樣版與資料分析研究	★ 同性質網頁資料整合之自動化研究

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

在過去，命名實體辨識(NER)研究都以新聞報導等正式文章中的人名、地名、組織名稱為主，相對地以網路的非正式文章則著墨較少。因此，現有的辨識模組對於網頁內容的辨識效果顯得較差，當需要辨識網頁內容中的命名實體時，勢必要重新訓練辨識模組。然而，訓練一個模型的時間和人力成本非常高，包含前置的大量訓練資料準備、人工收集及標記答案，且為了提升模組辨識效果，必須要為資料做適當切割、符號統一、正規化，以及特徵值的設計、準備已知詞庫（Dictionary）等，工作非常瑣碎複雜。此外，對於不同語言或不同辨識主題則需重複上述工作。本工具的設計目的，期能解決上述命名實體辨識工作過於費力耗時的問題，經由給定已知實體名稱的搜尋結果來自動標記訓練資料，並結合Tri-training半監督式訓練來產生NER模組。實驗證實，使用本工具可以套用在不同語言及類型的命名實體辨識，在中文組織名稱辨識的效能可達到86.1%，在日文組織名稱辨識的效能可達到80.3%，在英文組織名稱辨識的效能可達到83.2%，辨識不同主題的中文地點名稱辨識效能可達到84.5%，另外，辨識較長的命名實體如中文地址及英文地址辨識效能也可達到97.2%及94.8%。

摘要(英)

Named entity recognition (NER) is of vital importance in information extraction and natural language processing. Current NER research are trained mainly on journalistic documents such as news articles to extract person names, location names, and organization names. Since they have not been trained to deal with informal documents, the performance drops on Web documents which contain noise, and is less structured. Therefore, the State-of-the-art NER systems do not work well on Web documents. When users want to recognize named entity from Web documents, they certainly have to retrain the new model. Retraining a new model is labor intensive and time consuming. The preparatory work includes preparing a large set of training data, labeling named entity, selecting an appropriate segmentation, symbols unification, normalization, designing feature, preparing dictionary, and so on. The pre-processing work is very complicated. Besides, users need to repeat the previous work for different languages or different recognition types. In this research, we propose a NER model generation tool for effective Web entity extraction. We propose a semi-supervised learning approach for NER via automatic labeling and tri-training which makes use of unlabeled data and structured resources containing known named entities. Experiments confirmed that the use of this tool can be applied in different languages for various types of named entities. In the task of Chinese organization name extraction, the generated model can achieve 86.1% F1 score on the 38,692 sentences with 16,241 distinct names, while the performance for Japanese organization name, English organization name, Chinese location name extraction, Chinese address recognition and English address recognition can be reached 80.3%, 83.2%, 84.5%, 97.2% and 94.8% F1-measure, respectively.

關鍵字(中)

★ 命名實體辨識
★ 協同訓練
★ Tri-Training

關鍵字(英)

★ Named Entity Recognition
★ Co-Training
★ Tri-Training

論文目次

Chinese Abstract i
English Abstract ii
Table of Contents iv
List of Figures v
List of Tables vi
I. INTRODUCTION 1
1.1. Motivation 1
1.2. Thesis Organization 4
II. RELATED WORK 6
III. SYSTEM ARCHITECTURE 10
3.1. Data Collection and Automatic Labeling Modules 11
3.2. String Split and Tagging Module 14
3.3. Feature Mining Module 15
3.4. Self-Testing and Tri-Training 18
IV. EXPERIMENT 20
4.1. Data Set 21
4.2. Comparing on High-frequency Tokens Dictionary Size 24
4.3. The performance on various NER tasks 25
4.4. The Performance of Manual Generate Dictionary 27
4.5. The Performance Influence of Self-Testing and Tri-Training 29
4.6. ExactMatchLabeling and AlignmentLabeling 30
V. CONCLUSION 33
Reference 34

參考文獻

[1] D.-M. Bikel, S. Miller, R. Schwartz and R. Weischedel, "Nymble: a High-Performance Learning Name-finder”, Applied natural language processing, pp. 194-201, 1997.
[2] C.-L. Chou, C.-H. Chang, S.-Y. Wu, " Semi-supervised Sequence Labeling for Named Entity Extraction based on Tri-Training: Case Study on Chinese Person Name Extraction," Semantic Web and Information Extraction, pp. 244-255, 2014.
[3] CRF++: Yet Another CRF toolkit, http://crfpp.googlecode.com/svn/trunk/doc/index.html 9-1541
[4] J. Lafferty, A. McCallum and F.C.N. Pereira, "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data," ICML Proceedings of the Eighteenth International Conference on Machine Learning, pp. 282-289, 2001.
[5] C. Gu, X.-P. Tian, and J.-D Yu, "Automatic Recognition of Chinese Personal Name Using Conditional Random Fields and Knowledge Base," Mathematical Problems in Engineering, 2015.
[6] Y.-Y. Lin, C.-H. Chang, "Store Name Extraction and Name-Address Matching on the Web," Proceedings of the 26th Conference on Computational Linguistics and Speech Processing, pp. 91-93, 2014.
[7] Y. Ling, J. Yang and L. He, "Chinese Organization Name Recognition Based on Multiple Features," Pacific Asia conference on Intelligence and Security Informatics, pp. 136-144, 2012.
[8] W. Li, A. McCallum, "Semi-supervised sequence modeling with syntactic topic models," AAAI′05 Proceedings of the 20th national conference on Artificial intelligence - Volume 2, pp. 813-818, 2005.
[9] A. McCallum, W. Li, "Early Results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons," Proceedings of the Seventh Conference on Natural Language Learning HLT-NAACL 2003 - Volume 4 (CONLL′03), pp. 188-191, 2003.
[10] C.-W. Wu, R. T.-H. Tsai and W.-L. Hsu, "Semi-joint labeling for Chinese named entity recognition," Proceedings of the 4th Asia information retrieval conference, pp. 107-116, 2008.
[11] X. Yao, "A Method of Chinese Organization Named Entities Recognition Based on Statistical Word Frequency, Part of Speech and Length," Broadband Network and Multimedia Technology (IC-BNMT), pp. 637-641, 2011.
[12] Z.-H. Zhou, M. Li, "Tri-Training: Exploiting Unlabeled Data Using Three Classifiers", IEEE Transactions on Knowledge and Data Engineering archive, Volume 17 Issue 11, November 2005, Page 152.
[13] S. Zhang, S. Zhang and X. Wang, "Automatic Recognition of Chinese Organization Name Based on Conditional Random Fields," Natural Language Processing and Knowledge Engineering, pp. 229-233, 2007.

指導教授

張嘉惠

審核日期

2015-7-29

推文