基於已知名稱搜尋結果的網路實體辨識模型建立工具;A Tool for Web NER Model Generation Based on Search Snippets of Known Entities

NCU Institutional Repository > 資訊電機學院 > 資訊工程研究所 > 博碩士論文 > Item 987654321/68855

jsp.display-item.identifier=請使用永久網址來引用或連結此文件: https://ir.lib.ncu.edu.tw/handle/987654321/68855

题名:	基於已知名稱搜尋結果的網路實體辨識模型建立工具;A Tool for Web NER Model Generation Based on Search Snippets of Known Entities
作者:	黃雅筠;Huang,Ya-yun
贡献者:	資訊工程學系
关键词:	命名實體辨識;協同訓練;Tri-Training;Named Entity Recognition;Co-Training;Tri-Training
日期:	2015-07-29
上传时间:	2015-09-23 14:45:08 (UTC+8)
出版者:	國立中央大學
摘要:	在過去，命名實體辨識(NER)研究都以新聞報導等正式文章中的人名、地名、組織名稱為主，相對地以網路的非正式文章則著墨較少。因此，現有的辨識模組對於網頁內容的辨識效果顯得較差，當需要辨識網頁內容中的命名實體時，勢必要重新訓練辨識模組。然而，訓練一個模型的時間和人力成本非常高，包含前置的大量訓練資料準備、人工收集及標記答案，且為了提升模組辨識效果，必須要為資料做適當切割、符號統一、正規化，以及特徵值的設計、準備已知詞庫（Dictionary）等，工作非常瑣碎複雜。此外，對於不同語言或不同辨識主題則需重複上述工作。本工具的設計目的，期能解決上述命名實體辨識工作過於費力耗時的問題，經由給定已知實體名稱的搜尋結果來自動標記訓練資料，並結合Tri-training半監督式訓練來產生NER模組。實驗證實，使用本工具可以套用在不同語言及類型的命名實體辨識，在中文組織名稱辨識的效能可達到86.1%，在日文組織名稱辨識的效能可達到80.3%，在英文組織名稱辨識的效能可達到83.2%，辨識不同主題的中文地點名稱辨識效能可達到84.5%，另外，辨識較長的命名實體如中文地址及英文地址辨識效能也可達到97.2%及94.8%。;Named entity recognition (NER) is of vital importance in information extraction and natural language processing. Current NER research are trained mainly on journalistic documents such as news articles to extract person names, location names, and organization names. Since they have not been trained to deal with informal documents, the performance drops on Web documents which contain noise, and is less structured. Therefore, the State-of-the-art NER systems do not work well on Web documents. When users want to recognize named entity from Web documents, they certainly have to retrain the new model. Retraining a new model is labor intensive and time consuming. The preparatory work includes preparing a large set of training data, labeling named entity, selecting an appropriate segmentation, symbols unification, normalization, designing feature, preparing dictionary, and so on. The pre-processing work is very complicated. Besides, users need to repeat the previous work for different languages or different recognition types. In this research, we propose a NER model generation tool for effective Web entity extraction. We propose a semi-supervised learning approach for NER via automatic labeling and tri-training which makes use of unlabeled data and structured resources containing known named entities. Experiments confirmed that the use of this tool can be applied in different languages for various types of named entities. In the task of Chinese organization name extraction, the generated model can achieve 86.1% F1 score on the 38,692 sentences with 16,241 distinct names, while the performance for Japanese organization name, English organization name, Chinese location name extraction, Chinese address recognition and English address recognition can be reached 80.3%, 83.2%, 84.5%, 97.2% and 94.8% F1-measure, respectively.
显示于类别:	[資訊工程研究所] 博碩士論文

文件中的档案:

档案	描述	大小	格式	浏览次数
index.html		0Kb	HTML	441	检视/开启

在NCUIR中所有的数据项都受到原著作权保护.

社群 sharing

数据加载中.....