摘要: | 在過去,命名實體辨識(NER)研究都以新聞報導等正式文章中的人名、地名、組織名稱為主,相對地以網路的非正式文章則著墨較少。因此,現有的辨識模組對於網頁內容的辨識效果顯得較差,當需要辨識網頁內容中的命名實體時,勢必要重新訓練辨識模組。然而,訓練一個模型的時間和人力成本非常高,包含前置的大量訓練資料準備、人工收集及標記答案,且為了提升模組辨識效果,必須要為資料做適當切割、符號統一、正規化,以及特徵值的設計、準備已知詞庫(Dictionary)等,工作非常瑣碎複雜。此外,對於不同語言或不同辨識主題則需重複上述工作。本工具的設計目的,期能解決上述命名實體辨識工作過於費力耗時的問題,經由給定已知實體名稱的搜尋結果來自動標記訓練資料,並結合Tri-training半監督式訓練來產生NER模組。實驗證實,使用本工具可以套用在不同語言及類型的命名實體辨識,在中文組織名稱辨識的效能可達到86.1%,在日文組織名稱辨識的效能可達到80.3%,在英文組織名稱辨識的效能可達到83.2%,辨識不同主題的中文地點名稱辨識效能可達到84.5%,另外,辨識較長的命名實體如中文地址及英文地址辨識效能也可達到97.2%及94.8%。;Named entity recognition (NER) is of vital importance in information extraction and natural language processing. Current NER research are trained mainly on journalistic documents such as news articles to extract person names, location names, and organization names. Since they have not been trained to deal with informal documents, the performance drops on Web documents which contain noise, and is less structured. Therefore, the State-of-the-art NER systems do not work well on Web documents. When users want to recognize named entity from Web documents, they certainly have to retrain the new model. Retraining a new model is labor intensive and time consuming. The preparatory work includes preparing a large set of training data, labeling named entity, selecting an appropriate segmentation, symbols unification, normalization, designing feature, preparing dictionary, and so on. The pre-processing work is very complicated. Besides, users need to repeat the previous work for different languages or different recognition types. In this research, we propose a NER model generation tool for effective Web entity extraction. We propose a semi-supervised learning approach for NER via automatic labeling and tri-training which makes use of unlabeled data and structured resources containing known named entities. Experiments confirmed that the use of this tool can be applied in different languages for various types of named entities. In the task of Chinese organization name extraction, the generated model can achieve 86.1% F1 score on the 38,692 sentences with 16,241 distinct names, while the performance for Japanese organization name, English organization name, Chinese location name extraction, Chinese address recognition and English address recognition can be reached 80.3%, 83.2%, 84.5%, 97.2% and 94.8% F1-measure, respectively. |