dc.description.abstract | Sequence labeling model has been widely used in Natural Language Processing (NLP). Ex: Named Entity recognition (NER), Part-Of-Speech tagging (POS) and Word Segmentation. Named Entity Recognition (NER) is one of the important tasks of Natural Language Processing because it can extract unnamed articles and extract them into pre-defined categories, such as person name, place name, organization, etc.
Most of the research in Named Entity Recognition (NER) focused on English data. In English, spaces are usually used for dividing words, and each word has its own meaning. While in Chinese, each characters contains different information, different location of the vocabulary, may represent different meanings, so Chinese is without explicit word delimiters. However, the traditional machine learning of Chinese Named Entity Recognition (CNER), most of them use statistical methods and take the Conditional Random Field (CRF) to complete the sequence labeling task. Therefore, it only can capture local features. It is a challenging and forward-looking task to capturing long-range context information in Chinese dataset, determine the correct semantic meaning of the current word, and correctly identify the named entity.
In order to overcome the challenges, this study used the deep learning Condition Random Fields to execute Chinese Named Entity Recognition task. Firstly, training a word vector model to convert characters to numeric data. And used convolutional layer, bidirectional GRU layer, and the memory layer that integrates external memory contains long-range context information. Making the task different from usual, only can capture local information, but can obtain rich message of article. Also by feature extraction generate some lexical features[1]. And use a automatically trained variable of deep learning model to automatically adjust the weight of word embedding and lexical features. In addition of long-range article information, the model also can fully obtain the hidden information of article.
The data set used in this research includes PerNews which is online articles collected using custom crawler as training data and online news articles as test data, and SIGHAN Bakeoff-3. According to the results, the model proposed in this research achieve 91.67% tagging accuracy in the online social media data. The result is significantly higher than the model that doesn’t add memory layer by 2.9%. And then the word embedding and lexical features are added, compared with the basic memory model increase 6.04%. The model proposed in this study also achieve the highest F1-score 92.45% at location name entity recognition performance and 90.95% overall recall rate in SIGHAN-MSRA dataset. | en_US |