摘要: | 序列標記的模型被廣泛的運用在自然語言處理的範疇當中,如:命名實體辨識、詞性標記、斷詞等。命名實體辨識(Named Entity Recognition, NER)是自然語言處理當中一項重要的任務,因為它可以將未經過處理的文章,提取當中的命名實體並歸類到預先定義的分類當中,如:人名、地名、組織等。 命名實體辨識任務當中,大多數的研究是針對英文的資料集,不同於英文通常以空格做為每個單字的分割,且每個單字通常具有其獨特的意思;中文字通常隱含許多不同的資訊,根據所在的詞彙當中不同的位置,就有可能代表不同的意思,也因此中文當中並沒有明確的斷詞特徵。而傳統的機器學習於中文命名實體的辨識任務中,大多係使用統計的方式,並採取條件隨機場域進行序列標記,因此受限於小範圍的特徵擷取,如何在中文的資料集當中擷取參考長距離上下文資訊,判斷當前字詞正確的語意,進而正確的辨識命名實體,是一個充滿挑戰性及前瞻性的任務。 為克服上述的挑戰,本研究係使用深度學習的條件隨機場域進行中文命名實體辨識任務;首先透過訓練詞向量模型,將字元轉換為數值化之資料,再藉由卷積層、雙向GRU層,及整合長距離文章資訊的記憶層,使命名實體任務不同於往常僅能夠擷取小範圍的資訊,能夠獲取豐富完整的文章訊息。此外,也藉由特徵的探勘[1],並使用深度學習模型可自動訓練的參數,自動調整詞向量及詞彙特徵,除長距離的文章資訊外,更能充分獲得文章所隱藏的訊息。 本研究所使用的資料集包含使用自製爬蟲軟體所蒐集的網路文章做為訓練資料,另以網路新聞做為測試資料[3]的PerNews及SIGHAN Bakeoff-3[2];經研究實驗結果呈現,在網路社群媒體的資料中可以達到的91.67%的標記準確率,與尚未加入記憶的模型相比大幅提升2.9%,再加入詞彙詞向量及詞彙特徵,與基礎的記憶模型相比更是提升了6.04%。本研究所提出之模型在SIGHAN-MSRA中也得到最高的92.45%地名實體辨識效果及90.95%召回率。;Sequence labeling model has been widely used in Natural Language Processing (NLP). Ex: Named Entity recognition (NER), Part-Of-Speech tagging (POS) and Word Segmentation. Named Entity Recognition (NER) is one of the important tasks of Natural Language Processing because it can extract unnamed articles and extract them into pre-defined categories, such as person name, place name, organization, etc. Most of the research in Named Entity Recognition (NER) focused on English data. In English, spaces are usually used for dividing words, and each word has its own meaning. While in Chinese, each characters contains different information, different location of the vocabulary, may represent different meanings, so Chinese is without explicit word delimiters. However, the traditional machine learning of Chinese Named Entity Recognition (CNER), most of them use statistical methods and take the Conditional Random Field (CRF) to complete the sequence labeling task. Therefore, it only can capture local features. It is a challenging and forward-looking task to capturing long-range context information in Chinese dataset, determine the correct semantic meaning of the current word, and correctly identify the named entity. In order to overcome the challenges, this study used the deep learning Condition Random Fields to execute Chinese Named Entity Recognition task. Firstly, training a word vector model to convert characters to numeric data. And used convolutional layer, bidirectional GRU layer, and the memory layer that integrates external memory contains long-range context information. Making the task different from usual, only can capture local information, but can obtain rich message of article. Also by feature extraction generate some lexical features[1]. And use a automatically trained variable of deep learning model to automatically adjust the weight of word embedding and lexical features. In addition of long-range article information, the model also can fully obtain the hidden information of article. The data set used in this research includes PerNews which is online articles collected using custom crawler as training data and online news articles as test data, and SIGHAN Bakeoff-3. According to the results, the model proposed in this research achieve 91.67% tagging accuracy in the online social media data. The result is significantly higher than the model that doesn’t add memory layer by 2.9%. And then the word embedding and lexical features are added, compared with the basic memory model increase 6.04%. The model proposed in this study also achieve the highest F1-score 92.45% at location name entity recognition performance and 90.95% overall recall rate in SIGHAN-MSRA dataset. |