應用樣式探勘與機器學習方法於中文未知詞擷取之研究; A two-phase Approach to Chinese Unknown Word Extraction: Application of Pattern Mining and Machine Learning

NCU Institutional Repository > 資訊電機學院 > 資訊工程研究所 > 博碩士論文 > Item 987654321/9589

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/9589

題名:	應用樣式探勘與機器學習方法於中文未知詞擷取之研究;A two-phase Approach to Chinese Unknown Word Extraction: Application of Pattern Mining and Machine Learning
作者:	楊傑程;Chieh-Cheng Yang
貢獻者:	資訊工程研究所
關鍵詞:	分類;機器學習;樣式探勘;中文未知詞;資料不均衡;machine learning;classification;imbalanced data;Chinese unknown word;pattern mining
日期:	2009-01-09
上傳時間:	2009-09-22 11:51:12 (UTC+8)
出版者:	國立中央大學圖書館
摘要:	在中文的自然語言處理中，中文斷詞是一個極為重要的問題。中文的原始呈現方式是未斷詞的格式，並不像歐美語系一樣，每個詞之間都有空白做區隔。中文斷詞的目標，係將輸入之未斷詞的中文字串，在適當位置插入空白以區隔不同詞彙，而中文斷詞裡有兩個主要的問題，“歧義性”及“未知詞”。在本篇論文中，我們主要是針對未知詞的問題進行兩個階段的偵測與擷取。第一階段的未知詞偵測，我們運用了一個樣式探勘(pattern mining)的方法，在語料庫中根據更完整、多樣的樣式型態找尋已知詞的偵測規則，這些規則可以區別出單一中文字元是單音節已知詞，亦或只是未知詞的一部份。第二階段的未知詞擷取，我們使用機器學習(Machine Learning)中的序列資料處理方法，將序列文件中未知詞擷取的問題，轉換為分類的問題，再搭配分類演算法予以解決。經過第一階段辨識出可能的未知詞字元後，再搭配該字元於文章中的上下文資訊與統計資訊，應用於分類模型中，進行是否為未知詞的判斷，實驗資料來自於中研院的平衡語料庫。第二階段的實驗，亦針對資料在分類上的不均衡情形應用相關的解決方法。最後，我們亦驗證兩階段作法中，未知詞偵測(第一階段)的必要性。2003年中研院的未知詞研究，使用一般性規則與人工規則的搭配，針對網路資料進行測試，擷取結果為F-measure 64.8%。本篇論文的研究在不使用任何人工型態規則之情況下，使用平衡語料庫的測試結果，未知詞擷取的表現已達到F-measure 65.7%的水準。 Chinese Word Segmentation is one of major preprocessing steps in Chinese text processing. Due to lack of word boundaries in original Chinese texts, the main goal of Chinese Word Segmentation is the identification of words. There are two major problems in word segmentation: Ambiguities and Unknown words (out of vocabulary words). In this paper, we focus on Chinese unknown word problem. We utilize a two-phase approach to solve unknown word problem: the first phase for unknown word detection and second phase for unknown word extraction. In detection phase, we apply continuity pattern mining to derive set of rules from a corpus based on more complete types of pattern. These rules can distinguish whether a Chinese character is monosyllable word or part of unknown word. In extraction phase, we utilize machine learning algorithms to determine whether a detected morpheme should be merged with adjacent words to form an unknown word. We use features based on syntactic information, contextual information and statistical information in our classification model. Three classification models, including 2-gram, 3-gram, and 4-gram are constructed, with rules to solve overlap and conflict problem. We use Academic Sinica balanced corpus as our experimental data. Without much assistance of artificial rules, our experimental results (F-measure 0.657) are proved to be as good as results of Academic Sinica (F-measure 0.648). Finally, we also prove the importance of detection in our two-phase approach.
顯示於類別:	[資訊工程研究所] 博碩士論文

文件中的檔案:

檔案	大小	格式	瀏覽次數

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....