應用樣式探勘與機器學習方法於中文未知詞擷取之研究

DC 欄位	值	語言
DC.contributor	資訊工程學系	zh_TW
DC.creator	楊傑程	zh_TW
DC.creator	Chieh-Cheng Yang	en_US
dc.date.accessioned	2009-2-2T07:39:07Z
dc.date.available	2009-2-2T07:39:07Z
dc.date.issued	2009
dc.identifier.uri	http://ir.lib.ncu.edu.tw:88/thesis/view_etd.asp?URN=955202037
dc.contributor.department	資訊工程學系	zh_TW
DC.description	國立中央大學	zh_TW
DC.description	National Central University	en_US
dc.description.abstract	在中文的自然語言處理中，中文斷詞是一個極為重要的問題。中文的原始呈現方式是未斷詞的格式，並不像歐美語系一樣，每個詞之間都有空白做區隔。中文斷詞的目標，係將輸入之未斷詞的中文字串，在適當位置插入空白以區隔不同詞彙，而中文斷詞裡有兩個主要的問題，“歧義性”及“未知詞”。在本篇論文中，我們主要是針對未知詞的問題進行兩個階段的偵測與擷取。第一階段的未知詞偵測，我們運用了一個樣式探勘(pattern mining)的方法，在語料庫中根據更完整、多樣的樣式型態找尋已知詞的偵測規則，這些規則可以區別出單一中文字元是單音節已知詞，亦或只是未知詞的一部份。第二階段的未知詞擷取，我們使用機器學習(Machine Learning)中的序列資料處理方法，將序列文件中未知詞擷取的問題，轉換為分類的問題，再搭配分類演算法予以解決。經過第一階段辨識出可能的未知詞字元後，再搭配該字元於文章中的上下文資訊與統計資訊，應用於分類模型中，進行是否為未知詞的判斷，實驗資料來自於中研院的平衡語料庫。第二階段的實驗，亦針對資料在分類上的不均衡情形應用相關的解決方法。最後，我們亦驗證兩階段作法中，未知詞偵測(第一階段)的必要性。2003年中研院的未知詞研究，使用一般性規則與人工規則的搭配，針對網路資料進行測試，擷取結果為F-measure 64.8%。本篇論文的研究在不使用任何人工型態規則之情況下，使用平衡語料庫的測試結果，未知詞擷取的表現已達到F-measure 65.7%的水準。	zh_TW
dc.description.abstract	Chinese Word Segmentation is one of major preprocessing steps in Chinese text processing. Due to lack of word boundaries in original Chinese texts, the main goal of Chinese Word Segmentation is the identification of words. There are two major problems in word segmentation: Ambiguities and Unknown words (out of vocabulary words). In this paper, we focus on Chinese unknown word problem. We utilize a two-phase approach to solve unknown word problem: the first phase for unknown word detection and second phase for unknown word extraction. In detection phase, we apply continuity pattern mining to derive set of rules from a corpus based on more complete types of pattern. These rules can distinguish whether a Chinese character is monosyllable word or part of unknown word. In extraction phase, we utilize machine learning algorithms to determine whether a detected morpheme should be merged with adjacent words to form an unknown word. We use features based on syntactic information, contextual information and statistical information in our classification model. Three classification models, including 2-gram, 3-gram, and 4-gram are constructed, with rules to solve overlap and conflict problem. We use Academic Sinica balanced corpus as our experimental data. Without much assistance of artificial rules, our experimental results (F-measure 0.657) are proved to be as good as results of Academic Sinica (F-measure 0.648). Finally, we also prove the importance of detection in our two-phase approach.	en_US
DC.subject	分類	zh_TW
DC.subject	機器學習	zh_TW
DC.subject	樣式探勘	zh_TW
DC.subject	中文未知詞	zh_TW
DC.subject	資料不均衡	zh_TW
DC.subject	machine learning	en_US
DC.subject	classification	en_US
DC.subject	imbalanced data	en_US
DC.subject	Chinese unknown word	en_US
DC.subject	pattern mining	en_US
DC.title	應用樣式探勘與機器學習方法於中文未知詞擷取之研究	zh_TW
dc.language.iso	zh-TW	zh-TW
DC.title	A two-phase Approach to Chinese Unknown Word Extraction: Application of Pattern Mining and Machine Learning	en_US
DC.type	博碩士論文	zh_TW
DC.type	thesis	en_US
DC.publisher	National Central University	en_US

博碩士論文 955202037 完整後設資料紀錄