博碩士論文 955202037 完整後設資料紀錄

DC 欄位 語言
DC.contributor資訊工程學系zh_TW
DC.creator楊傑程zh_TW
DC.creatorChieh-Cheng Yangen_US
dc.date.accessioned2009-2-2T07:39:07Z
dc.date.available2009-2-2T07:39:07Z
dc.date.issued2009
dc.identifier.urihttp://ir.lib.ncu.edu.tw:88/thesis/view_etd.asp?URN=955202037
dc.contributor.department資訊工程學系zh_TW
DC.description國立中央大學zh_TW
DC.descriptionNational Central Universityen_US
dc.description.abstract在中文的自然語言處理中,中文斷詞是一個極為重要的問題。中文的原始呈現方式是未斷詞的格式,並不像歐美語系一樣,每個詞之間都有空白做區隔。中文斷詞的目標,係將輸入之未斷詞的中文字串,在適當位置插入空白以區隔不同詞彙,而中文斷詞裡有兩個主要的問題,“歧義性”及“未知詞”。在本篇論文中,我們主要是針對未知詞的問題進行兩個階段的偵測與擷取。第一階段的未知詞偵測,我們運用了一個樣式探勘(pattern mining)的方法,在語料庫中根據更完整、多樣的樣式型態找尋已知詞的偵測規則,這些規則可以區別出單一中文字元是單音節已知詞,亦或只是未知詞的一部份。第二階段的未知詞擷取,我們使用機器學習(Machine Learning)中的序列資料處理方法,將序列文件中未知詞擷取的問題,轉換為分類的問題,再搭配分類演算法予以解決。經過第一階段辨識出可能的未知詞字元後,再搭配該字元於文章中的上下文資訊與統計資訊,應用於分類模型中,進行是否為未知詞的判斷,實驗資料來自於中研院的平衡語料庫。第二階段的實驗,亦針對資料在分類上的不均衡情形應用相關的解決方法。最後,我們亦驗證兩階段作法中,未知詞偵測(第一階段)的必要性。2003年中研院的未知詞研究,使用一般性規則與人工規則的搭配,針對網路資料進行測試,擷取結果為F-measure 64.8%。本篇論文的研究在不使用任何人工型態規則之情況下,使用平衡語料庫的測試結果,未知詞擷取的表現已達到F-measure 65.7%的水準。 zh_TW
dc.description.abstractChinese Word Segmentation is one of major preprocessing steps in Chinese text processing. Due to lack of word boundaries in original Chinese texts, the main goal of Chinese Word Segmentation is the identification of words. There are two major problems in word segmentation: Ambiguities and Unknown words (out of vocabulary words). In this paper, we focus on Chinese unknown word problem. We utilize a two-phase approach to solve unknown word problem: the first phase for unknown word detection and second phase for unknown word extraction. In detection phase, we apply continuity pattern mining to derive set of rules from a corpus based on more complete types of pattern. These rules can distinguish whether a Chinese character is monosyllable word or part of unknown word. In extraction phase, we utilize machine learning algorithms to determine whether a detected morpheme should be merged with adjacent words to form an unknown word. We use features based on syntactic information, contextual information and statistical information in our classification model. Three classification models, including 2-gram, 3-gram, and 4-gram are constructed, with rules to solve overlap and conflict problem. We use Academic Sinica balanced corpus as our experimental data. Without much assistance of artificial rules, our experimental results (F-measure 0.657) are proved to be as good as results of Academic Sinica (F-measure 0.648). Finally, we also prove the importance of detection in our two-phase approach. en_US
DC.subject分類zh_TW
DC.subject機器學習zh_TW
DC.subject樣式探勘zh_TW
DC.subject中文未知詞zh_TW
DC.subject資料不均衡zh_TW
DC.subjectmachine learningen_US
DC.subjectclassificationen_US
DC.subjectimbalanced dataen_US
DC.subjectChinese unknown worden_US
DC.subjectpattern miningen_US
DC.title應用樣式探勘與機器學習方法於中文未知詞擷取之研究zh_TW
dc.language.isozh-TWzh-TW
DC.titleA two-phase Approach to Chinese Unknown Word Extraction: Application of Pattern Mining and Machine Learningen_US
DC.type博碩士論文zh_TW
DC.typethesisen_US
DC.publisherNational Central Universityen_US

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡  - 隱私權政策聲明