應用樣式探勘與機器學習方法於中文未知詞擷取之研究

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：65

、訪客IP：18.118.0.145

姓名

楊傑程(Chieh-Cheng Yang) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

應用樣式探勘與機器學習方法於中文未知詞擷取之研究
(A two-phase Approach to Chinese Unknown Word Extraction: Application of Pattern Mining and Machine Learning)

相關論文

★ 行程邀約郵件的辨識與不規則時間擷取之研究	★ NCUFree校園無線網路平台設計及應用服務開發
★ 網際網路半結構性資料擷取系統之設計與實作	★ 非簡單瀏覽路徑之探勘與應用
★ 遞增資料關聯式規則探勘之改進	★ 應用卡方獨立性檢定於關連式分類問題
★ 中文資料擷取系統之設計與研究	★ 非數值型資料視覺化與兼具主客觀的分群
★ 關聯性字組在文件摘要上的探討	★ 淨化網頁：網頁區塊化以及資料區域擷取
★ 問題答覆系統使用語句分類排序方式之設計與研究	★ 時序資料庫中緊密頻繁連續事件型樣之有效探勘
★ 星狀座標之軸排列於群聚視覺化之應用	★ 由瀏覽歷程自動產生網頁抓取程式之研究
★ 動態網頁之樣版與資料分析研究	★ 同性質網頁資料整合之自動化研究

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

在中文的自然語言處理中，中文斷詞是一個極為重要的問題。中文的原始呈現方式是未斷詞的格式，並不像歐美語系一樣，每個詞之間都有空白做區隔。中文斷詞的目標，係將輸入之未斷詞的中文字串，在適當位置插入空白以區隔不同詞彙，而中文斷詞裡有兩個主要的問題，“歧義性”及“未知詞”。在本篇論文中，我們主要是針對未知詞的問題進行兩個階段的偵測與擷取。第一階段的未知詞偵測，我們運用了一個樣式探勘(pattern mining)的方法，在語料庫中根據更完整、多樣的樣式型態找尋已知詞的偵測規則，這些規則可以區別出單一中文字元是單音節已知詞，亦或只是未知詞的一部份。第二階段的未知詞擷取，我們使用機器學習(Machine Learning)中的序列資料處理方法，將序列文件中未知詞擷取的問題，轉換為分類的問題，再搭配分類演算法予以解決。經過第一階段辨識出可能的未知詞字元後，再搭配該字元於文章中的上下文資訊與統計資訊，應用於分類模型中，進行是否為未知詞的判斷，實驗資料來自於中研院的平衡語料庫。第二階段的實驗，亦針對資料在分類上的不均衡情形應用相關的解決方法。最後，我們亦驗證兩階段作法中，未知詞偵測(第一階段)的必要性。2003年中研院的未知詞研究，使用一般性規則與人工規則的搭配，針對網路資料進行測試，擷取結果為F-measure 64.8%。本篇論文的研究在不使用任何人工型態規則之情況下，使用平衡語料庫的測試結果，未知詞擷取的表現已達到F-measure 65.7%的水準。

摘要(英)

Chinese Word Segmentation is one of major preprocessing steps in Chinese text processing. Due to lack of word boundaries in original Chinese texts, the main goal of Chinese Word Segmentation is the identification of words. There are two major problems in word segmentation: Ambiguities and Unknown words (out of vocabulary words). In this paper, we focus on Chinese unknown word problem. We utilize a two-phase approach to solve unknown word problem: the first phase for unknown word detection and second phase for unknown word extraction. In detection phase, we apply continuity pattern mining to derive set of rules from a corpus based on more complete types of pattern. These rules can distinguish whether a Chinese character is monosyllable word or part of unknown word. In extraction phase, we utilize machine learning algorithms to determine whether a detected morpheme should be merged with adjacent words to form an unknown word. We use features based on syntactic information, contextual information and statistical information in our classification model. Three classification models, including 2-gram, 3-gram, and 4-gram are constructed, with rules to solve overlap and conflict problem. We use Academic Sinica balanced corpus as our experimental data. Without much assistance of artificial rules, our experimental results (F-measure 0.657) are proved to be as good as results of Academic Sinica (F-measure 0.648). Finally, we also prove the importance of detection in our two-phase approach.

關鍵字(中)

★ 分類
★ 機器學習
★ 樣式探勘
★ 中文未知詞
★ 資料不均衡

關鍵字(英)

★ machine learning
★ classification
★ imbalanced data
★ Chinese unknown word
★ pattern mining

論文目次

目錄
目錄 I
表目錄 II
圖目錄 III
1. 序論 1
2. 相關研究 4
3 研究方法 11
3.1 未知詞(單音節字元)偵測 12
3.2 未知詞擷取 16
3.2.1 序列資料處理方法- Sliding Window方法 17
3.2.2 應用的資訊 18
3.3 資料不均衡 22
3.4 未知詞結合上的選擇(N-gram模型) 24
4 實驗 27
4.1 未知詞偵測 27
4.2 未知詞擷取 29
5. 結論及未來改進的方向 37
6. Reference 39
表目錄
表1 例句的字詞與對應的標記 7
表2 序列資料以Sliding Window方法(windows size= 3)呈現的例子 8
表3 規則形式與正確率的定義 14
表4 “王(Na) 義氣(VH)”所產生的規則與對應正確率 15
表5 n-gram Sliding Window資料的格式 18
表6 第二階段訓練資料(某句) 18
表7 模型的分類判斷(3-gram Sliding Window資料範例) 21
表8 第二階段的訓練資料組成 23
表9 A Confusion Matrix 24
表10 第一階段的實驗結果(使用應用性>2的篩選結果) 28
表11 使用正確率=0.95，加上應用性後篩選規則的結果 29
表12 第二階段的實驗結果(Under-Sampling搭配Ensemble Method) 31
表13 12個分類模型的個別表現與整體表現 32
表14 第二階段的實驗結果(Cost-Sensitive Learning搭配Ensemble Method) 33
表15 一階段與兩階段的結果比較 34
圖目錄
圖1 兩階段系統架構圖(偵測 & 擷取) 12
圖2 未知詞擷取的系統架構圖(包含部份實驗步驟) 16

參考文獻

[1]. R. Agrawal and R. Srikant. Mining Sequential Patterns. In 11th International Conference on Data Engineering (ICDE), 1995.
[2]. C. C. Chang and C. J. Lin. LIBSVM : a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
[3]. K. J. Chen and M. H. Bai. Unknown Word Detection for Chinese by a Corpus-based Learning Method. International Journal of Computational linguistics and Chinese Language Processing, Vol.3, #1, pp.27-44, 1998.
[4]. K. J. Chen and C. J. Chen. Knowledge Extraction for Identification of Chinese Organization Names. In Proceedings of the second workshop on Chinese language processing: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics, Vol.12, pp.15-21, 2000.
[5]. H. H. Chen and J. C. Lee. Identification and Classification of Proper Names in Chinese Texts. In Proceedings of the 16th conference on Computational linguistics, Vol.1, pp.222-229, 1996.
[6]. K. J. Chen and S. H. Liu. Word Identification for Mandarin Chinese Sentences. In Proceedings of COLING, pp.101-105, 1992.
[7]. K. J. Chen and W. Y. Ma. Unknown Word Extraction for Chinese Documents. In Proceedings of COLING, pp.169-175, 2002.
[8]. T. G. Dietterich. Machine Learning for Sequential Data: A Review. Structural, Syntactic, and Statistical Pattern Recognition; Lecture Notes in Computer Science, Vol.2396, pp.15-30, 2002.
[9]. C. Drummond and R. C. Holte. C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling. In Workshop on Learning from Imbalanced Datasets Ⅱ, ICML, 2003.
[10]. C. L. Goh, M. Asahara, and Y. Matsumoto. Machine Learning-based Methods to Chinese Unknown Word Detection and POS Tag Guessing. International Journal of Chinese Language and Computing, Vol.16, #4, pp.185-206, 2006.
[11]. K. Y. Huang, C. H. Chang, and K. Z. Lin. Prowl: An Efficient Frequent Continuity Mining Algorithm on Event Sequences. In Proceedings of 6th International Conference on Data Warehousing and Knowledge Discovery (DaWak), vol.3181 of Lecture Notes in Computer Science, pp.351-360, 2004.
[12]. T. Kudo and Y. Matsumoto. Chunking with Support Vector Machines. In Second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies, pp.1-8, 2001.
[13]. C. Li. Classifying Imbalanced Data Using a Bagging Ensemble Variation (BEV). In Proceedings of the 45th annual southeast regional conference, pp.203-208, 2007.
[14]. W. Y. Ma and K. J. Chen. A Bottom-up Merging Algorithm for Chinese Unknown Word Extraction. In Proceedings of Second SIGHAN Workshop on Chinese Language Processing, Vol.17, pp.31-38, 2003.
[15]. J. Y. Nie, M-L. Hannan, and W. Jin. Unknown Word Detection and Segmentation of Chinese using Statistical and heuristic Knowledge. In Communications of COLIPS, 1995.
[16]. P. N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison Wesley, 2006.
[17]. R. T-H. Tsai, H. J. Dai, H. C. Hung, and C. L. Sung. Chinese Word Segmentation with Minimal Linguistic Knowledge: An Improved Conditional Random Fields Coupled with Character Clustering and Automatically Discovered Template Matching. The IEEE International Conference on Information Reuse and Integration, 2006.
[18]. G. M. Weiss, K. McCarthy, and B. Zabar. Cost-Sensitive Learning vs. Sampling: Which is Best for Handling Unbalanced Classes with Unequal Error Costs? International Conference on Data Mining (DMIN), 2007.
[19]. B. Zadrozny and C. Elkan. Learning and Making Decisions When Costs and Probabilities are Both Unknown. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pp.204-213, 2001.
[20]. K. Zhang, Q. Liu, H. Zhang, and X. Q. Cheng. Automatic Recognition of Chinese Unknown Words Based on Roles Tagging. In Proceedings of the first SIGHAN workshop on Chinese language processing, Vol.18, pp.1-7, 2002.

指導教授

張嘉惠(Chia-hui Chang)

審核日期

2009-2-2

推文