基於特製隱藏式馬可夫模型之中文斷詞研究

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：27

、訪客IP：3.133.144.147

姓名

林千翔(Qian-Xiang Lin) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

基於特製隱藏式馬可夫模型之中文斷詞研究
(Chinese Word Segmentation using Specialized HMM)

相關論文

★ 行程邀約郵件的辨識與不規則時間擷取之研究	★ NCUFree校園無線網路平台設計及應用服務開發
★ 網際網路半結構性資料擷取系統之設計與實作	★ 非簡單瀏覽路徑之探勘與應用
★ 遞增資料關聯式規則探勘之改進	★ 應用卡方獨立性檢定於關連式分類問題
★ 中文資料擷取系統之設計與研究	★ 非數值型資料視覺化與兼具主客觀的分群
★ 關聯性字組在文件摘要上的探討	★ 淨化網頁：網頁區塊化以及資料區域擷取
★ 問題答覆系統使用語句分類排序方式之設計與研究	★ 時序資料庫中緊密頻繁連續事件型樣之有效探勘
★ 星狀座標之軸排列於群聚視覺化之應用	★ 由瀏覽歷程自動產生網頁抓取程式之研究
★ 動態網頁之樣版與資料分析研究	★ 同性質網頁資料整合之自動化研究

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

中文斷詞在中文的自然語言處理上，是個相當基礎且非常重要的前置處理工作。中文斷詞這個領域雖然已經研究了數十年，過去也有相當多的學者提出各種斷詞演算法，但至今解決中文斷詞問題的研究仍未中斷，並且越來越受到重視。近年來的斷詞系統則較傾向於使用統計式的機器學習演算法來解決中文斷詞的問題，例如隱藏式馬可夫模型。然而，標準的隱藏式馬可夫模型在解決中文斷詞的問題上，斷詞效能F-measure約只有80% 的結果，所以許多研究都是使用外部資源或是結合其他的機器學習演算法來幫助斷詞。本研究目的是希望使用最簡單的方法，並且毋須使用任何外部資源，來提升隱藏式馬可夫模型的準確率。我們的作法是應用特製化（specialization）的概念，將中文斷詞之歧義性及未知詞的資訊帶入隱藏式馬可夫模型中，於完全不修改模型之訓練及測試過程的前提之下，透過兩階段特製化的方式，分別以擴充「觀測符號」，以及擴充「狀態符號」的方式，大大地改善了隱藏式馬可夫模型的斷詞準確性。第一階段中，我們結合了長詞優先法以及遮罩方式（Mask method），將歧義性與未知詞的資訊帶入隱藏式馬可夫模型中，使得模型擁有更多的斷詞資訊做學習。於實驗結果得知，結合最簡單的長詞優先斷詞方法，確實能大幅地提升隱藏式馬可夫模型的效能，將F-measure由0.812提升至0.953的斷詞結果。而第二階段的特製化過程中，我們使用詞彙化（lexicalization）的方式分別對高頻率及高錯誤的觀測符號，來新增狀態符號，於實驗結果也證明了，透過此階段的改良能再次提升系統效能，將斷詞結果F-measure由0.953提升至0.963。

摘要(英)

The first step in Chinese language processing tasks is word segmentation. Various methods have been proposed to address this problem in previous studies, e.g. heuristic-based approaches, statistical-based approaches, etc. HMM is a statistical machine learning approach that has been successfully applied in many fields, e.g. POS tagging, shallow parsing, and so on. However, we find that standard HMM achieved only 80% results in Chinese word segmentation. As is commonly known, segmentation ambiguity and unknown word occurrence are two main problems in Chinese word segmentation. In this paper, we proposed a two-stage specialized HMM by incorporating these information into the model. In the first stage, we combine the maximum matching heuristics to incorporate segmentation ambiguity and use a masking approach to handle unknown word information. By extending the observation symbols, the proposed M-HMM is improved from 0.812 to 0.953 in F-measure. At the second stage, we use lexicalization technique to further enrich HMM performance. The idea is to add new state symbols for high frequency characters or high tagging error symbols. Experimental results show that Lexicalized M-HMM is improved from 0.953 to 0.963 in F-measure.

論文目次

目錄 I
圖目錄 II
表目錄 III
一、緒論 1
1.1 研究背景 1
1.2 研究動機 2
1.3 章節概要 4
二、中文斷詞相關研究 5
2.1 解決歧義性問題 5
2.1.1 中研院經驗法則 5
2.1.2 解決交集型與組合型歧義 7
2.2 解決未知詞問題 8
2.3 近年來的斷詞研究 10
三、隱藏式馬可夫模型 12
3.1 馬可夫鏈 13
3.2 隱藏式馬可夫模型的理論與參數 14
3.3 監督式的訓練過程 15
3.4 非監督式訓練過程 16
3.5 測試過程 21
四、系統架構 23
4.1 長詞優先法 23
4.2 BIES 分類問題 25
4.3 特製隱藏式馬可夫模型 26
4.4 M-HMM 27
4.5 LEXICALIZED M-HMM 29
五、實驗 32
5.1 實驗資料與評估方式 32
5.2 實驗設定與結果 32
5.2.1 M-HMM實驗（實驗一、二、三） 33
5.2.4 Lexicalized M-HMM實驗（實驗四、五） 36
六、結論 38
參考文獻 39

參考文獻

1. M. Asahara, K. Fukuoka, A. Azuma, C. L. Goh, Y. Watanabe, Y. Matsumoto, T. Tsuzuki. Combination of Machine Learning Methods for Optimum Chinese Word Segmentation. In Proceedings of Fourth SIGHAN Workshop on Chinese Language Processing, pp. 134–137, 2005
2. M. Asahara, C. L. Goh, X. Wang and Y. Matsumoto. Combining Segmenter and Chunker for Chinese Word Segmentation. In Proceedings of Second SIGHAN Workshop on Chinese Language Processing, pp. 144–147, 2003
3. K. J. Chen and M. H. Bai. Unknown Word Detection for Chinese By a Corpus-based Learning Method. In Proceedings of ROCLING X, pp. 159–174, 1997
4. K. J. Chen and S. H. Liu. Word Identification for Mandarin Chinese Sentences. Proceedings COLING '92, pp. 101-105, 1992
5. K. J. Chen and W. Y. Ma. Unknown Word Extraction for Chinese Documents. In Proceedings of COLING 2002, pp. 169–175, 2002
6. C. L. Goh, M. Asahara and Y. Matsumoto. Chinese Word Segmentation by Classification of Characters. International Journal of Computational Linguistics and Chinese Language Processing Vol. 10, No. 3, pp. 381-396, 2005
7. J. D. Kim, S. Z. Lee and H. C. Rim. HMM Specialization with Selective Lexicalization. In Proceedings of the joinSIGDAT Conference on Empirical Methods in Natural Lan-guage Processing and Very Large Corpora(EMNLP-VLC-99), pp. 121-127, 1999
8. S. Z. Lee, J. I. Tsujii and H. C. Rim. Lexicalized Hidden Markov Models for Part-of-Speech Tagging. In Proceedings of 18th International Conference on Computa-tional Linguistics, Saarbrucken, Germany, pp.481-787, 2000
9. M. Li, J. F. Gao, C. N. Huang and J. F. Li. Unsupervised Training for Overlapping Ambiguity Resolution in Chinese Word Segmentation. In Proceedings of Second SIGHAN Workshop on Chinese Language Processing, pp. 1–7, 2003
10. Y. Y. Li, C. J. Miao, K. Bontcheva and H. Cunningham. Perceptron Learning for Chinese Word Segmentation. In Proceedings of Fourth SIGHAN Workshop on Chinese Language Processing, pp. 154–157, 2005
11. X. Lu. Towards a Hybrid Model for Chinese Word Segmentation. In Proceedings of Fourth SIGHAN Workshop on Chinese Language Processing, pp. 189–192, 2005
12. X. Luo, M. Sun and B. K. Tsou. Covering Ambiguity Resolution in Chinese Word Segmentation Based on Contextual Information. In Proceedings of COLING 2002, pp. 598-604, 2002
13. W. Y. Ma and K. J. Chen. A Bottom-up Merging Algorithm for Chinese Unknown Word Extraction. In Proceedings of Second SIGHAN Workshop on Chinese Language Processing, pp. 31–38, 2003
14. C. D. Manning and H. Schutze. Foundation of Statistical Natural Language Processing. Chapter 9-10. pp. 317-380, 1999
15. A. Molina and F. Pla. Shallow Parsing using Specialized HMMs. Journal of Machine Learning Research 2, pp. 595–613, 2002
16. A. Molina, F. Pla and E. Segarra. A Hidden Markov Model Approach to Word Sense Disambiguation. In Proceedings of the VIII Conferencia Iberoamericana de Inteligencia Artificial, IBERAMIA 2002, pp. 1-9, 2002
17. F. Pla and A. Molina. Improving Part-of-Speech Tagging using Lexicalized HMMs. Natural Language Engineering, pp. 167-189, 2004
18. F. Pla and A. Molina. Part-of-Speech Tagging with Lexicalized HMM. In proceedings of International Conference on Recent Advances in Natural Language Processing(RANLP2001), 2001
19. L. R. Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, Vol.77, No.22, pp. 257-286, 1989
20. H. H. Tseng, P. H. Chang, G. Andrew, D. Jurafsky, and C. Manning. A Conditional Random Field Word Segmenter for Sighan Bakeoff 2005. In Proceedings of Fourth SIGHAN Workshop on Chinese Language Processing, pp. 168–171, 2005
21. Y. C. Wu, C. H. Chang and Y. S. Lee. A General and Multi-lingual Phrase Chunking Model Based on Masking Method. Lecture Notes in Computer Science (LNCS): Computational Linguistics and Intelligent Text Processing, Vol. 3878, pp. 144-155, 2006
22. N. Xue. Chinese Word Segmentation as Character Tagging. International Journal of Computational Linguistics and Chinese, pp. 29–48, 2003
23. N. Xue and L. Shen. Chinese Word Segmentation as LMR Tagging. In Proceedings of Second SIGHAN Workshop on Chinese Language Processing, pp. 176–179, 2003
24. H. P. Zhang, Q. Liu, H. Zhang and X. Q. Cheng. Automatic Recognition of Chinese Unknown Words Based on Roles Tagging. In Proceedings of First SIGHAN Workshop on Chinese Language Processing, pp. 71-77, 2002
25. H. P. Zhang, H. K. Yu, D. Y. Xiong and Q. Liu. HHMM-based Chinese Lexical Analyzer ICTCLAS. In Proceedings of Second SIGHAN Workshop on Chinese Language Processing, pp. 187–187, 2003
26. J. H. Zheng and F. F. Wu. Study on segmentation of ambiguous phrases with the combinatorial type. Collections of Papers on Computational Linguistics. Tsinghua University Press, Beijing, pp. 129-134, 1999

指導教授

張嘉惠(Chia-Hui Chang)

審核日期

2006-7-20

推文