摘要(英) |
Chinese Word Segmentation is one of major preprocessing steps in Chinese text processing. Due to lack of word boundaries in original Chinese texts, the main goal of Chinese Word Segmentation is the identification of words. There are two major problems in word segmentation: Ambiguities and Unknown words (out of vocabulary words). In this paper, we focus on Chinese unknown word problem. We utilize a two-phase approach to solve unknown word problem: the first phase for unknown word detection and second phase for unknown word extraction. In detection phase, we apply continuity pattern mining to derive set of rules from a corpus based on more complete types of pattern. These rules can distinguish whether a Chinese character is monosyllable word or part of unknown word. In extraction phase, we utilize machine learning algorithms to determine whether a detected morpheme should be merged with adjacent words to form an unknown word. We use features based on syntactic information, contextual information and statistical information in our classification model. Three classification models, including 2-gram, 3-gram, and 4-gram are constructed, with rules to solve overlap and conflict problem. We use Academic Sinica balanced corpus as our experimental data. Without much assistance of artificial rules, our experimental results (F-measure 0.657) are proved to be as good as results of Academic Sinica (F-measure 0.648). Finally, we also prove the importance of detection in our two-phase approach.
|
參考文獻 |
[1]. R. Agrawal and R. Srikant. Mining Sequential Patterns. In 11th International Conference on Data Engineering (ICDE), 1995.
[2]. C. C. Chang and C. J. Lin. LIBSVM : a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
[3]. K. J. Chen and M. H. Bai. Unknown Word Detection for Chinese by a Corpus-based Learning Method. International Journal of Computational linguistics and Chinese Language Processing, Vol.3, #1, pp.27-44, 1998.
[4]. K. J. Chen and C. J. Chen. Knowledge Extraction for Identification of Chinese Organization Names. In Proceedings of the second workshop on Chinese language processing: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics, Vol.12, pp.15-21, 2000.
[5]. H. H. Chen and J. C. Lee. Identification and Classification of Proper Names in Chinese Texts. In Proceedings of the 16th conference on Computational linguistics, Vol.1, pp.222-229, 1996.
[6]. K. J. Chen and S. H. Liu. Word Identification for Mandarin Chinese Sentences. In Proceedings of COLING, pp.101-105, 1992.
[7]. K. J. Chen and W. Y. Ma. Unknown Word Extraction for Chinese Documents. In Proceedings of COLING, pp.169-175, 2002.
[8]. T. G. Dietterich. Machine Learning for Sequential Data: A Review. Structural, Syntactic, and Statistical Pattern Recognition; Lecture Notes in Computer Science, Vol.2396, pp.15-30, 2002.
[9]. C. Drummond and R. C. Holte. C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling. In Workshop on Learning from Imbalanced Datasets Ⅱ, ICML, 2003.
[10]. C. L. Goh, M. Asahara, and Y. Matsumoto. Machine Learning-based Methods to Chinese Unknown Word Detection and POS Tag Guessing. International Journal of Chinese Language and Computing, Vol.16, #4, pp.185-206, 2006.
[11]. K. Y. Huang, C. H. Chang, and K. Z. Lin. Prowl: An Efficient Frequent Continuity Mining Algorithm on Event Sequences. In Proceedings of 6th International Conference on Data Warehousing and Knowledge Discovery (DaWak), vol.3181 of Lecture Notes in Computer Science, pp.351-360, 2004.
[12]. T. Kudo and Y. Matsumoto. Chunking with Support Vector Machines. In Second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies, pp.1-8, 2001.
[13]. C. Li. Classifying Imbalanced Data Using a Bagging Ensemble Variation (BEV). In Proceedings of the 45th annual southeast regional conference, pp.203-208, 2007.
[14]. W. Y. Ma and K. J. Chen. A Bottom-up Merging Algorithm for Chinese Unknown Word Extraction. In Proceedings of Second SIGHAN Workshop on Chinese Language Processing, Vol.17, pp.31-38, 2003.
[15]. J. Y. Nie, M-L. Hannan, and W. Jin. Unknown Word Detection and Segmentation of Chinese using Statistical and heuristic Knowledge. In Communications of COLIPS, 1995.
[16]. P. N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison Wesley, 2006.
[17]. R. T-H. Tsai, H. J. Dai, H. C. Hung, and C. L. Sung. Chinese Word Segmentation with Minimal Linguistic Knowledge: An Improved Conditional Random Fields Coupled with Character Clustering and Automatically Discovered Template Matching. The IEEE International Conference on Information Reuse and Integration, 2006.
[18]. G. M. Weiss, K. McCarthy, and B. Zabar. Cost-Sensitive Learning vs. Sampling: Which is Best for Handling Unbalanced Classes with Unequal Error Costs? International Conference on Data Mining (DMIN), 2007.
[19]. B. Zadrozny and C. Elkan. Learning and Making Decisions When Costs and Probabilities are Both Unknown. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pp.204-213, 2001.
[20]. K. Zhang, Q. Liu, H. Zhang, and X. Q. Cheng. Automatic Recognition of Chinese Unknown Words Based on Roles Tagging. In Proceedings of the first SIGHAN workshop on Chinese language processing, Vol.18, pp.1-7, 2002.
|