摘要(英) |
Currently, advanced Natural Language Processing (NLP) includes event extraction or event classification, automatic text summarization and so on. Most NLP techniques for classical Chinese are still on the early stage, like sentence segmentation or word segmentation, named entity recognition. These basic applications usually use supervised learning to identify. Tagging the training data of these basic applications need to spend much time, because the people that know the classical Chinese are minority. Therefore, the current advanced Natural Language Processing for classical Chinese are difficult to develop. The basic element of most languages is word. The accuracy of word segmentation influences the effect of the current advanced Natural Language Processing directly. As a result, we develop the word segment system for classical Chinese. Compared with traditional word segmentation, we do not need training data.
This thesis focuses on applying active learning to word segmentation of historical texts. In addition, we apply the algorithm to the MING SHILU. We use active learning because it can reduce the annotation efforts significantly. We also mitigate the disadvantage of unsupervised model that needs large amounts of data to achieve satisfactory accuracy.
|
參考文獻 |
1. Kotsiantis, S.B., I. Zaharakis, and P. Pintelas, Supervised machine learning: A review of classification techniques. 2007.
2. Li, S. and C.-R. Huang. Word Boundary Decision with CRF for Chinese Word Segmentation. in PACLIC. 2009.
3. Feng, H., et al. Unsupervised Segmentation of Chinese Corpus Using Accessor Variety. in IJCNLP. 2004. Springer.
4. Jin, Z. and K. Tanaka-Ishii. Unsupervised segmentation of Chinese text by use of branching entropy. in Proceedings of the COLING/ACL on Main conference poster sessions. 2006. Association for Computational Linguistics.
5. Magistry, P. and B. Sagot. Unsupervized word segmentation: the case for mandarin chinese. in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2. 2012. Association for Computational Linguistics.
6. Wang, H., et al., A new unsupervised approach to word segmentation. Computational Linguistics, 2011. 37(3): p. 421-454.
7. Shannon, C., (1948)," A Mathematical Theory of Communication", Bell System Technical Journal, vol. 27, pp. 379-423 & 623-656, July & October. 1948.
8. Peng, F., F. Feng, and A. McCallum. Chinese segmentation and new word detection using conditional random fields. in Proceedings of the 20th international conference on Computational Linguistics. 2004. Association for Computational Linguistics.
9. Purandare, A. and T. Pedersen. Word sense discrimination by clustering contexts in vector and similarity spaces. in Proceedings of the Eighth Conference on Computational Natural Language Learning (CoNLL-2004) at HLT-NAACL 2004. 2004.
10. Mikolov, T., et al. Distributed representations of words and phrases and their compositionality. in Advances in neural information processing systems. 2013.
11. Mikolov, T., et al., Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
|