摘要(英) |
Internet has facilitated the development of information technology and communication protocol, and World Wide Web (WWW): an information-sharing model built on top of the Internet, a popular platform of information exchange will not confined to time and space. Since many feasible search engines have been utilized on the Internet, users of WWW no longer face the problem of how to obtain the information from the vast data, but rather how to manage and filter them. Because people generally do not have so much time to analyze the immense data, so data mining--- a technology of quickly and effectively extract requested information from these huge data is a very important issue. Research of data mining has developed many technologies of filtering out useful information from large volumes of data, document clustering is one of the important technologies. There are two methods of document clustering, one is clustering depended on metadata of document, and the other is content of document. Previous clustering methods of the document contents, most of the algorithms focus on the document summary (summary of single file or multiple files) and the words vector analysis of document, find the few and important keywords to conduct document clustering. In this study, we categorize popular goods and services and name them, in accordance with their accessing numbers K(d,i)and the words provided by abstracts of goods and services. First, parse Chinese word of abstracts documents for the foods or services, applied the hierarchical agglomerative clustering method - Ward method to analyze the properties of words into themes and decide the number K of themes. Secondly, adopt the TTM (Temporal Text Mining) Cross-Collection Mixture Model, collect and use of dynamic theme, and gather stable words by probability distribution to be the vectors of document clustering. This study proposes a novel approach of clustering document. The approach is according to the correlation of words which in the contents of documents, the level of popularity (accessing count) of users recognized words, and extracted dynamic themes to be the feature characteristic of document clustering. Through a series of experiment and evaluated by F-measure and error rate, it is proven that the algorithm is effective and can improve the accuracy of clustering results.
|
參考文獻 |
[1] G. Salton and M. E. Lesk. Computer evaluation of indexing and text processing.
Journal of the ACM, 15(1):8-36, January 1968.
[2] Rüger, S. M. and S. E. Gauch (2000) Feature Reduction for Document Clustering and Classification: Technical Report DTR 2000/8, Computing Department of Imperial College, London, UK.17. Salton, G. and M. McGill
[3] C. D. Manning, H. Schutze, Foundations of statistical natural language processing, Massachusetts Institute of Technology. pages 315-407, 1999
[4] Lijuan Cai and Thomas Hofmann. Text Categorization by Boosting Automatically Extracted Concepts. In Proceedings of the 26th annual international ACM SIGIR, conference on Research and development in information retrieval, pages 182-189, 2003.
[5] E. Rasmussen. Clustering algorithms. In W.B. Frakes and R. Baeza-Yates, Information Retrieval, pages 419-442, 1992.
[6] A. Griffith, H. C. Luckhurst, and P. Willet. Using inter-document similarity
information in document retrieval systems. Journal of the American Society for
Information Science, 37(1):3-11, 1986.
[7] ChengXiang Zhai, and Qiaozhu Mei. Discovering Evolutionary Theme Patterns from Text-An Exploration of Temporal Text Mining. In Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery and data mining,
pages 198-207, 2005.
[8] S. Morinaga and K. Yamanishi. Tracking dynamics of topic trends using a _nite mixture model. In Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery and data mining, pages 811-816, 2004.
[9] A. P. Dempster, N. M. Laird, and D. B. Rubin.Maximum likelihood from incomplete data via the EMalgorithm. Journal of Royal Statist. Soc. B, 39:1{38,1977.
[10] S. Roy, D. Gevry, and W. M. Pottenger.Methodologies for trend detection in textual datamining. In the Textmine '02 Workshop, Second SIAM International Conference on Data Mining, 2002.
[11] S. Khan and A. Ahmad (2004), Cluster Centre Initialization Algorithm for K-Means Clustering, Pattern Recognition vol. 25, pages 1293–1302.
[12] R.M.Neal and G.E.Hinton: A view of the EM algorithm that justifies incremental sparse, and other variants, Learning in Graphical Models, M. Jordan (editor), MIT Press, Cambridge MA, USA.
[13] Kowalski, G. (1997) Information Retrieval Systems −Theory and Implementation, Kluwer Academic Publishers,Norwell, MA.
[14] Rijbergen, C. J. Van (1979) Information Retrieval, 2nd Ed, pages 114-115.Butterworths, London, UK.
[15] Larsen, B. and C. Aone (1999) Fast and effective text mining using linear-time document clustering. Proceedings of the fifth ACM SIGKDD international Conference on California.Knowledge Discovery and Data Mining, San Diego,
[16] Michael Steinbach, Pang-Ning Tan, Vipin Kumar(2006),Introduction to Data Mining, 1nd Ed, pages 158-159,USA
|