摘要(英) |
In Information Retrieval, document clustering is a technology that can enhance the efficiency in the retrieving of needed information. With document clustering, one can efficiently management all kinds of knowledge and information. Document clustering is a tool for knowledge management.
Traditionally, document clustering is based on document similarity comparison where the document is represented by the vector space model with term as the dimension value. In this approach, the documents with the same semantic meaning might be classified as unsimilar because they are described with different words.In this research, we have integrated the technology of concept extraction with vector space model for document similarity comparison. We extract concepts from the documents first, then create a vector space model with the extracted concepts as the dimension value for the document. Documents similarity comparison is based on the concept-dimensioned vector space model. We wish that the concept based vector space model could enhance the document clustering efficiency.
We have experimented with the document clustering effect for the concept based vector space modle. The results show that the concept based vector space model can perform better than term based vector space model. |
參考文獻 |
[1] P. G. Anick and S. Tipirneni. The paraphrase search assistant: terminological feedback for iterative information seeking. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 153-159, 1999.
[2] R. Baeza-Yates and B. Robeiro-Neto. Modern Information Retrieval. Addison-Wesley Longman, pages 19-69, 1999.
[3] Lijuan Cai and Thomas Hofmann. Text Categorization by Boosting Automatically Extracted Concepts. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval, pages 182-189, 2003.
[4] Hung Chim. A new suffix tree similarity measure for document clustering. In Proceedings of the 16th international conference on World Wide Web, pages 121-130, 2007.
[5] N. Fuhr. Probabilistic model in information retrieval. The Computer Journal, 35(3):243-255, 1992.
[6] Sreenivas Gollapudi and Rina Panigrahy. Exploiting asymmetry in hierarchical topic extraction. In Proceedings of the 15th ACM international conference on Information and knowledge management, pages 475-482, 2006.
[7] A. Griffith, H. C. Luckhurst, and P. Willet. Using inter-document similarity information in document retrieval systems. Journal of the American Society for Information Science, 37(1):3-11, 1986.
[8] Dawn Lawrie, W. Bruce Croft, and Arnold Rosenberg. Finding topic words for hierarchical summarization. In Proceedings of 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 349-357, 2001.
[9] D. D. Lewis. Reuters-21578 Text Categorization Test Collection Distribution, Distribution 1.0. AT&T Labs-Research, 1997.
[10] Nyeint Nyeint Myat and Khin Haymar Saw Hla. A combined approach of formal concept analysis and text mining for concept based document clustering. In Proceedings of the 2005 IEEE/WIC/ACM international conference on web intelligence, pages 330-333, 2005.
[11] Chowdhury Mofizur Rahman, Ferdous Ahmed Sohel, Parvez Naushad, and S. M. Kamruzzaman. Text classification using the concept of association rule of data mining. In Proceeding of the International Conference on Information Technology, pages 23-26, 2003.
[12] E. Rasmussen. Clustering algorithms. In W.B. Frakes and R. Baeza-Yates, editors, Information Retrieval, pages 419-442. Prentice Hall, Eaglewood Cliffs, N.J., 1992.
[13] S. E. Roberston and K. Sparck Jones. Relevance Weighting of search terms. Journal of the American Society for Information Sciences, 27(3):129-146, 1976.
[14] W. M. Shaw Jr., R. Burgin, and P. Howell. Performance standards and evaluations in IR text collections: Cluster-based retrieval models. Information processing & management, 33(1):15-36, 1997.
[15] G. Salton and M. E. Lesk. Computer evaluation of indexing and text processing. Journal of the ACM, 15(1):8-36, January 1968.
[16] Mark Sanderson and Bruce Croft. Deriving concept hierarchies from text. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 206-213, 1999.
[17] C. J. van Rijsbergen. Information Retrieval. Butterworths, London, second edition, pages 23-43, 1979. |