摘要(英) |
Clustering analysis is an important task in data mining. Due to the nature of the clustering theory, these techniques keep the result of clustering, which gives the chance of better utilization of managing the objects. Yet they all have some common shortages:(1) Unable to predict new objects. (2) Difficult to give clear semantic description for each cluster.
In [Liu et al, 3], a decision tree, called CLTree is built based on decision trees in classification to represent a result of clustering. The technique introduced in the paper uses the same attribute set for both partitioning the dataset and constructing the decision tree. However, in a practical situation, it is possible that the two kinds of attributes may be different from each other. [Lee, 1] proposed an improved technique, Semantic Tree, to allow different attributes set for clustering and partitioning which brings better chances for the technique to be applied.
A drawback for the above two techniques is that both techniques are density-based, i.e. they can be applied only to numerical attributes. This can be fatal when we want to cluster those categorical datasets. In this paper, we develop a new technique using k-nearest neighbor graph, which allows both numerical and categorical attributes. The technique also covers the convenience of unsupervised learning as well as the ability of prediction of decision trees. |
參考文獻 |
[1] 李育璇,具語意描述能力的分群演算法,國立中央大學資訊管理研究所碩 士論文,民國92年6月。
[2] A.K. Jain, M.N. Murty, and P.J. Flynn, Data clustering: a review, ACM Computing Surveys, 31(3):264--323, 1999.
[3] B. Liu, Y. Xia, and P. Yu, Clustering through decision tree construction, In SIGMOD-00, 2000.
[4] C.H. Cheng, A.W. Fu, and Y. Zhang, Entropy-based subspace clustering for mining numerical data, KDD-99, 84-93, 1999.
[5] F. Giannotti, C. Gozzi and G. Manco, Clustering Transactional Data, SEBD 2001.
[6] George Karypis, Eui-Hong Han and Vipin Kumar, Chameleon: Hierarchical Clustering Using Dynamic Modeling, IEEE Computer, 1999.
[7] G. Salton, Automatic text processing: the transformation, analysis and retrieval of information by computer, Addison Wesley, 1989.
[8] H. Ralambondrainy, A Conceptual Version of the K-Means Algorithm, Pattern Recognition Letters, 16, pp.1147-1157, 1995.
[9] Hirano Shoji, Sun Xiaoguang, Tsumoto Shusaku, Comparison of clustering methods for clinical databases , Information Sciences, Volume: 159, Issue: 3-4, pp. 155-165 .February, 2004.
[10] J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.
[11] J.R. Quinlan, C4.5 : Programs for Machine Learning, Morgan Kaufmann, 1993.
[12] M. Halkidi, Y. Batistakis and M. Vazirgiannis. Clustering algorithms and validity measures. In Proceedings. Thirteenth International Conference on Scientific and Statistical Database Management ( SSDBM'01), Pages 3 -22, 2001.
[13] M. Kantardzic, Data Mining: Concepts, Models, Methods, and Algorithms, Wiley-Interscience, 2002.
[14] P. Berkhin, Survey of clustering data mining techniques, Technical Report, Accrue Software, 2002.
[15] Sudipto Guha, Rajeev Rastogi and Kyuseok Shim. Cure: an efficient clustering algorithm for large databases. Information Systems, 2001.
[16] Sudipto Guha and Rajeev Rastogi, ROCK: A Clustering Algorithm for Categorical Attributes. Information System Journal, 2000. |