dc.description.abstract | Clustering algorithms are effective tools for exploring the structures of complex data sets, therefore, are of great value in a number of applications. For most of clustering algorithms, two crucial problems required to be solved are
(1) the determining of the optimal number of clusters
(2) the determining of the similarity measure based on which patterns are assigned to corresponding clusters.
The estimation of the number of clusters in the data set is the so-called cluster validity problem. Conventional approaches to solving the cluster validity problem usually involves increasing the number of clusters, and/or merging the existing clusters, computing some certain cluster validity measures in each run, until partition into optimal number of clusters is obtained. Since most validity measures usually assume a certain geometrical structure in cluster shapes, these approaches fail to estimate the correct number of clusters in real data with a large variety of distributions within and between clusters. The second crucial problem faces a similar situation. While it is easy to consider the idea of a data cluster on a rather informal basis, it is very difficult to give a formal and universal definition of a cluster. Most of the conventional clustering methods assume that patterns having similar locations or constant density create a single cluster. In order to mathematically identify clusters in a data set, it is usually necessary to first define a measure of similarity or proximity which will establish a rule for assigning patterns to the domain of a particular cluster center. As it is to be expected, the measure of similarity is problem dependent. That is, different similarity measures
will result in different clustering results.
In this paper, we propose a hierarchical approach to ART-like clustering algorithm which is able to deal with data consisting of arbitrarily geometrical-shaped clusters. Combining hierarchical and ART-like clustering is suggested as a natural feasible solution to the two problems of determining the number of clusters and clustering data. | en_US |