dc.description.abstract | Nowadays, more and more enterprises require extracting knowledge from very large databases. However, these large datasets usually contain a certain amount of noisy data, which are likely to decline the performance of data mining. In addition, the computational time, during the KDD process over large scale datasets is large.
Instance selection, which is the widely used for data reduction, can filter out noisy data from large datasets. However, many existing instance selection algorithms are limited in dealing with large datasets in terms of time efficiency. Therefore, we introduce a novel data preprocessing process called Representative Data Detection (ReDD), which only needs a small part of the original dataset to perform the instance selection step. Then, a classifier is trained to learn the representative data identified by the instance selection step. Afterwards, the trained classifier as a detector is used to detect all the noisy data over the large original dataset.
The thesis contains two experiments where IB3, DROP3 and GA are used as the baseline the instance selection algorithms. In the first experiment, fifty small-scale datasets are used to evaluate ReDD, in which SVM, CART, KNN and Naive Bayes are constructed as the detectors for comparison. We find that KNN and CART perform the best. In the second experiment, the classification accuracy and execution time of ReDD and the baselines over four large-scale datasets (more than one hundred thousand data) are compared. The result shows that ReDD can reduce large amount of execution time compared to the traditional instance selection. Moreover, the accuracy rates of ReDD and the baselines have no significant difference. | en_US |