dc.description.abstract | Nowadays, more and more enterprises require extracting knowledge from very large databases. However, these large datasets usually contain a certain amount of noisy data, which are likely to decline the performance of data mining. In addition, the computational time of processing the large scale datasets is usually very large.
Instance selection, which is the widely used data reduction approach, can filter out noisy data from large datasets. However, different instance selection algorithms over different domain datasets filter out different noisy data, which are likely to result in over or under selection since there is no exact definition of outliers. Thus, the quality of data mining results can be affected. Therefore, this thesis proposes a new data pre-processing (TSHLA, Two-Stage Hybrid Learning Approach) for effective data classification. First, instance selection is performed over a given training dataset to filter out the noisy and non-noisy data to train two individual SVM classifiers respectively. Then, using the KNN to compare the similarity of the testing data. As a result, the noisy and non-noisy testing sets are identified and they are fed into their corresponding SVM classifiers for classification.
There two experimental studies in this thesis and three instance selection algorithms are used for comparison, which are IB3, DROP3 and GA. The first and second studies are based on 50 small UCI datasets and large scale datasets containing more than 100,000 data samples. In addition, our proposed TSHLA is compared with the baseline without instance selection and the one based on the conventional instance selection approach.
| en_US |