dc.description.abstract | Data mining technology has been widely used in many domain problems. However, there will be a problem when the collected data contain some missing values. Using the incomplete data is likely to produce bias results and most data mining algorithms cannot directly handle this kind of data. Recently, many scholars have proposed new imputation methods, based on machine learning techniques to impute or modify the imputation process. They aim to find a method that can reduce error rates, get high classification accuracy or find what kind of method can suit for particular data.
In this thesis, I propose an imputation method that is based on data class center to measure their similarity. The method is called Class Center based Missing Value Imputation for Incomplete dataset (CCMVI). In study one and study two, CCMVI, Statistic (Mean/Mode Imputation), KNN and SVM are used to impute incomplete datasets with different data types and domains. In order to avoid data inconsistence by choosing 90% training data and 10% testing data, repeating verification by 10-fold cross validation is employed. Finally, this thesis examines classification accuracy, error rates and time efficiency to evaluate different imputation methods.
The experiment result of study one shows that CCMVI’s classification accuracy is higher than the machine learning methods which are SVM and KNN. CCMVI’s efficiency is slightly lower than Statistic. In an overall view, both numerical and mixed datasets are suitable for the proposed CCMVI method. However, the experiment result of study two shows that numerical dataset belongs to software engineering field is not suitable for the CCMVI method. After probing into the cause of the result, finding the distribution of the data will influence the results. | en_US |