衡量資料相似度於遺漏值填補之研究

DC 欄位	值	語言
DC.contributor	資訊管理學系	zh_TW
DC.creator	李妙翎	zh_TW
DC.creator	Miao-Ling Li	en_US
dc.date.accessioned	2017-7-6T07:39:07Z
dc.date.available	2017-7-6T07:39:07Z
dc.date.issued	2017
dc.identifier.uri	http://ir.lib.ncu.edu.tw:88/thesis/view_etd.asp?URN=104423012
dc.contributor.department	資訊管理學系	zh_TW
DC.description	國立中央大學	zh_TW
DC.description	National Central University	en_US
dc.description.abstract	資料探勘技術逐漸被廣泛的應用在各領域當中，但遺漏值對於資料探勘來說，會造成無法分析或是結果有所偏差，使得探勘結果無法有效的分析出有用的資訊。近年來學者不斷提出新方法、採用機器學習演算法或是改善目前補值方法的流程等，來進行遺漏值的填補，目的是希望能找出不同領域或不同資料型態所適用的補值方法，或是期望能提高演算法的補值準確率與降低預測值與原始資料的誤差。本研究提出一個資料中心為基準衡量資料間相似度的補值方法（Class Center based Missing Value Imputation for Incomplete dataset，CCMVI）演算法，其是一個以統計方法為基礎，並考量資料所屬類別、資料之間的相似性並根據資料的離散程度調整填補值。於實驗一與實驗二中選擇不同類型與不同領域的資料集，以CCMVI方法、統計方法、K-近鄰算法（KNN）演算法以及支援向量機（SVM）演算法做遺漏值的填補。最後利用分類準正確率、誤差值以及執行時間來作為衡量補值方法的成效。從本研究的實驗一中得知，CCMVI方法於分類正確率比機器學習演算法高、補值時效略比統計方法差、誤差值與支援向量機相異不大。以整體的衡量來看，數值型與混合型資料適用於CCMVI補值方法，但實驗二所使用的數值型資料，其屬於軟體工程領域之資料集，卻不適用CCMVI補值法，因此也進一步的探討其原因，發現資料的分佈狀態會影響補值方法的選擇。	zh_TW
dc.description.abstract	Data mining technology has been widely used in many domain problems. However, there will be a problem when the collected data contain some missing values. Using the incomplete data is likely to produce bias results and most data mining algorithms cannot directly handle this kind of data. Recently, many scholars have proposed new imputation methods, based on machine learning techniques to impute or modify the imputation process. They aim to find a method that can reduce error rates, get high classification accuracy or find what kind of method can suit for particular data. In this thesis, I propose an imputation method that is based on data class center to measure their similarity. The method is called Class Center based Missing Value Imputation for Incomplete dataset (CCMVI). In study one and study two, CCMVI, Statistic (Mean/Mode Imputation), KNN and SVM are used to impute incomplete datasets with different data types and domains. In order to avoid data inconsistence by choosing 90% training data and 10% testing data, repeating verification by 10-fold cross validation is employed. Finally, this thesis examines classification accuracy, error rates and time efficiency to evaluate different imputation methods. The experiment result of study one shows that CCMVI’s classification accuracy is higher than the machine learning methods which are SVM and KNN. CCMVI’s efficiency is slightly lower than Statistic. In an overall view, both numerical and mixed datasets are suitable for the proposed CCMVI method. However, the experiment result of study two shows that numerical dataset belongs to software engineering field is not suitable for the CCMVI method. After probing into the cause of the result, finding the distribution of the data will influence the results.	en_US
DC.subject	資料前處理	zh_TW
DC.subject	遺漏值	zh_TW
DC.subject	補值方法	zh_TW
DC.subject	資料相似性	zh_TW
DC.subject	Data Preprocessing	en_US
DC.subject	Missing Value	en_US
DC.subject	Imputation Method	en_US
DC.subject	Data Similarity	en_US
DC.title	衡量資料相似度於遺漏值填補之研究	zh_TW
dc.language.iso	zh-TW	zh-TW
DC.type	博碩士論文	zh_TW
DC.type	thesis	en_US
DC.publisher	National Central University	en_US

博碩士論文 104423012 完整後設資料紀錄