衡量資料相似度於遺漏值填補之研究

NCU Institutional Repository > 管理學院 > 資訊管理研究所 > 博碩士論文 > Item 987654321/74766

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/74766

題名:	衡量資料相似度於遺漏值填補之研究
作者:	李妙翎;Li, Miao-Ling
貢獻者:	資訊管理學系
關鍵詞:	資料前處理;遺漏值;補值方法;資料相似性;Data Preprocessing;Missing Value;Imputation Method;Data Similarity
日期:	2017-07-06
上傳時間:	2017-10-27 14:38:45 (UTC+8)
出版者:	國立中央大學
摘要:	資料探勘技術逐漸被廣泛的應用在各領域當中，但遺漏值對於資料探勘來說，會造成無法分析或是結果有所偏差，使得探勘結果無法有效的分析出有用的資訊。近年來學者不斷提出新方法、採用機器學習演算法或是改善目前補值方法的流程等，來進行遺漏值的填補，目的是希望能找出不同領域或不同資料型態所適用的補值方法，或是期望能提高演算法的補值準確率與降低預測值與原始資料的誤差。本研究提出一個資料中心為基準衡量資料間相似度的補值方法（Class Center based Missing Value Imputation for Incomplete dataset，CCMVI）演算法，其是一個以統計方法為基礎，並考量資料所屬類別、資料之間的相似性並根據資料的離散程度調整填補值。於實驗一與實驗二中選擇不同類型與不同領域的資料集，以CCMVI方法、統計方法、K-近鄰算法（KNN）演算法以及支援向量機（SVM）演算法做遺漏值的填補。最後利用分類準正確率、誤差值以及執行時間來作為衡量補值方法的成效。從本研究的實驗一中得知，CCMVI方法於分類正確率比機器學習演算法高、補值時效略比統計方法差、誤差值與支援向量機相異不大。以整體的衡量來看，數值型與混合型資料適用於CCMVI補值方法，但實驗二所使用的數值型資料，其屬於軟體工程領域之資料集，卻不適用CCMVI補值法，因此也進一步的探討其原因，發現資料的分佈狀態會影響補值方法的選擇。;Data mining technology has been widely used in many domain problems. However, there will be a problem when the collected data contain some missing values. Using the incomplete data is likely to produce bias results and most data mining algorithms cannot directly handle this kind of data. Recently, many scholars have proposed new imputation methods, based on machine learning techniques to impute or modify the imputation process. They aim to find a method that can reduce error rates, get high classification accuracy or find what kind of method can suit for particular data. In this thesis, I propose an imputation method that is based on data class center to measure their similarity. The method is called Class Center based Missing Value Imputation for Incomplete dataset (CCMVI). In study one and study two, CCMVI, Statistic (Mean/Mode Imputation), KNN and SVM are used to impute incomplete datasets with different data types and domains. In order to avoid data inconsistence by choosing 90% training data and 10% testing data, repeating verification by 10-fold cross validation is employed. Finally, this thesis examines classification accuracy, error rates and time efficiency to evaluate different imputation methods. The experiment result of study one shows that CCMVI’s classification accuracy is higher than the machine learning methods which are SVM and KNN. CCMVI’s efficiency is slightly lower than Statistic. In an overall view, both numerical and mixed datasets are suitable for the proposed CCMVI method. However, the experiment result of study two shows that numerical dataset belongs to software engineering field is not suitable for the CCMVI method. After probing into the cause of the result, finding the distribution of the data will influence the results.
顯示於類別:	[資訊管理研究所] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	261	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....