The Effect of Instance Selection on Missing Value Imputation

NCU Institutional Repository > 管理學院 > 資訊管理研究所 > 博碩士論文 > Item 987654321/67584

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/67584

題名:	The Effect of Instance Selection on Missing Value Imputation
作者:	李昀潔;Li,Yun-Jie
貢獻者:	資訊管理學系
關鍵詞:	資料探勘;資料選取法;補值法;機器學習;分類問題;Machine Learning;Instance Selection Methods;Imputation Methods;Classification;Data Mining
日期:	2015-06-22
上傳時間:	2015-07-30 22:41:50 (UTC+8)
出版者:	國立中央大學
摘要:	遺漏值問題(Missing value problem)普遍存在資料探勘(Data mining)問題之中,不論是資料輸入錯誤或者資料格式錯誤等問題,皆造成資料探勘建模時無法有效利用現有的資料建立適合的分類模型。因此填補法(Imputation methods) 就針對此問題應運而生,此方法利用現有存在的資料進行分析並填補適合的值, 此適合的值可提供適當的資料供建模使用。然而現有的資料或許無法提供有效的資料給填補法進行有效的補值,原因在於現有的資料中有許多存在的問題,例如:雜訊資料存在的問題(Noisy problem)、資料冗餘的問題(Redundancy)或存在許多不具代表性的資料(Represented instances)等,因此為了有效利用現有的資料進行補值,資料選取法(Instance selection methods)則利用篩選出具代表性的資料來解決上述之問題,換句話說, 資料選取法透過一系列的篩選標準來產生精簡資料集,此資料集為具代表性的資料所組成,因此補值法就能利用此精簡資料集來進行補值,以避免原始資料內含有的問題影響補值法的效果。本論文為探討資料選取法對補值法的影響,透過 UCI 開放資料集庫中的 33 個資料集組成三種類型的資料集(類別型、混合型、數值型)來進行實驗,選定三個資料選取法;IB3(Instance-based learning)、DROP3(Decremental Reduction Optimization Procedure)、GA(Genetic Algorithm),和三個補值法;KNNI (K-Nearest Neighbor Imputation method)、SVM(Support Vector Machine)、MLP (MultiLayers Perceptron),來檢驗何種情況下哪種組合方法(三個資料選取法配上三個補值法)為最佳或最適合,或者是否組合方法是否比單純補值法更加有效果。依據本研究所得之結果,我們建議在數值型資枓集(Numerical datasets)情況下資料選取法配上補值法的流程會比單純補值法的流程適合;資料選取法的部份,DROP3 則建議比較適合用在數值型與混合型資料集(Mixed datasets),但是對於類別型資料集(Categorical datasets)且類別數大的情況下,則不建議使用資料選取法 DROP3,另一方面,對於 GA 和 IB3 這兩個資料選取法,我們建議 GA 的方法會比 IB3 適合,因為依據本研究的實驗顯示,GA 的資料選取表現會比 IB3 來得穩定。 ;In data mining, the collected datasets are usually incomplete, which contain some missing attribute values. It is difficult to effectively develop a learning model using the incomplete datasets. In literature, missing value imputation can be approached for the problem of incomplete datasets. Its aim is to provide estimations for the missing values by the (observed) complete data samples. However, some of the complete data may contain some noisy information, which can be regarded as outliers. If these noisy data were used for missing value imputation, the quality of the imputation results would be affected. To solve this problem, we propose to perform instance selection over the complete data before the imputation step. The aim of instance selection is to filter out some unrepresentative data from a given dataset. Therefore, this research focuses on examining the effect of performing instance selection on missing value imputation. The experimental setup is based on using 33 UCI datasets, which are composed of categorical, numerical, and mixed types of data. In addition, three instance selection methods, which are IB3 (Instance-based learning), DROP3 (Decremental Reduction Optimization Procedure), and GA (Genetic Algorithm) are used for comparison. Similarly, three imputation methods including KNNI (K-Nearest Neighbor Imputation method), SVM (Support Vector Machine), and MLP (MultiLayers Perceptron) are also employed individually. The comparative results can allow us to understand which combination of instance selection and imputation methods performs the best and whether combining instance selection and missing value imputation is the better choice than performing missing value imputation alone for the incomplete datasets. According to the results of this research, we suggest that the combinations of instance selection methods and imputation methods may suitable than the imputation methods along over numerical datasets. In particular, the DROP3 instance selection method is more suitable for numerical and mixed datasets, except for categorical datasets, especially when the number of features is large. For the other two instance selection methods, the GA method can provide more stable reduction performance than IB3.
顯示於類別:	[資訊管理研究所] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	769	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....