The Effect of Instance Selection on Missing Value Imputation

DC 欄位	值	語言
DC.contributor	資訊管理學系	zh_TW
DC.creator	李昀潔	zh_TW
DC.creator	Yun-Jie Li	en_US
dc.date.accessioned	2015-6-22T07:39:07Z
dc.date.available	2015-6-22T07:39:07Z
dc.date.issued	2015
dc.identifier.uri	http://ir.lib.ncu.edu.tw:88/thesis/view_etd.asp?URN=102423002
dc.contributor.department	資訊管理學系	zh_TW
DC.description	國立中央大學	zh_TW
DC.description	National Central University	en_US
dc.description.abstract	遺漏值問題(Missing value problem)普遍存在資料探勘(Data mining)問題之中,不論是資料輸入錯誤或者資料格式錯誤等問題,皆造成資料探勘建模時無法有效利用現有的資料建立適合的分類模型。因此填補法(Imputation methods) 就針對此問題應運而生,此方法利用現有存在的資料進行分析並填補適合的值, 此適合的值可提供適當的資料供建模使用。然而現有的資料或許無法提供有效的資料給填補法進行有效的補值,原因在於現有的資料中有許多存在的問題,例如:雜訊資料存在的問題(Noisy problem)、資料冗餘的問題(Redundancy)或存在許多不具代表性的資料(Represented instances)等,因此為了有效利用現有的資料進行補值,資料選取法(Instance selection methods)則利用篩選出具代表性的資料來解決上述之問題,換句話說, 資料選取法透過一系列的篩選標準來產生精簡資料集,此資料集為具代表性的資料所組成,因此補值法就能利用此精簡資料集來進行補值,以避免原始資料內含有的問題影響補值法的效果。本論文為探討資料選取法對補值法的影響,透過 UCI 開放資料集庫中的 33 個資料集組成三種類型的資料集(類別型、混合型、數值型)來進行實驗,選定三個資料選取法;IB3(Instance-based learning)、DROP3(Decremental Reduction Optimization Procedure)、GA(Genetic Algorithm),和三個補值法;KNNI (K-Nearest Neighbor Imputation method)、SVM(Support Vector Machine)、MLP (MultiLayers Perceptron),來檢驗何種情況下哪種組合方法(三個資料選取法配上三個補值法)為最佳或最適合,或者是否組合方法是否比單純補值法更加有效果。依據本研究所得之結果,我們建議在數值型資枓集(Numerical datasets)情況下資料選取法配上補值法的流程會比單純補值法的流程適合;資料選取法的部份,DROP3 則建議比較適合用在數值型與混合型資料集(Mixed datasets),但是對於類別型資料集(Categorical datasets)且類別數大的情況下,則不建議使用資料選取法 DROP3,另一方面,對於 GA 和 IB3 這兩個資料選取法,我們建議 GA 的方法會比 IB3 適合,因為依據本研究的實驗顯示,GA 的資料選取表現會比 IB3 來得穩定。	zh_TW
dc.description.abstract	In data mining, the collected datasets are usually incomplete, which contain some missing attribute values. It is difficult to effectively develop a learning model using the incomplete datasets. In literature, missing value imputation can be approached for the problem of incomplete datasets. Its aim is to provide estimations for the missing values by the (observed) complete data samples. However, some of the complete data may contain some noisy information, which can be regarded as outliers. If these noisy data were used for missing value imputation, the quality of the imputation results would be affected. To solve this problem, we propose to perform instance selection over the complete data before the imputation step. The aim of instance selection is to filter out some unrepresentative data from a given dataset. Therefore, this research focuses on examining the effect of performing instance selection on missing value imputation. The experimental setup is based on using 33 UCI datasets, which are composed of categorical, numerical, and mixed types of data. In addition, three instance selection methods, which are IB3 (Instance-based learning), DROP3 (Decremental Reduction Optimization Procedure), and GA (Genetic Algorithm) are used for comparison. Similarly, three imputation methods including KNNI (K-Nearest Neighbor Imputation method), SVM (Support Vector Machine), and MLP (MultiLayers Perceptron) are also employed individually. The comparative results can allow us to understand which combination of instance selection and imputation methods performs the best and whether combining instance selection and missing value imputation is the better choice than performing missing value imputation alone for the incomplete datasets. According to the results of this research, we suggest that the combinations of instance selection methods and imputation methods may suitable than the imputation methods along over numerical datasets. In particular, the DROP3 instance selection method is more suitable for numerical and mixed datasets, except for categorical datasets, especially when the number of features is large. For the other two instance selection methods, the GA method can provide more stable reduction performance than IB3.	en_US
DC.subject	資料探勘	zh_TW
DC.subject	資料選取法	zh_TW
DC.subject	補值法	zh_TW
DC.subject	機器學習	zh_TW
DC.subject	分類問題	zh_TW
DC.subject	Machine Learning	en_US
DC.subject	Instance Selection Methods	en_US
DC.subject	Imputation Methods	en_US
DC.subject	Classification	en_US
DC.subject	Data Mining	en_US
DC.title	The Effect of Instance Selection on Missing Value Imputation	en_US
dc.language.iso	en_US	en_US
DC.type	博碩士論文	zh_TW
DC.type	thesis	en_US
DC.publisher	National Central University	en_US

博碩士論文 102423002 完整後設資料紀錄