特徵選取前處理於填補遺漏值之影響

NCU Institutional Repository > 管理學院 > 資訊管理研究所 > 博碩士論文 > Item 987654321/72067

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/72067

題名:	特徵選取前處理於填補遺漏值之影響
作者:	李韋柔;Lee,Wei-Jou
貢獻者:	資訊管理學系
關鍵詞:	資料探勘;特徵選取;補值法;機器學習;分類問題;Machine Learning;Feature Selection Methods;Imputation Methods;Classification;Data Mining
日期:	2016-07-05
上傳時間:	2016-10-13 14:24:24 (UTC+8)
出版者:	國立中央大學
摘要:	從真實世界收集的龐大原始資料，難免會有不完整或品質不佳的資料，造成資料探勘(Data Mining)的效益及效率下降。使資料探勘效果不佳的原因有兩種，包含資料遺漏或資料含有過多冗餘(Redundancy)、不具代表性的資料(Represented features)。這些原因不僅無法提供有價值的資訊，除了降低整體實驗數據成果，還會降低實驗的整體效率和增加實驗所需花費的成本。遺漏值問題(Missing value problem)普遍存在資料探勘(Data mining)問題之中，可能的原因為資料輸入錯誤或者資料格式錯誤等問題。填補法(Imputation methods)就針對遺漏值問題而發展出來，此方法透過完整資料當作觀測值，預測不完整資料中的遺漏值。特徵選取(Feature Selection)為從資料中濾掉多餘及不具代表性的特徵值(Feature)。本研究結合特徵選取技術及補值法，主要在探討先透過特徵選取塞選出的細緻資料，再填補遺漏值的適用性。本論文為探討特徵選取對補值法的影響，透過UCI公開資料庫，蒐集12個完整的資料集，資料集的組成三種類型的資料集(類別型、混合型、數值型)來進行實驗。為了使實驗更接近現實的狀況，本論文模擬10％、20%、30%、40%、50%的缺失率的作為基準，探討在這五種遺漏率下的變化及趨勢。選定三個特徵選取技術：基因演算法 (Genetic Algorithm, GA)、決策樹(Decision Tree, DT)、資訊獲利 (Information Gain, IG)，和三個補值法；多層感知器 (Multilayer Perceptrons, MLP)、支持向量機 (Support Vector Machine, SVM)、K-最近鄰補值法(K-nearest Neighbor Imputation, KNNI)，來檢驗何種情況下，特徵選取搭配補值法為最佳或最適合的組合，或者比較組合方法與單純補值法的正確率表現。本研究分為兩項研究，研究一處理初始資料集，而研究二為驗證研究一結果的高維度資料集。依據研究一所得之結果，混和型資料集情況下，使用GA進行特徵選取再透過IBk進行補值是有參考性及正面影響；數值型資料集情況下，使用IG的65%保留率進行特徵選取再透過MLP、IBk、SVM任一補值方法進行補值皆具有正面影響。類別型資料集在直接分類的正確率表現最佳，我們建議不需要先進行特徵選取再補值的流程。另外我們發現，遺漏率為10%時，先進行特徵選取再補值的結果比直接補值的方法及分類的方法佳。依據研究二所得的結果有兩個，第一個為高維度資料集使用DT先進行特徵選取，再使用MLP及IBk進行補值為表現最佳的組合。第二個為高維度資料集在保留率低於40%時，組合方法表現較優異。 ;The collection of real life data contain missing values frequently. The presence of missing values in a dataset can affect the performance of data mining algorithms. The commonly used techniques for handling missing data are based on some imputation methods. Pool quality data negatively influence the predictive accuracy. The reason of pool quality data contains not only missing values but also redundancy features and representative features. This kind of data will degrade the performance of data mining algorithms and increase the cost of research. To solve this problem, we propose to perform feature selection over the complete data before the imputation step. The aim of feature selection is to filter out some unrepresentative data from a given dataset. Therefore, this research focuses on identifying the best combination of feature selection and imputation methods. The experimental setup is based on 12 UCI datasets, which are composed of categorical, numerical, and mixed types of data. The experiment conducts the simulation to 10%, 20%, 30%, 40% and 50% missing rates for each training dataset. In addition, three feature selection methods, which are GA(Genetic Algorithm), DT(Decision Tree Algorithm), and IG(Information Gain Algorithm) are used for comparison. Similarly, three imputation methods including KNNI (K-Nearest Neighbor Imputation method), SVM (Support Vector Machine), and MLP (MultiLayers Perceptron) are also employed individually. The comparative results can allow us to understand which combination of feature selection and imputation methods performs the best and whether combining feature selection and missing value imputation is the better choice than performing missing value imputation alone for the incomplete datasets. According to the results of this research, the combination of GA feature selection method and IBk imputation method was significantly better than the imputation methods alone over mixed datasets. The combinations of IG feature selection method with retaining 65% features and the imputation methods which we selected have significantly better classification accuracy over numarical datasets. Performing missing value imputation alone is a better choice over categorical datasets. Performing feature selection before the imputation step has the best classification accuracy with 10% missing rate. In large dimensional datasets, the classification accuracy of the combining the DT feature selection method and IBk or MLP imputation methods produce the best performance.
顯示於類別:	[資訊管理研究所] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	316	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....