資料遺漏率、補值法與資料前處理關係之研究; The relationship between missing value, imputation and data pre-processing

NCUIR > School of Management at National Central University > Graduate Institute of Information Management > Electronic Thesis & Dissertation > Item 987654321/61115

Please use this identifier to cite or link to this item: http://ir.lib.ncu.edu.tw/handle/987654321/61115

Title:	資料遺漏率、補值法與資料前處理關係之研究;The relationship between missing value, imputation and data pre-processing
Authors:	林盈秀;Lin,Ying-Siou
Contributors:	資訊管理學系
Keywords:	資料探勘;資料遺漏;直接刪除法;資料補值;樣本選取;特徵選取;data mining;missing values;case deletion;imputation;feature selection;instance selection
Date:	2013-07-01
Issue Date:	2013-08-22 12:12:17 (UTC+8)
Publisher:	國立中央大學
Abstract:	隨著資訊科技的快速發展，電腦所能處理和儲存的資料量也愈來愈大，資料採礦對於如何從大量資料中尋找有意義的內容是很重要的課題，但在探勘的過程中，難免會遭遇所需的資料有所遺漏或不足之處，這些問題都將導致探勘效能的降低。而針對不完整資料的前處理，常會採用直接刪去法最為簡單又直接，但這種方法只適用於資料集包含比較小的缺失值數量，若包含的缺失值數量較大，採用直接刪去法，會造成大量資料流失並對資料探勘的結果產生影響。另一種方法是採用補值(Imputation)的處理方法，而近年來的研究都集中在，提出新型的補值方法和一些不同補值方法在不同的資料集中的比較，但很少研究在回答關於「在資料前處理時，什麼時候可以選擇完全忽略或刪除有缺失值的樣本？」，也沒有研究在探討「將資料前處理(特徵選取或樣本選取)加在補值之前，結果是否可以比沒有執行維度縮減或樣本選取而直接補值的結果效果來的更佳」。本研究使用37個不同的資料集，包含三種主要的類型，分別為數值型(Numerical)，類別型(Categorical)，和混合型(Mixed)的資料類別，並用5％作為缺失率的間隔(從5％至50％)。研究主題分為兩個部份，研究一實驗結果說明，不同類型的資料集可以允許不同的缺失率。特別的是我們會建構決策樹模型來獲取關於資料集的特性(如資料數量，資料維度與資料類型)和可允許的缺失率之相關決策規則，來幫助資料分析並確定在不同的缺失率時，何時可以直接使用直接刪去法。在研究二的實驗結果部份，以三種類型的資料集(數值型、混合型、和類別型)來判斷特徵選取和樣本選取在缺失值補值上使用的效果，並了解是否適用特徵選取和樣本選取在進行補值階段之前。此實驗結果顯示出，先使用樣本選取再補值可以產生比經過特徵選取再補值更好的分類效能。換句話說，先特徵選取再補值的方法對於補值沒有產生正面的影響。 With the rapid development of information technology, computers can process and store huge amounts of data. This leads to the importance of finding useful content from large amounts of data in data mining. However, many collected datasets for data mining usually contain some missing values, which are likely to degrade the data mining performance. For incomplete data processing, it is a common and simple way to perform case deletion by ignoring the data samples with missing values if the missing rate was certainly small. Another approach is based on imputation, where various approaches have been proposed for missing value imputation. Generally speaking, the imputation algorithms aim at providing estimations for missing values by a reasoning process from the observed data. However, there is no answer for the question about when should we use the case deletion or imputation approach over different kinds of datasets. Another question is that will performing data pre-processing, i.e. feature and instance selection, affect the final imputation result? This thesis used 37 different data sets, which contain categorical, numerical, and both types of data, and 5% intervals for different missing rates per dataset (i.e. from 5% to 50%). Research topic is divided into two parts. The experimental results indicate that there are some specific patterns to consider case deletion over different datasets without significant performance degradation. A decision tree model is then constructed to extract useful rules to recommend when to use the case deletion approach. Furthermore, we found that imputation after instance selection can produce better classification performance than imputation alone. However, imputation after feature selection does not have a positive impact on the imputation result.
Appears in Collections:	[Graduate Institute of Information Management] Electronic Thesis & Dissertation

Files in This Item:

File	Description	Size	Format
index.html		0Kb	HTML	834	View/Open

社群 sharing

Loading...