資料遺漏率、補值法與資料前處理關係之研究

DC 欄位	值	語言
DC.contributor	資訊管理學系	zh_TW
DC.creator	林盈秀	zh_TW
DC.creator	Ying-Siou Lin	en_US
dc.date.accessioned	2013-7-1T07:39:07Z
dc.date.available	2013-7-1T07:39:07Z
dc.date.issued	2013
dc.identifier.uri	http://ir.lib.ncu.edu.tw:88/thesis/view_etd.asp?URN=100423032
dc.contributor.department	資訊管理學系	zh_TW
DC.description	國立中央大學	zh_TW
DC.description	National Central University	en_US
dc.description.abstract	隨著資訊科技的快速發展，電腦所能處理和儲存的資料量也愈來愈大，資料採礦對於如何從大量資料中尋找有意義的內容是很重要的課題，但在探勘的過程中，難免會遭遇所需的資料有所遺漏或不足之處，這些問題都將導致探勘效能的降低。而針對不完整資料的前處理，常會採用直接刪去法最為簡單又直接，但這種方法只適用於資料集包含比較小的缺失值數量，若包含的缺失值數量較大，採用直接刪去法，會造成大量資料流失並對資料探勘的結果產生影響。另一種方法是採用補值(Imputation)的處理方法，而近年來的研究都集中在，提出新型的補值方法和一些不同補值方法在不同的資料集中的比較，但很少研究在回答關於「在資料前處理時，什麼時候可以選擇完全忽略或刪除有缺失值的樣本？」，也沒有研究在探討「將資料前處理(特徵選取或樣本選取)加在補值之前，結果是否可以比沒有執行維度縮減或樣本選取而直接補值的結果效果來的更佳」。本研究使用37個不同的資料集，包含三種主要的類型，分別為數值型(Numerical)，類別型(Categorical)，和混合型(Mixed)的資料類別，並用5％作為缺失率的間隔(從5％至50％)。研究主題分為兩個部份，研究一實驗結果說明，不同類型的資料集可以允許不同的缺失率。特別的是我們會建構決策樹模型來獲取關於資料集的特性(如資料數量，資料維度與資料類型)和可允許的缺失率之相關決策規則，來幫助資料分析並確定在不同的缺失率時，何時可以直接使用直接刪去法。在研究二的實驗結果部份，以三種類型的資料集(數值型、混合型、和類別型)來判斷特徵選取和樣本選取在缺失值補值上使用的效果，並了解是否適用特徵選取和樣本選取在進行補值階段之前。此實驗結果顯示出，先使用樣本選取再補值可以產生比經過特徵選取再補值更好的分類效能。換句話說，先特徵選取再補值的方法對於補值沒有產生正面的影響。	zh_TW
dc.description.abstract	With the rapid development of information technology, computers can process and store huge amounts of data. This leads to the importance of finding useful content from large amounts of data in data mining. However, many collected datasets for data mining usually contain some missing values, which are likely to degrade the data mining performance. For incomplete data processing, it is a common and simple way to perform case deletion by ignoring the data samples with missing values if the missing rate was certainly small. Another approach is based on imputation, where various approaches have been proposed for missing value imputation. Generally speaking, the imputation algorithms aim at providing estimations for missing values by a reasoning process from the observed data. However, there is no answer for the question about when should we use the case deletion or imputation approach over different kinds of datasets. Another question is that will performing data pre-processing, i.e. feature and instance selection, affect the final imputation result? This thesis used 37 different data sets, which contain categorical, numerical, and both types of data, and 5% intervals for different missing rates per dataset (i.e. from 5% to 50%). Research topic is divided into two parts. The experimental results indicate that there are some specific patterns to consider case deletion over different datasets without significant performance degradation. A decision tree model is then constructed to extract useful rules to recommend when to use the case deletion approach. Furthermore, we found that imputation after instance selection can produce better classification performance than imputation alone. However, imputation after feature selection does not have a positive impact on the imputation result.	en_US
DC.subject	資料探勘	zh_TW
DC.subject	資料遺漏	zh_TW
DC.subject	直接刪除法	zh_TW
DC.subject	資料補值	zh_TW
DC.subject	樣本選取	zh_TW
DC.subject	特徵選取	zh_TW
DC.subject	data mining	en_US
DC.subject	missing values	en_US
DC.subject	case deletion	en_US
DC.subject	imputation	en_US
DC.subject	feature selection	en_US
DC.subject	instance selection	en_US
DC.title	資料遺漏率、補值法與資料前處理關係之研究	zh_TW
dc.language.iso	zh-TW	zh-TW
DC.title	The relationship between missing value, imputation and data pre-processing	en_US
DC.type	博碩士論文	zh_TW
DC.type	thesis	en_US
DC.publisher	National Central University	en_US

博碩士論文 100423032 完整後設資料紀錄