姓名 林盈秀(Ying-Siou Lin)
論文名稱 資料遺漏率、補值法與資料前處理關係之研究
論文名稱 資料遺漏率、補值法與資料前處理關係之研究
(The relationship between missing value, imputation and data pre-processing)
摘要(中) 隨著資訊科技的快速發展,電腦所能處理和儲存的資料量也愈來愈大,資料採礦對於如何從大量資料中尋找有意義的內容是很重要的課題,但在探勘的過程中,難免會遭遇所需的資料有所遺漏或不足之處,這些問題都將導致探勘效能的降低。
摘要(英) With the rapid development of information technology, computers can process and store huge amounts of data. This leads to the importance of finding useful content from large amounts of data in data mining. However, many collected datasets for data mining usually contain some missing values, which are likely to degrade the data mining performance.
For incomplete data processing, it is a common and simple way to perform case deletion by ignoring the data samples with missing values if the missing rate was certainly small. Another approach is based on imputation, where various approaches have been proposed for missing value imputation. Generally speaking, the imputation algorithms aim at providing estimations for missing values by a reasoning process from the observed data. However, there is no answer for the question about when should we use the case deletion or imputation approach over different kinds of datasets. Another question is that will performing data pre-processing, i.e. feature and instance selection, affect the final imputation result?
This thesis used 37 different data sets, which contain categorical, numerical, and both types of data, and 5% intervals for different missing rates per dataset (i.e. from 5% to 50%). Research topic is divided into two parts. The experimental results indicate that there are some specific patterns to consider case deletion over different datasets without significant performance degradation. A decision tree model is then constructed to extract useful rules to recommend when to use the case deletion approach. Furthermore, we found that imputation after instance selection can produce better classification performance than imputation alone. However, imputation after feature selection does not have a positive impact on the imputation result.
關鍵字(中) ★ 資料探勘
★ 資料遺漏
★ 直接刪除法
★ 資料補值
★ 樣本選取
★ 特徵選取
關鍵字(英) ★ data mining
★ missing values
★ case deletion
★ imputation
★ feature selection
★ instance selection
論文目次 摘要 i
Abstract ii
致謝辭 iii
目錄 iv
圖目錄 vi
表目錄 vii
第一章 緒論 1
1-1 研究背景 1
1-2 研究動機 3
1-3 研究目的 4
1-4 論文架構 5
第二章 文獻探討 6
2-1 資料遺漏值(Missing data) 6
2-1-1 完全隨機遺漏(Missing completely at random,MCAR) 6
2-1-2 隨機遺漏(Missing at random,MAR) 6
2-1-3 非隨機遺漏(Missing not at random,MNAR) 6
2-2 缺失值處理 7
2-2-1 事前預防法 7
2-2-2 刪除法(Listwise deletion) 8
2-2-3 虛擬變數法(Dummy variable) 8
2-2-4 插補法(Imputation) 8
2-3 特徵選取(Feature selection) 15
2-3-1 F-score 17
2-4 樣本選取(Instance selection) 18
2-4-1 DROP3 20
第三章 研究方法 22
3-1 實驗架構 22
3-2 資料集 22
3-3 研究一 24
3-4 研究二 24
3-4-1 單一補值法 26
3-4-2 多重補值法 27
3-4-3 特徵選取(Feature selection) 28
3-4-4 樣本選取(Instance selection) 29
第四章 實驗結果 31
4-1 研究一 31
4-1-1 類別型態資料集的結果 31
4-1-2 數值型態資料集的結果 32
4-1-3 混合型態資料集的結果 33
4-1-4 萃取決策規則 34
4-2 研究二 36
4-2-1 資料集在特定缺失率的結果 36
4-2-2 特定資料集在不同缺失率的結果 38
4-2-3 萃取決策規則 42
第五章 結論與未來研究方向 44
5-1 結論與貢獻 44
5-2 未來研究方向與建議 45
參考文獻 47
附錄一 52
附錄二 54
附錄三 59
指導教授 蔡志豐(Chih-Fong Tsai) 審核日期 2013-7-1
