姓名 杜俊甫(Chin-Fu Du)
畢業系所 資訊管理學系
論文名稱 多重填補與集成式填補策略 於遺漏資料處理之比較與研究
(Comparative Study of Multiple Imputation and Ensemble Imputation Strategies for Missing Data Handling)
摘要(中) 在當今資訊科技迅速發展的背景下,遺漏值的問題在各種資料集中普遍存在,這不僅妨礙了資料的完整性,也對後續的資料分析和決策制定造成了不利影響。因此,遺漏值的處理成為資料前處理階段一項重要且挑戰性的任務。
本研究選取了 25 個 UCI 資料集作為研究對象,包括數值型、類別型及混合型資料,模擬了 10%至 50%的不同遺漏率,以評估5種機器學習演算法在單一填補和多重填補(MICE)方法的成效。此外,研究還提出了兩種基於 MICE 的集成式填補方法,包括混合式和並列式集成策略,最終以SVM分類正確率、均方根誤差、平均絕對值誤差以及類別正確率來評估補值的成效。
摘要(英) With the progress of Information Technology, missing values have become common in various datasets. This affects data completeness and hampers data analysis and decision-making. Therefore, handling missing values is a crucial and challenging task in data preprocessing.
The main methods for handling missing values are imputation, including statistical and machine learning techniques, both single imputation methods. Recently, scholars have adopted multiple imputation methods. However, limited research compares multiple imputation and machine learning imputation across different datasets and missing rates. Additionally, while ensemble learning has improved model prediction accuracy, its use in missing value imputation is under-researched. Therefore, we aim to analyze the performance of single and multiple imputation methods and explore ensemble learning in missing value imputation.
This study used 25 UCI datasets, including numerical, categorical, and mixed types, simulating missing rates from 10% to 50%. Five machine learning algorithms were evaluated for single and multiple (MICE) imputation, and two ensemble imputation methods based on MICE, hybrid and parallel strategies, were proposed. Imputation effectiveness was assessed using SVM classification accuracy, RMSE, MAPE, and Hit Ratio.
Results showed that multiple imputation generally outperformed single imputation, with the random forest method being the best for mixed datasets, while other methods slightly underperformed. Ensemble imputation experiments indicated that both hybrid and parallel strategies effectively improved all metrics, though the order of applying models in hybrid imputation significantly impacted results. Finally, we provide recommendations for optimal combinations of multiple and ensemble imputation, offering valuable references for future researchers.
關鍵字(中) ★ 資料探勘
★ 遺漏值
★ 多重填補
★ 機器學習
★ 集成學習
關鍵字(英) ★ Data Mining
★ Missing Values
★ Multiple Imputation
★ Machine Learning
★ Ensemble Learning
論文目次 摘要 i
Abstract ii
目錄 iii
圖目錄 vi
表目錄 viii
一、 緒論 1
1-1. 研究背景 1
1-2. 研究動機 2
1-3. 研究目的 3
1-4. 研究架構 4
二、 文獻回顧 6
2-1. 資料遺漏 (Data Missing) 6
2-1-1. 完全隨機遺漏 (Missing Completely at Random, MCAR) 6
2-1-2. 隨機遺漏 (Missing at Random, MAR) 7
2-1-3. 非隨機遺漏 (Missing Not at Random, MNAR) 7
2-2. 遺漏值處理 8
2-2-1. 單一填補 (Single Imputation) 9
2-2-2. 多重填補 (Multiple Imputation) 13
2-3. 集成式學習 (Ensemble Learning) 14
2-3-1. 序列式集成 (Sequential Ensemble) 14
2-3-2. 並列式集成 (Parallel Ensemble) 15
2-3-3. 集成式填補 (Ensemble Imputation) 16
三、 研究方法 17
3-1. 實驗架構 17
3-2. 實驗準備 17
3-2-1. 硬體設備及軟體使用 17
3-2-2. 資料集 17
3-3. 實驗環境和參數設定 19
3-3-1. 迴歸填補法(Regression Imputation) 19
3-3-2. 機器學習填補法 19
3-3-3. MICE (Multivariate Imputation by Chained Equation) 22
3-3-4. 分類器 22
3-4. 實驗一 23
3-4-1. 實驗一(A) 23
3-4-2. 實驗一(B) 24
3-5. 實驗二 25
3-5-1. 實驗二(A) 25
3-5-2. 實驗二(B) 26
3-6. 評估指標 27
四、 實驗結果 29
4-1. 實驗一結果 29
4-1-1. 數值型資料 29
4-1-2. 類別型資料 37
4-1-3. 混合型資料 42
4-1-4. 統整 51
4-2. 實驗二結果 52
4-2-1. 實驗二(A) 52
4-2-2. 實驗二(B) 63
4-2-3. 統整 75
五、 結論 77
5-1. 總結與討論 77
5-2. 未來展望 78
六、 參考文獻 80
附錄一、 實驗一詳細實驗數據 85
1-1、 數值型資料 85
1-2、 類別型資料 101
1-3、 混合型資料 111
附錄二、 實驗二詳細實驗數據 131
2-1、 數值型資料 131
2-2、 類別型資料 147
2-3、 混合型資料 157
指導教授 蔡志豐(Chih-Fong Tsai) 審核日期 2024-7-22
