姓名 彭柏豪(Po-Hao Peng)  查詢紙本館藏   畢業系所 資訊管理學系
論文名稱 基於集成方法的二元分類資料集補值研究
(An Imputation Method based on Ensemble Techniques for Binary Classification Datasets)
摘要(中) 從過往研究中發現,補值方法大致可分成三大類:統計、機器學習與深度學習,不同種類的方法都有其適用情境,所以本研究將集成的技術應用於補值任務中,旨在將多個補值方法進行結合,並且依據各方法對不同情境的適用性,分配出適當的權重,以此產生出優異的填補值。
本研究還根據資料集的特性,對集成方法之性能進行適用性分析,在資料集樣本大小的分析中發現,〖Ensemble〗_acc在小型和大型資料集當中,都獲得較佳的性能 ; 在資料集特徵類型的分析中發現,〖Ensemble〗_rmse在純數值型資料集當中表現較佳,而〖Ensemble〗_acc在混合型資料集當中表現較佳 ; 最後,在應用領域的分析中發現,〖Ensemble〗_rmse在醫療資料集中表現較佳,而〖Ensemble〗_acc在信用資料集中表現較佳。
摘要(英) From past research, imputation methods can generally be categorized into three types: statistical, machine learning, and deep learning. Each type of method has its appropriate contexts, so this study applies ensemble techniques to imputation tasks. It aims to combine multiple imputation methods and assigns appropriate weights based on each method′s suitability for different scenarios, thereby generating superior imputed values.
In terms of experimental design, this study selects six binary classification datasets from the UCI dataset. Based on previous literature, representative methods for each category were selected, including statistical methods Mean/Mode, MICE; machine learning methods MissForest, KNN; and deep learning methods PC-GAIN, HI-VAE, and PMIVAE. Adjustments were made to the PC-GAIN method to form the RC-GAIN method. In total, eight imputation methods were used, and experiments were conducted using SVM, LightGBM, and MLP classifiers.
The study selected four imputation methods with better performance, MICE, MissForest, RC-GAIN, and HI-VAE, as well as the best classifier, LightGBM, to construct an ensemble imputation method. Two performance metrics, RMSE and Accuracy generated by LightGBM, were used to calculate two types of weights, producing two ensemble methods: 〖Ensemble〗_rmse and 〖Ensemble〗_acc. Experimental results showed that the performance of these two ensemble methods was superior to the four selected imputation methods in different missing mechanisms and missing rate scenarios. Among them, the 〖Ensemble〗_acc method outperformed 〖Ensemble〗_rmse and was the better imputation method.
The study also analyzed the suitability of the ensemble methods based on dataset characteristics. In the analysis of dataset sizes, 〖Ensemble〗_acc performed better in both small and large datasets. In the analysis of dataset feature types, 〖Ensemble〗_rmse performed better in purely numerical datasets, while 〖Ensemble〗_acc performed better in mixed datasets. Finally, in the application domain analysis, 〖Ensemble〗_rmse performed better in medical datasets, while 〖Ensemble〗_acc performed better in credit datasets.
關鍵字(中) ★ 機器學習
★ 深度學習
★ 遺漏值補值
★ 集成式學習
關鍵字(英) ★ Machine learning
★ Deep learning
★ Missing value imputation
★ Ensemble learning
論文目次 摘要 i
Abstract ii
誌謝 iii
目錄 iv
圖目錄 vii
表目錄 ix
一、緒論 1
1-1 研究背景 1
1-2 研究動機 1
1-3 研究目的 4
二、文獻探討 5
2-1 遺漏機制 5
2-2 補值方法 5
2-2-1 Mean/Mode Imputation 14
2-2-2 MICE(Multiple imputation by chained equations) 14
2-2-3 KNN Imputation 15
2-2-4 MissForest 15
2-2-5 PC-GAIN(Pseudo-label conditional GAIN) 16
2-2-6 RC-GAIN(Real-label conditional GAIN) 17
2-2-7 HI-VAE(Heterogeneous-Incomplete VAE) 19
2-2-8 PMIVAE(Partial Multiple Imputation with VAE) 19
2-3 集成方法 20
2-4 分類器 23
2-4-1 SVM(Support Vector Machine) 24
2-4-2 LightGBM(Light Gradient Boosting Machine) 24
2-4-3 MLP(Multi-Layer Perceptron) 25
三、研究方法 27
3-1 資料集 28
3-2 資料前處理 30
3-3 遺漏值的模擬情境 31
3-4 評估指標 32
3-4-1 RMSE(Root Mean Squared Error) 33
3-4-2 Accuracy 34
3-5 實驗參數設定、方法 35
3-6 實驗一:探討補值方法的補值性能 37
3-7 實驗二:探討分類器於填補資料集的分類性能 38
3-8 實驗三:探討集成補值方法的性能 39
四、實驗結果與分析 42
4-1 探討補值方法的填補性能 42
4-1-1 補值方法的性能分析 42
4-1-2 篩選補值方法 50
4-2 探討分類器於填補後的資料集之分類性能 51
4-2-1 分類器的性能分析 51
4-2-2 篩選最佳分類器 56
4-3 探討集成補值方法之性能 57
4-3-1 探討集成方法的填補品質 57
4-3-2 探討集成方法在填補後的分類性能分析 62
4-3-3 探討不同資料集角度下集成方法的適用性 67
4-3-4 探討不同集成作法對於分類性能的影響 71
五、結論 72
5-1 結論與貢獻 72
5-2 研究限制 74
5-3 未來研究與建議 75
參考文獻 76
指導教授 蘇坤良(Kuen-Liang Sue) 審核日期 2024-7-29
