摘要: | 從過往研究中發現,補值方法大致可分成三大類:統計、機器學習與深度學習,不同種類的方法都有其適用情境,所以本研究將集成的技術應用於補值任務中,旨在將多個補值方法進行結合,並且依據各方法對不同情境的適用性,分配出適當的權重,以此產生出優異的填補值。 實驗設計上,本研究選用收錄於UCI-dataset的六個二元分類資料集。依據過往的文獻探討,選出各類別的補值方法,分別為統計方法Mean/Mode、MICE,機器學習方法MissForest、KNN,以及深度學習方法PC-GAIN、HI-VAE和PMIVAE,並基於PC-GAIN方法進行調整形成RC-GAIN方法,使用總共八種補值方法,以及使用SVM、LightGBM和MLP三種分類器,進行實驗。 本研究以實驗篩選出四個性能較佳的入選補值方法MICE、MissForest、RC-GAIN及HI-VAE,以及最佳分類器LightGBM,並以上述方法建構出集成補值方法。透過兩種性能指標:RMSE以及由LightGBM產生之Accuracy,計算出兩種權重,產生出兩種集成方法:〖Ensemble〗_rmse和〖Ensemble〗_acc。實驗結果顯示,兩種集成方法之性能在不同遺漏機制以及不同遺漏率情境中,皆優於四個入選補值方法。其中,集成方法又以〖Ensemble〗_acc性能勝過〖Ensemble〗_rmse,是較佳的補值方法。 本研究還根據資料集的特性,對集成方法之性能進行適用性分析,在資料集樣本大小的分析中發現,〖Ensemble〗_acc在小型和大型資料集當中,都獲得較佳的性能 ; 在資料集特徵類型的分析中發現,〖Ensemble〗_rmse在純數值型資料集當中表現較佳,而〖Ensemble〗_acc在混合型資料集當中表現較佳 ; 最後,在應用領域的分析中發現,〖Ensemble〗_rmse在醫療資料集中表現較佳,而〖Ensemble〗_acc在信用資料集中表現較佳。 ;From past research, imputation methods can generally be categorized into three types: statistical, machine learning, and deep learning. Each type of method has its appropriate contexts, so this study applies ensemble techniques to imputation tasks. It aims to combine multiple imputation methods and assigns appropriate weights based on each method′s suitability for different scenarios, thereby generating superior imputed values. In terms of experimental design, this study selects six binary classification datasets from the UCI dataset. Based on previous literature, representative methods for each category were selected, including statistical methods Mean/Mode, MICE; machine learning methods MissForest, KNN; and deep learning methods PC-GAIN, HI-VAE, and PMIVAE. Adjustments were made to the PC-GAIN method to form the RC-GAIN method. In total, eight imputation methods were used, and experiments were conducted using SVM, LightGBM, and MLP classifiers. The study selected four imputation methods with better performance, MICE, MissForest, RC-GAIN, and HI-VAE, as well as the best classifier, LightGBM, to construct an ensemble imputation method. Two performance metrics, RMSE and Accuracy generated by LightGBM, were used to calculate two types of weights, producing two ensemble methods: 〖Ensemble〗_rmse and 〖Ensemble〗_acc. Experimental results showed that the performance of these two ensemble methods was superior to the four selected imputation methods in different missing mechanisms and missing rate scenarios. Among them, the 〖Ensemble〗_acc method outperformed 〖Ensemble〗_rmse and was the better imputation method. The study also analyzed the suitability of the ensemble methods based on dataset characteristics. In the analysis of dataset sizes, 〖Ensemble〗_acc performed better in both small and large datasets. In the analysis of dataset feature types, 〖Ensemble〗_rmse performed better in purely numerical datasets, while 〖Ensemble〗_acc performed better in mixed datasets. Finally, in the application domain analysis, 〖Ensemble〗_rmse performed better in medical datasets, while 〖Ensemble〗_acc performed better in credit datasets. |