dc.description.abstract | From past research, imputation methods can generally be categorized into three types: statistical, machine learning, and deep learning. Each type of method has its appropriate contexts, so this study applies ensemble techniques to imputation tasks. It aims to combine multiple imputation methods and assigns appropriate weights based on each method′s suitability for different scenarios, thereby generating superior imputed values.
In terms of experimental design, this study selects six binary classification datasets from the UCI dataset. Based on previous literature, representative methods for each category were selected, including statistical methods Mean/Mode, MICE; machine learning methods MissForest, KNN; and deep learning methods PC-GAIN, HI-VAE, and PMIVAE. Adjustments were made to the PC-GAIN method to form the RC-GAIN method. In total, eight imputation methods were used, and experiments were conducted using SVM, LightGBM, and MLP classifiers.
The study selected four imputation methods with better performance, MICE, MissForest, RC-GAIN, and HI-VAE, as well as the best classifier, LightGBM, to construct an ensemble imputation method. Two performance metrics, RMSE and Accuracy generated by LightGBM, were used to calculate two types of weights, producing two ensemble methods: 〖Ensemble〗_rmse and 〖Ensemble〗_acc. Experimental results showed that the performance of these two ensemble methods was superior to the four selected imputation methods in different missing mechanisms and missing rate scenarios. Among them, the 〖Ensemble〗_acc method outperformed 〖Ensemble〗_rmse and was the better imputation method.
The study also analyzed the suitability of the ensemble methods based on dataset characteristics. In the analysis of dataset sizes, 〖Ensemble〗_acc performed better in both small and large datasets. In the analysis of dataset feature types, 〖Ensemble〗_rmse performed better in purely numerical datasets, while 〖Ensemble〗_acc performed better in mixed datasets. Finally, in the application domain analysis, 〖Ensemble〗_rmse performed better in medical datasets, while 〖Ensemble〗_acc performed better in credit datasets. | en_US |