dc.description.abstract | With the progress of Information Technology, missing values have become common in various datasets. This affects data completeness and hampers data analysis and decision-making. Therefore, handling missing values is a crucial and challenging task in data preprocessing.
The main methods for handling missing values are imputation, including statistical and machine learning techniques, both single imputation methods. Recently, scholars have adopted multiple imputation methods. However, limited research compares multiple imputation and machine learning imputation across different datasets and missing rates. Additionally, while ensemble learning has improved model prediction accuracy, its use in missing value imputation is under-researched. Therefore, we aim to analyze the performance of single and multiple imputation methods and explore ensemble learning in missing value imputation.
This study used 25 UCI datasets, including numerical, categorical, and mixed types, simulating missing rates from 10% to 50%. Five machine learning algorithms were evaluated for single and multiple (MICE) imputation, and two ensemble imputation methods based on MICE, hybrid and parallel strategies, were proposed. Imputation effectiveness was assessed using SVM classification accuracy, RMSE, MAPE, and Hit Ratio.
Results showed that multiple imputation generally outperformed single imputation, with the random forest method being the best for mixed datasets, while other methods slightly underperformed. Ensemble imputation experiments indicated that both hybrid and parallel strategies effectively improved all metrics, though the order of applying models in hybrid imputation significantly impacted results. Finally, we provide recommendations for optimal combinations of multiple and ensemble imputation, offering valuable references for future researchers. | en_US |