dc.description.abstract | Missing Value Imputation (MVI) is an important process in data mining, because sometimes it will cause serious problems for classification. One of the most serious problems is that the majority of classification algorithms do not work on incomplete datasets (such as neural networks and support vector machines). In the medical field, because of not all possible tests can be done on every patient, and coupled with the interference of accidental factors such as human negligence and equipment failure, the existence of missing values is a common problem. It not only increases the difficulty in tasks such as analysis and prediction, but also affects the immediate diagnosis and treatment that patients should receive.
In the research field of missing value imputation, missForest is a very popular imputation method. Although its performance has been proved to be better than other known imputation methods, there are few studies considering its optimization or further discussion. Therefore, this study tried the feature selection method currently popular in missing value imputation research—RFE, combined it with missForest and propose a new imputation method RFE_missForest. We used a total of 10 medical data sets obtained from Kaggle and UCI, simulating the missing rate of 10% to 50%, then compare the filling quality of continuous and categorical data sets with missForest and three other traditional imputation methods.
Experimental results show that our RFE_missForest algorithm has the best performance both on 3 continuous data sets and 3 mixed data sets, whether it is NRMSE or PFC. The proposed method was also validated by t-test and has a significant difference. | en_US |