dc.description.abstract | In real-world datasets, missing values will inevitably occur. However, without appropriate treatments, most conventional machine learning models cannot tackle the missing values directly, and may even cause false results. In addition, the class imbalance is also a critical issue in machine learning and data mining. However, conventional machine learning models tend to ignore the minority class, which leads to degraded classification performances.
Due to the universality of missing values and class imbalance problems, in recent years, there are growing studies exploring how to deal with class imbalanced data with missing values. However, few studies have discussed the situation where a certain class of data, especially the minority class, has more missing values than the majority class. Moreover, based on the acknowledgment that missing value imputation and Data level approaches may change the distribution of the original data, there are few studies that discuss the implementation order of missing value imputation and Data level approaches when tackling class imbalanced data with missing values.
To this end, in this paper, we compare the performance of three processing procedures with six approaches of missing value imputation and SMOTE when the minority class of training data has more missing values. In the three processing procedures, in addition to changing the order, we also proposed to use only the complete training data subset of the minority class to serve as the basis for creating synthetic samples for the minority class.
The experimental results show that under most situations of missing rates, the order of different processing procedures has significant differences in the performance of RMSE. The performance of classification ability varies with the level of missing rate. When the missing rate is less than or equal to 50%, the processing procedure of imputing first has better classification performance, while the missing rate is higher than 50%, the processing procedure of imputing first and only uses the complete training data subset of the minority class to serve as the basis for creating synthetic samples for the minority class has a significantly better classification performance in the random forest classifier. Furthermore, we also recommend better performance processing procedures with the combination of missing value imputation approaches in the different levels of missing rate for future researchers′ reference. | en_US |