dc.description.abstract | Among the field of data analysis, the enterprise can make plans for future operation or make crucial decisions. Therefore, the data and its applications have become more and more important. However, the original dataset often exits the problems of class imbalance and high dimensionality. Those problems usually occur in the fields of finance, medicine and so on. The class imbalanced problem can cause the bias of prediction, which makes the prediction model mainly focuses on the majority class instead of the minority one. On the other hand, high dimensional datasets can lead to the complexity of the calculation and reduce the accuracy of prediction because of redundant features.
In this thesis, we propose a new method called Oversampling ensemble aiming to solve the class imbalanced problem. Three well-known variants of SMOTE, which are polynom-fit-SMOTE, ProWSyn, SMOTE-IPF, are investigated. The ensemble approaches contain the Parallel and Serial ensembles, where the parallel ensembles include four data combination methods: Random、Center、Cluster Random、Cluster Center. The experimental results based on 58 KEEL datasets show that Parallel ensembles outperform the baseline and single oversampling algorithms, especially the Center and Cluster Center methods. As for the class imbalanced and high dimensional problems, parallel ensembles are combined with information gain and embedded Decision Tree feature selection separately for 15 OpenML datasets, which indicates that the ensemble method surpasses the baseline and single algorithms. In addition, appropriate methods are recommended for different imbalance ratios and numbers of features. | en_US |