dc.description.abstract | Real-world data often exhibit the problem of class imbalance. In binary classification, class imbalance refers to a situation where the number of samples in one class is significantly greater than in the other class, resulting in a skewed distribution. Skewed distribution datasets typically have characteristics such as overlapping, small sample sizes, and small disjuncts, necessitating data preprocessing to effectively train models. Without proper handling, classifiers may be biased towards the majority class, ignoring the minority class. In many fields, such as medical diagnosis, anomaly detection, and bankruptcy prediction, the minority class data is more valuable.
Therefore, this paper proposes a novel cluster-based hybrid sampling (CBHS) approach. CBHS uses two different clustering methods to group the minority class data, identifying subgroups within the minority class. Based on the clustering results, it combines two different over-sampling strategies and two different under-sampling strategies for data preprocessing to reduce the class imbalance ratio. Three different classifiers are used to train the models. The aim is to explore whether the CBHS approach can more effectively address the three characteristics of skewed distributions, improve classification performance, and determine the optimal combination of strategies and clustering methods.
This paper uses 40 imbalanced datasets from the KEEL website for experiments, using 5-fold cross-validation as the experimental validation method. The Area Under the Curve (AUC) of the ROC curve is used as the evaluation metric. Experimental results show that the CBHS approach outperforms the Baseline method, effectively addressing overlapping, small sample sizes, and small disjuncts, thereby better solving the class imbalance problem. Furthermore, using the CBHS approach with the highest AUC from the three classifiers to form an ensemble classifier can further improve AUC, with the VOTE (AP (SWO, LM) + RF) method showing the best performance. | en_US |