摘要: | 現實世界的資料經常存在著類別不平衡(Class Imbalance)問題,在二元分類中,類別不平衡指的是兩類資料中其中一類的樣本數大於另一類的樣本數,使資料呈現偏態分布(Skewed Distribution)的情況,偏態分布的資料集通常有著樣本重疊(Overlapping)、樣本數少(Small Sample Size)、樣本分離(Small Disjuncts)特性,需要進行資料前處理才能有效地訓練模型,若不加以處理,可能導致分類器在預測時偏向於大類別資料,忽視小類別資料,而在醫療診斷、異常檢測、破產預測等許多領域,通常小類別資料更具有價值。 因此,本論文提出了一個基於分群的創新混合採樣CBHS(Cluster-Based Hybrid Sampling)方法,採用兩種不同的分群方法,針對小類資料進行分群,找出散落在資料空間中小類子群集,根據分群結果,結合兩種不同的增加少數法和兩種不同的減少多數法策略進行資料前處理,以降低大類資料與小類資料之間的類別不平衡比率,並採用三種不同分類器進行模型的訓練。欲探討CBHS方法是否能更有效處理偏態分布的三種特性,提升最後的分類效果,以及探討不同策略與分群方法的最佳選擇。 本論文使用來自KEEL網站的40個二元類別不平衡資料集進行實驗,以五折交叉驗證作為實驗驗證的方法,並採用ROC曲線下面積(Area Under Curve, AUC)作為模型的衡量指標。實驗結果顯示,CBHS方法在分類準確率(AUC)上優於Baseline方法,能有效解決偏態分布資料的樣本重疊、樣本數少及樣本分離特性,更好的解決類別不平衡問題。此外,將三種分類器中AUC最高的CBHS方法進行分類器端集成,則可進一步提升分類效果,其中VOTE(AP (SWO, LM)+RF)方法的表現最為優異。;Real-world data often exhibit the problem of class imbalance. In binary classification, class imbalance refers to a situation where the number of samples in one class is significantly greater than in the other class, resulting in a skewed distribution. Skewed distribution datasets typically have characteristics such as overlapping, small sample sizes, and small disjuncts, necessitating data preprocessing to effectively train models. Without proper handling, classifiers may be biased towards the majority class, ignoring the minority class. In many fields, such as medical diagnosis, anomaly detection, and bankruptcy prediction, the minority class data is more valuable. Therefore, this paper proposes a novel cluster-based hybrid sampling (CBHS) approach. CBHS uses two different clustering methods to group the minority class data, identifying subgroups within the minority class. Based on the clustering results, it combines two different over-sampling strategies and two different under-sampling strategies for data preprocessing to reduce the class imbalance ratio. Three different classifiers are used to train the models. The aim is to explore whether the CBHS approach can more effectively address the three characteristics of skewed distributions, improve classification performance, and determine the optimal combination of strategies and clustering methods. This paper uses 40 imbalanced datasets from the KEEL website for experiments, using 5-fold cross-validation as the experimental validation method. The Area Under the Curve (AUC) of the ROC curve is used as the evaluation metric. Experimental results show that the CBHS approach outperforms the Baseline method, effectively addressing overlapping, small sample sizes, and small disjuncts, thereby better solving the class imbalance problem. Furthermore, using the CBHS approach with the highest AUC from the three classifiers to form an ensemble classifier can further improve AUC, with the VOTE (AP (SWO, LM) + RF) method showing the best performance. |