本論文將以KEEL中44個類別非平衡資料集進行實驗,方法架構中嘗試了2種分群方法搭配3種樣本選取演算法以探討最佳配適,再以4種分類器搭配整體學習法建立分類模型,以了解不同分類器在研究架構中的表現,最後,實驗將採用五折交叉驗證之平均AUC結果作為評估指標,再與文獻中傳統方法、整體學習法進行正確率比較,並討論非平衡比率對於實驗架構的影響。實驗發現本研究提出的混合式前處理架構,在多數分類模型下的表現皆優於比較文獻方法,其中MLP分類器搭配Bagging整體學習法為表現最佳的分類模型,其AUC平均正確率高達92%。 ;The class imbalance problem is an important issue in data mining. The class skewed distribution occurs when the number of examples that represent one class is much lower than the ones of the other classes. The traditional classifiers tend to misclassify most samples in the minority class into the majority class because of maximizing the overall accuracy. This phenomenon limits the construction of effective classifiers for the precious minority class. This problem occurs in many real-world applications, such as fault diagnosis, medical diagnosis and face recognition.
To deal with the class imbalance problem, I proposed a two-stage hybrid data preprocessing framework based on clustering and instance selection techniques. This approach filters out the noisy data in the majority class and can reduce the execution time for classifier training. More importantly, it can decrease the effect of class imbalance and perform very well in the classification task.
Our experiments using 44 class imbalance datasets from KEEL to build four types of classification models, which are C4.5, k-NN, Naïve Bayes and MLP. In addition, the classifier ensemble algorithm is also employed. In addition, two kinds of clustering techniques and three kinds of instance selection algorithms are used in order to find out the best combination suited for the proposed method. The experimental results show that the proposed framework performs better than many well-known state-of-the-art approaches in terms of AUC. In particular, the proposed framework combined with bagging based MLP ensemble classifiers perform the best, which provide 92% of AUC.