兩階段混合式前處理方法於類別非平衡問題之研究

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：13

、訪客IP：3.15.141.155

姓名

姚冠廷(Guan-Ting Yao) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

兩階段混合式前處理方法於類別非平衡問題之研究
(A Two-Stage Hybrid Data Preprocessing Approach for the Class Imbalance Problem)

相關論文

★ 利用資料探勘技術建立商用複合機銷售預測模型	★ 應用資料探勘技術於資源配置預測之研究-以某電腦代工支援單位為例
★ 資料探勘技術應用於航空業航班延誤分析-以C公司為例	★ 全球供應鏈下新產品的安全控管-以C公司為例
★ 資料探勘應用於半導體雷射產業-以A公司為例	★ 應用資料探勘技術於空運出口貨物存倉時間預測-以A公司為例
★ 使用資料探勘分類技術優化YouBike運補作業	★ 特徵屬性篩選對於不同資料類型之影響
★ 資料探勘應用於B2B網路型態之企業官網研究-以T公司為例	★ 衍生性金融商品之客戶投資分析與建議-整合分群與關聯法則技術
★ 應用卷積式神經網路建立肝臟超音波影像輔助判別模型	★ 基於卷積神經網路之身分識別系統
★ 能源管理系統電能補值方法誤差率比較分析	★ 企業員工情感分析與管理系統之研發
★ 資料淨化於類別不平衡問題: 機器學習觀點	★ 資料探勘技術應用於旅客自助報到之分析—以C航空公司為例

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

類別非平衡（Class Imbalance）問題是資料探勘領域中重要且頻繁發生的議題，此現象發生於資料集中某一類別樣本數大於另一類別樣本數時，導致資料產生偏態分布，此時，傳統分類器為了追求高分類正確率，建立出的預測模型將會傾向將小類樣本（Minority Class）誤判為大類樣本（Majority Class），導致珍貴的少類樣本無法建立出良好的分類規則，這樣的現象在真實世界中也越來越常見，舉凡醫學診斷、錯誤偵測、臉部辨識等不同領域都經常發生資料的類別非平衡現象。

為了解決類別非平衡問題，本論文提出一個以分群技術為基礎結合樣本選取（Instance Selection）的資料取樣概念，嘗試從大類樣本挑選出具有代表性的資料，形成一個兩階段混合式的資料前處理架構，這樣的架構除了有效減少抽樣誤差、降低資料的類別非平衡比率（Imbalance Ratio）、減少分類器的訓練時間外，還可以提升分類的正確率。

本論文將以KEEL中44個類別非平衡資料集進行實驗，方法架構中嘗試了2種分群方法搭配3種樣本選取演算法以探討最佳配適，再以4種分類器搭配整體學習法建立分類模型，以了解不同分類器在研究架構中的表現，最後，實驗將採用五折交叉驗證之平均AUC結果作為評估指標，再與文獻中傳統方法、整體學習法進行正確率比較，並討論非平衡比率對於實驗架構的影響。實驗發現本研究提出的混合式前處理架構，在多數分類模型下的表現皆優於比較文獻方法，其中MLP分類器搭配Bagging整體學習法為表現最佳的分類模型，其AUC平均正確率高達92%。

摘要(英)

The class imbalance problem is an important issue in data mining. The class skewed distribution occurs when the number of examples that represent one class is much lower than the ones of the other classes. The traditional classifiers tend to misclassify most samples in the minority class into the majority class because of maximizing the overall accuracy. This phenomenon limits the construction of effective classifiers for the precious minority class. This problem occurs in many real-world applications, such as fault diagnosis, medical diagnosis and face recognition.

To deal with the class imbalance problem, I proposed a two-stage hybrid data preprocessing framework based on clustering and instance selection techniques. This approach filters out the noisy data in the majority class and can reduce the execution time for classifier training. More importantly, it can decrease the effect of class imbalance and perform very well in the classification task.

Our experiments using 44 class imbalance datasets from KEEL to build four types of classification models, which are C4.5, k-NN, Naïve Bayes and MLP. In addition, the classifier ensemble algorithm is also employed. In addition, two kinds of clustering techniques and three kinds of instance selection algorithms are used in order to find out the best combination suited for the proposed method. The experimental results show that the proposed framework performs better than many well-known state-of-the-art approaches in terms of AUC. In particular, the proposed framework combined with bagging based MLP ensemble classifiers perform the best, which provide 92% of AUC.

關鍵字(中)

★ 類別不平衡
★ 資料探勘
★ 分類
★ 分群
★ 樣本選取

關鍵字(英)

★ Class imblanace
★ data mining
★ classification
★ clustering
★ instance selection

論文目次

摘要 i
Abstract ii
致謝 iii
目錄 iv
圖目錄 vi
表目錄 viii
第一章緒論 1
1.1 研究背景 1
1.2 研究動機 2
1.3 研究目的 5
1.4 研究架構 6
第二章文獻探討 8
2.1 類別非平衡問題 8
2.2 類別非平衡問題之處理 10
2.2.1 資料層面 10
2.2.2 演算法層面 12
2.2.3 成本敏感法 13
2.3 類別非平衡問題評估指標 14
2.4 樣本選取 16
2.4.1 IB3 17
2.4.2 DROP3 19
2.4.3 GA 21
2.5 機器學習演算法 25
2.5.1 非監督式學習演算法 25
2.5.2 監督式學習演算法 31
2.5.3 整體學習法 36
第三章研究方法 39
3.1 實驗架構 39
3.2 CBIS前處理架構 41
3.2.1 CBIS階段一: 資料分群 41
3.2.2 CBIS階段二: 樣本選取 42
3.2.3 CBIS虛擬碼（pseudo-code） 43
3.2.4 CBIS架構之適用性分析 43
3.3 方法驗證 44
3.4 相關架構比較 45
第四章實驗結果 47
4.1 實驗準備 47
4.1.1 軟硬體設置 47
4.1.2 實驗資料集 48
4.2 實驗結果-Using Affinity Propagation clustering 52
4.2.1 以C4.5決策樹為基礎之分析 54
4.2.2 不同種類分類器之表現分析 57
4.2.3 不同樣本選取方法之正確率比較 61
4.2.4 類別非平衡比率相關討論 65
4.3 實驗結果-Using K-means clustering 69
4.3.1 以C4.5決策樹演算法為基礎之分析 69
4.3.2 不同分類器為基礎之表現分析 75
4.3.3 不同樣本選取方法之正確率比較 76
4.3.4 類別非平衡比率相關討論 79
4.4 實驗總結 86
第五章結論 87
5.1 結論與貢獻 87
5.2 未來研究方向與建議 89
參考文獻 91

參考文獻

[1]. Berry, M. J. A., and Linoff, G. (1997). Data Mining Techniques: for Marking, Sales, and Customer Support. New York, John Wiley and Sons Inc.
[2]. Kleissner, C. (1998). Data Mining for the Enterprise, Proceedings of the 31st Annual Hawaii International Conference on System Sciences, 7, 295-304.
[3]. Nitesh V. Chawla, Nathalie Japkowicz, Aleksander Kotcz. (2004). Special Issue on Learning from Imbalanced Data Sets. SIGKDD Explor, 6(1), 1-6.
[4]. Galar, M., Fernández, A., Barrenechea, E., Bustince, H. and Herrera, F. (2012). A review on ensembles for class imbalance problem: bagging, boosting and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics – part C: Applications and Reviews, 42(4), 463–484.
[5]. Mazurowski, M. A., Habas, P. A., Zurada, J. M., Lo,J. Y., Baker, J. A. and Tourassi G. D. (2008). Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance. Neural Netw., 21(2-3), 427-436.
[6]. Zhu, Z. B., and Song, Z. H. (2010). Fault diagnosis based on imbalance modified kernel fisher discriminant analysis. Chem. Eng. Res. Des., 88(8), 936-951.
[7]. Liu, Y. H., and Chen, Y. T. (2005). Total margin-based adaptive fuzzy support vector machines for multiview face recognition. Proc. IEEE Int. Conf. Syst., Man Cybern., 2, 1704-1711.
[8]. I. Guyon and A. Elisseeff　(2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157-1182.
[9]. Show-Jane Yen, Yue-Shi Lee, Cheng-Han Lin and Jia-Ching Ying (2006). Investigating the Effect of Sampling Methods for Imbalanced Data Distributions. IEEE International Conference on System, Man, and Cybernetics, 4163–4168.

[10]. J. Stefanowski and S. Wilk (2008). Selective pre-processing of imbalanced data for improving classification performance. Data Warehousing and Knowledge Discovery (Lecture Notes in Computer Science Series 5182), 283–292.
[11]. Y. Lin, Y. Lee, and G.Wahba. (2002). Support vector machines for classification
in nonstandard situations. Machine Learning, 46, 191–202.
[12]. N. Chawla, D. Cieslak, L. Hall, and A. Joshi. (2008). Automatically countering imbalance and its empirical relationship to cost. Data Min. Knowl. Discov., 17, 225–252.
[13]. V. García, R. A. Mollineda, J. S. Sánchez (2008). On the k-NN performance in a challenging scenario of imbalance and overlapping. Pattern Anal Applic, 11, 269–280.
[14]. Show-Jane Yen and Yue-Shi Lee. (2009). Cluster-based under-sampling approaches for imbalanced data distributions. Expert Systems with Applications, 36, 5718–5727.
[15]. N. V. Chawla, K.W. Bowyer, L. O. Hall and W. P. Kegelmeyer. (2002). SMOTE:
synthetic minority over-sampling technique, J. Artif. Intell. Res., 16, 321–357.
[16]. X.D. Wu et al. (2008). Top 10 Algorithms in Data Mining. Knowledge and Information Systems, vol. 14(1), 1-37.
[17]. J. Arturo Olvera-López, J. Ariel Carrasco-Ochoa, J. Francisco Martínez-Trinidad and Josef Kittler. (2010). A review of instance selection methods. Artif Intell Rev, 34, 133-143.
[18]. Brendan J. Frey and Delbert Dueck. (2007). Clustering by Passing Messages Between Data Points. Science, 315(5814), 972-976.
[19]. Sen Jia, Yuntao Qian and Zhen Ji. (2008). Band Selection for Hyperspectral Imagery Using Affinity Propagation. Proc. DICTA’08.Digital Image Computing:
Techniques and Applications, 137-141.
[20]. Shang F, Jiao L, Shi J, Wang F and Gong M. (2012). Fast affinity propagation clustering: a multilevel approach. Pattern Recognition (45):474–486.
[21]. Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7), 1145-1159.
[22]. Batista, G. E., Prati, R. C., and Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM Sigkdd Explorations Newsletter, 6(1), 20-29.
[23]. Japkowicz, N., and Stephen, S. (2002). The Class Imbalance Problem: A Systematic Study. Intelligent Data Analysis, 6(5), 429-449.
[24]. Kotsiantis, S., Kanellopoulos, D. and Pintelas, P. (2006). Handling imbalanced datasets: A review, GESTS International Transactions on Computer Science and Engineering. 30(1), 25-36.
[25]. Drummond, C. and Holte, R. C. (2003). C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In Workshop on learning from imbalanced datasets II (Vol. 11).
[26]. Weiss, G. (2004). Mining with rarity: A unifying framework. SIGKDD Explorations, 6(1), 7-19.
[27]. Cohen, W. W., (1995). Fast effective rule induction. In Proceedings of the Twelfth International Conference on Machine Learning, 115-123.
[28]. Raskutti, B. and Kowalczyk, A., (2004). Extreme rebalancing for svms: a case study. SIGKDD Explorations, 6(1), 60-69.
[29]. Longadge, R., Dongre, S. S., and Malik, L. (2013). Class Imbalance Problem in Data Mining： Review. International Journal of Computer Science and Network, 2(1), 1-6.

[30]. Liu, X. Y., and Zhou, Z. H. (2013). Ensemble Methods for Class Imbalance Learning. Imbalanced Learning： Foundations, Algorithms, and Applications, First Edition, 61-82.
[31]. López, V., Fernández, A., García, S., Palade, V., and Herrera, F. (2013). An insight into classification with imbalanced data： Empirical results and current trends on using data intrinsic characteristics. Information Sciences, 250, 113-141.
[32]. Garcı´a, S., Derrac, J., Cano, J. R., and Herrera, F. (2012). Prototype Selection for Nearest Neighbor Classification： Taxonomy and Empirical Study. IEEE Transactions on pattern analysis and machine intelligence, 34(3), 417-435.
[33]. Kuncheva, L. I., and S´anchez, J. S. (2008). Nearest Neighbour Classifiers for Streaming Data with Delayed Labelling. Eighth IEEE International Conference on Data Mining, 33, 869-874.
[34]. Cano, J.R., Herrera, F., and Lozano, M. (2003). Using Evolutionary Algorithms as Instance Selection for Data Reduction in KDD： an experimental study. Evolutionary Computation, 6(3), 323-332.
[35]. Brightion, H., Mellish C. (2002). Advances in Instance Selection for Instance-Based Learning Algorithms. Data Mining and Knowledge Discovery, 153-172.
[36]. Wilson, D. R., andMartinez, T. R. (2000). Reduction Techniques for Instance-Based Learning Algorithms. Machine Learning, 38, 257-286.
[37]. Nikolaidis, K., Goulermas, J. Y., & Wu, Q. H. (2011). A class boundary preserving algorithm for data condensation. Pattern Recognition, 44(3), 704-715.
[38]. Holland, J. H. (1975). Adaption in Natural and Artificial Systems. MIT Press, Cambridge, MA.
[39]. Goldberg, D. E. (1989). Genetic Algorithm in Search, Optimization, and Machine Learning. Addison Wesley.

[40]. Herrera, F., Lozano, M., and Verdegay, J. L. (1998). Tackling Real-Coded Genetic Algorithms: Operators and Tools for Behavioural Analysis. Artificial Intelligence Review, 12, 265-319.
[41]. Baker, J. E. (1987). Reducing bias and inefficiency in the selection algorithm. Proc. Second Int. Conf. on Genetic Algorithms, 14-21.
[42]. Reeves, C. R. (1999). Foundations of Genetic Algorithms. Morgan Kaufmann.
[43]. Sikora, R., and Piramuthu, S. (2007). Framework for efficient feature selection in genetic algorithm based data mining. European Journal of Operational Research, 180 (2), 723-737.
[44]. MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, 1(14), 281-297.
[45]. Hartigan, J. A. and Wong, M. A. (1979). Algorithm AS 136： A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1), 100-108.
[46]. Forgy, E. W., (1965). Cluster analysis of multivariate data： efficiency versus interpretability of classifications. Biometrics, 21, 768.
[47]. Jiawei Han and Micheline Kamber. (2000). Data Mining： Concepts and Techniques, Third Edition (The Morgan Kaufmann Series in Data Management Systems).
[48]. Witten, I. H. and Frank, E. (2005). Data Mining： Practical machine learning tools and techniques. Morgan Kaufmann.
[49]. Yoav Freund and Robert E. Schapire. (1996). A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Journal of computer and system sciences, 55, 119-139.

[50]. Schapire, R. E. (1990). The strength of weak learnability. Machine learning, 5(2), 197-227.
[51]. Freund, Y., and Schapire, R. E. (1996). Experiments with a new boosting algorithm. ICML, 96, 148-156.
[52]. Eric Bauer and Ron Kohavi. (1999). An Empirical Comparison of Voting Classification Algorithms：Bagging, Boosting, and Variants. Machine Learning, 36, 105-139.
[53]. T.G. Dietterich. (2000). Ensemble methods in machine learning 1st Int. Workshop on Multiple Classifier Systems, 1857, 1-15.
[54]. Breiman, L. (1996). Bagging predictors. Machine learning, 24(2), 123-140.
[55]. 陳景祥 (2010)。R軟體：應用統計方法。臺北市：東華。
[56]. 張智星 (2004)。MATLAB程式設計：入門篇。鈦思科技股份有限公司。

指導教授

蔡志豐

審核日期

2017-7-14

推文