一個基於分群之創新混合採樣法於類別不平衡資料集之應用

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：178

、訪客IP：3.17.186.157

姓名

陳映彤(Ying-Tung Chen) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

一個基於分群之創新混合採樣法於類別不平衡資料集之應用
(A Novel Cluster-Based Hybrid Sampling Approach for Class Imbalanced Datasets)

相關論文

★ 利用資料探勘技術建立商用複合機銷售預測模型	★ 應用資料探勘技術於資源配置預測之研究-以某電腦代工支援單位為例
★ 資料探勘技術應用於航空業航班延誤分析-以C公司為例	★ 全球供應鏈下新產品的安全控管-以C公司為例
★ 資料探勘應用於半導體雷射產業-以A公司為例	★ 應用資料探勘技術於空運出口貨物存倉時間預測-以A公司為例
★ 使用資料探勘分類技術優化YouBike運補作業	★ 特徵屬性篩選對於不同資料類型之影響
★ 資料探勘應用於B2B網路型態之企業官網研究-以T公司為例	★ 衍生性金融商品之客戶投資分析與建議-整合分群與關聯法則技術
★ 應用卷積式神經網路建立肝臟超音波影像輔助判別模型	★ 基於卷積神經網路之身分識別系統
★ 能源管理系統電能補值方法誤差率比較分析	★ 企業員工情感分析與管理系統之研發
★ 資料淨化於類別不平衡問題: 機器學習觀點	★ 資料探勘技術應用於旅客自助報到之分析—以C航空公司為例

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2029-7-1以後開放)

摘要(中)

現實世界的資料經常存在著類別不平衡（Class Imbalance）問題，在二元分類中，類別不平衡指的是兩類資料中其中一類的樣本數大於另一類的樣本數，使資料呈現偏態分布（Skewed Distribution）的情況，偏態分布的資料集通常有著樣本重疊（Overlapping）、樣本數少（Small Sample Size）、樣本分離（Small Disjuncts）特性，需要進行資料前處理才能有效地訓練模型，若不加以處理，可能導致分類器在預測時偏向於大類別資料，忽視小類別資料，而在醫療診斷、異常檢測、破產預測等許多領域，通常小類別資料更具有價值。
因此，本論文提出了一個基於分群的創新混合採樣CBHS（Cluster-Based Hybrid Sampling）方法，採用兩種不同的分群方法，針對小類資料進行分群，找出散落在資料空間中小類子群集，根據分群結果，結合兩種不同的增加少數法和兩種不同的減少多數法策略進行資料前處理，以降低大類資料與小類資料之間的類別不平衡比率，並採用三種不同分類器進行模型的訓練。欲探討CBHS方法是否能更有效處理偏態分布的三種特性，提升最後的分類效果，以及探討不同策略與分群方法的最佳選擇。
本論文使用來自KEEL網站的40個二元類別不平衡資料集進行實驗，以五折交叉驗證作為實驗驗證的方法，並採用ROC曲線下面積（Area Under Curve, AUC）作為模型的衡量指標。實驗結果顯示，CBHS方法在分類準確率（AUC）上優於Baseline方法，能有效解決偏態分布資料的樣本重疊、樣本數少及樣本分離特性，更好的解決類別不平衡問題。此外，將三種分類器中AUC最高的CBHS方法進行分類器端集成，則可進一步提升分類效果，其中VOTE(AP (SWO, LM)+RF)方法的表現最為優異。

摘要(英)

Real-world data often exhibit the problem of class imbalance. In binary classification, class imbalance refers to a situation where the number of samples in one class is significantly greater than in the other class, resulting in a skewed distribution. Skewed distribution datasets typically have characteristics such as overlapping, small sample sizes, and small disjuncts, necessitating data preprocessing to effectively train models. Without proper handling, classifiers may be biased towards the majority class, ignoring the minority class. In many fields, such as medical diagnosis, anomaly detection, and bankruptcy prediction, the minority class data is more valuable.
Therefore, this paper proposes a novel cluster-based hybrid sampling (CBHS) approach. CBHS uses two different clustering methods to group the minority class data, identifying subgroups within the minority class. Based on the clustering results, it combines two different over-sampling strategies and two different under-sampling strategies for data preprocessing to reduce the class imbalance ratio. Three different classifiers are used to train the models. The aim is to explore whether the CBHS approach can more effectively address the three characteristics of skewed distributions, improve classification performance, and determine the optimal combination of strategies and clustering methods.
This paper uses 40 imbalanced datasets from the KEEL website for experiments, using 5-fold cross-validation as the experimental validation method. The Area Under the Curve (AUC) of the ROC curve is used as the evaluation metric. Experimental results show that the CBHS approach outperforms the Baseline method, effectively addressing overlapping, small sample sizes, and small disjuncts, thereby better solving the class imbalance problem. Furthermore, using the CBHS approach with the highest AUC from the three classifiers to form an ensemble classifier can further improve AUC, with the VOTE (AP (SWO, LM) + RF) method showing the best performance.

關鍵字(中)

★ 資料探勘
★ 機器學習
★ 類別不平衡
★ 資料重採樣

關鍵字(英)

★ data mining
★ machine learning
★ class imbalance
★ data resampling

論文目次

摘要 i
Abstract ii
目錄 iii
圖目錄 v
表目錄 vi
第一章緒論 1
1-1 研究背景 1
1-2 研究動機 2
1-3 研究目的 4
1-4 研究架構 4
第二章文獻回顧 5
2-1 類別不平衡資料特性 5
2-1-1 樣本重疊（Overlapping） 5
2-1-2 樣本數少（Small Sample Size） 6
2-1-3 樣本分離（Small Disjuncts） 6
2-2 類別不平衡問題之處理 7
2-2-1 資料層級 7
2-2-2 演算法層級 9
2-2-3 成本敏感法 11
2-3 分群演算法 12
2-3-1 Affinity Propagation 12
2-3-2 K-means 14
第三章研究方法 16
3-1 CBHS方法 16
3-1-1 CBHS-U子方法 17
3-1-2 CBHS-O子方法 20
3-1-3 虛擬碼（psedo-code） 22
3-2 實驗架構 23
3-3 實驗流程 25
3-3-1 實驗一 25
3-3-2 實驗二 27
3-4 實驗環境與參數設定 29
3-4-1 實驗環境 29
3-4-2 實驗參數設定 30
3-5 實驗資料集 35
3-6 實驗驗證準則與評估指標 37
3-6-1 實驗驗證準則 37
3-6-2 評估指標 38
第四章實驗結果 39
4-1 實驗一結果 39
4-1-1 Baseline實驗結果 39
4-1-2 CBHS系列方法實驗結果 41
4-1-3 CBHS系列方法於不同分類器之表現 45
4-1-4 CBHS方法不同策略分析 47
4-1-5 CBHS系列方法刪除比例參數分析 51
4-1-6 AP分群數量與類別不平衡比率相關分析 52
4-1-7 實驗一小結 54
4-2 實驗二結果 55
4-2-1 資料端集成實驗結果 55
4-2-2 分類器端集成實驗結果 59
4-2-3 實驗二小結 62
第五章結論 63
5-1 結論與貢獻 63
5-2 未來研究方向與建議 64
參考文獻 66
附錄一、實驗一詳細數據 71
附錄二、實驗二詳細數據 127

參考文獻

[1] X. Wu, X. Zhu, G. Q. Wu, and W. Ding, "Data mining with big data," IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 1, pp. 97-107, 2014, doi: 10.1109/TKDE.2013.109.
[2] M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, and F. Herrera, "A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches," IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 42, no. 4, pp. 463-484, 2012, doi: 10.1109/TSMCC.2011.2161285.
[3] W.-C. Lin and C.-F. Tsai, "Missing value imputation: a review and analysis of the literature (2006–2017)," Artificial Intelligence Review, vol. 53, no. 2, pp. 1487-1509, 2020/02/01 2020, doi: 10.1007/s10462-019-09709-4.
[4] P. Vuttipittayamongkol, E. Elyan, and A. Petrovski, "On the class overlap problem in imbalanced data classification," Knowledge-Based Systems, vol. 212, p. 106631, 2021/01/05/ 2021, doi: https://doi.org/10.1016/j.knosys.2020.106631.
[5] M. Alibeigi, S. Hashemi, and A. Hamzeh, "DBFS: An effective Density Based Feature Selection scheme for small sample size and high dimensional imbalanced data sets," Data & Knowledge Engineering, vol. 81-82, pp. 67-103, 2012/11/01/ 2012, doi: https://doi.org/10.1016/j.datak.2012.08.001.
[6] R. C. Prati, G. E. A. P. A. Batista, and M. C. Monard, "Learning with Class Skews and Small Disjuncts," in Advances in Artificial Intelligence – SBIA 2004, Berlin, Heidelberg, A. L. C. Bazzan and S. Labidi, Eds., 2004// 2004: Springer Berlin Heidelberg, pp. 296-306.
[7] Y.-C. Wang and C.-H. Cheng, "A multiple combined method for rebalancing medical data with class imbalances," Computers in Biology and Medicine, vol. 134, p. 104527, 2021/07/01/ 2021, doi: https://doi.org/10.1016/j.compbiomed.2021.104527.
[8] Y. Xiao, J. Wu, and Z. Lin, "Cancer diagnosis using generative adversarial networks based on deep learning from imbalanced data," Computers in Biology and Medicine, vol. 135, p. 104540, 2021/08/01/ 2021, doi: https://doi.org/10.1016/j.compbiomed.2021.104540.
[9] X. g. Chen, S. Liu, and W. Zhang, "Predicting Coding Potential of RNA Sequences by Solving Local Data Imbalance," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 19, no. 2, pp. 1075-1083, 2022, doi: 10.1109/TCBB.2020.3021800.
[10] J. Jani, J. Doshi, I. Kheria, K. Mehta, C. Bhadane, and R. Karani, "LayNet—A multi-layer architecture to handle imbalance in medical imaging data," Computers in Biology and Medicine, vol. 163, p. 107179, 2023/09/01/ 2023, doi: https://doi.org/10.1016/j.compbiomed.2023.107179.
[11] X. Zhou, Y. Hu, W. Liang, J. Ma, and Q. Jin, "Variational LSTM Enhanced Anomaly Detection for Industrial Big Data," IEEE Transactions on Industrial Informatics, vol. 17, no. 5, pp. 3469-3477, 2021, doi: 10.1109/TII.2020.3022432.
[12] B. Gao et al., "Enhancing anomaly detection accuracy and interpretability in low-quality and class imbalanced data: A comprehensive approach," Applied Energy, vol. 353, p. 122157, 2024/01/01/ 2024, doi: https://doi.org/10.1016/j.apenergy.2023.122157.
[13] Z. Li, M. Huang, G. Liu, and C. Jiang, "A hybrid method with dynamic weighted entropy for handling the problem of class imbalance with overlap in credit card fraud detection," Expert Systems with Applications, vol. 175, p. 114750, 2021/08/01/ 2021, doi: https://doi.org/10.1016/j.eswa.2021.114750.
[14] V. García, A. I. Marqués, and J. S. Sánchez, "Exploring the synergetic effects of sample types on the performance of ensembles for credit risk and corporate bankruptcy prediction," Information Fusion, vol. 47, pp. 88-101, 2019/05/01/ 2019, doi: https://doi.org/10.1016/j.inffus.2018.07.004.
[15] D. Veganzones and E. Séverin, "An investigation of bankruptcy prediction in imbalanced datasets," Decision Support Systems, vol. 112, pp. 111-124, 2018/08/01/ 2018, doi: https://doi.org/10.1016/j.dss.2018.06.011.
[16] A. Islam, S. B. Belhaouari, A. U. Rehman, and H. Bensmail, "KNNOR: An oversampling technique for imbalanced datasets," Applied Soft Computing, vol. 115, p. 108288, 2022/01/01/ 2022, doi: https://doi.org/10.1016/j.asoc.2021.108288.
[17] C.-F. Tsai, W.-C. Lin, Y.-H. Hu, and G.-T. Yao, "Under-sampling class imbalanced datasets by combining clustering analysis and instance selection," Information Sciences, vol. 477, pp. 47-54, 2019, doi: 10.1016/j.ins.2018.10.029.
[18] W.-C. Lin, C.-F. Tsai, Y.-H. Hu, and J.-S. Jhang, "Clustering-based undersampling in class-imbalanced data," Information Sciences, vol. 409-410, pp. 17-26, 2017, doi: 10.1016/j.ins.2017.05.008.
[19] R. A. Sowah et al., "HCBST: An Efficient Hybrid Sampling Technique for Class Imbalance Problems," ACM Trans. Knowl. Discov. Data, vol. 16, no. 3, p. Article 57, 2021, doi: 10.1145/3488280.
[20] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, "SMOTE: synthetic minority over-sampling technique," J. Artif. Int. Res., vol. 16, no. 1, pp. 321–357, 2002.
[21] M. Khushi et al., "A Comparative Performance Analysis of Data Resampling Methods on Imbalance Medical Data," IEEE Access, vol. 9, pp. 109960-109975, 2021, doi: 10.1109/ACCESS.2021.3102399.
[22] G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard, "A study of the behavior of several methods for balancing machine learning training data," SIGKDD Explor. Newsl., vol. 6, no. 1, pp. 20–29, 2004, doi: 10.1145/1007730.1007735.
[23] Y. Sun, L. Cai, B. Liao, W. Zhu, and J. Xu, "A Robust Oversampling Approach for Class Imbalance Problem With Small Disjuncts," IEEE Transactions on Knowledge and Data Engineering, vol. 35, no. 6, pp. 5550-5562, 2023, doi: 10.1109/TKDE.2022.3161291.
[24] D. Devi, S. K. Biswas, and B. Purkayastha, "Learning in presence of class imbalance and class overlapping by using one-class SVM and undersampling technique," Connection Science, vol. 31, no. 2, pp. 105-142, 2019/04/03 2019, doi: 10.1080/09540091.2018.1560394.
[25] P. Soltanzadeh, M. R. Feizi-Derakhshi, and M. Hashemzadeh, "Addressing the class-imbalance and class-overlap problems by a metaheuristic-based under-sampling approach," Pattern Recognition, vol. 143, p. 109721, 2023/11/01/ 2023, doi: https://doi.org/10.1016/j.patcog.2023.109721.
[26] B. J. Frey and D. Dueck, "Clustering by Passing Messages Between Data Points," Science, vol. 315, no. 5814, pp. 972-976, 2007, doi: doi:10.1126/science.1136800.
[27] J. MacQueen, "Some methods for classification and analysis of multivariate observations," in Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, 1967, vol. 1, no. 14: Oakland, CA, USA, pp. 281-297.
[28] O. Sagi and L. Rokach, "Ensemble learning: A survey," WIREs Data Mining and Knowledge Discovery, vol. 8, no. 4, p. e1249, 2018, doi: https://doi.org/10.1002/widm.1249.
[29] H. Guan, Y. Zhang, M. Xian, H. D. Cheng, and X. Tang, "SMOTE-WENN: Solving class imbalance and small sample problems by oversampling and distance scaling," Applied Intelligence, vol. 51, no. 3, pp. 1394-1409, 2021/03/01 2021, doi: 10.1007/s10489-020-01852-8.
[30] D. A. Cieslak, N. V. Chawla, and A. Striegel, "Combating imbalance in network intrusion datasets," in 2006 IEEE International Conference on Granular Computing, 10-12 May 2006 2006, pp. 732-737, doi: 10.1109/GRC.2006.1635905.
[31] J.-H. Seo and Y.-H. Kim, "Machine-Learning Approach to Optimize SMOTE Ratio in Class Imbalance Dataset for Intrusion Detection," Computational Intelligence and Neuroscience, vol. 2018, p. 9704672, 2018/11/01 2018, doi: 10.1155/2018/9704672.
[32] Q. Liu et al., "Application of KM-SMOTE for rockburst intelligent prediction," Tunnelling and Underground Space Technology, vol. 138, p. 105180, 2023/08/01/ 2023, doi: https://doi.org/10.1016/j.tust.2023.105180.
[33] H. Karamti et al., "Improving Prediction of Cervical Cancer Using KNN Imputed SMOTE Features and Multi-Model Ensemble Learning Approach," Cancers, vol. 15, no. 17, p. 4412, 2023. [Online]. Available: https://www.mdpi.com/2072-6694/15/17/4412.
[34] V. S. Spelmen and R. Porkodi, "A Review on Handling Imbalanced Data," in 2018 International Conference on Current Trends towards Converging Technologies (ICCTCT), 1-3 March 2018 2018, pp. 1-11, doi: 10.1109/ICCTCT.2018.8551020.
[35] L. Wang, M. Han, X. Li, N. Zhang, and H. Cheng, "Review of Classification Methods on Unbalanced Data Sets," IEEE Access, vol. 9, pp. 64606-64628, 2021, doi: 10.1109/ACCESS.2021.3074243.
[36] S. Kotsiantis, D. Kanellopoulos, and P. Pintelas, "Handling imbalanced datasets: A review," GESTS International Transactions on Computer Science and Engineering, vol. 30, pp. 25-36, 11/30 2005.
[37] S. Rayana, W. Zhong, and L. Akoglu, "Sequential Ensemble Learning for Outlier Detection: A Bias-Variance Perspective," in 2016 IEEE 16th International Conference on Data Mining (ICDM), 12-15 Dec. 2016 2016, pp. 1167-1172, doi: 10.1109/ICDM.2016.0154.
[38] L. Breiman, "Random Forests," Machine Learning, vol. 45, no. 1, pp. 5-32, 2001/10/01 2001, doi: 10.1023/A:1010933404324.
[39] X. Dong, Z. Yu, W. Cao, Y. Shi, and Q. Ma, "A survey on ensemble learning," Frontiers of Computer Science, vol. 14, no. 2, pp. 241-258, 2020/04/01 2020, doi: 10.1007/s11704-019-8208-z.
[40] I. D. Mienye and Y. Sun, "A Survey of Ensemble Learning: Concepts, Algorithms, Applications, and Prospects," IEEE Access, vol. 10, pp. 99129-99149, 2022, doi: 10.1109/ACCESS.2022.3207287.
[41] Y. Freund and R. E. Schapire, "A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting," Journal of Computer and System Sciences, vol. 55, no. 1, pp. 119-139, 1997/08/01/ 1997, doi: https://doi.org/10.1006/jcss.1997.1504.
[42] I. D. Mienye and Y. Sun, "Performance analysis of cost-sensitive learning methods with application to imbalanced medical data," Informatics in Medicine Unlocked, vol. 25, p. 100690, 2021/01/01/ 2021, doi: https://doi.org/10.1016/j.imu.2021.100690.
[43] J. Tanha, Y. Abdi, N. Samadi, N. Razzaghi, and M. Asadpour, "Boosting methods for multi-class imbalanced data classification: an experimental review," Journal of Big Data, vol. 7, no. 1, p. 70, 2020/09/01 2020, doi: 10.1186/s40537-020-00349-y.
[44] R. Longadge and S. Dongre, "Class Imbalance Problem in Data Mining Review," International Journal of Computer Science and Network, vol. 2, no. 1, 2013.
[45] S. Lloyd, "Least squares quantization in PCM," IEEE Transactions on Information Theory, vol. 28, no. 2, pp. 129-137, 1982, doi: 10.1109/TIT.1982.1056489.
[46] E. W. Forgy, "Cluster analysis of multivariate data : efficiency versus interpretability of classifications," Biometrics, vol. 21, pp. 768-769, 1965.
[47] H. P. Friedman and J. Rubin, "On Some Invariant Criteria for Grouping Data," Journal of the American Statistical Association, vol. 62, no. 320, pp. 1159-1178, 1967, doi: 10.2307/2283767.
[48] X. Wu et al., "Top 10 algorithms in data mining," Knowledge and Information Systems, vol. 14, no. 1, pp. 1-37, 2008/01/01 2008, doi: 10.1007/s10115-007-0114-2.
[49] J. Alcala-Fdez et al., "KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework," Journal of Multiple-Valued Logic and Soft Computing, vol. 17, pp. 255-287, 01/01 2010.
[50] S. Boonamnuay, N. Kerdprasop, and K. Kerdprasop, "Classification and regression tree with resampling for classifying imbalanced data," International Journal of Machine Learning and Computing, vol. 8, no. 4, pp. 336-340, 2018.
[51] N. Cristianini and E. Ricci, "Support Vector Machines," in Encyclopedia of Algorithms, M.-Y. Kao Ed. Boston, MA: Springer US, 2008, pp. 928-932.
[52] M. P. Sesmero, J. A. Iglesias, E. Magán, A. Ledezma, and A. Sanchis, "Impact of the learners diversity and combination method on the generation of heterogeneous classifier ensembles," Applied Soft Computing, vol. 111, p. 107689, 2021/11/01/ 2021, doi: https://doi.org/10.1016/j.asoc.2021.107689.
[53] F. Pedregosa et al., "Scikit-learn: Machine learning in Python," the Journal of machine Learning research, vol. 12, pp. 2825-2830, 2011.
[54] G. LemaÃŽtre, F. Nogueira, and C. K. Aridas, "Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning," Journal of machine learning research, vol. 18, no. 17, pp. 1-5, 2017.
[55] P. J. Rousseeuw, "Silhouettes: A graphical aid to the interpretation and validation of cluster analysis," Journal of Computational and Applied Mathematics, vol. 20, pp. 53-65, 1987/11/01/ 1987, doi: https://doi.org/10.1016/0377-0427(87)90125-7.
[56] T. Fawcett, "An introduction to ROC analysis," (in English), Pattern Recognit. Lett., Article vol. 27, no. 8, pp. 861-874, Jun 2006, doi: 10.1016/j.patrec.2005.10.010.
[57] J. N. Mandrekar, "Receiver Operating Characteristic Curve in Diagnostic Test Assessment," Journal of Thoracic Oncology, vol. 5, no. 9, pp. 1315-1316, 2010/09/01/ 2010, doi: https://doi.org/10.1097/JTO.0b013e3181ec173d.
[58] J. Grzyb and M. Woźniak, "SVM ensemble training for imbalanced data classification using multi-objective optimization techniques," Applied Intelligence, vol. 53, no. 12, pp. 15424-15441, 2023/06/01 2023, doi: 10.1007/s10489-022-04291-9.
[59] I. Borg and P. J. Groenen, Modern multidimensional scaling: Theory and applications. Springer Science & Business Media, 2005.
[60] K. Napierala and J. Stefanowski, "Types of minority class examples and their influence on learning classifiers from imbalanced data," Journal of Intelligent Information Systems, vol. 46, no. 3, pp. 563-597, 2016/06/01 2016, doi: 10.1007/s10844-015-0368-1.

指導教授

蔡志豐(Chih-Fong Tsai)

審核日期

2024-7-9

推文