應用資料重採樣與資料離散化方法於類別不平衡問題之研究

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：46

、訪客IP：3.142.164.56

姓名

林家暘(LIN,JIA-YANG) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

應用資料重採樣與資料離散化方法於類別不平衡問題之研究
(Data Resampling and Discretization Methods for Class Imbalanced Data)

相關論文

★ 利用資料探勘技術建立商用複合機銷售預測模型	★ 應用資料探勘技術於資源配置預測之研究-以某電腦代工支援單位為例
★ 資料探勘技術應用於航空業航班延誤分析-以C公司為例	★ 全球供應鏈下新產品的安全控管-以C公司為例
★ 資料探勘應用於半導體雷射產業-以A公司為例	★ 應用資料探勘技術於空運出口貨物存倉時間預測-以A公司為例
★ 使用資料探勘分類技術優化YouBike運補作業	★ 特徵屬性篩選對於不同資料類型之影響
★ 資料探勘應用於B2B網路型態之企業官網研究-以T公司為例	★ 衍生性金融商品之客戶投資分析與建議-整合分群與關聯法則技術
★ 應用卷積式神經網路建立肝臟超音波影像輔助判別模型	★ 基於卷積神經網路之身分識別系統
★ 能源管理系統電能補值方法誤差率比較分析	★ 企業員工情感分析與管理系統之研發
★ 資料淨化於類別不平衡問題: 機器學習觀點	★ 資料探勘技術應用於旅客自助報到之分析—以C航空公司為例

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2028-7-1以後開放)

摘要(中)

近年來，隨著人工智慧領域的蓬勃發展，許多產業積極投入相關研究，透過現有的產業資料，來研發適用於自身產業的智慧應用。然而在現實世界中，受到不同的人為或環境因素的影響，資料容易自然地呈現出偏斜且不均勻的狀態。這種類別不平衡問題廣泛存在於不同產業與領域當中，容易對相關應用的智慧模型造成負面影響，是近年相當重要的實務議題。因此，本研究欲應用資料層級的過採樣SMOTE(Synthetic Minority Over-sampling Technique, SMOTE)，與ChiMerge和MDLP等監督式離散化方法，來探討不同資料前處理步驟的組合與順序，對於二元類別不平衡問題的效益與影響。此外，為了能夠深入理解不同重採樣方法，處理類別不平衡問題的效能差異。本研究納入多種相異的重採樣方法，即具有不同採樣策略的SMOTE方法、欠採樣Tomek Links方法，與上述兩者的混合方法，來進一步地探究不同前處理步驟的組合與順序，對於多元類別不平衡問題的影響。
本研究使用UCI與KEEL網站提供的二元與多元資料集，透過使用不同的資料前處理步驟，分別比較單一前處理方法與混合前處理方法，對於二元與多元類別不平衡問題的影響，進而釐清不同前處理方法的適用性，以提供有效的解決方案與建議。根據實驗結果，在處理二元不平衡問題時，本研究建議使用「先MDLP後SMOTE」的混合方法，來改善SVM、C4.5，與RF的分類效能。此外，在處理多元類別不平衡問題時，在不考量時間成本的前題下，本研究推薦使用先重採樣後ChiMerge的流程，會具有較為穩健且準確的實驗結果。另外，若極為重視資料處理與模型的運算效率，則推薦先重採樣後MDLP的流程，亦可有效率地取得相當準確的實驗結果。

摘要(英)

In recent years, with the booming of artificial intelligence, more people have taken the initiative to develop intelligent applications using their existing data, looking forward to creating successful products which suitable for their business. However, data tends to naturally present skewed or biased states due to various human or environmental factors in reality. The class imbalance problem widely exists in different industries and domains, and it causes negative influences on intelligent models used in related applications. Therefore, the issue has become an important practical concern recently. This study aims to explore the benefits and effects of data preprocessing steps with different combinations and orders to address binary class imbalance problems. The preprocessing steps include the oversampling technique called Synthetic Minority Over-sampling Technique (SMOTE) and supervised discretization methods such as ChiMerge and MDLP. Additionally, to gain a deeper understanding of different resampling methods′ performance in handling class imbalance problems, this study brings in diverse resampling methods, including SMOTE with different sampling strategies, an undersampling method called Tomek Links, and a hybrid method combining the above methods. To further investigate the impact of different preprocessing combinations and orders to address multiclass imbalance problems.
This study uses binary and multiclass datasets provided by UCI and KEEL websites, to compare the effects of single preprocessing methods and mixed preprocessing methods on binary and multiclass class imbalance problems. Thus, clarifying the applicability of different preprocessing methods and providing effective solutions and recommendations. According to the experimental results, when dealing with binary class imbalance problems, it recommends the mixed method of using MDLP to discrete data features first, then using SMOTE to balance the datasets, to improve the classification performance of SVM, C4.5, and RF. Furthermore, when handling multiclass imbalance problems without considering the time cost, it recommends the mixed method of using resampling methods to balance the datasets first, then using ChiMerge to discrete data features, which can get more robust and accurate experimental results. Additionally, if there is a high emphasis on data processing and model computation efficiency, it recommends the mixed method of using resampling methods to balance the datasets first, then using MDLP to discrete data features, to efficiently obtain fairly accurate experimental results.

關鍵字(中)

★ 資料前處理
★ 資料重採樣
★ 資料離散化
★ 類別不平衡
★ 資料探勘

關鍵字(英)

★ data preprocessing
★ data resampling
★ data discretization
★ class imbalance
★ data mining

論文目次

摘要 i
Abstract ii
目錄 iv
圖目錄 vi
表目錄 vii
第1章緒論 1
1.1 研究背景 1
1.2 研究動機 2
1.3 研究目的 4
1.3.1 實驗一 4
1.3.2 實驗二 4
1.4 研究架構 5
第2章文獻探討 6
2.1 類別不平衡 6
2.1.1 類別不平衡問題 7
2.1.2 類別不平衡的特性 7
2.1.3 類別不平衡問題的解決方法 9
2.2 資料離散化 15
2.2.1 卡方分箱法(ChiMerge) 16
2.2.2 最小描述長度準則(Minimum Description Length Principle，MDLP) 17
2.3 監督式學習分類器 19
2.3.1 支援向量機(Support Vector Machine, SVM) 19
2.3.2 C4.5分類器 21
2.3.3 隨機森林(Random Forest, RF) 22
2.4 相關研究 23
第3章研究方法 24
3.1 實驗架構 24
3.1.1 實驗一 24
3.1.2 實驗二 28
3.2 實驗環境與資料集 33
3.2.1 實驗環境 33
3.2.2 實驗資料集 34
3.3 實驗參數設定 36
3.3.1 重採樣方法參數設定 37
3.3.2 離散化方法參數設定 38
3.3.3 分類器模型參數設定 39
3.4 實驗驗證方法與評估指標 40
3.4.1 驗證方法 40
3.4.2 評估指標 41
第4章實驗結果 43
4.1 實驗一 43
4.1.1 Baseline結果 43
4.1.2 單一前處理方法的結果 44
4.1.3 單一前處理方法與Baseline的比較 45
4.1.4 混合前處理方法的結果 47
4.1.5 混合前處理方法與前述方法的比較 49
4.1.6 實驗一小結 53
4.2 實驗二 55
4.2.1 Baseline結果 56
4.2.2 單一前處理方法的結果 56
4.2.3 單一前處理方法與Baseline的比較 59
4.2.4 混合前處理方法的結果 63
4.2.5 混合前處理方法與前述方法的比較 66
4.2.6 實驗二小結 72
第5章結論 74
5.1 結論與貢獻 74
5.2 未來研究方向與建議 75
參考文獻 76
附錄一 85
附錄二 94

參考文獻

[1] Nnamoko, N., & Korkontzelos, I. (2020). Efficient treatment of outliers and class imbalance for diabetes prediction. Artificial intelligence in medicine, 104, 101815.
[2] Haddad, B. M., Yang, S., Karam, L. J., Ye, J., Patel, N. S., & Braun, M. W. (2016). Multifeature, sparse-based approach for defects detection and classification in semiconductor units. IEEE Transactions on Automation Science and Engineering, 15(1), 145-159.
[3] Pereira, R. M., Bertolini, D., Teixeira, L. O., Silla Jr, C. N., & Costa, Y. M. (2020). COVID-19 identification in chest X-ray images on flat and hierarchical classification scenarios. Computer methods and programs in biomedicine, 194, 105532.
[4] Cui, Z., Xue, F., Cai, X., Cao, Y., Wang, G. G., & Chen, J. (2018). Detection of malicious code variants based on deep learning. IEEE Transactions on Industrial Informatics, 14(7), 3187-3196.
[5] Luque, A., Carrasco, A., Martín, A., & de Las Heras, A. (2019). The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recognition, 91, 216-231.
[6] García, S., Luengo, J., & Herrera, F. (2015). Data preprocessing in data mining. pp. 245–283.
[7] Garcia, S., Luengo, J., Sáez, J. A., Lopez, V., & Herrera, F. (2012). A survey of discretization techniques: Taxonomy and empirical analysis in supervised learning. IEEE transactions on Knowledge and Data Engineering, 25(4), 734-750.
[8] Liu, H., & Setiono, R. (1997). Feature selection via discretization. IEEE Transactions on knowledge and Data Engineering, 9(4), 642-645.
[9] He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9), 1263-1284.
[10] Johnson, J. M., & Khoshgoftaar, T. M. (2019). Survey on deep learning with class imbalance. Journal of Big Data, 6(1), 1-54.
[11] Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.
[12] Fernández, A., García, S., del Jesus, M. J., & Herrera, F. (2008). A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets. Fuzzy Sets and Systems, 159(18), 2378-2398.
[13] Krawczyk, B. (2016). Learning from imbalanced data: open challenges and future directions. Progress in Artificial Intelligence, 5(4), 221-232.
[14] Japkowicz, N., & Stephen, S. (2002). The class imbalance problem: A systematic study. Intelligent data analysis, 6(5), 429-449.
[15] Das, S., Datta, S., & Chaudhuri, B. B. (2018). Handling data irregularities in classification: Foundations, trends, and future challenges. Pattern Recognition, 81, 674-693.
[16] Santos, M. S., Abreu, P. H., Japkowicz, N., Fernández, A., & Santos, J. (2023). A unifying view of class overlap and imbalance: Key concepts, multi-view panorama, and open avenues for research. Information Fusion, 89, 228-253.
[17] Vuttipittayamongkol, P., & Elyan, E. (2020). Neighbourhood-based undersampling approach for handling imbalanced and overlapped data. Information Sciences, 509, 47-70.
[18] Liu, Y., Liu, Y., Bruce, X. B., Zhong, S., & Hu, Z. (2023). Noise-robust oversampling for imbalanced data classification. Pattern Recognition, 133, 109008.
[19] Huang, C., Li, Y., Loy, C. C., & Tang, X. (2019). Deep imbalanced learning for face recognition and attribute prediction. IEEE transactions on pattern analysis and machine intelligence, 42(11), 2781-2794.
[20] Li, T., Xia, Q., Zhao, M., Gui, Z., & Leng, S. (2020). Prospectivity mapping for tungsten polymetallic mineral resources, Nanling metallogenic belt, south China: Use of random forest algorithm from a perspective of data imbalance. Natural Resources Research, 29(1), 203-227.
[21] Bustillo, A., Pimenov, D. Y., Mia, M., & Kapłonek, W. (2021). Machine-learning for automatic prediction of flatness deviation considering the wear of the face mill teeth. Journal of Intelligent Manufacturing, 32(3), 895-912.
[22] Ma, H., Huang, W., Jing, Y., Yang, C., Han, L., Dong, Y., ... & Ruan, C. (2019). Integrating growth and environmental parameters to discriminate powdery mildew and aphid of winter wheat using bi-temporal Landsat-8 imagery. Remote Sensing, 11(7), 846.
[23] Liu, X. Y., Wu, J., & Zhou, Z. H. (2008). Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2), 539-550.
[24] García, V., Sánchez, J. S., & Mollineda, R. A. (2012). On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowledge-Based Systems, 25(1), 13-21.
[25] Burez, J., & Van den Poel, D. (2009). Handling class imbalance in customer churn prediction. Expert Systems with Applications, 36(3), 4626-4636.
[26] Drummond, C., & Holte, R. C. (2003, August). C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In Workshop on learning from imbalanced datasets II (Vol. 11, pp. 1-8).
[27] Tomek, I. (1976). Two modifications of CNN.
[28] Pereira, R. M., Costa, Y. M., & Silla Jr, C. N. (2020). MLTL: A multi-label approach for the Tomek Link undersampling algorithm. Neurocomputing, 383, 95-105.
[29] Choirunnisa, S., & Lianto, J. (2018, November). Hybrid method of undersampling and oversampling for handling imbalanced data. In 2018 International Seminar on Research of Information Technology and Intelligent Systems (ISRITI) (pp. 276-280). IEEE.
[30] Batista, G. E., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD explorations newsletter, 6(1), 20-29.
[31] Maslove, D. M., Podchiyska, T., & Lowe, H. J. (2013). Discretization of continuous features in clinical datasets. Journal of the American Medical Informatics Association, 20(3), 544-553.
[32] Tsai, C. F., & Chen, Y. C. (2019). The optimal combination of feature selection and data discretization: An empirical study. Information Sciences, 505, 282-293.
[33] Gómez, I., Ribelles, N., Franco, L., Alba, E., & Jerez, J. M. (2016). Supervised discretization can discover risk groups in cancer survival analysis. Computer Methods and Programs in Biomedicine, 136, 11-19.
[34] Gonzalez-Abril, L., Cuberos, F. J., Velasco, F., & Ortega, J. A. (2009). Ameva: An autonomous discretization algorithm. Expert Systems with Applications, 36(3), 5327-5332.
[35] Wen, L. Y., Min, F., & Wang, S. Y. (2017). A two-stage discretization algorithm based on information entropy. Applied Intelligence, 47, 1169-1185.
[36] Richeldi, M., & Rossotto, M. (1995). Class-driven statistical discretization of continuous attributes. In Machine Learning: ECML-95: 8th European Conference on Machine Learning Heraclion, Crete, Greece, April 25–27, 1995 Proceedings 8 (pp. 335-338). Springer Berlin Heidelberg.
[37] Fayyad, U., & Irani, K. (1993). Multi-interval discretization of continuous-valued attributes for classification learning.
[38] Lavangnananda, K., & Chattanachot, S. (2017, February). Study of discretization methods in classification. In 2017 9th International Conference on Knowledge and Smart Technology (KST) (pp. 50-55). IEEE.
[39] Abraham, R., Simha, J. B., & Iyengar, S. S. (2006, December). A comparative analysis of discretization methods for Medical Datamining with Naïve Bayesian classifier. In 9th International Conference on Information Technology (ICIT′06) (pp. 235-236). IEEE.
[40] Clarke, E. J., & Barton, B. A. (2000). Entropy and MDL discretization of continuous variables for Bayesian belief networks. International Journal of Intelligent Systems, 15(1), 61-92.
[41] Makhalova, T., Kuznetsov, S. O., & Napoli, A. (2022). Mint: MDL-based approach for mining INTeresting numerical pattern sets. Data Mining and Knowledge Discovery, 1-38.
[42] Sun, X., Lin, X., Li, Z., & Wu, H. (2022). A comprehensive comparison of supervised and unsupervised methods for cell type identification in single-cell RNA-seq. Briefings in bioinformatics, 23(2), bbab567.
[43] Abonizio, H. Q., Paraiso, E. C., & Barbon, S. (2021). Toward text data augmentation for sentiment analysis. IEEE Transactions on Artificial Intelligence, 3(5), 657-668.
[44] Chen, C. H., Patel, V. M., & Chellappa, R. (2017). Learning from ambiguously labeled face images. IEEE transactions on pattern analysis and machine intelligence, 40(7), 1653-1667.
[45] Mahapatra, D., Poellinger, A., & Reyes, M. (2022). Interpretability-guided inductive bias for deep learning based medical image. Medical image analysis, 81, 102551.
[46] Mori, T., & Uchihira, N. (2019). Balancing the trade-off between accuracy and interpretability in software defect prediction. Empirical Software Engineering, 24, 779-825.
[47] Kerber, R. (1992). Chimerge: Discretization of numeric attributes. In Proceedings of the tenth national conference on Artificial intelligence (pp. 123-128).
[50] Xu, D., Zhang, Z., & Shi, J. (2022). A New Multi-Sensor Stream Data Augmentation Method for Imbalanced Learning in Complex Manufacturing Process. Sensors, 22(11), 4042.
[51] Vuttipittayamongkol, P., & Elyan, E. (2020). Neighbourhood-based undersampling approach for handling imbalanced and overlapped data. Information Sciences, 509, 47-70.
[52] Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992, July). A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory (pp. 144-152).
[53] Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20, 273-297.
[54] Chang, C. C., & Lin, C. J. (2011). LIBSVM: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3), 1-27.
[55] Salo, F., Nassif, A. B., & Essex, A. (2019). Dimensionality reduction with IG-PCA and ensemble classifier for network intrusion detection. Computer Networks, 148, 164-175.
[56] Burges, C. J. (1998). A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery, 2(2), 121-167.
[57] Rezvani, S., & Wang, X. (2021). Class imbalance learning using fuzzy ART and intuitionistic fuzzy twin support vector machines. Information Sciences, 578, 659-682.
[58] Shafizadeh-Moghadam, H., Tayyebi, A., Ahmadlou, M., Delavar, M. R., & Hasanlou, M. (2017). Integration of genetic algorithm and multiple kernel support vector regression for modeling urban growth. Computers, Environment and Urban Systems, 65, 28-40.
[59] Chen, P., Yuan, L., He, Y., & Luo, S. (2016). An improved SVM classifier based on double chains quantum genetic algorithm and its application in analogue circuit diagnosis. Neurocomputing, 211, 202-211.
[60] Fu, X., & Wang, L. (2003). Data dimensionality reduction with application to simplifying RBF network structure and improving classification performance. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 33(3), 399-409.
[61] Prajapati, G. L., & Patle, A. (2010). On performing classification using SVM with radial basis and polynomial kernel functions. In 2010 3rd International Conference on Emerging Trends in Engineering and Technology (pp. 512-515). IEEE.
[62] Quinlan, J. R. (1986). Induction of decision trees. Machine learning, 1, 81-106.
[63] Quinlan, J. R. (1993). C.45: Programs for machine learning. San Francisco: Morgan Kaufmann.
[64] Singh, S., & Gupta, P. (2014). Comparative study ID3, cart and C4. 5 decision tree algorithm: a survey. International Journal of Advanced Information Science and Technology (IJAIST), 27(27), 97-103.
[65] Breiman, L. (2001). Random forests. Machine learning, 45, 5-32.
[66] Bader-El-Den, M., Teitei, E., & Perry, T. (2018). Biased random forest for dealing with the class imbalance problem. IEEE transactions on neural networks and learning systems, 30(7), 2163-2172.
[67] Li, Y. S., Chi, H., Shao, X. Y., Qi, M. L., & Xu, B. G. (2020). A novel random forest approach for imbalance problem in crime linkage. Knowledge-Based Systems, 195, 105738.
[68] Tan, X., Su, S., Huang, Z., Guo, X., Zuo, Z., Sun, X., & Li, L. (2019). Wireless sensor networks intrusion detection based on SMOTE and the random forest algorithm. Sensors, 19(1), 203.
[69] Casa, A., Scrucca, L., & Menardi, G. (2021). Better than the best? Answers via model ensemble in density-based clustering. Advances in Data Analysis and Classification, 15, 599-623.
[70] Rokach, L. (2016). Decision forest: Twenty years of research. Information Fusion, 27, 111-125.
[71] Cano, A., Nguyen, D. T., Ventura, S., & Cios, K. J. (2016). ur-CAIM: improved CAIM discretization for unbalanced and balanced data. Soft Computing, 20, 173-188.
[72] Tahan, M. H., & Asadi, S. (2018). EMDID: Evolutionary multi-objective discretization for imbalanced datasets. Information Sciences, 432, 442-461.
[73] Pal, S. S., & Kar, S. (2019). Time series forecasting for stock market prediction through data discretization by fuzzistics and rule generation by rough set theory. Mathematics and Computers in Simulation, 162, 18-30.
[74] Serengil, S. I., & Teknoloji, Y. K. (2021). ChefBoost: A Lightweight Boosted Decision Tree Framework.
[75] Pineda-Bautista, B. B., Carrasco-Ochoa, J. A., & Martı́nez-Trinidad, J. F. (2011). General framework for class-specific feature selection. Expert Systems with Applications, 38(8), 10018-10024.

指導教授

蔡志豐(Tsai, Chih-Fong)

審核日期

2023-7-27

推文