基於深度表格生成模型的過採樣方法 於信用及破產預測領域的效能分析

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：11

、訪客IP：3.137.179.200

姓名

陳信瑋(Hsin-Wei Chen) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

基於深度表格生成模型的過採樣方法於信用及破產預測領域的效能分析
(Effectiveness Analysis of Deep Tabular Generation-Based Oversampling Method in Credit Risk and Bankruptcy Prediction)

相關論文

★ 具代理人之行動匿名拍賣與付款機制	★ 網路攝影機遠端連線安全性分析
★ HSDPA環境下的複合式細胞切換機制	★ 樹狀結構為基礎之行動隨意網路IP位址分配機制
★ 平面環境中目標區域之偵測 - 使用行動感測網路技術	★ 藍芽Scatternet上的P2P檔案分享機制
★ 交通壅塞避免之動態繞路機制	★ 運用UWB提升MANET上檔案分享之效能
★ 合作學習平台對團體迷思現象及學習成效之影響–以英文字彙學習為例	★ 以RFID為基礎的室內定位機制─使用虛擬標籤的經驗法則
★ 適用於實體購物情境的行動商品比價系統-使用影像辨識技術	★ 信用卡網路刷卡安全性
★ DEAP:適用於行動RFID系統之高效能動態認證協定	★ 在破產預測與信用評估領域對前處理方式與分類器組合的比較分析
★ 單一類別分類方法於不平衡資料集－搭配遺漏值填補和樣本選取方法	★ 正規化與變數篩選在破產領域的適用性研究

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2028-6-30以後開放)

摘要(中)

信用評估及破產預測領域中，由於資料收集的困難與領域特性，經常面臨到資料不平衡的狀況。為解決資料不平衡對於模型預測造成的問題，目前常見的處理方式為使用基於內插法的傳統過採樣方法，對資料集進行平衡。近年來，隨著個人資料隱私的被重視，逐漸發展出使用生成模型學習原始資料集分佈與特徵，並生成合成資料集的技術。該技術得以讓研究學者使用合成的資料集，在不洩漏個人隱私的情況下繼續進行研究。由於該類技術所生成的樣本具有類似於原始樣本特徵與分布的特性，因此有學者嘗試將其應用於解決資料不平衡的問題。
本研究將使用兩種具代表性的深度表格生成模型 (CopulaGAN與TVAE) 作為深度過採樣的代表方法，並與四種具代表性的傳統過採樣方法 (SMOTE、polynomial-fit-SMOTE、Borderline SMOTE與ADASYN)，在所蒐集三個信用領域的資料集及三個破產領域的資料集中進行比較，觀察六種方法於信用評估及破產預測領域當中的適用性。
本研究發現TVAE在信用評估及破產預測領域當中的表現優於其它五種過採樣方法。最終，本研究進一步將實驗中的最佳深度過採樣方法 (TVAE) ，與最佳傳統過採樣方法 (ADASYN) 進行結合使用。發現以整體而言，先使用深度過採樣方法進行過採樣後，再使用傳統過採樣方法進行過採樣可以進一步獲得更低的TypeII錯誤率。

摘要(英)

In the field of credit risk prediction and bankruptcy prediction, data imbalance is a common challenge due to difficulties in data collection and the characteristics of the domain. To address the issues caused by data imbalance in model predictions, the common approach currently used is to balance the dataset through traditional oversampling methods based on interpolation. In recent years, with the increasing emphasis on personal data privacy, methods have been developed to learn the distribution and the relationship between features of the original samples using deep generative models and generating synthetic datasets. This allows researchers to continue their studies without compromising individual privacy. Since the samples generated by such methods possess similar characteristics and distributions to the original samples, some researchers have attempted to apply them to solve the data imbalance problem. This approach is referred to as the deep oversampling method.
Our research compares two representative deep tabular generative models (CopulaGAN and TVAE) with four representative traditional oversampling methods (SMOTE, Polynomial-fit-SMOTE, Borderline SMOTE, and ADASYN) in three credit risk and three bankruptcy datasets. The goal is to observe the performance of these six methods in the fields of credit risk and bankruptcy prediction. Our study found that TVAE outperformed the other five oversampling methods in credit risk and bankruptcy prediction domains. We further combine the best deep oversampling method (TVAE) with the best traditional oversampling method (ADASYN) in the research and found that, overall, using deep oversampling followed by applying the traditional oversampling method leads to even lower Type II Error.

關鍵字(中)

★ 過採樣
★ 深度學習
★ 不平衡資料集
★ 生成模型

關鍵字(英)

★ Oversampling
★ Deep learning
★ Imbalance dataset
★ Generative Model

論文目次

摘要 i
Abstract ii
誌謝 iii
目錄 iv
圖目錄 vi
表目錄 viii
一、緒論 1
1-1 研究背景 1
1-2 研究動機 1
1-3 研究目的 4
二、文獻探討 5
2-1 過往信用評估破產預測領域相關研究 5
2-2 過採樣技術發展現況 6
2-2-1 SMOTE (Synthetic Minority Oversampling Technique) 6
2-2-2 Borderline SMOTE 7
2-2-3 ADASYN (Adaptive Synthetic Sampling) 8
2-2-4 Polynomial-fit-SMOTE 8
2-3 深度生成模型 9
2-3-1 Variational AutoEncoder (VAE) 9
2-3-2 Generative Adversarial Network (GAN) 10
2-4 深度表格生成模型 11
2-4-1 TVAE & CTGAN 12
2-4-2 CopulaGAN 13
2-5 分類器介紹 14
2-5-1 SVM (Support Vector Machine) 14
2-5-2 LightGBM 15
2-5-3 MLP (Multi-Layer Perception) 16
三、研究方法 18
3-1 資料集 19

11

3-2 資料前處理 20
3-3 評估指標 21
3-3-1 Receiver Operation Characteristics - Area Under Curve (ROC-AUC) 22
3-3-2 TypeII Error 23
3-4 實驗一，深度過採樣方法於信用評估及破產預測領域的效能分析 23
3-5 實驗二，深度結合傳統過採樣方法的效果分析 24
3-6 實驗參數設定、方法 26
四、實驗結果與分析 28
4-1 深度過採樣方法的效能分析 28
4-1-1 篩選最佳分類器 28
4-1-2 篩選最佳過採樣方法 34
4-2 深度結合傳統過採樣方法的效果分析 37
4-3-1 信用評估及破產預測領域過採樣分析 43
4-3-2 不同不平衡率下過採樣分析 45
五、結論 48
5-1 結論與貢獻 48
5-2 研究限制與未來展望 50

參考文獻

[1] K.-S. Shin, T. S. Lee, and H.-j. Kim, "An application of support vector machines in bankruptcy prediction model," Expert Systems with Applications, vol. 28, no. 1, pp. 127-135, 2005.
[2] F. Barboza, H. Kimura, and E. Altman, "Machine learning models and bankruptcy prediction," Expert Systems with Applications, vol. 83, pp. 405-417, 2017.
[3] Y.-C. Chang, K.-H. Chang, and G.-J. Wu, "Application of eXtreme gradient boosting trees in the construction of credit risk assessment models for financial institutions," Applied Soft Computing, vol. 73, pp. 914-920, 2018.
[4] F. Shen, X. Zhao, Z. Li, K. Li, and Z. Meng, "A novel ensemble classification model based on neural networks and a classifier optimisation technique for imbalanced credit risk evaluation," Physica A: Statistical Mechanics and its Applications, vol. 526, p. 121073, 2019.
[5] Y. Qu, P. Quan, M. Lei, and Y. Shi, "Review of bankruptcy prediction using machine learning and deep learning techniques," Procedia Computer Science, vol. 162, pp. 895-899, 2019.
[6] T. Korol, "Dynamic bankruptcy prediction models for European enterprises," Journal of Risk and Financial Management, vol. 12, no. 4, p. 185, 2019.
[7] S. García, S. Ramírez-Gallego, J. Luengo, J. M. Benítez, and F. Herrera, "Big data preprocessing: methods and prospects," Big Data Analytics, vol. 1, no. 1, pp. 1-22, 2016.
[8] C.-F. Tsai, K.-L. Sue, Y.-H. Hu, and A. Chiu, "Combining feature selection, instance selection, and ensemble classification techniques for improved financial distress prediction," Journal of Business Research, vol. 130, pp. 200-209, 2021.
[9] K.-L. Sue, C.-F. Tsai, and H.-M. Tsau, "Missing value imputation and the effect of feature normalisation on financial distress prediction," Journal of Experimental & Theoretical Artificial Intelligence, pp. 1-17, 2022.
[10] M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, and F. Herrera, "A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches," IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 42, no. 4, pp. 463-484, 2011.
[11] G. Kovács, "An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets," Applied Soft Computing, vol. 83, p. 105662, 2019.
[12] V. Ganganwar, "An overview of classification algorithms for imbalanced datasets," International Journal of Emerging Technology and Advanced Engineering, vol. 2, no. 4, pp. 42-47, 2012.
[13] H. Kaur, H. S. Pannu, and A. K. Malhi, "A systematic review on imbalanced data challenges in machine learning: Applications and solutions," ACM Computing Surveys, vol. 52, no. 4, pp. 1-36, 2019.
[14] J. M. Johnson and T. M. Khoshgoftaar, "Survey on deep learning with class imbalance," Journal of Big Data, vol. 6, no. 1, pp. 1-54, 2019.
[15] D. Elreedy and A. F. Atiya, "A comprehensive analysis of synthetic minority oversampling technique (SMOTE) for handling class imbalance," Information Sciences, vol. 505, pp. 32-64, 2019.
[16] H. He and E. A. Garcia, "Learning from imbalanced data," IEEE Transactions on knowledge and data engineering, vol. 21, no. 9, pp. 1263-1284, 2009.
[17] G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, and G. Bing, "Learning from class-imbalanced data: Review of methods and applications," Expert Systems with Applications, vol. 73, pp. 220-239, 2017.
[18] G. M. Weiss, K. McCarthy, and B. Zabar, "Cost-sensitive learning vs. sampling: Which is best for handling unbalanced classes with unequal error costs?," in International Conference on Data Mining, 2007, vol. 7, no. 35-41, p. 24.
[19] J. Zhai, J. Qi, and C. Shen, "Binary imbalanced data classification based on diversity oversampling by generative models," Information Sciences, vol. 585, pp. 313-343, 2022.
[20] B. Das, N. C. Krishnan, and D. J. Cook, "RACOG and wRACOG: Two probabilistic oversampling techniques," IEEE Transactions on Knowledge and Data Engineering, vol. 27, no. 1, pp. 222-234, 2014.
[21] G. E. Batista, R. C. Prati, and M. C. Monard, "A study of the behavior of several methods for balancing machine learning training data," ACM SIGKDD Explorations Newsletter, vol. 6, no. 1, pp. 20-29, 2004.
[22] G. E. Batista, A. L. Bazzan, and M. C. Monard, "Balancing training data for automated annotation of keywords: a case study," Wob, vol. 3, pp. 10-8, 2003.
[23] N. Park, M. Mohammadi, K. Gorde, S. Jajodia, H. Park, and Y. Kim, "Data synthesis based on generative adversarial networks," in International Conference on Very Large Data Bases, 2018, vol. 11, no. 10, pp. 1071–1083.
[24] L. Xu, M. Skoularidou, A. Cuesta-Infante, and K. Veeramachaneni, "Modeling tabular data using conditional gan," Advances in Neural Information Processing Systems, vol. 32, 2019.
[25] E. Choi, S. Biswal, B. Malin, J. Duke, W. F. Stewart, and J. Sun, "Generating multi-label discrete patient records using generative adversarial networks," in Machine Learning for Healthcare Conference, 2017, pp. 286-305.
[26] J. Engelmann and S. Lessmann, "Conditional Wasserstein GAN-based oversampling of tabular data for imbalanced learning," Expert Systems with Applications, vol. 174, p. 114582, 2021.
[27] R. Sauber-Cole and T. M. Khoshgoftaar, "The use of generative adversarial networks to alleviate class imbalance in tabular data: a survey," Journal of Big Data, vol. 9, no. 1, p. 98, 2022.
[28] V. Borisov, T. Leemann, K. Seßler, J. Haug, M. Pawelczyk, and G. Kasneci, "Deep neural networks and tabular data: A survey," IEEE Transactions on Neural Networks and Learning Systems, 2022.
[29] H. Faris et al., "Improving financial bankruptcy prediction in a highly imbalanced class distribution using oversampling and ensemble learning: a case from the Spanish market," Progress in Artificial Intelligence, vol. 9, pp. 31-53, 2020.
[30] T. Le, M. Y. Lee, J. R. Park, and S. W. Baik, "Oversampling techniques for bankruptcy prediction: Novel features from a transaction dataset," Symmetry, vol. 10, no. 4, p. 79, 2018.
[31] S. Kamthe, S. Assefa, and M. Deisenroth, "Copula flows for synthetic data generation," arXiv preprint arXiv:2101.00598, 2021.
[32] J. Xiao, L. Li, C. Wang, Z.-J. Zha, and Q. Huang, "Few shot generative model adaption via relaxed spatial structural alignment," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11204-11213.
[33] T. Randall et al., "Transfer-learning-based Autotuning using Gaussian Copula," in Proceedings of the 37th International Conference on Supercomputing, 2023, pp. 37-49.
[34] M. Crouhy, D. Galai, and R. Mark, "A comparative analysis of current credit risk models," Journal of Banking & Finance, vol. 24, no. 1-2, pp. 59-117, 2000.
[35] E. I. Altman, R. G. Haldeman, and P. Narayanan, "ZETATM analysis A new model to identify bankruptcy risk of corporations," Journal of Banking & Finance, vol. 1, no. 1, pp. 29-54, 1977.
[36] N. E. Monti and R. M. Garcia, "A statistical analysis to predict financial distress," Journal of Service Science and Management, vol. 3, no. 03, p. 309, 2010.
[37] G. Kou et al., "Bankruptcy prediction for SMEs using transactional data and two-stage multiobjective feature selection," Decision Support Systems, vol. 140, p. 113429, 2021.
[38] V. Moscato, A. Picariello, and G. Sperlí, "A benchmark of machine learning approaches for credit score prediction," Expert Systems with Applications, vol. 165, p. 113986, 2021.
[39] W. Qiu, "Credit risk prediction in an imbalanced social lending environment based on XGBoost," in International Conference on Big Data and Information Analytics, 2019, pp. 150-156.
[40] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, "SMOTE: synthetic minority over-sampling technique," Journal of Artificial Intelligence Research, vol. 16, pp. 321-357, 2002.
[41] H. Han, W.-Y. Wang, and B.-H. Mao, "Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning," in International Conference on Intelligent Computing, 2005, pp. 878-887.
[42] H. He, Y. Bai, E. A. Garcia, and S. Li, "ADASYN: Adaptive synthetic sampling approach for imbalanced learning," in IEEE International Joint Conference on Neural Networks 2008, pp. 1322-1328.
[43] S. Gazzah and N. E. B. Amara, "New oversampling approaches based on polynomial fitting for imbalanced data sets," in International Association of Pattern Recognition International Workshop on Document Analysis Systems, 2008, pp. 677-684.
[44] A. Bissoto, E. Valle, and S. Avila, "Gan-based data augmentation and anonymization for skin-lesion analysis: A critical review," in Conference on Computer Vision and Pattern Recognition, 2021, pp. 1847-1856.
[45] Y. Wu and L. Xu, "Image generation of tomato leaf disease identification based on adversarial-VAE," Agriculture, vol. 11, no. 10, p. 981, 2021.
[46] M. T. García-Ordás, C. Benavides, J. A. Benítez-Andrades, H. Alaiz-Moretón, and I. García-Rodríguez, "Diabetes detection using deep learning techniques with oversampling and feature augmentation," Computer Methods and Programs in Biomedicine, vol. 202, p. 105968, 2021.
[47] C. Zhang et al., "Over-sampling algorithm based on vae in imbalanced classification," in International Conference on Cloud Computing, 2018, pp. 334-344.
[48] X. Liu, T. Li, R. Zhang, D. Wu, Y. Liu, and Z. Yang, "A GAN and feature selection-based oversampling technique for intrusion detection," Security and Communication Networks, pp. 1-15, 2021.
[49] C. Xing, L. Ma, and X. Yang, "Stacked denoise autoencoder based feature extraction and classification for hyperspectral images," Journal of Sensors, 2016.
[50] J. Zabalza et al., "Novel segmented stacked autoencoder for effective dimensionality reduction and feature extraction in hyperspectral imaging," Neurocomputing, vol. 185, pp. 1-10, 2016.
[51] D. P. Kingma and M. Welling, "Auto-encoding variational bayes," in International Conference on Learning Representations, 2014.
[52] A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sengupta, and A. A. Bharath, "Generative adversarial networks: An overview," IEEE Signal Processing Magazine, vol. 35, no. 1, pp. 53-65, 2018.
[53] J. Moon, S. Jung, S. Park, and E. Hwang, "Conditional tabular GAN-based two-stage data generation scheme for short-term load forecasting," IEEE Access, vol. 8, pp. 205327-205339, 2020.
[54] S. Bourou, A. El Saer, T.-H. Velivassaki, A. Voulkidis, and T. Zahariadis, "A review of tabular data synthesis using GANs on an IDS dataset," Information, vol. 12, no. 09, p. 375, 2021.
[55] M. Beigi, A. Shafquat, J. Mezey, and J. Aptekar, "Simulants: Synthetic Clinical Trial Data via Subject-Level Privacy-Preserving Synthesis," in American Medical Information Association Annual Symposium Proceedings, 2022, p. 231.
[56] A. S. Dina, A. Siddique, and D. Manivannan, "Effect of balancing data using synthetic data on the performance of machine learning classifiers for intrusion detection in computer networks," IEEE Access, vol. 10, pp. 96731-96747, 2022.
[57] S. Park and H. Park, "Performance comparison of multi-class SVM with oversampling methods for imbalanced data classification," in Advances on Broad-Band Wireless Computing, Communication and Applications: Proceedings of the 15th International Conference on Broad-Band and Wireless Computing, Communication and Applications 2021, pp. 108-119.
[58] L. Rüschendorf, "On the distributional transform, Sklar′s theorem, and the empirical copula process," Journal of statistical planning and inference, vol. 139, no. 11, pp. 3921-3927, 2009.
[59] S. Bhatore, L. Mohan, and Y. R. Reddy, "Machine learning techniques for credit risk evaluation: a systematic literature review," Journal of Banking and Financial Technology, vol. 4, pp. 111-138, 2020.
[60] Y. Shi and X. Li, "An overview of bankruptcy prediction models for corporate firms: A systematic literature review," Intangible Capital, vol. 15, no. 2, pp. 114-127, 2019.
[61] X. Ma, J. Sha, D. Wang, Y. Yu, Q. Yang, and X. Niu, "Study on a prediction of P2P network loan default based on the machine learning LightGBM and XGboost algorithms according to different high dimensional data cleaning," Electronic Commerce Research and Applications, vol. 31, pp. 24-39, 2018.
[62] H. Son, C. Hyun, D. Phan, and H. J. Hwang, "Data analytic approach for bankruptcy prediction," Expert Systems with Applications, vol. 138, p. 112816, 2019.
[63] D.-n. Wang, L. Li, and D. Zhao, "Corporate finance risk prediction based on LightGBM," Information Sciences, vol. 602, pp. 259-268, 2022.
[64] J. Zhou, W. Li, J. Wang, S. Ding, and C. Xia, "Default prediction in P2P lending from high-dimensional data based on machine learning," Physica A: Statistical Mechanics and its Applications, vol. 534, p. 122370, 2019.
[65] H. Taud and J. Mas, "Multilayer perceptron (MLP)," in Geomatic Approaches for Modeling Land Change Scenarios, 2018, pp. 451-455.
[66] D. A. Pisner and D. M. Schnyer, "Support vector machine," in Machine learning, 2020, pp. 101-121.
[67] G. Ke et al., "Lightgbm: A highly efficient gradient boosting decision tree," Advances in Neural Information Processing Systems, vol. 30, 2017.
[68] M. W. Gardner and S. Dorling, "Artificial neural networks (the multilayer perceptron)—a review of applications in the atmospheric sciences," Atmospheric Environment, vol. 32, no. 14-15, pp. 2627-2636, 1998.
[69] S. Sharma, S. Sharma, and A. Athaiya, "Activation functions in neural networks," Towards Data Science, vol. 6, no. 12, pp. 310-316, 2017.
[70] D. Liang, C.-F. Tsai, and H.-T. Wu, "The effect of feature selection on financial distress prediction," Knowledge-Based Systems, vol. 73, pp. 289-297, 2015.
[71] 邱子安, "在破產預測與信用評估領域對前處理方式與分類器組合的比較分析," 國立中央大學資訊管理學系碩士論文, 2018.
[72] W.-C. Lin and C.-F. Tsai, "Missing value imputation: a review and analysis of the literature (2006–2017)," Artificial Intelligence Review, vol. 53, pp. 1487-1509, 2020.
[73] J. T. Hancock and T. M. Khoshgoftaar, "Survey on categorical data for neural networks," Journal of Big Data, vol. 7, no. 1, pp. 1-41, 2020.
[74] V. G. Raju, K. P. Lakshmi, V. M. Jain, A. Kalidindi, and V. Padma, "Study the influence of normalization/transformation process on the accuracy of supervised classification," in Third International Conference on Smart Systems and Inventive Technology, 2020, pp. 729-735.
[75] S. Ekelund, "Roc Curves—What are they and how are they used?," Point of Care, vol. 11, no. 1, pp. 16-21, 2012.
[76] T. Fawcett, "An introduction to ROC analysis," Pattern Recognition Letters, vol. 27, no. 8, pp. 861-874, 2006.
[77] G. Kovács, "Smote-variants: A python implementation of 85 minority oversampling techniques," Neurocomputing, vol. 366, pp. 352-354, 2019.
[78] N. Patki, R. Wedge, and K. Veeramachaneni, "The synthetic data vault," in IEEE International Conference on Data Science and Advanced Analytics, 2016, pp. 399-410.
[79] F. Pedregosa et al., "Scikit-learn: Machine Learning in Python," Journal of Machine Learning Research, vol. 12, pp. 2825-2830, 2011.
[80] L. Cleofas-Sánchez, J. S. Sánchez, V. García, and R. Valdovinos, "Associative learning on imbalanced environments: An empirical study," Expert Systems with Applications, vol. 54, pp. 387-397, 2016.
[81] A. I. Marqués, V. García, and J. S. Sánchez, "On the suitability of resampling techniques for the class imbalance problem in credit scoring," Journal of the Operational Research Society, vol. 64, no. 7, pp. 1060-1070, 2013.
[82] Y. Liu, A. An, and X. Huang, "Boosting prediction accuracy on imbalanced datasets with SVM ensembles," in Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2006, pp. 107-118.
[83] S. Del Rio, J. M. Benítez, and F. Herrera, "Analysis of data preprocessing increasing the oversampling ratio for extremely imbalanced big data classification," in IEEE International Conference on Trust, Security and Privacy in Computing and Communications, 2015, vol. 2, pp. 180-185.

指導教授

蘇坤良(Kuen-Liang Sue)

審核日期

2023-7-24

推文