博碩士論文 104423035 詳細資訊




以作者查詢圖書館館藏 以作者查詢臺灣博碩士 以作者查詢全國書目 勘誤回報 、線上人數:234 、訪客IP:18.188.224.177
姓名 邱子安(Tzu-An Chiu)  查詢紙本館藏   畢業系所 資訊管理學系
論文名稱 在破產預測與信用評估領域對前處理方式與分類器組合的比較分析
(Comparative analysis of pre-processing methods and classifier ensembles for bankruptcy prediction and credit scoring)
相關論文
★ 具代理人之行動匿名拍賣與付款機制★ 網路攝影機遠端連線安全性分析
★ HSDPA環境下的複合式細胞切換機制★ 樹狀結構為基礎之行動隨意網路IP位址分配機制
★ 平面環境中目標區域之偵測 - 使用行動感測網路技術★ 藍芽Scatternet上的P2P檔案分享機制
★ 交通壅塞避免之動態繞路機制★ 運用UWB提升MANET上檔案分享之效能
★ 合作學習平台對團體迷思現象及學習成效之影響–以英文字彙學習為例★ 以RFID為基礎的室內定位機制─使用虛擬標籤的經驗法則
★ 適用於實體購物情境的行動商品比價系統-使用影像辨識技術★ 信用卡網路刷卡安全性
★ DEAP:適用於行動RFID系統之高效能動態認證協定★ 單一類別分類方法於不平衡資料集-搭配遺漏值填補和樣本選取方法
★ 正規化與變數篩選在破產領域的適用性研究★ 分群式前處理方法於類別不平衡問題之研究
檔案 [Endnote RIS 格式]    [Bibtex 格式]    [相關文章]   [文章引用]   [完整記錄]   [館藏目錄]   [檢視]  [下載]
  1. 本電子論文使用權限為同意立即開放。
  2. 已達開放權限電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。
  3. 請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。

摘要(中) 隨著儲存媒體的技術進步,企業在儲存資料時不再像過去需要考慮容量問題,會將所有資料儲存下來以待後續分析,但是這使得資料過於繁雜,因此,在進行資料探勘時,資料前處理就變成一個重要的課題。特徵選取(feature selection)與樣本選取(instance selection)是前處理的兩大重要技術,過去的研究中往往專注討論一項,同時討論二者的研究並不常見,過去同時討論兩者的研究也只有使用基因演算法(genetic algorithm)作為特徵與樣本選取的方式,沒有其他方式的組合與比較,所以我們並不清楚用其他的特徵或樣本選取方式的組合是否會比基因演算法的組合更佳,同時,也不清楚其他方法在同時使用特徵與樣本選取時,先後順序是否會對效能有所影響。因此,本研究的目的是透過組合數種較具代表性的特徵與樣本選取方式,來探討選取方式之間的優劣以及先後順序的影響,以及在信用評估與破產預測兩個領域的資料集是否有差異。兩個領域中各使用了變數數量與類別的比例都不相同的資料集,目的在找出資料集的特性不同時,對於選取方式的選擇是否也會造成影響。實驗中使用了多個具代表性的分類器進行比較,目的是在找出選取方式的先後順序與最佳組合之外,找到最佳的分類器或分類器組合(classifier ensembles),作為後續實驗的參考依據。
摘要(英) With advances in media storage technology, many companies do not consider the capacity when they store their data like they used to in the past. They now save all the data for further analysis, but this makes the data too complicated for practical usage. Thus, data pre-processing becomes an important issue in data mining. Feature selection and instance selection are two important tasks in data pre-processing, but the literatures often focused on a single task. Few literatures discuss both tasks at the same time, but they only use genetic algorithm as feature and instance selection function. We could not know if there are performance differences between other combination of pre-processing methods and genetic algorithm.
Therefore, the aim of this research is to perform feature selection and instance selection with several representatives of feature and instance selection methods using different priorities to examine the classification performances over two differnet domain, namely bankruptcy prediction and credit scoring.
We use datasets with different amount of features and different ratio of classes, to find out if the characteristic of the dataset will affect the performance of feature or instance selection. We also use several representatives of classifiers to find out which classifier or classifier ensembles is the best for further usage.
關鍵字(中) ★ 資料探勘
★ 特徵選取
★ 樣本選取
★ 分類器組合
★ 基因演算法
關鍵字(英)
論文目次 摘    要 i
Abstract ii
誌    謝 iii
目錄 iv
圖目錄 vii
表目錄 viii
一、 緒論 1
1-1 研究背景 1
1-2 研究動機 2
1-3 研究目的 3
二、 文獻探討 4
2-1 資料集差異 4
2-1-1 信用與破產資料集 4
2-1-2 大小資料集 4
2-1-3 平衡與非平衡資料集 5
2-2 分類器 6
2-2-1 Logistic Regression (LR) 邏輯迴歸 7
2-2-2 Support Vector Machine (SVM) 8
2-2-3 Artificial Neural Network (ANN) 類神經網路 9
2-2-4 Decision Tree (DT) 決策樹 10
2-3 分類器組合 (Classifier Ensembles) 11
2-3-1 Boosting 12
2-3-2 Bagging (Bootstrap aggregating) 13
2-4 特徵選取 (Feature Selection, FS) 14
2-4-1 T-test (Welch’s T-test) 16
2-4-2 主成分分析 (Principal Component Analysis, PCA) 17
2-4-3 遺傳演算法 (Genetic Algorithm, GA) 18
2-5 樣本選取 (Instance Selection, IS) 18
2-5-1 遺傳演算法 19
2-5-2 Affinity Propagation (AP) 20
2-5-3 自組織對應 (Self-Organizing Map, SOM) 21
2-6 FS與IS先後順序的問題 22
2-7 不平衡資料集的調整方法 23
2-7-1 Random Under-sampling (RU) 23
2-7-2 Synthetic Minority Over-sampling Technique (SMOTE) 23
三、 實驗設計 24
3-1 實驗流程 24
3-1-1 主要實驗流程 24
3-1-2 交叉驗證 26
3-1-3 前處理類型的比較流程 27
3-1-4 前處理方式比較流程 27
3-1-5 分類器比較流程 28
3-2 資料集 29
3-3 參數與使用之軟體、設備說明 31
3-4 評估標準 31
3-4-1 準確率 (Accuracy, ACC) 32
3-4-2 AUC (Area Under ROC Curve) 32
3-4-3 型二誤差 (Type II error) 34
3-4-4 三個指標評估時的優先順序 35
3-5 T-test與P-value 35
四、 實驗結果與分析 37
4-1 前處理類型的比較 37
4-2 不同類型資料集的最佳前處理方式 54
4-2-1 信用與破產資料集的最佳前處理方式 54
4-2-2 大資料集與小資料集的最佳前處理方式 56
4-2-3 平衡與非平衡資料集的最佳前處理方式 62
4-3 分類模型的挑選 64
4-3-1 信用與破產資料集的最佳分類器 64
4-3-2 大資料集與小資料集的最佳分類器 66
4-3-3 平衡與非平衡資料集的最佳分類器 70
4-4 最佳數據表現 72
4-5 極度非平衡資料集調整前後比較 73
4-5-1 一般非平衡資料集與極度非平衡資料集比較 74
4-5-2 樣本比例調整後對Type II error與AUC的影響 77
4-5-3 調整後資料集的最佳前處理方式與最佳分類器 77
五、 結論與延伸研究 81
5-1 結論 81
5-2 延伸研究 83
參考資料 84
Appendix 88
參考文獻 [1] F. Koutanaei, H. Sajedi, M. Khanbabaei, “A hybrid data mining model of feature selection algorithms and ensemble learning classifiers for credit scoring”, Journal of Retailing and Consumer Services, vol. 27, pp. 11-23, 2015.
[2] C. Tsai, “Feature selection in bankruptcy prediction”, Knowledge-Based Systems, vol. 22, no. 2, pp. 120–127, 2009.
[3] C. Tsai, “Combining cluster analysis with classifier ensembles to predict financial distress”, Information Fusion, 16, pp.46–58, 2014
[4] D. Liang, C. Tsai, H. Wu, “The effect of feature selection on financial distress prediction”, Knowledge-Based Systems, vol. 73, pp.289–297, 2015.
[5] P. Kumar, V. Ravi, "Bankruptcy prediction in banks and firms via statistical and intelligent techniques – A review", European Journal of Operational Research, vol. 180, no. 1, pp. 1-28, 2007.
[6] J. do Prado, et al., "Multivariate analysis of credit risk and bankruptcy research data: a bibliometric study involving different knowledge fields (1968-2014)", Scientometrics, vol. 106, no. 3, pp. 1007-1029, 2016.
[7] W. Lin, Y. Hu, C. Tsai, “Machine learning in financial crisis prediction: a survey”, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 42, no. 4, pp. 421-436, 2012.
[8] R. Banfield, L. Hall, K. Bowyer and W. Kegelmeyer, "A Comparison of Decision Tree Ensemble Creation Techniques", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 1, pp. 173-180, 2007.
[9] M. Kim, D. Kang, “Classifiers selection in ensembles using genetic algorithms for bankruptcy prediction”, Expert Systems with Applications, vol. 39, pp. 9308–9314, 2012
[10] W. Pietruszkiewicz, “Dynamical systems and nonlinear Kalman filtering
applied in classification”, Cybernetic Intelligent Systems, 7th IEEE International Conference on, pp. 1–6, 2008.
[11] L. Zhou and K. Lai, "AdaBoost Models for Corporate Bankruptcy Prediction with Missing Data", Computational Economics, vol. 50, no. 1, pp. 69-94, 2016.
[12] A. Marqués, V. García, and J. Sánchez, “Exploring the behaviour of base classifiers in credit scoring ensembles”, Expert Systems with Applications, vol. 39, pp. 10244–10250, 2012.
[13] C. Tsai, W. Eberle, C. Chu, “Genetic algorithms in feature and instance selection”, Knowledge-Based Systems, vol. 39, pp. 240–247, 2013.
[14] S. Cateni, V. Colla, and M. Vannucci, “A method for resampling imbalanced datasets in binary classification tasks for real-world problems”, Neuro computing, vol. 135, pp. 32–41, 2014.
[15] C. Tsai, Y. Hsu, D. Yen, “A comparative study of classifier ensembles for bankruptcy prediction”, Applied Soft Computing, vol. 24, pp. 977–984, 2014.
[16] D. Cox, “The Regression Analysis of Binary Sequences”, Journal of the Royal Statistical Society. Series B (Methodological), Vol. 20, No. 2, pp. 215-242, 1958.
[17] S. Cessie and J. Van Houwelingen, “Ridge Estimators in Logistic Regression”, Journal of the Royal Statistical Society. Series C (Applied Statistics), Vol. 41, No. 1, pp. 191-201, 1992.
[18] B. Boser, I. Guyon, and V. Vapnik, “A training algorithm for optimal margin classifiers.”, Proceedings of the fifth annual workshop on Computational learning theory (COLT ′92), pp.144-152, 1992.
[19] C. Cortes, V. Vapnik, “Support-vector networks”, Machine Learning, vol. 20 (3), pp. 273–297, 1995.
[20] W. McCulloch and W. Pitts, "A logical calculus of the ideas immanent in nervous activity", Bulletin of Mathematical Biology, vol. 52, no. 1-2, pp. 99-115, 1990.
[21] P. Werbos, “Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences”, Harvard University, PhD thesis, 1974.
[22] B. Kamiński, M. Jakubczyk and P. Szufel, "A framework for sensitivity analysis of decision trees", Central European Journal of Operations Research, pp. 1-25, 2017.
[23] Y. Dong, K. Han, “A comparison of several ensemble methods for text categorization”, In IEEE international conference on service computing services, computing (SCC’04), pp. 419–422, 2004.
[24] J. Eom, S. Kim, and B. Zhang, “AptaCDSS-E: A classifier ensemble-based clinical decision support system for cardiovascular disease level prediction.”, Expert Systems with Applications, vol. 34, pp. 2465–2479, 2008.
[25] I. Buciu, C. Kotropoulos and I. Pitas, "Combining support vector machines for accurate face detection", Proceedings 2001 International Conference on Image Processing (Cat. No.01CH37205), Thessaloniki, vol.1, pp. 1054-1057, 2001.
[26] Y. Freund and R. Schapire, “A decision-theoretic generalization of on-line learning and an application to boosting", Journal of Computer and System Sciences, vol. 55, pp. 119-139, 1997.
[27] L. Breiman, “Bagging predictors”, Machine Learning, vol. 24 (2), pp.123–140, 1996.
[28] M. Delacre, D. Lakens, and C. Leys, “Why Psychologists Should by Default Use Welch’s t-test Instead of Student’s t-test.”, International Review of Social Psychology, vol. 30(1), pp. 92–101, 2017.
[29] G. Ratta, J. Vega, and A. Murari, “Improved feature selection based on genetic algorithms for real time disruption prediction on JET”, Fusion Engineering and Design, vol 87(9), pp. 1670–1678, 2012.
[30] N. Das, et al., “A genetic algorithm based region sampling for selection of local features in handwritten digit recognition application”, Applied Soft Computing, vol. 12 (5), pp. 1592–1606, 2012
[31] D. Wilson and T. Martinez, “Reduction techniques for instance-based learning algorithms”, Machine Learning, vol. 38, pp. 257–286, 2000.
[32] J. Olvera-López, et al., "A review of instance selection methods", Artificial Intelligence Review, vol. 34, no. 2, pp. 133-143, 2010.
[33] C. Tsai and Z. Chen, "Towards high dimensional instance selection: An evolutionary approach", Decision Support Systems, vol. 61, pp. 79-92, 2014.
[34] B. Frey, D. Dueck, “Clustering by passing messages between data points”, Science, vol. 315, pp. 972 -976, 2007.
[35] B. North, A. Lehmann and R. Dunbrack, "A New Clustering of Antibody CDR Loop Conformations", Journal of Molecular Biology, vol. 406, no. 2, pp. 228-256, 2011.
[36] F. Yang, et al., "Using Affinity Propagation Combined Post-Processing to Cluster Protein Sequences", Protein & Peptide Letters, vol. 17, no. 6, pp. 681-689, 2010.
[37] U. Bodenhofer, A. Kothmeier, S. Hochreiter, “APCluster: an R package for affinity propagation clustering”, Bioinformatics, vol. 27 (17), pp. 2463-2464, 2011.
[38] F. López Iturriaga and I. Sanz, "Bankruptcy visualization and prediction using neural networks: A study of U.S. commercial banks", Expert Systems with Applications, vol. 42, no. 6, pp. 2857-2869, 2015.
[39] J. Souza, R. Carmo and G. Campos, “A novel approach for integrating feature and instance selection”, 2008 International Conference on Machine Learning and Cybernetics, Kunming, pp. 374-379, 2008
[40] J. Sun and H. Li, “Dynamic financial distress prediction using instance selection for the disposal of concept drift”, Expert Systems with Applications, Volume Jar, Issue 3, pp. 2566–2576, 2011.
[41] N. Chawla, et al.,“SMOTE: Synthetic Minority Over-sampling Technique”, Journal of Artificial Intelligence Research, vol. 16, pp. 321-357, 2002.
[42] R. Somasundaram and R. Nedunchezhian, “Evaluation of three Simple Imputation Methods for Enhancing Preprocessing of Data with Missing Values”, International Journal of Computer Applications, vol. 21(10), pp. 14-19, 2011.
[43] R. Somasundaram and R. Nedunchezhian, “Missing Value Imputation using Refined Mean Substitution”, International Journal of Computer Science Issues, Vol. 9 Issue 4, pp. 306, 2012.
[44] D. Liang, et al., “Financial ratios and corporate governance indicators in bankruptcy prediction: A comprehensive study”, European Journal of Operational Research, vol. 252, pp.561–572, 2016.
[45] A. Moment, M. Pincus, and J. Libien, “Introduction to Statistical Methods in Pathology”, Springer, 2017.
[46] W. Lin, Y. Hu and C. Tsai, "Machine Learning in Financial Crisis Prediction: A Survey", IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 42, no. 4, pp. 421-436, 2012.
[47] A. Marqués, V. García and J. Sánchez, "Exploring the behaviour of base classifiers in credit scoring ensembles", Expert Systems with Applications, vol. 39, no. 11, pp. 10244-10250, 2012.
[48] L. Zhou, D. Lu and H. Fujita, "The performance of corporate financial distress prediction models with features selection guided by domain knowledge and data mining approaches", Knowledge-Based Systems, vol. 85, pp. 52-61, 2015.
指導教授 蘇坤良(Kuen-Liang Sue) 審核日期 2018-1-18
推文 facebook   plurk   twitter   funp   google   live   udn   HD   myshare   reddit   netvibes   friend   youpush   delicious   baidu   
網路書籤 Google bookmarks   del.icio.us   hemidemi   myshare   

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡  - 隱私權政策聲明