分類技術於類別不平衡資料集之研究

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：8

、訪客IP：18.220.234.169

姓名

龔健生(Chien-Shen Kung) 查詢紙本館藏

畢業系所

資訊管理學系在職專班

論文名稱

分類技術於類別不平衡資料集之研究

相關論文

★ 利用資料探勘技術建立商用複合機銷售預測模型	★ 應用資料探勘技術於資源配置預測之研究-以某電腦代工支援單位為例
★ 資料探勘技術應用於航空業航班延誤分析-以C公司為例	★ 全球供應鏈下新產品的安全控管-以C公司為例
★ 資料探勘應用於半導體雷射產業-以A公司為例	★ 應用資料探勘技術於空運出口貨物存倉時間預測-以A公司為例
★ 使用資料探勘分類技術優化YouBike運補作業	★ 特徵屬性篩選對於不同資料類型之影響
★ 資料探勘應用於B2B網路型態之企業官網研究-以T公司為例	★ 衍生性金融商品之客戶投資分析與建議-整合分群與關聯法則技術
★ 應用卷積式神經網路建立肝臟超音波影像輔助判別模型	★ 基於卷積神經網路之身分識別系統
★ 能源管理系統電能補值方法誤差率比較分析	★ 企業員工情感分析與管理系統之研發
★ 資料淨化於類別不平衡問題: 機器學習觀點	★ 資料探勘技術應用於旅客自助報到之分析—以C航空公司為例

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

在現實的生活所產生的二元分類數據中，大多都存在著不平衡的問題，如：破產資訊、罹患罕見疾病、因意外造成傷亡等。傳統的二元分類演算法，大多在訓練分類器的過程中，常會因類別不平衡而產生預測的偏差進而影響到分類的正確率，其結果也往往會偏向多數類樣本。近年來，學者及研究人員針對類別不平衡問題也提出了相當多的解決方式，卻沒有相關的研究篩選出較適用的基底分類器。
本研究希望能透過所提出的研究架構，並使用KEEL網站上研究二元分類問題的44個不同比例資料集進行實驗，籍此找出較適用於研究類別不平衡問題的基底分類器，提供學者及研究人員參考。

摘要(英)

In our daily life, most of the datasets possess the class imbalance problem, in which one class contains a very large number of data samples whereas another class for a very small number of data samples. On example is bankruptcy information, suffering from rare diseases, due to accidental casualties and so on. In the process of training a classifier, the traditional binary classification algorithms will generate prediction bias because of class imbalanced datasets, and the results also tend to favor the majority class samples. In recent years, a considerable number of scholars raised many solutions for solving the class imbalanced problem.
In this study, different from related works that proposing novel algorithms to enhance the performances of existing classification techniques, we focus on finding out the best baseline classifier for the class imbalance domain problem. The finding of this study is able to provide the guideline for future research to compare their novel algorithms to the identified baseline classifier.
The experiments are based on 44 various domain datasets containing different imbalance ratios and three popular classifiers, i.e. J48, MLP, and SVM are constructed and compared. Moreover, classifier ensembles by the bagging and boosting method are also developed. The results show that the bagging based MLP classifier ensembles perform the best in terms of the AUC rate.

關鍵字(中)

★ 資料探勘
★ 類別不平衡問題
★ 接收者操作特徵曲線
★ 曲線下面積

關鍵字(英)

★ Data Mining
★ Class Imbalanced Problem
★ ROC
★ AUC

論文目次

摘要 i
Abstract iii
誌謝 iv
目錄 v
圖目錄 vii
表目錄 viii
第一章緒論 1
1.1 研究背景 1
1.2 研究動機 2
1.3 研究目的 3
1.4 研究流程 4
第二章文獻探討 5
2.1資料探勘介紹 5
2.1.1 資料探勘的定義 5
2.1.2 知識發掘/資料探勘的程序 5
2.2類別不平衡問題 (Class Imbalance) 7
2.3監督式學習 9
2.3.1 決策樹(C4.5) 10
2.3.2 類神經網路(MLP) 12
2.3.3 支持向量機(SVM) 14
2.3.4 AdaBoost 15
2.3.5 Bagging 16
2.3.6 K摺交叉驗證(K-Folder Cross-Validation) 18
第三章研究方法 20
3.1 研究流程 20
3.2 資料集介紹 20
3.3 資料整理 23
3.4 預測模式設計 25
第四章實驗結果與分析 29
4.1 實驗環境 29
4.2 實驗評估方法 29
4.2.1 混亂矩陣(Confusion Matrix) 29
4.2.2 ROC曲線與曲線下面積(AUC) 30
4.3 實驗結果與分析 34
4.3.1 單一分類器實驗結果 34
4.3.2 AdaBoost多重分類器實驗結果 37
4.3.3 Bagging多重分類器實驗結果 40
4.4 討論 43
第五章結論與建議 48
5.1 研究結論與貢獻 48
5.2 未來研究 49
參考文獻 51

參考文獻

【中文文獻】
1. 林明潔，董子毅，「危險評估中 ROC 曲線在預測 2×2 表上與敏感度及特異度之關係」，亞洲家庭暴力與性侵害期刊，第四卷第二期，2008，64 -74。
2. 洪振富(2010)，「距離式特徵於資料自動分類之研究」國立中央大學資訊管理學系碩士論文。
3. 張琦、吳斌、王柏 (2005)，「非平衡數據訓練方法概述」，計算機科學，第三二卷，第十期，第 181-186 頁。
4. 凌士雄 (2004),非對稱性分類分析解決策略之效能比較,碩士論文,國立中山大學資訊管理學系,高雄。
5. 蘇昭安(2003)，應用倒傳遞類神經網路在颱風波浪預報之研究，國立臺灣大學工程科學與海洋工程學系碩士論文。
【英文文獻】
1. A. Fernández, S. García, M.J. del Jesus, and F. Herrera, (2008), “A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets,” Fuzzy Sets System, Vol. 159, pp. 2378-2398.
2. A. Fernández, M.J. Del Jesus, and F. Herrera, (2009), “On the influence of an adaptive inference system in fuzzy rule based classification systems for imbalanced data-sets,” Expert Systems with Applications, Vol. 36, pp. 9805-9812.
3. Barandela R, Rangel E, Sánchez JS, FerriFJ (2003) ,“Restricted decontamination forthe imbalanced training sample problem,”In: 8th Ibero-american Congress on Pattern Recognition, pp. 424–431.
4. Barandela, R., Sanchez, J. S., Garcia, V. and Rangel, E. (2003), “Strategies for learning in class imbalance problems,” Pattern Recognition, Vol. 36, No. 3, pp. 849-851.
5. Batista, G. E., Bazzan, A. L., & Monard, M. C. (2003). “Balancing training data for automated annotation of keywords,” A case study. WOB, pp.10-18.
6. Batista, G., Prati, R.C., and Monard, M.C. (2004),“A study of the behavior of several methods 2009 International Conference on Advanced Information Technologies (AIT)for balancing machine learning training data,” SIGKDD Explorations, Vol. 6, No. 1,pp. 20-29.
7. Berson, A., Smith, S., Thearling , K., (1999) “Building Data Mining application for CRM, ” McGraw-Hill.
8. Bianchi C. and Montemaggiore G. B. (2008), “Enhancing Strategy Design and Planning in Public Utilities through “Dynamic” balanced scorecards:Insight from a Project in a City Water Company, ” System Dynamic Review Vol. 24, No. 2, (summer 2008): 175-213.
9. Brachman, R. and Anand, T. (1996), “The Process of Knowledge Discovery in Databases: A Human Centered Approach,” in A KDDM, AAAI/MIT Press, 37-58.
10. Breiman, L (1996), “Bagging predictors,” Machine Learning, 24 (2):123-140.
11. Brighton, H. and Mellish, C. (2002), “Advances in instance selection for instance-based learning algorithms,” Data Mining and Knowledge Discovery, vol. 6, pp. 153-172.
12. Burez, J., & Van den Poel, D. (2009), “Handling class imbalance in customer churn prediction,” Expert Systems with Applications, 36(3), 4626-4636.
13. Burges, C.J.C. (1998), “A tutorial on support vector machines for pattern recognition,” Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121-167.
14. C. Drummond, R.C. Holte (2003), “C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling,” Workshop on Learning from Imbalanced Datasets, NRC 47381.
15. C.-C. Chang and C.-J. Lin (2001). “LIBSVM: a library for support vector machines,” Software available at http://www.csie.ntu.edu.tw/˜cjlin/libsvm.
16. Chawla, N. V. (2003). C4.5 and Imbalanced Data sets: Investigating the Effect of Sampling Method, Probabilistic Estimate, and Decision Tree Structure. In ICML Workshop on Learning from Imbalanced Data sets, Washington, DC.
17. Chawla, N. V., Japkowicz, N. and Kolcz, A. (2004), “Editorial: special issue on learning from imbalanced data sets,” SIGKDD Explorations, Vol. 6, No. 1, pp. 1-6.
18. Chung, H.M. & Gray, P. (1999). Special Section: Data mining. Journal of management information systems, Vol. 16, No. 1, 11-16, ISSN 0724-1222
19. C. J. C. Burges, (1998), "A tutorial on support vector machines for pattern recognition," Data Mining and Knowledge Discovery, Vol. 2, No. 2.
20. Clark, P. and Niblett, T (1989) The CN2 induction algorithm. Machine Learning 3(4):261-283.
21. Davis, J., & Goadrich, M. (2006, June). The relationship between Precision-Recall and ROC curves. In Proceedings of the 23rd international conference on Machine learning (pp. 233-240). ACM.
22. Del-Hoyo, R., Buldain, D., & Marco, A. (2003). Supervised classification with associative SOM. In Seventh international work-conference on artificial and natural neural networks, IWANN 2003. Lecture notes in computer science (Vol.2686, pp. 334–341).
23. D. Hand, H. Mannila, P. Smyth (2001). "Principles of Data Mining". MIT Press, Cambridge, MA.
24. Dorian Pyle(1999), "Data Preparation for Data Ming, Morgan Kaufmann.
25. Dunham, M. H.(2003), "Data Mining Introductory and Advanced Topics," N. J. , Pearson Education Inc.
26. Fawcett T. (2006), "An introduction to ROC analysis," Pattern Recognition Letters, vol.27, pp. 861-874.
27. Fayyad, M.U(1996), "Data Mining and Knowledge Discovery: Making Sense Out of Data, " IEEE Expect, 11(10), 20-25.
28. Frawley, W.J., Sharpiro, G. P. and Matheus C. J.(1992), "knowledge Discovery in Database: An Overview," AI Magazine, 13(3), 57-10.
29. Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., & Herrera, F. (2012). A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches. IEEE Transactions on Systems Man and Cybernetics Part C-Applications and Reviews, 42(4), 463-484.
30. Grupe, G. H. and Owrang(1995), "M. M “Database Mining Discovering New Knowledge and Cooperative Advantage," Information System Management, l(12), 26-31.
31. Haibo He and Edwardo A. Garcia. (2009). Learning from imbalanced data. IEEE Transactions On Knowledge And Data Engineering, 21(9):1263–1284.
32. Han, J. and Kamber M.(2001), "Data Mining: Concepts and Techniques," Academic Press, San Diego.
33. Hanley, J.A., McNeil, B.J., 1982. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143, 29–36.
34. H. Guo and H. L. Viktor, “Learning from imbalanced data sets with boosting and data generation: The data boost-IM approach,” ACM SIGKDD Explor. Newslett., vol. 6, no. 1, pp. 30–39, 2004.
35. In N. Japkowicz, editor, Proceedings of the AAAI’2000 Workshop on Learning from Imbalanced Data Sets, AAAI Tech Report WS-00-05. AAAI, 2000.
36. Japkowicz, N. 2000. The class imbalance problem: Significance and strategies. In Proceedings of the 2000 International Conference on Artificial Intelligence (ICAI’2000).
37. K. Chen, B.-L. Lu, and J. T. Kwok. (2006), “Efficient classification of multi-label and imbalanced data using min-max modular classifiers,” in Proc. Int. Joint Conf. Neural Netw., pp. 1770–1775.
38. Kleissner, C. (1998), “Data Mining for the Enterprise,” Proceedings of the 31st Annual Hawaii International Conference On System Sciences, pp. 295-304.
39. Kohavi, Ron. (1995), “A study of cross-validation and bootstrap for accuracy estimation and model selection,” Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence. 2 (12): 1137–1143.
40. Ling, C. and Li, C. (1998). Data Mining for Direct Marketing Problems and Solutions. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98), New York, NY. AAAI Press.
41. Liu, J. Hu, Q. Yu, D. (2008), “A comparative study on rough set based class imbalance learning,” Knowledge-Based Systems 21, pp.753–763.
42. Liu, X. Y., Wu, J. X., & Zhou, Z. H. (2009). Exploratory Undersampling for Class-Imbalance Learning. IEEE Transactions on Systems Man and Cybernetics Part B-Cybernetics, 39(2), 539-550.
43. M.A. Maloof. (2003), “Learning when data sets are Imbalanced and when costs are unequal and unknown,” ICML-2003 Workshop on Learning from Imbalanced Data Sets.
44. N. Chawla, A. Lazarevic, L. Hall and K. Bowyer.(2003), “SMOTEBoost: improving prediction of the minority class in boosting,” 7th uropean Conference on Principles and Practice of Knowledge Discovery in Databases, Cavtat-Dubrovnik,Croatia , pp. 107-119.
45. Quinlan, J.R. (1986), “Induction of Decision Tree,”Machine Learning, Vol. 1, No. 1, pp.81-106.
46. Quinlan, J.R. (1993), “C4.5: Programs for Machine Learning,” Morgankaufmann, San Mateo,CA.
47. Rakesh Agrawal, Tomasz Imielinskim and Arun Swami (1993), "Database Mining: A Performance Perspective," IEEE Trans on Knowledge and Data Engineering, 5(6), 914-925.
48. Reinartz, T. (2002), "A unifying view on instance selection," Data Mining and Knowledge Discovery, vol. 6, pp. 191-210.
49. Rumelhart, D.E., McClelland, J.L., and the PDP Research Group,(1986).PARALLEL DISTRIBUTED PROCESSING ,Vol. 1,MIT Press, Cambridge, MA.
50. S. Chen, H. He, and E. A. Garcia (2010), “Ramoboost: Ranked minority oversampling in boosting,” IEEE Trans. Neural Netw., vol. 21, no. 10, pp. 1624– 1642.
51. Seiffert, C., Khoshgoftaar, T. M., Van Hulse, J., & Napolitano, A. (2010). RUSBoost: A Hybrid Approach to Alleviating Class Imbalance. IEEE Transactions on Systems Man and Cybernetics Part a-Systems and Humans, 40(1), 185-197.
52. Seymour Geisser. The predictive sample reuse method with applications. Journal of the American Statistical Association, 70:320–328, 1975.
53. S. K. Shevade, S. S. Keerthi, C. Bhattacharyya, and K. R. K. Murthy (2000), "Improvements to the SMO Algorithm for SVM Regression," IEEE TRANSACTIONS ON NEURAL NETWORKS, 11(5), 1188-1193.
54. Stone, M. (1974). “Cross-validatory choice and assessment of statistical predictions,” J. Roy. Statist. Soc. Ser. B, 36:111–147.
55. Su, C. T., & Hsiao, Y. H. (2007). An evaluation of the robustness of MTS for imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 19(10), 1321-1332.
56. Su, C.-T., Chen, L.-S. and Yih, Y.(2006),“ Knowledge acquisition through information granulation for imbalanced 2009 International Conference on Advanced Information Technologies (AIT)data,” Expert System with Applications,Vol. 31, No. 3, pp. 531-541
57. Su, C.-T. and Hsiao, Y.-H.( 2007), “An Evaluation of the Robustness of MTS for Imbalanced Data,” IEEE Transactions on Knowledge and Data Engineering, Vol. 19, No. 10, pp.1321-1332.
58. Taft LM, Evans RS, Shyu CR, et al. (2009) , “Countering imbalanced datasets to improve adverse drug event predictive models in labor and delivery,” Journal of biomedical informatics. 42:356–64.
59. U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth (1996), “From data mining to knowledge discovery: An overview,” In Advances in Knowledge Discovery and Data Mining, pages 1–34. AAAI Press.
60. Vapnik V. (1995) , “The Nature of Statistical Learning Theory,” Springer ,New York.
61. Wang, B. X., & Japkowicz, N. (2010). Boosting support vector machines for imbalanced data sets. Knowledge and Information Systems, 25(1), 1-20.
62. Weiss, G. (1995), “Learning with rare cases and small disjuncts,” Proceedings of the Twelfth International Conference on Machine Learning.
63. Weiss, G. (2004), “Mining with rarity: a unifying framework,” SIGKDD Exploration, Vol. 6, No. 1, pp. 7-19.
64. Wasikowski, M., & Chen, X. W. (2010), “Combating the Small Sample Class Imbalance Problem Using Feature Selection,” IEEE Transactions on Knowledge and Data Engineering, 22(10), 1388-1400.
65. Wu, X., Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G. J., Ng, A., Liu, B., Yu, P.S., Zhou, Z.-H., Steinbach, M., Hand, D. J., and Steinberg, D. (2007), “Top 10 Algorithms in Data Mining,” Knowledge and Information Systems (14:1), pp. 1-37.
66. Yoav Freund and Robert E. Schapire. (1996), “Experiments with a new boosting algorithm,” In Machine Learning: Proceedings of the Thirteenth International Conference, pages 148–156.
67. Yoav Freund and Robert E. Schapire. (1996), “Game theory, on-line prediction and boosting,” In Proceedings of the Ninth Annual Conference on Computational Learning Theory, pages 325–332.
68. Zhang, J. and Mani, I. (2003), “KNN approach to unbalanced data distributions: A case study involving information extraction,” in Proceedings of the ICML Workshop on Learning from Imbalanced Data Sets.
【網路部分】
甲、台灣Wiki(accessed 2016/04/29, available at:
http://www.twwiki.com/wiki/ROC%E6%9B%B2%E7%B7%9A
乙、教育部數位教學資源入口網(accessed 2016/04/29, available at: http://content.edu.tw/senior/life_tech/tc_t2/inform/data2.htm

指導教授

蔡志豐

審核日期

2016-6-6

推文