資料淨化於類別不平衡問題: 機器學習觀點

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：14

、訪客IP：18.189.2.122

姓名

蔡武霖(WU-LIN TSAI) 查詢紙本館藏

畢業系所

資訊管理學系在職專班

論文名稱

資料淨化於類別不平衡問題: 機器學習觀點

相關論文

★ 利用資料探勘技術建立商用複合機銷售預測模型	★ 應用資料探勘技術於資源配置預測之研究-以某電腦代工支援單位為例
★ 資料探勘技術應用於航空業航班延誤分析-以C公司為例	★ 全球供應鏈下新產品的安全控管-以C公司為例
★ 資料探勘應用於半導體雷射產業-以A公司為例	★ 應用資料探勘技術於空運出口貨物存倉時間預測-以A公司為例
★ 使用資料探勘分類技術優化YouBike運補作業	★ 特徵屬性篩選對於不同資料類型之影響
★ 資料探勘應用於B2B網路型態之企業官網研究-以T公司為例	★ 衍生性金融商品之客戶投資分析與建議-整合分群與關聯法則技術
★ 應用卷積式神經網路建立肝臟超音波影像輔助判別模型	★ 基於卷積神經網路之身分識別系統
★ 能源管理系統電能補值方法誤差率比較分析	★ 企業員工情感分析與管理系統之研發
★ 資料探勘技術應用於旅客自助報到之分析—以C航空公司為例	★ 應用機器學習建立單位健保欠費催繳後繳納預測模型

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

機器學習在Google Alpha Go出現之後再次受到矚目，這也顯現出收集資料的重要性。但在現實生活中，資料收集時的困難與限制會造成收集資料的不平均。這容易使得分類困難與不準確，因為特徵選取與不平衡處理(抽樣)後，影響分類器在向量空間中的學習與分類效能。本研究使用知名公開網站的資料集，並設計二個流程來發掘類別不平衡問題，而特徵選取與抽樣誰該放置於前或後，使用五種不平衡處理抽樣模組，分別為三增加少數抽樣法、二減少多數抽樣法放置於前後，另外特徵選取使用二種模組，並加入有無正規化在這二項流程上。分類器目前在類別不平衡中，最常被使用支持向量機(SVM)與決策樹(Decision Tree Classifier)的二種分類器進行分類。從本研究實驗過程得知，類別不平衡資料在先執行特徵選取之後，再執行不平衡處理(抽樣)，低資料量在抽樣後為 SMOTE 增加少數抽樣法，高資料量在抽樣後Random為減少多數抽樣法(Under Sampling)，特徵選取小於20建議使用PCA，20維以上使用GA，分類器SVM 為佳的分類器，至於資料是否要正規化為決策樹不使用、支持向量機使用。

摘要(英)

After the invention of Alpha Go, machine learning caught the public eye and showed us the essential need for data collection. Nevertheless, in reality, data collection is often uneven owing to its many difficulties and confinement. Feature selection and imbalanced (Sampling) have inherent impacts on Classifier in vector space. This in turn impacts the ability of learning and classification which also leads to difficulty and inaccuracy during data classification. This research aims to utilize data from public websites to design two processes to excavate imbalanced (Sampling), feature selection and place sampling in the beginning and at the end. It will utilize five examples of imbalanced (sampling); three examples of increased over sampling and two of reduced under sampling placed in the beginning and the back. Moreover, it will use two different models and utilize normalization with non-normalization in the two processes. Classifier in class imbalanced is often used to support vector machines and decision trees two model. From this research, we can find out that class imbalanced need use after then use feature selection, SMOTE is when low data amounts after sampling increase over sampling. Random is when high data amounts after sampling reduce under sampling. It is recommended to use PCA when feature selection is under 20 dimensions, as GA is recommended if feature selection is above 20 dimensions. Moreover, the ideal classifier is SVM. When it comes to the question of utilizing normalization in data, we can utilize classification to selection. decision tree abandons it. support vector machines use it.

關鍵字(中)

★ 機器學習
★ 資料探勘
★ 類別不平衡
★ 抽樣
★ 特徵選取

關鍵字(英)

★ Machine Learning
★ Data Mining
★ Class Imbalanced Problem
★ Sampling
★ Feature Selection

論文目次

中文摘要 I
Abstract II
誌謝 III
目錄 IV
表目錄 VI
圖目錄 VIII
第一章緒論 1
1.1 研究背景 1
1.2 研究動機 2
1.3 研究目的 3
1.4 研究貢獻 3
1.5 論文架構 4
第二章文獻探討 6
2.1 資料前處理 6
2.2 特徵選取(Feature selection) 7
2.2.1 監督式學習 7
2.2.2 遺傳演算法(Genetic Algorithm GA) 8
2.2.3 非監督式學習 10
2.2.4 主成分分析PCA 11
2.3 類別不平衡Class Imbalance 12
2.3.1 增加少數抽樣法Over Sampling 12
2.3.2 SMOTE 13
2.3.3 SMOTE Borderline 14
2.3.4 ANDANY 16
2.3.5 減少多數抽樣法Under Sampling 18
2.3.6 Random under Sampling 19
2.3.7 Edited Nearest Neighbours 20
2.4 支持向量機SVM 21
2.5 決策樹Decision Tree Classifier 22
2.6 正規化 24
2.7 近期學者研究資料 26
第三章研究方法 28
3.1 研究架構 28
3.1.1 研究架構流程 30
3.2 資料收集與前處理 32
3.2.1 資料收集 33
3.2.2 資料前處理與流程 34
3.3 特徵選取模型與種類 35
3.3.1 特徵選取流程 36
3.4 不平衡資料處理模型與種類 37
3.5 分類模型與種類 39
3.6 評估方法 41
3.6.1 混亂矩陣(Confusion Matrix) 41
3.6.2 ROC曲線下面積AUC 43
3.7 小結 44
第四章系統建置與實驗 45
4.1 實驗環境 46
4.2 實驗設計 47
4.2.1 模型參數設定 49
4.3 實驗結果分析 50
4.3.1 Feature Selection VS Imbalanced 51
4.3.2 Over Sampling VS Under Sampling: 54
4.3.3 PCA VS GA 56
4.3.4 Normalization VS Non Normalization: 59
第五章結論 71
5.1 研究總結 71
5.2 研究建議與未來研究方向 72
參考文獻 73
附錄 81

參考文獻

1. Inside （2019）,“Google Alpha Go,”（accessed 2019/03/10, available at: https://www.inside.com.tw/article/9071-how-alphago-inspire-human-in-go）.
2. Wikipedia （2019）,“圍棋,”（accessed 2019/03/10, available at: https://zh.wikipedia.org/wiki/%E5%9B%B4%E6%A3%8B）.
3. Su, C. T., Chen, L. S., & Yih, Y. (2006). Knowledge acquisition through information granulation for imbalanced data. Expert Systems with applications, 31(3), 531-541.
4. Su, C. T., Yang, C. H., Hsu, K. H., & Chiu, W. K. (2006). Data mining for the diagnosis of type II diabetes from three-dimensional body surface anthropometrical scanning data. Computers & mathematics with applications, 51(6-7), 1075-1092.
5. Liao, T. W. (2008). Classification of weld flaws with imbalanced class data. Expert Systems with Applications, 35(3), 1041-1052.
6. Chae, Y. M., Ho, S. H., Cho, K. W., Lee, D. H., & Ji, S. H. (2001). Data mining approach to policy analysis in a health insurance domain. International journal of medical informatics, 62(2-3), 103-111.
7. Barandela, R., Sánchez, J. S., Garca, V., & Rangel, E. (2003). Strategies for learning in class imbalance problems. Pattern Recognition, 36(3), 849-851.
8. Zhou, Z. H., & Liu, X. Y. (2006). Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge & Data Engineering, (1), 63-77.
9. An, A., & Wang, Y. (2001). Comparisons of classification methods for screening potential compounds. In Proceedings 2001 IEEE International Conference on Data Mining (pp. 11-18). IEEE.

10. Weiss, G. M. (2004). Mining with rarity: a unifying framework. ACM Sigkdd Explorations Newsletter, 6(1), 7-19.
11. Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.
12. Quinlan, J. R. (1986). Induction of decision trees. Machine learning, 1(1), 81-106.
13. Bernhard Scholkopf and Alexander Smola. Support vector machine. KDD 99 The First Annual International Conference on Knowledge Discovery in Data, pages 321–357, 2001.
14. Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., & Herrera, F. (2012). A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(4), 463-484.
15. Sun, Z., Song, Q., & Zhu, X. (2012). Using coding-based ensemble learning to improve software defect prediction. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(6), 1806-1817.
16. Yang, Z., Tang, W. H., Shintemirov, A., & Wu, Q. H. (2009). Association rule mining-based dissolved gas analysis for fault diagnosis of power transformers. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 39(6), 597-610.
17. Zhu, Z. B., & Song, Z. H. (2010). Fault diagnosis based on imbalance modified kernel Fisher discriminant analysis. Chemical Engineering Research and Design, 88(8), 936-951.
18. Khreich, W., Granger, E., Miri, A., & Sabourin, R. (2010). Iterative Boolean combination of classifiers in the ROC space: An application to anomaly detection with HMMs. Pattern Recognition, 43(8), 2732-2752.
19. Mazurowski, M. A., Habas, P. A., Zurada, J. M., Lo, J. Y., Baker, J. A., & Tourassi, G. D. (2008). Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance. Neural networks, 21(2-3), 427-436.
20. Hanley, J. A., & McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1), 29-36.
21. Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern recognition, 30(7), 1145-1159.
22. Huang, J., & Ling, C. X. (2005). Using AUC and accuracy in evaluating learning algorithms. IEEE Transactions on knowledge and Data Engineering, 17(3), 299-310.
23. Batista, G. E., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD explorations newsletter, 6(1), 20-29.
24. Napierała, K., Stefanowski, J., & Wilk, S. (2010, June). Learning from imbalanced data in presence of noisy and borderline examples. In International Conference on Rough Sets and Current Trends in Computing (pp. 158-167). Springer, Berlin, Heidelberg.
25. 陳逸真. (2017). Comparison of Imbalanced Data Classification Methods., (頁 17~18).
26. Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), 1157-1182.
27. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112, p. 18). New York: springer.
28. Bermingham, M. L., Pong-Wong, R., Spiliopoulou, A., Hayward, C., Rudan, I., Campbell, H., ... & Haley, C. S. (2015). Application of high-dimensional feature selection: evaluation for genomic prediction in man. Scientific reports, 5, 10312.
29. Li, T. S. (2006). Feature selection for classification by using a GA-based neural network approach. Journal of the Chinese Institute of Industrial Engineers, 23(1), 55-64.
30. Liu, H., & Motoda, H. (Eds.). (1998). Feature extraction, construction and selection: A data mining perspective (Vol. 453). Springer Science & Business Media.
31. Liu, H., & Motoda, H. (2012). Feature selection for knowledge discovery and data mining (Vol. 454). Springer Science & Business Media.
32. Chawla, N. V., Japkowicz, N., & Kotcz, A. (2004). Special issue on learning from imbalanced data sets. ACM Sigkdd Explorations Newsletter, 6(1), 1-6.
33. 譚琳,“非平衡數據挖掘簡介”,計算機科學與技術研討會,南京,2008。
34. Weiss, G. M. (2004). Mining with rarity: a unifying framework. ACM Sigkdd Explorations Newsletter, 6(1), 7-19.
35. Barandela, R., Rangel, E., Sánchez, J. S., & Ferri, F. J. (2003, November). Restricted decontamination for the imbalanced training sample problem. In Iberoamerican Congress on Pattern Recognition (pp. 424-431). Springer, Berlin, Heidelberg.
36. Barandela, R., Sánchez, J. S., Garca, V., & Rangel, E. (2003). Strategies for learning in class imbalance problems. Pattern Recognition, 36(3), 849-851.
37. Dorronsoro, J. R., Ginel, F., Sgnchez, C., & Cruz, C. S. (1997). Neural fraud detection in credit card operations. IEEE transactions on neural networks, 8(4), 827-834.
38. Tseng, Y. H., & Chien, J. T. (2017). International Journal of Computational Linguistics & Chinese Language Processing, Volume 22, Number 1, June 2017. International Journal of Computational Linguistics & Chinese Language Processing, Volume 22, Number 1, June 2017, 22(1).
39. Inside （2019）,“監督式學習,”（accessed 2019/03/10, available at: https://www.inside.com.tw/article/9945-machine-learning）.
40. Holland, J. H. (1992). Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. MIT press.
41. Wikipedia（2019）,“遺傳演算法,”（accessed 2019/03/15, available at: https://zh.wikipedia.org/wiki/%E9%81%97%E4%BC%A0%E7%AE%97%E6%B3%95）.
42. Pearson, K. (1901). LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11), 559-572.

43. Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.
44. Bunkhumpornpat, C., Sinapiromsaran, K., & Lursinsap, C. (2009, April). Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In Pacific-Asia conference on knowledge discovery and data mining (pp. 475-482). Springer, Berlin, Heidelberg.
45. Tetko, I. V., Livingstone, D. J., & Luik, A. I. (1995). Neural network studies. 1. Comparison of overfitting and overtraining. Journal of chemical information and computer sciences, 35(5), 826-833.
46. Fernández, A., Garcia, S., Herrera, F., & Chawla, N. V. (2018). Smote for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. Journal of artificial intelligence research, 61, 863-905.
47. Mi, Y. (2013). Imbalanced classification based on active learning SMOTE. Research Journal of Applied Sciences, Engineering and Technology, 5(3), 944-949.
48. Han, H., Wang, W. Y., & Mao, B. H. (2005, August). Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In International conference on intelligent computing (pp. 878-887). Springer, Berlin, Heidelberg.
49. He, H., Bai, Y., Garcia, E. A., & Li, S. (2008, June). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence) (pp. 1322-1328). IEEE.
50. More, A. (2016). Survey of resampling techniques for improving classification performance in unbalanced datasets. arXiv preprint arXiv:1608.06048.
51. Medium（2019）,“ADASYN,”（accessed 2019/03/15, available at: https://medium.com/@ruinian/an-introduction-to-adasyn-with-code-1383a5ece7aa）.
52. He, H., & Garcia, E. A. (2008). Learning from imbalanced data. IEEE Transactions on Knowledge & Data Engineering, (9), 1263-1284.
53. Mani, I., & Zhang, I. (2003, August). kNN approach to unbalanced data distributions: a case study involving information extraction. In Proceedings of workshop on learning from imbalanced datasets (Vol. 126).
54. Wilson, D. L. (1972). Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, (3), 408-421.
55. Wilson, D. R., & Martinez, T. R. (2000). Reduction techniques for instance-based learning algorithms. Machine learning, 38(3), 257-286.
56. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273-297.
57. 雷祖強, 周天穎, 萬絢, 楊龍士, & 許晉嘉. (2007). 空間特徵分類器支援向量機之研究. Journal of Photogrammetry and Remote Sensing, 12(2), 145-163.
58. Cuingnet, R., Rosso, C., Chupin, M., Lehéricy, S., Dormont, D., Benali, H., ... & Colliot, O. (2011). Spatial regularization of SVM for the detection of diffusion alterations associated with stroke outcome. Medical image analysis, 15(5), 729-737.
59. L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth, Belmont, CA, 1984.，
60. Breiman, L. (2017). Classification and regression trees. Routledge.
61. Scikit-learn （2019）,“Decision Trees,”（accessed 2019/02/10, available at: https://scikit-learn.org/stable/modules/tree.html）.
62. Scikit-learn （2019）,“StandardScaler,”（accessed 2019/02/10, available at: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html）.

63. Harvard Business Review （2019）,“garbage-in, garbage-out,”（accessed 2019/02/05, available at: https://hbr.org/2018/04/if-your-data-is-bad-your-machine-learning-tools-are-useless）.
64. Provost, F., & Kohavi, R. (1998). Guest editors′ introduction: On applied research in machine learning. Machine learning, 30(2), 127-132.
65. Marimont, R. B., & Shapiro, M. B. (1979). Nearest neighbour searches and the curse of dimensionality. IMA Journal of Applied Mathematics, 24(1), 59-70.
66. Chávez, E., Navarro, G., Baeza-Yates, R., & Marroquín, J. L. (2001). Searching in metric spaces. ACM computing surveys (CSUR), 33(3), 273-321.
67. Wikipedia （2019）,“維數災難,”（accessed 2019/04/15, available at: https://zh.wikipedia.org/wiki/%E7%BB%B4%E6%95%B0%E7%81%BE%E9%9A%BE.
68. Johnstone, I. M., & Lu, A. Y. (2009). On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association, 104(486), 682-693.
69. Lu, Y., Cohen, I., Zhou, X. S., & Tian, Q. (2007, September). Feature selection using principal feature analysis. In Proceedings of the 15th ACM international conference on Multimedia (pp. 301-304). ACM.
70. Aksoy, S., & Haralick, R. M. (2001). Feature normalization and likelihood-based similarity measures for image retrieval. Pattern recognition letters, 22(5), 563-582.
71. Wikipedia（2019）,“Feature scaling,”（accessed 2019/05/01, available at: https://en.wikipedia.org/wiki/Feature_scaling）.
72. Archive（2019）,“normalization,”（accessed 2019/04/20, available at: https://web.archive.org/web/20121230101134/http://www.qsarworld.com/qsar-statistics-normalization.php）.
73. Wu, G., & Chang, E. Y. (2005). KBA: Kernel boundary alignment considering imbalanced data distribution. IEEE Transactions on Knowledge & Data Engineering, (6), 786-795.
74. KEEL （2019）,“keel Imbalanced data sets,”（accessed 2018/10/10, available at: http://sci2s.ugr.es/keel/imbalanced.php）.
75. Uottawa （2019）,“NANA Imbalanced data sets,”（accessed 2018/10/10, available at: http://promise.site.uottawa.ca/SERepository/datasets-page.html）.
76. Github （2019）,“NANA Imbalanced data sets,”（accessed 2018/10/10, available at: https://github.com/klainfo/NASADefectDataset/tree/master/OriginalData/MDP）.
77. You, C., Li, C., Robinson, D. P., & Vidal, R. (2018, September). A Scalable Exemplar-Based Subspace Clustering Algorithm for Class-Imbalanced Data. In European Conference on Computer Vision (pp. 68-85). Springer, Cham.
78. Lin, W. C., Tsai, C. F., Hu, Y. H., & Jhang, J. S. (2017). Clustering-based undersampling in class-imbalanced data. Information Sciences, 409, 17-26.
79. Zhai, J., Zhang, S., & Wang, C. (2017). The classification of imbalanced large data sets based on mapreduce and ensemble of elm classifiers. International Journal of Machine Learning and Cybernetics, 8(3), 1009-1017.
80. Sun, Y., Kamel, M. S., & Wang, Y. (2006, December). Boosting for learning multiple classes with imbalanced class distribution. In Sixth International Conference on Data Mining (ICDM′06) (pp. 592-602). IEEE.
81. 林冠宇,“發展改良試支持向量資料摸述改善不平衡資料分類”,國立臺北科技大學工業工程與管理學系碩士論文,2013
82. 林佳蒨,“支援向量機於不平衡資料類別問題之應用”,國立暨南國際大學資訊管理學系碩士論文,2012
83. 羅隆晉,“以集群為基礎之多分類器模型對不平衡資料預測之研究”,銘傳大學工程學系碩士論文,2010
84. 張毓珊,“以集群為基礎之多分類器模型對不平衡資料預測之研究”,朝陽科技大學資訊管理學系碩士論文,2009

指導教授

蔡志豐(Chih-Fong Tsai)

審核日期

2019-7-3

推文