單分類方法於類別不平衡資料集之研究－結合特徵選取與集成式學習

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：13

、訪客IP：18.223.238.221

姓名

臧怡婷(Yi-Ting Tsang) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

單分類方法於類別不平衡資料集之研究－結合特徵選取與集成式學習
(One Class Classification on Imbalanced Datasets Using Feature Selection and Ensemble Learning)

相關論文

★ 單一類別分類方法於不平衡資料集－搭配遺漏值填補和樣本選取方法	★ 應用文字探勘技術於股價預測：探討傳統機器學習及深度學習技術與不同財經新聞來源之關係
★ 混合式前處理於類別不平衡問題之研究 - 結合機器學習與生成對抗網路	★ 單一與並列式集成特徵選取方法於多分類類別不平衡問題之研究

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

在真實世界的資料集類別不平衡是個常見的問題，在過去文獻裡類別不平衡問題大致可從四個方向去解決，包含資料層面、演算法層面、成本敏感法、集成式學習等，本研究欲從演算法層面去探討，選擇可用單一類別資料進行學習的單分類方法建立預測模型，本研究使用KEEL網站上55個類別不平衡資料集，並選用三種單分類方法分別為單類支援向量機（One-class SVM, OCSVM）、孤立森林（Isolation forest, IF）、區域異常因子法（Local outlier factor, LOF）。而過去文獻指出，資料前處理能提升資料的品質，進而提升模型的效能，且目前較少研究二分類資料集經特徵選取前處理後再搭配單分類方法建模，因此本研究欲搭配特徵選取前處理方法，採用包裝（Wrapper）、過濾（Filter）、嵌入（Embedded）三種類別的特徵選取方法各一，分別為基因演算法（Genetic algorithm, GA）、主成分分析法（Principal component analysis, PCA）、C4.5決策樹（C4.5 Decision tree），欲探討哪一種特徵選取方法搭配哪一種單分類方法可提升分類效果，以及單分類模型表現是否會受到類別不平衡比率高低影響，更結合集成式學習概念，結合數個不同的基礎分類器形成最終的預測模型，是否能進一步提升分類表現。
從實驗結果來看，整體來說C4.5特徵選取可提升單分類模型的表現，但如分為高低類別不平衡比率後來看，在低比率情況下，C4.5特徵選取有助於提升OCSVM、IF的表現，但仍不及直接使用C4.5方法建模的表現；在高比率時，GA特徵選取有助於提升OCSVM、LOF的表現，C4.5則有助於提升IF的表現，且三種單分類方法不管搭配哪種特徵選取方法皆贏過直接使用C4.5，因此單分類方法比C4.5適合用於高類別不平衡比率的資料集。搭配集成式學習後，由先前實驗的結果排名前八名集合的異質性集成模型AUC最高可達83.24%。

摘要(英)

In the real world datasets, the class imbalance problem is very common. In the literatures, the class imbalance problem can be solved from four different ways, including data level methods, algorithm level methods, cost-sensitive methods, and ensemble learning. This thesis aims to explore the algorithm level method, where one-class classification algorithms are considered, which can learn from one-class data to build the one-class classifier. In addition, 55 class imbalanced datasets from the KEEL dataset repository are used for the experiment, and three one-class classification algorithms, including One-Class SVM (OCSVM), Isolation Forest (IF), and Local Outlier Factor (LOF) are compared.
From the past researches, data pre-processing, such as feature selection, can improve the quality of data, and thus improve the performance of classifiers. Moreover, few studies focus on performing feature selection over binary classification datasets and then combining with one-class classification methods. Therefore, three different types of feature selection methods are employed: wrapper, filter, and embedded methods, which are based on Genetic Algorithm (GA), Principal Component Analysis (PCA), and C4.5 decision tree (C4.5), respectively. As a result, the research objective is to find out which one-class classification algorithm combining with which feature selection algorithm can perform the best. Moreover, the relationship between the class imbalance ratio and the performance of one-class classifiers is examined. The second research objective is to apply the ensemble learning technique to combine several different one-class classifiers to examine whether one-class classifier ensembles can further improve the performance of single one-class classifiers.

The experimental results show that the C4.5 feature selection can overall improve the performance of the one-class classifiers. However, when the imbalance ratio is divided into high and low imbalance ratio groups, the C4.5 feature selection combined with OCSVM and IF perform better than the others for the datasets with low class imbalance ratios. For the datasets with high imbalance ratios, GA can to improve the performance of OCSVM, LOF, whereas C4.5 feature selection helps to improve the performance of IF, and no matter which feature selection method is used, the three one-class classifiers perform better than using C4.5 directly. After using the ensemble learning technique, the AUC of the heterogeneous classifier ensembles based on combining the top eight base one-class classifiers outperform the other classifier ensembles and single one-class classifiers, which can provide the AUC rate of 83.24%.

關鍵字(中)

★ 類別不平衡
★ 單分類方法
★ 特徵選取
★ 集成式學習
★ 資料探勘

關鍵字(英)

★ Class Imbalance
★ One-Class Classification
★ Feature Selection
★ Ensemble Learning
★ Data Mining

論文目次

摘要 i
Abstract ii
目錄 iv
圖目錄 vi
表目錄 vii
第1章緒論 1
1.1 研究背景 1
1.2 研究動機 3
1.3 研究目的 4
1.4 論文架構 5
第2章文獻探討 6
2.1 類別不平衡問題 6
2.2 解決類別不平衡問題之探討 7
2.2.1 演算法層面 7
2.3 特徵選取 8
2.3.1 基因演算法（GA） 12
2.3.2 主成分分析法（PCA） 14
2.3.3 C4.5決策樹（C4.5） 15
2.4 單分類方法 16
2.4.1 單類支援向量機（OCSVM） 17
2.4.2 孤立森林（IF） 19
2.4.3 區域異常因子法（LOF） 21
2.5 集成式學習 24
第3章研究方法 25
3.1 實驗架構 25
3.2 實驗參數設定 26
3.2.1 特徵選取參數設定 27
3.2.2 單分類方法參數設定 29
3.3 實驗驗證準則及評估指標 30
3.3.1 實驗驗證準則 30
3.3.2 實驗評估指標 31
3.4 實驗一 32
3.4.1 Baseline 32
3.4.2 特徵選取後再分類 34
3.5 實驗二 35
第4章實驗結果 37
4.1 實驗準備 37
4.1.1 實驗資料集 37
4.1.2 實驗電腦環境 39
4.2 實驗一結果 39
4.2.1 Baseline 39
4.2.2 特徵選取後再分類 45
4.2.3 實驗一小結 49
4.3 實驗二結果 49
4.3.1 同質性集成 50
4.3.2 異質性集成 51
4.3.3 實驗二小結 53
4.4 實驗總結 54
第5章結論 55
5.1 結論與貢獻 55
5.2 未來研究方向與建議 56
參考文獻 58
附錄一 62

參考文獻

[1] W. Raghupathi and V. Raghupathi, "Big data analytics in healthcare: promise and potential," Health information science and systems, vol. 2, no. 1, p. 3, 2014.
[2] H. C. Koh and G. Tan, "Data mining applications in healthcare," Journal of healthcare information management, vol. 19, no. 2, p. 65, 2011.
[3] L. D. Xu and L. Duan, "Big data for cyber physical systems in industry 4.0: a survey," Enterprise Information Systems, vol. 13, no. 2, pp. 148-169, 2019.
[4] H. Chen, R. H. Chiang, and V. C. Storey, "BUSINESS INTELLIGENCE AND ANALYTICS: FROM BIG DATA TO BIG IMPACT," MIS Quarterly, vol. 36, no. 4, pp. 1165-1188, 2012.
[5] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, "From data mining to knowledge discovery in databases," AI magazine, vol. 17, no. 3, pp. 37-37, 1996.
[6] A. Famili, W.-M. Shen, R. Weber, and E. Simoudis, "Data Preprocessing and Intelligent Data Analysis," Intelligent Data Analysis, vol. 1, no. 1, pp. 3-23, 1997.
[7] M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, and F. Herrera, "A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches," IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 42, no. 4, pp. 463-484, 2011.
[8] Y. Sun, A. K. Wong, and M. S. Kamel, "Classification of imbalanced data: A review," International journal of pattern recognition and artificial intelligence, vol. 23, no. 04, pp. 687-719, 2009.
[9] H. He and E. A. Garcia, "Learning from imbalanced data," IEEE Transactions on knowledge and data engineering, vol. 21, no. 9, pp. 1263-1284, 2009.
[10] Q. Yang and X. Wu, "10 Challenging Problems In Data Mining Research," International Journal of Information Technology & Decision Making (IJITDM), vol. 5, no. 04, pp. 597-604, 2006.
[11] N. V. Chawla, N. Japkowicz, and A. Kotcz, "Special issue on learning from imbalanced data sets," ACM SIGKDD explorations newsletter, vol. 6, no. 1, pp. 1-6, 2004.
[12] S. S. Khan and M. G. Madden, "A survey of recent trends in one class classification," in Irish conference on artificial intelligence and cognitive science, 2009, pp. 188-197: Springer.
[13] S. Kotsiantis, D. Kanellopoulos, and P. Pintelas, "Data preprocessing for supervised leaning," International Journal of Computer Science, vol. 1, no. 2, pp. 111-117, 2006.
[14] R. Polikar, "Ensemble learning," in Ensemble machine learning: Springer, 2012, pp. 1-34.
[15] S. Wang and X. Yao, "Multiclass imbalance problems: Analysis and potential solutions," IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 42, no. 4, pp. 1119-1130, 2012.
[16] R. C. Prati, G. E. Batista, and M. C. Monard, "Class imbalances versus class overlapping: an analysis of a learning system behavior," in Mexican international conference on artificial intelligence, 2004, pp. 312-321: Springer.
[17] N. Japkowicz and S. Stephen, "The class imbalance problem: A systematic study," Intelligent data analysis, vol. 6, no. 5, pp. 429-449, 2002.
[18] T. Jo and N. Japkowicz, "Class imbalances versus small disjuncts," ACM Sigkdd Explorations Newsletter, vol. 6, no. 1, pp. 40-49, 2004.
[19] J. M. Johnson and T. M. Khoshgoftaar, "Survey on deep learning with class imbalance," Journal of Big Data, vol. 6, no. 1, p. 27, 2019.
[20] G. M. Weiss, "Mining with rarity: a unifying framework," ACM Sigkdd Explorations Newsletter, vol. 6, no. 1, pp. 7-19, 2004.
[21] W. W. Cohen, "Fast effective rule induction," in Machine learning proceedings 1995: Elsevier, 1995, pp. 115-123.
[22] B. Raskutti and A. Kowalczyk, "Extreme re-balancing for SVMs: a case study," ACM Sigkdd Explorations Newsletter, vol. 6, no. 1, pp. 60-69, 2004.
[23] P. Langley, "Selection of relevant features in machine learning," in Proceedings of the AAAI Fall symposium on relevance, 1994, vol. 184, pp. 245-271.
[24] H. Liu and L. Yu, "Toward integrating feature selection algorithms for classification and clustering," IEEE Transactions on knowledge and data engineering, vol. 17, no. 4, pp. 491-502, 2005.
[25] M. Dash and H. Liu, "Feature selection for classification," Intelligent data analysis, vol. 1, no. 3, pp. 131-156, 1997.
[26] I. Guyon and A. Elisseeff, "An introduction to variable and feature selection," Journal of machine learning research, vol. 3, no. Mar, pp. 1157-1182, 2003.
[27] R. Kohavi and G. H. John, "Wrappers for feature subset selection," Artificial intelligence, vol. 97, no. 1-2, pp. 273-324, 1997.
[28] J. Yang and V. Honavar, "Feature subset selection using a genetic algorithm," in Feature extraction, construction and selection: Springer, 1998, pp. 117-136.
[29] H. Abdi and L. J. Williams, "Principal component analysis," Wiley interdisciplinary reviews: computational statistics, vol. 2, no. 4, pp. 433-459, 2010.
[30] L. Yu and H. Liu, "Feature selection for high-dimensional data: A fast correlation-based filter solution," in Proceedings of the 20th international conference on machine learning (ICML-03), 2003, pp. 856-863.
[31] Z. Zhao and H. Liu, "Searching for interacting features in subset selection," Intelligent Data Analysis, vol. 13, no. 2, pp. 207-228, 2009.
[32] J. R. Quinlan, "Induction of decision trees," Machine learning, vol. 1, no. 1, pp. 81-106, 1986.
[33] J. R. Quinlan, "C4. 5: Programs for Machine Learning," 1993.
[34] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen, Classification and regression trees. CRC press, 1984.
[35] J. Tang, S. Alelyani, and H. Liu, "Feature selection for classification: A review," Data classification: Algorithms and applications, p. 37, 2014.
[36] J. H. Holland, Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. MIT press, 1992.
[37] Y. Chtioui, D. Bertrand, and D. Barba, "Feature selection by a genetic algorithm. Application to seed discrimination by artificial vision," Journal of the Science of Food and Agriculture, vol. 76, no. 1, pp. 77-86, 1998.
[38] S. R. Safavian and D. Landgrebe, "A survey of decision tree classifier methodology," IEEE transactions on systems, man, and cybernetics, vol. 21, no. 3, pp. 660-674, 1991.
[39] B. Hssina, A. Merbouha, H. Ezzikouri, and M. Erritali, "A comparative study of decision tree ID3 and C4. 5," International Journal of Advanced Computer Science and Applications, vol. 4, no. 2, pp. 13-19, 2014.
[40] J. Mingers, "An empirical comparison of pruning methods for decision tree induction," Machine learning, vol. 4, no. 2, pp. 227-243, 1989.
[41] S. Ghosh and D. L. Reilly, "Credit card fraud detection with a neural-network," in System Sciences, 1994. Proceedings of the Twenty-Seventh Hawaii International Conference on, 1994, vol. 3, pp. 621-630: IEEE.
[42] R. Brause, T. Langsdorf, and M. Hepp, "Neural data mining for credit card fraud detection," in Proceedings 11th international conference on tools with artificial intelligence, 1999, pp. 103-106: IEEE.
[43] W. Lee and D. Xiang, "Information-theoretic measures for anomaly detection," in Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001, 2000, pp. 130-143: IEEE.
[44] V. Chandola, A. Banerjee, and V. Kumar, "Anomaly detection: A survey," ACM computing surveys (CSUR), vol. 41, no. 3, pp. 1-58, 2009.
[45] V. Hodge and J. Austin, "A survey of outlier detection methodologies," Artificial intelligence review, vol. 22, no. 2, pp. 85-126, 2004.
[46] X. Wu et al., "Top 10 algorithms in data mining," Knowledge and information systems, vol. 14, no. 1, pp. 1-37, 2008.
[47] R. Domingues, M. Filippone, P. Michiardi, and J. Zouaoui, "A comparative evaluation of outlier detection algorithms: Experiments and analyses," Pattern Recognition, vol. 74, pp. 406-421, 2018.
[48] F. T. Liu, K. M. Ting, and Z.-H. Zhou, "Isolation-based anomaly detection," ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 6, no. 1, pp. 1-39, 2012.
[49] B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson, "Estimating the support of a high-dimensional distribution," Neural computation, vol. 13, no. 7, pp. 1443-1471, 2001.
[50] Y. Guerbai, Y. Chibani, and B. Hadjadji, "The effective use of the one-class SVM classifier for handwritten signature verification based on writer-independent parameters," Pattern Recognition, vol. 48, no. 1, pp. 103-113, 2015.
[51] F. T. Liu, K. M. Ting, and Z.-H. Zhou, "Isolation forest," in 2008 Eighth IEEE International Conference on Data Mining, 2008, pp. 413-422: IEEE.
[52] S. Ganeriwal, L. K. Balzano, and M. B. Srivastava, "Reputation-based framework for high integrity sensor networks," ACM Transactions on Sensor Networks (TOSN), vol. 4, no. 3, pp. 1-37, 2008.
[53] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, "LOF: identifying density-based local outliers," in Proceedings of the 2000 ACM SIGMOD international conference on Management of data, 2000, pp. 93-104.
[54] T. G. Dietterich, "Ensemble methods in machine learning," in International workshop on multiple classifier systems, 2000, pp. 1-15: Springer.
[55] L. Rokach, "Ensemble-based classifiers," Artificial Intelligence Review, vol. 33, no. 1-2, pp. 1-39, 2010.
[56] A. Lazarevic and V. Kumar, "Feature bagging for outlier detection," in Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, 2005, pp. 157-166.
[57] C. C. Aggarwal, "Outlier analysis," in Data mining, 2015, pp. 237-263: Springer.
[58] J. R. Koza, "Survey of genetic algorithms and genetic programming," in Wescon conference record, 1995, pp. 589-594: WESTERN PERIODICALS COMPANY.
[59] J. J. Grefenstette, "Optimization of control parameters for genetic algorithms," IEEE Transactions on systems, man, and cybernetics, vol. 16, no. 1, pp. 122-128, 1986.
[60] A. Venkatachalam, "M-InfoSift: A Graph-based Approach for Multiclass Document Classification," 2007.
[61] T. Fawcett, "An introduction to ROC analysis," Pattern recognition letters, vol. 27, no. 8, pp. 861-874, 2006.
[62] C.-F. Tsai, W.-C. Lin, Y.-H. Hu, and G.-T. Yao, "Under-sampling class imbalanced datasets by combining clustering analysis and instance selection," Information Sciences, vol. 477, pp. 47-54, 2019.

指導教授

蔡志豐蘇坤良(Chih-Fong Tsai Kuen-Liang Sue)

審核日期

2020-7-17

推文