單分類方法於類別不平衡資料集之研究－結合特徵選取與集成式學習;One Class Classification on Imbalanced Datasets Using Feature Selection and Ensemble Learning

NCU Institutional Repository > 管理學院 > 資訊管理研究所 > 博碩士論文 > Item 987654321/84019

請使用永久網址來引用或連結此文件: https://ir.lib.ncu.edu.tw/handle/987654321/84019

題名:	單分類方法於類別不平衡資料集之研究－結合特徵選取與集成式學習;One Class Classification on Imbalanced Datasets Using Feature Selection and Ensemble Learning
作者:	臧怡婷;Tsang, Yi-Ting
貢獻者:	資訊管理學系
關鍵詞:	類別不平衡;單分類方法;特徵選取;集成式學習;資料探勘;Class Imbalance;One-Class Classification;Feature Selection;Ensemble Learning;Data Mining
日期:	2020-07-17
上傳時間:	2020-09-02 17:55:55 (UTC+8)
出版者:	國立中央大學
摘要:	在真實世界的資料集類別不平衡是個常見的問題，在過去文獻裡類別不平衡問題大致可從四個方向去解決，包含資料層面、演算法層面、成本敏感法、集成式學習等，本研究欲從演算法層面去探討，選擇可用單一類別資料進行學習的單分類方法建立預測模型，本研究使用KEEL網站上55個類別不平衡資料集，並選用三種單分類方法分別為單類支援向量機（One-class SVM, OCSVM）、孤立森林（Isolation forest, IF）、區域異常因子法（Local outlier factor, LOF）。而過去文獻指出，資料前處理能提升資料的品質，進而提升模型的效能，且目前較少研究二分類資料集經特徵選取前處理後再搭配單分類方法建模，因此本研究欲搭配特徵選取前處理方法，採用包裝（Wrapper）、過濾（Filter）、嵌入（Embedded）三種類別的特徵選取方法各一，分別為基因演算法（Genetic algorithm, GA）、主成分分析法（Principal component analysis, PCA）、C4.5決策樹（C4.5 Decision tree），欲探討哪一種特徵選取方法搭配哪一種單分類方法可提升分類效果，以及單分類模型表現是否會受到類別不平衡比率高低影響，更結合集成式學習概念，結合數個不同的基礎分類器形成最終的預測模型，是否能進一步提升分類表現。從實驗結果來看，整體來說C4.5特徵選取可提升單分類模型的表現，但如分為高低類別不平衡比率後來看，在低比率情況下，C4.5特徵選取有助於提升OCSVM、IF的表現，但仍不及直接使用C4.5方法建模的表現；在高比率時，GA特徵選取有助於提升OCSVM、LOF的表現，C4.5則有助於提升IF的表現，且三種單分類方法不管搭配哪種特徵選取方法皆贏過直接使用C4.5，因此單分類方法比C4.5適合用於高類別不平衡比率的資料集。搭配集成式學習後，由先前實驗的結果排名前八名集合的異質性集成模型AUC最高可達83.24%。 ;In the real world datasets, the class imbalance problem is very common. In the literatures, the class imbalance problem can be solved from four different ways, including data level methods, algorithm level methods, cost-sensitive methods, and ensemble learning. This thesis aims to explore the algorithm level method, where one-class classification algorithms are considered, which can learn from one-class data to build the one-class classifier. In addition, 55 class imbalanced datasets from the KEEL dataset repository are used for the experiment, and three one-class classification algorithms, including One-Class SVM (OCSVM), Isolation Forest (IF), and Local Outlier Factor (LOF) are compared. From the past researches, data pre-processing, such as feature selection, can improve the quality of data, and thus improve the performance of classifiers. Moreover, few studies focus on performing feature selection over binary classification datasets and then combining with one-class classification methods. Therefore, three different types of feature selection methods are employed: wrapper, filter, and embedded methods, which are based on Genetic Algorithm (GA), Principal Component Analysis (PCA), and C4.5 decision tree (C4.5), respectively. As a result, the research objective is to find out which one-class classification algorithm combining with which feature selection algorithm can perform the best. Moreover, the relationship between the class imbalance ratio and the performance of one-class classifiers is examined. The second research objective is to apply the ensemble learning technique to combine several different one-class classifiers to examine whether one-class classifier ensembles can further improve the performance of single one-class classifiers. The experimental results show that the C4.5 feature selection can overall improve the performance of the one-class classifiers. However, when the imbalance ratio is divided into high and low imbalance ratio groups, the C4.5 feature selection combined with OCSVM and IF perform better than the others for the datasets with low class imbalance ratios. For the datasets with high imbalance ratios, GA can to improve the performance of OCSVM, LOF, whereas C4.5 feature selection helps to improve the performance of IF, and no matter which feature selection method is used, the three one-class classifiers perform better than using C4.5 directly. After using the ensemble learning technique, the AUC of the heterogeneous classifier ensembles based on combining the top eight base one-class classifiers outperform the other classifier ensembles and single one-class classifiers, which can provide the AUC rate of 83.24%.
顯示於類別:	[資訊管理研究所] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	304	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....