過採樣集成法於類別不平衡與高維度資料之研究;Oversampling Ensembles in Class imbalanced and high dimensional data

NCU Institutional Repository > 管理學院 > 資訊管理研究所 > 博碩士論文 > Item 987654321/89792

請使用永久網址來引用或連結此文件: https://ir.lib.ncu.edu.tw/handle/987654321/89792

題名:	過採樣集成法於類別不平衡與高維度資料之研究;Oversampling Ensembles in Class imbalanced and high dimensional data
作者:	林欣儀;Lin, Hsin-Yi
貢獻者:	資訊管理學系
關鍵詞:	類別不平衡;高維度;特徵選取法;集成式學習;class imbalance;high dimension;feature selection;ensemble learning
日期:	2022-07-13
上傳時間:	2022-10-04 11:59:52 (UTC+8)
出版者:	國立中央大學
摘要:	從資料數據的分析中，企業可以根據分析結果進行未來計劃或決策發展，因此資料的重要性及應用性日益劇增，但原始資料中卻常存在類別不平衡以及高維度的特性，這兩者的資料問題常發生於金融業、醫療業等領域，類別不平衡容易造成數據預測的偏誤，讓預測模型只專注在大類資料而忽略小類的數據；高維度資料集因為過多的欄位則易造成計算上的複雜性且降低預測的準確率。本論文在研究眾多類別不平衡以及高維度問題的解決方法文獻後，針對類別不平衡問題提出一個新方法：過採樣集成法（Oversampling ensemble），將常見的三個SMOTE變異法：polynom-fit-SMOTE, ProWSyn以及SMOTE-IPF進行集成，集成方法有Parallel ensemble以及Serial ensemble方式，其中Parallel ensemble包含四種選取生成資料的方法：Random、Center、Cluster Random、Cluster Center，並透過58個KEEL資料集的實驗證明Parallel ensemble顯著勝過單一演算法，以Center以及Cluster Center表現最好。對於類別不平衡同時伴隨高維度特性的資料集，本論文將新方法過採樣集成法搭配資料增益（Information Gain, IG）法以及Embedded法中的決策樹特徵選取，並透過15個OpenML的資料集證明該方法勝過單一演算法，並根據不平衡比率以及特徵數有不同的適用方法。 ;Among the field of data analysis, the enterprise can make plans for future operation or make crucial decisions. Therefore, the data and its applications have become more and more important. However, the original dataset often exits the problems of class imbalance and high dimensionality. Those problems usually occur in the fields of finance, medicine and so on. The class imbalanced problem can cause the bias of prediction, which makes the prediction model mainly focuses on the majority class instead of the minority one. On the other hand, high dimensional datasets can lead to the complexity of the calculation and reduce the accuracy of prediction because of redundant features. In this thesis, we propose a new method called Oversampling ensemble aiming to solve the class imbalanced problem. Three well-known variants of SMOTE, which are polynom-fit-SMOTE, ProWSyn, SMOTE-IPF, are investigated. The ensemble approaches contain the Parallel and Serial ensembles, where the parallel ensembles include four data combination methods: Random、Center、Cluster Random、Cluster Center. The experimental results based on 58 KEEL datasets show that Parallel ensembles outperform the baseline and single oversampling algorithms, especially the Center and Cluster Center methods. As for the class imbalanced and high dimensional problems, parallel ensembles are combined with information gain and embedded Decision Tree feature selection separately for 15 OpenML datasets, which indicates that the ensemble method surpasses the baseline and single algorithms. In addition, appropriate methods are recommended for different imbalance ratios and numbers of features.
顯示於類別:	[資訊管理研究所] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	105	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....