博碩士論文 109423035 完整後設資料紀錄

DC 欄位 語言
DC.contributor資訊管理學系zh_TW
DC.creator林欣儀zh_TW
DC.creatorHsin-Yi Linen_US
dc.date.accessioned2022-7-13T07:39:07Z
dc.date.available2022-7-13T07:39:07Z
dc.date.issued2022
dc.identifier.urihttp://ir.lib.ncu.edu.tw:88/thesis/view_etd.asp?URN=109423035
dc.contributor.department資訊管理學系zh_TW
DC.description國立中央大學zh_TW
DC.descriptionNational Central Universityen_US
dc.description.abstract從資料數據的分析中,企業可以根據分析結果進行未來計劃或決策發展,因此資料的重要性及應用性日益劇增,但原始資料中卻常存在類別不平衡以及高維度的特性,這兩者的資料問題常發生於金融業、醫療業等領域,類別不平衡容易造成數據預測的偏誤,讓預測模型只專注在大類資料而忽略小類的數據;高維度資料集因為過多的欄位則易造成計算上的複雜性且降低預測的準確率。 本論文在研究眾多類別不平衡以及高維度問題的解決方法文獻後,針對類別不平衡問題提出一個新方法:過採樣集成法(Oversampling ensemble),將常見的三個SMOTE變異法:polynom-fit-SMOTE, ProWSyn以及SMOTE-IPF進行集成,集成方法有Parallel ensemble以及Serial ensemble方式,其中Parallel ensemble包含四種選取生成資料的方法:Random、Center、Cluster Random、Cluster Center,並透過58個KEEL資料集的實驗證明Parallel ensemble顯著勝過單一演算法,以Center以及Cluster Center表現最好。對於類別不平衡同時伴隨高維度特性的資料集,本論文將新方法過採樣集成法搭配資料增益(Information Gain, IG)法以及Embedded法中的決策樹特徵選取,並透過15個OpenML的資料集證明該方法勝過單一演算法,並根據不平衡比率以及特徵數有不同的適用方法。zh_TW
dc.description.abstractAmong the field of data analysis, the enterprise can make plans for future operation or make crucial decisions. Therefore, the data and its applications have become more and more important. However, the original dataset often exits the problems of class imbalance and high dimensionality. Those problems usually occur in the fields of finance, medicine and so on. The class imbalanced problem can cause the bias of prediction, which makes the prediction model mainly focuses on the majority class instead of the minority one. On the other hand, high dimensional datasets can lead to the complexity of the calculation and reduce the accuracy of prediction because of redundant features. In this thesis, we propose a new method called Oversampling ensemble aiming to solve the class imbalanced problem. Three well-known variants of SMOTE, which are polynom-fit-SMOTE, ProWSyn, SMOTE-IPF, are investigated. The ensemble approaches contain the Parallel and Serial ensembles, where the parallel ensembles include four data combination methods: Random、Center、Cluster Random、Cluster Center. The experimental results based on 58 KEEL datasets show that Parallel ensembles outperform the baseline and single oversampling algorithms, especially the Center and Cluster Center methods. As for the class imbalanced and high dimensional problems, parallel ensembles are combined with information gain and embedded Decision Tree feature selection separately for 15 OpenML datasets, which indicates that the ensemble method surpasses the baseline and single algorithms. In addition, appropriate methods are recommended for different imbalance ratios and numbers of features.en_US
DC.subject類別不平衡zh_TW
DC.subject高維度zh_TW
DC.subject特徵選取法zh_TW
DC.subject集成式學習zh_TW
DC.subjectclass imbalanceen_US
DC.subjecthigh dimensionen_US
DC.subjectfeature selectionen_US
DC.subjectensemble learningen_US
DC.title過採樣集成法於類別不平衡與高維度資料之研究zh_TW
dc.language.isozh-TWzh-TW
DC.titleOversampling Ensembles in Class imbalanced and high dimensional dataen_US
DC.type博碩士論文zh_TW
DC.typethesisen_US
DC.publisherNational Central Universityen_US

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡  - 隱私權政策聲明