English  |  正體中文  |  简体中文  |  全文筆數/總筆數 : 80990/80990 (100%)
造訪人次 : 42120267      線上人數 : 1283
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
搜尋範圍 查詢小技巧:
  • 您可在西文檢索詞彙前後加上"雙引號",以獲取較精準的檢索結果
  • 若欲以作者姓名搜尋,建議至進階搜尋限定作者欄位,可獲得較完整資料
  • 進階搜尋


    請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/89792


    題名: 過採樣集成法於類別不平衡與高維度資料之研究;Oversampling Ensembles in Class imbalanced and high dimensional data
    作者: 林欣儀;Lin, Hsin-Yi
    貢獻者: 資訊管理學系
    關鍵詞: 類別不平衡;高維度;特徵選取法;集成式學習;class imbalance;high dimension;feature selection;ensemble learning
    日期: 2022-07-13
    上傳時間: 2022-10-04 11:59:52 (UTC+8)
    出版者: 國立中央大學
    摘要: 從資料數據的分析中,企業可以根據分析結果進行未來計劃或決策發展,因此資料的重要性及應用性日益劇增,但原始資料中卻常存在類別不平衡以及高維度的特性,這兩者的資料問題常發生於金融業、醫療業等領域,類別不平衡容易造成數據預測的偏誤,讓預測模型只專注在大類資料而忽略小類的數據;高維度資料集因為過多的欄位則易造成計算上的複雜性且降低預測的準確率。
    本論文在研究眾多類別不平衡以及高維度問題的解決方法文獻後,針對類別不平衡問題提出一個新方法:過採樣集成法(Oversampling ensemble),將常見的三個SMOTE變異法:polynom-fit-SMOTE, ProWSyn以及SMOTE-IPF進行集成,集成方法有Parallel ensemble以及Serial ensemble方式,其中Parallel ensemble包含四種選取生成資料的方法:Random、Center、Cluster Random、Cluster Center,並透過58個KEEL資料集的實驗證明Parallel ensemble顯著勝過單一演算法,以Center以及Cluster Center表現最好。對於類別不平衡同時伴隨高維度特性的資料集,本論文將新方法過採樣集成法搭配資料增益(Information Gain, IG)法以及Embedded法中的決策樹特徵選取,並透過15個OpenML的資料集證明該方法勝過單一演算法,並根據不平衡比率以及特徵數有不同的適用方法。
    ;Among the field of data analysis, the enterprise can make plans for future operation or make crucial decisions. Therefore, the data and its applications have become more and more important. However, the original dataset often exits the problems of class imbalance and high dimensionality. Those problems usually occur in the fields of finance, medicine and so on. The class imbalanced problem can cause the bias of prediction, which makes the prediction model mainly focuses on the majority class instead of the minority one. On the other hand, high dimensional datasets can lead to the complexity of the calculation and reduce the accuracy of prediction because of redundant features.
    In this thesis, we propose a new method called Oversampling ensemble aiming to solve the class imbalanced problem. Three well-known variants of SMOTE, which are polynom-fit-SMOTE, ProWSyn, SMOTE-IPF, are investigated. The ensemble approaches contain the Parallel and Serial ensembles, where the parallel ensembles include four data combination methods: Random、Center、Cluster Random、Cluster Center. The experimental results based on 58 KEEL datasets show that Parallel ensembles outperform the baseline and single oversampling algorithms, especially the Center and Cluster Center methods. As for the class imbalanced and high dimensional problems, parallel ensembles are combined with information gain and embedded Decision Tree feature selection separately for 15 OpenML datasets, which indicates that the ensemble method surpasses the baseline and single algorithms. In addition, appropriate methods are recommended for different imbalance ratios and numbers of features.
    顯示於類別:[資訊管理研究所] 博碩士論文

    文件中的檔案:

    檔案 描述 大小格式瀏覽次數
    index.html0KbHTML52檢視/開啟


    在NCUIR中所有的資料項目都受到原著作權保護.

    社群 sharing

    ::: Copyright National Central University. | 國立中央大學圖書館版權所有 | 收藏本站 | 設為首頁 | 最佳瀏覽畫面: 1024*768 | 建站日期:8-24-2009 :::
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - 隱私權政策聲明