English  |  正體中文  |  简体中文  |  全文筆數/總筆數 : 80990/80990 (100%)
造訪人次 : 41635329      線上人數 : 1381
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
搜尋範圍 查詢小技巧:
  • 您可在西文檢索詞彙前後加上"雙引號",以獲取較精準的檢索結果
  • 若欲以作者姓名搜尋,建議至進階搜尋限定作者欄位,可獲得較完整資料
  • 進階搜尋


    請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/81220


    題名: 單一與集成特徵選取方法於高維度資料之比較;Comparison of Single Feature Selection and Ensemble Feature Selection for High-Dimensional Data
    作者: 宋亞庭;Sung, Ya-Ting
    貢獻者: 資訊管理學系
    關鍵詞: 資料探勘;特徵選取;分類;集成學習;支援向量機;Data Mining;Feature Selection;Ensemble learning;Classification;Support Vector Machines
    日期: 2019-07-01
    上傳時間: 2019-09-03 15:39:43 (UTC+8)
    出版者: 國立中央大學
    摘要: 真實世界中的資料時常存有品質不佳的問題,像是含有雜訊、不相關的資料、資
    料量過大等。若是直接將這些資料建模,恐怕會導致模型的效果和效益不佳,因此必
    須先將這些資料進行前處理,其中特徵選取為常見的資料前處理方法,透過特徵選
    取,可以將冗餘、不相關的特徵去除,僅留下具代表性的特徵,集成特徵選取是指使
    用多種不同的特徵選取演算法,將他們所選取的特徵子集透過不同的方式聚合,透過
    集成能夠提升特徵選取的穩健性甚至是提升分類正確率。目前特徵選取的相關研究多
    是採用單一特徵選取,較少有研究涉及集成特徵選取,因此本研究欲比較單一特徵選
    取和集成特徵選取在高維度資料的表現,找出較佳的特徵選取方法組合。
    本研究使用了三種分屬不同類型的特徵選取演算法,分別為基因演算法(Genetic
    Algorithm, GA)、決策樹 C4.5(Decision Tree C4.5, DT)、主成分分析(Principal
    Components Analysis, PCA),引用集成學習的中序列式集成和並列式集成的概念形成
    序列式集成特徵選取和並列式集成特徵選取,最後利用分類正確率、F1-Score 以及執
    行時間來衡量特徵選取方法的優劣。本研究使用 20 個公開資料集,資料集的維度介於
    44 到 19993。
    根據本研究實驗結果,使用序列式集成特徵方法與並列式集成特徵選取的表現會
    優於單一特徵選取,多數資料集的最佳特徵選取方法都是序列式集成特徵方法與並列
    式集成特徵選取,序列式集成特徵選取方法中表現最好的方法是 GA+PCA,並列式集成
    特徵選取方法中表現最好的方法是 C4.5∪GA。;Data in the real world often have the problem of bad quality, such as noise, irrelevant
    data and extreme volume. Without considering data pre-processing, the models that are
    trained by this kind of data are unlikely to be effective. In particular, feature selection is a
    common data pre-processing method. Through feature selection, redundant and irrelevant
    features can be removed, leaving only representative features. In ensemble feature selection, it
    refers to using multiple different feature selection algorithms and combines their selected
    feature subsets through different aggregation methods. Ensemble feature selection can
    improve the robustness of single feature selection and even improve the classification
    accuracy. Currently, the related research on feature selection mostly adopts single feature
    selection. There are few researches discussing ensemble feature selection. Thus, the aim of
    this thesis is to compare the performance of single feature selection and ensemble feature
    selection in high-dimensional data to find a better combination of feature selection methods.
    In the experiment, three different types of feature selection algorithms are used, which
    are GA (Genetic Algorithm), DT (Decision Tree Algorithm), and PCA (Principal Components
    Analysis). For ensemble feature selection, the concept of sequential ensemble and parallel
    ensemble in ensemble learning are applied to form sequential ensemble feature selection and
    parallel ensemble feature selection, respectively. Finally, the classification accuracy, f1-score
    and execution time are examined to evaluate feature selection methods.
    Based on 20 public datasets with dimensions ranging from 44 to 19993, the experimental
    results show that sequential ensemble feature selection and parallel ensemble feature selection
    perform better than single feature selection. The best feature selection methods for most
    datasets are sequential ensemble feature selection and parallel ensemble feature selection.
    The best combination in sequential ensemble feature selection is GA+PCA, and the best
    combination in parallel ensemble feature selection is C4.5∪GA.
    顯示於類別:[資訊管理研究所] 博碩士論文

    文件中的檔案:

    檔案 描述 大小格式瀏覽次數
    index.html0KbHTML109檢視/開啟


    在NCUIR中所有的資料項目都受到原著作權保護.

    社群 sharing

    ::: Copyright National Central University. | 國立中央大學圖書館版權所有 | 收藏本站 | 設為首頁 | 最佳瀏覽畫面: 1024*768 | 建站日期:8-24-2009 :::
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - 隱私權政策聲明