中大機構典藏-NCU Institutional Repository-提供博碩士論文、考古題、期刊論文、研究計畫等下載:Item 987654321/81224
English  |  正體中文  |  简体中文  |  全文筆數/總筆數 : 80990/80990 (100%)
造訪人次 : 41762061      線上人數 : 1999
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
搜尋範圍 查詢小技巧:
  • 您可在西文檢索詞彙前後加上"雙引號",以獲取較精準的檢索結果
  • 若欲以作者姓名搜尋,建議至進階搜尋限定作者欄位,可獲得較完整資料
  • 進階搜尋


    請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/81224


    題名: 樣本選取與資料離散化對於分類器效果之影響;Instance Selection and Data Discretization Influence on Classifier’s Performance
    作者: 顏子明;Yen, Tzu-Ming
    貢獻者: 資訊管理學系
    關鍵詞: 資料前處理;樣本選取;資料離散化;連續型數值;資料探勘;Data pre-processing;instance selection;discretization;continuous value;data mining
    日期: 2019-07-01
    上傳時間: 2019-09-03 15:39:49 (UTC+8)
    出版者: 國立中央大學
    摘要: 「資料前處理」在資料探勘中,扮演舉足輕重的角色,也是整個分析流程的起點。真實世界中的資料品質參差不齊,例如:大樣本的資料往往會帶有雜訊(Noisy)、或是包含判讀性低的連續型數值類型,若是沒有經過適當的前處理,這些因素都會造成分析結果有所誤差。在過去的文獻中,有學者提出樣本選取(Instance Selection)的資料取樣概念,能夠透過演算法篩選具有代表性的樣本;也有研究顯示出前處理時運用離散化(Discretization),將連續型數值轉換成離散型,能夠有效的提高分析探勘規則的可讀性同時也可能提升正確率。若是將樣本選取與離散化結合,是否能夠在最後獲得比單一前處理還要佳的表現,目前尚未有文獻做出這方面的探討。
    本論文欲探討樣本選取與離散化結合後進行資料前處理的影響,如何搭配才能達到最佳表現。本研究選用了三種樣本選取的演算法:基於樣本學習演算法(Instance-Based Learning Algorithm, IB3)、基因演算法(Genetic Algorithm, GA)、遞減式縮減最佳化程序(Decremental Reduction Optimization Procedure, DROP3),以及兩種監督式離散化演算法:最短描述長度原則(Minimum Description Length Principle, MDLP)、基於卡方分箱(ChiMerge, ChiM)。並以最近鄰居法(K-th Nearest Neighbor, KNN)作為分類器來評估搭配的最佳組合。
    本研究將以UCI與KEEL上的10種資料集,來進行樣本選取與離散化搭配的探討。根據實驗結果發現,以DROP3樣本選取演算法搭配MDLP離散化演算法的所得到的平均結果,為較推薦之組合搭配,並且以先進行DROP3樣本選取後進行MDLP離散化後的前處理,能夠得到較顯著提升的平均正確率,其正確率達85.11%。
    ;"Data Preprocessing" plays a pivotal role in data exploration and is the first step for the analysis process of data mining. In the real world, the quality of the big data is always unclear and uneven. For example, samples in the big data often have noise or continuous type values with low interpretability. These factors will result in inaccurate outcome if not properly pre-processed. In the literature, the concept of data sampling for instance selection had been proposed, which can be used to screen representative samples. Some studies have also shown that using discretization technology to transfer continuous values into discrete ones can effectively improve the readability of analytical exploration rules and may also improve the accuracy rate. Till now, there are no studies to explore the combination of instance selection and discretization, whether it can achieve better performance outcome than the single preprocessing techniques.
    This thesis aims to discuss the influence of data preprocessing after combining instance selection and discretization, and how to achieve the optimal performance. In this study, three instance selection algorithms are selected: Instance-Based Learning Algorithm (IB3), Genetic Algorithm (GA), Decremental Reduction Optimization Procedure (DROP3), and two supervised discretization algorithms: Minimum Description Length Principle (MDLP), ChiMerge-based (ChiM). The best combination of the two types of techniques is evaluated by the performance of the K-th Nearest Neighbor (KNN) classifiers.
    This study uses the 10 datasets from UCI and KEEL to explore the instance selection and discretization. According to the experimental results, it reveals that the average results of the DROP3 instance selection algorithm combined with the MDLP discretization algorithm is the more recommended combination than others, and the optimal performance can be obtained when the pre-processing of MDLP discretization is performed after the selection by DROP3, the average accuracy is promoted to 85.11%.
    顯示於類別:[資訊管理研究所] 博碩士論文

    文件中的檔案:

    檔案 描述 大小格式瀏覽次數
    index.html0KbHTML135檢視/開啟


    在NCUIR中所有的資料項目都受到原著作權保護.

    社群 sharing

    ::: Copyright National Central University. | 國立中央大學圖書館版權所有 | 收藏本站 | 設為首頁 | 最佳瀏覽畫面: 1024*768 | 建站日期:8-24-2009 :::
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - 隱私權政策聲明