中大機構典藏-NCU Institutional Repository-提供博碩士論文、考古題、期刊論文、研究計畫等下載:Item 987654321/81224
English  |  正體中文  |  简体中文  |  Items with full text/Total items : 80990/80990 (100%)
Visitors : 41270451      Online Users : 471
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version


    Please use this identifier to cite or link to this item: http://ir.lib.ncu.edu.tw/handle/987654321/81224


    Title: 樣本選取與資料離散化對於分類器效果之影響;Instance Selection and Data Discretization Influence on Classifier’s Performance
    Authors: 顏子明;Yen, Tzu-Ming
    Contributors: 資訊管理學系
    Keywords: 資料前處理;樣本選取;資料離散化;連續型數值;資料探勘;Data pre-processing;instance selection;discretization;continuous value;data mining
    Date: 2019-07-01
    Issue Date: 2019-09-03 15:39:49 (UTC+8)
    Publisher: 國立中央大學
    Abstract: 「資料前處理」在資料探勘中,扮演舉足輕重的角色,也是整個分析流程的起點。真實世界中的資料品質參差不齊,例如:大樣本的資料往往會帶有雜訊(Noisy)、或是包含判讀性低的連續型數值類型,若是沒有經過適當的前處理,這些因素都會造成分析結果有所誤差。在過去的文獻中,有學者提出樣本選取(Instance Selection)的資料取樣概念,能夠透過演算法篩選具有代表性的樣本;也有研究顯示出前處理時運用離散化(Discretization),將連續型數值轉換成離散型,能夠有效的提高分析探勘規則的可讀性同時也可能提升正確率。若是將樣本選取與離散化結合,是否能夠在最後獲得比單一前處理還要佳的表現,目前尚未有文獻做出這方面的探討。
    本論文欲探討樣本選取與離散化結合後進行資料前處理的影響,如何搭配才能達到最佳表現。本研究選用了三種樣本選取的演算法:基於樣本學習演算法(Instance-Based Learning Algorithm, IB3)、基因演算法(Genetic Algorithm, GA)、遞減式縮減最佳化程序(Decremental Reduction Optimization Procedure, DROP3),以及兩種監督式離散化演算法:最短描述長度原則(Minimum Description Length Principle, MDLP)、基於卡方分箱(ChiMerge, ChiM)。並以最近鄰居法(K-th Nearest Neighbor, KNN)作為分類器來評估搭配的最佳組合。
    本研究將以UCI與KEEL上的10種資料集,來進行樣本選取與離散化搭配的探討。根據實驗結果發現,以DROP3樣本選取演算法搭配MDLP離散化演算法的所得到的平均結果,為較推薦之組合搭配,並且以先進行DROP3樣本選取後進行MDLP離散化後的前處理,能夠得到較顯著提升的平均正確率,其正確率達85.11%。
    ;"Data Preprocessing" plays a pivotal role in data exploration and is the first step for the analysis process of data mining. In the real world, the quality of the big data is always unclear and uneven. For example, samples in the big data often have noise or continuous type values with low interpretability. These factors will result in inaccurate outcome if not properly pre-processed. In the literature, the concept of data sampling for instance selection had been proposed, which can be used to screen representative samples. Some studies have also shown that using discretization technology to transfer continuous values into discrete ones can effectively improve the readability of analytical exploration rules and may also improve the accuracy rate. Till now, there are no studies to explore the combination of instance selection and discretization, whether it can achieve better performance outcome than the single preprocessing techniques.
    This thesis aims to discuss the influence of data preprocessing after combining instance selection and discretization, and how to achieve the optimal performance. In this study, three instance selection algorithms are selected: Instance-Based Learning Algorithm (IB3), Genetic Algorithm (GA), Decremental Reduction Optimization Procedure (DROP3), and two supervised discretization algorithms: Minimum Description Length Principle (MDLP), ChiMerge-based (ChiM). The best combination of the two types of techniques is evaluated by the performance of the K-th Nearest Neighbor (KNN) classifiers.
    This study uses the 10 datasets from UCI and KEEL to explore the instance selection and discretization. According to the experimental results, it reveals that the average results of the DROP3 instance selection algorithm combined with the MDLP discretization algorithm is the more recommended combination than others, and the optimal performance can be obtained when the pre-processing of MDLP discretization is performed after the selection by DROP3, the average accuracy is promoted to 85.11%.
    Appears in Collections:[Graduate Institute of Information Management] Electronic Thesis & Dissertation

    Files in This Item:

    File Description SizeFormat
    index.html0KbHTML134View/Open


    All items in NCUIR are protected by copyright, with all rights reserved.

    社群 sharing

    ::: Copyright National Central University. | 國立中央大學圖書館版權所有 | 收藏本站 | 設為首頁 | 最佳瀏覽畫面: 1024*768 | 建站日期:8-24-2009 :::
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - 隱私權政策聲明