English  |  正體中文  |  简体中文  |  Items with full text/Total items : 65421/65421 (100%)
Visitors : 22354452      Online Users : 235
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version


    Please use this identifier to cite or link to this item: http://ir.lib.ncu.edu.tw/handle/987654321/67584


    Title: The Effect of Instance Selection on Missing Value Imputation
    Authors: 李昀潔;Li,Yun-Jie
    Contributors: 資訊管理學系
    Keywords: 資料探勘;資料選取法;補值法;機器學習;分類問題;Machine Learning;Instance Selection Methods;Imputation Methods;Classification;Data Mining
    Date: 2015-06-22
    Issue Date: 2015-07-30 22:41:50 (UTC+8)
    Publisher: 國立中央大學
    Abstract: 遺漏值問題(Missing value problem)普遍存在資料探勘(Data mining)問 題之中,不論是資料輸入錯誤或者資料格式錯誤等問題,皆造成資料探勘建模時 無法有效利用現有的資料建立適合的分類模型。因此填補法(Imputation methods) 就針對此問題應運而生,此方法利用現有存在的資料進行分析並填補適合的值, 此適合的值可提供適當的資料供建模使用。
    然而現有的資料或許無法提供有效的資料給填補法進行有效的補值,原因在 於現有的資料中有許多存在的問題,例如:雜訊資料存在的問題(Noisy problem)、 資料冗餘的問題(Redundancy)或存在許多不具代表性的資料(Represented instances)等,因此為了有效利用現有的資料進行補值,資料選取法(Instance selection methods)則利用篩選出具代表性的資料來解決上述之問題,換句話說, 資料選取法透過一系列的篩選標準來產生精簡資料集,此資料集為具代表性的資 料所組成,因此補值法就能利用此精簡資料集來進行補值,以避免原始資料內含 有的問題影響補值法的效果。
    本論文為探討資料選取法對補值法的影響,透過 UCI 開放資料集庫中的 33 個資料集組成三種類型的資料集(類別型、混合型、數值型)來進行實驗,選定 三個資料選取法;IB3(Instance-based learning)、DROP3(Decremental Reduction Optimization Procedure)、GA(Genetic Algorithm),和三個補值法;KNNI (K-Nearest Neighbor Imputation method)、SVM(Support Vector Machine)、MLP (MultiLayers Perceptron),來檢驗何種情況下哪種組合方法(三個資料選取法配 上三個補值法)為最佳或最適合,或者是否組合方法是否比單純補值法更加有效 果。
    依據本研究所得之結果,我們建議在數值型資枓集(Numerical datasets)情 況下資料選取法配上補值法的流程會比單純補值法的流程適合;資料選取法的部份,DROP3 則建議比較適合用在數值型與混合型資料集(Mixed datasets),但是 對於類別型資料集(Categorical datasets)且類別數大的情況下,則不建議使用資 料選取法 DROP3,另一方面,對於 GA 和 IB3 這兩個資料選取法,我們建議 GA 的方法會比 IB3 適合,因為依據本研究的實驗顯示,GA 的資料選取表現會比 IB3 來得穩定。
    ;In data mining, the collected datasets are usually incomplete, which contain some missing attribute values. It is difficult to effectively develop a learning model using the incomplete datasets. In literature, missing value imputation can be approached for the problem of incomplete datasets. Its aim is to provide estimations for the missing values by the (observed) complete data samples.
    However, some of the complete data may contain some noisy information, which can be regarded as outliers. If these noisy data were used for missing value imputation, the quality of the imputation results would be affected. To solve this problem, we propose to perform instance selection over the complete data before the imputation step. The aim of instance selection is to filter out some unrepresentative data from a given dataset. Therefore, this research focuses on examining the effect of performing instance selection on missing value imputation.
    The experimental setup is based on using 33 UCI datasets, which are composed of categorical, numerical, and mixed types of data. In addition, three instance selection methods, which are IB3 (Instance-based learning), DROP3 (Decremental Reduction Optimization Procedure), and GA (Genetic Algorithm) are used for comparison. Similarly, three imputation methods including KNNI (K-Nearest Neighbor Imputation method), SVM (Support Vector Machine), and MLP (MultiLayers Perceptron) are also employed individually. The comparative results can allow us to understand which combination of instance selection and imputation methods performs the best and whether combining instance selection and missing value imputation is the better choice than performing missing value imputation alone for the incomplete datasets.
    According to the results of this research, we suggest that the combinations of instance selection methods and imputation methods may suitable than the imputation methods along over numerical datasets. In particular, the DROP3 instance selection method is more suitable for numerical and mixed datasets, except for categorical datasets, especially when the number of features is large. For the other two instance selection methods, the GA method can provide more stable reduction performance than IB3.
    Appears in Collections:[資訊管理研究所] 博碩士論文

    Files in This Item:

    File Description SizeFormat
    index.html0KbHTML525View/Open


    All items in NCUIR are protected by copyright, with all rights reserved.

    社群 sharing

    ::: Copyright National Central University. | 國立中央大學圖書館版權所有 | 收藏本站 | 設為首頁 | 最佳瀏覽畫面: 1024*768 | 建站日期:8-24-2009 :::
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - Feedback  - 隱私權政策聲明