中大機構典藏-NCU Institutional Repository-提供博碩士論文、考古題、期刊論文、研究計畫等下載:Item 987654321/85059
English  |  正體中文  |  简体中文  |  Items with full text/Total items : 78937/78937 (100%)
Visitors : 39423414      Online Users : 293
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version


    Please use this identifier to cite or link to this item: http://ir.lib.ncu.edu.tw/handle/987654321/85059


    Title: 資料離散化與 填補遺漏值之順序與研究;Effects of Combining Data Discretization and Missing Value Imputation on Classification Problems
    Authors: 崔書晴;Tsui, Shu-Ching
    Contributors: 資訊管理學系
    Keywords: 資料前處理;資料離散化;遺漏值填補;資料探勘
    Date: 2021-01-27
    Issue Date: 2021-03-18 17:31:27 (UTC+8)
    Publisher: 國立中央大學
    Abstract: 隨著時代演進,科技化的腳步不斷前進著,而人們的行為會產生許許多多的資料,這些過往不被重視或是因為技術而難以被蒐集的資料,今日,不管是對企業或是對一般人而言,資料都有著不同的意義,它可以是市場分析的工具或是個人隱私的一部分,它的價值甚至比產品本身還要更加重要,因此,資料探勘(Data Mining)是目前非常熱門的技術,即是利用不同的方法進行資料的分析並且設法找出資料內隱含的相關性以及可以提取的特徵,加以解讀及應用,聽起來很容易,但是,實際操作上,卻會遇到很多問題,其中之一就是原始資料的不完整、產生遺漏的狀況。
    遺漏值(Missing Value)會直接導致資料探勘及分析的結果上有誤差,這些遺漏可能來自於人為的填寫失誤或各種原因導致的刻意隱瞞,也可能是機器本身的原因所導致,例如: 資料儲存的過程產生失誤、硬體設備的損壞等等。因此,資料探勘及分析時,常常會因為遺漏值的緣故導致結果被干擾,準確度因此降低。
    除此之外,在資料前處理的階段,經常會碰到例如年齡的連續型資料,若是連續型的資料,在進行特徵提取的時候可能會導致條件過於狹隘,因此,離散化是個很重要的過程,將連續型資料透過不同的劃分點分類到不同的類別,使資料平整化亦降低異常資料對於模型的影響程度,唯有高品質的資料,才可能產出高品質的結果。
    目前對於遺漏值的處理以及離散化的方式有非常多種,本研究將嘗試先離散化資料再進行各種遺漏值填補以及先進行遺漏值填補後再離散化的方法,將結果以正確率評估,統整,歸納出一個較為有效的方法。
    ;As technology improves day by day, many data that used to be ignored or was difficult to be gathered, could have a brand-new meaning nowadays no matter on personal or enterprise aspect. Data could be a tool to analyze market or a part of personal privacy, what’s more, it has become more valuable than the product itself. Thus, data mining, which means analyze data in many different ways, try to find out the correlation between each one of them and make use of them, is a big hit recently. It sounds easy but actually facing many difficulties while practicing. One of them is the incompleteness of data, means data that contains missing values.
    Missing value will directly result in error of analysis outcome. Missing values may cause by human error or malfunctioning machine. For example: the process of saving data does not work well or broken hardware. So, outcomes of data mining and analyzing will often be interfered due to missing values.
    Furthermore, there are continuous variables inside data, like: age. If for continuous variables, it could result in a narrow condition when data analyzing. Consequently, discretization is an important data preprocessing stage. Discretization will divide continuous variables into categorical by different cutting points and depends on different methods to reduce the influence of abnormal data or outliers. Because only high-quality data could output high quality outcomes.
    There are many methods to deal with missing values and to implement discretization. This study will try to do discretization first and to inpute missing values first, and evaluated with accuracy to see which one is a better way.
    Appears in Collections:[Graduate Institute of Information Management] Electronic Thesis & Dissertation

    Files in This Item:

    File Description SizeFormat
    index.html0KbHTML185View/Open


    All items in NCUIR are protected by copyright, with all rights reserved.

    社群 sharing

    ::: Copyright National Central University. | 國立中央大學圖書館版權所有 | 收藏本站 | 設為首頁 | 最佳瀏覽畫面: 1024*768 | 建站日期:8-24-2009 :::
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - 隱私權政策聲明