資料離散化與 填補遺漏值之順序與研究;Effects of Combining Data Discretization and Missing Value Imputation on Classification Problems

NCUIR > School of Management at National Central University > Graduate Institute of Information Management > Electronic Thesis & Dissertation > Item 987654321/85059

Please use this identifier to cite or link to this item: https://ir.lib.ncu.edu.tw/handle/987654321/85059

Title:	資料離散化與填補遺漏值之順序與研究;Effects of Combining Data Discretization and Missing Value Imputation on Classification Problems
Authors:	崔書晴;Tsui, Shu-Ching
Contributors:	資訊管理學系
Keywords:	資料前處理;資料離散化;遺漏值填補;資料探勘
Date:	2021-01-27
Issue Date:	2021-03-18 17:31:27 (UTC+8)
Publisher:	國立中央大學
Abstract:	隨著時代演進，科技化的腳步不斷前進著，而人們的行為會產生許許多多的資料，這些過往不被重視或是因為技術而難以被蒐集的資料，今日，不管是對企業或是對一般人而言，資料都有著不同的意義，它可以是市場分析的工具或是個人隱私的一部分，它的價值甚至比產品本身還要更加重要，因此，資料探勘(Data Mining)是目前非常熱門的技術，即是利用不同的方法進行資料的分析並且設法找出資料內隱含的相關性以及可以提取的特徵，加以解讀及應用，聽起來很容易，但是，實際操作上，卻會遇到很多問題，其中之一就是原始資料的不完整、產生遺漏的狀況。遺漏值(Missing Value)會直接導致資料探勘及分析的結果上有誤差，這些遺漏可能來自於人為的填寫失誤或各種原因導致的刻意隱瞞，也可能是機器本身的原因所導致，例如: 資料儲存的過程產生失誤、硬體設備的損壞等等。因此，資料探勘及分析時，常常會因為遺漏值的緣故導致結果被干擾，準確度因此降低。除此之外，在資料前處理的階段，經常會碰到例如年齡的連續型資料，若是連續型的資料，在進行特徵提取的時候可能會導致條件過於狹隘，因此，離散化是個很重要的過程，將連續型資料透過不同的劃分點分類到不同的類別，使資料平整化亦降低異常資料對於模型的影響程度，唯有高品質的資料，才可能產出高品質的結果。目前對於遺漏值的處理以及離散化的方式有非常多種，本研究將嘗試先離散化資料再進行各種遺漏值填補以及先進行遺漏值填補後再離散化的方法，將結果以正確率評估，統整，歸納出一個較為有效的方法。 ;As technology improves day by day, many data that used to be ignored or was difficult to be gathered, could have a brand-new meaning nowadays no matter on personal or enterprise aspect. Data could be a tool to analyze market or a part of personal privacy, what’s more, it has become more valuable than the product itself. Thus, data mining, which means analyze data in many different ways, try to find out the correlation between each one of them and make use of them, is a big hit recently. It sounds easy but actually facing many difficulties while practicing. One of them is the incompleteness of data, means data that contains missing values. Missing value will directly result in error of analysis outcome. Missing values may cause by human error or malfunctioning machine. For example: the process of saving data does not work well or broken hardware. So, outcomes of data mining and analyzing will often be interfered due to missing values. Furthermore, there are continuous variables inside data, like: age. If for continuous variables, it could result in a narrow condition when data analyzing. Consequently, discretization is an important data preprocessing stage. Discretization will divide continuous variables into categorical by different cutting points and depends on different methods to reduce the influence of abnormal data or outliers. Because only high-quality data could output high quality outcomes. There are many methods to deal with missing values and to implement discretization. This study will try to do discretization first and to inpute missing values first, and evaluated with accuracy to see which one is a better way.
Appears in Collections:	[Graduate Institute of Information Management] Electronic Thesis & Dissertation

Files in This Item:

File	Description	Size	Format
index.html		0Kb	HTML	219	View/Open

社群 sharing

Loading...