博碩士論文 100423032 完整後設資料紀錄

DC 欄位 語言
DC.contributor資訊管理學系zh_TW
DC.creator林盈秀zh_TW
DC.creatorYing-Siou Linen_US
dc.date.accessioned2013-7-1T07:39:07Z
dc.date.available2013-7-1T07:39:07Z
dc.date.issued2013
dc.identifier.urihttp://ir.lib.ncu.edu.tw:444/thesis/view_etd.asp?URN=100423032
dc.contributor.department資訊管理學系zh_TW
DC.description國立中央大學zh_TW
DC.descriptionNational Central Universityen_US
dc.description.abstract隨著資訊科技的快速發展,電腦所能處理和儲存的資料量也愈來愈大,資料採礦對於如何從大量資料中尋找有意義的內容是很重要的課題,但在探勘的過程中,難免會遭遇所需的資料有所遺漏或不足之處,這些問題都將導致探勘效能的降低。 而針對不完整資料的前處理,常會採用直接刪去法最為簡單又直接,但這種方法只適用於資料集包含比較小的缺失值數量,若包含的缺失值數量較大,採用直接刪去法,會造成大量資料流失並對資料探勘的結果產生影響。另一種方法是採用補值(Imputation)的處理方法,而近年來的研究都集中在,提出新型的補值方法和一些不同補值方法在不同的資料集中的比較,但很少研究在回答關於「在資料前處理時,什麼時候可以選擇完全忽略或刪除有缺失值的樣本?」,也沒有研究在探討「將資料前處理(特徵選取或樣本選取)加在補值之前,結果是否可以比沒有執行維度縮減或樣本選取而直接補值的結果效果來的更佳」。 本研究使用37個不同的資料集,包含三種主要的類型,分別為數值型(Numerical),類別型(Categorical),和混合型(Mixed)的資料類別,並用5%作為缺失率的間隔(從5%至50%)。研究主題分為兩個部份,研究一實驗結果說明,不同類型的資料集可以允許不同的缺失率。特別的是我們會建構決策樹模型來獲取關於資料集的特性(如資料數量,資料維度與資料類型)和可允許的缺失率之相關決策規則,來幫助資料分析並確定在不同的缺失率時,何時可以直接使用直接刪去法。 在研究二的實驗結果部份,以三種類型的資料集(數值型、混合型、和類別型)來判斷特徵選取和樣本選取在缺失值補值上使用的效果,並了解是否適用特徵選取和樣本選取在進行補值階段之前。此實驗結果顯示出,先使用樣本選取再補值可以產生比經過特徵選取再補值更好的分類效能。換句話說,先特徵選取再補值的方法對於補值沒有產生正面的影響。zh_TW
dc.description.abstractWith the rapid development of information technology, computers can process and store huge amounts of data. This leads to the importance of finding useful content from large amounts of data in data mining. However, many collected datasets for data mining usually contain some missing values, which are likely to degrade the data mining performance. For incomplete data processing, it is a common and simple way to perform case deletion by ignoring the data samples with missing values if the missing rate was certainly small. Another approach is based on imputation, where various approaches have been proposed for missing value imputation. Generally speaking, the imputation algorithms aim at providing estimations for missing values by a reasoning process from the observed data. However, there is no answer for the question about when should we use the case deletion or imputation approach over different kinds of datasets. Another question is that will performing data pre-processing, i.e. feature and instance selection, affect the final imputation result? This thesis used 37 different data sets, which contain categorical, numerical, and both types of data, and 5% intervals for different missing rates per dataset (i.e. from 5% to 50%). Research topic is divided into two parts. The experimental results indicate that there are some specific patterns to consider case deletion over different datasets without significant performance degradation. A decision tree model is then constructed to extract useful rules to recommend when to use the case deletion approach. Furthermore, we found that imputation after instance selection can produce better classification performance than imputation alone. However, imputation after feature selection does not have a positive impact on the imputation result.en_US
DC.subject資料探勘zh_TW
DC.subject資料遺漏zh_TW
DC.subject直接刪除法zh_TW
DC.subject資料補值zh_TW
DC.subject樣本選取zh_TW
DC.subject特徵選取zh_TW
DC.subjectdata miningen_US
DC.subjectmissing valuesen_US
DC.subjectcase deletionen_US
DC.subjectimputationen_US
DC.subjectfeature selectionen_US
DC.subjectinstance selectionen_US
DC.title資料遺漏率、補值法與資料前處理關係之研究zh_TW
dc.language.isozh-TWzh-TW
DC.titleThe relationship between missing value, imputation and data pre-processingen_US
DC.type博碩士論文zh_TW
DC.typethesisen_US
DC.publisherNational Central Universityen_US

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡  - 隱私權政策聲明