摘要: | 資料集中的遺漏或異常資料樣本,都會對資料探勘的過程造成影響,使得探勘的結果正確性下降。因此,在資料探勘前的資料前處理是有其必要性的。而資料前處理即是針對存在於資料集中的遺漏或異常樣本進行處理或篩選,較常使用的方法為「補值法」與「樣本選取法」。 補值法的原理為根據資料集中「完整資料樣本」進行分析,並推估出一個可能的值,填補到空白的欄位中,雖然現階段已有許多研究針對補值法提出各種新型的技術,但卻忽略了補值過程中所需要參考的「完整資料樣本」。假設這些「完整資料樣本」存在著異常值,將會對補值過程產生不良影響。因此,本研究提出在補值前,事先針對這些「完整資料樣本」進行樣本選取,將異常的資料篩選出來,再利用這些篩選過後的精簡資料樣本,做為補值的參考樣本,可以讓補值的結果更加可靠(實驗流程二)。另外,某資料集在補值之後的結果,對於樣本選取的概念來說,可能仍然屬於冗餘值(重複值)或異常值,因此本研究又提出了補值後,再進行樣本選取技術,將那些不必要的資料篩選出來,留下具有代表性的資料,進而提升資料探勘的正確性(實驗流程一)。為了進一步篩選出更精簡、更具代表性的資料,實驗流程一與實驗流程二將進行第二次的樣本選取技術,即成為本研究的實驗流程三與實驗流程四。 本研究使用31個不同的資料集,包含三種主要的類型,分別為數值型、類別型和混合型,並用10%作為遺漏率的間隔(從10%至50%)。最後,本研究將會建構決策樹模型來獲取關於資料集的特性(如資料數量、資料維度、類別數量、資料類型)和遺漏率之相關決策規則,來幫助資料分析並確定何時使用何種資料前處理流程。 ;In practice, the collected data usually contain some missing values and noise, which are likely to degrade the data mining performance. As a result, data pre-processing step is necessary before data mining. The aim of data pre-processing is to deal with missing values and filter out noise data. In particular, “imputation” and “instance selection” are two common solutions for the data pre-processing purpose. The aim of imputation is to provide estimations for missing values by reasoning from the observed data (i.e., complete data). Although various missing value imputation algorithms have been proposed in literature, the outputs for the missing values produced by most imputation algorithms heavily rely on the complete (training) data. Therefore, if some of the complete data contains noise, it will directly affect the quality of the imputation and data mining results. In this thesis, four integration processes were proposed, in which one process is to execute instance selection first to remove several noisy (complete) data from the training set. Then, the imputation process is performed based on the reduced training set (Process 2). On the contrary, the imputation process is employed first to produce a complete training set. Then, instance selection is performed to filter out some noisy data from this set (Process 1). In or to filter out more representative data, instance selection is performed again over the outputs produced by Processes 1 and 2 (Process 3 & Process 4). The experiments are based 31 different data sets, which contain categorical, numerical, and mixed types of data, and 10% intervals for different missing rates per dataset (i.e. from 10% to 50%). A decision tree model is then constructed to extract useful rules to recommend when (no. of sample, no. of attribute, no. of classed, type of dataset, missing rate) to use which kind of the integration process. |