「資料前處理」在資料探勘中,扮演舉足輕重的角色,也是整個分析流程的起點。真實世界中的資料品質參差不齊,例如:大樣本的資料往往會帶有雜訊(Noisy)、或是包含判讀性低的連續型數值類型,若是沒有經過適當的前處理,這些因素都會造成分析結果有所誤差。在過去的文獻中,有學者提出樣本選取(Instance Selection)的資料取樣概念,能夠透過演算法篩選具有代表性的樣本;也有研究顯示出前處理時運用離散化(Discretization),將連續型數值轉換成離散型,能夠有效的提高分析探勘規則的可讀性同時也可能提升正確率。若是將樣本選取與離散化結合,是否能夠在最後獲得比單一前處理還要佳的表現,目前尚未有文獻做出這方面的探討。 本論文欲探討樣本選取與離散化結合後進行資料前處理的影響,如何搭配才能達到最佳表現。本研究選用了三種樣本選取的演算法:基於樣本學習演算法(Instance-Based Learning Algorithm, IB3)、基因演算法(Genetic Algorithm, GA)、遞減式縮減最佳化程序(Decremental Reduction Optimization Procedure, DROP3),以及兩種監督式離散化演算法:最短描述長度原則(Minimum Description Length Principle, MDLP)、基於卡方分箱(ChiMerge, ChiM)。並以最近鄰居法(K-th Nearest Neighbor, KNN)作為分類器來評估搭配的最佳組合。 本研究將以UCI與KEEL上的10種資料集,來進行樣本選取與離散化搭配的探討。根據實驗結果發現,以DROP3樣本選取演算法搭配MDLP離散化演算法的所得到的平均結果,為較推薦之組合搭配,並且以先進行DROP3樣本選取後進行MDLP離散化後的前處理,能夠得到較顯著提升的平均正確率,其正確率達85.11%。 ;"Data Preprocessing" plays a pivotal role in data exploration and is the first step for the analysis process of data mining. In the real world, the quality of the big data is always unclear and uneven. For example, samples in the big data often have noise or continuous type values with low interpretability. These factors will result in inaccurate outcome if not properly pre-processed. In the literature, the concept of data sampling for instance selection had been proposed, which can be used to screen representative samples. Some studies have also shown that using discretization technology to transfer continuous values into discrete ones can effectively improve the readability of analytical exploration rules and may also improve the accuracy rate. Till now, there are no studies to explore the combination of instance selection and discretization, whether it can achieve better performance outcome than the single preprocessing techniques. This thesis aims to discuss the influence of data preprocessing after combining instance selection and discretization, and how to achieve the optimal performance. In this study, three instance selection algorithms are selected: Instance-Based Learning Algorithm (IB3), Genetic Algorithm (GA), Decremental Reduction Optimization Procedure (DROP3), and two supervised discretization algorithms: Minimum Description Length Principle (MDLP), ChiMerge-based (ChiM). The best combination of the two types of techniques is evaluated by the performance of the K-th Nearest Neighbor (KNN) classifiers. This study uses the 10 datasets from UCI and KEEL to explore the instance selection and discretization. According to the experimental results, it reveals that the average results of the DROP3 instance selection algorithm combined with the MDLP discretization algorithm is the more recommended combination than others, and the optimal performance can be obtained when the pre-processing of MDLP discretization is performed after the selection by DROP3, the average accuracy is promoted to 85.11%.