摘要: | 現今企業越來越依賴從龐大的資料庫及資料倉儲中找尋對企業本身有價值的知識,但越是大型的資料集所包含的雜訊資料將會越多,這些雜訊資料會降低探勘的準確度,且龐大的資料更會增加知識發掘過程所需的時間。
雖然樣本選取可以在資料前處理階段中幫助我們過濾掉一些雜訊,是目前最常被用來進行資料減量的方法,但是在過去文獻中,一些效能較佳的樣本選取演算法執行時的時間複雜度卻相當高。因此本研究提出了一個新的資料前處理流程(ReDD, 代表性資料偵測),僅需以一小部份資料先進行樣本選取以後,再以複雜度相對較低的分類器學習由樣本選取所篩選出的代表性資料之特徵,便可利用訓練完成之分類器(偵測器)偵測出所有原始資料中所包含的離群值,將可大幅減少資料精簡的時間。
本研究的實驗分成兩個部份,在樣本選取步驟皆分別實驗了IB3、DROP3和GA等三種效能較佳的演算法。在第一部分的實驗以ReDD對50個小型資料集做精簡,並以SVM、CART、KNN以及Naive Bayes為偵測器,測試出偵測效能最好的分類器為KNN以及CART。在第二部分的實驗測試四個大型資料集(十萬筆以上),並以KNN和CART為ReDD模型之偵測器,與傳統樣本選取方法比較彼此之準確度與花費時間,結果顯示出ReDD確實比傳統樣本選取節省龐大的執行時間,且準確度與傳統樣本選取並無明顯差異,由此可見ReDD在處理大型資料集上能大幅提升資料精簡的效率。
Nowadays, more and more enterprises require extracting knowledge from very large databases. However, these large datasets usually contain a certain amount of noisy data, which are likely to decline the performance of data mining. In addition, the computational time, during the KDD process over large scale datasets is large.
Instance selection, which is the widely used for data reduction, can filter out noisy data from large datasets. However, many existing instance selection algorithms are limited in dealing with large datasets in terms of time efficiency. Therefore, we introduce a novel data preprocessing process called Representative Data Detection (ReDD), which only needs a small part of the original dataset to perform the instance selection step. Then, a classifier is trained to learn the representative data identified by the instance selection step. Afterwards, the trained classifier as a detector is used to detect all the noisy data over the large original dataset.
The thesis contains two experiments where IB3, DROP3 and GA are used as the baseline the instance selection algorithms. In the first experiment, fifty small-scale datasets are used to evaluate ReDD, in which SVM, CART, KNN and Naive Bayes are constructed as the detectors for comparison. We find that KNN and CART perform the best. In the second experiment, the classification accuracy and execution time of ReDD and the baselines over four large-scale datasets (more than one hundred thousand data) are compared. The result shows that ReDD can reduce large amount of execution time compared to the traditional instance selection. Moreover, the accuracy rates of ReDD and the baselines have no significant difference. |