摘要: | 當今的企業常常需要從龐大的資料庫以及資料倉儲中尋找對企業有價值的知識,但越是大型的資料庫所包含的雜訊資料越多,這些雜訊資料會降低資料探勘的精確度,且龐大的資料更會增加知識發掘過程中所需的時間。 雖然樣本選取可以在資料前處理的階段中幫我們過濾掉一些雜訊,是目前最常被用來進行資料縮減的方法,但不同的樣本選取的演算法所篩選出來的資料不盡相同,且常常會發生過度選取 (Over Selection) 或是選取不足 (Under Selection) 的情況進而影響資料探勘的精確度。因此本研究提出了一個新的資料前處理流程 (TSHLA, 兩階段混合學習) ,並且應用在資料分類上。先將訓練集的資料做樣本選取後,分別對被樣本選取演算法判定為雜訊及非雜訊的資料集訓練SVM模型;並且將測試集的資料做KNN的相似度比對,較相似為雜訊的測試資料集用雜訊資料集所訓練的模型做測試,同理,較相似為非雜訊的測試資料集用非雜訊資料集所訓練的模型做測試,希望在雜訊類的資料中找出被篩選掉,但卻有效的樣本,最後合併為最終結果。 本研究的實驗分成兩部分,在樣本選取步驟皆分別實驗了IB3、DROP3、GA等三種效能較佳的演算法。在第一部分的實驗以TSHLA對50個小型資料集做測試,並以SVM作為本研究所使用的分類器。在第二部分的實驗則是使用大型資料集 (十萬筆以上) ,以SVM為分類器,與傳統樣本選取方法比較彼此精準度。 ;Nowadays, more and more enterprises require extracting knowledge from very large databases. However, these large datasets usually contain a certain amount of noisy data, which are likely to decline the performance of data mining. In addition, the computational time of processing the large scale datasets is usually very large. Instance selection, which is the widely used data reduction approach, can filter out noisy data from large datasets. However, different instance selection algorithms over different domain datasets filter out different noisy data, which are likely to result in over or under selection since there is no exact definition of outliers. Thus, the quality of data mining results can be affected. Therefore, this thesis proposes a new data pre-processing (TSHLA, Two-Stage Hybrid Learning Approach) for effective data classification. First, instance selection is performed over a given training dataset to filter out the noisy and non-noisy data to train two individual SVM classifiers respectively. Then, using the KNN to compare the similarity of the testing data. As a result, the noisy and non-noisy testing sets are identified and they are fed into their corresponding SVM classifiers for classification. There two experimental studies in this thesis and three instance selection algorithms are used for comparison, which are IB3, DROP3 and GA. The first and second studies are based on 50 small UCI datasets and large scale datasets containing more than 100,000 data samples. In addition, our proposed TSHLA is compared with the baseline without instance selection and the one based on the conventional instance selection approach. |