資料前處理之研究：以基因演算法為例; Feature and Instance Selection Using Genetic Algorithms：An Empirical Study

NCU Institutional Repository > 管理學院 > 資訊管理研究所 > 博碩士論文 > Item 987654321/48981

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/48981

題名:	資料前處理之研究：以基因演算法為例;Feature and Instance Selection Using Genetic Algorithms：An Empirical Study
作者:	朱啟源;Chi-yuan Chu
貢獻者:	資訊管理研究所
關鍵詞:	資料探勘;特徵選取;基因演算法;樣本選取;data mining;feature selection;instance selection;genetic algorithms
日期:	2011-07-20
上傳時間:	2012-01-05 15:12:01 (UTC+8)
摘要:	特徵選取(feature selection)和樣本選取(instance selection)在資料探勘裡，是兩個很重要的資料前處理技術，主要目的是希望再給定一個資料集時，可以透過特徵選取技術來去除不相關或是冗餘的特徵值，或是透過樣本選取技術來消除重覆及錯誤的資料，特別的是基因演算法(genetic algorithm)是過去最被廣泛應用在這資料前處理技術的演算法，而目前這兩種資料前處理的方法，在過去往往是被分開探討的，所以目前尚未清楚特徵選取和樣本選取同時執行與個別單獨執行，其執行效能與結果有什麼樣的不同，因此本研究的目的是透過基因演算法去處理特徵選取與樣本選取，並且探討兩種資料前處理方法之間的順序，在不同的領域資料集中的分類表現，實驗的結果來自於不同領域的四個大型資料集與四個小型資料集在分類器(例如：support vector machines and k-nearest neighbor)上的表現，而其中這八個資料集的維度特徵與資料樣本數目並不相同，目的是希望可以將這樣的方法不僅可以應用在不同領域的資料集，還可以應用在差異性大的資料集，除此之外，本研究除了找到不同的資料前處理模式，更進一步的分析資料集的特性，目的是希望透過正確率與時效性的兩個層面，更進一步的探討那種特性的資料集適合應用何種資料前處理方法，透過找出一定的規律和準則，讓不同領域的資料集皆能夠在分類器上或實驗的時效性上，皆有較佳的表現。 Feature selection and instance selection are two important data preprocessing steps in data mining, where the former aims at removing some irrelevant and/or redundant features from a given dataset and the later for discarding the faulty data. In particular, genetic algorithms have been widely used for these tasks in related studies. However, these two data processing tasks are generally considered separately in literature. It is unknown about the performance differences between performing both feature and instance selection and feature or instance selection individually. Therefore, the aim of this paper is to perform feature selection and instance selection based on genetic algorithms using different priorities to examine the classification performances over different domain datasets. Experimental results based on four small and large scale datasets containing various numbers of features and data samples show that performing both feature and instance selection usually make the classifiers (i.e., support vector machines and k-nearest neighbor) perform slightly poorer than feature selection or instance selection individually. However, while there is not a significant difference in classification accuracy between these different data preprocessing methods, the combination of feature and instance selection largely reduces the computational effort of training the classifiers than feature and instance selection individually. By considering both classification effectiveness and efficiency, performing feature and instance selection is the optimal solution for data preprocessing in data mining.
顯示於類別:	[資訊管理研究所] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	1649	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....