特徵屬性篩選對於不同資料類型之影響

DC 欄位	值	語言
DC.contributor	資訊管理學系在職專班	zh_TW
DC.creator	歐先弘	zh_TW
DC.creator	Hsien-Hung Leo	en_US
dc.date.accessioned	2017-8-21T07:39:07Z
dc.date.available	2017-8-21T07:39:07Z
dc.date.issued	2017
dc.identifier.uri	http://ir.lib.ncu.edu.tw:88/thesis/view_etd.asp?URN=104453017
dc.contributor.department	資訊管理學系在職專班	zh_TW
DC.description	國立中央大學	zh_TW
DC.description	National Central University	en_US
dc.description.abstract	特徵屬性篩選(Feature Selection)在資料探勘裡，是很重要的資料前處理步驟，主要目的是希望在給定一個資料集時，可以透過特徵選取技術來去除不相關或是冗餘的特徵值，從目前現有相關文獻中，並沒有針對每一類特徵屬性篩選，與三種不同的資料類型(數值型、離散型、混合型)進行實驗，因此本研究選定了三種特徵屬性篩選技術：資訊獲利(Information Gain, GA)、基因演算法(Genetic Algorithm, GA)、決策樹(Decision Tree, DT)，探討在這三種類型的未篩選與特徵屬性篩選下，在不同類型的資料集當中的分類表現，從UCI取得真實世界不同領域的40個資料集，實驗結果會在分類器：支持向量機 (Support Vector Machines, SVM)、最近鄰居法(K-Nearest Neighbor, KNN)、決策樹(Decision Tree, DT)、類神經網路(Artificial Neural Network, ANN)、AdaBoost、Bagging上進行驗證，希望透過正確率表現，探討出哪種特性的資料集透過哪種特徵屬性篩選，會提升某分類器演算法的效能，做為分析人員在進行實驗時的參考。依據研究所得之結果，離型散資料不論使用哪一種單一分類器或是Adaboost的分類演算法，其基準正確率表現最佳，建議不需再進行特徵屬性篩選步驟；離散型資料使用Bagging多重分類器下選擇KNN分類器，經過DT特徵屬性篩選演算法後，其正確率會較執行其它演算法較佳；混合型資料除了IG特徵屬性篩選演算法，透過GA或是DT 特徵屬性篩選演算法，其正確率會比基準較佳；數值型資料中除了GA特徵屬性篩選演算法，透過GA或是DT 特徵屬性篩選演算法，其正確率會比基準較佳；數值型資料在MLP的基準正確率表現最佳，建議不需再進行特徵屬性篩選步驟。針對不同資料類型，在選定分類器之後，可參考本研究挑選正確率最佳的特徵屬性篩選方法優先進行。	zh_TW
dc.description.abstract	Feature selection is an important process for pattern recognition applications. The purpose of feature selection is to avoid classifier’s performance degradation. The removed feature(s) must be redundant, irrelevant, or of the least possible use. There is no related study which compares different feature selection methods with different data types, such as categorical, numerical, and mixed-type of datasets for classification performance. Therefore, in this thesis, three major feature selection methods were chosen, which are Information Gain (IG), Genetic Algorithm (GA) and Decision Tree (DT), and the research aim is to compare the classification accuracy of using these feature selection methods over different types of datasets. We illustrate the capability of the result by extensive experiments on analyzing 40 real-world datasets from UCI. In addition, six different classification techniques are compared, including Support Vector Machines (SVM), K-Nearest Neighbor (KNN), Decision Tree (DT), Artificial Neural Network (ANN), AdaBoost and Bagging. The experimental results show that the need for feature selection over categorical datasets is not strong. However, bagging based KNN and DT could increase the performance. For the mixed-type and numerical datasets, using GA and DT perform better. Particularly, if MLP is used, there is no need to do the feature selection process for numerical datasets. We demonstrate that different feature selection methods could increase the accuracy of some classification models.	en_US
DC.subject	資料探勘	zh_TW
DC.subject	特徵屬性篩選	zh_TW
DC.subject	分類演算法	zh_TW
DC.subject	Data Mining	en_US
DC.subject	Feature Selected	en_US
DC.subject	Classification Algorithm	en_US
DC.title	特徵屬性篩選對於不同資料類型之影響	zh_TW
dc.language.iso	zh-TW	zh-TW
DC.type	博碩士論文	zh_TW
DC.type	thesis	en_US
DC.publisher	National Central University	en_US

博碩士論文 104453017 完整後設資料紀錄