特徵屬性篩選對於不同資料類型之影響

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：19

、訪客IP：3.12.136.133

姓名

歐先弘(Hsien-Hung Leo) 查詢紙本館藏

畢業系所

資訊管理學系在職專班

論文名稱

特徵屬性篩選對於不同資料類型之影響

相關論文

★ 利用資料探勘技術建立商用複合機銷售預測模型	★ 應用資料探勘技術於資源配置預測之研究-以某電腦代工支援單位為例
★ 資料探勘技術應用於航空業航班延誤分析-以C公司為例	★ 全球供應鏈下新產品的安全控管-以C公司為例
★ 資料探勘應用於半導體雷射產業-以A公司為例	★ 應用資料探勘技術於空運出口貨物存倉時間預測-以A公司為例
★ 使用資料探勘分類技術優化YouBike運補作業	★ 資料探勘應用於B2B網路型態之企業官網研究-以T公司為例
★ 衍生性金融商品之客戶投資分析與建議-整合分群與關聯法則技術	★ 應用卷積式神經網路建立肝臟超音波影像輔助判別模型
★ 基於卷積神經網路之身分識別系統	★ 能源管理系統電能補值方法誤差率比較分析
★ 企業員工情感分析與管理系統之研發	★ 資料淨化於類別不平衡問題: 機器學習觀點
★ 資料探勘技術應用於旅客自助報到之分析—以C航空公司為例	★ 應用機器學習建立單位健保欠費催繳後繳納預測模型

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

特徵屬性篩選(Feature Selection)在資料探勘裡，是很重要的資料前處理步驟，主要目的是希望在給定一個資料集時，可以透過特徵選取技術來去除不相關或是冗餘的特徵值，從目前現有相關文獻中，並沒有針對每一類特徵屬性篩選，與三種不同的資料類型(數值型、離散型、混合型)進行實驗，因此本研究選定了三種特徵屬性篩選技術：資訊獲利(Information Gain, GA)、基因演算法(Genetic Algorithm, GA)、決策樹(Decision Tree, DT)，探討在這三種類型的未篩選與特徵屬性篩選下，在不同類型的資料集當中的分類表現，從UCI取得真實世界不同領域的40個資料集，實驗結果會在分類器：支持向量機 (Support Vector Machines, SVM)、最近鄰居法(K-Nearest Neighbor, KNN)、決策樹(Decision Tree, DT)、類神經網路(Artificial Neural Network, ANN)、AdaBoost、Bagging上進行驗證，希望透過正確率表現，探討出哪種特性的資料集透過哪種特徵屬性篩選，會提升某分類器演算法的效能，做為分析人員在進行實驗時的參考。
依據研究所得之結果，離型散資料不論使用哪一種單一分類器或是Adaboost的分類演算法，其基準正確率表現最佳，建議不需再進行特徵屬性篩選步驟；離散型資料使用Bagging多重分類器下選擇KNN分類器，經過DT特徵屬性篩選演算法後，其正確率會較執行其它演算法較佳；混合型資料除了IG特徵屬性篩選演算法，透過GA或是DT 特徵屬性篩選演算法，其正確率會比基準較佳；數值型資料中除了GA特徵屬性篩選演算法，透過GA或是DT 特徵屬性篩選演算法，其正確率會比基準較佳；數值型資料在MLP的基準正確率表現最佳，建議不需再進行特徵屬性篩選步驟。針對不同資料類型，在選定分類器之後，可參考本研究挑選正確率最佳的特徵屬性篩選方法優先進行。

摘要(英)

Feature selection is an important process for pattern recognition applications. The purpose of feature selection is to avoid classifier’s performance degradation. The removed feature(s) must be redundant, irrelevant, or of the least possible use. There is no related study which compares different feature selection methods with different data types, such as categorical, numerical, and mixed-type of datasets for classification performance. Therefore, in this thesis, three major feature selection methods were chosen, which are Information Gain (IG), Genetic Algorithm (GA) and Decision Tree (DT), and the research aim is to compare the classification accuracy of using these feature selection methods over different types of datasets. We illustrate the capability of the result by extensive experiments on analyzing 40 real-world datasets from UCI. In addition, six different classification techniques are compared, including Support Vector Machines (SVM), K-Nearest Neighbor (KNN), Decision Tree (DT), Artificial Neural Network (ANN), AdaBoost and Bagging.
The experimental results show that the need for feature selection over categorical datasets is not strong. However, bagging based KNN and DT could increase the performance. For the mixed-type and numerical datasets, using GA and DT perform better. Particularly, if MLP is used, there is no need to do the feature selection process for numerical datasets. We demonstrate that different feature selection methods could increase the accuracy of some classification models.

關鍵字(中)

★ 資料探勘
★ 特徵屬性篩選
★ 分類演算法

關鍵字(英)

★ Data Mining
★ Feature Selected
★ Classification Algorithm

論文目次

摘要 i
Abstract ii
誌謝 iii
目錄 iv
圖目錄 vi
表目錄 vii
第一章前言 1
1.1 研究背景 1
1.2 研究動機 2
1.3 研究目的 4
1.4 研究流程 6
第二章文獻探討 8
2.1 特徵選取 8
2.1.1 包裝(Wrappers) 9
2.1.2 內嵌(Embedded) 10
2.1.3 過濾(Filters) 10
2.2 分類技術 11
2.2.1 決策樹(Decision Tree) 11
2.2.2 支持向量機(SVM) 12
2.2.3 最近鄰居法(KNN) 14
2.2.4 類神經網路(ANN) 16
2.2.5 AdaBoost 16
2.2.6 Bagging 17
第三章研究方法 19
3.1 研究流程 19
3.2 資料集介紹 19
3.3 資料整理 21
3.4 預測模式設計 23
第四章實驗結果與分析 29
4.1 單一分類器實驗結果 29
4.1.1 離散型資料 29
4.1.2 混合型資料 33
4.1.3 數值型資料 36
4.2 AdaBoost多重分類器實驗結果 40
4.2.1 離散型資料 40
4.2.2 混合型資料 44
4.2.3 數值型資料 47
4.3 Bagging多重分類器實驗結果 51
4.3.1 離散型資料 51
4.3.2 混合型資料 55
4.3.3 數值型資料 58
4.4 討論 62
4.4.1 成對樣本t檢定 63
第五章結論與建議 65
5.1 研究結論與貢獻 65
5.2 研究限制 66
5.3 未來展望與建議 66
參考文獻 67

參考文獻

【英文文獻】
A. Wanga, N. Ana, G. Chenb, L. Lia, G. Alterovitz (2014). Accelerating wrapper-based feature selection with K-nearest-neighbor, Knowledge-Base Systems, 83:81-91.
B. Seijo-Pardo, I. Porto-Díaz, V. Bolón-Canedo, A. Alonso-Beta (2016). Ensemble feature selection: Homogeneous and heterogeneous approaches, Knowledge-Base Systems, 118:124-139.
Berry, M. J. A. and Linoff, G.S.(1997). Data Mining Technique for Marketing, Sale, and Customer Support, Wiley Computer, N. J..
D. Randall Wilson, Tony R. Martinez (2000). Reduction Techniques for Instance-Based Learning Algorithms, Machine Learning, Vol. 38, pp 257-286.
E. Bauer, R. Kohavi (1999). An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants, Machine Learning, Vol. 36, pp 105-139.
G. Chandrashekar, F. Sahin (2014). A survey on feature selection methods. Computers & Electrical Engineering, 40(1):16 – 28.
H. Liu and L. Yu (2005). Toward integrating feature selection algorithms for classification and clustering, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, vol. 17, pp. 491 - 502, April.
Hand, D., Mannila, H., Smyth, P. (2001). Principles of Data Mining, MIT Press, Cambridge, MA.
I. Guyon, A. Elisseeff (2003). An introduction to variable and feature selection, Journal of Machine Learnig Research, Vol. 3, pp. 1157-1182.
J. Li, M.T. Manry, P.L. Narasimha, C. Yu (2006). Feature selection using a piecewise linear network, IEEE Transactions on Neural Networks, Vol. 17, No. 5, pp. 1101-1115.
Jiawei, H., & Kamber, M. (2001). Data mining: concepts and techniques, Morgan Kaufmann, San Francisco, CA.
Quinlan, J.R. (1986). Induction of Decision Tree, Machine Learning, Vol. 1, No. 1, pp.81-106.
Quinlan, J.R. (1993). C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo,CA.
U. Stan ́czyk (2013). Ranking of characteristic features in combined wrapperapproaches to selection, Journal of Machine Leaning Research, Vol. 3, pp. 1157-1182.
V. Bolón-Canedo, N. Sánchez-Maroño and A. Alonso-Betanzos (2013). A review of feature selection methods on synthetic data, Knowl Inf Syst, 34:483-519.
W.B. Powell (2007). Approximate dynamic programming: solving the curses of dimensionality. Wiley-Interscience.
Y. Saeys, et al., (2007). Areview of feature selection techniques in bioinformatics, Bioinformatics, vol. 23, pp. 2507-2517.
Y. Freund, RE. Schapire (1996). Experiments with a new boosting algorithm. In Machine Learning: Proceedings of the Thirteenth International Conference, pages 148–156.
【中文文獻】
李韋柔 (2016). 特徵選取前處理於填補遺漏值之影響, 國立中央大學資訊管理學系, 碩士論文.
凌士維 (2005). 非對稱性分類分析解決策略之效能比較, 國立中山大學資訊管理學系, 碩士論文.
蘇昭安 (2003). 應用倒傳遞類神經網路在颱風波浪預報之研究, 國立臺灣大學工程科學與海洋工程學系, 碩士論文.
【網路文獻】
C.-C. Chang and C.-J. Lin (2001). LIBSVM: a library for support vector machines, Software available at http://www.csie.ntu.edu.tw/˜cjlin/libsvm.
林宗勳. Support Vector Machines 簡介 at www.cmlab.csie.ntu.edu.tw/~cyy/learning/tutorials/SVM2.pdf

指導教授

蔡志豐(C.-F. Tsai)

審核日期

2017-8-21

推文