單一與集成特徵選取方法於高維度資料之比較

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：220

、訪客IP：3.139.105.18

姓名

宋亞庭(Ya-Ting Sung) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

單一與集成特徵選取方法於高維度資料之比較
(Comparison of Single Feature Selection and Ensemble Feature Selection for High-Dimensional Data)

相關論文

★ 特徵選取於資料離散化之影響	★ 過採樣集成法於類別不平衡與高維度資料之研究
★ 樣本選取與資料離散化對於分類器效果之影響

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

真實世界中的資料時常存有品質不佳的問題，像是含有雜訊、不相關的資料、資
料量過大等。若是直接將這些資料建模，恐怕會導致模型的效果和效益不佳，因此必
須先將這些資料進行前處理，其中特徵選取為常見的資料前處理方法，透過特徵選
取，可以將冗餘、不相關的特徵去除，僅留下具代表性的特徵，集成特徵選取是指使
用多種不同的特徵選取演算法，將他們所選取的特徵子集透過不同的方式聚合，透過
集成能夠提升特徵選取的穩健性甚至是提升分類正確率。目前特徵選取的相關研究多
是採用單一特徵選取，較少有研究涉及集成特徵選取，因此本研究欲比較單一特徵選
取和集成特徵選取在高維度資料的表現，找出較佳的特徵選取方法組合。
本研究使用了三種分屬不同類型的特徵選取演算法，分別為基因演算法(Genetic
Algorithm, GA)、決策樹 C4.5(Decision Tree C4.5, DT)、主成分分析(Principal
Components Analysis, PCA)，引用集成學習的中序列式集成和並列式集成的概念形成
序列式集成特徵選取和並列式集成特徵選取，最後利用分類正確率、F1-Score 以及執
行時間來衡量特徵選取方法的優劣。本研究使用 20 個公開資料集，資料集的維度介於
44 到 19993。
根據本研究實驗結果，使用序列式集成特徵方法與並列式集成特徵選取的表現會
優於單一特徵選取，多數資料集的最佳特徵選取方法都是序列式集成特徵方法與並列
式集成特徵選取，序列式集成特徵選取方法中表現最好的方法是 GA+PCA，並列式集成
特徵選取方法中表現最好的方法是 C4.5∪GA。

摘要(英)

Data in the real world often have the problem of bad quality, such as noise, irrelevant
data and extreme volume. Without considering data pre-processing, the models that are
trained by this kind of data are unlikely to be effective. In particular, feature selection is a
common data pre-processing method. Through feature selection, redundant and irrelevant
features can be removed, leaving only representative features. In ensemble feature selection, it
refers to using multiple different feature selection algorithms and combines their selected
feature subsets through different aggregation methods. Ensemble feature selection can
improve the robustness of single feature selection and even improve the classification
accuracy. Currently, the related research on feature selection mostly adopts single feature
selection. There are few researches discussing ensemble feature selection. Thus, the aim of
this thesis is to compare the performance of single feature selection and ensemble feature
selection in high-dimensional data to find a better combination of feature selection methods.
In the experiment, three different types of feature selection algorithms are used, which
are GA (Genetic Algorithm), DT (Decision Tree Algorithm), and PCA (Principal Components
Analysis). For ensemble feature selection, the concept of sequential ensemble and parallel
ensemble in ensemble learning are applied to form sequential ensemble feature selection and
parallel ensemble feature selection, respectively. Finally, the classification accuracy, f1-score
and execution time are examined to evaluate feature selection methods.
Based on 20 public datasets with dimensions ranging from 44 to 19993, the experimental
results show that sequential ensemble feature selection and parallel ensemble feature selection
perform better than single feature selection. The best feature selection methods for most
datasets are sequential ensemble feature selection and parallel ensemble feature selection.
The best combination in sequential ensemble feature selection is GA+PCA, and the best
combination in parallel ensemble feature selection is C4.5∪GA.

關鍵字(中)

★ 資料探勘
★ 特徵選取
★ 分類
★ 集成學習
★ 支援向量機

關鍵字(英)

★ Data Mining
★ Feature Selection
★ Ensemble learning
★ Classification
★ Support Vector Machines

論文目次

摘要 ................................................................... i
Abstract .............................................................. ii
目錄 .................................................................. iv
圖目錄 ................................................................ vi
表目錄 ............................................................... vii
第一章緒論 ............................................................ 1
1.1 研究背景 .................................................................. 1
1.2 研究動機 .................................................................. 2
1.3 研究目的 .................................................................. 3
1.4 論文架構 .................................................................. 4
第二章文獻探討 ........................................................ 5
2.1 特徵選取 .................................................................. 5
2.1.1 基因演算法(Genetic Algorithm, GA) ............................................ 7
2.1.2 主成分分析(Principal Component Analysis, PCA) ................................ 8
2.1.3 決策樹 C4.5(Decision Tree C4.5, DT) ........................................... 9
2.2 集成學習 ................................................................. 10
2.2.1 集成特徵選取 ................................................................ 11
2.3 監督式學習................................................................ 12
2.3.1 支援向量機（Support Vector Machine, SVM） ................................... 13
第三章研究方法 ....................................................... 14
3.1 實驗架構 ................................................................. 14
3.2 實驗參數設定 ............................................................. 16
3.3 實驗一 ................................................................... 17
3.3.1 Baseline .................................................................... 17
3.3.2 單一特徵選取 ................................................................ 18
3.3.3 序列式集成特徵選取 .......................................................... 18
3.3.3.1 異質集成 ................................................................ 18
3.3.3.2 同質集成 ................................................................ 19
3.4 實驗二 ................................................................... 20
3.4.1 並列式集成特徵選取 .......................................................... 20 3.5 實驗驗證準則 ............................................................. 20
3.6 時間複雜度 ............................................................... 22
第四章實驗結果 ....................................................... 23
4.1 實驗準備 ................................................................ 23
4.1.1 實驗資料集 ................................................................. 23
4.1.2 實驗電腦環境 ............................................................... 24
4.2 實驗一結果................................................................ 25
4.2.1 Baseline 、單一、序列式集成特徵選取屬性集合大小之比較 ....................... 25
4.2.2 SVM 分類器結果 .............................................................. 29
4.2.2.1 分類正確率 .............................................................. 29
4.2.2.2 F1-score ................................................................ 34
4.2.2.3 CPU 運算時間 ............................................................ 39
4.2.3 各資料集正確率和 F1-score 最佳的方法 .......................................... 43
4.2.4 實驗一小結................................................................... 45
4.3 實驗二結果................................................................ 46
4.3.1 Baseline、並列式集成特徵選取屬性集合大小之比較 .............................. 46
4.3.2 SVM 分類器結果 .............................................................. 50
4.3.2.1 分類正確率 .............................................................. 50
4.3.2.2 F1-score ................................................................ 54
4.3.2.3 CPU 運算時間 ............................................................ 58
4.3.3 各資料集正確率和 F1-score 最佳的方法 .......................................... 62
4.3.4 實驗二小結................................................................... 64
4.4 分析與討論................................................................ 66
第五章結論 ........................................................... 71
5.1 結論與貢獻................................................................ 71
5.2 未來研究方向與建議........................................................ 72
參考文獻 .............................................................. 73
附錄一特徵選取的結果 ................................................... 77

參考文獻

[1]. W. Fan, and A. Bifet, “Mining big data: current status, and forecast to the future,”
SIGKDD Explor. Newsl., vol. 14, no. 2, pp. 1-5, 2013.
[2]. G. Bello-Orgaz, J. J. Jung, and D. Camacho, “Social big data: Recent achievements and
new challenges,” Information Fusion, vol. 28, pp. 45-59, 2016.
[3]. J. Han, J. Pei, and M. Kamber, Data mining: concepts and techniques: Elsevier, 2011.
[4]. U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, “From data mining to knowledge
discovery in databases,” AI magazine, vol. 17, no. 3, pp. 37, 1996.
[5]. A. Famili, W.-M. Shen, R. Weber, and E. Simoudis, “Data preprocessing and intelligent
data analysis,” Intelligent data analysis, vol. 1, no. 1, pp. 3-23, 1997.
[6]. S. Kotsiantis, D. Kanellopoulos, and P. Pintelas, “Data preprocessing for supervised
leaning,” International Journal of Computer Science, vol. 1, no. 2, pp. 111-117, 2006.
[7]. O. E. de Noord, “The influence of data preprocessing on the robustness and parsimony of
multivariate calibration models,” Chemometrics and intelligent laboratory systems, vol. 23,
no. 1, pp. 65-70, 1994.
[8]. L. Yu, and H. Liu, “Feature selection for high-dimensional data: a fast correlation-based
filter solution,” in Proceedings of the Twentieth International Conference on International
Conference on Machine Learning, Washington, DC, USA, 2003, pp. 856-863.
[9]. A. B. Patel, M. Birla, and U. Nair, “Addressing big data problem using Hadoop and Map
Reduce,” in 2012 Nirma University International Conference on Engineering (NUiCONE),
2012, pp. 1-5.
[10]. Y. Zhai, Y.-S. Ong, and I. W. Tsang, “The emerging" big dimensionality",” 2014.
[11]. V. Bolón-Canedo, N. Sánchez-Maroño, and A. Alonso-Betanzos, “A review of feature
selection methods on synthetic data,” Knowledge and information systems, vol. 34, no. 3, pp.
483-519, 2013.
[12]. V. Bolón-Canedo, N. Sánchez-Marono, A. Alonso-Betanzos, J. M. Benítez, and F.
Herrera, “A review of microarray datasets and applied feature selection methods,”
Information Sciences, vol. 282, pp. 111-135, 2014.
[13]. I. Guyon, and A. Elisseeff, “An introduction to variable and feature selection,” Journal
of machine learning research, vol. 3, no. Mar, pp. 1157-1182, 2003.
[14]. S. Rayana, W. Zhong, and L. Akoglu, “Sequential ensemble learning for outlier
detection: A bias-variance perspective,” in 2016 IEEE 16th International Conference on Data
Mining (ICDM), 2016, pp. 1167-1172.
[15]. A. L. Blum, and P. Langley, “Selection of relevant features and examples in machine
learning,” Artificial intelligence, vol. 97, no. 1-2, pp. 245-271, 1997.
[16]. H. Liu, and L. Yu, “Toward integrating feature selection algorithms for classification and
clustering,” IEEE Transactions on Knowledge & Data Engineering, no. 4, pp. 491-502, 2005. 74

[17]. Z. M. Hira, and D. F. Gillies, “A Review of Feature Selection and Feature Extraction
Methods Applied on Microarray Data,” Adv Bioinformatics, vol. 2015, pp. 198363, 2015.
[18]. A. Jain, and D. Zongker, “Feature selection: Evaluation, application, and small sample
performance,” IEEE transactions on pattern analysis and machine intelligence, vol. 19, no. 2,
pp. 153-158, 1997.
[19]. M. Dash, and H. Liu, “Feature selection for classification,” Intelligent data analysis, vol.
1, no. 1-4, pp. 131-156, 1997.
[20]. R. Kohavi, and G. H. John, “Wrappers for feature subset selection,” Artificial
intelligence, vol. 97, no. 1-2, pp. 273-324, 1997.
[21]. A. G. Karegowda, M. Jayaram, and A. Manjunath, “Feature subset selection problem
using wrapper approach in supervised learning,” International journal of Computer
applications, vol. 1, no. 7, pp. 13-17, 2010.
[22]. Y. Saeys, I. Inza, and P. Larrañaga, “A review of feature selection techniques in
bioinformatics,” bioinformatics, vol. 23, no. 19, pp. 2507-2517, 2007.
[23]. V. Kumar, “Feature Selection: A literature Review,” The Smart Computing Review, vol.
4, no. 3, 2014.
[24]. J. H. Holland, “Genetic algorithms,” Scientific american, vol. 267, no. 1, pp. 66-73,
1992.
[25]. S. Cateni, M. Vannucci, M. Vannocci, and V. Colla, "Variable selection and feature
extraction through artificial intelligence techniques," Multivariate Analysis in Management,
Engineering and the Sciences: IntechOpen, 2013.
[26]. Y. Chtioui, D. Bertrand, and D. Barba, “Feature selection by a genetic algorithm.
Application to seed discrimination by artificial vision,” Journal of the Science of Food and
Agriculture, vol. 76, no. 1, pp. 77-86, 1998.
[27]. Y. Lu, I. Cohen, X. S. Zhou, and Q. Tian, “Feature selection using principal feature
analysis,” in Proceedings of the 15th ACM international conference on Multimedia, 2007, pp.
301-304.
[28]. L. I. Smith, A tutorial on principal components analysis, 2002.
[29]. J. R. Quinlan, “Induction of decision trees,” Machine learning, vol. 1, no. 1, pp. 81-106,
1986.
[30]. J. R. Quinlan, C4. 5: programs for machine learning: Elsevier, 2014.
[31]. T. G. Dietterich, “Ensemble methods in machine learning,” in International workshop on
multiple classifier systems, 2000, pp. 1-15.
[32]. L. Rokach, “Ensemble-based classifiers,” Artificial Intelligence Review, vol. 33, no. 1-2,
pp. 1-39, 2010.
[33]. D. Opitz, and R. Maclin, “Popular ensemble methods: An empirical study,” Journal of
artificial intelligence research, vol. 11, pp. 169-198, 1999.
[34]. G. Martínez-Muñoz, and A. Suárez, “Using boosting to prune bagging ensembles,” 75

Pattern Recognition Letters, vol. 28, no. 1, pp. 156-165, 2007.
[35]. P. Bühlmann, Bagging, boosting and ensemble methods: Springer, 2012.
[36]. Y. Saeys, T. Abeel, and Y. Van de Peer, “Robust feature selection using ensemble feature
selection techniques,” in Joint European Conference on Machine Learning and Knowledge
Discovery in Databases, 2008, pp. 313-325.
[37]. T. Abeel, T. Helleputte, Y. Van de Peer, P. Dupont, and Y. Saeys, “Robust biomarker
identification for cancer diagnosis with ensemble feature selection methods,” Bioinformatics,
vol. 26, no. 3, pp. 392-398, 2009.
[38]. A. Tsymbal, M. Pechenizkiy, and P. Cunningham, “Diversity in search strategies for
ensemble feature selection,” Information fusion, vol. 6, no. 1, pp. 83-98, 2005.
[39]. B. Seijo-Pardo, I. Porto-Díaz, V. Bolón-Canedo, and A. Alonso-Betanzos, “Ensemble
feature selection: Homogeneous and heterogeneous approaches,” Knowledge-Based Systems,
vol. 118, pp. 124-139, 2017.
[40]. M. Termenon, and M. Graña, “A two stage sequential ensemble applied to the
classification of Alzheimer’s disease based on mri features,” Neural Processing Letters, vol.
35, no. 1, pp. 1-12, 2012.
[41]. C.-F. Tsai, and Y.-C. Hsiao, “Combining multiple feature selection methods for stock
prediction: Union, intersection, and multi-intersection approaches,” Decision Support
Systems, vol. 50, no. 1, pp. 258-269, 2010.
[42]. M. I. Jordan, and T. M. Mitchell, “Machine learning: Trends, perspectives, and
prospects,” Science, vol. 349, no. 6245, pp. 255-260, 2015.
[43]. P. Cunningham, M. Cord, and S. J. Delany, "Supervised learning," Machine learning
techniques for multimedia, pp. 21-49: Springer, 2008.
[44]. C. Cortes, and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, no. 3,
pp. 273-297, 1995.
[45]. J. J. J. I. T. o. s. Grefenstette, man,, and cybernetics, “Optimization of control parameters
for genetic algorithms,” vol. 16, no. 1, pp. 122-128, 1986.
[46]. A. Venkatachalam, “M-InfoSift: A Graph-based Approach for Multiclass Document
Classification,” 2007.
[47]. M. Sokolova, G. J. I. P. Lapalme, and Management, “A systematic analysis of
performance measures for classification tasks,” vol. 45, no. 4, pp. 427-437, 2009.
[48]. M. Al-Rajab, J. Lu, Q. J. C. m. Xu, and p. i. biomedicine, “Examining applying high
performance genetic data feature selection and classification algorithms for colon cancer
diagnosis,” vol. 146, pp. 11-24, 2017.
[49]. T. Elgamal, and M. J. a. p. a. Hefeeda, “Analysis of PCA algorithms in distributed
environments,” 2015.
[50]. J. Su, and H. Zhang, “A fast decision tree learning algorithm,” in Proceedings of the 21st
national conference on Artificial intelligence - Volume 1, Boston, Massachusetts, 2006, pp.
500-505.

指導教授

蔡志豐蘇坤良

審核日期

2019-7-1

推文