特徵選取於資料離散化之影響

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：60

、訪客IP：18.225.195.163

姓名

陳鈺錡(Yu-Chi CHEN) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

特徵選取於資料離散化之影響
(Feature Selection in Data Discretization)

相關論文

★ 過採樣集成法於類別不平衡與高維度資料之研究	★ 樣本選取與資料離散化對於分類器效果之影響
★ 單一與集成特徵選取方法於高維度資料之比較

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

真實的世界中，資料往往沒有想像中的「乾淨」，因此我們需要透過資料前處理（Data Pre-processing），來確保資料品質。真實世界可能遇到資料維度過高，資料中摻雜不相關屬性或冗餘值，同時還包含許多複雜難以理解的連續型數值屬性，若直接使用可能大幅降低模型的預測能力。過去研究顯示，資料前處理中的離散化 (Discretization)方法將數值型屬性的資料轉換成為類別型屬性，有助於提升模型準確度、效能，使資料更平滑，減少雜訊、避免過度訓練。另外特徵選取(Feature Selection)為實務上經常使用的資料前處理技術，此方法能降低運算複雜度，取得具有代表性的特徵值，提升預測準確度。目前相關研究較少討論離散化與特徵選取這兩種資料前處理方法合併處理的議題，因此本論文欲探討使用離散化與特徵選取進行資料前處理流程之最佳組合。
本研究使用較具指標性離散化方法:等寬離散化(Equal-Width Discretization，EWD)、等頻離散化 (Equal-Frequency Discretization，EFD)、最小描述長度原則(Minimum Description Length Principle，MDLP)、卡方分箱法(ChiMerge，ChiM)，特徵選取方法:基因演算法(Genetic Algorithm, GA)、決策樹C4.5(Decision Tree C4.5, DT)、主成分分析(Principal Components Analysis, PCA)，探討離散化方法與特徵選取之間的優劣配適以及順序性。本研究使用資料集來自UCI Dataset上的10種資料集，資料之維度介於8到90維，分類問題介於2到28類，實驗結果為C5.0與SVM分類器預測之平均準確度。
根據本研究實驗結果，平均表現最佳的離散化方法為MDLP。先進行特徵選取再進行離散化模型平均預測準確度優於先進行離散化再進行特徵選取。先進行特徵選取再進行離散化的執行順序下，無論是使用SVM或C5.0作為分類器，先採用C4.5特徵選取，再用MDLP離散化為本論文最推薦之組合，其準確度可高達80.1%。

摘要(英)

In the reality, data are always not “clean” as we thought. Thus, we need to figure out and ensure data quality by data pre-processing. There are many problems that we must be solved, like high dimensional data may include irrelevant and redundant features (attributes of the data). Besides, data may include many continuous attributes that would be hard to understand and explain. If people use these “unclean” data, it might decrease model prediction performance dramatically.
Previous researches show advantages derived from discretization are the reduction and the simplification of data, making the model learning faster and yielding more accurate, compact and shorter results; and noise information possibly presents in the data is reduced. It could avoid overfitting and let the data curve smoothly. In addition, feature selection is a common method for data pre-processing. By this way, it can reduce the time complexity during model training and identify important features to improve the classification accuracy of the model. Currently, there are few researches discussing the pre-processing methods by combining discretization and feature selection at the same time. Thus, this paper focuses on the optimal combination of data pre-processing by discretization and feature selection.
The experiment exploits three popular feature selection methods, which are GA(Genetic Algorithm), DT(Decision Tree Algorithm), and PCA(Principal Components Analysis). In this experiment, EWD(Equal-width discretization), EFD(Equal-frequency discretization), MDLP(Minimum Description Length Principle), and ChiMerge are used for discretization.
In order to explore the optimal combination of discretization and feature selection, the data are collected from 10 UCI Datasets. The data dimensions are from 8 to 90 and the classification problems contains 2 to 28 classes. The comparative results are based on the average accuracy by C5.0 and SVM classifiers. Our empirical results show that the MDLP discretization method gives the best predictive performance.
To conclude, implementing feature selection before discretization can make classifiers provide higher accuracy than the ones by discretization alone. Moreover, no matter which classifier is utilized (i.e. C5.0 or SVM), combining feature selection by C4.5 first and discretization by MDLP second is the most recommended combination in this thesis. The combination could make “the average classification accuracy of the model” reaches 80.1%.

關鍵字(中)

★ 資料離散化
★ 特徵選取
★ 資料探勘
★ 分類
★ 連續型屬性
★ 機器學習

關鍵字(英)

★ Discretization
★ Feature Selection
★ Classification
★ Continuous Attributes
★ Data Mining
★ Machine Learning

論文目次

摘要 i
Abstract ii
目錄 iii
圖目錄 v
表目錄 vi
第一章緒論 1
1.1 研究背景 1
1.2 研究動機 2
1.3 研究目的 3
1.4 論文架構 4
第二章文獻探討 5
2.1 資料離散化 5
2.1.1 非監督式離散化方法 6
2.1.2 監督式離散化方法 6
2.2 特徵選取 9
2.2.1 基因演算法(Genetic Algorithm, GA) 11
2.2.2 決策樹C4.5(Decision Tree C4.5, DT) 13
2.2.3主成分分析(Principal Components Analysis, PCA) 14
2.3相關文獻 15
第三章研究方法 16
3.1 實驗架構 16
3.2 實驗參數設定、方法 18
3.2.1離散化參數設定 18
3.2.2特徵選取參數設定 19
3.2.3分類器方法 20
3.3 實驗一 20
3.3.1 Baseline 20
3.3.2 單一離散化(D) 21
3.3.3 先離散化再進行特徵選取(D+FS) 22
3.3.4 先特徵選取再進行離散化(FS+D) 23
3.4 實驗二 24
第四章實驗結果 26
4.1 實驗準備 26
4.1.1 實驗資料集 26
4.1.2 實驗電腦環境 27
4.1.3模型驗證準則 27
4.2 實驗一結果 28
4.2.1離散方法的優劣 28
4.2.2先執行離散化再執行特徵選取之結果(D+FS) 31
4.2.3先執行特徵選取再執行離散化之結果 (FS+D) 38
4.2.4離散化與特徵選取執行順序之影響 45
4.2.5運算時間比較 48
4.3 實驗二結果 51
4.4 實驗總結 53
第五章結論 55
5.1 結論與貢獻 55
5.2 未來研究方向與建議 57
參考文獻 59
附錄一 63
1.1 實驗一 : SVM分類器結果 63
1.2 實驗一 : C5.0分類器結果 74

參考文獻

[1]. Y. Wang, L. Kung, W. Y. C. Wang, and C. G. Cegielski, “An integrated big data analytics-enabled transformation model: Application to health care,” Information & Management, vol. 55, no. 1, pp. 64–79, 2018.
[2]. W. Y. Chiang, “Applying data mining for online CRM marketing strategy,” British Food Journal, vol. 120, no. 3, pp. 665–675, May 2018..
[3]. C. Chauhan and S. Sehgal, “A review: Crime analysis using data mining techniques and algorithms,” 2017 International Conference on Computing, Communication and Automation (ICCCA), 2017.
[4]. J. Han, M. Kamber, and J. Pei, Data mining concepts and techniques. Amsterdam: Morgan Kaufmann, 2012.
[5]. U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth , “From Data Mining to Knowledge Discovery in Databases,” AI Magazine , vol. 17 , no. 3, pp. 37–54, 1996.
[6]. P. Guo, S. S. Chen, and Y. He, “Study on Data Preprocessing for Daylight Climate Data,” Information Computing and Applications Lecture Notes in Computer Science, pp. 492–499, 2012.
[7]. S. B. Kotsiantis, D. Kanellopoulos, and P. E. Pintelas, “Data Preprocessing for Supervised Leaning,” International Journal of Computer Science, vol. 1, no. 12, pp. 4091–4096, 2007.
[8]. S. Garcia, J. Luengo, J. A. Sáez, V. López, and F. Herrera, “A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 25, no. 4, pp. 734–750, 2013.
[9]. B. Tran, B. Xue, and M. Zhang, “A New Representation in PSO for Discretization-Based Feature Selection,” IEEE Transactions on Cybernetics, vol. 48, no. 6, pp. 1733–1746, 2018.
[10]. Y. Zhai, Y. S. Ong, and I. W. Tsang, “The Emerging ‘Big Dimensionality,” IEEE Computational Intelligence Magazine, vol. 9, no. 3, pp. 14–26, 2014.
[11]. Q. He, Z. Xie, Q. Hu, and C. Wu, “Neighborhood based sample and feature selection for SVM classification learning,” Neurocomputing, vol. 74, no. 10, pp. 1585–1594, 2011.
[12]. H. Liu and R. Setiono, “Feature selection via discretization,” IEEE Transactions on Knowledge and Data Engineering, vol. 9, no. 4, pp. 642–645, 1997.
[13]. A. Kalousis, J. Prados, and M. Hilario, “Stability of feature selection algorithms: a study on high-dimensional spaces,” Knowledge and Information Systems, vol. 12, no. 1, pp. 95–116, Jan. 2006.
[14]. J. Catlett, “On changing continuous attributes into ordered discrete attributes,” Lecture Notes in Computer Science Machine Learning — EWSL-91, pp. 164–178.
[15]. D. Oreski, S. Oreski, and B. Klicek, “Effects of dataset characteristics on the performance of feature selection techniques,” Applied Soft Computing, vol. 52, pp. 109–119, 2017.
[16]. R. Ropero, S. Renooij, and L. V. D. Gaag, “Discretizing environmental data for learning Bayesian-network classifiers,” Ecological Modelling, vol. 368, pp. 391–403, 2018.
[17]. J. H. Liua, Y. J. Lin, S. X. Wu , and J. Zhang, “Feature selection based on quality of information,” Neurocomputing, vol. 225, pp. 11–22, 2017.
[18]. H. Liu and R. Setiono, “Feature selection via discretization,” IEEE Transactions on Knowledge and Data Engineering, vol. 9, no. 4, pp. 642–645, 1997.
[19]. C. J. Tsai, C. I. Lee, and W. P. Yang, “A discretization algorithm based on Class-Attribute Contingency Coefficient,” Information Sciences, vol. 178, no. 3, pp. 714–731, 2008.
[20]. J. Dougherty, R. Kohavi, and M. Sahami, “Supervised and Unsupervised Discretization of Continuous Features,” Machine Learning Proceedings, pp. 194–202, 1995
[21]. Z. Cebeci and F. Yildiz, “Comparison of Chi-square based algorithms for discretization of continuous chicken egg quality traits,” Journal of Agricultural Informatics, vol. 8, no. 1, pp. 13–22, 2017.
[22]. U. Fayyad and K. B. Irani, “Multi-interval discretization of continuous-valued attributes for classification learning,” Artificial intelligence, vol. 13, pp. 1022–1027, 1993.
[23]. R. Kerber, “ChiMerge: Discretization of numeric attributes,” In Proceedings of the Tenth National Conference on Artificial Intelligence, pp. 123–128, 1992.
[24]. M. Dash and H. Liu, “Feature selection for classification,” Intelligent Data Analysis, vol. 1, no. 1-4, pp. 131–156, 1997.
[25]. I. Guyon and A. Elisseeff, “An Introduction to Feature Extraction,”Journal of Machine Learning Research, pp. 1157–1182, 2003.
[26]. R. Kohavi and G. H. John, “Wrappers for feature subset selection,” Artificial Intelligence, vol. 97, no. 1-2, pp. 273–324, 1997.
[27]. H. Liu and H. Motoda, “Feature Selection for Knowledge Discovery and Data Mining,” 1998.
[28]. V. Kumar, “Feature Selection: A literature Review,” The Smart Computing Review, vol. 4, no. 3, 2014.
[29]. B. Bhanu and Y. Lin, “Genetic algorithm based feature selection for target detection in SAR images,” Image and Vision Computing, vol. 21, no. 7, pp. 591–608, 2003.
[30]. Q. Guo, W. Wu, D. Massart, C. Boucon, and S. D. Jong, “Feature selection in principal component analysis of analytical data,” Chemometrics and Intelligent Laboratory Systems, vol. 61, no. 1-2, pp. 123–132, 2002.
[31]. H. Ince and T. B. Trafalis, “Kernel principal component analysis and support vector machines for stock price prediction,” IIE Transactions, vol. 39, no. 6, pp. 629–637, 2007.
[32]. B. Tran, B. Xue, and M. Zhang, “A New Representation in PSO for Discretization-Based Feature Selection,” IEEE Transactions on Cybernetics, vol. 48, no. 6, pp. 1733–1746, 2018.
[33]. Y. S. Choi and B. R. Moon, “Feature Selection in Genetic Fuzzy Discretization for the Pattern Classification Problems,” IEICE Transactions on Information and Systems, vol. E90-D, no. 7, pp. 1047–1054, Jan. 2007.
[34]. A. J. Ferreira and M. A. Figueiredo, “An unsupervised approach to feature discretization and selection,” Pattern Recognition, vol. 45, no. 9, pp. 3048–3060, 2012.
[35]. D. Tian, X. J. Zeng, and J. Keane, “Core-Generating Discretization for Rough Set Feature Selection,” Transactions on Rough Sets XIII Lecture Notes in Computer Science, pp. 135–158, 2011.
[36]. R. Kohavi, “A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection,” International Joint Conference on Artificial Intelligence(IJCAI), vol. 14, no. 2, pp. 1137–1145, 1995.
[37]. J. Grefenstette, “Optimization of Control Parameters for Genetic Algorithms,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 16, no. 1, pp. 122–128, 1986.
[38]. A. Venkatachalam, “M-infosift: A Graph-based Approach For Multiclass document Classification,” Master Of Science In Computer Science And Engineering., 2007.

指導教授

蔡志豐蘇坤良(Chih-Fong Tsai Kuen-Liang Sue)

審核日期

2018-7-5

推文