樣本選取與資料離散化對於分類器效果之影響

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：1

、訪客IP：3.145.105.108

姓名

顏子明(Tzu-Ming Yen) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

樣本選取與資料離散化對於分類器效果之影響
(Instance Selection and Data Discretization Influence on Classifier’s Performance)

相關論文

★ 特徵選取於資料離散化之影響	★ 過採樣集成法於類別不平衡與高維度資料之研究
★ 單一與集成特徵選取方法於高維度資料之比較

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2024-6-25以後開放)

摘要(中)

「資料前處理」在資料探勘中，扮演舉足輕重的角色，也是整個分析流程的起點。真實世界中的資料品質參差不齊，例如：大樣本的資料往往會帶有雜訊（Noisy）、或是包含判讀性低的連續型數值類型，若是沒有經過適當的前處理，這些因素都會造成分析結果有所誤差。在過去的文獻中，有學者提出樣本選取（Instance Selection）的資料取樣概念，能夠透過演算法篩選具有代表性的樣本；也有研究顯示出前處理時運用離散化（Discretization），將連續型數值轉換成離散型，能夠有效的提高分析探勘規則的可讀性同時也可能提升正確率。若是將樣本選取與離散化結合，是否能夠在最後獲得比單一前處理還要佳的表現，目前尚未有文獻做出這方面的探討。
本論文欲探討樣本選取與離散化結合後進行資料前處理的影響，如何搭配才能達到最佳表現。本研究選用了三種樣本選取的演算法：基於樣本學習演算法（Instance-Based Learning Algorithm， IB3）、基因演算法（Genetic Algorithm， GA）、遞減式縮減最佳化程序（Decremental Reduction Optimization Procedure， DROP3），以及兩種監督式離散化演算法：最短描述長度原則（Minimum Description Length Principle， MDLP）、基於卡方分箱（ChiMerge， ChiM）。並以最近鄰居法（K-th Nearest Neighbor， KNN）作為分類器來評估搭配的最佳組合。
本研究將以UCI與KEEL上的10種資料集，來進行樣本選取與離散化搭配的探討。根據實驗結果發現，以DROP3樣本選取演算法搭配MDLP離散化演算法的所得到的平均結果，為較推薦之組合搭配，並且以先進行DROP3樣本選取後進行MDLP離散化後的前處理，能夠得到較顯著提升的平均正確率，其正確率達85.11%。

摘要(英)

"Data Preprocessing" plays a pivotal role in data exploration and is the first step for the analysis process of data mining. In the real world, the quality of the big data is always unclear and uneven. For example, samples in the big data often have noise or continuous type values with low interpretability. These factors will result in inaccurate outcome if not properly pre-processed. In the literature, the concept of data sampling for instance selection had been proposed, which can be used to screen representative samples. Some studies have also shown that using discretization technology to transfer continuous values into discrete ones can effectively improve the readability of analytical exploration rules and may also improve the accuracy rate. Till now, there are no studies to explore the combination of instance selection and discretization, whether it can achieve better performance outcome than the single preprocessing techniques.
This thesis aims to discuss the influence of data preprocessing after combining instance selection and discretization, and how to achieve the optimal performance. In this study, three instance selection algorithms are selected: Instance-Based Learning Algorithm (IB3), Genetic Algorithm (GA), Decremental Reduction Optimization Procedure (DROP3), and two supervised discretization algorithms: Minimum Description Length Principle (MDLP), ChiMerge-based (ChiM). The best combination of the two types of techniques is evaluated by the performance of the K-th Nearest Neighbor (KNN) classifiers.
This study uses the 10 datasets from UCI and KEEL to explore the instance selection and discretization. According to the experimental results, it reveals that the average results of the DROP3 instance selection algorithm combined with the MDLP discretization algorithm is the more recommended combination than others, and the optimal performance can be obtained when the pre-processing of MDLP discretization is performed after the selection by DROP3, the average accuracy is promoted to 85.11%.

關鍵字(中)

★ 資料前處理
★ 樣本選取
★ 資料離散化
★ 連續型數值
★ 資料探勘

關鍵字(英)

★ Data pre-processing
★ instance selection
★ discretization
★ continuous value
★ data mining

論文目次

摘要 i
Abstract ii
目錄 iii
圖目錄 v
表目錄 vii
第一章緒論 1
1.1 研究背景 1
1.2 研究動機 2
1.3 研究目的 4
1.4 研究架構 5
第二章文獻探討 6
2.1 樣本選取 6
2.1.1 基因演算法（GA） 8
2.1.2 遞減式縮減最佳化程序（DROP3） 15
2.1.3 基於樣本學習演算法（IB3） 17
2.2 資料離散化 19
2.2.1 最短描述長度原則（MDLP） 20
2.2.2 基於卡方分箱（ChiMerge） 22
第三章研究方法 23
3.1 實驗架構 23
3.2 方法驗證 24
3.3 實驗參數設定與評估指標 25
3.3.1 離散化參數設定 25
3.3.2 樣本選取參數設定 26
3.3.3 評估指標 26
3.3.4 最近鄰居法（KNN）分類器參數設定 26
3.4 實驗流程 28
3.4.1 Baseline 28
3.4.2 先執行樣本選取再執行離散化（IS + D） 30
3.4.3 先執行離散化再執行樣本選取（D + IS） 33

第四章實驗結果 36
4.1 實驗準備 36
4.1.1 實驗資料集 36
4.1.2 實驗電腦環境 37
4.1.3 軟體程式 38
4.2 實驗結果 39
4.2.1 Baseline數據 39
4.2.2 先執行樣本選取再執行離散化（IS + D）之結果 41
4.2.3 先執行離散化再執行樣本選取（D + IS）之結果 53
4.2.4 樣本選取與離散化執行順序之影響 65
4.2.5 運算時間比較 69
4.3 實驗總結 70
第五章結論 75
5.1 結論與貢獻 75
5.2 未來研究方向與建議 77
參考文獻 79

參考文獻

[1] Wu, X., Zhu, X., Wu, G., & Ding, W. (2014). Data mining with big data. IEEE Transactions on Knowledge and Data Engineering. 26(1), 97-107.
[2] Chiang, W. Y. (2018). Applying data mining for online CRM marketing strategy. British Food Journal. 120(3), 665-675.
[3] Kalmegh, S. (2015). Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News. International Journal of Innovative Science, Engineering & Technology. 2(2), 438-446.
[4] Buczak, A. L. & Guven, E. (2016). A Survey of Data Mining and Machine Learning Methods for Cyber Security Intrusion Detection. IEEE Communications Surveys & Tutorials. 18(2), 1153-1176.
[5] Rahman, F. A., Shamsuddin, S. M., Hassan, S., & Haris, N. A. (2016). A Review of KDD-Data Mining Framework and Its Application in Logistics and Transportation. International Journal of Supply Chain Management. 5(2), 77-84
[6] Blum, A. L. & Pat, L. (1997). Selection of relevant features and examples in machine learning. Artificial Intelligence. 97(1-2), 245-271.
[7] Holzinger, A., Dehmer, M., & Jurisica, I. (2014). Knowledge Discovery and interactive Data Mining in Bioinformatics - State-of-the-Art, future challenges and research directions. BMC Bioinformatics. 15(6), 1-9.
[8] Kim, W., Choi, B. J., Hong, E. K., Kim, S. K., & Doheon, L. (2003). A Taxonomy of Dirty Data. Data Mining and Knowledge Discovery. 7(1), 81-99.
[9] Pyle, D. (1999). Data Preparation for Data Mining. San Francisco: Morgan Kaufmann.
[10] Kotsiantis, S. B., Kanellopoulos, D., & Pintelas, P. E. (2006). Data Preprocessing for Supervised Leaning. International Journal of Computer Science. 1(1), 111-117.
[11] Meng X. F., Ci X. (2013). Big data management: Concepts, techniques and challenges. Journal of Computer Research and Development. 50(1), 146-169.
[12] Czarnowski, I. (2011). Cluster-based instance selection for machine classification. Knowledge and Information Systems. 30(1), 113-133.
[13] Wilson, D. R. & Martinez, T. R. (2000). An Integrated Instance-Based Learning Algorithm. Computational Intelligence. 16(1), 1-28.
[14] Zhang, J. (1992). Selecting Typical Instances in Instance-Based Learning. In Sleeman, D. & Edwards, P. (Eds.), Machine Learning Proceedings 1992 Through the past (470-479). San Francisco: Morgan Kaufmann.
[15] Zhang, S., Zhang, C., & Yang, Q. (2003). Data preparation for data mining. Applied Artificial Intelligence. 17(5-6), 375-381.
[16] Liu, H., Hussain, F., Tan, C. L., & Dash, M. (2002). Discretization: An Enabling Technique. Data Mining and Knowledge Discovery. 6(4), 393-423.
[17] Garcia, S., Luengo, J., Sáez, J. A., López, V., & Herrera, F. (2013). A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning. IEEE Transactions on Knowledge and Data Engineering. 25(4), 734-750.
[18] Cano, J. R., Herrera, F., & Lozano, M. (2003). Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study. IEEE Transactions on Evolutionary Computation. 7(6), 561-575.
[19] Romero, C., Ventura, S., Espejo, P. G., & Hervás, C. (2008). Data Mining Algorithms to Classify Students. Proceedings of the First International Conference on Educational Data Mining. 57-66.
[20] Stone, M. (1974). Cross-validation and multinomial prediction. Biometrika. 61(3), 509-515.
[21] Domindos, P. (1996). Unifying Instance-Based and Rule-Based Induction. Machine Learning. 24(2), 141-168.
[22] Leyva, E., González, A., & Pérez, R. (2015). Three new instance selection methods based on local sets: A comparative study with several approaches from a bi-objective perspective. Pattern Recognition. 48, 1523-1537.
[23] Olvera-López, J. A., Carrasco-Ochoa, J. A., Martínez-Trinidad, J. F., & Kittler, J. (2010). A review of instance selection methods. Artif Intell Rev. 34, 133-143.
[24] Derrac, J., Garcia, S., & Herrera, F. (2010). A survey on evolutionary instance selection and generation. International Journal of Applied Metaheuristic Computing. 1(1), 60-92.
[25] Garcı´a, S., Derrac, J., Cano, J. R., & Herrera, F. (2012). Prototype Selection for Nearest Neighbor Classification: Taxonomy and Empirical Study. IEEE Transactions on Pattern Analysis and machine Intelligence. 34(3), 417-435.
[26] García-Pedrajas, N. & Pérez-Rodríguez, J. (2012). Multi-selection of instances: A straightforward way to improve evolutionary instance selection. Applied Soft Computing. 12(11), 3590-3602.
[27] Triguero, I., Garcı´a, S., & Herrera, F. (2011). Differential evolution for optimizing the positioning of prototypes in nearest neighbor classification. Pattern Recognition. 44(4), 901-916.
[28] Espejo, P. G., Ventura, S., & Herrera, F. (2010). A Survey on the Application of Genetic Programming to Classification. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews). 40(2), 121-144.
[29] Ahn, H. & Kim, K. J. (2009). Bankruptcy prediction modeling with hybrid case-based reasoning and genetic algorithms approach. Applied Soft Computing. 9(2), 599-607.
[30] Goldberg, D. E. (1989). Genetic Algorithm in Search, Optimization, and Machine Learning. Boston: Addison Wesley.
[31] Beasley, J. E. & Chu, P. C. (1996). A genetic algorithm for the set covering problem. European Journal of Operational Research. 94(2), 392-404.
[32] Holland, J. H. (1975). Adaptation in natural and artificial systems. Massachusetts: The MIT Press.
[33] Elbeltagi, E., Hegazy, T., & Grierson, D. (2005). Comparison among five evolutionary-based optimization algorithms. Advanced Engineering Informatics. 19(1), 43-53.
[34] Beasley, D., Bull, D. R., & Martin, R. R. (1993). An Overview of genetic algorithms: Part I. Fundamentals. University Computing. 15(2), 58-69.
[35] Baker, J. E. (1987). Reducing bias and inefficiency in the selection algorithm. Proceedings of the Second International Conference on Genetic Algorithms. 14-21.
[36] Srinivas, M. & Patnaik, L. M. (1994). Adaptive Probabilities of Crossover and Mutation in Genetic Algorithms. IEEE Transactions on Systems, Man, and Cybernetics. 24(4), 656-667.
[37] Reeves, C. R. (1999). Foundations of Genetic Algorithms. Massachusetts: Morgan Kaufmann.
[38] Kazarlis, S. A., Bakirtzis, A. G., & Petridis, V. (1996). A Genetic Algorithm Solution To The Unit Commitment Problem. IEEE Transactions on Power Systems. 11(1), 83-92.
[39] Sikora, R. & Piramuthu, S. (2007). Framework for Efficient Feature Selection in Genetic Algorithm Based Data Mining. European Journal of Operational Research. 180(2), 723-737.
[40] Gates, G. W. (1972). The Reduced Nearest Neighbor Rule. IEEE Transactions on Information Theory. 18(3), 431-433.
[41] García-Pedrajas, N. (2009). Constructing Ensembles of Classifiers by Means of Weighted Instance Selection. IEEE Transactions on Neural Networks. 20(2), 258-277.
[42] Nikolaidis, K., Goulermas, J. Y., & Wu, Q. H. (2011). A Class Boundary Preserving Algorithm for Data Condensation. Pattern Recognition. 44(3), 704-715.
[43] Tsymbal, A. (2004). The Problem of Concept Drift: Definitions and Related Work. Technical Report TCD-CS-2004-15, Computer Science Department, Trinity College, Dublin, Ireland.
[44] Wilson, D. R. & Martinez, T. R. (2000). Reduction Techniques for Instance-Based Learning Algorithms. Machine Learning. 38(3), 257-286.
[45] Grochowski, M. & Jankowski, N. (2004). Comparison of instance selection II. Results , comments. In Rutkowski, L. (Eds.), ICAISC 2004, LNAI. Through the past (580-585). Poland: Springer.
[46] Liu, H. & Setino, R. (1997). Feature selection via discretization. IEEE Transactions on Knowledge and Data Engineering. 9(4), 642-645.
[47] Dougherty, J., Kohavi, R., & Sahami, M. (1995). Supervised and Unsupervised Discretization of Continuous Features. In Prieditis, A. & Russell, S. (Eds.), Proceedings of the Twelfth International Conference on Machine Learning (194–202). San Francisco, CA: Morgan Kaufmann.
[48] Liu, Huan. & Setiono, R. (1995). Chi2: feature selection and discretization of numeric attributes. Proceedings of 7th IEEE International Conference on Tools with Artificial Intelligence, Herndon, USA.
[49] Kotsiantis, S. & Kanellopoulos, D. (2006). Discretization Techniques: A Recent Survey. GESTS Int′l Trans. Computer Science and Eng. 48(1), 47-58.
[50] Dash, R., Paramguru, R. L., & Dash, R. (2011). Comparative Analysis of Supervised and Unsupervised Discretization Techniques. International Journal of Advances in Science and Technology. 2(3), 29-37.
[51] Kerber, R. (1992). ChiMerge Discretization of numeric attributes. Proceedings of the tenth national conference on Artificial intelligence. 123-128.
[52] Rosati, S., Balestra, G., Giannini, V., Mazzetti, S., Russo, F., & Regge, D. (2015). ChiMerge discretization method: Impact on a computer aided diagnosis system for prostate cancer in MRI. Proceedings of 2015 IEEE International Symposium on Medical Measurements and Applications (MeMeA). 297-302.
[53] Richeldi, M. & Rossotto, M. (1995). Class-Driven Statistical Discretization of Continuous Attributes (Extended Abstract). In Lavrac, N. & Wrobel, S. (Eds.), Proceedings of the 8th Conference on Machine Learning Heraclion (335-338). Berlin: Springer.
[54] Zhang, P. (1993). Model Selection Via Multifold Cross Validation. The Annals of Statistics. 21(1), 299-313.

指導教授

蔡志豐蘇坤良(Chih-Fong Tsai Kuen-Liang Sue)

審核日期

2019-7-1

推文