樣本選取方法於多分類資料集之影響：多對多、一對多與一對一

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：30

、訪客IP：18.117.238.162

姓名

廖珮祺(Pei-Qi Liao) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

樣本選取方法於多分類資料集之影響：多對多、一對多與一對一
(Instance Selection Methods in Multi-Class Classification Datasets: All versus All, One versus All, and One versus One)

相關論文

★ 利用資料探勘技術建立商用複合機銷售預測模型	★ 應用資料探勘技術於資源配置預測之研究-以某電腦代工支援單位為例
★ 資料探勘技術應用於航空業航班延誤分析-以C公司為例	★ 全球供應鏈下新產品的安全控管-以C公司為例
★ 資料探勘應用於半導體雷射產業-以A公司為例	★ 應用資料探勘技術於空運出口貨物存倉時間預測-以A公司為例
★ 使用資料探勘分類技術優化YouBike運補作業	★ 特徵屬性篩選對於不同資料類型之影響
★ 資料探勘應用於B2B網路型態之企業官網研究-以T公司為例	★ 衍生性金融商品之客戶投資分析與建議-整合分群與關聯法則技術
★ 應用卷積式神經網路建立肝臟超音波影像輔助判別模型	★ 基於卷積神經網路之身分識別系統
★ 能源管理系統電能補值方法誤差率比較分析	★ 企業員工情感分析與管理系統之研發
★ 資料淨化於類別不平衡問題: 機器學習觀點	★ 資料探勘技術應用於旅客自助報到之分析—以C航空公司為例

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2025-9-1以後開放)

摘要(中)

巨量資料(Big data)的時代來臨，將這些資料轉化為有用的資訊時，若沒有經過適當的前處理，訓練出的模型可能會受到其中的雜訊 (Noizy)影響，而使預測能力降低。在過去的研究中顯示，透過樣本選取(Instance selsction)方法能夠有效的篩選出資料集中代表性的資料，提升模型的效能與準確度。在這些相關研究中，較少討論在資料集為多分類的情況下，是否有不同的處理方法能夠提升樣本選取的效能。因此在本論文中欲探討：先對多分類資料集進行本研究中提出的多元分類處理方法後，再進行樣本選取，對於模型建立的影響。
　　本研究提出了三種多分類資料集的多元分類處理方法：多對多(AvA)、一對多(OvA)以及一對一(OvO)，並搭配三種樣本選取方法：樣本學習演算法（Instance based learning algorithm，IB3）、遞減式降低最佳化程序（Decremental reduction optimization procedure 3, DROP3）與基因演算法（Genetic algorithm, GA），使用支持向量機（Support vector machine, SVM）與K鄰近值分類演算法（k-nearest neighbors classification algorithm, KNN）作為分類器，評估訓練模型最佳的搭配組合。於實驗第二階段進一步加入特徵選取方法（Feature selection），探討特徵選取搭配多元分類處理後的樣本選取，對於建立訓練模型的影響。
　　本研究使用UCI與KEEL上20個不同類型的多分類資料集，進行不同多元分類處理與樣本選取方法組合。根據實驗結果發現，以多元分類處理OvO搭配樣本選取演算法DROP3，在分類器KNN的模型建立之下，獲得最佳的平均結果，與未經過樣本選取方法的KNN建模結果相比，AUC指標提升了6.6%。

摘要(英)

The big data generation has come. When turning these data into useful information, if they are out of proper pre-processing, the noise in data may reduce the predictive ability of the training model. In previous research, it has shown that the instance selection methods can effectively selection the representative data from the datasets, and improve the performance and accuracy of the model. Among the research, it rarely discusses whether there are any processing methods that can improve the efficiency of instance selection when the datasets are multi-classified. Therefore, this thesis aims to discuss about the impact of the multi-class classification methods proposed in this research with the instance selection methods in multi-class datasets.
　　This study proposes three methods for multi-class classification processing in multi-class datasets: All versus All (AvA), One versus All (OvA), and One versus One (OvO), with three instance selection methods: Instance based learning algorithm 3 (IB3), Decremental reduction optimization procedure 3 (DROP3) and Genetic algorithm (GA). Using Support vector machine (SVM) and the k-nearest neighbors classification algorithm (KNN) as classifiers to evaluate which method is the best combination. In the second stage of the study, we add the feature selection method to find out the impact between feature selection and instance selection under the multi-class classification methods.
　　This study uses 20 different types of multi-class datasets from UCI and KEEL, and goes through different combination of multi-class classification methods and instance selection methods. The empirical results show that, the combination of multi-class classification method-OvO with instance selection method-DROP3, under classifier KNN, obtained the best average results. Comparing to the results of the baseline which is without instance selection, the AUC index has improved 6.6%.

關鍵字(中)

★ 資料前處理
★ 樣本選取
★ 特徵選取
★ 多分類資料集
★ 資料探勘

關鍵字(英)

★ Data pre-processing
★ Instance selection
★ Feature selection
★ Multi-class dataset
★ Data mining

論文目次

摘要 i
Abstract ii
目錄 iii
圖目錄 v
表目錄 vii
第一章緒論 1
1.1 研究背景 1
1.2 研究動機 2
1.3 研究目的 3
1.4 研究架構 4
第二章文獻探討 5
2.1 樣本選取 5
2.1.1 樣本學習演算法（IB3） 7
2.1.2 遞減式縮減最佳化程序（DROP3） 9
2.1.3 基因演算法（GA） 11
2.2 特徵選取 17
2.2.1 基因演算法（GA） 20
2.3 監督式學習分類模型 21
2.3.1 監督式學習 21
2.3.2 支援向量機（SVM） 21
2.3.3 K鄰近值分類演算法（KNN） 25
第三章實驗方法 26
3.1 實驗架構 26
3.2 多元分類處理 28
3.2.1 AvA（All versus All） 28
3.2.2 OvA（One versus All） 29
3.2.3 OvO（One versus One） 30
3.3 實驗流程 31
3.3.1 Baseline 31
3.3.2 實驗一 31
3.3.3 實驗二 34
3.4 方法驗證 35
3.5 實驗參數設定 36
3.5.1 樣本選取 36
3.5.2 特徵選取 37
3.5.3 建模分類器 38
第四章實驗結果 39
4.1 實驗準備 39
4.2 實驗結果 41
4.2.1 實驗一 IS 41
4.2.2 實驗二之一 IS+FS 50
4.2.3 實驗二之二 FS+IS 59
4.3 綜合討論 68
4.3.1 實驗結果比較 68
4.3.2 樣本選取精簡率 74
4.3.3 運算時間比較 77
4.4 實驗總結 79
第五章結論 81
5.1結論與貢獻 81
5.2 未來研究方向與建議 83
參考文獻 85

參考文獻

[1] Wu, X., Zhu, X., Wu, G. Q., & Ding, W. (2013). Data mining with big data. IEEE transactions on knowledge and data engineering, 26(1), 97-107.
[2]Fan, W., & Bifet, A. (2013). Mining big data: current status, and forecast to the future. ACM sIGKDD Explorations Newsletter, 14(2), 1-5.
[3] Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI magazine, 17(3), 37-37.
[4] Pyle, D. (1999). Data preparation for data mining. morgan kaufmann.
[5] Kotsiantis, S. B., Kanellopoulos, D., & Pintelas, P. E. (2006). Data preprocessing for supervised leaning. International Journal of Computer Science, 1(2), 111-117.
[6] Reinartz, T. (2002). A unifying view on instance selection. Data Mining and Knowledge Discovery, 6(2), 191-210.
[7] Brighton, H., & Mellish, C. (2002). Advances in instance selection for instance-based learning algorithms. Data mining and knowledge discovery, 6(2), 153-172 .
[8] García-Pedrajas, N., & De Haro-García, A. (2014). Boosting instance selection algorithms. Knowledge-Based Systems, 67, 342-360.
[9] Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), 1157-1182.
[10] Dash, M., & Liu, H. (1997). Sikora, R., & Piramuthu, S. (2007). Framework for efficient feature selection in genetic algorithm based data mining. European Journal of Operational Research, 180(2), 723-737.
[11] Aggarwal, C. C., & Yu, P. S. (2001). Outlier detection for high dimensional data. In Proceedings of the 2001 ACM SIGMOD international conference on Management of data (pp. 37-46).

[12] Armi, L., & Fekri-Ershad, S. (2019). Texture image analysis and texture classification methods-A review. arXiv preprint arXiv:1904.06554.
[13] Saidi, M., Bechar, M. E. A., Settouti, N., & Chikh, M. A. (2018). Instances selection algorithm by ensemble margin. Journal of Experimental & Theoretical Artificial Intelligence, 30(3), 457-478.
[14] Olvera-López, J. A., Carrasco-Ochoa, J. A., Martínez-Trinidad, J. F., & Kittler, J. (2010). A review of instance selection methods. Artificial Intelligence Review, 34(2), 133-143.
[15] Hart, P. (1968). The condensed nearest neighbor rule. IEEE transactions on information theory, 14(3), 515-516.
[16] Gates, G. (1972). The reduced nearest neighbor rule. IEEE transactions on information theory, 18(3), 431-433.
[17] Ritter, G., Woodruff, H., Lowry, S., & Isenhour, T. (1975). An algorithm for a selective nearest neighbor decision rule. IEEE Transactions on Information Theory, 21(6), 665-669.
[18] Wilson, D. L. (1972). Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, (3), 408-421.
[19] Wilson, D. R., & Martinez, T. R. (2000). Reduction techniques for instance-based learning algorithms. Machine learning, 38(3), 257-286.
[20] Garcia, S., Derrac, J., Cano, J., & Herrera, F. (2012). Prototype selection for nearest neighbor classification: Taxonomy and empirical study. IEEE transactions on pattern analysis and machine intelligence, 34(3), 417-435.
[21] Aha, D. W., Kibler, D., & Albert, M. K. (1991). Instance-based learning algorithms. Machine learning, 6(1), 37-66.
[22] Tsymbal, A. (2004). The problem of concept drift: definitions and related work. Computer Science Department, Trinity College Dublin, 106(2), 58.

[23] Grochowski, M., & Jankowski, N. (2004, June). Comparison of instance selection algorithms II. Results and comments. In International Conference on Artificial Intelligence and Soft Computing (pp. 580-585). Springer, Berlin, Heidelberg.
[24] García-Pedrajas, N. (2009). Constructing ensembles of classifiers by means of weighted instance selection. IEEE Transactions on Neural Networks, 20(2), 258-277.
[25] Nikolaidis, K., Goulermas, J. Y., & Wu, Q. H. (2011). A class boundary preserving algorithm for data condensation. Pattern Recognition, 44(3), 704-715.
[26] Holland, J. H. (1975). Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence.
[27] Goldberg, D.E. (1989a). Genetic Algorithms in Search, Optimization, and Machine Learning. AddisonWesley, New York.
[28] Beasley, J. E., & Chu, P. C. (1996). A genetic algorithm for the set covering problem. European journal of operational research, 94(2), 392-404.
[29] Kumar, A. (2013). Encoding schemes in genetic algorithm. International Journal of Advanced Research in IT and Engineering, 2(3), 1-7.
[30] Herrera, F., Lozano, M., & Verdegay, J. L. (1998). Tackling real-coded genetic algorithms: Operators and tools for behavioural analysis. Artificial intelligence review, 12(4), 265-319.
[31] Elbeltagi, E., Hegazy, T., & Grierson, D. (2005). Comparison among five evolutionary-based optimization algorithms. Advanced engineering informatics, 19(1), 43-53.
[32] Beasley, D., Bull, D. R., & Martin, R. R. (1993). An overview of genetic algorithms: Part 1, fundamentals. University computing, 15(2), 56-69.
[33] Baker, J. E. (1987). Reducing bias and inefficiency in the selection algorithm. In Proceedings of the second international conference on genetic algorithms (Vol. 206, pp. 14-21).

[34] Beasley, D., Bull, D. R., & Martin, R. R. (1993). An overview of genetic algorithms: Part 2, research topics. University computing, 15(4), 170-181.
[35] Reeves, C. R. (1999). Foundations of genetic algorithms (Vol. 5). Morgan Kaufmann.
[36] Kazarlis, S. A., Bakirtzis, A. G., & Petridis, V. (1996). A genetic algorithm solution to the unit commitment problem. IEEE transactions on power systems, 11(1), 83-92.
[37] Srinivas, M., & Patnaik, L. M. (1994). Adaptive probabilities of crossover and mutation in genetic algorithms. IEEE Transactions on Systems, Man, and Cybernetics, 24(4), 656-667.
[38] Ishibuchi, H., Nakashima, T., & Nii, M. (2001). Genetic-algorithm-based instance and feature selection. In Instance selection and construction for data mining (pp. 95-112). Springer, Boston, MA.
[39] Liu, H., & Motoda, H. (2012). Feature selection for knowledge discovery and data mining (Vol. 454). Springer Science & Business Media.
[40] Dash, M., & Liu, H. (1997). Feature selection for classification. Intelligent data analysis, 1(3), 131-156.
[41] Kumar, V., & Minz, S. (2014). Feature selection: a literature review. SmartCR, 4(3), 211-229.
[42] Kohavi, R., & John, G. H. (1997). Wrappers for feature subset selection. Artificial intelligence, 97(1-2), 273-324.
[43] Bolón-Canedo, V., Sánchez-Maroño, N., & Alonso-Betanzos, A. (2013). A review of feature selection methods on synthetic data. Knowledge and information systems, 34(3), 483-519.
[44] Carbonell, J. G., Michalski, R. S., & Mitchell, T. M. (1983). An overview of machine learning. In Machine learning (pp. 3-23). Morgan Kaufmann.
[45] Caruana, R., & Niculescu-Mizil, A. (2006, June). An empirical comparison of supervised learning algorithms. In Proceedings of the 23rd international conference on Machine learning (pp. 161-168)
[46] Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273-297.
[47] Han, J., Pei, J., & Kamber, M. (2011). Data mining: concepts and techniques. Elsevier.
[48] Lingras, P., & Butz, C. (2007). Rough set based 1-v-1 and 1-vr approaches to support vector machine multi-classification. Information Sciences, 177(18), 3782-3798.
[49] Fix, E. (1951). Discriminatory analysis: nonparametric discrimination, consistency properties. USAF school of Aviation Medicine.
[50] Bay, S. D. (1998, July). Combining Nearest Neighbor Classifiers Through Multiple Feature Subsets. In ICML (Vol. 98, pp. 37-45).
[51] Mandong, A. M., & Munir, U. (2018, October). Smartphone Based Activity Recognition using K-Nearest Neighbor Algorithm. In Proceedings of the International Conference on Engineering Technologies, Konya, Turkey (pp. 26-28)
[52] Kohavi, R. (1995, August). A study of cross-validation and bootstrap for accuracy estimation and model selection. In Ijcai (Vol. 14, No. 2, pp. 1137-1145).
[53] Grefenstette, J. J. (1986). Optimization of control parameters for genetic algorithms. IEEE Transactions on systems, man, and cybernetics, 16(1), 122-128
[54] Hsu, C. W., & Lin, C. J. (2002). A comparison of methods for multiclass support vector machines. IEEE transactions on Neural Networks, 13(2), 415-425.
[55] Chen, J., & Shao, J. (2000). Nearest neighbor imputation for survey data. Journal of official statistics, 16(2), 113.
[56] Thanh Noi, P., & Kappas, M. (2018). Comparison of random forest, k-nearest neighbor, and support vector machine classifiers for land cover classification using Sentinel-2 imagery. Sensors, 18(1), 18.
[57] Huang, J., & Ling, C. X. (2005). Using AUC and accuracy in evaluating learning algorithms. IEEE Transactions on knowledge and Data Engineering, 17(3), 299-310.

[58] Wilcoxon, F. (1946). Individual comparisons of grouped data by ranking methods. Journal of economic entomology, 39(2), 269-270.
[59] Sipser, M. (1996). Introduction to the Theory of Computation. ACM Sigact News, 27(1), 27-29.
[60] Jankowski, N., & Grochowski, M. (2004, June). Comparison of instances seletion algorithms i. algorithms survey. In International conference on artificial intelligence and soft computing (pp. 598-603). Springer, Berlin, Heidelberg.
[61] Ortigosa-Hernández, J., Inza, I., & Lozano, J. A. (2017). Measuring the class-imbalance extent of multi-class problems. Pattern Recognition Letters, 98, 32-38.

指導教授

蔡志豐(Chih-Fong Tsai)

審核日期

2021-2-19

推文