特徵與樣本選取於一對多與一對一之多分類資料處理方法之研究

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：5

、訪客IP：3.15.151.234

姓名

柳成彥(LIU CHENG YAN) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

特徵與樣本選取於一對多與一對一之多分類資料處理方法之研究
(Feature and Instance Selection in One versus All and One versus One Multi-class Classification methods)

相關論文

★ 利用資料探勘技術建立商用複合機銷售預測模型	★ 應用資料探勘技術於資源配置預測之研究-以某電腦代工支援單位為例
★ 資料探勘技術應用於航空業航班延誤分析-以C公司為例	★ 全球供應鏈下新產品的安全控管-以C公司為例
★ 資料探勘應用於半導體雷射產業-以A公司為例	★ 應用資料探勘技術於空運出口貨物存倉時間預測-以A公司為例
★ 使用資料探勘分類技術優化YouBike運補作業	★ 特徵屬性篩選對於不同資料類型之影響
★ 資料探勘應用於B2B網路型態之企業官網研究-以T公司為例	★ 衍生性金融商品之客戶投資分析與建議-整合分群與關聯法則技術
★ 應用卷積式神經網路建立肝臟超音波影像輔助判別模型	★ 基於卷積神經網路之身分識別系統
★ 能源管理系統電能補值方法誤差率比較分析	★ 企業員工情感分析與管理系統之研發
★ 資料淨化於類別不平衡問題: 機器學習觀點	★ 資料探勘技術應用於旅客自助報到之分析—以C航空公司為例

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2029-7-1以後開放)

摘要(中)

隨著資訊技術的日漸發達，各領域的資訊量都呈現爆炸性的增長，其中過量的資料若未經過適當的前處理，會使模型建模受其中的雜訊影響，使分類器的分類效能降低。至今已有許多研究證實，針對資料進行特徵選取以及樣本選取以篩選出一份資料中重要的特徵及樣本，能夠有效的提升分類器的效能與模型的準確度，然而在過往的研究中較少討論針對多分類資料集是否有不同的資料處理方法以提升效能，目前已有針對樣本使用多元分類處理方法之研究，然而特徵還未有人研究，因此本研究將探討：先對多分類資料集進行多元分類處理方法後進行特徵選取，對於建模的影響。
本研究使用多元分類處理技術中之一對多(OvA)以及一對一(OvO)進行資料切分並搭配特徵選取方法進行特徵的選擇，於特徵選取中使用三大類(過濾類、包裝類、嵌入類)方法進行選取，使用支持向量機(Support vector machine, SVM)與K鄰近值分類演算法(K-nearest neighbors classification algorithm, KNN)作為分類器，探求最好之實驗組合。並於實驗第二階段加入樣本選取方法(Instance Selection)，探討樣本選取結合多元分類處理後之特徵選取的使用先後對於分類器分類效能的影響。
本研究使用UCI與Feature Selection @ ASU上共15種多分類資料集進行實驗，根據實驗結果顯示，使用XGBoost特徵選取演算法並使用多元分類處理OvO進行特徵聯集(Union)，在KNN分類器之下獲得最佳的平均結果，與未經特徵選取的Baseline相比，AUC提升了2.9%。

摘要(英)

With the rapid advancement of information technology, data volume across various fields has exploded. Without proper preprocessing, this excess data can introduce noise, negatively impacting model performance and reducing classifier effectiveness. Studies have shown that feature selection and instance selection can significantly enhance classifier performance and model accuracy by filtering out important features and samples. However, there has been limited discussion on whether different data processing methods for multi-class datasets can further enhance performance. While multi-class processing methods for instances have been explored, feature-focused research is lacking. This study investigates the impact of applying multi-class classification to a multi-class dataset, followed by feature selection on model building.
We utilize One-versus-All (OvA) and One-versus-One (OvO) techniques in multi-class classification for data splitting, combined with three major types of feature selection methods (filter, wrapper, and embedded). Support Vector Machine (SVM) and K-Nearest Neighbors (KNN) classification algorithms are used to explore optimal combinations. In the second stage of study, we incorporate instance selection methods to examine the impact of the sequence of instance and feature selection combined with multi-class classification on classifier performance.
This study uses 15 multi-class datasets from UCI and Feature Selection @ ASU, our results show that employing the XGBoost feature selection algorithm with OvO multi-class classification for feature union achieved the best average results under the KNN classifier. Compared to the baseline without feature selection, the AUC improved by 2.9%.

關鍵字(中)

★ 資料前處理
★ 特徵選取
★ 樣本選取
★ 多元分類處理
★ 多分類資料集
★ 資料探勘

關鍵字(英)

★ Data pre-processing
★ feature selection
★ instance selection
★ multi-class classification
★ multi-class dataset
★ data mining

論文目次

摘要 i
Abstract ii
目錄 iii
圖目錄 x
表目錄 xiii
第一章緒論 1
1.1 研究背景 1
1.2 研究動機 2
1.3 研究目的 3
1.4 研究架構 4
第二章文獻探討 5
2.1 多元分類處理 5
2.2 SVM Classifier 5
2.2.1 AVA 6
2.2.2 OVO 7
2.2.3 OVA 8
2.3 特徵選取 9
2.3.1 過濾類(Filter) 10
2.3.2 包裝類(Wrapper) 10
2.3.3 嵌入類(Embedded) 10
2.4 樣本選取 11
2.4.1 DROP3 12
2.4.2 ENN 14
第三章研究方法與實驗設計 15
3.1 實驗架構 15
3.1.1 實驗流程 17
3.1.2 Baseline 18
3.1.3 實驗一 19
3.1.4 實驗二 19
3.2 實驗準備 21
3.2.1 實驗硬體與軟體設置 21
3.2.2 實驗套件與參數設定 22
3.2.2.1 特徵選取 22
3.2.2.2 樣本選取 24
3.2.3 實驗資料集 24
3.2.4 衡量指標 (Evaluation Metrics) 26
第四章實驗結果 28
4.1 實驗一 28
4.1.1 實驗一結果 28
4.1.1.1 Filter類實驗結果 29
4.1.1.2 Wrapper類實驗結果 34
4.1.1.3 Embedded類實驗結果 39
4.1.2 實驗一小結 44
4.2 實驗二 45
4.2.1 實驗二結果 45
4.2.1.1 實驗二 - ISFS樣本精簡率 45
4.2.1.2 實驗二 - Filter 46
4.2.1.3 實驗二 - Wrapper 54
4.2.1.4 實驗二 - Embedded 60
4.2.2 實驗二小結 66
4.3 綜合比較 68
4.3.1 實驗結果比較 68
4.3.2 特徵選取精簡率比較 71
第五章結論 72
5.1 總結與貢獻 72
5.2 未來研究方向與建議 74
參考文獻 75

參考文獻

1. Fayyad, U., G. Piatetsky-Shapiro, and P. Smyth, From data mining to knowledge discovery in databases. AI magazine, 1996. 17(3): p. 37-37.
2. Pyle, D., Data preparation for data mining. 1999: morgan kaufmann.
3. Kotsiantis, S.B., D. Kanellopoulos, and P.E. Pintelas, Data preprocessing for supervised leaning. International journal of computer science, 2006. 1(2): p. 111-117.
4. Li, J., et al., Feature selection: A data perspective. ACM computing surveys (CSUR), 2017. 50(6): p. 1-45.
5. Brighton, H. and C. Mellish, Advances in instance selection for instance-based learning algorithms. Data mining and knowledge discovery, 2002. 6: p. 153-172.
6. García-Pedrajas, N. and A. de Haro-García, Boosting instance selection algorithms. Knowledge-Based Systems, 2014. 67: p. 342-360.
7. Zhai, Y., Y.-S. Ong, and I.W. Tsang, The emerging" big dimensionality". IEEE Computational Intelligence Magazine, 2014. 9(3): p. 14-26.
8. Jović, A., K. Brkić, and N. Bogunović. A review of feature selection methods with applications. in 2015 38th international convention on information and communication technology, electronics and microelectronics (MIPRO). 2015. Ieee.
9. Liu, H. and R. Setiono. A probabilistic approach to feature selection-a filter solution. in ICML. 1996.
10. Fonti, V. and E. Belitser, Feature selection using lasso. VU Amsterdam research paper in business analytics, 2017. 30: p. 1-25.
11. Zhao, H., S. Wang, and Z. Wang, Multiclass classification and feature selection based on least squares regression with large margin. Neural Computation, 2018. 30(10): p. 2781-2804.
12. Izetta, J., P.F. Verdes, and P.M. Granitto, Improved multiclass feature selection via list combination. Expert Systems with Applications, 2017. 88: p. 205-216.
13. Cascaro, R.J., B.D. Gerardo, and R.P. Medina. Filter selection methods for multiclass classification. in Proceedings of the 2nd International Conference on Computing and Big Data. 2019.
14. Yijing, L., et al., Adapted ensemble classification algorithm based on multiple classifier system and feature selection for classifying multi-class imbalanced data. Knowledge-Based Systems, 2016. 94: p. 88-104.
15. Xu, J., An extended one-versus-rest support vector machine for multi-label classification. Neurocomputing, 2011. 74(17): p. 3114-3124.
16. Cortes, C. and V. Vapnik, Support-vector networks. Machine learning, 1995. 20: p. 273-297.
17. Fang, C.L., et al., Instance selection using one‐versus‐all and one‐versus‐one decomposition approaches in multiclass classification datasets. Expert Systems, 2023. 40(6): p. e13217.
18. Sikora, R. and S. Piramuthu, Framework for efficient feature selection in genetic algorithm based data mining. European Journal of Operational Research, 2007. 180(2): p. 723-737.
19. Dash, M. and H. Liu, Feature selection for classification. Intelligent data analysis, 1997. 1(1-4): p. 131-156.
20. Chandrashekar, G. and F. Sahin, A survey on feature selection methods. Computers & electrical engineering, 2014. 40(1): p. 16-28.
21. Wang, S., et al., Pathological brain detection by artificial intelligence in magnetic resonance imaging scanning (invited review). Progress in Electromagnetics Research, 2016. 156: p. 105-133.
22. Saeys, Y., I. Inza, and P. Larranaga, A review of feature selection techniques in bioinformatics. bioinformatics, 2007. 23(19): p. 2507-2517.
23. Kraskov, A., H. Stögbauer, and P. Grassberger, Estimating mutual information. Physical review E, 2004. 69(6): p. 066138.
24. Ding, H., et al., Identification of bacteriophage virion proteins by the ANOVA feature selection and analysis. Molecular BioSystems, 2014. 10(8): p. 2229-2235.
25. Thaseen, I.S. and C.A. Kumar, Intrusion detection model using fusion of chi-square feature selection and multi class SVM. Journal of King Saud University-Computer and Information Sciences, 2017. 29(4): p. 462-472.
26. Urbanowicz, R.J., et al., Relief-based feature selection: Introduction and review. Journal of biomedical informatics, 2018. 85: p. 189-203.
27. Guan, S.-U., Y. Qi, and C. Bao, An incremental approach to MSE-based feature selection. International Journal of Computational Intelligence and Applications, 2006. 6(04): p. 451-471.
28. Krishnan, G.S. and S. Kamath, A novel GA-ELM model for patient-specific mortality prediction over large-scale lab event data. Applied Soft Computing, 2019. 80: p. 525-533.
29. Huang, C.-L. and J.-F. Dun, A distributed PSO–SVM hybrid system with feature selection and parameter optimization. Applied soft computing, 2008. 8(4): p. 1381-1391.
30. Mafarja, M. and S. Mirjalili, Whale optimization approaches for wrapper feature selection. Applied Soft Computing, 2018. 62: p. 441-453.
31. Zhu, M. and J. Song, An embedded backward feature selection method for MCLP classification algorithm. Procedia Computer Science, 2013. 17: p. 1047-1054.
32. Tibshirani, R., Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 1996. 58(1): p. 267-288.
33. Hoerl, A.E. and R.W. Kennard, Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 1970. 12(1): p. 55-67.
34. Chen, T. and C. Guestrin. Xgboost: A scalable tree boosting system. in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 2016.
35. Breiman, L., Random forests. Machine learning, 2001. 45: p. 5-32.
36. Ali, J., et al., Random forests and decision trees. International Journal of Computer Science Issues (IJCSI), 2012. 9(5): p. 272.
37. Olvera-López, J.A., et al., A review of instance selection methods. Artificial Intelligence Review, 2010. 34: p. 133-143.
38. Wilson, D.R. and T.R. Martinez, Reduction techniques for instance-based learning algorithms. Machine learning, 2000. 38: p. 257-286.
39. García-Pedrajas, N., Constructing ensembles of classifiers by means of weighted instance selection. IEEE Transactions on Neural Networks, 2009. 20(2): p. 258-277.
40. Gates, G., The reduced nearest neighbor rule (corresp.). IEEE transactions on information theory, 1972. 18(3): p. 431-433.
41. Wilson, D.L., Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, 1972(3): p. 408-421.
42. Tsai, C.-F., et al., Combining feature selection, instance selection, and ensemble classification techniques for improved financial distress prediction. Journal of Business Research, 2021. 130: p. 200-209.
43. Morales, P., et al., The noisefiltersr package: label noise preprocessing in R. The R Journal, 2017. 9(1): p. 219-228.
44. Hossin, M. and M.N. Sulaiman, A review on evaluation metrics for data classification evaluations. International journal of data mining & knowledge management process, 2015. 5(2): p. 1.
45. Huang, J. and C.X. Ling, Using AUC and accuracy in evaluating learning algorithms. IEEE Transactions on knowledge and Data Engineering, 2005. 17(3): p. 299-310.
46. Fawcett, T., An introduction to ROC analysis. Pattern recognition letters, 2006. 27(8): p. 861-874.

指導教授

蔡志豐

審核日期

2024-7-3

推文