基於集成方法的二元分類資料集補值研究

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：32

、訪客IP：3.147.48.123

姓名

彭柏豪(Po-Hao Peng) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

基於集成方法的二元分類資料集補值研究
(An Imputation Method based on Ensemble Techniques for Binary Classification Datasets)

相關論文

★ 具代理人之行動匿名拍賣與付款機制	★ 網路攝影機遠端連線安全性分析
★ HSDPA環境下的複合式細胞切換機制	★ 樹狀結構為基礎之行動隨意網路IP位址分配機制
★ 平面環境中目標區域之偵測 - 使用行動感測網路技術	★ 藍芽Scatternet上的P2P檔案分享機制
★ 交通壅塞避免之動態繞路機制	★ 運用UWB提升MANET上檔案分享之效能
★ 合作學習平台對團體迷思現象及學習成效之影響–以英文字彙學習為例	★ 以RFID為基礎的室內定位機制─使用虛擬標籤的經驗法則
★ 適用於實體購物情境的行動商品比價系統-使用影像辨識技術	★ 信用卡網路刷卡安全性
★ DEAP:適用於行動RFID系統之高效能動態認證協定	★ 在破產預測與信用評估領域對前處理方式與分類器組合的比較分析
★ 單一類別分類方法於不平衡資料集－搭配遺漏值填補和樣本選取方法	★ 正規化與變數篩選在破產領域的適用性研究

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2029-6-30以後開放)

摘要(中)

從過往研究中發現，補值方法大致可分成三大類：統計、機器學習與深度學習，不同種類的方法都有其適用情境，所以本研究將集成的技術應用於補值任務中，旨在將多個補值方法進行結合，並且依據各方法對不同情境的適用性，分配出適當的權重，以此產生出優異的填補值。
實驗設計上，本研究選用收錄於UCI-dataset的六個二元分類資料集。依據過往的文獻探討，選出各類別的補值方法，分別為統計方法Mean/Mode、MICE，機器學習方法MissForest、KNN，以及深度學習方法PC-GAIN、HI-VAE和PMIVAE，並基於PC-GAIN方法進行調整形成RC-GAIN方法，使用總共八種補值方法，以及使用SVM、LightGBM和MLP三種分類器，進行實驗。
本研究以實驗篩選出四個性能較佳的入選補值方法MICE、MissForest、RC-GAIN及HI-VAE，以及最佳分類器LightGBM，並以上述方法建構出集成補值方法。透過兩種性能指標：RMSE以及由LightGBM產生之Accuracy，計算出兩種權重，產生出兩種集成方法：〖Ensemble〗_rmse和〖Ensemble〗_acc。實驗結果顯示，兩種集成方法之性能在不同遺漏機制以及不同遺漏率情境中，皆優於四個入選補值方法。其中，集成方法又以〖Ensemble〗_acc性能勝過〖Ensemble〗_rmse，是較佳的補值方法。
本研究還根據資料集的特性，對集成方法之性能進行適用性分析，在資料集樣本大小的分析中發現，〖Ensemble〗_acc在小型和大型資料集當中，都獲得較佳的性能 ; 在資料集特徵類型的分析中發現，〖Ensemble〗_rmse在純數值型資料集當中表現較佳，而〖Ensemble〗_acc在混合型資料集當中表現較佳 ; 最後，在應用領域的分析中發現，〖Ensemble〗_rmse在醫療資料集中表現較佳，而〖Ensemble〗_acc在信用資料集中表現較佳。

摘要(英)

From past research, imputation methods can generally be categorized into three types: statistical, machine learning, and deep learning. Each type of method has its appropriate contexts, so this study applies ensemble techniques to imputation tasks. It aims to combine multiple imputation methods and assigns appropriate weights based on each method′s suitability for different scenarios, thereby generating superior imputed values.
In terms of experimental design, this study selects six binary classification datasets from the UCI dataset. Based on previous literature, representative methods for each category were selected, including statistical methods Mean/Mode, MICE; machine learning methods MissForest, KNN; and deep learning methods PC-GAIN, HI-VAE, and PMIVAE. Adjustments were made to the PC-GAIN method to form the RC-GAIN method. In total, eight imputation methods were used, and experiments were conducted using SVM, LightGBM, and MLP classifiers.
The study selected four imputation methods with better performance, MICE, MissForest, RC-GAIN, and HI-VAE, as well as the best classifier, LightGBM, to construct an ensemble imputation method. Two performance metrics, RMSE and Accuracy generated by LightGBM, were used to calculate two types of weights, producing two ensemble methods: 〖Ensemble〗_rmse and 〖Ensemble〗_acc. Experimental results showed that the performance of these two ensemble methods was superior to the four selected imputation methods in different missing mechanisms and missing rate scenarios. Among them, the 〖Ensemble〗_acc method outperformed 〖Ensemble〗_rmse and was the better imputation method.
The study also analyzed the suitability of the ensemble methods based on dataset characteristics. In the analysis of dataset sizes, 〖Ensemble〗_acc performed better in both small and large datasets. In the analysis of dataset feature types, 〖Ensemble〗_rmse performed better in purely numerical datasets, while 〖Ensemble〗_acc performed better in mixed datasets. Finally, in the application domain analysis, 〖Ensemble〗_rmse performed better in medical datasets, while 〖Ensemble〗_acc performed better in credit datasets.

關鍵字(中)

★ 機器學習
★ 深度學習
★ 遺漏值補值
★ 集成式學習

關鍵字(英)

★ Machine learning
★ Deep learning
★ Missing value imputation
★ Ensemble learning

論文目次

摘要 i
Abstract ii
誌謝 iii
目錄 iv
圖目錄 vii
表目錄 ix
一、緒論 1
1-1 研究背景 1
1-2 研究動機 1
1-3 研究目的 4
二、文獻探討 5
2-1 遺漏機制 5
2-2 補值方法 5
2-2-1 Mean/Mode Imputation 14
2-2-2 MICE(Multiple imputation by chained equations) 14
2-2-3 KNN Imputation 15
2-2-4 MissForest 15
2-2-5 PC-GAIN(Pseudo-label conditional GAIN) 16
2-2-6 RC-GAIN(Real-label conditional GAIN) 17
2-2-7 HI-VAE(Heterogeneous-Incomplete VAE) 19
2-2-8 PMIVAE(Partial Multiple Imputation with VAE) 19
2-3 集成方法 20
2-4 分類器 23
2-4-1 SVM(Support Vector Machine) 24
2-4-2 LightGBM(Light Gradient Boosting Machine) 24
2-4-3 MLP(Multi-Layer Perceptron) 25
三、研究方法 27
3-1 資料集 28
3-2 資料前處理 30
3-3 遺漏值的模擬情境 31
3-4 評估指標 32
3-4-1 RMSE(Root Mean Squared Error) 33
3-4-2 Accuracy 34
3-5 實驗參數設定、方法 35
3-6 實驗一：探討補值方法的補值性能 37
3-7 實驗二：探討分類器於填補資料集的分類性能 38
3-8 實驗三：探討集成補值方法的性能 39
四、實驗結果與分析 42
4-1 探討補值方法的填補性能 42
4-1-1 補值方法的性能分析 42
4-1-2 篩選補值方法 50
4-2 探討分類器於填補後的資料集之分類性能 51
4-2-1 分類器的性能分析 51
4-2-2 篩選最佳分類器 56
4-3 探討集成補值方法之性能 57
4-3-1 探討集成方法的填補品質 57
4-3-2 探討集成方法在填補後的分類性能分析 62
4-3-3 探討不同資料集角度下集成方法的適用性 67
4-3-4 探討不同集成作法對於分類性能的影響 71
五、結論 72
5-1 結論與貢獻 72
5-2 研究限制 74
5-3 未來研究與建議 75
參考文獻 76

參考文獻

[1] I. H. Sarker, "Machine learning: Algorithms, real-world applications and research directions," SN computer science, vol. 2, no. 3, p. 160, 2021.
[2] I. Castiglioni et al., "AI applications to medical images: From machine learning to deep learning," Physica Medica, vol. 83, pp. 9-24, 2021.
[3] M. C. Data, C. M. Salgado, C. Azevedo, H. Proença, and S. M. Vieira, "Missing data," Secondary Analysis of Electronic Health Records, pp. 143-162, 2016.
[4] L. Tran, X. Liu, J. Zhou, and R. Jin, "Missing modalities imputation via cascaded residual autoencoder," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1405-1414.
[5] L. Zhou and K. K. Lai, "AdaBoost models for corporate bankruptcy prediction with missing data," Computational Economics, vol. 50, pp. 69-94, 2017.
[6] Z. Ruiz-Chavez, J. Salvador-Meneses, and J. Garcia-Rodriguez, "Machine learning methods based preprocessing to improve categorical data classification," in Intelligent Data Engineering and Automated Learning–IDEAL 2018: 19th International Conference, Madrid, Spain, November 21–23, 2018, Proceedings, Part I 19, 2018: Springer, pp. 297-304.
[7] S. Nijman et al., "Missing data is poorly handled and reported in prediction model studies using machine learning: a literature review," Journal of clinical epidemiology, vol. 142, pp. 218-229, 2022.
[8] J. C. Jakobsen, C. Gluud, J. Wetterslev, and P. Winkel, "When and how should multiple imputation be used for handling missing data in randomised clinical trials–a practical guide with flowcharts," BMC medical research methodology, vol. 17, no. 1, pp. 1-10, 2017.
[9] J. Han, J. Pei, and H. Tong, Data mining: concepts and techniques. Morgan kaufmann, 2022.
[10] E. Acuna and C. Rodriguez, "The treatment of missing values and its effect on classifier accuracy," in Classification, Clustering, and Data Mining Applications: Proceedings of the Meeting of the International Federation of Classification Societies (IFCS), Illinois Institute of Technology, Chicago, 15–18 July 2004, 2004: Springer, pp. 639-647.
[11] K. Strike, K. El Emam, and N. Madhavji, "Software cost estimation with incomplete data," IEEE Transactions on Software Engineering, vol. 27, no. 10, pp. 890-908, 2001.
[12] L. Wilkinson, "Statistical methods in psychology journals: Guidelines and explanations," American psychologist, vol. 54, no. 8, p. 594, 1999.
[13] T. I. Lin, J. C. Lee, and H. J. Ho, "On fast supervised learning for normal mixture models with missing information," Pattern Recognition, vol. 39, no. 6, pp. 1177-1187, 2006.
[14] S. M. Iacus and G. Porro, "Missing data imputation, matching and other applications of random recursive partitioning," Computational statistics & data analysis, vol. 52, no. 2, pp. 773-789, 2007.
[15] L. F. Burgette and J. P. Reiter, "Multiple imputation for missing data via sequential regression trees," American journal of epidemiology, vol. 172, no. 9, pp. 1070-1076, 2010.
[16] L. L. Doove, S. Van Buuren, and E. Dusseldorp, "Recursive partitioning for missing data imputation in the presence of interaction effects," Computational statistics & data analysis, vol. 72, pp. 92-104, 2014.
[17] I. B. Aydilek and A. Arslan, "A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector regression and a genetic algorithm," Information Sciences, vol. 233, pp. 25-35, 2013.
[18] Q. Suo, L. Yao, G. Xun, J. Sun, and A. Zhang, "Recurrent imputation for multivariate time series with missing values," in 2019 IEEE international conference on healthcare informatics (ICHI), 2019: IEEE, pp. 1-3.
[19] J. T. McCoy, S. Kroon, and L. Auret, "Variational autoencoders for missing data imputation with application to a simulated milling circuit," IFAC-PapersOnLine, vol. 51, no. 21, pp. 141-146, 2018.
[20] J. Yoon, J. Jordon, and M. Schaar, "Gain: Missing data imputation using generative adversarial nets," in International conference on machine learning, 2018: PMLR, pp. 5689-5698.
[21] E.-L. Silva-Ramírez, R. Pino-Mejías, M. López-Coello, and M.-D. Cubiles-de-la-Vega, "Missing value imputation on missing completely at random data using multilayer perceptrons," Neural Networks, vol. 24, no. 1, pp. 121-129, 2011.
[22] Y. Sun, J. Li, Y. Xu, T. Zhang, and X. Wang, "Deep learning versus conventional methods for missing data imputation: A review and comparative study," Expert Systems with Applications, p. 120201, 2023.
[23] T. Shadbahr et al., "The impact of imputation quality on machine learning classifiers for datasets with missing values," Communications Medicine, vol. 3, no. 1, p. 139, 2023.
[24] R. J. Little and D. B. Rubin, Statistical analysis with missing data. John Wiley & Sons, 2019.
[25] W.-C. Lin and C.-F. Tsai, "Missing value imputation: a review and analysis of the literature (2006–2017)," Artificial Intelligence Review, vol. 53, pp. 1487-1509, 2020.
[26] B. E. Twala, M. Jones, and D. J. Hand, "Good methods for coping with missing data in decision trees," Pattern Recognition Letters, vol. 29, no. 7, pp. 950-956, 2008.
[27] Y. Ge, Z. Li, and J. Zhang, "A simulation study on missing data imputation for dichotomous variables using statistical and machine learning methods," Scientific Reports, vol. 13, no. 1, p. 9432, 2023.
[28] M. Liu et al., "Handling missing values in healthcare data: A systematic review of deep learning-based imputation techniques," Artificial Intelligence in Medicine, p. 102587, 2023.
[29] U. Hwang, D. Jung, and S. Yoon, "Hexagan: Generative adversarial nets for real world classification," in International conference on machine learning, 2019: PMLR, pp. 2921-2930.
[30] S. C.-X. Li, B. Jiang, and B. Marlin, "Misgan: Learning from incomplete data with generative adversarial networks," arXiv preprint arXiv:1902.09599, 2019.
[31] F. Lalande and K. Doya, "Numerical data imputation: Choose kNN over deep learning," in International Conference on Similarity Search and Applications, 2022: Springer, pp. 3-10.
[32] S. E. Awan, M. Bennamoun, F. Sohel, F. Sanfilippo, and G. Dwivedi, "Imputation of missing data with class imbalance using conditional generative adversarial networks," Neurocomputing, vol. 453, pp. 164-171, 2021.
[33] Y. Wang, D. Li, X. Li, and M. Yang, "PC-GAIN: Pseudo-label conditional generative adversarial imputation networks for incomplete data," Neural Networks, vol. 141, pp. 395-403, 2021.
[34] R. C. Pereira, M. S. Santos, P. P. Rodrigues, and P. H. Abreu, "Reviewing autoencoders for missing data imputation: Technical trends, applications and outcomes," Journal of Artificial Intelligence Research, vol. 69, pp. 1255-1285, 2020.
[35] L. Gondara and K. Wang, "Recovering loss to followup information using denoising autoencoders," in 2017 IEEE International Conference on Big Data (Big Data), 2017: IEEE, pp. 1936-1945.
[36] L. Gondara and K. Wang, "Mida: Multiple imputation using denoising autoencoders," in Advances in Knowledge Discovery and Data Mining: 22nd Pacific-Asia Conference, PAKDD 2018, Melbourne, VIC, Australia, June 3-6, 2018, Proceedings, Part III 22, 2018: Springer, pp. 260-272.
[37] A. Nazabal, P. M. Olmos, Z. Ghahramani, and I. Valera, "Handling incomplete heterogeneous data using vaes," Pattern Recognition, vol. 107, p. 107501, 2020.
[38] R. C. Pereira, P. H. Abreu, and P. P. Rodrigues, "Partial Multiple Imputation with variational autoencoders: tackling not at randomness in healthcare data," IEEE Journal of Biomedical and Health Informatics, vol. 26, no. 8, pp. 4218-4227, 2022.
[39] Q. Ma, X. Li, M. Bai, X. Wang, B. Ning, and G. Li, "MIVAE: Multiple Imputation based on Variational Auto-Encoder," Engineering Applications of Artificial Intelligence, vol. 123, p. 106270, 2023.
[40] X. Hong and S. Hao, "Imputation of Missing Values in Training Data using Variational Autoencoder," in 2023 IEEE 39th International Conference on Data Engineering Workshops (ICDEW), 2023: IEEE, pp. 49-54.
[41] 曹皓閔, "應用深度學習於遺漏值填補在財務危機領域的影響分析," 碩士, 資訊管理學系, 國立中央大學, 桃園縣, 2020. [Online]. Available: https://hdl.handle.net/11296/g22b3s
[42] 陳信瑋, "基於深度表格生成模型的過採樣方法於信用及破產預測領域的效能分析," 碩士, 資訊管理學系, 國立中央大學, 桃園縣, 2023. [Online]. Available: https://hdl.handle.net/11296/jmk593
[43] A. Hassan and N. Yousaf, "Bankruptcy Prediction using Diverse Machine Learning Algorithms," in 2022 International Conference on Frontiers of Information Technology (FIT), 2022: IEEE, pp. 106-111.
[44] Y. Ding and J. S. Simonoff, "An investigation of missing data methods for classification trees applied to binary response data," Journal of Machine Learning Research, vol. 11, no. 1, 2010.
[45] A. J. Khan, B. Raza, A. R. Shahid, Y. J. Kumar, M. Faheem, and H. Alquhayz, "Handling incomplete data classification using imputed feature selected bagging (IFBag) method," Intelligent Data Analysis, vol. 25, no. 4, pp. 825-846, 2021.
[46] S. Van Buuren and K. Groothuis-Oudshoorn, "mice: Multivariate imputation by chained equations in R," Journal of statistical software, vol. 45, pp. 1-67, 2011.
[47] S. Van Buuren and C. G. Oudshoorn, "Multivariate imputation by chained equations," ed: Leiden: TNO, 2000.
[48] I. R. White, P. Royston, and A. M. Wood, "Multiple imputation using chained equations: issues and guidance for practice," Statistics in medicine, vol. 30, no. 4, pp. 377-399, 2011.
[49] T. Cover and P. Hart, "Nearest neighbor pattern classification," IEEE transactions on information theory, vol. 13, no. 1, pp. 21-27, 1967.
[50] O. Troyanskaya et al., "Missing value estimation methods for DNA microarrays," Bioinformatics, vol. 17, no. 6, pp. 520-525, 2001.
[51] D. J. Stekhoven and P. Bühlmann, "MissForest—non-parametric missing value imputation for mixed-type data," Bioinformatics, vol. 28, no. 1, pp. 112-118, 2012.
[52] D. P. Kingma and M. Welling, "Auto-encoding variational bayes," arXiv preprint arXiv:1312.6114, 2013.
[53] J. W. Tukey, Exploratory data analysis. Reading, MA, 1977.
[54] A. Aleryani, W. Wang, and B. De La Iglesia, "Multiple imputation ensembles (MIE) for dealing with missing data," SN Computer Science, vol. 1, pp. 1-20, 2020.
[55] K. Jegadeeswari, R. Ragunath, and R. Rathipriya, "Missing data imputation using ensemble learning technique: a review," Soft Computing for Security Applications: Proceedings of ICSCS 2022, pp. 223-236, 2022.
[56] S. S. Khan, A. Ahmad, and A. Mihailidis, "Bootstrapping and multiple imputation ensemble approaches for classification problems," Journal of Intelligent & Fuzzy Systems, vol. 37, no. 6, pp. 7769-7783, 2019.
[57] X. Zhu, J. Wang, B. Sun, C. Ren, T. Yang, and J. Ding, "An efficient ensemble method for missing value imputation in microarray gene expression data," BMC bioinformatics, vol. 22, no. 1, pp. 1-25, 2021.
[58] J. Choi, K. J. Lim, and B. Ji, "Robust imputation method with context-aware voting ensemble model for management of water-quality data," Water Research, vol. 243, p. 120369, 2023.
[59] B. Rekabdar, D. L. Albright, J. T. McDaniel, S. Talafha, and H. Jeong, "From machine learning to deep learning: A comprehensive study of alcohol and drug use disorder," Healthcare Analytics, vol. 2, p. 100104, 2022.
[60] K. Psychogyios, L. Ilias, C. Ntanos, and D. Askounis, "Missing value imputation methods for electronic health records," IEEE Access, vol. 11, pp. 21562-21574, 2023.
[61] G. Ke et al., "Lightgbm: A highly efficient gradient boosting decision tree," Advances in neural information processing systems, vol. 30, 2017.
[62] J. Schmidhuber, "Deep learning in neural networks: An overview," Neural networks, vol. 61, pp. 85-117, 2015.
[63] S. Huang, N. Cai, P. P. Pacheco, S. Narrandes, Y. Wang, and W. Xu, "Applications of support vector machine (SVM) learning in cancer genomics," Cancer genomics & proteomics, vol. 15, no. 1, pp. 41-51, 2018.
[64] H. Hakimpoor, K. A. B. Arshad, H. H. Tat, N. Khani, and M. Rahmandoust, "Artificial neural networks’ applications in management," World Applied Sciences Journal, vol. 14, no. 7, pp. 1008-1019, 2011.
[65] D. Singh and B. Singh, "Investigating the impact of data normalization on classification performance," Applied Soft Computing, vol. 97, p. 105524, 2020.
[66] F. Provost, "Machine learning from imbalanced data sets 101," in Proceedings of the AAAI’2000 workshop on imbalanced data sets, 2000, vol. 68, no. 2000: AAAI Press, pp. 1-3.
[67] G. M. Weiss, "Mining with rarity: a unifying framework," ACM Sigkdd Explorations Newsletter, vol. 6, no. 1, pp. 7-19, 2004.
[68] S. Vellamcheti and P. Singh, "Class imbalance deep learning for bankruptcy prediction," in 2020 First International Conference on Power, Control and Computing Technologies (ICPC2T), 2020: IEEE, pp. 421-425.
[69] A. Sportisse, "Handling heterogeneous and MNAR missing data in statistical learning frameworks: imputation based on low-rank models, online linear regression with SGD, and model-based clustering," Sorbonne université, 2021.

指導教授

蘇坤良(Kuen-Liang Sue)

審核日期

2024-7-29

推文