博碩士論文 111423012 詳細資訊




以作者查詢圖書館館藏 以作者查詢臺灣博碩士 以作者查詢全國書目 勘誤回報 、線上人數:27 、訪客IP:3.129.128.14
姓名 杜俊甫(Chin-Fu Du)  查詢紙本館藏   畢業系所 資訊管理學系
論文名稱 多重填補與集成式填補策略 於遺漏資料處理之比較與研究
(Comparative Study of Multiple Imputation and Ensemble Imputation Strategies for Missing Data Handling)
相關論文
★ 利用資料探勘技術建立商用複合機銷售預測模型★ 應用資料探勘技術於資源配置預測之研究-以某電腦代工支援單位為例
★ 資料探勘技術應用於航空業航班延誤分析-以C公司為例★ 全球供應鏈下新產品的安全控管-以C公司為例
★ 資料探勘應用於半導體雷射產業-以A公司為例★ 應用資料探勘技術於空運出口貨物存倉時間預測-以A公司為例
★ 使用資料探勘分類技術優化YouBike運補作業★ 特徵屬性篩選對於不同資料類型之影響
★ 資料探勘應用於B2B網路型態之企業官網研究-以T公司為例★ 衍生性金融商品之客戶投資分析與建議-整合分群與關聯法則技術
★ 應用卷積式神經網路建立肝臟超音波影像輔助判別模型★ 基於卷積神經網路之身分識別系統
★ 能源管理系統電能補值方法誤差率比較分析★ 企業員工情感分析與管理系統之研發
★ 資料淨化於類別不平衡問題: 機器學習觀點★ 資料探勘技術應用於旅客自助報到之分析—以C航空公司為例
檔案 [Endnote RIS 格式]    [Bibtex 格式]    [相關文章]   [文章引用]   [完整記錄]   [館藏目錄]   至系統瀏覽論文 (2029-7-1以後開放)
摘要(中) 在當今資訊科技迅速發展的背景下,遺漏值的問題在各種資料集中普遍存在,這不僅妨礙了資料的完整性,也對後續的資料分析和決策制定造成了不利影響。因此,遺漏值的處理成為資料前處理階段一項重要且挑戰性的任務。
現行人們對於遺漏值最主流的處理方法為填補遺漏值,常見的填補方法包括統計補值法和機器學習補值法這種單一補值法,以及近年來學者所採用的多重填補法,然而,過去文獻中較少探討多重填補法與機器學習補值法之間的差異,尤其是在不同種類的資料集和不同遺失率下的表現。再者,近年隨著集成式學習的蓬勃發展已被證實能有效提高模型的預測準確率,但在遺漏值填補上的文獻仍相對匱乏。故本論文旨在分析單一填補法與多重填補法在不同模型下的表現,並探討集成學習在遺漏值填補上的應用
本研究選取了 25 個 UCI 資料集作為研究對象,包括數值型、類別型及混合型資料,模擬了 10%至 50%的不同遺漏率,以評估5種機器學習演算法在單一填補和多重填補(MICE)方法的成效。此外,研究還提出了兩種基於 MICE 的集成式填補方法,包括混合式和並列式集成策略,最終以SVM分類正確率、均方根誤差、平均絕對值誤差以及類別正確率來評估補值的成效。
實驗發現,在數值型與類別型資料集多重填補在各個評估指標都明顯優於單一填補法,在混合型資料集上除了多重填補的隨機森林方法,其他方法都稍遜於單一填補。並且在多重填補與單一填補的比較我們綜合下來得出隨機森林方法是最佳的方法。在集成式填補的實驗結果發現使用混合式填補法或是並列式填補都能在各個評估指標有效提高,其中混合式填補法針對不同模型採用的先後順序對結果有一定落差,在實際使用上會需要再注意。最後,本研究根據不同的需求提供了多重填補與集成填補的推薦策略,供未來研究者參考。
摘要(英) With the progress of Information Technology, missing values have become common in various datasets. This affects data completeness and hampers data analysis and decision-making. Therefore, handling missing values is a crucial and challenging task in data preprocessing.
The main methods for handling missing values are imputation, including statistical and machine learning techniques, both single imputation methods. Recently, scholars have adopted multiple imputation methods. However, limited research compares multiple imputation and machine learning imputation across different datasets and missing rates. Additionally, while ensemble learning has improved model prediction accuracy, its use in missing value imputation is under-researched. Therefore, we aim to analyze the performance of single and multiple imputation methods and explore ensemble learning in missing value imputation.
This study used 25 UCI datasets, including numerical, categorical, and mixed types, simulating missing rates from 10% to 50%. Five machine learning algorithms were evaluated for single and multiple (MICE) imputation, and two ensemble imputation methods based on MICE, hybrid and parallel strategies, were proposed. Imputation effectiveness was assessed using SVM classification accuracy, RMSE, MAPE, and Hit Ratio.
Results showed that multiple imputation generally outperformed single imputation, with the random forest method being the best for mixed datasets, while other methods slightly underperformed. Ensemble imputation experiments indicated that both hybrid and parallel strategies effectively improved all metrics, though the order of applying models in hybrid imputation significantly impacted results. Finally, we provide recommendations for optimal combinations of multiple and ensemble imputation, offering valuable references for future researchers.
關鍵字(中) ★ 資料探勘
★ 遺漏值
★ 多重填補
★ 機器學習
★ 集成學習
關鍵字(英) ★ Data Mining
★ Missing Values
★ Multiple Imputation
★ Machine Learning
★ Ensemble Learning
論文目次 摘要 i
Abstract ii
目錄 iii
圖目錄 vi
表目錄 viii
一、 緒論 1
1-1. 研究背景 1
1-2. 研究動機 2
1-3. 研究目的 3
1-4. 研究架構 4
二、 文獻回顧 6
2-1. 資料遺漏 (Data Missing) 6
2-1-1. 完全隨機遺漏 (Missing Completely at Random, MCAR) 6
2-1-2. 隨機遺漏 (Missing at Random, MAR) 7
2-1-3. 非隨機遺漏 (Missing Not at Random, MNAR) 7
2-2. 遺漏值處理 8
2-2-1. 單一填補 (Single Imputation) 9
2-2-2. 多重填補 (Multiple Imputation) 13
2-3. 集成式學習 (Ensemble Learning) 14
2-3-1. 序列式集成 (Sequential Ensemble) 14
2-3-2. 並列式集成 (Parallel Ensemble) 15
2-3-3. 集成式填補 (Ensemble Imputation) 16
三、 研究方法 17
3-1. 實驗架構 17
3-2. 實驗準備 17
3-2-1. 硬體設備及軟體使用 17
3-2-2. 資料集 17
3-3. 實驗環境和參數設定 19
3-3-1. 迴歸填補法(Regression Imputation) 19
3-3-2. 機器學習填補法 19
3-3-3. MICE (Multivariate Imputation by Chained Equation) 22
3-3-4. 分類器 22
3-4. 實驗一 23
3-4-1. 實驗一(A) 23
3-4-2. 實驗一(B) 24
3-5. 實驗二 25
3-5-1. 實驗二(A) 25
3-5-2. 實驗二(B) 26
3-6. 評估指標 27
四、 實驗結果 29
4-1. 實驗一結果 29
4-1-1. 數值型資料 29
4-1-2. 類別型資料 37
4-1-3. 混合型資料 42
4-1-4. 統整 51
4-2. 實驗二結果 52
4-2-1. 實驗二(A) 52
4-2-2. 實驗二(B) 63
4-2-3. 統整 75
五、 結論 77
5-1. 總結與討論 77
5-2. 未來展望 78
六、 參考文獻 80
附錄一、 實驗一詳細實驗數據 85
1-1、 數值型資料 85
1-2、 類別型資料 101
1-3、 混合型資料 111
附錄二、 實驗二詳細實驗數據 131
2-1、 數值型資料 131
2-2、 類別型資料 147
2-3、 混合型資料 157
參考文獻 Sandhu, A. K. (2021). Big data with cloud computing: Discussions and challenges. Big Data Mining and Analytics, 5(1), 32-40.
[2] Khanra, S., Dhir, A., & Mäntymäki, M. (2020). Big data analytics and enterprises: a bibliometric synthesis of the literature. Enterprise Information Systems, 14(6), 737-768.
[3] Munson, M. A. (2012). A study on the importance of and time spent on different modeling steps. ACM SIGKDD Explorations Newsletter, 13(2), 65-71.
[4] Peugh, J. L., & Enders, C. K. (2004). Missing data in educational research: A review of reporting practices and suggestions for improvement. Review of educational research, 74(4), 525-556.
[5] Jena, M., & Dehuri, S. (2022). An Integrated Novel Framework for Coping Missing Values Imputation and Classification. IEEE Access, 10, 69373-69387.
[6] Lin, W. C., & Tsai, C. F. (2020). Missing value imputation: a review and analysis of the literature (2006–2017). Artificial Intelligence Review, 53, 1487-1509.
[7] Miao, X., Wu, Y., Chen, L., Gao, Y., & Yin, J. (2022). An experimental survey of missing data imputation algorithms. IEEE Transactions on Knowledge and Data Engineering.
[8] White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations: issues and guidance for practice. Statistics in medicine, 30(4), 377-399.
[9] Donders, A. R. T., Van Der Heijden, G. J., Stijnen, T., & Moons, K. G. (2006). A gentle introduction to imputation of missing values. Journal of clinical epidemiology, 59(10), 1087-1091.
[10] Zhu, X., Zhang, S., Jin, Z., Zhang, Z., & Xu, Z. (2010). Missing value estimation for mixed-attribute data sets. IEEE Transactions on Knowledge and Data Engineering, 23(1), 110-121.
[11] Tsai, C. F., & Hu, Y. H. (2022). Empirical comparison of supervised learning techniques for missing value imputation. Knowledge and Information Systems, 64(4), 1047-1075.
[12] Jenghara, M. M., Ebrahimpour-Komleh, H., Rezaie, V., Nejatian, S., Parvin, H., & Yusof, S. K. S. (2018). Imputing missing value through ensemble concept based on statistical measures. Knowledge and Information Systems, 56, 123-139.
[13] Batra, S., Khurana, R., Khan, M. Z., Boulila, W., Koubaa, A., & Srivastava, P. (2022). A Pragmatic Ensemble Strategy for Missing Values Imputation in Health Records. Entropy, 24(4), 533.
[14] Van Buuren, S. (2018). Flexible imputation of missing data. CRC press.
[15] Ayilara, O. F., Zhang, L., Sajobi, T. T., Sawatzky, R., Bohm, E., & Lix, L. M. (2019). Impact of missing data on bias and precision when estimating change in patient-reported outcomes from a clinical registry. Health and quality of life outcomes, 17, 1-9.
[16] Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581-592.
[17] Newman, D. A. (2014). Missing data: Five practical guidelines. Organizational Research Methods, 17(4), 372-411.
[18] Tseng, C. H., Elashoff, R., Li, N., & Li, G. (2016). Longitudinal data analysis with non-ignorable missing data. Statistical methods in medical research, 25(1), 205-220.
[19] Buhi, E. R., Goodson, P., & Neilands, T. B. (2008). Out of sight, not out of mind: Strategies for handling missing data. American journal of health behavior, 32(1), 83-92.
[20] Faria, R., Gomes, M., Epstein, D., & White, I. R. (2014). A guide to handling missing data in cost-effectiveness analysis conducted within randomised controlled trials. Pharmacoeconomics, 32(12), 1157-1170.
[21] Han, J., Pei, J., & Tong, H. (2022). Data mining: concepts and techniques. Morgan kaufmann.
[22] Orme, J. G., & Reis, J. (1991). Multiple regression with missing data. Journal of Social Service Research, 15(1-2), 61-91.
[23] Madley-Dowd, P., Hughes, R., Tilling, K., & Heron, J. (2019). The proportion of missing data should not be used to guide decisions on multiple imputation. Journal of clinical epidemiology, 110, 63-73.
[24] Baraldi, A. N., & Enders, C. K. (2010). An introduction to modern missing data analyses. Journal of school psychology, 48(1), 5-37.
[25] Manly, C. A., & Wells, R. S. (2015). Reporting the use of multiple imputation for missing data in higher education research. Research in Higher Education, 56, 397-409.
[26] Scheffer, J. (2002). Dealing with missing data.
[27] Miao, X., Wu, Y., Chen, L., Gao, Y., & Yin, J. (2022). An experimental survey of missing data imputation algorithms. IEEE Transactions on Knowledge and Data Engineering.
[28] Zhang, Z. (2016). Missing data imputation: focusing on single imputation. Annals of translational medicine, 4(1).
[29] García-Laencina, P. J., Sancho-Gómez, J. L., & Figueiras-Vidal, A. R. (2010). Pattern classification with missing data: a review. Neural Computing and Applications, 19, 263-282.
[30] Bø, T. H., Dysvik, B., & Jonassen, I. (2004). LSimpute: accurate estimation of missing values in microarray data with least squares methods. Nucleic acids research, 32(3), e34-e34.
[31] Mostafa, S. M. (2019). Imputing missing values using cumulative linear regression. CAAI Transactions on Intelligence Technology, 4(3), 182-200.
[32] Batista, G. E., & Monard, M. C. (2003). An analysis of four missing data treatment methods for supervised learning. Applied artificial intelligence, 17(5-6), 519-533.
[33] Loh, W. Y. (2011). Classification and regression trees. Wiley interdisciplinary reviews: data mining and knowledge discovery, 1(1), 14-23.
[34] Fix, E., & Hodges, J. L. (1952). Discriminatory analysis: Nonparametric discrimination: Small sample performance.
[35] Zhang, S. (2012). Nearest neighbor selection for iteratively kNN imputation. Journal of Systems and Software, 85(11), 2541-2552.
[36] Breiman, L. (2001). Random forests. Machine learning, 45, 5-32.
[37] Geurts, P., Ernst, D., & Wehenkel, L. (2006). Extremely randomized trees. Machine learning, 63, 3-42.
[38] Stekhoven, D. J., & Bühlmann, P. (2012). MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), 112-118.
[39] Silva-Ramírez, E. L., Pino-Mejías, R., López-Coello, M., & Cubiles-de-la-Vega, M. D. (2011). Missing value imputation on missing completely at random data using multilayer perceptrons. Neural Networks, 24(1), 121-129.
[40] Witten, I. H., & Frank, E., 2005. Data Mining: Practical machine learning tools and techniques: Morgan Kaufmann.
[41] D. E. Rumelhart, G. E. Hinton and R. J. Williams, 1986. “Learning Internal Representations by Error Propagation,” in D. E. Rumelhart and J. L. McCelland (Eds.), Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Vol 1: Foundations. MIT Press.
[42] Rubin, D. B, 1988. An overview of multiple imputation. In Proceedings of
the survey research methods section of the American statistical association (pp.
79-84). Princeton, NJ, USA: Citeseer
[43] Yucel, R. M. (2011). State of the multiple imputation software. Journal of statistical software, 45(1).
[44] Van Buuren, S. (2007). Multiple imputation of discrete and continuous data by fully conditional specification. Statistical methods in medical research, 16(3), 219-242.
[45] Dietterich, T. G. (2000). Ensemble methods in machine learning. In International workshop on multiple classifier systems (pp. 1-15). Berlin, Heidelberg: Springer Berlin Heidelberg.
[46] Kim, M. J., & Kang, D. K. (2012). Classifiers selection in ensembles using genetic algorithms for bankruptcy prediction. Expert Systems with applications, 39(10), 9308-9314.
[47] Mienye, I. D., & Sun, Y. (2022). A survey of ensemble learning: Concepts, algorithms, applications, and prospects. IEEE Access, 10, 99129-99149.
[48] Bühlmann, P. (2012). Bagging, boosting and ensemble methods. Handbook of computational statistics: Concepts and methods, 985-1022.
[49] Garciarena, U., & Santana, R. (2017). An extensive analysis of the interaction between missing data types, imputation methods, and supervised classifiers. Expert Systems with Applications, 89, 52-65.
[50] Tsai, C. F., Li, M. L., & Lin, W. C. (2018). A class center based approach for missing value imputation. Knowledge-Based Systems, 151, 124-135.
[51]. Sun, Y., Li, J., Xu, Y., Zhang, T., & Wang, X. (2023). Deep learning versus conventional methods for missing data imputation: A review and comparative study. Expert Systems with Applications, 12020.
指導教授 蔡志豐(Chih-Fong Tsai) 審核日期 2024-7-22
推文 facebook   plurk   twitter   funp   google   live   udn   HD   myshare   reddit   netvibes   friend   youpush   delicious   baidu   
網路書籤 Google bookmarks   del.icio.us   hemidemi   myshare   

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡  - 隱私權政策聲明