考量特徵選取與隨機森林之遺漏值填補技術

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：139

、訪客IP：3.143.255.169

姓名

林彥呈(Yen-Cheng Lin) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

考量特徵選取與隨機森林之遺漏值填補技術

相關論文

★ 不動產仲介業銷售住宅類別之成交預測模型—以不動產仲介S公司為例	★ 應用文字探勘技術建構預測客訴問題類別機器學習模型
★ 以機器學習技術建構顧客回購率預測模型：以某手工皂原料電子商務網站為例	★ 以機器學習建構股價預測模型：以台灣股市為例
★ 以機器學習方法建構財務危機之預測模型：以台灣上市櫃公司為例	★ 運用資料探勘技術於股票填息之預測模型：以台灣股市上市公司為例
★ 運用資料探勘技術優化次世代防火牆規則之研究	★ 應用資料探勘技術於電子病歷文本中識別相關新資訊
★ 應用深度學習於藥品後市場監督：Twitter文本分類任務	★ 運用電子病歷與資料探勘技術建構腦中風病人心房顫動預測模型
★ 電子病歷縮寫消歧與一對多分類任務	★ 運用Meta-path與注意力機制改善個人化穿搭推薦
★ 運用機器學習技術建構核保風險預測模型：以A公司為例	★ 風扇壽命預測使用大數據分析－以 X 公司為例
★ 使用文字探勘與深度學習技術建置中風後肺炎之預測模型	★ 利用文字探勘技術分析評論特徵因子對於體驗品評論有益性之影響：以IMDb 為例

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2026-9-1以後開放)

摘要(中)

遺漏值填補(Missing Value Imputation, MVI)是研究人員進行資料分析的重要過程，因為大多數的機器學習方法都不適用於不完整的數據集(如神經網絡和支持向量機)，隨意忽略該步驟更會導致嚴重的分類錯誤。對於醫療領域來說，因為並非所有可能的測試都可以對每個患者進行，再加上人為疏失、設備故障等意外因素干擾，遺漏值的存在已是一個常見的問題，這不僅增加了相關人員在分析、預測等任務上的難度，也影響了患者所應該受到的即時診斷和治療。

在補值領域的研究中，missForest是一種相當受歡迎的補值方法，儘管其表現已被證明優於其它已知的填補方法，然而卻少有研究考慮對其進行優化或進一步的探討。因此，本研究嘗試了當前流行於補值研究的特徵選取方法—RFE，將其與missForest合併提出了一種新的RFE_missForest補值法，並使用在自Kaggle及UCI所取得的共10個醫療數據集，在進行10%～50%的遺漏率模擬後，和missForest以及另外三個傳統的補值方法比較各自在連續型和類別型變量的填補品質。

最後的研究結果顯示，由本研究所提出的RFE_missForest分別在3種連續型數據集以及3種混合型數據集上，不論是NRMSE或是PFC都有著最好的表現，優於其他4種現有的補值方法，並且統計差異顯著。

摘要(英)

Missing Value Imputation (MVI) is an important process in data mining, because sometimes it will cause serious problems for classification. One of the most serious problems is that the majority of classification algorithms do not work on incomplete datasets (such as neural networks and support vector machines). In the medical field, because of not all possible tests can be done on every patient, and coupled with the interference of accidental factors such as human negligence and equipment failure, the existence of missing values is a common problem. It not only increases the difficulty in tasks such as analysis and prediction, but also affects the immediate diagnosis and treatment that patients should receive.
In the research field of missing value imputation, missForest is a very popular imputation method. Although its performance has been proved to be better than other known imputation methods, there are few studies considering its optimization or further discussion. Therefore, this study tried the feature selection method currently popular in missing value imputation research—RFE, combined it with missForest and propose a new imputation method RFE_missForest. We used a total of 10 medical data sets obtained from Kaggle and UCI, simulating the missing rate of 10% to 50%, then compare the filling quality of continuous and categorical data sets with missForest and three other traditional imputation methods.
Experimental results show that our RFE_missForest algorithm has the best performance both on 3 continuous data sets and 3 mixed data sets, whether it is NRMSE or PFC. The proposed method was also validated by t-test and has a significant difference.

關鍵字(中)

★ 遺漏值填補
★ 隨機森林
★ 特徵選取

關鍵字(英)

★ missing value imputation
★ random forest
★ feature selection

論文目次

摘要 ii
Abstract iii
誌謝 iv
目錄 v
圖目錄 vi
表目錄 vii
第一章緒論 1
1.1 研究背景 1
1.2 研究動機 3
1.3 研究目的 5
第二章文獻探討 6
2.1 遺漏值機制 6
2.2 遺漏值處理 7
第三章研究方法 11
3.1 遞歸特徵消除 11
3.2 RFE_missForest 13
第四章實驗建構與評估 15
4.1 實驗資料集 15
4.2 實驗流程 16
4.3 演算法參數設定 16
4.4 評估指標 17
第五章實驗結果 18
5.1 數值型資料 (Numerical) 18
5.2 混合型資料 (Mixed) 23
5.3 成對母體平均數差異t檢定 33
5.4 小結 36
第六章結論 38
6.1 研究貢獻 38
6.2 研究限制 38
6.3 未來研究方向 39
參考文獻 40

參考文獻

Afshari Safavi, A., Kazemzadeh Gharechobogh, H., & Rezaei, M. (2015). Comparison of EM algorithm and standard imputation methods for missing data: a questionnaire study on diabetic patients. Iranian journal of epidemiology, 11(3), 43-51.
Arriagada, P., Karelovic, B., & Link, O. (2021). Automatic gap-filling of daily streamflow time series in data-scarce regions using a machine learning algorithm. Journal of Hydrology, 598, 126454.
Baneshi, M. R., & Talei, A. R. (2012). Does the missing data imputation method affect the composition and performance of prognostic models?. Iranian Red Crescent Medical Journal, 14(1), 31.
Bania, R. K., & Halder, A. (2020). R-Ensembler: A greedy rough set based ensemble attribute selection algorithm with kNN imputation for classification of medical data. Computer methods and programs in biomedicine, 184, 105122.
Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.
Burgette, L. F., & Reiter, J. P. (2010). Multiple imputation for missing data via sequential regression trees. American journal of epidemiology, 172(9), 1070-1076.
Cheng, C. H., Chang, J. R., & Huang, H. H. (2020). A novel weighted distance threshold method for handling medical missing values. Computers in Biology and Medicine, 122, 103824.
Chen, Q., Meng, Z., Liu, X., Jin, Q., & Su, R. (2018). Decision variants for the automatic determination of optimal feature subset in RF-RFE. Genes, 9(6), 301.
Debastiani, V. J., Bastazini, V. A., & Pillar, V. D. (2021). Using phylogenetic information to impute missing functional trait values in ecological databases. Ecological Informatics, 63, 101315.
Deng, Y., Chang, C., Ido, M. S., & Long, Q. (2016). Multiple imputation for general missing data patterns in the presence of high-dimensional data. Scientific reports, 6(1), 1-10.
De La Iglesia, B. (2013). Evolutionary computation for feature selection in classification problems. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 3(6), 381-407.
Doquire, G., & Verleysen, M. (2012). Feature selection with missing data using mutual information estimators. Neurocomputing, 90, 3-11.
Donders, A. R. T., Van Der Heijden, G. J., Stijnen, T., & Moons, K. G. (2006). A gentle introduction to imputation of missing values. Journal of clinical epidemiology, 59(10), 1087-1091.
Dzulkalnine, M. F., & Sallehuddin, R. (2019). Missing data imputation with fuzzy feature selection for diabetes dataset. SN Applied Sciences, 1(4), 1-12.
Fedushko, S., & Ustyianovych, T. (2019). Medical card data imputation and patient psychological and behavioral profile construction. Procedia Computer Science, 160, 354-361.
Fichman, M., & Cummings, J. N. (2003). Multiple imputation for missing data: Making the most of what you know. Organizational Research Methods, 6(3), 282-308.
García-Laencina, P. J., Sancho-Gómez, J. L., & Figueiras-Vidal, A. R. (2010). Pattern classification with missing data: a review. Neural Computing and Applications, 19(2), 263-282.
García-Laencina, P. J., Abreu, P. H., Abreu, M. H., & Afonoso, N. (2015). Missing data imputation on the 5-year survival prediction of breast cancer patients with unknown discrete values. Computers in biology and medicine, 59, 125-133.
Granitto, P. M., Furlanello, C., Biasioli, F., & Gasperi, F. (2006). Recursive feature elimination with random forest for PTR-MS analysis of agroindustrial products. Chemometrics and intelligent laboratory systems, 83(2), 83-90.
Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine learning, 46(1), 389-422.
Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), 1157-1182.
Hariz, N. B., Khoufi, H., & Zagrouba, E. (2017, June). On Combining Imputation Methods for Handling Missing Data. In International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems (pp. 171-181). Springer, Cham.
Hall, M. A. (1999). Correlation-based feature selection for machine learning.

Huang, S. F., & Cheng, C. H. (2020). A Safe-Region Imputation Method for Handling Medical Data with Missing Values. Symmetry, 12(11), 1792.
Hong, S., & Lynn, H. S. (2020). Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction. BMC medical research methodology, 20(1), 1-12.
Huang, H. H., Liu, X. Y., & Liang, Y. (2016). Feature selection and cancer classification via sparse logistic regression with the hybrid L1/2+ 2 regularization. PloS one, 11(5), e0149675.
Huang, C., Mezencev, R., McDonald, J. F., & Vannberg, F. (2017). Open source machine-learning algorithms for the prediction of optimal cancer drug therapies. PLoS One, 12(10), e0186906.
Jerez, J. M., Molina, I., García-Laencina, P. J., Alba, E., Ribelles, N., Martín, M., & Franco, L. (2010). Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artificial intelligence in medicine, 50(2), 105-115.
Knol, M. J., Janssen, K. J., Donders, A. R. T., Egberts, A. C., Heerdink, E. R., Grobbee, D. E., ... & Geerlings, M. I. (2010). Unpredictable bias when using the missing indicator method or complete case analysis for missing confounder values: an empirical example. Journal of clinical epidemiology, 63(7), 728-736.
Kohavi, R., & John, G. H. (1997). Wrappers for feature subset selection. Artificial intelligence, 97(1-2), 273-324.
Kwak, S. K., & Kim, J. H. (2017). Statistical data preparation: management of missing values and outliers. Korean journal of anesthesiology, 70(4), 407.
Li, X., Peng, S., Chen, J., Lü, B., Zhang, H., & Lai, M. (2012). SVM–T-RFE: A novel gene selection algorithm for identifying metastasis-related genes in colorectal cancer using gene expression profiles. Biochemical and biophysical research communications, 419(2), 148-153.
Li, X., Liu, T., Tao, P., Wang, C., & Chen, L. (2015). A highly accurate protein structural class prediction approach using auto cross covariance transformation and recursive feature elimination. Computational biology and chemistry, 59, 95-100.
Lin, X., Li, C., Zhang, Y., Su, B., Fan, M., & Wei, H. (2018). Selecting feature subsets based on SVM-RFE and the overlapping ratio with applications in bioinformatics. Molecules, 23(1), 52.
Liu, X. Y., Liang, Y., Wang, S., Yang, Z. Y., & Ye, H. S. (2018). A hybrid genetic algorithm with wrapper-embedded approaches for feature selection. IEEE Access, 6, 22863-22874.
Liu, H., & Yu, L. (2005). Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on knowledge and data engineering, 17(4), 491-502.
Mundra, P. A., & Rajapakse, J. C. (2009). SVM-RFE with MRMR filter for gene selection. IEEE transactions on nanobioscience, 9(1), 31-37.
Naghani, S. Y., Dara, R., Poljak, Z., & Sharif, S. (2019). A review of knowledge discovery process in control and mitigation of avian influenza. Animal health research reviews, 20(1), 61-71.
Pedersen, A. B., Mikkelsen, E. M., Cronin-Fenton, D., Kristensen, N. R., Pham, T. M., Pedersen, L., & Petersen, I. (2017). Missing data and multiple imputation in clinical epidemiological research. Clinical epidemiology, 9, 157.
Peng, H., Long, F., & Ding, C. (2005). Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on pattern analysis and machine intelligence, 27(8), 1226-1238.
Purwar, A., & Singh, S. K. (2015). Hybrid prediction model with missing value imputation for medical data. Expert Systems with Applications, 42(13), 5621-5631.
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581-592.
Sharafoddini, A., Dubin, J. A., Maslove, D. M., & Lee, J. (2019). A new insight into missing data in intensive care unit patient profiles: observational study. JMIR medical informatics, 7(1), e11605.
Sterne, J. A., White, I. R., Carlin, J. B., Spratt, M., Royston, P., Kenward, M. G., ... & Carpenter, J. R. (2009). Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. Bmj, 338.
Stekhoven, D. J., & Bühlmann, P. (2012). MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), 112-118.
Shah, A. D., Bartlett, J. W., Carpenter, J., Nicholas, O., & Hemingway, H. (2014). Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study. American journal of epidemiology, 179(6), 764-774.
Su, R., Liu, X., & Wei, L. (2020). MinE-RFE: determine the optimal subset from RFE by minimizing the subset-accuracy–defined energy. Briefings in bioinformatics, 21(2), 687-698.
Su, R., Xiong, S., Zink, D., & Loo, L. H. (2016). High-throughput imaging-based nephrotoxicity prediction for xenobiotics with diverse chemical structures. Archives of toxicology, 90(11), 2793-2808.
Svetnik, V., Liaw, A., Tong, C., & Wang, T. (2004, June). Application of Breiman’s random forest to modeling structure-activity relationships of pharmaceutical molecules. In International Workshop on Multiple Classifier Systems (pp. 334-343). Springer, Berlin, Heidelberg.
Tang, Y., Zhang, Y. Q., & Huang, Z. (2007). Development of two-stage SVM-RFE gene selection strategy for microarray expression data analysis. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 4(3), 365-381.
Torres-Valencia, C., Álvarez-López, M., & Orozco-Gutiérrez, Á. (2017). SVM-based feature selection methods for emotion recognition from multimodal data. Journal on Multimodal User Interfaces, 11(1), 9-23.
Van Wolputte, E., & Blockeel, H. (2020, October). Missing Value Imputation with MERCS: A Faster Alternative to MissForest. In International Conference on Discovery Science (pp. 502-516). Springer, Cham.
Van Buuren, S. (2018). Flexible imputation of missing data. CRC press.
Voyle, N., Keohane, A., Newhouse, S., Lunnon, K., Johnston, C., Soininen, H., ... & Dobson, R. J. (2016). A pathway based classification method for analyzing gene expression for Alzheimer’s disease diagnosis. Journal of Alzheimer′s Disease, 49(3), 659-669.
Waljee, A. K., Mukherjee, A., Singal, A. G., Zhang, Y., Warren, J., Balis, U., ... & Higgins, P. D. (2013). Comparison of imputation methods for missing laboratory data in medicine. BMJ open, 3(8).
Xu, Z., Zhang, H., Wang, Y., Chang, X., & Liang, Y. (2010). L 1/2 regularization. Science China Information Sciences, 53(6), 1159-1169.
Yang, Z., Zhuan, B., Yan, Y., Jiang, S., & Wang, T. (2016). Identification of gene markers in the development of smoking-induced lung cancer. Gene, 576(1), 451-457.
Zhang, Z. (2015). Missing values in big data research: some basic skills. Annals of translational medicine, 3(21).
Zhang, S., Gong, L., Zeng, Q., Li, W., Xiao, F., & Lei, J. (2021). Imputation of GPS Coordinate Time Series Using MissForest. Remote Sensing, 13(12), 2312.
Zhu, R., & Kosorok, M. R. (2012). Recursively imputed survival trees. Journal of the American Statistical Association, 107(497), 331-340.
Zhang, X., Yan, C., Gao, C., Malin, B. A., & Chen, Y. (2020). Predicting Missing Values in Medical Data Via XGBoost Regression. Journal of Healthcare Informatics Research, 4(4), 383-394.

指導教授

胡雅涵(Ya-Han Hu)

審核日期

2021-8-17

推文