資料離散化與 填補遺漏值之順序與研究

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：34

、訪客IP：3.12.152.102

姓名

崔書晴(Shu-Ching Tsui) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

資料離散化與填補遺漏值之順序與研究
(Effects of Combining Data Discretization and Missing Value Imputation on Classification Problems)

相關論文

★ 利用資料探勘技術建立商用複合機銷售預測模型	★ 應用資料探勘技術於資源配置預測之研究-以某電腦代工支援單位為例
★ 資料探勘技術應用於航空業航班延誤分析-以C公司為例	★ 全球供應鏈下新產品的安全控管-以C公司為例
★ 資料探勘應用於半導體雷射產業-以A公司為例	★ 應用資料探勘技術於空運出口貨物存倉時間預測-以A公司為例
★ 使用資料探勘分類技術優化YouBike運補作業	★ 特徵屬性篩選對於不同資料類型之影響
★ 資料探勘應用於B2B網路型態之企業官網研究-以T公司為例	★ 衍生性金融商品之客戶投資分析與建議-整合分群與關聯法則技術
★ 應用卷積式神經網路建立肝臟超音波影像輔助判別模型	★ 基於卷積神經網路之身分識別系統
★ 能源管理系統電能補值方法誤差率比較分析	★ 企業員工情感分析與管理系統之研發
★ 資料淨化於類別不平衡問題: 機器學習觀點	★ 資料探勘技術應用於旅客自助報到之分析—以C航空公司為例

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

隨著時代演進，科技化的腳步不斷前進著，而人們的行為會產生許許多多的資料，這些過往不被重視或是因為技術而難以被蒐集的資料，今日，不管是對企業或是對一般人而言，資料都有著不同的意義，它可以是市場分析的工具或是個人隱私的一部分，它的價值甚至比產品本身還要更加重要，因此，資料探勘(Data Mining)是目前非常熱門的技術，即是利用不同的方法進行資料的分析並且設法找出資料內隱含的相關性以及可以提取的特徵，加以解讀及應用，聽起來很容易，但是，實際操作上，卻會遇到很多問題，其中之一就是原始資料的不完整、產生遺漏的狀況。
遺漏值(Missing Value)會直接導致資料探勘及分析的結果上有誤差，這些遺漏可能來自於人為的填寫失誤或各種原因導致的刻意隱瞞，也可能是機器本身的原因所導致，例如: 資料儲存的過程產生失誤、硬體設備的損壞等等。因此，資料探勘及分析時，常常會因為遺漏值的緣故導致結果被干擾，準確度因此降低。
除此之外，在資料前處理的階段，經常會碰到例如年齡的連續型資料，若是連續型的資料，在進行特徵提取的時候可能會導致條件過於狹隘，因此，離散化是個很重要的過程，將連續型資料透過不同的劃分點分類到不同的類別，使資料平整化亦降低異常資料對於模型的影響程度，唯有高品質的資料，才可能產出高品質的結果。
目前對於遺漏值的處理以及離散化的方式有非常多種，本研究將嘗試先離散化資料再進行各種遺漏值填補以及先進行遺漏值填補後再離散化的方法，將結果以正確率評估，統整，歸納出一個較為有效的方法。

摘要(英)

As technology improves day by day, many data that used to be ignored or was difficult to be gathered, could have a brand-new meaning nowadays no matter on personal or enterprise aspect. Data could be a tool to analyze market or a part of personal privacy, what’s more, it has become more valuable than the product itself. Thus, data mining, which means analyze data in many different ways, try to find out the correlation between each one of them and make use of them, is a big hit recently. It sounds easy but actually facing many difficulties while practicing. One of them is the incompleteness of data, means data that contains missing values.
Missing value will directly result in error of analysis outcome. Missing values may cause by human error or malfunctioning machine. For example: the process of saving data does not work well or broken hardware. So, outcomes of data mining and analyzing will often be interfered due to missing values.
Furthermore, there are continuous variables inside data, like: age. If for continuous variables, it could result in a narrow condition when data analyzing. Consequently, discretization is an important data preprocessing stage. Discretization will divide continuous variables into categorical by different cutting points and depends on different methods to reduce the influence of abnormal data or outliers. Because only high-quality data could output high quality outcomes.
There are many methods to deal with missing values and to implement discretization. This study will try to do discretization first and to inpute missing values first, and evaluated with accuracy to see which one is a better way.

關鍵字(中)

★ 資料前處理
★ 資料離散化
★ 遺漏值填補
★ 資料探勘

關鍵字(英)

論文目次

摘要 v
Abstract vi
目錄 vii
圖目錄 ix
表目錄 xi
第一章緒論 1
1.1 研究背景 1
1.2 研究動機 2
1.3 研究目的 3
1.4 論文架構 5
第二章文獻探討 6
2.1 資料離散化 6
2.1.1最小描述長度準則(Minimum Description Length Principle，MDLP) 7
2.1.2 ChiMerge 8
2.2 資料遺漏 9
2.2.1 遺漏機制 10
2.2.2填補方式 11
2.3 分類器 13
2.3.1 支援向量機(Support Vector Machines，SVM) 13
2.3.2 決策樹C4.5 15
2.4 相關研究 16
第三章研究方法 18
3-1、實驗架構 18
3-1-1、實驗環境 20
3-2、實驗步驟 21
3-2-1、資料集前處理 21
3-2-2、離散化 22
3-2-3、傳統統計法於填補遺漏值之應用 22
3-2-4、k-鄰近值分類(kNN)於填補遺漏值之應用 23
3-2-5、分類與迴歸樹(CART)於填補遺漏值之應用 26
第四章、實驗結果 28
4-1 SVM正確率分析 28
4-1-1 Baseline 數據 28
4-1-2 單一前處理方法之數據 29
4-1-3 混合前處理方法數據 33
4-1-4 各方法之討論 36
4-2 C4.5正確率分析 38
4-2-1 Baseline 數據 38
4-2-2 單一前處理方法之數據 39
4-2-3 混合前處理方法數據 42
4-2-4 各方法之討論 46
4-3 各方法之顯著性檢定 49
第五章結論 54
5.1 結論與貢獻 54
5.2 未來研究方向與建議 55
參考文獻 56
附錄 63

參考文獻

[1] J. Han, J. Pei, and M. Kamber, Data mining: concepts and techniques: Elsevier, 2011.
[2] P.-N. Tan, M. Steinbach, and V. Kumar, "Introduction to data mining, Pearson education," Inc., New Delhi, 2006.
[3] I. H. Witten, E. Frank, L. E. Trigg, M. A. Hall, G. Holmes, and S. J. Cunningham, "Weka: Practical machine learning tools and techniques with Java implementations," 1999.
[4] K. C. Laudon and J. P. Laudon, Management information systems 12th: Prentice Hall, 2011.
[5] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, "From data mining to knowledge discovery in databases," AI magazine, vol. 17, pp. 37-37, 1996.
[6] J. Han and M. Kamber, "Data Mining Concepts and Techniques, Morgan Kaufmann Publishers," San Francisco, CA, pp. 335-391, 2001.
[7] J. Dougherty, R. Kohavi, and M. Sahami, "Supervised and unsupervised discretization of continuous features," in Machine learning proceedings 1995, ed: Elsevier, 1995, pp. 194-202.
[8] H. Liu, F. Hussain, C. L. Tan, and M. Dash, "Discretization: An enabling technique," Data mining and knowledge discovery, vol. 6, pp. 393-423, 2002.
[9] Q. Zhu, L. Lin, M.-L. Shyu, and S.-C. Chen, "Effective supervised discretization for classification based on correlation maximization," in 2011 IEEE International Conference on Information Reuse & Integration, 2011, pp. 390-395.
[10] R. Kohavi and M. Sahami, "Error-based and entropy-based discretization of continuous features," in KDD, 1996, pp. 114-119.
[11] J. R. Quinlan, C4. 5: programs for machine learning: Elsevier, 2014.
[12] R. Agrawal and R. Srikant, "Fast algorithms for mining association rules," in Proc. 20th int. conf. very large data bases, VLDB, 1994, pp. 487-499.
[13] Y. Yang and G. I. Webb, "Discretization for naive-Bayes learning: managing discretization bias and variance," Machine learning, vol. 74, pp. 39-74, 2009.
[14] R. Jin, Y. Breitbart, and C. Muoh, "Data discretization unification," Knowledge and Information Systems, vol. 19, p. 1, 2009.
[15] M. R. Raymond and D. M. Roberts, "A comparison of methods for treating incomplete data in selection research," Educational and Psychological Measurement, vol. 47, pp. 13-26, 1987.
[16] K. Strike, K. El Emam, and N. Madhavji, "Software cost estimation with incomplete data," IEEE Transactions on Software Engineering, vol. 27, pp. 890-908, 2001.
[17] E. Acurna and C. Rodriguez, "The treatment of missing values and its effect in the classifier accuracy, classification, clustering, and data mining applications," in Proceedings of the Meeting of the International Federation of Classification Societies (IFCS), 2004, pp. 639-647.
[18] J. Catlett, "On changing continuous attributes into ordered discrete attributes," in European working session on learning, 1991, pp. 164-178.
[19] M. H. Hansen and B. Yu, "Model selection and the principle of minimum description length," Journal of the American Statistical Association, vol. 96, pp. 746-774, 2001.
[20] J. Rissanen, "Modeling by shortest data description," Automatica, vol. 14, pp. 465-471, 1978.
[21] D.-D. Le and S. i. Satoh, "Ent-Boost: Boosting using entropy measures for robust object detection," Pattern Recognition Letters, vol. 28, pp. 1083-1090, 2007.
[22] R. Kerber, "Chimerge: Discretization of numeric attributes," in Proceedings of the tenth national conference on Artificial intelligence, 1992, pp. 123-128.
[23] A. ROL, J. A. Madrid, M. Campos, B. Rodríguez Morilla, E. Estivill, C. Estivill-Domènech, et al., "Application of Machine learning methods to Ambulatory Circadian Monitoring (ACM) for discriminating sleep and circadian disorders," Frontiers in Neuroscience, vol. 13, p. 1318, 2019.
[24] S. S. Pal and S. Kar, "Time series forecasting for stock market prediction through data discretization by fuzzistics and rule generation by rough set theory," Mathematics and Computers in Simulation, vol. 162, pp. 18-30, 2019.
[25] J. W. Grzymala-Busse, Z. S. Hippe, and T. Mroczek, "Reduced Data Sets and Entropy-Based Discretization," Entropy, vol. 21, p. 1051, 2019.
[26] M. Hacibeyoglu and M. H. Ibrahim, "EF_Unique: an improved version of unsupervised equal frequency discretization method," Arabian Journal for Science and Engineering, vol. 43, pp. 7695-7704, 2018.
[27] I. Gómez, N. Ribelles, L. Franco, E. Alba, and J. M. Jerez, "Supervised discretization can discover risk groups in cancer survival analysis," Computer Methods and Programs in Biomedicine, vol. 136, pp. 11-19, 2016.
[28] M. Kurtcephe and H. A. Güvenir, "A discretization method based on maximizing the area under receiver operating characteristic curve," International Journal of Pattern Recognition and Artificial Intelligence, vol. 27, p. 1350002, 2013.
[29] D. M. Maslove, T. Podchiyska, and H. J. Lowe, "Discretization of continuous features in clinical datasets," Journal of the American Medical Informatics Association, vol. 20, pp. 544-553, 2013.
[30] D. Tian, X.-j. Zeng, and J. Keane, "Core-generating approximate minimum entropy discretization for rough set feature selection in pattern classification," International Journal of Approximate Reasoning, vol. 52, pp. 863-880, 2011.
[31] X. Liu and H. Wang, "A discretization algorithm based on a heterogeneity criterion," IEEE Transactions on Knowledge and Data Engineering, vol. 17, pp. 1166-1173, 2005.
[32] H. Liu and R. Setiono, "Feature selection via discretization," IEEE Transactions on knowledge and Data Engineering, vol. 9, pp. 642-645, 1997.
[33] M. Richeldi and M. Rossotto, "Class-driven statistical discretization of continuous attributes," in European Conference on Machine Learning, 1995, pp. 335-338.
[34] B. S. Chlebus and S. H. Nguyen, "On finding optimal discretizations for two attributes," in International Conference on Rough Sets and Current Trends in Computing, 1998, pp. 537-544.
[35] K. J. Cios, R. W. Swiniarski, W. Pedrycz, and L. A. Kurgan, "The knowledge discovery process," in Data Mining, 2007, pp. 9-24.
[36] L. Peng, W. Qing, and G. Yujia, "Study on comparison of discretization methods," in 2009 International Conference on Artificial Intelligence and Computational Intelligence, 2009, pp. 380-384.
[37] I. H. Witten and E. Frank, "Data mining: practical machine learning tools and techniques with Java implementations," Acm Sigmod Record, vol. 31, pp. 76-77, 2002.
[38] S. Kotsiantis and D. Kanellopoulos, "Discretization techniques: A recent survey," GESTS International Transactions on Computer Science and Engineering, vol. 32, pp. 47-58, 2006.
[39] M. Mizianty, L. Kurgan, and M. Ogiela, "Comparative analysis of the impact of discretization on the classification with naïve Bayes and semi-naïve Bayes classifiers," in 2008 Seventh International Conference on Machine Learning and Applications, 2008, pp. 823-828.
[40] U. Fayyad and K. Irani, "Multi-interval discretization of continuous-valued attributes for classification learning," 1993.
[41] S. Rosati, G. Balestra, V. Giannini, S. Mazzetti, F. Russo, and D. Regge, "ChiMerge discretization method: Impact on a computer aided diagnosis system for prostate cancer in MRI," in 2015 IEEE International Symposium on Medical Measurements and Applications (MeMeA) Proceedings, 2015, pp. 297-302.
[42] S. J. Hadeed, M. K. O′Rourke, J. L. Burgess, R. B. Harris, and R. A. Canales, "Imputation methods for addressing missing data in short-term monitoring of air pollutants," Science of The Total Environment, p. 139140, 2020.
[43] G. King, J. Honaker, A. Joseph, and K. Scheve, "Analyzing incomplete political science data: An alternative algorithm for multiple imputation," American political science review, pp. 49-69, 2001.
[44] A. Jadhav, D. Pramod, and K. Ramanathan, "Comparison of performance of data imputation methods for numeric dataset," Applied Artificial Intelligence, vol. 33, pp. 913-933, 2019.
[45] X. Wu, V. Kumar, J. R. Quinlan, J. Ghosh, Q. Yang, H. Motoda, et al., "Top 10 algorithms in data mining," Knowledge and information systems, vol. 14, pp. 1-37, 2008.
[46] R. J. Little and D. B. Rubin, Statistical analysis with missing data vol. 793: John Wiley & Sons, 2019.
[47] A. G. Di Nuovo, "Missing data analysis with fuzzy C-Means: A study of its application in a psychological scenario," Expert Systems with Applications, vol. 38, pp. 6793-6797, 2011.
[48] Z. Zhang, "Missing data imputation: focusing on single imputation," Annals of translational medicine, vol. 4, 2016.
[49] X. Xu, L. Xia, Q. Zhang, S. Wu, M. Wu, and H. Liu, "The ability of different imputation methods for missing values in mental measurement questionnaires," BMC Medical Research Methodology, vol. 20, pp. 1-9, 2020.
[50] E. Fix, Discriminatory analysis: nonparametric discrimination, consistency properties: USAF School of Aviation Medicine, 1951.
[51] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, et al., "Missing value estimation methods for DNA microarrays," Bioinformatics, vol. 17, pp. 520-525, 2001.
[52] O. G. Troyanskaya, D. Botstein, and R. B. Altman, "Missing value estimation," in A practical approach to microarray data analysis, ed: Springer, 2003, pp. 65-75.
[53] S. G. Liao, Y. Lin, D. D. Kang, D. Chandra, J. Bon, N. Kaminski, et al., "Missing value imputation in high-dimensional phenomic data: imputable or not, and how?," BMC bioinformatics, vol. 15, p. 346, 2014.
[54] L. Breiman, J. Friedman, R. Olshen, and C. Stone, "Classification and regression trees. Statistics/probability series," ed: Wadsworth Publishing Company, Belmont, California, USA, 1984.
[55] B. K. Lee, J. Lessler, and E. A. Stuart, "Improving propensity score weighting using machine learning," Statistics in medicine, vol. 29, pp. 337-346, 2010.
[56] L. L. Doove, S. Van Buuren, and E. Dusseldorp, "Recursive partitioning for missing data imputation in the presence of interaction effects," Computational Statistics & Data Analysis, vol. 72, pp. 92-104, 2014.
[57] A. Hapfelmeier, T. Hothorn, and K. Ulm, "Recursive partitioning on incomplete data using surrogate decisions and multiple imputation," Computational Statistics & Data Analysis, vol. 56, pp. 1552-1565, 2012.
[58] Y. Liu, K. Wen, Q. Gao, X. Gao, and F. Nie, "SVM based multi-label learning with missing labels for image annotation," Pattern Recognition, vol. 78, pp. 307-317, 2018.
[59] J. Liu, J. Gu, H. Li, and K. Carlson, "Machine learning and transport simulations for groundwater anomaly detection," Journal of Computational and Applied Mathematics, p. 112982, 2020.
[60] C.-L. Huang, M.-C. Chen, and C.-J. Wang, "Credit scoring with a data mining approach based on support vector machines," Expert systems with applications, vol. 33, pp. 847-856, 2007.
[61] N. Gilardi, M. Kanevski, M. Maignan, and E. Mayoraz, "Environmental and pollution spatial data classification with support vector machines and geostatistics," Greece, ACAI, vol. 99, pp. 43-51, 1999.
[62] D. Tao, X. Tang, X. Li, and X. Wu, "Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval," IEEE transactions on pattern analysis and machine intelligence, vol. 28, pp. 1088-1099, 2006.
[63] J. Ghent and J. McDonald, "Facial expression classification using a one-against-all support vector machine," in Proceedings of the Irish machine vision and image processing conference, 2005.
[64] C. Bomhardt, "Newsrec, a svm-driven personal recommendation system for news websites," in IEEE/WIC/ACM International Conference on Web Intelligence (WI′04), 2004, pp. 545-548.
[65] T. S. Furey, N. Cristianini, N. Duffy, D. W. Bednarski, M. Schummer, and D. Haussler, "Support vector machine classification and validation of cancer tissue samples using microarray expression data," Bioinformatics, vol. 16, pp. 906-914, 2000.
[66] C. Cortes and V. Vapnik, "Support-vector networks," Machine learning, vol. 20, pp. 273-297, 1995.
[67] B. E. Boser, I. M. Guyon, and V. N. Vapnik, "A training algorithm for optimal margin classifiers," in Proceedings of the fifth annual workshop on Computational learning theory, 1992, pp. 144-152.
[68] A. Patle and D. S. Chouhan, "SVM kernel functions for classification," in 2013 International Conference on Advances in Technology and Engineering (ICATE), 2013, pp. 1-9.
[69] J. D. B. Deepika Kancherla, Veeranjaneyulu N., "Effect of Different Kernels on the Performance of an SVM Based Classification," International Journal of Recent Technology and Engineering (IJRTE), vol. 7, 2019.
[70] S. Ali and K. A. Smith, "Automatic parameter selection for polynomial kernel," in Proceedings Fifth IEEE Workshop on Mobile Computing Systems and Applications, 2003, pp. 243-249.
[71] R. Amami, D. B. Ayed, and N. Ellouze, "Practical selection of SVM supervised parameters with different feature representations for vowel recognition," arXiv preprint arXiv:1507.06020, 2015.
[72] T. Joachims, "Text categorization with support vector machines: Learning with many relevant features," in European conference on machine learning, 1998, pp. 137-142.
[73] J. Huang, X. Shao, and H. Wechsler, "Face pose discrimination using support vector machines (SVM)," in Proceedings. fourteenth international conference on pattern recognition (Cat. No. 98EX170), 1998, pp. 154-156.
[74] H. Suo, M. Li, P. Lu, and Y. Yan, "Using SVM as back-end classifier for language identification," EURASIP Journal on Audio, Speech, and Music Processing, vol. 2008, p. 674859, 2008.
[75] S.-i. Amari and S. Wu, "Improving support vector machine classifiers by modifying kernel functions," Neural Networks, vol. 12, pp. 783-789, 1999.
[76] Y.-W. Chang, C.-J. Hsieh, K.-W. Chang, M. Ringgaard, and C.-J. Lin, "Training and testing low-degree polynomial data mappings via linear SVM," Journal of Machine Learning Research, vol. 11, 2010.
[77] G. L. Prajapati and A. Patle, "On performing classification using SVM with radial basis and polynomial kernel functions," in 2010 3rd International Conference on Emerging Trends in Engineering and Technology, 2010, pp. 512-515.
[78] H.-T. Lin and C.-J. Lin, "A study on sigmoid kernels for SVM and the training of non-PSD kernels by SMO-type methods," submitted to Neural Computation, vol. 3, p. 16, 2003.
[79] S. S. Keerthi and C.-J. Lin, "Asymptotic behaviors of support vector machines with Gaussian kernel," Neural computation, vol. 15, pp. 1667-1689, 2003.
[80] J. R. Quinlan, "Improved use of continuous attributes in C4. 5," Journal of artificial intelligence research, vol. 4, pp. 77-90, 1996.

指導教授

蔡志豐

審核日期

2021-1-27

推文