資料遺漏率、補值法與資料前處理關係之研究

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：34

、訪客IP：18.224.69.254

姓名

林盈秀(Ying-Siou Lin) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

資料遺漏率、補值法與資料前處理關係之研究
(The relationship between missing value, imputation and data pre-processing)

相關論文

★ 利用資料探勘技術建立商用複合機銷售預測模型	★ 應用資料探勘技術於資源配置預測之研究-以某電腦代工支援單位為例
★ 資料探勘技術應用於航空業航班延誤分析-以C公司為例	★ 全球供應鏈下新產品的安全控管-以C公司為例
★ 資料探勘應用於半導體雷射產業-以A公司為例	★ 應用資料探勘技術於空運出口貨物存倉時間預測-以A公司為例
★ 使用資料探勘分類技術優化YouBike運補作業	★ 特徵屬性篩選對於不同資料類型之影響
★ 資料探勘應用於B2B網路型態之企業官網研究-以T公司為例	★ 衍生性金融商品之客戶投資分析與建議-整合分群與關聯法則技術
★ 應用卷積式神經網路建立肝臟超音波影像輔助判別模型	★ 基於卷積神經網路之身分識別系統
★ 能源管理系統電能補值方法誤差率比較分析	★ 企業員工情感分析與管理系統之研發
★ 資料淨化於類別不平衡問題: 機器學習觀點	★ 資料探勘技術應用於旅客自助報到之分析—以C航空公司為例

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

隨著資訊科技的快速發展，電腦所能處理和儲存的資料量也愈來愈大，資料採礦對於如何從大量資料中尋找有意義的內容是很重要的課題，但在探勘的過程中，難免會遭遇所需的資料有所遺漏或不足之處，這些問題都將導致探勘效能的降低。
而針對不完整資料的前處理，常會採用直接刪去法最為簡單又直接，但這種方法只適用於資料集包含比較小的缺失值數量，若包含的缺失值數量較大，採用直接刪去法，會造成大量資料流失並對資料探勘的結果產生影響。另一種方法是採用補值(Imputation)的處理方法，而近年來的研究都集中在，提出新型的補值方法和一些不同補值方法在不同的資料集中的比較，但很少研究在回答關於「在資料前處理時，什麼時候可以選擇完全忽略或刪除有缺失值的樣本？」，也沒有研究在探討「將資料前處理(特徵選取或樣本選取)加在補值之前，結果是否可以比沒有執行維度縮減或樣本選取而直接補值的結果效果來的更佳」。
本研究使用37個不同的資料集，包含三種主要的類型，分別為數值型(Numerical)，類別型(Categorical)，和混合型(Mixed)的資料類別，並用5％作為缺失率的間隔(從5％至50％)。研究主題分為兩個部份，研究一實驗結果說明，不同類型的資料集可以允許不同的缺失率。特別的是我們會建構決策樹模型來獲取關於資料集的特性(如資料數量，資料維度與資料類型)和可允許的缺失率之相關決策規則，來幫助資料分析並確定在不同的缺失率時，何時可以直接使用直接刪去法。
在研究二的實驗結果部份，以三種類型的資料集(數值型、混合型、和類別型)來判斷特徵選取和樣本選取在缺失值補值上使用的效果，並了解是否適用特徵選取和樣本選取在進行補值階段之前。此實驗結果顯示出，先使用樣本選取再補值可以產生比經過特徵選取再補值更好的分類效能。換句話說，先特徵選取再補值的方法對於補值沒有產生正面的影響。

摘要(英)

With the rapid development of information technology, computers can process and store huge amounts of data. This leads to the importance of finding useful content from large amounts of data in data mining. However, many collected datasets for data mining usually contain some missing values, which are likely to degrade the data mining performance.
For incomplete data processing, it is a common and simple way to perform case deletion by ignoring the data samples with missing values if the missing rate was certainly small. Another approach is based on imputation, where various approaches have been proposed for missing value imputation. Generally speaking, the imputation algorithms aim at providing estimations for missing values by a reasoning process from the observed data. However, there is no answer for the question about when should we use the case deletion or imputation approach over different kinds of datasets. Another question is that will performing data pre-processing, i.e. feature and instance selection, affect the final imputation result?
This thesis used 37 different data sets, which contain categorical, numerical, and both types of data, and 5% intervals for different missing rates per dataset (i.e. from 5% to 50%). Research topic is divided into two parts. The experimental results indicate that there are some specific patterns to consider case deletion over different datasets without significant performance degradation. A decision tree model is then constructed to extract useful rules to recommend when to use the case deletion approach. Furthermore, we found that imputation after instance selection can produce better classification performance than imputation alone. However, imputation after feature selection does not have a positive impact on the imputation result.

關鍵字(中)

★ 資料探勘
★ 資料遺漏
★ 直接刪除法
★ 資料補值
★ 樣本選取
★ 特徵選取

關鍵字(英)

★ data mining
★ missing values
★ case deletion
★ imputation
★ feature selection
★ instance selection

論文目次

摘要 i
Abstract ii
致謝辭 iii
目錄 iv
圖目錄 vi
表目錄 vii
第一章緒論 1
1-1 研究背景 1
1-2 研究動機 3
1-3 研究目的 4
1-4 論文架構 5
第二章文獻探討 6
2-1 資料遺漏值(Missing data) 6
2-1-1 完全隨機遺漏（Missing completely at random，MCAR） 6
2-1-2 隨機遺漏（Missing at random，MAR） 6
2-1-3 非隨機遺漏（Missing not at random，MNAR） 6
2-2 缺失值處理 7
2-2-1 事前預防法 7
2-2-2 刪除法(Listwise deletion) 8
2-2-3 虛擬變數法(Dummy variable) 8
2-2-4 插補法(Imputation) 8
2-3 特徵選取(Feature selection) 15
2-3-1 F-score 17
2-4 樣本選取(Instance selection) 18
2-4-1 DROP3 20
第三章研究方法 22
3-1 實驗架構 22
3-2 資料集 22
3-3 研究一 24
3-4 研究二 24
3-4-1 單一補值法 26
3-4-2 多重補值法 27
3-4-3 特徵選取(Feature selection) 28
3-4-4 樣本選取(Instance selection) 29
第四章實驗結果 31
4-1 研究一 31
4-1-1 類別型態資料集的結果 31
4-1-2 數值型態資料集的結果 32
4-1-3 混合型態資料集的結果 33
4-1-4 萃取決策規則 34
4-2 研究二 36
4-2-1 資料集在特定缺失率的結果 36
4-2-2 特定資料集在不同缺失率的結果 38
4-2-3 萃取決策規則 42
第五章結論與未來研究方向 44
5-1 結論與貢獻 44
5-2 未來研究方向與建議 45
參考文獻 47
附錄一 52
附錄二 54
附錄三 59

參考文獻

[1] Fayyad, U., Shapiro, g. P., Smyth, P., 1996. From data mining to knowledge discovery in databases, AI Magazine, 17(3):37-54.
[2] Frawley, W. J., Piatetski-Shapiro, G., Matheus, C. J., 1991. Knowledge Discovery in Databases: An Overview, AAAI-MIT Press, Menlo Park, California.
[3] Hand, D., Mannila, H., Smyth, P., 2001. Principles of data mining, Adaptive Computation and Machine Learning Series.
[4] Cios, K. J., Kurgan, L. A., 2002. Trends in Data Mining and Knowledge Discovery. In: Knowledge discovery in advanced information systems, Pal, N.R., Jain, LPal, N.R.. C., Teoderesku N. (eds.), Springer.
[5]Wanyande, Peter et al., 1997. History and Government., PETER, PROF. WANYANDE, Longhorn, Kenya.,
[6] Ader, H. J., Mellenbergh, G. J., Hand, D. J., 2008. Advising on Research Methods: A consultant’s Companion. Huizen, The Netherlands: Johannes van Kessel.
[7] Kurgan, L. A., Cios, K. J., 2004. CAIM Discretization Algorithm. IEEE Transactions on Data and Knowledge Engineering, 16(2):145-153.
[8] Kamakshi Lakshminarayan, Steven A. Harp, Tariq Samad, 1999. Imputation of Missing Data in Industrial Databases, Appl. Intell, 11(3): 259-275
[9] Fayyad, U. M., Shapiro, G. P., Smyth, P., 1996. The KDD Process for Extracting Useful Knowledge from Volumes of Data, Communications of the ACM, Vol. 39, No. 11, 1996, pp. 27-34.
[10] Han, J. and Kamber, M., 2001. Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers.
[11] Allison, P. D., 2001. Missing Data Thousand Oaks, CA: Sage Publications.
[12] Landerman, Lawrence R., Kenneth C. Land and Carl F. Pieper, 1997. An Empirical Evaluation of the Predictive Mean Matching Method for Imputing Missing Values, Sociological Methods & Research 26: 3-33.
[13] Batista, G. E. A. P. A., Monard, M. C., 2003, An analysis of four missing data treatment methods for supervised learning, Applied Articial Intelligence, 17(5-6): 519-533.
[14] Pei Y-F, Li J, Zhang L, Papasian CJ, Deng H-W., 2008. Analyses and comparison of accuracy of different genotype imputation methods, PLoS One, 3(10):e3551.
[15] Zhang, S., Jin, Z., Zhu, X., 2011. Missing data imputation by utilizing information within incomplete instances, The Journal of Systems and Software, 84(3): 452-459.
[16]Acuna E., Rodriguez C. A., 2004. Meta analysis study of outlier detection methods in classification, In proceedings IPSI.
[17]Batista, G., Monard, M., 2003. An Analysis of Four Missing Data Treatment Methods for Supervised Learning, Applied Artificial Intelligence, 17(5-6): 519-533.
[18] Alireza Farhangfar, Lukasz Kurgan, Jennifer Dy, 2008. Impact of imputation of missing values on classification error for discrete data, Pattern Recognition 41:3692 – 3705.
[19] Saeys, Y., Inza, I., Larranaga, P., 2007. A review of feature selection techniques in bioinformatics. Bioinformatics 23(19):2507–2517.
[20] Ma, S., Huang, J., 2008. Penalized feature selection and classiﬁcation in bioinformatics, Brieﬁngs in Bioinformatics 9(5): 392–403.
[21] Hilario, M., Kalousis, A., 2008. Approaches to dimensionality reduction in proteomic biomarker studies, Brieﬁngs in Bioinformatics 9(2):102–118.
[22] Duval, B., Hao, J., 2010. Advances in metaheuristics for gene selection and classiﬁcation of microarray data, Brieﬁngs in Bioinformatics, 11(1):127–141.
[23] Olvera-López, J. A., Carrasco-Ochoa, J. A., Martinez-Trinidad, J. F., Kittler, J., 2010. A review of instance selection methods, Artif. Intell. Rev., 34(2):133-143.
[24] Per Jonsson, Claes Wohlin, 2004. An Evaluation of kNearest Neighbour Imputation Using Likert Data, Proceedings of the 10th International Symposium on Software Metrics, Chicago, IL, (USA), pp. 108 – 118.
[25] Pyle D, 1999. Data Preparation for data mining, Morgan Kaufmann, San Mateo, p540
[26] Mistiaen, Johan A., Ravallion, Martin, 2003. Survey compliance and the distribution of income, Policy Research Working Paper Series 2956, The World Bank.
[27] Cohen, J., Cohen, P., 1983. Applied multiple regression/correlation analysis for the behavioral sciences, Hillsdale, NJ: Erlbaum.
[28] Little, R. J. A., Rubin, D. B., 1987. Statistical analysis with missing data, New York, Wiley.
[29] Little, R. J. A., Rubin, D. B., 2002. Statistical Analysis with Missing Data, New York, John Wiley.
[30]Kalton, G., Kasprzyk, D., 1982. Imputing for missing survey responses, Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 22–31.
[31] Schafer, J. L., Olsen, M. K., 1998. Multiple imputation for multivariate missing-data problems: a data analyst’s perspective, Multivariate Behavioral Research, 33, 545-571.
[32] Batista, G. and Monard, M., 2003. An Analysis of Four Missing Data Treatment Methods for Supervised Learning, Applied Artificial Intelligence, 17(5-6): 519-533.
[33] Schafer. J. L. and Graham, J. W., 2002. Missing data: Our view of the state of the art, Psychological Methods, 7 (2), 147-177.
[34] Zhu X., Zhang S., Senior Member, IEEE, Jin Z., Senior Member, IEEE, Zhang Z., and Xu Z., 2011. Missing Value Estimation for Mixed-Attribute Data Sets IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 1.
[35] Zhang S., 2012. Nearest neighbor selection for iteratively kNN imputation. Journal of Systems and Software, 85(11):2541-2552.
[36] Ronald K. Pearson, 2006. The problem of disguised missing data, ACM SIGKDD Explorations Newsletter, v.8 n.1, p.83-92.
[37] Hawthorne, G. and Elliott, P., 2005. Imputing cross-sectional missing data: Comparison of common techniques, Australian and New Zealand Journal of Psychiatry, 39(7), 583-590.
[38] Tabachnick, B. G., and Fidell, L. S., 1983. Using multivariate statistics. New York: Harper & Row. (Chapter 9; more recent editions are available)
[39] Schafer. J. L. and Graham, J. W., 2002. Missing data: Our view of the state of the art, Psychological Methods, 7 (2), 147-177.
[40] Landerman, Lawrence R., Kenneth C. Land and Carl F. Pieper, 1997. An Empirical Evaluation of the Predictive Mean Matching Method for Imputing Missing Values, Sociological Methods & Research 26: 3-33.
[41] Acock, A.C., 2005. Working with Missing Data. Journal of Marriage and Family, 67, 1012-1028.
[42] Rubin, D. B., 1987. Multiple Imputation for Nonresponse in Surveys, New York: John Wiley & Sons, Inc.
[43] Tanner, M. A. and Wong, W. H., 1987. The calculation of posterior distributions by data augmentation (with discussion), J. Amer. Statist. Assoc. 82 528–550.
[44] Lisa A. C. and Daniel H. K., 2007. Childhood Family, Ethnicity, and Drug Use Over the Life Course Journal of Marriage and Family 69(3):810–830.
[45] Freeman, Vicki A., Douglas A., Wolf, 1995. A case-study on the use of multiple imputation, Demography 32: 459-470.
[46] Little, Roderick J. A. and Donald B. Rubin, 1989. The Analysis of Social Science Data with Missing Values, Sociological Methods and Research 18: 292-326.
[47] Shafer J.L., 1997. Analysis of incomplete multivariate data, 430 pp., ISBN 0-412-04061-1
[48] Fix, E., Hodges, J.L., 1951. Discriminatory analysis, nonparametric discrimination: Consistency properties, Technical Report 4, USAF School of Aviation Medicine, Randolph Field, Texas.
[49] CHO, S. B., 2002. Towards Creative evolutionary Systems with Interactive Genetic Algorithm, Applied Intelligence, 16(2): 129-138.
[50] Kaufman, L., Rousseeuw, P. J., 1990. Finding Groups in Data: An Introduction to Cluster Analysis, Wiley, New York.
[51] Jönsson P. and Wohlin C., 2006. Benchmarking k-Nearest Neighbour Imputation with Homogeneous Likert Data, Empirical Software Engineering: An International Journal, Vol. 11, No. 3, pp. 463-489.
[52] Batista, G. E. A. P. A. and Monard, M. C., 2003. An analysis of four missing data treatment methods for supervised learning, Applied Arti¯cial Intelligence 17(5-6), 519{533.
[53]C.-T. Su and C.-H. Yang, 2008. Feature Selection for the SVM: An Application to Hypertension Diagnosis, Expert Systems with Applications, Vol. 34, No. 1, pp. 754-763.
[54]Rem, Olaf and Erik Darwinkel, 2002. The Concept Editor, [D12.4].
[55]A. W. Whitney., 1971. A direct method of nonparametric measurement selection, IEEE Trans. Computers, 20(9):1100–1103.
[56]XU Yang, LIU Jia, HU Qingmao, CHEN Zhijun, DU Xiaohua, HENG Pheng Ann,2008
[57]Y.W. Chen, C.J. Lin, 2006. Combining SVMs with Various Feature Selection Strategies, Feature Extraction and Applications, Springer-Verlag, Berlin,.
[58] Derrac, J., García, S., Herrera, F., 2010. A Survey on Evolutionary Instance Selection and Generation, International Journal of Applied Metaheuristic Computing 1(1):60-92.
[59] Hart PE, 1968. The condensed nearest neighbor rule. IEEE Transactions on Information Theory, 14, 1968, pp 515–516
[60] Baker, V. R., and Ritter, D. F., 1975. Competence of rivers to transport coarse bedload material: Geological Society of America, Bulletin, v. 86, p. 975–978.
[61] Gates, G. W., 1972. The Reduced Nearest Neighbor Rule, IEEE Transactions on Information Theory, Vol. IT-18, No. 3, pp. 431-433.
[62] Wilson,D.L., 1972. Asymptotic properties of nearest neighbor rules using edited data, IEEE Transactionson on Systems, Man and Cybernetics, vol. SMC-2,no.3,pp.408-421.
[63] N. Jankowski and M. Grochowski., 2004. Comparison of instances selection algorithms: I. Algorithms survey, In Artificial Intelligence and SoftComputing, Lecture notes in computer science, pages 598–603.
[64] Asa, D.W.,Kibler,D.,and Albert,M.K., 1991. Instance-Based Learning Algorithms, Machine Learning, vol.6, no. 1, pp. 37-66.
[65] Brightion,H. and Mellish,C., 2002. Advances in Instance Selection for Instance-Based Learning Algorithms, Data Mining and Knowledge Discovery, vol. 6, pp.153-172.
[66] Wison,D.R. and Martinez,T.R., 2000. Reduction Techniques for Instance-Based Learning Algorithms, Machine Learning, vol. 38, pp. 257-286.
[67] D. Fragoudis, D. Meretakis, S. Likothanassis, 2002. Integrating feature and instance selection for text classification, KDD: 501-506.

指導教授

蔡志豐(Chih-Fong Tsai)

審核日期

2013-7-1

推文