監督式學習演算法於填補遺漏值之比較與研究

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：61

、訪客IP：18.119.164.249

姓名

鄭秉豪(Ping-hao Cheng) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

監督式學習演算法於填補遺漏值之比較與研究

相關論文

★ 利用資料探勘技術建立商用複合機銷售預測模型	★ 應用資料探勘技術於資源配置預測之研究-以某電腦代工支援單位為例
★ 資料探勘技術應用於航空業航班延誤分析-以C公司為例	★ 全球供應鏈下新產品的安全控管-以C公司為例
★ 資料探勘應用於半導體雷射產業-以A公司為例	★ 應用資料探勘技術於空運出口貨物存倉時間預測-以A公司為例
★ 使用資料探勘分類技術優化YouBike運補作業	★ 特徵屬性篩選對於不同資料類型之影響
★ 資料探勘應用於B2B網路型態之企業官網研究-以T公司為例	★ 衍生性金融商品之客戶投資分析與建議-整合分群與關聯法則技術
★ 應用卷積式神經網路建立肝臟超音波影像輔助判別模型	★ 基於卷積神經網路之身分識別系統
★ 能源管理系統電能補值方法誤差率比較分析	★ 企業員工情感分析與管理系統之研發
★ 資料淨化於類別不平衡問題: 機器學習觀點	★ 資料探勘技術應用於旅客自助報到之分析—以C航空公司為例

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

隨著資訊科技的日益進步，人們在資訊蒐集與應用上的受益，是最貼近生活且最明顯的部分。資料的記載、儲存並不僅限於經驗的保留及傳承。透過資訊系統的建置、方法的改良及優化，人們更能將資料有效率的分門別類及管理、應用和推測，而資料探勘(Data Mining)技術便是在這樣的背景下日趨成熟、演進。資料探勘採用了多樣的統計分析及模組方式來針對大量資料進行分析，並設法提取具有隱含價值的特徵及關聯性加以應用。然而，在這些隱藏價值的萃取過程中，資料本身所具有的部份特質將一定程度的對結果造成影響，例如:資料遺漏。
遺漏值(Missing Value)之於資料探勘，是造成探勘資料不完整的一項原因，而資料遺漏的原因可能來自人為的資料輸入錯誤、隱瞞或背景差異等主觀影響所造成的缺失;亦可能來自機器本身，如:儲存失敗、硬體故障、毀損等導致特定時段內的資料遺漏等。因此，在進行資料探勘時遺漏值問題往往導致了探勘效能的降低。
目前，人們針對遺漏值的處理提出了許多解決策略。其中，使用監督式學習演算法做為補值預測的應用更是其中的佼佼者。然而，針對各種演算法在補值應用上的成效卻無一統整性的應用與建議。著眼於此，本研究嘗試透過使用多種較為知名的監督式學習演算法來針對遺漏資料進行預測並補值後，再將補值結果輔以多項的正確率評估，進而分析及探討各類補值法在不同情境下的表現，統整、歸納並提出建議供後續研究者(或具有補值需求者)在針對遺漏值處理上能更切實的以最具效力及效益的方法來進行應用。

摘要(英)

With the progress of Information Technology, people are benefited from efficient data collection and its related applications. In addition, since the number and the size of online databases grow rapidly, the way to retrieve useful information from these large databases effectively and efficiently is getting more important. This has become the research issue of data mining.
Data mining is a process of using a variety of statistical analyses or machine learing techniques for large amounts of data, including analyzing and managing the way of extracting the hidden values of features and their relevance to vairous applications. It helps people to learn novel knowledge by passing experiences that they can make the decision or forecaste the trend. However, from the retrieval process, there are some problems that should be considered, such as “Missing Values”.
Missing values can briefly defined as the (attribute) value that is missed in a chosen dataset. For example, when registering on websites, users have to fill in some columns sequentially, such as “Name”,”Birthday”…etc. However, because of some reasons, like data input errors, information concealing and so on, we may lost some data values through this process and these lost may cause data incomplete or some errors. Moreover, it can reduce the efficiency and accuracy of data mining results. In this case, people try to use some methods to impute missing values, and supervised learning algorithms is one of these common approach for the missing value impution problem.
In this thesis, I try to conduct experiments to compare the efficiency and accuracy between five famous supervised learning algorithms, which are Bayes, SVM, MLP, CART, k-NN, over categorical, numerical, and mix types of datasets. This allows us to know which imputation method performs better in what data type over the dataset with how many missing rates. The experimental results show that the CART method is the best choice for missing value imputation, which not only requires relative lower imputation time, but also can make the classifier provide the higher classification accuracy.

關鍵字(中)

★ 資料探勘
★ 遺漏值
★ 資料補值
★ 監督式學習

關鍵字(英)

論文目次

一、緒論 1
1-1、研究背景 1
1-2、研究動機 2
1-3、研究目的 3
1-4、研究架構 4
二、文獻探討 6
2-1、資料遺漏(Data Missing) 6
2-1-1、完全隨機遺漏(MCAR) 6
2-1-2、隨機遺漏(MAR) 7
2-1-3、非隨機遺漏(MNAR) 7
2-2、遺漏值處理 7
2-2-1、預防法 8
2-2-2、刪除法 8
2-2-3、插補法 8
2-3、監督式學習演算法於填補遺漏值之應用 11
2-3-1、類神經網路(MLP) 12
2-3-2、支援向量機(SVM) 13
2-3-3、分類與迴歸樹(CART) 15
2-3-4、 k-鄰近值分類(kNN) 16
2-3-5、單純貝式方法(Naïve Bayes) 17
三、實驗設計 18
3-1、實驗架構 18
3-1-1、實驗準備 19
3-2、實驗一 21
3-2-1 資料集前處理 22
3-2-1、傳統統計法於填補遺漏值之應用 23
3-2-2、演算法:類神經網路(MLP)於填補遺漏值之應用 24
3-2-3、演算法:支援向量機(SVM)於填補遺漏值之應用 25
3-2-4、演算法:分類與迴歸樹(CART)於填補遺漏值之應用 26
3-2-5、演算法:k-鄰近值分類(kNN)於填補遺漏值之應用 26
3-2-6、演算法:貝式方法(Naïve Bayes)於填補遺漏值之應用 27
3-3、實驗二 28
3-3-1、多專家之補值結果統合 29
3-3-2、正確率分析及ANOVA檢定 30
四、實驗結果 31
4-1、正確率分析 31
4-1-1、 SVM分類正確率 31
4-1-2、平均絕對值誤差(MAPE) 31
4-1-3、均方根誤差(RMSE) 32
4-2、實驗結果:SVM分類正確率 33
4-2-1、類別型資料(Categorical Data) 33
4-2-2、數值型資料(Numeric Data) 39
4-2-3、混合型資料(Mixed Data) 45
4-2-4、統整 51
4-2-5、 ANOVA分析 52
4-3、實驗結果:平均絕對值誤差(MAPE) 54
4-3-1、數值型資料(Numeric Data) 54
4-3-2、混合型資料(Mixed Data) 62
4-3-3、統整 69
4-4、實驗結果:均方根誤差(RMSE) 70
4-4-1、數值型資料(Numeric Data) 70
4-4-2、混合型資料(Mixed Data) 73
4-4-3、統整 75
4-5、實驗結果:類別正確率 76
4-5-1、類別型資料(Categorical Data) 76
4-5-2、混合型資料(Mixed Data) 82
4-5-3、統整 88
4-6、實驗結果:時間耗用分析 89
4-6-1、類別型資料(Categorical Data) 89
4-6-2、數值型資料(Numeric Data) 91
4-6-3、混合型資料(Mixed Data) 93
4-6-4、統整 94
五、結論 95
5-1、不同需求下的補值方法建議 95
5-2、研究貢獻及未來展望 100
參考文獻 102
附錄一、分類正確率詳細實驗數據 105
1-1、類別型資料集 105
1-2、數值型資料集 107
1-3、混合型資料 110
1-4、各資料集詳細圖表結果(以字母排序): 112
附錄二、平均絕對值誤差詳細實驗數據 129
2-1、數值型資料集 129
2-2、混合型資料集 131
附錄三、均方根誤差詳細實驗數據 133
3-1、數值型資料集 133
3-2、混合型資料集 138
附錄四、類別正確率詳細實驗數據 142
4-1、類別型資料集 142
4-2、混合型資料集 145

參考文獻

[1] Fayyad, U., Shapiro, g. P., Smyth, P., 1996. From data mining to knowledge discovery in databases, AI Magazine, 17(3):37-54.
[2] Hand, D., Mannila, H., Smyth, P., 2001. Principles of data mining, Adaptive
Computation and Machine Learning Series.
[3] Cios, K. J., Kurgan, L. A., 2002. Trends in Data Mining and Knowledge Discovery. In: Knowledge discovery in advanced information systems, Pal, N.R., Jain, L Pal, N.R.. C., Teoderesku N. (eds.), Springer.
[4] Ader, H. J., Mellenbergh, G. J., Hand, D. J., 2008. Advising on Research Methods: A consultant’s Companion. Huizen, The Netherlands: Johannes van Kessel.
[5] Kurgan, L. A., Cios, K. J., 2004. CAIM Discretization Algorithm. IEEE Transactions on Data and Knowledge Engineering, 16(2):145-153.
[6] Kamakshi Lakshminarayan, Steven A. Harp, Tariq Samad, 1999. Imputation of Missing Data in Industrial Databases, Appl. Intell, 11(3): 259-275
[7] Allison, P. D., 2001. Missing Data Thousand Oaks, CA: Sage Publications.
[8] Little, R. J. A., Rubin, D. B., 1987. Statistical analysis with missing data, New York, Wiley.
[9] Little, R. J. A., Rubin, D. B., 2002. Statistical Analysis with Missing Data, New York, John Wiley.
[10] XindongWu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu, Zhi-Hua Zhou, Michael Steinbach, David J. Hand, Dan Steinberg, Top 10 algorithms in data mining, Knowl Inf Syst (2008) 14:1–37.
[11] Lewis, C. D., 1982. Industrial and business forecasting methods: A practical guide to exponential smoothing and curve fitting. London: Butterworth Scientific.
[12] J. Scott Armstrong and Fred Collopy, 1992. Error Measures For Generalizing About Forecasting Methods: Empirical Comparisons.
[13] Hyndman, Rob J. Koehler, Anne B.; Koehler, 2006. Another look at measures of forecast accuracy, International Journal of Forecasting.
[14] Scheffer, J., 2002. Dealing with missing data.
[15] Rubin, D.B., 1976. Inference and Missing Data. Biometrika 63 581-592
[16] Schafer, J.L., 1997. The Analysis of Incomplete Multivariate Data. Chapman & Hall
[17] J. Cohen and P. Cohen, 1983. Applied multiple regression/correlation analysis for the behavioral sciences (2nd ed.), Hillsdale, NJ: Erlbaum.
[18] J. L. Schafer and M. K. Olsen, 1998. “Multiple imputation for multivariate missing-data problems: A data analyst′s perspective”, Multivariate Behavioral Research, Vol.33, pp.545-57.
[19] G. Kalton and D. Kasprzyk, 1982. Imputing for Missing Survey Responses. Proceedings of the Survey Research Methods Section, American Statisitcal Association.
[20] 楊棋全、呂理裕(2004)，指數與韋伯分佈遺失值之處理，國立中央大學統計研究所
[21] 林盈秀、蔡志豐(2013)，資料遺漏率、補值法與資料前處理關係之研究，國立中央大學資訊管理研究所
[22] B. G. Tabachnick and L. S. Fidell, 1983. Using multivariate statistics, New York: Haper & Row.
[23] 趙民德、謝邦昌(1999)，探索真相:抽樣理論和實務，暁園
[24] Witten, I. H., & Frank, E., 2005. Data Mining: Practical machine learning tools and techniques: Morgan Kaufmann.
[25] D. E. Rumelhart, G. E. Hinton and R. J. Williams, 1986. “Learning Internal Representations by Error Propagation,” in D. E. Rumelhart and J. L. McCelland (Eds.), Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Vol 1: Foundations. MIT Press.
[26] Chih-Jen Lin, LIBSVM -- A Library for Support Vector Machines. Retrived by 2015/05. http://www.csie.ntu.edu.tw/~cjlin/libsvm/
[27] Cortes, C., & Vapnik, V. N., 1995. Support-vector networks. Machine Learning, 20(3), 273-297.
[28] Jiawei, H., & Kamber, M., 2001. Data mining: concepts and techniques. San Francisco, CA, itd: Morgan Kaufmann, 5.
[29] Suykens, J. A., & Vandewalle, J., 1999a. Least squares support vector machine classifiers. Neural processing letters, 9(3), 293-300.
[30] Burges, C. J. (1998). A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery, 2(2), 121-167.
[31] Chang, C.-C., & Lin, C.-J., 2011. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), 27.
[32] Lewis, R. J., 2000. An introduction to classification and regression tree (CART) analysis. In Annual Meeting of the Society for Academic Emergency Medicine in San Francisco, California (pp. 1-14).
[33] Breiman, L., Friedman, J.H., Olshen, R.A., and Stone, C.J. Classification and Regression Trees, Wadsworth, Belmont, CA, 1984. Since 1993 this book has been published by Chapman & Hall, New York.
[34] Fix, E., Hodges, J.L., 1951. Discriminatory analysis, nonparametric discrimination: Consistency properties, Technical Report 4, USAF School of Aviation Medicine, Randolph Field, Texas.
[35] CHO, S. B., 2002. Towards Creative evolutionary Systems with Interactive Genetic Algorithm, Applied Intelligence, 16(2): 129-138.
[36] Quinlan, J. R., 1987. Generating Production Rules from Decision Trees. Paper presented at the IJCAI.

指導教授

蔡志豐(Chih-fong Tsai)

審核日期

2015-7-2

推文