衡量資料相似度於遺漏值填補之研究

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：92

、訪客IP：3.15.146.27

姓名

李妙翎(Miao-Ling Li) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

衡量資料相似度於遺漏值填補之研究

相關論文

★ 利用資料探勘技術建立商用複合機銷售預測模型	★ 應用資料探勘技術於資源配置預測之研究-以某電腦代工支援單位為例
★ 資料探勘技術應用於航空業航班延誤分析-以C公司為例	★ 全球供應鏈下新產品的安全控管-以C公司為例
★ 資料探勘應用於半導體雷射產業-以A公司為例	★ 應用資料探勘技術於空運出口貨物存倉時間預測-以A公司為例
★ 使用資料探勘分類技術優化YouBike運補作業	★ 特徵屬性篩選對於不同資料類型之影響
★ 資料探勘應用於B2B網路型態之企業官網研究-以T公司為例	★ 衍生性金融商品之客戶投資分析與建議-整合分群與關聯法則技術
★ 應用卷積式神經網路建立肝臟超音波影像輔助判別模型	★ 基於卷積神經網路之身分識別系統
★ 能源管理系統電能補值方法誤差率比較分析	★ 企業員工情感分析與管理系統之研發
★ 資料淨化於類別不平衡問題: 機器學習觀點	★ 資料探勘技術應用於旅客自助報到之分析—以C航空公司為例

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

資料探勘技術逐漸被廣泛的應用在各領域當中，但遺漏值對於資料探勘來說，會造成無法分析或是結果有所偏差，使得探勘結果無法有效的分析出有用的資訊。近年來學者不斷提出新方法、採用機器學習演算法或是改善目前補值方法的流程等，來進行遺漏值的填補，目的是希望能找出不同領域或不同資料型態所適用的補值方法，或是期望能提高演算法的補值準確率與降低預測值與原始資料的誤差。
本研究提出一個資料中心為基準衡量資料間相似度的補值方法（Class Center based Missing Value Imputation for Incomplete dataset，CCMVI）演算法，其是一個以統計方法為基礎，並考量資料所屬類別、資料之間的相似性並根據資料的離散程度調整填補值。於實驗一與實驗二中選擇不同類型與不同領域的資料集，以CCMVI方法、統計方法、K-近鄰算法（KNN）演算法以及支援向量機（SVM）演算法做遺漏值的填補。最後利用分類準正確率、誤差值以及執行時間來作為衡量補值方法的成效。
從本研究的實驗一中得知，CCMVI方法於分類正確率比機器學習演算法高、補值時效略比統計方法差、誤差值與支援向量機相異不大。以整體的衡量來看，數值型與混合型資料適用於CCMVI補值方法，但實驗二所使用的數值型資料，其屬於軟體工程領域之資料集，卻不適用CCMVI補值法，因此也進一步的探討其原因，發現資料的分佈狀態會影響補值方法的選擇。

摘要(英)

Data mining technology has been widely used in many domain problems. However, there will be a problem when the collected data contain some missing values. Using the incomplete data is likely to produce bias results and most data mining algorithms cannot directly handle this kind of data. Recently, many scholars have proposed new imputation methods, based on machine learning techniques to impute or modify the imputation process. They aim to find a method that can reduce error rates, get high classification accuracy or find what kind of method can suit for particular data.
In this thesis, I propose an imputation method that is based on data class center to measure their similarity. The method is called Class Center based Missing Value Imputation for Incomplete dataset (CCMVI). In study one and study two, CCMVI, Statistic (Mean/Mode Imputation), KNN and SVM are used to impute incomplete datasets with different data types and domains. In order to avoid data inconsistence by choosing 90% training data and 10% testing data, repeating verification by 10-fold cross validation is employed. Finally, this thesis examines classification accuracy, error rates and time efficiency to evaluate different imputation methods.
The experiment result of study one shows that CCMVI’s classification accuracy is higher than the machine learning methods which are SVM and KNN. CCMVI’s efficiency is slightly lower than Statistic. In an overall view, both numerical and mixed datasets are suitable for the proposed CCMVI method. However, the experiment result of study two shows that numerical dataset belongs to software engineering field is not suitable for the CCMVI method. After probing into the cause of the result, finding the distribution of the data will influence the results.

關鍵字(中)

★ 資料前處理
★ 遺漏值
★ 補值方法
★ 資料相似性

關鍵字(英)

★ Data Preprocessing
★ Missing Value
★ Imputation Method
★ Data Similarity

論文目次

摘要 i
Abstract ii
誌謝 iii
目錄 iv
圖目錄 vi
表目錄 vii
附表目錄 ix
一、緒論 1
1-1 研究背景 1
1-2 研究動機 2
1-3 研究目的 4
1-4 論文架構 5
二、文獻探討 6
2-1 遺漏值介紹 6
2-1-1 完全隨機遺漏（Missing Completely at Random，MCAR） 6
2-1-2 隨機遺漏（Missing at Random，MAR） 7
2-1-3 非隨機遺漏（Not Missing at Random，NMAR） 7
2-2 遺漏值填補法 8
2-2-1 單一補值法（Single Imputation） 8
2-2-2 多重補值法（Multiple Imputation） 10
2-3 資料相似度衡量 15
2-3-1 歐幾里得距離（Euclidean Distance） 16
2-3-2 曼哈頓距離（Manhattan Distance） 16
2-3-3 夾角餘弦距離（Cosine Angle Distance） 16
三、研究方法與設計 17
3-1 實驗架構 17
3-2 實驗資料集 18
3-2-1 實驗一 CCMVI方法與其他補值法應用於UCI各領域開放資料集 18
3-2-2 實驗二 CCMVI方法與其他補值法應用於軟體工程之軟體缺陷預測資料集 18
3-3 實驗一 CCMVI方法與其他補值法應用於UCI各領域開放資料集 20
3-3-1 CCMVI演算法 20
3-3-2 基準（Baseline） 28
3-3-3 支援向量機（SVM）補值法 28
3-3-4 K-近鄰算法（KNN）補值法 29
3-4 實驗二 CCMVI方法與其他補值法應用於軟體工程之軟體缺陷預測資料集 30
3-5 實驗驗證 30
3-5-1 分類正確率（Classification Accuracy） 30
3-5-2 時效性（Time Efficiency） 31
3-5-3 均方根誤差（RMSE） 31
3-5-4 平均絕對百分比誤差（MAPE） 32
3-5-5 T檢定：成對母體平均數差異檢定 32
四、實驗結果 33
4-1 實驗準備 33
4-1-1 硬體設備 33
4-1-2 軟體 33
4-2 實驗一結果 34
4-2-1 分類正確率（Classification Accuracy）分析 34
4-2-2 時效性（Time Efficiency）分析 44
4-2-3 均方根誤差（RMSE）分析 48
4-2-4 平均絕對百分比誤差（MAPE）分析 58
4-2-5 T檢定：成對母體平均數差異檢定 63
4-2-6 實驗一總結 65
4-3 實驗二結果 69
4-3-1 分類正確率（Classification Accuracy）分析 69
4-3-2 時效性（Time Efficiency）分析 72
4-3-3 均方根誤差（RMSE）分析 73
4-3-4 平均絕對百分比誤差（MAPE）分析 74
4-3-5 T檢定：成對母體平均數差異檢定 76
4-3-6 實驗二總結 77
五、結論 80
5-1 總結與探討 80
5-2 貢獻與未來研究方向 83
參考文獻 85
附錄一、實驗一詳細數據 88
1-1 分類正確率（Classification Accuracy） 88
1-2 均方根誤差（RMSE） 95
1-3 平均絕對百分比誤差（MAPE） 104
附錄二、實驗二詳細數據 109
2-1 分類正確率（Classification Accuracy） 109
2-2 均方根誤差（RMSE） 111
2-3 平均絕對百分比誤差（MAPE） 114

參考文獻

[1] Krzysztof J. Cios, Witold Pedrycz, Roman W. Swiniarski, Lukasz Kurgan. (2007). The Knowledge Discovery Process, Springer US.
[2] Cemil Colak, Esra Karaman, M. Gokhan Turtay. (2015). Application of knowledge discovery process on the prediction of stroke, Computer Methods and Programs in Biomedicine, 119, 181–185.
[3] Esther-Lydia Silva-Ramírez, Rafael Pino-Mejías, Manuel López-Coello, María-Dolores Cubiles-de-la-Vega. (2011). Missing value imputation on missing completely at random data using multilayer perceptrons, Networks, 24, 121–129.
[4] Ruilin Pan, Tingsheng Yang, Jianhua Cao, Ke Lu, Zhanchao Zhang. (2015). Missing data imputation by K nearest neighbours based on grey relational structure and mutual information, Springer Science+Business Media New York.
[5] Kamakshi LakshminarayanSteven A. HarpTariq Samad. (1999). Imputation of Missing Data in Industrial Databases, Applied Intelligence, 11, 259–275.
[6] Loris Nanni, Alessandra Lumini, Sheryl Brahnam. (2012). A classifier ensemble approach for the missing feature problem, Artificial Intelligence in Medicine, 55, 37–50.
[7] Li Zhang, Zhaohong Bing, Liyong Zhang. (2014). A hybrid clustering algorithm based on missing attribute interval estimation for incomplete data, Pattern Anal Applic, 18, 377–384.
[8] Chih-Fong Tsai, Fu-Yu Chang. (2016). Combining instance selection for better missing value imputation, The Journal of Systems and Software, 122, 63–71.
[9] Archana Purwar, Sandeep Kumar Singh. (2015). Hybrid prediction model with missing value imputation for medical data, Systems with Applications, 42, 5621–5631.
[10] Nuno Pombo, Paulo Rebelo, Pedro Araújo, Joaquim Viana. (2016). Design and evaluation of a decision support system for pain management based on data imputation and statistical models, Measurement, 93, 480–489.
[11] Donald B. Rubin. (1987). Multiple Imputation for Nonresponse in Surveys, Wiley.
[12] Rupam Deb, Alan Wee-Chung Liew. (2016). Missing value imputation for the analysis of incomplete traffic accident data, Information Sciences, 339, 274–289.
[13] Alireza Farhangfar, Lukasz Kurgan, Jennifer Dy. (2008). Impact of imputation of missing values on classification error for discrete data, Pattern Recognition, 41, 3692 – 3705.
[14] Shehroz S. Khan, Amir Ahma. (2004). Cluster center initialization algorithm for K-means clustering, Pattern Recognition Letters, 25, 1293–1302.
[15] Roderick J. A. Little, Donald B. Rubin. (2002). Statistical Analysis with Missing Data, New York, John Wiley.
[16] Julián Luengo, Salvador García, Francisco Herrera. (2012). On the choice of the best imputation methods for missing values considering three groups of classification methods, Knowl Inf Syst, 32, 77–108.
[17] Esther-Lydia Silva-Ramíreza, Rafael Pino-Mejías, Manuel López-Coelloa. (2015). Single imputation with multilayer perceptron and multiple imputation combining multilayer perceptron and k-nearest neighbours for monotone patterns, Applied Soft Computing, 29, 65–74.
[18] Paul J. Rathouz, John S. Preisser. (2014). Missing Data: Weighting and Imputation, Encyclopedia of Health Economics, 292-298.
[19] Jane Y. Nancy, Nehemiah H. Khanna, Kannan Arputharaj. (2017). Imputing missing values in unevenly spaced clinical time series data to build an effective temporal classification framework, Computational Statistics and Data Analysis, 112, 63–79.
[20] Jing Tian, Bing Yu, Dan Yu, Shilong Ma. (2014). Missing data analyses: a hybrid multiple imputation algorithm using Gray System Theory and entropy based on clustering, Springer Science+Business Media New York.
[21] Jason S. Haukoos, Craig D. Newgard. (2007). Advanced Statistics: Missing Data in Clinical Research—Part 1: An Introduction and Conceptual Framework, The Society for Academic Emergency Medicine.
[22] Rogier Donders, Geert J.M.G. van der Heijden, Theo Stijnen, Karel G M Moons. (2016). Review: A gentle introduction to imputation of missing values, Journal of Clinical Epidemiology, 59, 1087-1091.
[23] Farhadian Hadi, Katibeh Homayoon. (2017). New empirical model to evaluate groundwater flow into circular tunnel using multiple regression analysis, International Journal of Mining Science and Technology, 27, 415–421.
[24] Pang-Ning Tan, Michael Steinbach and Vipin Kumar. (2006) Introduction to Data Mining, Addison Wesley.
[25] Evelyn Fix and J. L. Hodges, Jr. (1951). Discriminatory analysis, nonparametric discrimination: Consistency properties, Technical Report 4, USAF School of Aviation Medicine, Randolph Field, Texas.
[26] Michelle H Cartwright, Martin John Shepperd and Qinbao Song. (2003). Dealing with Missing Software Project Data, Proceedings of the 9th International Software Metrics Symposium, Sydney, Australia, 154-165.
[27] Olga Troyanskaya, Michael Cantor, Gavin Sherlock, and e. al. (2001). Missing value estimation methods for DNA microarrays, Bioinformatics, vol. 17, 520-525.
[28] Corinna Cortes, Vladimir Vapnik. (1995). Support-vector networks, Machine Learning, 20, 273-297.
[29] Hyeran Byun and Seong-Whan Lee. (2003). A survey on pattern recognition applications of support vector machines, International Journal of Pattern and Artificial Intelligence, Vol. 17, No. 3, 459–486.
[30] Gautam Bhattachary, Koushik Ghosh, Ananda S. Chowdhury. (2012). An affinity-based new local distance function and similarity measure for kNN algorithm, Pattern Recognition Letters, 33, 356–363.
[31] Joseph Ahn, Moonseo Park, Hyun-Soo Lee, Sung Jin Ahn, Sae-Hyun Ji, Kwonsik Song, Bo-Sik Son. (2017). Covariance effect analysis of similarity measurement methods for early construction cost estimation using case-based reasoning, Automation in Construction.
[32] Jin Qi, Jie Hu, Ying-Hong Peng, Weiming Wang, Zhenfei Zhang. (2009). A case retrieval method combined with similarity measurement and multi-criteria decision making for concurrent design, Expert Systems with Applications, 36, 10357–10366.
[33] Shan Shen, Andre J. Szameitat, Annette Sterr. (2010). An improved lesion detection approach based on similarity measurement between fuzzy intensity segmentation and spatial probability maps, Magnetic Resonance Imaging, 28, 245–254.
[34] The CLUSTER Procedure: Clustering Methods. SAS/STAT 9.2 Users Guide. SAS Institute. Retrieved 2009-04-26.
[35] Gabor J. Szekely, Maria L Rizzo. (2005). Hierarchical clustering via Joint Between-Within Distances: Extending Ward′s Minimum Variance Method, Journal of Classification, 22, 151-183.
[36] Jin Qi, Jie Hu, YingHong Peng, Qiushi Ren, Weiming Wang, Zhenfei Zhan. (2011). Integration of similarity measurement and dynamic SVM for electrically evoked potentials prediction in visual prostheses research, Expert Systems with Applications, 38, 5044–5060.
[37] Nikola Minovski, Spela Zuperl, Viktor Drgan, Marjana Novic. (2013). Assessment of applicability domain for multivariate counter-propagation artificial neural network predictive models by minimum Euclidean distance space analysis: A case study, Analytica Chimica Acta, 759, 28–42.
[38] Michel Marie Deza, Elena Deza. (2009). Encyclopedia of Distances, Springer-Verlag Berlin Heidelberg.
[39] Mirco Kocher, Jacques Savoy. (2017). Distance measures in author profiling, Information Processing and Management, 53, 1103–1119.
[40] Ron Kohavi. (1995). A Study of Cross Validation and Bootstrap for Accuracy Estimation and Model Selection, Appears in the International Joint Conference on Articial Intelligence IJCAI.
[41] Khaled El Emam, Andreas Birk. (2000). Validating the ISO/IEC 15504 measures of software development process capability, The Journal of Systems and Software, 51, 119-149.
[42] Ali Idri, Ibtissam Abnane, Alain Abran. (2016). Missing data techniques in analogy-based software development effort estimation, The Journal of Systems and Software, 117, 595–611.
[43] Xinyang Deng, Qi Liu, Yong Deng, Sankaran Mahadevan. (2016). An improved method to construct basic probability assignment based on the confusion matrix for classification problem, Information Sciences ,340–341, 250–261.
[44] Lorenzo Mentaschi, Giovanni Besio, Federico Cassola, A. Mazzino. (2013). Problems in RMSE-based wave model validations, Ocean Modelling, 72, 53–58.
[45] Benyamin Khoshnevisan, Shahin Rafiee, Mahmoud Omid, Hossein Mousazadeh. (2014). Prediction of potato yield based on energy inputs using multi-layer adaptive neuro-fuzzy inference system, Measurement, 47, 521–530.
[46] MATLAB Documentation. MathWorks. Retrieved 14 August 2013.
[47] Geoffrey Holmes, Andrew Donkin, and Ian H. Witten. (1994). Weka: A machine learning workbench. Proc Second Australia and New Zealand Conference on Intelligent Information Systems, Brisbane, Australia. Retrieved 2007-06-25.

指導教授

蔡志豐(Chih-Fong Tsai)

審核日期

2017-7-6

推文