深度學習演算法於遺漏值填補之研究

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：42

、訪客IP：3.145.58.141

姓名

鍾家蓉(Jia-Rong Zhong) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

深度學習演算法於遺漏值填補之研究
(Deep Learning in Missing Value Imputation)

相關論文

★ 利用資料探勘技術建立商用複合機銷售預測模型	★ 應用資料探勘技術於資源配置預測之研究-以某電腦代工支援單位為例
★ 資料探勘技術應用於航空業航班延誤分析-以C公司為例	★ 全球供應鏈下新產品的安全控管-以C公司為例
★ 資料探勘應用於半導體雷射產業-以A公司為例	★ 應用資料探勘技術於空運出口貨物存倉時間預測-以A公司為例
★ 使用資料探勘分類技術優化YouBike運補作業	★ 特徵屬性篩選對於不同資料類型之影響
★ 資料探勘應用於B2B網路型態之企業官網研究-以T公司為例	★ 衍生性金融商品之客戶投資分析與建議-整合分群與關聯法則技術
★ 應用卷積式神經網路建立肝臟超音波影像輔助判別模型	★ 基於卷積神經網路之身分識別系統
★ 能源管理系統電能補值方法誤差率比較分析	★ 企業員工情感分析與管理系統之研發
★ 資料淨化於類別不平衡問題: 機器學習觀點	★ 資料探勘技術應用於旅客自助報到之分析—以C航空公司為例

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

隨著資訊科技快速的發展，人們能快速的收集到各式各樣且大量的資料，而電腦運算能力的效能提升，也使得資料探勘 (Data Mining) 的技術日趨成熟。但在收集資料的過程，難免會遇到資料遺漏 (Data Missing) 的情況，若沒有將這些資料經過適當的前處理，這些不完整的資料往往會導致資料探勘的效能不佳，進而造成準確度的降低。近年來有傳統的統計補值法與機器學習補值法，但現有的文獻中並無探討深度學習對於遺漏值填補效用。再者，資料離散化 (Data Discretization) 能降低離群值對預測結果的干擾，提高模型的穩定性，但是現有文獻並無探討面對資料遺漏時，執行資料離散化與遺漏值填補的順序性對於資料預測之正確率影響之相關研究。因此本論文欲探討與分析各種補值法在不同模型下的表現，以及搭配離散化的技術，探討資料離散化與遺漏值填補的順序性，對於模型預測正確率之影響。
本研究提出以深度學習演算法的深度類神經網路 (Deep MultiLayer Perceptron, DMLP) 與深度信念網路 (Deep Belief Network, DBN) 用於建置遺漏值填補的模型並與現有的統計分析與機器學習補值模型進行比較，此外本研究也加入了最小描述長度原則 (Minimum Description Length Principle, MDLP) 與卡方分箱法 (ChiMerge, ChiM) 這兩種離散化技術去搭配前述提到的深度學習補值模型進行實驗，最後利用 SVM 分類正確率作為衡量補值方法的成效。
根據實驗結果可以觀察出在面對不同類型的資料時深度學習補值法的表現都較為優異，尤其在數值型與混合型資料集，DMLP與DBN分別勝過Baseline 14.70% 與15.88% 以及8.71% 與7.96%，可以發現不完整的資料集經過遺漏值填補能增加其正確率。而針對數值型資料加入離散化後，可以發現搭配MDLP不管是先離散化後補值，還是先補值後離散化，相較下都優於其他搭配組合，其中，先使用MDLP離散化後使用DMLP補值以及先用MDLP離散化後使用DBN補值的分類正確率小贏過單純使用深度學習補值的DMLP 0.74% 與0.52% 且勝過Baseline中使用ChiM的結果2.94% 與2.72%，可以發現離散化技術與深度學習演算法的搭配會影響其正確率。

摘要(英)

With the evolution of Information Technology, people may easily collect various and large amounts of data. Consequently, data mining has widely considered in many industries. However, it is unavoidable that the collected data usually contain some missing values. If we do not deal with these missing data appropriately, the data mining results will be affected and the accuracies of learning models may be degraded. In related literature, missing value imputation by some statistical analyses and machine learning techniques has shown its applicability in solving incomplete data problems. However, very few studies examine the imputation performance of deep learning techniques. In addition, data discretization may further reduce the influence of outliers and increase the stability of models. Therefore, this thesis aims to compare the performances of various imputation models including deep neural networks based on Deep MultiLayer Perceptron (DMLP) and Deep Belief Network (DBN). Moreover, this thesis also examines the performances of different orders to combine data imputation and discretization. Particularly, Minimum Description Length Principle (MDLP) and ChiMerge (ChiM) are used as the discretizers.
The experimental results show that deep neural networks outperform the other imputation methods, especially for numeric and mixed datasets. For numeric datasets, the accuracies of DMLP and DBN are higher than the baseline by 14.70% and 15.88%, respectively, and 8.71% and 7.96% for mixed datasets. Furthermore, for the combinations of deep neural networks with data discretization by MDLP, no matter which combination order is conducted, the performances are higher than other combinations. Particularly, the classification accuracy rates of MDLP + DMLP and MDLP + DBN are slightly higher than using Imputation (DMLP) alone by 0.74% and 0.52%, respectively, and higher than the Baseline (ChiM) by 2.94% and 2.72%, respectively. Therefore, the experiment shows that the performance would be impacted by the chosen discretizer and deep learning algorithms.

關鍵字(中)

★ 資料探勘
★ 深度學習
★ 資料離散化
★ 遺漏值
★ 資料前處理

關鍵字(英)

★ Data Mining
★ Deep Learning
★ Data Discretization
★ Missing Value
★ Data pre-processing

論文目次

摘要 i
Abstract ii
目錄 iv
表目錄 vi
圖目錄 vii
附表目錄 viii
一、緒論 1
1-1 研究背景 1
1-2 研究動機 3
1-3 研究目的 5
1-4 研究架構 5
二、文獻探討 7
2-1 資料遺漏 (Data Missing) 7
2-1-1 完全隨機遺漏 (Missing Completely at Random, MCAR) 7
2-1-2 隨機遺漏 (Missing at Random, MAR) 7
2-1-3 非隨機遺漏 (Missing Not at Random, MNAR) 7
2-2 遺漏值填補 8
2-2-1 傳統的統計分析插補法 8
2-2-2 機器學習演算法於遺漏值填補之方法 9
2-3 深度學習演算法 (Deep learning) 11
2-3-1 深度類神經網路 (Deep MultiLayer Perceptron, DMLP) 11
2-3-2 深度信念網路 (Deep Belief Network, DBN) 13
2-4 資料離散化 (Data Discretization) 14
2-4-1 最小描述長度原則 (Minimum Description Length Principle, MDLP) 16
2-4-2 卡方分箱法 (ChiMerge, ChiM) 17
三、實驗方法與設計 18
3-1 實驗架構 18
3-2 實驗環境 18
3-2-1 硬體設備與軟體應用 18
3-2-2 資料集 19
3-3 實驗參數設定 20
3-3-1 傳統統計補值法 20
3-3-2 機器學習演算法 20
3-3-3 深度學習演算法 21
3-3-4 離散化演算法 23
3-3-5 分類器與評估標準 23
3-4 實驗流程 24
3-4-1 實驗一 24
3-4-2 實驗二 31
四、實驗結果 34
4-1 實驗ㄧ結果 34
4-1-1 類別型資料 (Categorical Data) 34
4-1-2 數值型資料 (Numeric Data) 36
4-1-3 混合型資料 (Mix Data) 38
4-1-4 統整 40
4-2 實驗二結果 42
4-2-1 先離散化後補值與先補值後離散 43
4-2-2 統整 47
五、結論 48
5-1 總結與探討 48
5-2 研究貢獻及未來展望 50
參考文獻 53
附錄一、分類正確率詳細實驗數據 56
1-1、類別型資料集 56
1-2、數值型資料集 60
1-3、混合型資料 64
附錄二、深度學習補值與離散化搭配詳細實驗數據 68
2-1、數值型資料集 68

參考文獻

[1] J. Han, J. Pei, and M. Kamber, Data Mining: Concepts and Techniques. Elsevier, 2011.
[2] I. Witten, E. Frank, and M. Hall, Data Mining: Practical Machine Learning Tools and Techniques. Elsevier, 2011. Accessed: Nov. 04, 2020.
[3] G. E. A. P. A. Batista and M. C. Monard, “An analysis of four missing data treatment methods for supervised learning,” Appl. Artif. Intell., vol. 17, no. 5–6, pp. 519–533, May 2003.
[4] S. García, J. Luengo, and F. Herrera, “Discretization,” in Data Preprocessing in Data Mining, S. García, J. Luengo, and F. Herrera, Eds. Cham: Springer International Publishing, 2015, pp. 245–283. Accessed: Nov. 04, 2020.
[5] S. García, J. Luengo, J. A. Sáez, V. López, and F. Herrera, “A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning,” IEEE Trans. Knowl. Data Eng., vol. 25, no. 4, pp. 734–750, Apr. 2013.
[6] R. J. A. Little and D. B. Rubin, Statistical Analysis with Missing Data. John Wiley & Sons, 2019.
[7] J. M. Jerez et al., “Missing data imputation using statistical and machine learning methods in a real breast cancer problem,” Artif. Intell. Med., vol. 50, no. 2, pp. 105–115, Oct. 2010.
[8] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, Art. no. 7553, May 2015.
[9] Q. Zhang, L. T. Yang, Z. Chen, and P. Li, “A survey on deep learning for big data,” Inf. Fusion, vol. 42, pp. 146–157, Jul. 2018.
[10] J. Ma, R. P. Sheridan, A. Liaw, G. E. Dahl, and V. Svetnik, “Deep Neural Nets as a Method for Quantitative Structure–Activity Relationships,” J. Chem. Inf. Model., vol. 55, no. 2, pp. 263–274, Feb. 2015.
[11] P. Zhang and B. Ci, “Deep belief network for gold price forecasting,” Resour. Policy, vol. 69, p. 101806, Dec. 2020.
[12] A. Ben-Hur and J. Weston, “A User’s Guide to Support Vector Machines | SpringerLink,” 2010. Accessed: Jan. 11, 2021.
[13] D. A. Bennett, “How can I deal with missing data in my study?,” Aust. N. Z. J. Public Health, vol. 25, no. 5, pp. 464–469, 2001.
[14] R. J. A. Little and D. B. Rubin, Statistical Analysis with Missing Data, Second Edition. Hoboken, NJ, USA: John Wiley & Sons, Inc., 2002. Accessed: Nov. 25, 2020.
[15] M. L. Brown and J. F. Kros, “Data mining and the impact of missing data,” Ind. Manag. Data Syst., vol. 103, no. 8, pp. 611–621, Jan. 2003.
[16] S. van Buuren, Flexible Imputation of Missing Data, Second Edition. CRC Press, 2018.
[17] M. R. Raymond and D. M. Roberts, “A Comparison of Methods for Treating Incomplete Data in Selection Research,” Educ. Psychol. Meas., vol. 47, no. 1, pp. 13–26, Mar. 1987.
[18] K. Strike, K. E. Emam, and N. Madhavji, “Software cost estimation with incomplete data,” IEEE Trans. Softw. Eng., vol. 27, no. 10, pp. 890–908, Oct. 2001.
[19] E. Acuña and C. Rodriguez, “The Treatment of Missing Values and its Effect on Classifier Accuracy,” Classif. Clust. Data Min. Appl., pp. 639–647.
[20] T. H. Bø, B. Dysvik, and I. Jonassen, “LSimpute: accurate estimation of missing values in microarray data with least squares methods,” Nucleic Acids Res., vol. 32, no. 3, pp. e34–e34, Feb. 2004.
[21] J. L. Schafer, Analysis of Incomplete Multivariate Data. CRC Press, 1997.
[22] E. Biganzoli, P. Boracchi, L. Mariani, and E. Marubini, “Feed forward neural networks for the analysis of censored survival data: a partial logistic regression approach,” Stat. Med., vol. 17, no. 10, pp. 1169–1186, 1998.
[23] F. K et al., “Predicting disease outcome of non-invasive transitional cell carcinoma of the urinary bladder using an artificial neural network model: results of patient follow-up for 15 years or longer,” International journal of urology : official journal of the Japanese Urological Association, Mar. 2003.
[24] J. M. Jerez-Aragonés, J. A. Gómez-Ruiz, G. Ramos-Jiménez, J. Muñoz-Pérez, and E. Alba-Conejo, “A combined neural network and decision trees model for prognosis of breast cancer relapse,” Artif. Intell. Med., vol. 27, no. 1, pp. 45–63, Jan. 2003.
[25] S. Singhal and L. Wu, “Training Multilayer Perceptrons with the Extended Kalman Algorithm,” p. 8, 1988.
[26] F. Rosenblatt, “The perceptron: A probabilistic model for information storage and organization in the brain,” Psychol. Rev., vol. 65, no. 6, pp. 386–408, 1958.
[27] G. Panchal, A. Ganatra, Y. P. Kosta, and D. Panchal, “Behaviour Analysis of Multilayer Perceptronswith Multiple Hidden Neurons and Hidden Layers,” Int. J. Comput. Theory Eng., pp. 332–337, 2011.
[28] E. Fix, Discriminatory Analysis: Nonparametric Discrimination, Consistency Properties. USAF School of Aviation Medicine, 1951.
[29] A. W.-C. Liew, N.-F. Law, and H. Yan, “Missing value imputation for gene expression data: computational techniques to recover missing data from available information,” Brief. Bioinform., vol. 12, no. 5, pp. 498–513, Sep. 2011.
[30] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When Is ‘Nearest Neighbor’ Meaningful?,” in Database Theory — ICDT’99, Berlin, Heidelberg, 1999, pp. 217–235.
[31] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen, Classification and regression trees. CRC press, 1984.
[32] W.-Y. Loh, “Classification and regression trees,” Wiley Interdiscip. Rev. Data Min. Knowl. Discov., vol. 1, no. 1, pp. 14–23, 2011.
[33] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning, vol. 1, no. 2. MIT press Cambridge, 2016.
[34] G. Hinton et al., “Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups,” IEEE Signal Process. Mag., vol. 29, no. 6, pp. 82–97, Nov. 2012.
[35] K.-I. Funahashi, “On the approximate realization of continuous mappings by neural networks,” Neural Netw., vol. 2, no. 3, pp. 183–192, Jan. 1989.
[36] M. W. Gardner and S. R. Dorling, “Artificial neural networks (the multilayer perceptron)—a review of applications in the atmospheric sciences,” Atmos. Environ., vol. 32, no. 14, pp. 2627–2636, Aug. 1998.
[37] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A Fast Learning Algorithm for Deep Belief Nets,” Neural Comput., vol. 18, no. 7, pp. 1527–1554, May 2006.
[38] G. E. Hinton and T. J. Sejnowski, “Learning and relearning in Boltzmann machines,” Parallel Distrib. Process. Explor. Microstruct. Cogn., vol. 1, no. 282–317, p. 2, 1986.
[39] J. Karhunen, T. Raiko, and K. Cho, “Chapter 7 - Unsupervised deep learning: A short review,” in Advances in Independent Component Analysis and Learning Machines, E. Bingham, S. Kaski, J. Laaksonen, and J. Lampinen, Eds. Academic Press, 2015, pp. 125–142. Accessed: Dec. 12, 2020.
[40] J. Dougherty, R. Kohavi, and M. Sahami, “Supervised and Unsupervised Discretization of Continuous Features,” in Machine Learning Proceedings 1995, A. Prieditis and S. Russell, Eds. San Francisco (CA): Morgan Kaufmann, 1995, pp. 194–202. Accessed: Dec. 15, 2020.
[41] J. Rissanen, “Modeling by shortest data description,” Automatica, vol. 14, no. 5, pp. 465–471, Sep. 1978.
[42] U. Fayyad and K. Irani, “Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning,” Sep. 1993. Accessed: Dec. 19, 2020.
[43] R. Kerber, “Chimerge: Discretization of numeric attributes,” in Proceedings of the tenth national conference on Artificial intelligence, 1992, pp. 123–128.
[44] J. Chen and J. Shao, “Nearest neighbor imputation for survey data,” J. Off. Stat., vol. 16, no. 2, p. 113, 2000.
[45] P. Thanh Noi and M. Kappas, “Comparison of random forest, k-nearest neighbor, and support vector machine classifiers for land cover classification using Sentinel-2 imagery,” Sensors, vol. 18, no. 1, p. 18, 2018.
[46] P. Zhang, “Model Selection Via Multifold Cross Validation,” Ann. Stat., vol. 21, no. 1, pp. 299–313, 1993.

指導教授

蔡志豐(Chih-Fong Tsai)

審核日期

2021-7-9

推文