深度學習演算法於遺漏值填補之研究;Deep Learning in Missing Value Imputation

NCU Institutional Repository > 管理學院 > 資訊管理研究所 > 博碩士論文 > Item 987654321/86560

jsp.display-item.identifier=請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/86560

题名:	深度學習演算法於遺漏值填補之研究;Deep Learning in Missing Value Imputation
作者:	鍾家蓉;Zhong, Jia-Rong
贡献者:	資訊管理學系
关键词:	資料探勘;深度學習;資料離散化;遺漏值;資料前處理;Data Mining;Deep Learning;Data Discretization;Missing Value;Data pre-processing
日期:	2021-07-09
上传时间:	2021-12-07 12:58:18 (UTC+8)
出版者:	國立中央大學
摘要:	隨著資訊科技快速的發展，人們能快速的收集到各式各樣且大量的資料，而電腦運算能力的效能提升，也使得資料探勘 (Data Mining) 的技術日趨成熟。但在收集資料的過程，難免會遇到資料遺漏 (Data Missing) 的情況，若沒有將這些資料經過適當的前處理，這些不完整的資料往往會導致資料探勘的效能不佳，進而造成準確度的降低。近年來有傳統的統計補值法與機器學習補值法，但現有的文獻中並無探討深度學習對於遺漏值填補效用。再者，資料離散化 (Data Discretization) 能降低離群值對預測結果的干擾，提高模型的穩定性，但是現有文獻並無探討面對資料遺漏時，執行資料離散化與遺漏值填補的順序性對於資料預測之正確率影響之相關研究。因此本論文欲探討與分析各種補值法在不同模型下的表現，以及搭配離散化的技術，探討資料離散化與遺漏值填補的順序性，對於模型預測正確率之影響。本研究提出以深度學習演算法的深度類神經網路 (Deep MultiLayer Perceptron, DMLP) 與深度信念網路 (Deep Belief Network, DBN) 用於建置遺漏值填補的模型並與現有的統計分析與機器學習補值模型進行比較，此外本研究也加入了最小描述長度原則 (Minimum Description Length Principle, MDLP) 與卡方分箱法 (ChiMerge, ChiM) 這兩種離散化技術去搭配前述提到的深度學習補值模型進行實驗，最後利用 SVM 分類正確率作為衡量補值方法的成效。根據實驗結果可以觀察出在面對不同類型的資料時深度學習補值法的表現都較為優異，尤其在數值型與混合型資料集，DMLP與DBN分別勝過Baseline 14.70% 與15.88% 以及8.71% 與7.96%，可以發現不完整的資料集經過遺漏值填補能增加其正確率。而針對數值型資料加入離散化後，可以發現搭配MDLP不管是先離散化後補值，還是先補值後離散化，相較下都優於其他搭配組合，其中，先使用MDLP離散化後使用DMLP補值以及先用MDLP離散化後使用DBN補值的分類正確率小贏過單純使用深度學習補值的DMLP 0.74% 與0.52% 且勝過Baseline中使用ChiM的結果2.94% 與2.72%，可以發現離散化技術與深度學習演算法的搭配會影響其正確率。 ;With the evolution of Information Technology, people may easily collect various and large amounts of data. Consequently, data mining has widely considered in many industries. However, it is unavoidable that the collected data usually contain some missing values. If we do not deal with these missing data appropriately, the data mining results will be affected and the accuracies of learning models may be degraded. In related literature, missing value imputation by some statistical analyses and machine learning techniques has shown its applicability in solving incomplete data problems. However, very few studies examine the imputation performance of deep learning techniques. In addition, data discretization may further reduce the influence of outliers and increase the stability of models. Therefore, this thesis aims to compare the performances of various imputation models including deep neural networks based on Deep MultiLayer Perceptron (DMLP) and Deep Belief Network (DBN). Moreover, this thesis also examines the performances of different orders to combine data imputation and discretization. Particularly, Minimum Description Length Principle (MDLP) and ChiMerge (ChiM) are used as the discretizers. The experimental results show that deep neural networks outperform the other imputation methods, especially for numeric and mixed datasets. For numeric datasets, the accuracies of DMLP and DBN are higher than the baseline by 14.70% and 15.88%, respectively, and 8.71% and 7.96% for mixed datasets. Furthermore, for the combinations of deep neural networks with data discretization by MDLP, no matter which combination order is conducted, the performances are higher than other combinations. Particularly, the classification accuracy rates of MDLP + DMLP and MDLP + DBN are slightly higher than using Imputation (DMLP) alone by 0.74% and 0.52%, respectively, and higher than the Baseline (ChiM) by 2.94% and 2.72%, respectively. Therefore, the experiment shows that the performance would be impacted by the chosen discretizer and deep learning algorithms.
显示于类别:	[資訊管理研究所] 博碩士論文

文件中的档案:

档案	描述	大小	格式	浏览次数
index.html		0Kb	HTML	82	检视/开启

在NCUIR中所有的数据项都受到原著作权保护.

社群 sharing

数据加载中.....