深度學習演算法於遺漏值填補之研究

DC 欄位	值	語言
DC.contributor	資訊管理學系	zh_TW
DC.creator	鍾家蓉	zh_TW
DC.creator	Jia-Rong Zhong	en_US
dc.date.accessioned	2021-7-9T07:39:07Z
dc.date.available	2021-7-9T07:39:07Z
dc.date.issued	2021
dc.identifier.uri	http://ir.lib.ncu.edu.tw:444/thesis/view_etd.asp?URN=108423021
dc.contributor.department	資訊管理學系	zh_TW
DC.description	國立中央大學	zh_TW
DC.description	National Central University	en_US
dc.description.abstract	隨著資訊科技快速的發展，人們能快速的收集到各式各樣且大量的資料，而電腦運算能力的效能提升，也使得資料探勘 (Data Mining) 的技術日趨成熟。但在收集資料的過程，難免會遇到資料遺漏 (Data Missing) 的情況，若沒有將這些資料經過適當的前處理，這些不完整的資料往往會導致資料探勘的效能不佳，進而造成準確度的降低。近年來有傳統的統計補值法與機器學習補值法，但現有的文獻中並無探討深度學習對於遺漏值填補效用。再者，資料離散化 (Data Discretization) 能降低離群值對預測結果的干擾，提高模型的穩定性，但是現有文獻並無探討面對資料遺漏時，執行資料離散化與遺漏值填補的順序性對於資料預測之正確率影響之相關研究。因此本論文欲探討與分析各種補值法在不同模型下的表現，以及搭配離散化的技術，探討資料離散化與遺漏值填補的順序性，對於模型預測正確率之影響。本研究提出以深度學習演算法的深度類神經網路 (Deep MultiLayer Perceptron, DMLP) 與深度信念網路 (Deep Belief Network, DBN) 用於建置遺漏值填補的模型並與現有的統計分析與機器學習補值模型進行比較，此外本研究也加入了最小描述長度原則 (Minimum Description Length Principle, MDLP) 與卡方分箱法 (ChiMerge, ChiM) 這兩種離散化技術去搭配前述提到的深度學習補值模型進行實驗，最後利用 SVM 分類正確率作為衡量補值方法的成效。根據實驗結果可以觀察出在面對不同類型的資料時深度學習補值法的表現都較為優異，尤其在數值型與混合型資料集，DMLP與DBN分別勝過Baseline 14.70% 與15.88% 以及8.71% 與7.96%，可以發現不完整的資料集經過遺漏值填補能增加其正確率。而針對數值型資料加入離散化後，可以發現搭配MDLP不管是先離散化後補值，還是先補值後離散化，相較下都優於其他搭配組合，其中，先使用MDLP離散化後使用DMLP補值以及先用MDLP離散化後使用DBN補值的分類正確率小贏過單純使用深度學習補值的DMLP 0.74% 與0.52% 且勝過Baseline中使用ChiM的結果2.94% 與2.72%，可以發現離散化技術與深度學習演算法的搭配會影響其正確率。	zh_TW
dc.description.abstract	With the evolution of Information Technology, people may easily collect various and large amounts of data. Consequently, data mining has widely considered in many industries. However, it is unavoidable that the collected data usually contain some missing values. If we do not deal with these missing data appropriately, the data mining results will be affected and the accuracies of learning models may be degraded. In related literature, missing value imputation by some statistical analyses and machine learning techniques has shown its applicability in solving incomplete data problems. However, very few studies examine the imputation performance of deep learning techniques. In addition, data discretization may further reduce the influence of outliers and increase the stability of models. Therefore, this thesis aims to compare the performances of various imputation models including deep neural networks based on Deep MultiLayer Perceptron (DMLP) and Deep Belief Network (DBN). Moreover, this thesis also examines the performances of different orders to combine data imputation and discretization. Particularly, Minimum Description Length Principle (MDLP) and ChiMerge (ChiM) are used as the discretizers. The experimental results show that deep neural networks outperform the other imputation methods, especially for numeric and mixed datasets. For numeric datasets, the accuracies of DMLP and DBN are higher than the baseline by 14.70% and 15.88%, respectively, and 8.71% and 7.96% for mixed datasets. Furthermore, for the combinations of deep neural networks with data discretization by MDLP, no matter which combination order is conducted, the performances are higher than other combinations. Particularly, the classification accuracy rates of MDLP + DMLP and MDLP + DBN are slightly higher than using Imputation (DMLP) alone by 0.74% and 0.52%, respectively, and higher than the Baseline (ChiM) by 2.94% and 2.72%, respectively. Therefore, the experiment shows that the performance would be impacted by the chosen discretizer and deep learning algorithms.	en_US
DC.subject	資料探勘	zh_TW
DC.subject	深度學習	zh_TW
DC.subject	資料離散化	zh_TW
DC.subject	遺漏值	zh_TW
DC.subject	資料前處理	zh_TW
DC.subject	Data Mining	en_US
DC.subject	Deep Learning	en_US
DC.subject	Data Discretization	en_US
DC.subject	Missing Value	en_US
DC.subject	Data pre-processing	en_US
DC.title	深度學習演算法於遺漏值填補之研究	zh_TW
dc.language.iso	zh-TW	zh-TW
DC.title	Deep Learning in Missing Value Imputation	en_US
DC.type	博碩士論文	zh_TW
DC.type	thesis	en_US
DC.publisher	National Central University	en_US

博碩士論文 108423021 完整後設資料紀錄