博碩士論文 104423012 完整後設資料紀錄

DC 欄位 語言
DC.contributor資訊管理學系zh_TW
DC.creator李妙翎zh_TW
DC.creatorMiao-Ling Lien_US
dc.date.accessioned2017-7-6T07:39:07Z
dc.date.available2017-7-6T07:39:07Z
dc.date.issued2017
dc.identifier.urihttp://ir.lib.ncu.edu.tw:88/thesis/view_etd.asp?URN=104423012
dc.contributor.department資訊管理學系zh_TW
DC.description國立中央大學zh_TW
DC.descriptionNational Central Universityen_US
dc.description.abstract資料探勘技術逐漸被廣泛的應用在各領域當中,但遺漏值對於資料探勘來說,會造成無法分析或是結果有所偏差,使得探勘結果無法有效的分析出有用的資訊。近年來學者不斷提出新方法、採用機器學習演算法或是改善目前補值方法的流程等,來進行遺漏值的填補,目的是希望能找出不同領域或不同資料型態所適用的補值方法,或是期望能提高演算法的補值準確率與降低預測值與原始資料的誤差。 本研究提出一個資料中心為基準衡量資料間相似度的補值方法(Class Center based Missing Value Imputation for Incomplete dataset,CCMVI)演算法,其是一個以統計方法為基礎,並考量資料所屬類別、資料之間的相似性並根據資料的離散程度調整填補值。於實驗一與實驗二中選擇不同類型與不同領域的資料集,以CCMVI方法、統計方法、K-近鄰算法(KNN)演算法以及支援向量機(SVM)演算法做遺漏值的填補。最後利用分類準正確率、誤差值以及執行時間來作為衡量補值方法的成效。 從本研究的實驗一中得知,CCMVI方法於分類正確率比機器學習演算法高、補值時效略比統計方法差、誤差值與支援向量機相異不大。以整體的衡量來看,數值型與混合型資料適用於CCMVI補值方法,但實驗二所使用的數值型資料,其屬於軟體工程領域之資料集,卻不適用CCMVI補值法,因此也進一步的探討其原因,發現資料的分佈狀態會影響補值方法的選擇。zh_TW
dc.description.abstractData mining technology has been widely used in many domain problems. However, there will be a problem when the collected data contain some missing values. Using the incomplete data is likely to produce bias results and most data mining algorithms cannot directly handle this kind of data. Recently, many scholars have proposed new imputation methods, based on machine learning techniques to impute or modify the imputation process. They aim to find a method that can reduce error rates, get high classification accuracy or find what kind of method can suit for particular data. In this thesis, I propose an imputation method that is based on data class center to measure their similarity. The method is called Class Center based Missing Value Imputation for Incomplete dataset (CCMVI). In study one and study two, CCMVI, Statistic (Mean/Mode Imputation), KNN and SVM are used to impute incomplete datasets with different data types and domains. In order to avoid data inconsistence by choosing 90% training data and 10% testing data, repeating verification by 10-fold cross validation is employed. Finally, this thesis examines classification accuracy, error rates and time efficiency to evaluate different imputation methods. The experiment result of study one shows that CCMVI’s classification accuracy is higher than the machine learning methods which are SVM and KNN. CCMVI’s efficiency is slightly lower than Statistic. In an overall view, both numerical and mixed datasets are suitable for the proposed CCMVI method. However, the experiment result of study two shows that numerical dataset belongs to software engineering field is not suitable for the CCMVI method. After probing into the cause of the result, finding the distribution of the data will influence the results.en_US
DC.subject資料前處理zh_TW
DC.subject遺漏值zh_TW
DC.subject補值方法zh_TW
DC.subject資料相似性zh_TW
DC.subjectData Preprocessingen_US
DC.subjectMissing Valueen_US
DC.subjectImputation Methoden_US
DC.subjectData Similarityen_US
DC.title衡量資料相似度於遺漏值填補之研究zh_TW
dc.language.isozh-TWzh-TW
DC.type博碩士論文zh_TW
DC.typethesisen_US
DC.publisherNational Central Universityen_US

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡  - 隱私權政策聲明