博碩士論文 103423019 完整後設資料紀錄

DC 欄位 語言
DC.contributor資訊管理學系zh_TW
DC.creator李韋柔zh_TW
DC.creatorWei-Jou Leeen_US
dc.date.accessioned2016-7-5T07:39:07Z
dc.date.available2016-7-5T07:39:07Z
dc.date.issued2016
dc.identifier.urihttp://ir.lib.ncu.edu.tw:88/thesis/view_etd.asp?URN=103423019
dc.contributor.department資訊管理學系zh_TW
DC.description國立中央大學zh_TW
DC.descriptionNational Central Universityen_US
dc.description.abstract從真實世界收集的龐大原始資料,難免會有不完整或品質不佳的資料,造成資料探勘(Data Mining)的效益及效率下降。使資料探勘效果不佳的原因有兩種,包含資料遺漏或資料含有過多冗餘(Redundancy)、不具代表性的資料(Represented features)。這些原因不僅無法提供有價值的資訊,除了降低整體實驗數據成果,還會降低實驗的整體效率和增加實驗所需花費的成本。遺漏值問題(Missing value problem)普遍存在資料探勘(Data mining)問題之中,可能的原因為資料輸入錯誤或者資料格式錯誤等問題。填補法(Imputation methods)就針對遺漏值問題而發展出來,此方法透過完整資料當作觀測值,預測不完整資料中的遺漏值。特徵選取(Feature Selection)為從資料中濾掉多餘及不具代表性的特徵值(Feature)。本研究結合特徵選取技術及補值法,主要在探討先透過特徵選取塞選出的細緻資料,再填補遺漏值的適用性。 本論文為探討特徵選取對補值法的影響,透過UCI公開資料庫,蒐集12個完整的資料集,資料集的組成三種類型的資料集(類別型、混合型、數值型)來進行實驗。為了使實驗更接近現實的狀況,本論文模擬10%、20%、30%、40%、50%的缺失率的作為基準,探討在這五種遺漏率下的變化及趨勢。選定三個特徵選取技術:基因演算法 (Genetic Algorithm, GA)、決策樹(Decision Tree, DT)、資訊獲利 (Information Gain, IG),和三個補值法;多層感知器 (Multilayer Perceptrons, MLP)、支持向量機 (Support Vector Machine, SVM)、K-最近鄰補值法(K-nearest Neighbor Imputation, KNNI),來檢驗何種情況下,特徵選取搭配補值法為最佳或最適合的組合,或者比較組合方法與單純補值法的正確率表現。本研究分為兩項研究,研究一處理初始資料集,而研究二為驗證研究一結果的高維度資料集。 依據研究一所得之結果,混和型資料集情況下,使用GA進行特徵選取再透過IBk進行補值是有參考性及正面影響;數值型資料集情況下,使用IG的65%保留率進行特徵選取再透過MLP、IBk、SVM任一補值方法進行補值皆具有正面影響。類別型資料集在直接分類的正確率表現最佳,我們建議不需要先進行特徵選取再補值的流程。另外我們發現,遺漏率為10%時,先進行特徵選取再補值的結果比直接補值的方法及分類的方法佳。依據研究二所得的結果有兩個,第一個為高維度資料集使用DT先進行特徵選取,再使用MLP及IBk進行補值為表現最佳的組合。第二個為高維度資料集在保留率低於40%時,組合方法表現較優異。zh_TW
dc.description.abstractThe collection of real life data contain missing values frequently. The presence of missing values in a dataset can affect the performance of data mining algorithms. The commonly used techniques for handling missing data are based on some imputation methods. Pool quality data negatively influence the predictive accuracy. The reason of pool quality data contains not only missing values but also redundancy features and representative features. This kind of data will degrade the performance of data mining algorithms and increase the cost of research. To solve this problem, we propose to perform feature selection over the complete data before the imputation step. The aim of feature selection is to filter out some unrepresentative data from a given dataset. Therefore, this research focuses on identifying the best combination of feature selection and imputation methods. The experimental setup is based on 12 UCI datasets, which are composed of categorical, numerical, and mixed types of data. The experiment conducts the simulation to 10%, 20%, 30%, 40% and 50% missing rates for each training dataset. In addition, three feature selection methods, which are GA(Genetic Algorithm), DT(Decision Tree Algorithm), and IG(Information Gain Algorithm) are used for comparison. Similarly, three imputation methods including KNNI (K-Nearest Neighbor Imputation method), SVM (Support Vector Machine), and MLP (MultiLayers Perceptron) are also employed individually. The comparative results can allow us to understand which combination of feature selection and imputation methods performs the best and whether combining feature selection and missing value imputation is the better choice than performing missing value imputation alone for the incomplete datasets. According to the results of this research, the combination of GA feature selection method and IBk imputation method was significantly better than the imputation methods alone over mixed datasets. The combinations of IG feature selection method with retaining 65% features and the imputation methods which we selected have significantly better classification accuracy over numarical datasets. Performing missing value imputation alone is a better choice over categorical datasets. Performing feature selection before the imputation step has the best classification accuracy with 10% missing rate. In large dimensional datasets, the classification accuracy of the combining the DT feature selection method and IBk or MLP imputation methods produce the best performance.en_US
DC.subject資料探勘zh_TW
DC.subject特徵選取zh_TW
DC.subject補值法zh_TW
DC.subject機器學習zh_TW
DC.subject分類問題zh_TW
DC.subjectMachine Learningen_US
DC.subjectFeature Selection Methodsen_US
DC.subjectImputation Methodsen_US
DC.subjectClassificationen_US
DC.subjectData Miningen_US
DC.title特徵選取前處理於填補遺漏值之影響zh_TW
dc.language.isozh-TWzh-TW
DC.type博碩士論文zh_TW
DC.typethesisen_US
DC.publisherNational Central Universityen_US

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡  - 隱私權政策聲明