博碩士論文 106423009 完整後設資料紀錄

DC 欄位 語言
DC.contributor資訊管理學系zh_TW
DC.creator顏子明zh_TW
DC.creatorTzu-Ming Yenen_US
dc.date.accessioned2019-7-1T07:39:07Z
dc.date.available2019-7-1T07:39:07Z
dc.date.issued2019
dc.identifier.urihttp://ir.lib.ncu.edu.tw:88/thesis/view_etd.asp?URN=106423009
dc.contributor.department資訊管理學系zh_TW
DC.description國立中央大學zh_TW
DC.descriptionNational Central Universityen_US
dc.description.abstract「資料前處理」在資料探勘中,扮演舉足輕重的角色,也是整個分析流程的起點。真實世界中的資料品質參差不齊,例如:大樣本的資料往往會帶有雜訊(Noisy)、或是包含判讀性低的連續型數值類型,若是沒有經過適當的前處理,這些因素都會造成分析結果有所誤差。在過去的文獻中,有學者提出樣本選取(Instance Selection)的資料取樣概念,能夠透過演算法篩選具有代表性的樣本;也有研究顯示出前處理時運用離散化(Discretization),將連續型數值轉換成離散型,能夠有效的提高分析探勘規則的可讀性同時也可能提升正確率。若是將樣本選取與離散化結合,是否能夠在最後獲得比單一前處理還要佳的表現,目前尚未有文獻做出這方面的探討。 本論文欲探討樣本選取與離散化結合後進行資料前處理的影響,如何搭配才能達到最佳表現。本研究選用了三種樣本選取的演算法:基於樣本學習演算法(Instance-Based Learning Algorithm, IB3)、基因演算法(Genetic Algorithm, GA)、遞減式縮減最佳化程序(Decremental Reduction Optimization Procedure, DROP3),以及兩種監督式離散化演算法:最短描述長度原則(Minimum Description Length Principle, MDLP)、基於卡方分箱(ChiMerge, ChiM)。並以最近鄰居法(K-th Nearest Neighbor, KNN)作為分類器來評估搭配的最佳組合。 本研究將以UCI與KEEL上的10種資料集,來進行樣本選取與離散化搭配的探討。根據實驗結果發現,以DROP3樣本選取演算法搭配MDLP離散化演算法的所得到的平均結果,為較推薦之組合搭配,並且以先進行DROP3樣本選取後進行MDLP離散化後的前處理,能夠得到較顯著提升的平均正確率,其正確率達85.11%。 zh_TW
dc.description.abstract"Data Preprocessing" plays a pivotal role in data exploration and is the first step for the analysis process of data mining. In the real world, the quality of the big data is always unclear and uneven. For example, samples in the big data often have noise or continuous type values with low interpretability. These factors will result in inaccurate outcome if not properly pre-processed. In the literature, the concept of data sampling for instance selection had been proposed, which can be used to screen representative samples. Some studies have also shown that using discretization technology to transfer continuous values into discrete ones can effectively improve the readability of analytical exploration rules and may also improve the accuracy rate. Till now, there are no studies to explore the combination of instance selection and discretization, whether it can achieve better performance outcome than the single preprocessing techniques. This thesis aims to discuss the influence of data preprocessing after combining instance selection and discretization, and how to achieve the optimal performance. In this study, three instance selection algorithms are selected: Instance-Based Learning Algorithm (IB3), Genetic Algorithm (GA), Decremental Reduction Optimization Procedure (DROP3), and two supervised discretization algorithms: Minimum Description Length Principle (MDLP), ChiMerge-based (ChiM). The best combination of the two types of techniques is evaluated by the performance of the K-th Nearest Neighbor (KNN) classifiers. This study uses the 10 datasets from UCI and KEEL to explore the instance selection and discretization. According to the experimental results, it reveals that the average results of the DROP3 instance selection algorithm combined with the MDLP discretization algorithm is the more recommended combination than others, and the optimal performance can be obtained when the pre-processing of MDLP discretization is performed after the selection by DROP3, the average accuracy is promoted to 85.11%. en_US
DC.subject資料前處理zh_TW
DC.subject樣本選取zh_TW
DC.subject資料離散化zh_TW
DC.subject連續型數值zh_TW
DC.subject資料探勘zh_TW
DC.subjectData pre-processingen_US
DC.subjectinstance selectionen_US
DC.subjectdiscretizationen_US
DC.subjectcontinuous valueen_US
DC.subjectdata miningen_US
DC.title樣本選取與資料離散化對於分類器效果之影響zh_TW
dc.language.isozh-TWzh-TW
DC.titleInstance Selection and Data Discretization Influence on Classifier’s Performanceen_US
DC.type博碩士論文zh_TW
DC.typethesisen_US
DC.publisherNational Central Universityen_US

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡  - 隱私權政策聲明