English  |  正體中文  |  简体中文  |  全文筆數/總筆數 : 80990/80990 (100%)
造訪人次 : 41632726      線上人數 : 3719
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
搜尋範圍 查詢小技巧:
  • 您可在西文檢索詞彙前後加上"雙引號",以獲取較精準的檢索結果
  • 若欲以作者姓名搜尋,建議至進階搜尋限定作者欄位,可獲得較完整資料
  • 進階搜尋


    請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/61106


    題名: 樣本選取與代表性資料偵測之研究;A Study of Instance Selection and Representative Data Detection
    作者: 洪嘉彣;Hung,Chia-Wen
    貢獻者: 資訊管理學系
    關鍵詞: 知識發掘;資料精簡;樣本選取;離群值偵測;時間複雜度;knowledge discovery in databases;data reduction;instance selection;outlier detection;time complexity
    日期: 2013-07-05
    上傳時間: 2013-08-22 12:12:03 (UTC+8)
    出版者: 國立中央大學
    摘要: 現今企業越來越依賴從龐大的資料庫及資料倉儲中找尋對企業本身有價值的知識,但越是大型的資料集所包含的雜訊資料將會越多,這些雜訊資料會降低探勘的準確度,且龐大的資料更會增加知識發掘過程所需的時間。
    雖然樣本選取可以在資料前處理階段中幫助我們過濾掉一些雜訊,是目前最常被用來進行資料減量的方法,但是在過去文獻中,一些效能較佳的樣本選取演算法執行時的時間複雜度卻相當高。因此本研究提出了一個新的資料前處理流程(ReDD, 代表性資料偵測),僅需以一小部份資料先進行樣本選取以後,再以複雜度相對較低的分類器學習由樣本選取所篩選出的代表性資料之特徵,便可利用訓練完成之分類器(偵測器)偵測出所有原始資料中所包含的離群值,將可大幅減少資料精簡的時間。
    本研究的實驗分成兩個部份,在樣本選取步驟皆分別實驗了IB3、DROP3和GA等三種效能較佳的演算法。在第一部分的實驗以ReDD對50個小型資料集做精簡,並以SVM、CART、KNN以及Naive Bayes為偵測器,測試出偵測效能最好的分類器為KNN以及CART。在第二部分的實驗測試四個大型資料集(十萬筆以上),並以KNN和CART為ReDD模型之偵測器,與傳統樣本選取方法比較彼此之準確度與花費時間,結果顯示出ReDD確實比傳統樣本選取節省龐大的執行時間,且準確度與傳統樣本選取並無明顯差異,由此可見ReDD在處理大型資料集上能大幅提升資料精簡的效率。
    Nowadays, more and more enterprises require extracting knowledge from very large databases. However, these large datasets usually contain a certain amount of noisy data, which are likely to decline the performance of data mining. In addition, the computational time, during the KDD process over large scale datasets is large.
    Instance selection, which is the widely used for data reduction, can filter out noisy data from large datasets. However, many existing instance selection algorithms are limited in dealing with large datasets in terms of time efficiency. Therefore, we introduce a novel data preprocessing process called Representative Data Detection (ReDD), which only needs a small part of the original dataset to perform the instance selection step. Then, a classifier is trained to learn the representative data identified by the instance selection step. Afterwards, the trained classifier as a detector is used to detect all the noisy data over the large original dataset.
    The thesis contains two experiments where IB3, DROP3 and GA are used as the baseline the instance selection algorithms. In the first experiment, fifty small-scale datasets are used to evaluate ReDD, in which SVM, CART, KNN and Naive Bayes are constructed as the detectors for comparison. We find that KNN and CART perform the best. In the second experiment, the classification accuracy and execution time of ReDD and the baselines over four large-scale datasets (more than one hundred thousand data) are compared. The result shows that ReDD can reduce large amount of execution time compared to the traditional instance selection. Moreover, the accuracy rates of ReDD and the baselines have no significant difference.
    顯示於類別:[資訊管理研究所] 博碩士論文

    文件中的檔案:

    檔案 描述 大小格式瀏覽次數
    index.html0KbHTML775檢視/開啟


    在NCUIR中所有的資料項目都受到原著作權保護.

    社群 sharing

    ::: Copyright National Central University. | 國立中央大學圖書館版權所有 | 收藏本站 | 設為首頁 | 最佳瀏覽畫面: 1024*768 | 建站日期:8-24-2009 :::
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - 隱私權政策聲明