博碩士論文 100423035 詳細資訊




以作者查詢圖書館館藏 以作者查詢臺灣博碩士 以作者查詢全國書目 勘誤回報 、線上人數:41 、訪客IP:3.21.93.44
姓名 洪嘉彣(Chia-Wen Hung)  查詢紙本館藏   畢業系所 資訊管理學系
論文名稱 樣本選取與代表性資料偵測之研究
(A Study of Instance Selection and Representative Data Detection)
相關論文
★ 利用資料探勘技術建立商用複合機銷售預測模型★ 應用資料探勘技術於資源配置預測之研究-以某電腦代工支援單位為例
★ 資料探勘技術應用於航空業航班延誤分析-以C公司為例★ 全球供應鏈下新產品的安全控管-以C公司為例
★ 資料探勘應用於半導體雷射產業-以A公司為例★ 應用資料探勘技術於空運出口貨物存倉時間預測-以A公司為例
★ 使用資料探勘分類技術優化YouBike運補作業★ 特徵屬性篩選對於不同資料類型之影響
★ 資料探勘應用於B2B網路型態之企業官網研究-以T公司為例★ 衍生性金融商品之客戶投資分析與建議-整合分群與關聯法則技術
★ 應用卷積式神經網路建立肝臟超音波影像輔助判別模型★ 基於卷積神經網路之身分識別系統
★ 能源管理系統電能補值方法誤差率比較分析★ 企業員工情感分析與管理系統之研發
★ 資料淨化於類別不平衡問題: 機器學習觀點★ 資料探勘技術應用於旅客自助報到之分析—以C航空公司為例
檔案 [Endnote RIS 格式]    [Bibtex 格式]    [相關文章]   [文章引用]   [完整記錄]   [館藏目錄]   至系統瀏覽論文 ( 永不開放)
摘要(中) 現今企業越來越依賴從龐大的資料庫及資料倉儲中找尋對企業本身有價值的知識,但越是大型的資料集所包含的雜訊資料將會越多,這些雜訊資料會降低探勘的準確度,且龐大的資料更會增加知識發掘過程所需的時間。
雖然樣本選取可以在資料前處理階段中幫助我們過濾掉一些雜訊,是目前最常被用來進行資料減量的方法,但是在過去文獻中,一些效能較佳的樣本選取演算法執行時的時間複雜度卻相當高。因此本研究提出了一個新的資料前處理流程(ReDD, 代表性資料偵測),僅需以一小部份資料先進行樣本選取以後,再以複雜度相對較低的分類器學習由樣本選取所篩選出的代表性資料之特徵,便可利用訓練完成之分類器(偵測器)偵測出所有原始資料中所包含的離群值,將可大幅減少資料精簡的時間。
本研究的實驗分成兩個部份,在樣本選取步驟皆分別實驗了IB3、DROP3和GA等三種效能較佳的演算法。在第一部分的實驗以ReDD對50個小型資料集做精簡,並以SVM、CART、KNN以及Naive Bayes為偵測器,測試出偵測效能最好的分類器為KNN以及CART。在第二部分的實驗測試四個大型資料集(十萬筆以上),並以KNN和CART為ReDD模型之偵測器,與傳統樣本選取方法比較彼此之準確度與花費時間,結果顯示出ReDD確實比傳統樣本選取節省龐大的執行時間,且準確度與傳統樣本選取並無明顯差異,由此可見ReDD在處理大型資料集上能大幅提升資料精簡的效率。
摘要(英) Nowadays, more and more enterprises require extracting knowledge from very large databases. However, these large datasets usually contain a certain amount of noisy data, which are likely to decline the performance of data mining. In addition, the computational time, during the KDD process over large scale datasets is large.
Instance selection, which is the widely used for data reduction, can filter out noisy data from large datasets. However, many existing instance selection algorithms are limited in dealing with large datasets in terms of time efficiency. Therefore, we introduce a novel data preprocessing process called Representative Data Detection (ReDD), which only needs a small part of the original dataset to perform the instance selection step. Then, a classifier is trained to learn the representative data identified by the instance selection step. Afterwards, the trained classifier as a detector is used to detect all the noisy data over the large original dataset.
The thesis contains two experiments where IB3, DROP3 and GA are used as the baseline the instance selection algorithms. In the first experiment, fifty small-scale datasets are used to evaluate ReDD, in which SVM, CART, KNN and Naive Bayes are constructed as the detectors for comparison. We find that KNN and CART perform the best. In the second experiment, the classification accuracy and execution time of ReDD and the baselines over four large-scale datasets (more than one hundred thousand data) are compared. The result shows that ReDD can reduce large amount of execution time compared to the traditional instance selection. Moreover, the accuracy rates of ReDD and the baselines have no significant difference.
關鍵字(中) ★ 知識發掘
★ 資料精簡
★ 樣本選取
★ 離群值偵測
★ 時間複雜度
關鍵字(英) ★ knowledge discovery in databases
★ data reduction
★ instance selection
★ outlier detection
★ time complexity
論文目次 摘要 i
Abstract ii
致謝辭 iii
目錄 iv
圖目錄 vi
表目錄 vii
第一章 緒論 1
1.1研究背景 1
1.2研究動機 2
1.3研究目的 4
1.4研究架構 5
第二章 文獻探討 7
2.1 樣本選取(Instance selection) 7
2.1.1樣本選取簡介 7
2.1.2基因演算法 9
2.1.3 DROP3 14
2.1.4 IB3 16
2.2 離群值偵測(Outlier detection) 18
2.2.1離群值偵測相關技術 18
2.3 監督式學習分類模型 21
2.3.1 監督式學習 21
2.3.2 支援向量機 22
2.3.3 分類和迴歸樹 24
2.3.4 單純貝式分類法 25
2.3.5 K最近鄰居分類法 27
第三章 REDD:代表性資料偵測方法 28
3.1 ReDD流程 28
3.2一般Base-line的流程 30
3.3 討論與分析 30
第四章 實驗結果 33
4.1實驗一 33
4.1.1實驗一設計 33
4.1.1.1 資料集 33
4.1.1.2 分類器 35
4.1.1.3 驗證 35
4.1.2 實驗一結果 37
4.2 實驗二 40
4.2.1 實驗二設計 40
4.2.1.1 資料集 40
4.2.2.2 驗證 41
4.2.2 實驗二結果 43
4.3敏感度分析 48
第五章 結論 50
5.1 結論與貢獻 50
5.2 未來研究方向與建議 51
參考文獻 53
附錄 57
參考文獻 中文部分
林嘉陞,2009,CANN:一個整合分群中心與最鄰近鄰居之入侵偵測系統,國立中正大學,碩士論文。
英文部分
Aha, D.W., Kibler, D., and Albert, M.K., (1991), “Instance-based learning algorithms.” Machine Learning, vol. 6, no.1, pp.37-66.
Baker, J. E., (1987), “Reducing bias and inefficiency in the selection algorithm.” Proc. Second Int. Conf. on Genetic Algorithms (L. Erlbaum Associates, Hillsdale, MA), 14–21.
Barnett, V. and Lewis, T., (1994), “Outliers in statistical data.” 3rd Edition, John Wiley & Sons.
Ben-Gal I., (2005), “Outlier detection, In: Maimon O. and Rockach L. (Eds.) Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers.” Kluwer Academic Publishers, ISBN 0-387-24435-2.
Brieman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J., (1984), “Classification and regression trees.” Belmont, CA: Wadsworth.
Chandola, V., Banerjee, A., and Kumar, V., (2009), “Anomaly detection: a survey.” ACM Computing Surveys, vol. 41, no. 3, article 15.
Cano, J.R., Herrera, F., and Lozano, M., (2003), “Using evolutionary algorithms as instance selection for data reduction: an experimental study.” IEEE Transactions on Evolutionary Computation, vol. 7, no. 6, pp. 561-575.
Cover, T. M., and Hart, P. E., (1967), “Nearest neighbor pattern classification.” IEEE Transactions on Information Theory, Vol. 3, pp.21-27.
Devijver, P. A., Kittler, J., (1982), “Pattern Recognition: A Statistical Approach.” Prentice-Hall, London, GB.
De Jong, K. A., (1975), “An Analysis of the Behavior of a class of Genetic Adaptive Systems.” Department of Computer and Communication Sciences.
Duda, R.O., Hart, P.E., and Stork, D.G., (2001), “Pattern Classification.” 2nd Edition, John Wiley, New York.
Edgeworth, F. Y., (1887), “On discordant observations.” Philosophical Magazine 23, 5, 364-375.
Garcı´a, S., Derrac, J., Cano, J.R., and Herrera, F., (2012), “Prototype Selection for Nearest NeighborClassification: Taxonomy and Empirical Study.” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.34, no.3.
Gates, G.W., (1972), “The Reduced Nearest Neighbor Rule.” IEEE Transactions on Information, Theory 18, pp.431-433.
Gen, M., and Cheng, R., (1997), “Genetic Algorithm and Engineering Design.” John Wiley and Sons.
Goldberg, D.E., (1989), “Genetic Algorithms in Search, Optimization, and Machine Learning.” Addison Wesley.
Hawkins, D., (1980), “Identification of Outliers.” Chapman and Hall.
Herrera, F., Lozano, M., and Verdegay, J.L., (1998), “Tackling Real-Coded Genetic Algorithms: Operators and Tools for Behavioural Analysis.” Artificial Intelligence Review, vol.12, pp.265-319.
Hodge, V.J. and Austin, J., (2004), “A survey of outlier detection methodologies.” Artificial Intelligence Review, vol. 22, pp. 85-126.
Holland, J.H., (1975), “Adaptation in Natural and Artificial Systems.” The University of Michigan Press.
Jang, J.R., Sun, C.T., and Mizutani, E., (1997), “Neuro-Fuzzy and Soft Computing: A Computational Approach to Learning and Machine Intelligence.” Prentice Hall, Inc. Upper Saddle River, NJ 07458.
Jankowski, N. and Grochowski, M., (2004), “Comparison of instances selection algorithms I: algorithms survey.” International Conference on Artificial Intelligence and Soft Computing, pp. 598-603.
Kohavi, R., (1995), “A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection.” Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, Vol. 2, pp.1137-1145.
Kotsiantis, S.B., Kanellopoulos, D. and Pintelas, P.E., (2006), “Data Preprocessing for Supervised Leaning.” Intermational Journal of Computer Science, vol.1, pp.1306-4428.
Kuncheva, L. I., and S´anchez, J. S., (2008), “Nearest Neighbour Classifiers for Streaming Data with Delayed Labelling.” Eighth IEEE International Conference on Data Mining.
Li, X.-B. and Jacob, V.S., (2008), “Adaptive data reduction for large-scale transaction data.” European Journal of Operational Research, vol. 188, no. 3, pp. 910-924.
Liu, H., Shah, S., and Jiang, W., (2004), “On-line outlier detection and data cleaning.” Computers and Chemical Engineering, 28, 1635–1647.
Meyer, D., (2012), “Support Vector Machines * The Interface to libsvm in package e1071.” Technische Universitat Wien, Austria.
Mitchell, T., (1997), “Machine Learning.” McGraw Hill, New York.
Olvera-López, J.A., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F. and Kittler, J., (2010), “A review of instance selection methods.” Artif Intell Rev, vol.34, pp.133-143.
Pyle, D., (1999), ”Data preparation for data mining.” Morgan Kaufmann.
Reeves, C. R., (1999), “Foundations of Genetic Algorithms.” Morgan Kaufmann Publishers.
Reinartz, T., (2002), ”A unifying view on instance selection.” Data Mining and Knowledge Discovery, vol. 6, pp. 191-210.
Richard, J.R. and Michael, W.G., (2003), “Data Mining A Tutorial-Based Primer.” Addison-Wesley.
Ritter, G.L., Woodruff, H.B., Lowry, S.R., and Isenhour, T.L., (1975), “An algorithm for aselective nearest neighbor decision rule.” IEEE Transactions on Information, Theory 21, pp. 665–669.
Rousseeuw, P. and Leroy, A., (1996), “Robust Regression and Outlier Detection.” 3 edition, John Wiley & Sons.
Sikora, R., and Piramuthu, S., (2007), “Framework for efficient feature selection in genetic algorithm based data mining.” European Journal of Operational Research, vol. 180, no. 2, pp. 723-737.
Sipser, M., (2006), “Introduction to the Theory of Computation.” Course Technology Inc. ISBN 0-619-21764-2.
Syswerda, G., (1989), “Uniform Crossover in Genetic Algorithms.” In Proceedings of the Third International Conference on Genetic Algorithms, J. Schaffer (ed.), Morgan Kaufmann, 2-9.
Tan, P.N., Steinbach, M., and Kumar, V., (2006), “Introduction to Data Mining.” Addison Wesley.
Vapnik, V.N., (1995), “The Nature of Statistical Learning Theory.” Springer, New York.
Williams, B. K., Nichols, J. D., and Conroy, M. J., (2002), “Analysis and management of animal populations.” London: Academic Press.
Wilson, D., (1972), “Asymptotic properties of nearest neighbor rules using edited data.” IEEE Transactions on Systems, Man, and Cybernetics 2 pp.408–421.
Wilson, D.R., and Martinez, T.R., (2000), “Reduction techniques for instance-based learning algorithms.” Machine Learning, vol.38, pp.257-286.
Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Yu, P.S., Zhou, Z.H., Steinbach, M., Hand, D.J., and Steinberg, D., (2008), “Top 10 algorithms in data mining.” Knowl Inf Syst 14:1–37.
指導教授 蔡志豐(Chih-Fong Tsai) 審核日期 2013-7-5
推文 facebook   plurk   twitter   funp   google   live   udn   HD   myshare   reddit   netvibes   friend   youpush   delicious   baidu   
網路書籤 Google bookmarks   del.icio.us   hemidemi   myshare   

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡  - 隱私權政策聲明