博碩士論文 101423034 詳細資訊




以作者查詢圖書館館藏 以作者查詢臺灣博碩士 以作者查詢全國書目 勘誤回報 、線上人數:10 、訪客IP:18.234.88.196
姓名 游孟綸(Mon-loon You)  查詢紙本館藏   畢業系所 資訊管理學系
論文名稱 兩階段混合學習法於資料分類之研究
(A Two-Stage Hybrid Learning Approach for Effective Pattern Classification)
相關論文
★ 利用資料探勘技術建立商用複合機銷售預測模型★ 應用資料探勘技術於資源配置預測之研究-以某電腦代工支援單位為例
★ 資料探勘技術應用於航空業航班延誤分析-以C公司為例★ 全球供應鏈下新產品的安全控管-以C公司為例
★ 資料探勘應用於半導體雷射產業-以A公司為例★ 應用資料探勘技術於空運出口貨物存倉時間預測-以A公司為例
★ 使用資料探勘分類技術優化YouBike運補作業★ 特徵屬性篩選對於不同資料類型之影響
★ 資料探勘應用於B2B網路型態之企業官網研究-以T公司為例★ 衍生性金融商品之客戶投資分析與建議-整合分群與關聯法則技術
★ 生物式基因演算法-以避難據點之人員分配與賑災物資配送規劃為例 與賑災物資配★ 基於關鍵點篩選於袋字模型之影像分類
★ K線圖探勘於股票預測之研究★ 一個基於區塊式虛擬相關回饋演算法: 以影像檢索為例
★ 樣本選取與代表性資料偵測之研究★ 資料遺漏率、補值法與資料前處理關係之研究
檔案 [Endnote RIS 格式]    [Bibtex 格式]    [相關文章]   [文章引用]   [完整記錄]   [館藏目錄]   至系統瀏覽論文 ( 永不開放)
摘要(中) 當今的企業常常需要從龐大的資料庫以及資料倉儲中尋找對企業有價值的知識,但越是大型的資料庫所包含的雜訊資料越多,這些雜訊資料會降低資料探勘的精確度,且龐大的資料更會增加知識發掘過程中所需的時間。
雖然樣本選取可以在資料前處理的階段中幫我們過濾掉一些雜訊,是目前最常被用來進行資料縮減的方法,但不同的樣本選取的演算法所篩選出來的資料不盡相同,且常常會發生過度選取 (Over Selection) 或是選取不足 (Under Selection) 的情況進而影響資料探勘的精確度。因此本研究提出了一個新的資料前處理流程 (TSHLA, 兩階段混合學習) ,並且應用在資料分類上。先將訓練集的資料做樣本選取後,分別對被樣本選取演算法判定為雜訊及非雜訊的資料集訓練SVM模型;並且將測試集的資料做KNN的相似度比對,較相似為雜訊的測試資料集用雜訊資料集所訓練的模型做測試,同理,較相似為非雜訊的測試資料集用非雜訊資料集所訓練的模型做測試,希望在雜訊類的資料中找出被篩選掉,但卻有效的樣本,最後合併為最終結果。
本研究的實驗分成兩部分,在樣本選取步驟皆分別實驗了IB3、DROP3、GA等三種效能較佳的演算法。在第一部分的實驗以TSHLA對50個小型資料集做測試,並以SVM作為本研究所使用的分類器。在第二部分的實驗則是使用大型資料集 (十萬筆以上) ,以SVM為分類器,與傳統樣本選取方法比較彼此精準度。
摘要(英) Nowadays, more and more enterprises require extracting knowledge from very large databases. However, these large datasets usually contain a certain amount of noisy data, which are likely to decline the performance of data mining. In addition, the computational time of processing the large scale datasets is usually very large.
Instance selection, which is the widely used data reduction approach, can filter out noisy data from large datasets. However, different instance selection algorithms over different domain datasets filter out different noisy data, which are likely to result in over or under selection since there is no exact definition of outliers. Thus, the quality of data mining results can be affected. Therefore, this thesis proposes a new data pre-processing (TSHLA, Two-Stage Hybrid Learning Approach) for effective data classification. First, instance selection is performed over a given training dataset to filter out the noisy and non-noisy data to train two individual SVM classifiers respectively. Then, using the KNN to compare the similarity of the testing data. As a result, the noisy and non-noisy testing sets are identified and they are fed into their corresponding SVM classifiers for classification.
There two experimental studies in this thesis and three instance selection algorithms are used for comparison, which are IB3, DROP3 and GA. The first and second studies are based on 50 small UCI datasets and large scale datasets containing more than 100,000 data samples. In addition, our proposed TSHLA is compared with the baseline without instance selection and the one based on the conventional instance selection approach.
關鍵字(中) ★ 資料探勘
★ 樣本選取
★ 資料縮減
★ 機器學習
★ 支援向量機
關鍵字(英) ★ data mining
★ instance selection
★ data reduction
★ machine learning
★ support vector machines
論文目次 摘要……………………………………………………………………………………………………….. i
Abstract…………………………………………………………………………………………………….. ii
致謝辭………………………………………………………………………………………………………… iii
目錄……………………………………………………………………………………………………………… iv
圖目錄…………………………………………………………………………………………………………… vi
表目錄………………………………………………………………………………………………………… vii

第一章 緒論 - 1 -
1.1 研究背景 - 1 -
1.2 研究動機 - 3 -
1.3 研究目的 - 4 -
1.4 研究架構 - 5 -
第二章 文獻探討 - 7 -
2.1樣本選取 (Instance selection) - 7 -
2.1.1樣本選取簡介 - 7 -
2.1.2基因演算法 (Genetic Algorithm, GA) - 9 -
2.1.3 DROP3 - 13 -
2.1.4 IB3 - 15 -
2.2 機器學習 - 17 -
2.2.1 監督式學習 - 18 -
2.2.2 支援向量機 (Support Vector Machine, SVM) - 19 -
第三章 TSHLA方法介紹 - 22 -
3.1 實驗架構 - 22 -
3.2.1 TSHLA實驗流程 - 22 -
3.2.2 TSHLA實驗虛擬碼 (pseudo-code) - 25 -
3.3 一般Base-line的流程 - 26 -
3.4一般Base-line2的流程 - 27 -
3.4 討論與分析 - 28 -
第四章 實驗結果 - 29 -
4.1 實驗一 - 29 -
4.1.1 資料集 - 29 -
4.1.2 驗證 - 32 -
4.1.3實驗一結果 - 33 -
4.2實驗二 - 34 -
4.2.1資料集 - 34 -
4.2.2驗證 - 35 -
4.2.3實驗二結果 - 35 -
第五章 結論 - 38 -
5.1 結論與貢獻 - 38 -
5.2 未來研究方向與建議 - 39 -
參考文獻 - 41 -
附錄 - 45 -
參考文獻 中文部分
林嘉陞,2009,“CANN:一個整合分群中心與最鄰近鄰居之入侵偵測系統”,
國立中正大學會計與資訊科技研究所碩士論文。
洪嘉彣,2013,“樣本選取與代表性資料偵測之研究”,國立中央大學資訊管理研究所碩士論文。
英文部分
Aha, D.W., Kibler, D., and Albert, M.K., 1991, “Instance-based learning algorithms.” Machine Learning, vol. 6, no.1, pp. 37-66.
Baker, J. E., 1987, “Reducing bias and inefficiency in the selection algorithm.” Proc. Second Int. Conf. on Genetic Algorithms (L. Erlbaum Associates, Hillsdale, MA), 14–21.
Barnett, V. and Lewis, T., 1994, “Outliers in statistical data.” 3rd Edition, John Wiley & Sons.
Ben-Gal I., 2005, “Outlier detection, In: Maimon O. and Rockach L. (Eds.) Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers.” Kluwer Academic Publishers, ISBN 0-387-24435-2.
Brieman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J., 1984, “Classification and regression trees.” Belmont, CA: Wadsworth.
Chandola, V., Banerjee, A., and Kumar, V., 2009, “Anomaly detection: a survey.” ACM Computing Surveys, vol. 41, no. 3, article 15.
Cano, J.R., Herrera, F., and Lozano, M., 2003, “Using evolutionary algorithms as instance selection for data reduction: an experimental study.” IEEE Transactions on Evolutionary Computation, vol. 7, no. 6, pp. 561-575.
Cover, T. M., and Hart, P. E., 1967, “Nearest neighbor pattern classification.” IEEE Transactions on Information Theory, Vol. 3, pp.21-27.
Devijver, P. A., Kittler, J., 1982, “Pattern Recognition: A Statistical Approach.” Prentice-Hall, London, GB.
De Jong, K. A., (1975), “An Analysis of the Behavior of a class of Genetic Adaptive Systems.” Department of Computer and Communication Sciences.
Duda, R.O., Hart, P.E., and Stork, D.G., 2001, “Pattern Classification.” 2nd Edition, John Wiley, New York.
Edgeworth, F. Y., 1887, “On discordant observations.” Philosophical Magazine 23, 5, 364-375.
Garcı´a, S., Derrac, J., Cano, J.R., and Herrera, F., 2012, “Prototype Selection for Nearest NeighborClassification: Taxonomy and Empirical Study.” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.34, no.3.
Gates, G.W., 1972, “The Reduced Nearest Neighbor Rule.” IEEE Transactions on Information, Theory 18, pp.431-433.
Gen, M., and Cheng, R., 1997, “Genetic Algorithm and Engineering Design.” John Wiley and Sons.
Goldberg, D.E., 1989, “Genetic Algorithms in Search, Optimization, and Machine Learning.” Addison Wesley.
Hawkins, D., 1980, “Identification of Outliers.” Chapman and Hall.
Herrera, F., Lozano, M., and Verdegay, J.L., 1998, “Tackling Real-Coded Genetic Algorithms: Operators and Tools for Behavioural Analysis.” Artificial Intelligence Review, vol.12, pp.265-319.
Hodge, V.J. and Austin, J., 2004, “A survey of outlier detection methodologies.” Artificial Intelligence Review, vol. 22, pp. 85-126.
Holland, J.H., 1975, “Adaptation in Natural and Artificial Systems.” The University of Michigan Press.
Jang, J.R., Sun, C.T., and Mizutani, E., 1997, “Neuro-Fuzzy and Soft Computing: A Computational Approach to Learning and Machine Intelligence.” Prentice Hall, Inc. Upper Saddle River, NJ 07458.
Jankowski, N. and Grochowski, M., 2004, “Comparison of instances selection algorithms I: algorithms survey.” International Conference on Artificial Intelligence and Soft Computing, pp. 598-603.
Kohavi, R., 1995, “A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection.” Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, Vol. 2, pp.1137-1145.
Kotsiantis, S.B., Kanellopoulos, D. and Pintelas, P.E., 2006, “Data Preprocessing for Supervised Leaning.” Intermational Journal of Computer Science, vol.1, pp.1306-4428.
Kuncheva, L. I., and S´anchez, J. S., 2008, “Nearest Neighbour Classifiers for Streaming Data with Delayed Labelling.” Eighth IEEE International Conference on Data Mining.
Li, X.-B. and Jacob, V.S., 2008, “Adaptive data reduction for large-scale transaction data.” European Journal of Operational Research, vol. 188, no. 3, pp. 910-924.
Liu, H., Shah, S., and Jiang, W., 2004, “On-line outlier detection and data cleaning.” Computers and Chemical Engineering, 28, 1635–1647.
Meyer, D., 2012, “Support Vector Machines * The Interface to libsvm in package e1071.” Technische Universitat Wien, Austria.
Mitchell, T., 1997, “Machine Learning.” McGraw Hill, New York.
Olvera-López, J.A., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F. and Kittler, J., 2010, “A review of instance selection methods.” Artif Intell Rev, vol.34, pp.133-143.
Pyle, D., 1999, “Data preparation for data mining.” Morgan Kaufmann.
Reeves, C. R., 1999, “Foundations of Genetic Algorithms.” Morgan Kaufmann Publishers.
Reinartz, T., 2002, “A unifying view on instance selection.” Data Mining and Knowledge Discovery, vol. 6, pp. 191-210.
Richard, J.R. and Michael, W.G., 2003, “Data Mining A Tutorial-Based Primer.”Addison-Wesley.
Ritter, G.L., Woodruff, H.B., Lowry, S.R., and Isenhour, T.L., 1975, “An algorithm for aselective nearest neighbor decision rule.” IEEE Transactions on Information, Theory 21, pp. 665–669.
Rousseeuw, P. and Leroy, A., 1996, “Robust Regression and Outlier Detection.” 3 edition, John Wiley & Sons
Salvador García, Joaquín Derrac, José Ramón Cano, and Francisco Herrera, 2012, “Prototype Selection for Nearest Neighbor Classification: Taxonomy and Empirical Study.” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 3.
Sikora, R., and Piramuthu, S., 2007, “Framework for efficient feature selection in genetic algorithm based data mining.” European Journal of Operational Research, vol. 180, no. 2, pp. 723-737.
Sipser, M., 2006, “Introduction to the Theory of Computation.” Course Technology Inc. ISBN 0-619-21764-2.
Syswerda, G., 1989, “Uniform Crossover in Genetic Algorithms.” In Proceedings of the Third International Conference on Genetic Algorithms, J. Schaffer (ed.), Morgan Kaufmann, 2-9.
Tan, P.N., Steinbach, M., and Kumar, V., 2006, “Introduction to Data Mining.” Addison Wesley.
Vapnik, V.N., 1995, “The Nature of Statistical Learning Theory.” Springer, New York.
Williams, B. K., Nichols, J. D., and Conroy, M. J., 2002, “Analysis and management of animal populations.” London: Academic Press.
Wilson, D., 1972, “Asymptotic properties of nearest neighbor rules using edited data.” IEEE Transactions on Systems, Man, and Cybernetics 2 pp.408–421.
Wilson, D.R., and Martinez, T.R., 2000, “Reduction techniques for instance-based learning algorithms.” Machine Learning, vol.38, pp. 257-286.
Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Yu, P.S., Zhou, Z.H., Steinbach, M., Hand, D.J., and Steinberg, D., 2008, “Top 10 algorithms in data mining.” Knowl Inf Syst 14:1–37.
指導教授 蔡志豐(Chih-Feng Tsai) 審核日期 2014-7-9
推文 facebook   plurk   twitter   funp   google   live   udn   HD   myshare   reddit   netvibes   friend   youpush   delicious   baidu   
網路書籤 Google bookmarks   del.icio.us   hemidemi   myshare   

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡  - 隱私權政策聲明