針對文字分類的支援向量導向樣本選取

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：67

、訪客IP：3.22.79.55

姓名

張哲瑋(Che-wei Chang) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

針對文字分類的支援向量導向樣本選取
(Support Vector Oriented Instance Selection for Text Classification)

相關論文

★ 利用資料探勘技術建立商用複合機銷售預測模型	★ 應用資料探勘技術於資源配置預測之研究-以某電腦代工支援單位為例
★ 資料探勘技術應用於航空業航班延誤分析-以C公司為例	★ 全球供應鏈下新產品的安全控管-以C公司為例
★ 資料探勘應用於半導體雷射產業-以A公司為例	★ 應用資料探勘技術於空運出口貨物存倉時間預測-以A公司為例
★ 使用資料探勘分類技術優化YouBike運補作業	★ 特徵屬性篩選對於不同資料類型之影響
★ 資料探勘應用於B2B網路型態之企業官網研究-以T公司為例	★ 衍生性金融商品之客戶投資分析與建議-整合分群與關聯法則技術
★ 應用卷積式神經網路建立肝臟超音波影像輔助判別模型	★ 基於卷積神經網路之身分識別系統
★ 能源管理系統電能補值方法誤差率比較分析	★ 企業員工情感分析與管理系統之研發
★ 資料淨化於類別不平衡問題: 機器學習觀點	★ 資料探勘技術應用於旅客自助報到之分析—以C航空公司為例

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

樣本選取 (instance selection) 在資料探勘領域的一門技術，但是對於現今持續增長的資料量，卻很少人著重在樣本選取，而本研究提出了一個基於支援向量機 (Support Vector Machine, SVM)概念發展出的一個樣本選取演算法稱為SVOIS。
而且是針對於文字分類上進行樣本選取，此外也與幾個有名的樣本選取演算法ENN、IB3、ICF和DROP3這些演算法進行比較。在分類器的選擇上，也較這些方法不同，本篇論文不只有使用k-NN這個作為分類器，還有使用一個二分類的分類器支援向量機SVM作為分類器的比較依據；因為對於SVM而言，在訓練的時候時常需要花費很長的時間，而且時間是隨著樣本的增加而增長，所以我們認為SVOIS不只會對SVM有所幫助，還可能會對於k-NN有較其他樣本選取演算法更有幫助。
最後，透過實驗二分類的文字資料集來進行實驗，也分別實作出其他這個演算法來進行比較，以驗證SVOIS是較其他樣本選取演算法來的佳。實驗結果也發現，SVOIS針對在文字資料集上樣本選取後的正確率較其他演算法來的高，也能改善其資料量。

摘要(英)

Since the number and size of online information are increasing rapidly, instance selection has become one of the major techniques for managing text data. In this paper, a novel instance selection method, namely Support Vector Oriented Instance Selection (SVOIS) is proposed for text classification.
SVOIS attempts to find the support vectors in the original feature space through a linear regression plane, where the instances to be selected as the support vectors need to satisfy two criteria. The first one is that the distances between the original instances and their class centers need to be smaller than a pre-defined value. Then, the instances fulfilling this criterion are regarded as the regression data in order to identify a regression plane. The second criterion is based on the distances between the regression data and the regression plane, which is like the margin of SVM. In particular, these distances need to be larger than a pre-defined value, and the regression data fulfilling this criterion are called support vectors for classifier training and classification. More specifically, these two types of distances should not be neither too long to make all instances to be selected, nor too short leading to very few support vectors.
In particular, this paper compares SVOIS with four state-of-the-art algorithms, which are ENN, IB3, ICF, and DROP3. The experimental results over the TechTC-100 dataset show that SVOIS can allow SVM and k-NN provide similar or better classification accuracy than the baseline without instance selection and it also outperforms the state-of-the-art algorithms in terms of effectiveness and efficiency.

關鍵字(中)

★ 機器學習
★ 支援向量機
★ 文字分類
★ 資料縮減
★ 樣本選取

關鍵字(英)

★ support vector machines
★ machine learning
★ text classification
★ data reduction
★ instance selection

論文目次

摘要 i
Abstract ii
目錄 iii
圖目錄 v
表目錄 vi
第一章緒論 1
1.1 研究背景 1
1.2 研究動機與目的 1
1.3 研究範圍 2
1.4 研究貢獻 2
1.5 章節架構 3
第二章文獻探討 5
2.1 樣本選取 5
2.2 Edited Nearest Neighbor 5
2.3 Instance-Based Learning 6
2.4 Iterative Case Filtering 7
2.5 Decremental Reduction Optimization Procedure 8
2.6 討論 10
第三章 Support Vector Oriented Instance Selection 11
3.1 第一階段 11
3.2 第二階段 12
3.3 第三階段 13
3.4 第四階段 14
3.5 SVOIS詳細演算法 15
3.6 SVOIS多分類之情況 16
第四章實驗結果 17
4.1 實驗設計 17
4.2 研究結果 18
第五章結論 30
5.1 研究貢獻 30
5.2未來研究 31
參考文獻 32
附錄一 35
附錄二 38
附錄三 41
附錄四 44

參考文獻

[1] Aggarwal, CC. and Yu, P.C. (2001) Outlier detection for high dimensional data. Proceedings of the ACM SIGMOD Conference, pp. 37-46.
[2] Aha, D.W., Kibler, D., and Albert, M.K. (1991) Instance-based learning algorithms. Machine Learning, vol. 6, no. 1, pp. 37-66.
[3] Barnett, V. and Lewis, T. (1994) Outliers in statistical data. John Wiley & Sons.
[4] Brank, J., Grobelnik, M., Milic-Frayling, N., and Mladenic, D. (2002) Interaction of feature selection methods and linear classification models. International Workshop on Text Mining, in conjunction with International Conference on Machine Learning.
[5] Brighton, H. and Mellish, C. (2002) Advances in instance selection for instance-based learning algorithms. Data Mining and Knowledge Discovery, vol. 6, pp. 153-172.
[6] Burges, C.J.C. (1998) A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121-167.
[7] Byun, H. and Lee, S.-W. (2003) A survey on pattern recognition applications of support vector machines. International Journal of Pattern Recognition and Artificial Intelligence, vol. 17, no. 3, pp. 459-486.
[8] Cano, J.R., Herrera, F., and Lozano, M. (2003) Using evolutionary algorithms as instance selection for data reduction: an experimental study. IEEE Transactions on Evolutionary Computation, vol. 7, no. 6, pp. 561-575.
[9] Dasgupta, A., Drineas, P., Harb, B., Josifovski, V., and Mahoney, M.W. (2007) Feature selection methods for text classification. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 230-239.
[10] Davidov, D., Gabrilovich, E., and Markovitch, S. (2004) Parameterized generation of labeled datasets for text categorization based on a hierarchical directory. ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 250-257.
[11] Derrac, J., Garcia, S., and Herrera, F. (2010) A survey on evolutionary instance selection and generation. International Journal of Applied Metaheuristic Computing, vol. 1, no. 1, pp. 60-92.
[12] Forman, G. (2003) An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, vol. 3, pp. 1289-1305.
[13] Gabrilovich, E. and Markovitch, S. (2004) Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5. International Conference on Machine Learning, pp. 321-328.
[14] Garcia-Pedrajas, N., del Castillo, J.A.R., and Ortiz-Boyer, D. (2010) A cooperative coevolutionary algorithm for instance selection for instance-based learning. Machine Learning, vol. 78, pp. 381-420.
[15] Jain, A.K., Duin, R.P.W., and Mao, J. (2000) Statistical pattern recognition: a review. IEEE Transitions on Pattern Analysis and Machine Intelligence, vol. 22, no. 1, pp. 4-37.
[16] Jankowski, N. and Grochowski, M. (2004) Comparison of instances selection algorithms I: algorithms survey. International Conference on Artificial Intelligence and Soft Computing, pp. 598-603.
[17] Joachims, T. (1998) Text categorization with support vector machines: learning with many relevant features. European Conference on Machine Learning, pp. 137-142.
[18] Knorr, E.M., Ng., R., and Tucakov, V. (2000) Distance-based outliers: algorithms and applications. The VLDB Journal, Vol. 8, pp. 237-253.
[19] Lewis, D.D. and Hayes, P.J. (1994) Guest editorial – special issue on text categorization. ACM Transactions on Information Systems, vol. 12, no. 3, pp. 231.
[20] Li, X.-B. and Jacob, V.S. (2008) Adaptive data reduction for large-scale transaction data. European Journal of Operational Research, vol. 188, no. 3, pp. 910-924.
[21] Liu, H. and Motoda, H. (2001) Instance selection and construction for data mining. Kluwer.
[22] Pyle, D. (1999) Data preparation for data mining. Morgan Kaufmann.
[23] Pradhan, S. and Wu, X. (1999) Instance selection in data mining. Technical Report, Department of Computer Science, University of Colorado at Boulder.
[24] Reinartz, T. (2002) A unifying view on instance selection. Data Mining and Knowledge Discovery, vol. 6, pp. 191-210.
[25] Sebastiani, F. (2002) Machine learning in automated text categorization. ACM Computing Surveys, vol. 34, no. 1, pp. 1-47.
[26] Tsai, C.-F., McGarry, K., and Tait, J. (2006) CLAIRE: a modular support vector image indexing and classification system. ACM Transactions on Information Systems, vol. 24, no. 3, pp. 353-379.
[27] Vapnik, V. (1998) Statistical learning theory. John Wiley.
[28] Wilson, D.L. (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man and Cybernetics, vol. 2, no. 3, pp. 408-421.
[29] Wilson, D.R. and Martinez, T.R. (2000) Reduction techniques for instance-based learning algorithms. Machine Learning, vol. 38, pp. 257-286.
[30] Yang, Y. and Liu, X. (1999) A re-examination of text categorization methods. ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 42-49.
[31] Yang, Y. and Pedersen, J.O. (1997) A comparative study on feature selection in text categorization. International Conference on Machine Learning, pp. 412-420.

指導教授

李俊賢、蔡志豐
(Chun-shien Li、Chih-Fong Tsai)

審核日期

2011-7-22

推文