樣本選取 (instance selection) 在資料探勘領域的一門技術,但是對於現今持續增長的資料量,卻很少人著重在樣本選取,而本研究提出了一個基於支援向量機 (Support Vector Machine, SVM)概念發展出的一個樣本選取演算法稱為SVOIS。 而且是針對於文字分類上進行樣本選取,此外也與幾個有名的樣本選取演算法ENN、IB3、ICF和DROP3這些演算法進行比較。在分類器的選擇上,也較這些方法不同,本篇論文不只有使用k-NN這個作為分類器,還有使用一個二分類的分類器支援向量機SVM作為分類器的比較依據;因為對於SVM而言,在訓練的時候時常需要花費很長的時間,而且時間是隨著樣本的增加而增長,所以我們認為SVOIS不只會對SVM有所幫助,還可能會對於k-NN有較其他樣本選取演算法更有幫助。 最後,透過實驗二分類的文字資料集來進行實驗,也分別實作出其他這個演算法來進行比較,以驗證SVOIS是較其他樣本選取演算法來的佳。實驗結果也發現,SVOIS針對在文字資料集上樣本選取後的正確率較其他演算法來的高,也能改善其資料量。 Since the number and size of online information are increasing rapidly, instance selection has become one of the major techniques for managing text data. In this paper, a novel instance selection method, namely Support Vector Oriented Instance Selection (SVOIS) is proposed for text classification. SVOIS attempts to find the support vectors in the original feature space through a linear regression plane, where the instances to be selected as the support vectors need to satisfy two criteria. The first one is that the distances between the original instances and their class centers need to be smaller than a pre-defined value. Then, the instances fulfilling this criterion are regarded as the regression data in order to identify a regression plane. The second criterion is based on the distances between the regression data and the regression plane, which is like the margin of SVM. In particular, these distances need to be larger than a pre-defined value, and the regression data fulfilling this criterion are called support vectors for classifier training and classification. More specifically, these two types of distances should not be neither too long to make all instances to be selected, nor too short leading to very few support vectors. In particular, this paper compares SVOIS with four state-of-the-art algorithms, which are ENN, IB3, ICF, and DROP3. The experimental results over the TechTC-100 dataset show that SVOIS can allow SVM and k-NN provide similar or better classification accuracy than the baseline without instance selection and it also outperforms the state-of-the-art algorithms in terms of effectiveness and efficiency.