dc.description.abstract |
With the progress of technology along with the tide of big data, the importance of ”information” has gradually been valued by people. Therefore, many scholars began to dive into the field of data mining, looking forward to find the value behind numerous data and come up with innovative usages. Such as, but not limited to, using classifiers to discriminate the categories of articles and so on. However, for a classifier, a more comprehensive training data will come to a better result. When building a classifier, we label the data in manual, since articles and paragraphs come in different length, the cost varies widely for doing so.
This study focuses on using unsupervised learning to select samples while giving each data a selection cost to limit the total cost of the final selection. In this thesis, by using Bisecting K-means and Hierarchical Clustering algorithm, the data are selected by two ways, best points and best points under cost considerations. These training materials then are modeled by five different classifiers to measure the classification of classifiers that were established by the selected data.
Finally, the experimental results show that compared with random selecting, the 5 different classifiers each show better strengths in different areas when classifiers are established. Using the method mentioned in this thesis can result in selecting better quality and representative data from unlabeled data while not exceeding the budget. If these data are handed to experts for labeling, the labeling cost will drop significantly and come out in a better result. | en_US |