在成本限制下，以非監督式學習進行樣本選取之研究

DC 欄位	值	語言
DC.contributor	資訊管理學系	zh_TW
DC.creator	張詒鈞	zh_TW
DC.creator	Yi-Jun Zhang	en_US
dc.date.accessioned	2017-8-17T07:39:07Z
dc.date.available	2017-8-17T07:39:07Z
dc.date.issued	2017
dc.identifier.uri	http://ir.lib.ncu.edu.tw:444/thesis/view_etd.asp?URN=104423010
dc.contributor.department	資訊管理學系	zh_TW
DC.description	國立中央大學	zh_TW
DC.description	National Central University	en_US
dc.description.abstract	隨著科技的進步及大數據的浪潮，「資料」的重要性及實用性逐漸被人們所看重，因此許多的學者開始著墨於資料探勘領域，期待在眾多資料中找出其背後的價值並產生出許多相關應用，如使用分類器預測文章的所屬類別等。然而，對於分類器而言，若其訓練資料越能代表整體資料，則會使其所得到的訓練結果越好，而在分類器建立過程，會將訓練資料以人為方式貼上所屬標籤，但由於文章有長有短，並不是每筆資料貼上標籤所花費的成本都相同。而本研究著重於在成本下以非監督式學習進行樣本選取之過程，在實驗中給予每筆資料其挑選成本，並限制訓練資料最終所挑選之總成本，而本論文使用了Bisecting K-means及Hierarchical Clustering兩種演算法，並以最佳點及成本考量下最佳點兩種方法去挑選資料，將這些訓練資料透過五種不同的分類器進行建模，來衡量所挑選資料所建立的分類器之分類結果。最終在實驗結果證明本論文所提出之方法在五種不同分類器中，與隨機挑選法相比而言，所得資料在建立分類器模型時，皆有其相對表現較好之方法，而透過本論文之方法，可以在成本限制下，從尚未擁有類別標籤的資料中選出較具代表性的資料，若將這些資料交給專家進行類別標示，即可訓練出更好的分類模型，大幅的降低類別標示的成本。	zh_TW
dc.description.abstract	With the progress of technology along with the tide of big data, the importance of ”information” has gradually been valued by people. Therefore, many scholars began to dive into the field of data mining, looking forward to find the value behind numerous data and come up with innovative usages. Such as, but not limited to, using classifiers to discriminate the categories of articles and so on. However, for a classifier, a more comprehensive training data will come to a better result. When building a classifier, we label the data in manual, since articles and paragraphs come in different length, the cost varies widely for doing so. This study focuses on using unsupervised learning to select samples while giving each data a selection cost to limit the total cost of the final selection. In this thesis, by using Bisecting K-means and Hierarchical Clustering algorithm, the data are selected by two ways, best points and best points under cost considerations. These training materials then are modeled by five different classifiers to measure the classification of classifiers that were established by the selected data. Finally, the experimental results show that compared with random selecting, the 5 different classifiers each show better strengths in different areas when classifiers are established. Using the method mentioned in this thesis can result in selecting better quality and representative data from unlabeled data while not exceeding the budget. If these data are handed to experts for labeling, the labeling cost will drop significantly and come out in a better result.	en_US
DC.subject	文件分類	zh_TW
DC.subject	非監督式樣本選取	zh_TW
DC.subject	成本	zh_TW
DC.subject	Document classification	en_US
DC.subject	Unsupervised instance selection	en_US
DC.subject	Cost	en_US
DC.title	在成本限制下，以非監督式學習進行樣本選取之研究	zh_TW
dc.language.iso	zh-TW	zh-TW
DC.type	博碩士論文	zh_TW
DC.type	thesis	en_US
DC.publisher	National Central University	en_US

博碩士論文 104423010 完整後設資料紀錄