在成本限制下，以非監督式學習進行樣本選取之研究

NCU Institutional Repository > 管理學院 > 資訊管理研究所 > 博碩士論文 > Item 987654321/74818

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/74818

題名:	在成本限制下，以非監督式學習進行樣本選取之研究
作者:	張詒鈞;Zhang, Yi-Jun
貢獻者:	資訊管理學系
關鍵詞:	文件分類;非監督式樣本選取;成本;Document classification;Unsupervised instance selection;Cost
日期:	2017-08-17
上傳時間:	2017-10-27 14:40:56 (UTC+8)
出版者:	國立中央大學
摘要:	隨著科技的進步及大數據的浪潮，「資料」的重要性及實用性逐漸被人們所看重，因此許多的學者開始著墨於資料探勘領域，期待在眾多資料中找出其背後的價值並產生出許多相關應用，如使用分類器預測文章的所屬類別等。然而，對於分類器而言，若其訓練資料越能代表整體資料，則會使其所得到的訓練結果越好，而在分類器建立過程，會將訓練資料以人為方式貼上所屬標籤，但由於文章有長有短，並不是每筆資料貼上標籤所花費的成本都相同。而本研究著重於在成本下以非監督式學習進行樣本選取之過程，在實驗中給予每筆資料其挑選成本，並限制訓練資料最終所挑選之總成本，而本論文使用了Bisecting K-means及Hierarchical Clustering兩種演算法，並以最佳點及成本考量下最佳點兩種方法去挑選資料，將這些訓練資料透過五種不同的分類器進行建模，來衡量所挑選資料所建立的分類器之分類結果。最終在實驗結果證明本論文所提出之方法在五種不同分類器中，與隨機挑選法相比而言，所得資料在建立分類器模型時，皆有其相對表現較好之方法，而透過本論文之方法，可以在成本限制下，從尚未擁有類別標籤的資料中選出較具代表性的資料，若將這些資料交給專家進行類別標示，即可訓練出更好的分類模型，大幅的降低類別標示的成本。 ;With the progress of technology along with the tide of big data, the importance of "information" has gradually been valued by people. Therefore, many scholars began to dive into the field of data mining, looking forward to find the value behind numerous data and come up with innovative usages. Such as, but not limited to, using classifiers to discriminate the categories of articles and so on. However, for a classifier, a more comprehensive training data will come to a better result. When building a classifier, we label the data in manual, since articles and paragraphs come in different length, the cost varies widely for doing so. This study focuses on using unsupervised learning to select samples while giving each data a selection cost to limit the total cost of the final selection. In this thesis, by using Bisecting K-means and Hierarchical Clustering algorithm, the data are selected by two ways, best points and best points under cost considerations. These training materials then are modeled by five different classifiers to measure the classification of classifiers that were established by the selected data. Finally, the experimental results show that compared with random selecting, the 5 different classifiers each show better strengths in different areas when classifiers are established. Using the method mentioned in this thesis can result in selecting better quality and representative data from unlabeled data while not exceeding the budget. If these data are handed to experts for labeling, the labeling cost will drop significantly and come out in a better result.
顯示於類別:	[資訊管理研究所] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	383	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....