博碩士論文 104423010 完整後設資料紀錄

DC 欄位 語言
DC.contributor資訊管理學系zh_TW
DC.creator張詒鈞zh_TW
DC.creatorYi-Jun Zhangen_US
dc.date.accessioned2017-8-17T07:39:07Z
dc.date.available2017-8-17T07:39:07Z
dc.date.issued2017
dc.identifier.urihttp://ir.lib.ncu.edu.tw:444/thesis/view_etd.asp?URN=104423010
dc.contributor.department資訊管理學系zh_TW
DC.description國立中央大學zh_TW
DC.descriptionNational Central Universityen_US
dc.description.abstract隨著科技的進步及大數據的浪潮,「資料」的重要性及實用性逐漸被人們所看重,因此許多的學者開始著墨於資料探勘領域,期待在眾多資料中找出其背後的價值並產生出許多相關應用,如使用分類器預測文章的所屬類別等。然而,對於分類器而言,若其訓練資料越能代表整體資料,則會使其所得到的訓練結果越好,而在分類器建立過程,會將訓練資料以人為方式貼上所屬標籤,但由於文章有長有短,並不是每筆資料貼上標籤所花費的成本都相同。 而本研究著重於在成本下以非監督式學習進行樣本選取之過程,在實驗中給予每筆資料其挑選成本,並限制訓練資料最終所挑選之總成本,而本論文使用了Bisecting K-means及Hierarchical Clustering兩種演算法,並以最佳點及成本考量下最佳點兩種方法去挑選資料,將這些訓練資料透過五種不同的分類器進行建模,來衡量所挑選資料所建立的分類器之分類結果。 最終在實驗結果證明本論文所提出之方法在五種不同分類器中,與隨機挑選法相比而言,所得資料在建立分類器模型時,皆有其相對表現較好之方法,而透過本論文之方法,可以在成本限制下,從尚未擁有類別標籤的資料中選出較具代表性的資料,若將這些資料交給專家進行類別標示,即可訓練出更好的分類模型,大幅的降低類別標示的成本。zh_TW
dc.description.abstract With the progress of technology along with the tide of big data, the importance of ”information” has gradually been valued by people. Therefore, many scholars began to dive into the field of data mining, looking forward to find the value behind numerous data and come up with innovative usages. Such as, but not limited to, using classifiers to discriminate the categories of articles and so on. However, for a classifier, a more comprehensive training data will come to a better result. When building a classifier, we label the data in manual, since articles and paragraphs come in different length, the cost varies widely for doing so. This study focuses on using unsupervised learning to select samples while giving each data a selection cost to limit the total cost of the final selection. In this thesis, by using Bisecting K-means and Hierarchical Clustering algorithm, the data are selected by two ways, best points and best points under cost considerations. These training materials then are modeled by five different classifiers to measure the classification of classifiers that were established by the selected data. Finally, the experimental results show that compared with random selecting, the 5 different classifiers each show better strengths in different areas when classifiers are established. Using the method mentioned in this thesis can result in selecting better quality and representative data from unlabeled data while not exceeding the budget. If these data are handed to experts for labeling, the labeling cost will drop significantly and come out in a better result.en_US
DC.subject文件分類zh_TW
DC.subject非監督式樣本選取zh_TW
DC.subject成本zh_TW
DC.subjectDocument classificationen_US
DC.subjectUnsupervised instance selectionen_US
DC.subjectCosten_US
DC.title在成本限制下,以非監督式學習進行樣本選取之研究zh_TW
dc.language.isozh-TWzh-TW
DC.type博碩士論文zh_TW
DC.typethesisen_US
DC.publisherNational Central Universityen_US

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡  - 隱私權政策聲明