在成本/資源限制下選擇資料進行標註;Selecting Data for Labeling under Cost/Resource Constraint

NCU Institutional Repository > 管理學院 > 資訊管理學系 > 研究計畫 > Item 987654321/82308

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/82308

題名:	在成本/資源限制下選擇資料進行標註;Selecting Data for Labeling under Cost/Resource Constraint
作者:	陳彥良
貢獻者:	國立中央大學資訊管理學系
關鍵詞:	分類;非監督式樣本選取;成本;資源;分類器;Classification;Unsupervised instance selection;Cost;Resource;Classifiers
日期:	2020-01-13
上傳時間:	2020-01-13 14:39:27 (UTC+8)
出版者:	科技部
摘要:	隨著科技的進步及大數據的浪潮，「資料」的重要性越來越被人們所看重，因此許多的學者開始著墨於資料探勘領域，期待在眾多資料中找出其背後的價值並產生出許多相關應用，如使用分類器預測資料的所屬類別等。然而，對於分類器而言，若其使用的訓練資料越能代表整體資料，則會使所得到的訓練結果越好。但在分類器建立過程，需要專家將訓練資料以人為方式貼上所屬標籤，而往往每筆資料所需的標註成本/資源是不一樣的。因此本計畫的研究動機乃如何在限定的總成本/總資源使用量條件下，提出同時考量「資料的代表性」及「標註資料所需成本/資源」兩大因素的訓練資料集建立方法，好讓專家在對訓練集資料標註後，從這些訓練資料所建立的分類器模型可以有最佳分類結果。本計畫中，對於成本限制有三種不同假設，隨之產生三個不同問題。而本計劃為三年期計劃，每一年將針對一個問題提出解決方法。  第一個問題假設每筆資料所需的標註成本都一樣，因此本研究的問題變成是：在只能挑選K筆資料來標註的情況下，要挑選哪K筆資料來標註，可使以後所建立的分類器有最佳的分類正確率。  第二個問題假設每筆資料的標註成本不一樣，具體而言，若令第i筆資料的標註成本為ci，而總標註成本限制是C，則本研究的問題變成是：在總標註成本不可超過C的條件下，要挑選哪些資料來標註，可使以後所建立的分類器有最佳的分類正確率。  第三個問題假設每筆資料的標註會耗用不同數量的不同資源(例如時間、金錢、空間、儀器等)，而每一種資源的使用量都有其限制。在此情形下，本研究的問題變成是：在不可超過各種資源使用量限制的條件下，要挑選哪些資料來標註，可使以後所建立的分類器有最佳的分類正確率。 ;With the progress of technology along with the tide of big data, the importance of "information" has increasingly been valued by people. Therefore, many scholars began to dive into the field of data mining, looking forward to finding the value behind the data and coming up with innovative usages. Such as, but not limited to, using classifiers to discriminate the categories of data and so on. Generally, to build good classifiers we must consider two factors in selecting data for labeling. Firstly, the data selected for labeling must be more representative to the original data, because this will make the trained classifier achieve a higher accuracy. Secondly, when training data have no label, we must ask the help of experts to label the training data. Since each data has a different condition, experts may spend different cost/resources to label the data. Accordingly, a problem arising immediately is: how we can select a set of data to label from training data under the given cost/resource constraint so that the classifier built from the selected data may have best accuracy. In this project, we make three different assumptions about the cost/resource constraint, which lead to three different problems. Since this is a three year project, our aim is to solve one problem at a time in every year of the project. These three problems are given below. 1. Assume that the cost for labeling each data is the same. Then the problem is how we can select K data to label from training data so that the classifier built from the selected data may have best accuracy. 2. Assume that the cost for labeling each data is different. Specifically, let ci be the labeling cost for data i, and C be the total cost constraint. Then the problem is how we can select a set of data to label from training data under the total cost constraint C so that the classifier built from the selected data may have best accuracy. 3. Assume that labeling a data needs using different amounts of different resources. Further assume that each type of resource has a limit on it. Then the problem is how we can select a set of data to label from training data under the multiple resource constraint so that the classifier built from the selected data may have best accuracy.
關聯:	財團法人國家實驗研究院科技政策研究與資訊中心
顯示於類別:	[資訊管理學系] 研究計畫

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	259	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....