當訓練資料量不足時,資料增益(Data Augmentation)是改善下游任務性能常見的技術之一。但是,相較於圖片的資料增益方法,資料增益在文字數據的的做法上幾乎沒有共識。原因是圖片很容易制定出通用的轉換規則(翻轉、旋轉、裁切等等),然而一段文字如果更動其內文順序很容易會影響到原先的語意。在這項研究中,我們提出了一個資料增益的框架SDA:Semantic-based Data Augmentation,目的是利用現有的標籤資料,從大量的無標籤資料中找到跟標籤資料有相同語意的擴充樣本,用以提高文本分類任務的分類性能。SDA從外部的無標籤文本中,利用採樣的方法找出語意與原始標籤資料相似的文本,並給予與原始標籤文本相同標籤來增加訓練資料。本研究透過實驗說明了語意相似的無標籤文本對於下游分類任務的實用性,我們在相同框架中分別使用了基於不同訓練目標訓練出的文本表示。我們首先探討在不同的表示方法對於語意的捕捉能力分別為何,以及評估將不同數量的擴增樣本添加到訓練集中的效果。 SDA的概念簡單,但對於提升下游分類性能的表現十分卓越。SDA在七個分類數據集中的六個,明顯優於其他常見的增益方法。此外,SDA不僅僅在性能的提升上勝過其它增益方法,在與真實資料相比,也就是添加原本的標籤資料到訓練集當中的情況下,也能夠取得不亞於真實資料的分類性能。 ;Data augmentation is among the most widely used techniques for improving the performance of downstream tasks when insufficient training data is present. However, there is little agreement on the augmentation approaches of text data such as transformation rules. In this study, we propose a flexible augmentation framework, SDA: Semantic-based Data Augmentation, which aims to improve the classification performance on text classification tasks. The SDA augments the insufficient training documents by sampling external unlabeled documents that are semantically similar to the existing training documents. This study sheds new light on the usefulness of semantics. We take advantage of advanced representation methods into our framework. We first investigate the ability of semantic capturing on different representation methods and then evaluate the effect of adding different quantities of semantically similar texts into the training data. The SDA is conceptually simple and shows promising performance. It obtains remarkable results on seven classification datasets. Moreover, the SDA not only outperforms the data augmentation benchmarks, but also achieves comparable performances where labeled documents are added into the training data. Through the experiments and analysis, we knew that the SDA can be applied to improve the performance of classifiers for a wide range of classification tasks, such as sentiment analysis and opinion polarity detection, even training documents are severely insufficient.