多樣化的資料增益於類別不平衡及小資料集問題下的文本分類任務;Diversified Data Augmentation for Class Imbalance Datasets and Small Datasets on Text Classification

NCU Institutional Repository > 管理學院 > 資訊管理研究所 > 博碩士論文 > Item 987654321/86670

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/86670

題名:	多樣化的資料增益於類別不平衡及小資料集問題下的文本分類任務;Diversified Data Augmentation for Class Imbalance Datasets and Small Datasets on Text Classification
作者:	林筱芙;Lin, Hsiao-Fu
貢獻者:	資訊管理學系
關鍵詞:	多樣化文本生成;行列式點過程;資料增益;文本分類;diversified text generation;DPPs;data augmentation;text classification
日期:	2021-08-19
上傳時間:	2021-12-07 13:06:16 (UTC+8)
出版者:	國立中央大學
摘要:	文本生成是自然語言處理中的一項重要任務。文本生成模型可以被分為兩大類：基於最大似然估計（MLE-based）的模型和基於生成對抗網絡（GAN-based）的模型。然而，這兩大類模型仍然分別存在過度生產高頻單字、重複句子和模式崩壞(mode collapse)的問題。近年來也有一些文獻提出了能夠解決上述問題，並且能生成較多樣化和有趣句子的生成文本模型。另一方面，行列式點過程（DPPs）是一個在談論到機器學習和深度學習的多樣性時重要的機率模型。過去也有許多研究也在許多深度學習的應用上透過使用 DPPs 來提高模型的多樣性，例如:萃取式摘要、推薦系統、SGD的mini-batch和圖像生成。綜合上述，本研究會將 DPPs 嵌入到 VAE 和 SeqGAN 中來執行多樣化的文本生成任務，並使用各種多樣性評估指標 (reverse perplexity, distinct n-gram, TF cosine similarity)來衡量性能。除此之外，我們還將基於 DPPs 的文本生成模型應用在具有類別不平衡或是訓練資料不足的文本分類之下游任務上。我們會將DPP-VAE以及DPP-SeqGAN和其他資料增益的模型（VAE、SeqGAN、EDA、GPT-2、IRL）進行比較，來觀察多樣性和分類性能之間的相關性，以研究多樣性的生成文本是否能帶來更好的影響，使分類器能夠訓練得更好. 從實驗結果中，我們證明了 DPPs 確實可以幫VAE 和 SeqGAN 生成更多樣化的數據，在多樣性衡量指標上皆取得更好的成績。而DPP-VAE 甚至在長文本數據集中皆得到了最好的表現。此外，我們還發現雖然最終表現仍不及直接減少大類別樣本以平衡類別間的訓練資料數量，多樣化的生成數據確實可以在類別不平衡情境中的文本分類帶來良好的影響，獲得更好的分類性能。在類別不平衡情境下的文本分類中，Distinct n-gram、TF cosine similarity和分類評估指標有很好的相關性。然而，這些資料增益模型在訓練資料不足的情境中產生的幫助並不顯著，多樣性表現與分類性能較沒有相關性。我們認為，能夠保留類別標籤的生成文本相比多樣化的生成文本對訓練資料不足的情境中的文本分類任務較能帶來更好的影響。 ;Text generation is an important task in NLP. The text generative models can be divided into two categories: the maximum likelihood estimation (MLE)-based models and the generative adversarial network (GAN)-based models. However, the MLE-based models still suffer from the problem of overproducing high-frequency words and repeating sentences; the GAN-based models have the problem of mode collapse. Recently, some literatures proposed models to alleviate the problems, encouraging the text generative model to produce diverse and interesting sentences. On the other hand, Determinant Point Processes (DPPs) is one of the important probability models when it comes to diversity in machine learning and deep learning. Past studies had also used DPPs on many deep learning applications to improve the diversity of model such as extractive summarization, recommendation system, mini-batches for SGD, and image generation. Therefore, this study will embed DPPs into VAE and SeqGAN to perform the diversified text generation task and use various diversity evaluation metrics (reverse perplexity, distinct n-gram, TF cosine similarity) to measure the performance. Additionally, we also apply the DPP-based text generative model on the downstream task of text classification having class imbalance or small datasets scenario. We will compare DPP-VAE, DPP-SeqGAN with other data augmentation models (VAE, SeqGAN, EDA, GPT-2, IRL) and observe the correlation between the performance of diversity and classification, further investigating whether diverse generated data can bring a better impact, making the classifier to train well. From the experiment results, we prove that DPPs can help the vanilla VAE and SeqGAN to generate more diverse data, getting better results on the diversity evaluation metrics. DPP-VAE even achieves the best results in long text datasets. Additionally, we also find that though the final results are not as good as directly reducing the examples of majority class to balance the number of training data between classes, diverse generated data can indeed bring a good impact in class imbalance scenario, getting better classification performance. Distinct n-gram and TF cosine similarity have a well correlation with the evaluation metrics of classification in class imbalance scenario. However, the help of these data augmentation models is not significant in the small datasets scenario and diversity score has no correlation with the classification performance. We think that compare with diverse generated data, within-class generated data can bring better impact on text classification task in small datasets scenario.
顯示於類別:	[資訊管理研究所] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	93	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....