人工合成文本之資料增益於不平衡文字分類問題

DC 欄位	值	語言
DC.contributor	資訊管理學系	zh_TW
DC.creator	黃軍儒	zh_TW
DC.creator	Chun-Ru Huang	en_US
dc.date.accessioned	2020-7-16T07:39:07Z
dc.date.available	2020-7-16T07:39:07Z
dc.date.issued	2020
dc.identifier.uri	http://ir.lib.ncu.edu.tw:88/thesis/view_etd.asp?URN=107423005
dc.contributor.department	資訊管理學系	zh_TW
DC.description	國立中央大學	zh_TW
DC.description	National Central University	en_US
dc.description.abstract	類別不平衡問題會因為各類別分布的高度不平均而產生。在現實生活中，不平衡文字分類任務時常發生，而文本分類器通常因為缺乏次要類別訓練數據而過度擬合於主要類別，導致在次要類別的分類表現不佳。因此在本論文中，我們提出用各種不同的文字生成模型(MLE, SeqGAN, VAE, GPT-2)生成合成文本，並且資料增益在次要類別上。在我們的實驗中，我們將探討合成文本和真實資料在資料增益上的差距表現，以及比較合成文本與傳統的採樣方法、同義詞替換之方法之間的有效性，不同的文字表達法也將會被納入我們的觀察當中。從我們的結果顯示，基於文字生成模型生成的合成文本用於資料增益可以解決類別不平衡的文字分類問題以及缺乏次要類別資料的問題。我們發現我們所提出的方法比先前的過採樣方法(如SMOTE)及同義詞替換方法的表現來的好。再者，我們採用長文本及短文本這兩種角度觀察，發現不同的文字生成模型會依據其輸入的資料量大小及文本的長度，其增益的表現會有所不同。	zh_TW
dc.description.abstract	Class imbalance exists when class distributions are heavily skewed. It is commonly found in many real-world text classification tasks. Text classifiers usually underperform on minor classes because of lack of training data, which is not desirable especially when minor classes are of interest. We propose to apply different text generation models (MLE, SeqGAN, VAE, GPT-2) to generate synthetic text for data augmentation on minor classes. In our experiments, we evaluate the effectiveness of synthetic text against traditional sampling method, synonym replacement method and real-world text in terms of classification performance. Various text representations will also be discussed. Our results show that synthetic text generated from text generation model for data augmentation can solve the problem of class imbalance and the problem of insufficient minority data. We found that the performance of our approach is better than previous oversampling method (SMOTE) and synonym replacement method. We also discover that different text generation models will perform different performances based on the dataset size and sentence length.	en_US
DC.subject	自然語言生成	zh_TW
DC.subject	類別不平衡	zh_TW
DC.subject	文字分類	zh_TW
DC.subject	資料增益	zh_TW
DC.subject	Natural Language Generation	en_US
DC.subject	class imbalance	en_US
DC.subject	text classification	en_US
DC.subject	data augmentation	en_US
DC.title	人工合成文本之資料增益於不平衡文字分類問題	zh_TW
dc.language.iso	zh-TW	zh-TW
DC.title	Data Augmentation for Imbalanced Classification with Synthetic Text	en_US
DC.type	博碩士論文	zh_TW
DC.type	thesis	en_US
DC.publisher	National Central University	en_US

博碩士論文 107423005 完整後設資料紀錄