DC 欄位 |
值 |
語言 |
DC.contributor | 資訊管理學系 | zh_TW |
DC.creator | 黃軍儒 | zh_TW |
DC.creator | Chun-Ru Huang | en_US |
dc.date.accessioned | 2020-7-16T07:39:07Z | |
dc.date.available | 2020-7-16T07:39:07Z | |
dc.date.issued | 2020 | |
dc.identifier.uri | http://ir.lib.ncu.edu.tw:444/thesis/view_etd.asp?URN=107423005 | |
dc.contributor.department | 資訊管理學系 | zh_TW |
DC.description | 國立中央大學 | zh_TW |
DC.description | National Central University | en_US |
dc.description.abstract | 類別不平衡問題會因為各類別分布的高度不平均而產生。在現實生活中,不平衡文字分類任務時常發生,而文本分類器通常因為缺乏次要類別訓練數據而過度擬合於主要類別,導致在次要類別的分類表現不佳。
因此在本論文中,我們提出用各種不同的文字生成模型(MLE, SeqGAN, VAE, GPT-2)生成合成文本,並且資料增益在次要類別上。在我們的實驗中,我們將探討合成文本和真實資料在資料增益上的差距表現,以及比較合成文本與傳統的採樣方法、同義詞替換之方法之間的有效性,不同的文字表達法也將會被納入我們的觀察當中。
從我們的結果顯示,基於文字生成模型生成的合成文本用於資料增益可以解決類別不平衡的文字分類問題以及缺乏次要類別資料的問題。我們發現我們所提出的方法比先前的過採樣方法(如SMOTE)及同義詞替換方法的表現來的好。
再者,我們採用長文本及短文本這兩種角度觀察,發現不同的文字生成模型會依據其輸入的資料量大小及文本的長度,其增益的表現會有所不同。 | zh_TW |
dc.description.abstract | Class imbalance exists when class distributions are heavily skewed. It is commonly found in many real-world text classification tasks. Text classifiers usually underperform on minor classes because of lack of training data, which is not desirable especially when minor classes are of interest.
We propose to apply different text generation models (MLE, SeqGAN, VAE, GPT-2) to generate synthetic text for data augmentation on minor classes. In our experiments, we evaluate the effectiveness of synthetic text against traditional sampling method, synonym replacement method and real-world text in terms of classification performance. Various text representations will also be discussed.
Our results show that synthetic text generated from text generation model for data augmentation can solve the problem of class imbalance and the problem of insufficient minority data. We found that the performance of our approach is better than previous oversampling method (SMOTE) and synonym replacement method. We also discover that different text generation models will perform different performances based on the dataset size and sentence length. | en_US |
DC.subject | 自然語言生成 | zh_TW |
DC.subject | 類別不平衡 | zh_TW |
DC.subject | 文字分類 | zh_TW |
DC.subject | 資料增益 | zh_TW |
DC.subject | Natural Language Generation | en_US |
DC.subject | class imbalance | en_US |
DC.subject | text classification | en_US |
DC.subject | data augmentation | en_US |
DC.title | 人工合成文本之資料增益於不平衡文字分類問題 | zh_TW |
dc.language.iso | zh-TW | zh-TW |
DC.title | Data Augmentation for Imbalanced Classification with Synthetic Text | en_US |
DC.type | 博碩士論文 | zh_TW |
DC.type | thesis | en_US |
DC.publisher | National Central University | en_US |