博碩士論文 107423005 完整後設資料紀錄

DC 欄位 語言
DC.contributor資訊管理學系zh_TW
DC.creator黃軍儒zh_TW
DC.creatorChun-Ru Huangen_US
dc.date.accessioned2020-7-16T07:39:07Z
dc.date.available2020-7-16T07:39:07Z
dc.date.issued2020
dc.identifier.urihttp://ir.lib.ncu.edu.tw:444/thesis/view_etd.asp?URN=107423005
dc.contributor.department資訊管理學系zh_TW
DC.description國立中央大學zh_TW
DC.descriptionNational Central Universityen_US
dc.description.abstract類別不平衡問題會因為各類別分布的高度不平均而產生。在現實生活中,不平衡文字分類任務時常發生,而文本分類器通常因為缺乏次要類別訓練數據而過度擬合於主要類別,導致在次要類別的分類表現不佳。 因此在本論文中,我們提出用各種不同的文字生成模型(MLE, SeqGAN, VAE, GPT-2)生成合成文本,並且資料增益在次要類別上。在我們的實驗中,我們將探討合成文本和真實資料在資料增益上的差距表現,以及比較合成文本與傳統的採樣方法、同義詞替換之方法之間的有效性,不同的文字表達法也將會被納入我們的觀察當中。 從我們的結果顯示,基於文字生成模型生成的合成文本用於資料增益可以解決類別不平衡的文字分類問題以及缺乏次要類別資料的問題。我們發現我們所提出的方法比先前的過採樣方法(如SMOTE)及同義詞替換方法的表現來的好。 再者,我們採用長文本及短文本這兩種角度觀察,發現不同的文字生成模型會依據其輸入的資料量大小及文本的長度,其增益的表現會有所不同。zh_TW
dc.description.abstractClass imbalance exists when class distributions are heavily skewed. It is commonly found in many real-world text classification tasks. Text classifiers usually underperform on minor classes because of lack of training data, which is not desirable especially when minor classes are of interest. We propose to apply different text generation models (MLE, SeqGAN, VAE, GPT-2) to generate synthetic text for data augmentation on minor classes. In our experiments, we evaluate the effectiveness of synthetic text against traditional sampling method, synonym replacement method and real-world text in terms of classification performance. Various text representations will also be discussed. Our results show that synthetic text generated from text generation model for data augmentation can solve the problem of class imbalance and the problem of insufficient minority data. We found that the performance of our approach is better than previous oversampling method (SMOTE) and synonym replacement method. We also discover that different text generation models will perform different performances based on the dataset size and sentence length.en_US
DC.subject自然語言生成zh_TW
DC.subject類別不平衡zh_TW
DC.subject文字分類zh_TW
DC.subject資料增益zh_TW
DC.subjectNatural Language Generationen_US
DC.subjectclass imbalanceen_US
DC.subjecttext classificationen_US
DC.subjectdata augmentationen_US
DC.title人工合成文本之資料增益於不平衡文字分類問題zh_TW
dc.language.isozh-TWzh-TW
DC.titleData Augmentation for Imbalanced Classification with Synthetic Texten_US
DC.type博碩士論文zh_TW
DC.typethesisen_US
DC.publisherNational Central Universityen_US

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡  - 隱私權政策聲明