博碩士論文 108423039 詳細資訊




以作者查詢圖書館館藏 以作者查詢臺灣博碩士 以作者查詢全國書目 勘誤回報 、線上人數:11 、訪客IP:3.129.249.105
姓名 林筱芙(Hsiao-Fu Lin)  查詢紙本館藏   畢業系所 資訊管理學系
論文名稱 多樣化的資料增益於類別不平衡及小資料集問題下的文本分類任務
(Diversified Data Augmentation for Class Imbalance Datasets and Small Datasets on Text Classification)
相關論文
★ 多重標籤文本分類之實證研究 : word embedding 與傳統技術之比較★ 基於圖神經網路之網路協定關聯分析
★ 學習模態間及模態內之共用表示式★ Hierarchical Classification and Regression with Feature Selection
★ 病徵應用於病患自撰日誌之情緒分析★ 基於注意力機制的開放式對話系統
★ 針對特定領域任務—基於常識的BERT模型之應用★ 基於社群媒體使用者之硬體設備差異分析文本情緒強烈程度
★ 機器學習與特徵工程用於虛擬貨幣異常交易監控之成效討論★ 捷運轉轍器應用長短期記憶網路與機器學習實現最佳維保時間提醒
★ 基於半監督式學習的網路流量分類★ ERP日誌分析-以A公司為例
★ 企業資訊安全防護:網路封包蒐集分析與網路行為之探索性研究★ 資料探勘技術在顧客關係管理之應用─以C銀行數位存款為例
★ 人臉圖片生成與增益之可用性與效率探討分析★ 人工合成文本之資料增益於不平衡文字分類問題
檔案 [Endnote RIS 格式]    [Bibtex 格式]    [相關文章]   [文章引用]   [完整記錄]   [館藏目錄]   至系統瀏覽論文 (2026-7-21以後開放)
摘要(中) 文本生成是自然語言處理中的一項重要任務。文本生成模型可以被分為兩大類:基於最大似然估計(MLE-based)的模型和基於生成對抗網絡(GAN-based)的模型。然而,這兩大類模型仍然分別存在過度生產高頻單字、重複句子和模式崩壞(mode collapse)的問題。近年來也有一些文獻提出了能夠解決上述問題,並且能生成較多樣化和有趣句子的生成文本模型。
另一方面,行列式點過程(DPPs)是一個在談論到機器學習和深度學習的多樣性時重要的機率模型。過去也有許多研究也在許多深度學習的應用上透過使用 DPPs 來提高模型的多樣性,例如:萃取式摘要、推薦系統、SGD的mini-batch和圖像生成。
綜合上述,本研究會將 DPPs 嵌入到 VAE 和 SeqGAN 中來執行多樣化的文本生成任務,並使用各種多樣性評估指標 (reverse perplexity, distinct n-gram, TF cosine similarity)來衡量性能。除此之外,我們還將基於 DPPs 的文本生成模型應用在具有類別不平衡或是訓練資料不足的文本分類之下游任務上。我們會將DPP-VAE以及DPP-SeqGAN和其他資料增益的模型(VAE、SeqGAN、EDA、GPT-2、IRL)進行比較,來觀察多樣性和分類性能之間的相關性,以研究多樣性的生成文本是否能帶來更好的影響,使分類器能夠訓練得更好.
從實驗結果中,我們證明了 DPPs 確實可以幫VAE 和 SeqGAN 生成更多樣化的數據,在多樣性衡量指標上皆取得更好的成績。而DPP-VAE 甚至在長文本數據集中皆得到了最好的表現。此外,我們還發現雖然最終表現仍不及直接減少大類別樣本以平衡類別間的訓練資料數量,多樣化的生成數據確實可以在類別不平衡情境中的文本分類帶來良好的影響,獲得更好的分類性能。在類別不平衡情境下的文本分類中,Distinct n-gram、TF cosine similarity和分類評估指標有很好的相關性。然而,這些資料增益模型在訓練資料不足的情境中產生的幫助並不顯著,多樣性表現與分類性能較沒有相關性。我們認為,能夠保留類別標籤的生成文本相比多樣化的生成文本對訓練資料不足的情境中的文本分類任務較能帶來更好的影響。
摘要(英) Text generation is an important task in NLP. The text generative models can be divided into two categories: the maximum likelihood estimation (MLE)-based models and the generative adversarial network (GAN)-based models. However, the MLE-based models still suffer from the problem of overproducing high-frequency words and repeating sentences; the GAN-based models have the problem of mode collapse. Recently, some literatures proposed models to alleviate the problems, encouraging the text generative model to produce diverse and interesting sentences.
On the other hand, Determinant Point Processes (DPPs) is one of the important probability models when it comes to diversity in machine learning and deep learning. Past studies had also used DPPs on many deep learning applications to improve the diversity of model such as extractive summarization, recommendation system, mini-batches for SGD, and image generation.
Therefore, this study will embed DPPs into VAE and SeqGAN to perform the diversified text generation task and use various diversity evaluation metrics (reverse perplexity, distinct n-gram, TF cosine similarity) to measure the performance. Additionally, we also apply the DPP-based text generative model on the downstream task of text classification having class imbalance or small datasets scenario. We will compare DPP-VAE, DPP-SeqGAN with other data augmentation models (VAE, SeqGAN, EDA, GPT-2, IRL) and observe the correlation between the performance of diversity and classification, further investigating whether diverse generated data can bring a better impact, making the classifier to train well.
From the experiment results, we prove that DPPs can help the vanilla VAE and SeqGAN to generate more diverse data, getting better results on the diversity evaluation metrics. DPP-VAE even achieves the best results in long text datasets. Additionally, we also find that though the final results are not as good as directly reducing the examples of majority class to balance the number of training data between classes, diverse generated data can indeed bring a good impact in class imbalance scenario, getting better classification performance. Distinct n-gram and TF cosine similarity have a well correlation with the evaluation metrics of classification in class imbalance scenario. However, the help of these data augmentation models is not significant in the small datasets scenario and diversity score has no correlation with the classification performance. We think that compare with diverse generated data, within-class generated data can bring better impact on text classification task in small datasets scenario.
關鍵字(中) ★ 多樣化文本生成
★ 行列式點過程
★ 資料增益
★ 文本分類
關鍵字(英) ★ diversified text generation
★ DPPs
★ data augmentation
★ text classification
論文目次 摘要 I
Abstract II
Acknowledgements IV
List of Figures VII
List of Tables IX
1. Introduction 1
1.1. Background 1
1.2. Motivation 2
1.3. Objectives 3
1.4. Thesis Organization 3
2. Related Works 5
2.1. Diversified Text Generation 5
2.1.1 Seq2Seq-MMI 5
2.1.2 Diversity-Promoting GAN (DP-GAN) 6
2.1.3 Inverse Reinforcement Learning (IRL) 7
2.1.4 Diversity Regularized Autoencoders (DRAE) 9
2.2. Determinantal Point Processes (DPPs) 11
2.2.1 Generative Determinantal Point Processes (GDPP) 12
2.3. Data augmentation in NLP 16
2.3.1 EDA 16
2.3.2 GPT-2 17
2.3.3 Generative model 17
2.4. Evaluation metrics 20
2.4.1 Diversity evaluation metrics 20
2.4.2 Classification evaluation metrics 22
2.5. Chapter Summary 24
3. Methodology 26
3.1. Datasets 27
3.2. Experimental settings 28
3.2.1 Preprocessing 28
3.2.2 Model setting 29
3.3. Experiment design 29
3.3.1 Experiment 1: DPP-VAE、DPP-SeqGAN 29
3.3.2 Experiment 2: apply DPP-VAE and DPP-SeqGAN on Data Augmentation 30
4. Experiment Results 32
4.1. Experiment 1: DPP-VAE、DPP-SeqGAN 32
4.1.1 Results 32
4.1.2 Question 1 answering 35
4.2. Experiment 2: apply DPP-VAE and DPP-SeqGAN on Data Augmentation 35
4.2.1 Results of class imbalance scenario 35
4.2.2 Results of small datasets scenario 40
4.2.3 Question 2 answering 44
5. Conclusion 48
5.1. Contributions 49
5.2. Limitations 49
5.3. Future work 49
Reference 50
參考文獻 Antoniou, A., Storkey, A., Edwards, H., 2018. Data Augmentation Generative Adversarial Networks. ArXiv171104340 Cs Stat.
Bahdanau, D., Brakel, P., Xu, K., Goyal, A., Lowe, R., Pineau, J., Courville, A., Bengio, Y., 2017. An Actor-Critic Algorithm for Sequence Prediction. In Conference ICLR.
Bowman, S.R., Vilnis, L., Vinyals, O., Dai, A.M., Jozefowicz, R., Bengio, S., 2016. Generating Sentences from a Continuous Space. In 20th SIGNLL Conference on Computational Natural Language Learning, CoNLL 2016 (pp. 10-21). Association for Computational Linguistics (ACL).
Chen, H., Liu, X., Yin, D., Tang, J., 2017. A Survey on Dialogue Systems: Recent Advances and New Frontiers. ACM SIGKDD Explor. Newsl. 19, 25–35. https://doi.org/10.1145/3166054.3166058
Chen, L., Zhang, G., Zhou, E., n.d. Fast Greedy MAP Inference for Determinantal Point Process to Improve Recommendation Diversity. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (pp. 5627-5638).
Cho, S., Lebanoff, L., Foroosh, H., Liu, F., 2019a. Improving the Similarity Measure of Determinantal Point Processes for Extractive Multi-Document Summarization, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Presented at the Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Florence, Italy, pp. 1027–1038. https://doi.org/10.18653/v1/P19-1098
Cho, S., Li, C., Yu, D., Foroosh, H., Liu, F., 2019b. Multi-Document Summarization with Determinantal Point Processes and Contextualized Representations, in: Proceedings of the 2nd Workshop on New Frontiers in Summarization. Presented at the Proceedings of the 2nd Workshop on New Frontiers in Summarization, Association for Computational Linguistics, Hong Kong, China, pp. 98–103. https://doi.org/10.18653/v1/D19-5412
Elfeki, M., Couprie, C., Riviere, M., Elhoseiny, M., 2019. GDPP: Learning Diverse Generations Using Determinantal Point Process. In ICML.
Finn, C., Levine, S., Abbeel, P., 2016. Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization. In International conference on machine learning (pp. 49-58). PMLR.
Galley, M., Brockett, C., Sordoni, A., Ji, Y., Auli, M., Quirk, C., Mitchell, M., Gao, J., Dolan, B., 2015. deltaBLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets, in: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Presented at the Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Association for Computational Linguistics, Beijing, China, pp. 445–450. https://doi.org/10.3115/v1/P15-2073
Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y., 2014. Generative Adversarial Networks. Communications of the ACM, 63(11), 139-144.
Guo, J., Lu, S., Cai, H., Zhang, W., Yu, Y., Wang, J., 2017. Long Text Generation via Adversarial Training with Leaked Information. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 32, No. 1).
Hinton, G.E., 2006. Reducing the Dimensionality of Data with Neural Networks. Science 313, 504–507. https://doi.org/10.1126/science.1127647
Hu, Z., Tan, B., Salakhutdinov, R., Mitchell, T., Xing, E.P., 2019. Learning Data Manipulation for Augmentation and Weighting. Advances in Neural Information Processing Systems, 32, 15764-15775.
Kingma, D.P., Welling, M., 2014. Auto-Encoding Variational Bayes. ArXiv13126114 Cs Stat.
Ko, H., Lee, Junhyuk, Kim, J., Lee, Jongwuk, Shim, H., 2020. Diversity regularized autoencoders for text generation, in: Proceedings of the 35th Annual ACM Symposium on Applied Computing. Presented at the SAC ’20: The 35th ACM/SIGAPP Symposium on Applied Computing, ACM, Brno Czech Republic, pp. 883–891. https://doi.org/10.1145/3341105.3373998
Kulesza, A., Taskar, B., 2012. Determinantal point processes for machine learning. Found. Trends® Mach. Learn. 5, 123–286. https://doi.org/10.1561/2200000044
Kulesza, A., Taskar, B., 2011. k-DPPs: Fixed-Size Determinantal Point Processes. In ICML.
Kumar, V., Choudhary, A., Cho, E., 2021. Data Augmentation using Pre-trained Transformer Models. In Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems (pp. 18-26).
Li, J., Galley, M., Brockett, C., Gao, J., Dolan, B., 2016. A Diversity-Promoting Objective Function for Neural Conversation Models, in: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Presented at the NAACL-HLT 2016, Association for Computational Linguistics, San Diego, California, pp. 110–119. https://doi.org/10.18653/v1/N16-1014
Lin, K., Li, D., He, X., Zhang, Z., Sun, M.-T., 2018. Adversarial Ranking for Language Generation. In NIPS.
Liu, C.-W., Lowe, R., Serban, I.V., Noseworthy, M., Charlin, L., Pineau, J., 2016. How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation. In EMNLP.
M, H., M.N, S., 2015. A Review on Evaluation Metrics for Data Classification Evaluations. Int. J. Data Min. Knowl. Manag. Process 5, 01–11. https://doi.org/10.5121/ijdkp.2015.5201
Malandrakis, N., Shen, M., Goyal, A., Gao, S., Sethi, A., Metallinou, A., 2019. Controlled Text Generation for Data Augmentation in Intelligent Artificial Agents. In Proceedings of the 3rd Workshop on Neural Generation and Translation (pp. 90-98).
Miller, G.A., 1995. WordNet: a lexical database for English. Commun. ACM 38, 39–41. https://doi.org/10.1145/219717.219748
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., 2018. Language Models are Unsupervised Multitask Learners 24.
Semeniuta, S., Severyn, A., Barth, E., 2017. A Hybrid Convolutional Variational Autoencoder for Text Generation. In EMNLP.
Sennrich, R., Haddow, B., Birch, A., 2016. Improving Neural Machine Translation Models with Monolingual Data, in: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Presented at the ACL 2016, Association for Computational Linguistics, Berlin, Germany, pp. 86–96. https://doi.org/10.18653/v1/P16-1009
Shao, Z., Huang, M., Wen, J., Xu, W., Zhu, X., 2019. Long and Diverse Text Generation with Planning-based Hierarchical Variational Model. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 3257-3268).
Shi, Z., Chen, X., Qiu, X., Huang, X., 2018. Toward Diverse Text Generation with Inverse Reinforcement Learning, in: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence. Presented at the Twenty-Seventh International Joint Conference on Artificial Intelligence {IJCAI-18}, International Joint Conferences on Artificial Intelligence Organization, Stockholm, Sweden, pp. 4361–4367. https://doi.org/10.24963/ijcai.2018/606
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I., 2017. Attention Is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (pp. 6000-6010).
Wei, J., Zou, K., 2019. EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 6382-6388).
Wen, T.-H., Gasic, M., Mrksic, N., Su, P.-H., Vandyke, D., Young, S., 2015. Semantically Conditioned LSTM-based Natural Language Generation for Spoken Dialogue Systems. In Conference Proceedings-EMNLP 2015: Conference on Empirical Methods in Natural Language Processing (pp. 1711-1721).
Xu, J., Ren, X., Lin, J., Sun, X., 2018. DP-GAN: Diversity-Promoting Generative Adversarial Network for Generating Informative and Diversified Text. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (pp. 3940-3949).
Yu, L., Zhang, W., Wang, J., Yu, Y., 2017. SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient. In Proceedings of the AAAI conference on artificial intelligence (Vol. 31, No. 1).
Zhang, C., Kjellstrom, H., Mandt, S., 2017a. Determinantal Point Processes for Mini-Batch Diversification. In 33rd Conference on Uncertainty in Artificial Intelligence, UAI 2017, Sydney, Australia, 11 August 2017 through 15 August 2017. AUAI Press Corvallis.
Zhang, D., Li, T., Zhang, H., Yin, B., 2020. On Data Augmentation for Extreme Multi-label Classification. ArXiv200910778 Cs.
Zhang, X., Wang, Z., Liu, D., Ling, Q., 2018. DADA: Deep Adversarial Data Augmentation for Extremely Low Data Regime Classification. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 2807-2811). IEEE.
Ziebart, B.D., Maas, A., Bagnell, J.A., Dey, A.K., 2008. Maximum Entropy Inverse Reinforcement Learning. In Aaai (Vol. 8, pp. 1433-1438).
指導教授 柯士文(Shih-Wen Ke) 審核日期 2021-8-19
推文 facebook   plurk   twitter   funp   google   live   udn   HD   myshare   reddit   netvibes   friend   youpush   delicious   baidu   
網路書籤 Google bookmarks   del.icio.us   hemidemi   myshare   

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡  - 隱私權政策聲明