人工合成文本之資料增益於不平衡文字分類問題

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：55

、訪客IP：18.227.114.218

姓名

黃軍儒(Chun-Ru Huang) 查詢紙本館藏

畢業系所

資訊管理學系

論文名稱

人工合成文本之資料增益於不平衡文字分類問題
(Data Augmentation for Imbalanced Classification with Synthetic Text)

相關論文

★ 多重標籤文本分類之實證研究 : word embedding 與傳統技術之比較	★ 基於圖神經網路之網路協定關聯分析
★ 學習模態間及模態內之共用表示式	★ Hierarchical Classification and Regression with Feature Selection
★ 病徵應用於病患自撰日誌之情緒分析	★ 基於注意力機制的開放式對話系統
★ 針對特定領域任務—基於常識的BERT模型之應用	★ 基於社群媒體使用者之硬體設備差異分析文本情緒強烈程度
★ 機器學習與特徵工程用於虛擬貨幣異常交易監控之成效討論	★ 捷運轉轍器應用長短期記憶網路與機器學習實現最佳維保時間提醒
★ 基於半監督式學習的網路流量分類	★ ERP日誌分析-以A公司為例
★ 企業資訊安全防護：網路封包蒐集分析與網路行為之探索性研究	★ 資料探勘技術在顧客關係管理之應用─以C銀行數位存款為例
★ 人臉圖片生成與增益之可用性與效率探討分析	★ 探討使用多面向方法在文字不平衡資料集之分類問題影響

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2025-7-10以後開放)

摘要(中)

類別不平衡問題會因為各類別分布的高度不平均而產生。在現實生活中，不平衡文字分類任務時常發生，而文本分類器通常因為缺乏次要類別訓練數據而過度擬合於主要類別，導致在次要類別的分類表現不佳。
因此在本論文中，我們提出用各種不同的文字生成模型(MLE, SeqGAN, VAE, GPT-2)生成合成文本，並且資料增益在次要類別上。在我們的實驗中，我們將探討合成文本和真實資料在資料增益上的差距表現，以及比較合成文本與傳統的採樣方法、同義詞替換之方法之間的有效性，不同的文字表達法也將會被納入我們的觀察當中。
從我們的結果顯示，基於文字生成模型生成的合成文本用於資料增益可以解決類別不平衡的文字分類問題以及缺乏次要類別資料的問題。我們發現我們所提出的方法比先前的過採樣方法(如SMOTE)及同義詞替換方法的表現來的好。
再者，我們採用長文本及短文本這兩種角度觀察，發現不同的文字生成模型會依據其輸入的資料量大小及文本的長度，其增益的表現會有所不同。

摘要(英)

Class imbalance exists when class distributions are heavily skewed. It is commonly found in many real-world text classification tasks. Text classifiers usually underperform on minor classes because of lack of training data, which is not desirable especially when minor classes are of interest.
We propose to apply different text generation models (MLE, SeqGAN, VAE, GPT-2) to generate synthetic text for data augmentation on minor classes. In our experiments, we evaluate the effectiveness of synthetic text against traditional sampling method, synonym replacement method and real-world text in terms of classification performance. Various text representations will also be discussed.
Our results show that synthetic text generated from text generation model for data augmentation can solve the problem of class imbalance and the problem of insufficient minority data. We found that the performance of our approach is better than previous oversampling method (SMOTE) and synonym replacement method. We also discover that different text generation models will perform different performances based on the dataset size and sentence length.

關鍵字(中)

★ 自然語言生成
★ 類別不平衡
★ 文字分類
★ 資料增益

關鍵字(英)

★ Natural Language Generation
★ class imbalance
★ text classification
★ data augmentation

論文目次

摘要 I
Abstract II
Acknowledgements III
Table of Contents IV
List of figures V
List of tables VI
1. Introduction 1
1.1. Research Overview 1
1.2. Research Motivation 2
1.3. Research Objective 2
1.4. Thesis Structure 3
2. Related Work 4
2.1. Class Imbalance in Text Classification 4
2.1.1. Oversampling 8
2.1.2. Under-sampling 8
2.1.3. Synthetic Minority Over-sampling Technique (SMOTE) 9
2.2. Data Augmentation 10
2.2.1. Synonym replacement 15
2.2.2. Back-translation 15
2.2.3. MixMatch 16
2.2.4. Data Augmentation for non-text data 16
2.3. Text generation 17
2.3.1. Maximum likelihood estimation 18
2.3.2. Sequence Generative Adversarial Nets 19
2.3.3. Variational Auto-Encoders 22
2.3.4. GPT-2 23
2.4. Word representation 24
2.4.1. TF-IDF 24
2.4.2. Word2vec 25
2.4.3. GloVe 27
2.5. Evaluation Metrics 28
2.5.1. G-mean 28
2.5.2. Minor F1 score, Major F1 score 29
3. Experiment 30
3.1. Overview 30
3.2. Datasets 31
3.2.1. Quora Insincere Question 31
3.2.2. Twitter Sentiment analysis 31
3.2.3. Amazon Fine Food Review 31
3.2.4. Toxic comment 32
3.2.5. Yelp sentiment analysis 32
3.2.6. SST-2 32
3.3. Experimental Settings 33
3.3.1. Preprocessing 33
3.3.2. Generative models and Classifier model settings 34
3.4. Experiment Design 36
3.4.1. Experiment Ⅰ: Text generation for data augmentation 36
3.4.2. Compare with other data augmentation methods (SMOTE, EDA) and real text 37
3.4.3. Text classification 38
3.4.4. Experiment Ⅱ: Balanced an imbalance dataset 39
4. Results 40
4.1. ExperimentⅠ 40
4.1.1. Short text dataset 40
4.1.2. Long text dataset 46
4.1.3. Answer our Questions 50
4.2. Experiment Ⅱ 52
4.2.1. Short text dataset 52
4.2.2. Long text dataset 60
4.2.3. Answer Questions 66
5. Conclusion 68
5.1. Summary 68
5.2. Contribution 68
5.3. Limitations 68
5.4. Future work 69
6. Reference 71

參考文獻

Ah-Pine, J., Morales, E.P.S., 2016. A Study of Synthetic Oversampling for Twitter Imbalanced Sentiment Analysis 8.
Akkaradamrongrat, S., Kachamas, P., Sinthupinyo, S., 2019. Text Generation for Imbalanced Text Classification, in: 2019 16th International Joint Conference on Computer Science and Software Engineering (JCSSE). Presented at the 2019 16th International Joint Conference on Computer Science and Software Engineering (JCSSE), IEEE, Chonburi, Thailand, pp. 181–186. https://doi.org/10.1109/JCSSE.2019.8864181
Akosa, J., 2017. Predictive Accuracy: A Misleading Performance Measure for Highly Imbalanced Data 12.
Ali, A., Shamsuddin, S.M., Ralescu, A.L., 2015. Classification with class imbalance problem: A Review 29.
Arjovsky, M., Chintala, S., Bottou, L., 2017. Wasserstein Generative Adversarial Networks, in: International Conference on Machine Learning. Presented at the International Conference on Machine Learning, pp. 214–223.
Bahuleyan, H., Mou, L., Vechtomova, O., Poupart, P., 2018. Variational Attention for Sequence-to-Sequence Models. arXiv:1712.08207 [cs].
Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N., 2015. Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks 9.
Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., Raffel, C., 2019. MixMatch: A Holistic Approach to Semi-Supervised Learning. arXiv:1905.02249 [cs, stat].
Bowman, S.R., Vilnis, L., Vinyals, O., Dai, A.M., Jozefowicz, R., Bengio, S., 2016. Generating Sentences from a Continuous Space. arXiv:1511.06349 [cs].
Bradley Efron, Robert Tibshirani, 1993. An introduction to the bootstrap. CRC press.
Caccia, M., Caccia, L., Fedus, W., Larochelle, H., Pineau, J., Charlin, L., 2020. Language GANs Falling Short. arXiv:1811.02549 [cs].
Chawla, N.V., 2005. Data Mining for Imbalanced Datasets: An Overview, in: Maimon, O., Rokach, L. (Eds.), Data Mining and Knowledge Discovery Handbook. Springer US, Boston, MA, pp. 853–867. https://doi.org/10.1007/0-387-25465-X_40
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P., 2002. SMOTE: Synthetic Minority Over-sampling Technique. jair 16, 321–357. https://doi.org/10.1613/jair.953
Che, T., Li, Y., Zhang, R., Hjelm, R.D., Li, W., Song, Y., Bengio, Y., 2017. Maximum-Likelihood Augmented Discrete Generative Adversarial Networks. arXiv:1702.07983 [cs].
Chen, E., Lin, Y., Xiong, H., Luo, Q., Ma, H., 2011. Exploiting probabilistic topic models to improve text categorization under class imbalance. Information Processing & Management 47, 202–214. https://doi.org/10.1016/j.ipm.2010.07.003
Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y., 2014. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Presented at the Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Doha, Qatar, pp. 1724–1734. https://doi.org/10.3115/v1/D14-1179
Chung, J., Gulcehre, C., Cho, K., Bengio, Y., 2014. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv:1412.3555 [cs].
Cıfka, O., Severyn, A., Alfonseca, E., Filippova, K., 2018. Eval all, trust a few, do wrong to none: Comparing sentence generation models 9.
Colah’s blog, 2015. Understanding LSTM Networks. https://colah.github.io/posts/2015-08-Understanding-LSTMs/
d’Autume, C. de M., Rosca, M., Rae, J., Mohamed, S., 2020. Training language GANs from Scratch. arXiv:1905.09922 [cs, stat].
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K., 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs].
Esuli, A., Sebastiani, F., 2006. SENTIWORDNET: A Publicly Available Lexical Resource for Opinion Mining 6.
Frank, E., Bouckaert, R.R., 2006. Naive Bayes for Text Classification with Unbalanced Classes, in: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (Eds.), Knowledge Discovery in Databases: PKDD 2006. Springer Berlin Heidelberg, Berlin, Heidelberg, pp. 503–510. https://doi.org/10.1007/11871637_49
Frid-Adar, M., Diamant, I., Klang, E., Amitai, M., Goldberger, J., Greenspan, H., 2018. GAN-based Synthetic Medical Image Augmentation for increased CNN Performance in Liver Lesion Classification. Neurocomputing 321, 321–331. https://doi.org/10.1016/j.neucom.2018.09.013
Ganganwar, V., 2012. An overview of classification algorithms for imbalanced datasets 2, 6.
Ger, S., Klabjan, D., 2019. Autoencoders and Generative Adversarial Networks for Imbalanced Sequence Classification. arXiv:1901.02514 [cs, stat].
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y., 2014. Generative Adversarial Nets 9.
Google Code, 2016. https://code.google.com/archive/p/word2vec/
Guo, J., Lu, S., Cai, H., Zhang, W., Yu, Y., Wang, J., 2017. Long Text Generation via Adversarial Training with Leaked Information. arXiv:1709.08624 [cs].
Guo, X., Yin, Y., Dong, C., Yang, G., Zhou, G., 2008. On the Class Imbalance Problem, in: 2008 Fourth International Conference on Natural Computation. Presented at the 2008 Fourth International Conference on Natural Computation, IEEE, Jinan, Shandong, China, pp. 192–201. https://doi.org/10.1109/ICNC.2008.871
Haibo He, Yang Bai, Garcia, E.A., Shutao Li, 2008. ADASYN: Adaptive synthetic sampling approach for imbalanced learning, in: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). Presented at the 2008 IEEE International Joint Conference on Neural Networks (IJCNN 2008 - Hong Kong), IEEE, Hong Kong, China, pp. 1322–1328. https://doi.org/10.1109/IJCNN.2008.4633969
Han, H., Wang, W.-Y., Mao, B.-H., 2005. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning, in: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (Eds.), Advances in Intelligent Computing, Lecture Notes in Computer Science. Springer, Berlin, Heidelberg, pp. 878–887. https://doi.org/10.1007/11538059_91
Harris, Z.S., 1954. Distributional Structure. WORD 10, 146–162. https://doi.org/10.1080/00437956.1954.11659520
Hinton, G.E., Osindero, S., Teh, Y.-W., 2006. A Fast Learning Algorithm for Deep Belief Nets. Neural Computation 18, 1527–1554. https://doi.org/10.1162/neco.2006.18.7.1527
Hu, F., Li, H., 2013. A Novel Boundary Oversampling Algorithm Based on Neighborhood Rough Set Model: NRSBoundary-SMOTE. https://doi.org/10.1155/2013/694809
Hu, Z., Yang, Z., Liang, X., Salakhutdinov, R., Xing, E.P., 2018. Toward Controlled Generation of Text. arXiv:1703.00955 [cs, stat].
Huszár, F., 2015. How (not) to Train your Generative Model: Scheduled Sampling, Likelihood, Adversary? arXiv:1511.05101 [cs, math, stat].
Ibrahim, M., Torki, M., El-Makky, N., 2018. Imbalanced Toxic Comments Classification Using Data Augmentation and Deep Learning, in: 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA). Presented at the 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), IEEE, Orlando, FL, pp. 875–878. https://doi.org/10.1109/ICMLA.2018.00141
Ioffe, S., Szegedy, C., 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv:1502.03167 [cs].
Jiang, H., 2016. Sentiment Analysis on Imbalanced Airline Data 13.
Joachims, T., 1998. Text categorization with Support Vector Machines: Learning with many relevant features, in: Nédellec, C., Rouveirol, C. (Eds.), Machine Learning: ECML-98. Springer Berlin Heidelberg, Berlin, Heidelberg, pp. 137–142. https://doi.org/10.1007/BFb0026683
Jolicoeur-Martineau, A., 2018. The relativistic discriminator: a key element missing from standard GAN. arXiv:1807.00734 [cs, stat].
Karras, T., Aila, T., Laine, S., Lehtinen, J., 2018. Progressive Growing of GANs for Improved Quality, Stability, and Variation. arXiv:1710.10196 [cs, stat].
Kawthekar, P., Rewari, R., Bhooshan, S., 2017. Evaluating Generative Models for Text Generation 8.
Kingma, D.P., Mohamed, S., Jimenez Rezende, D., Welling, M., 2014. Semi-supervised Learning with Deep Generative Models, in: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (Eds.), Advances in Neural Information Processing Systems 27. Curran Associates, Inc., pp. 3581–3589.
Kingma, D.P., Welling, M., 2014. Auto-Encoding Variational Bayes. arXiv:1312.6114 [cs, stat].

Kobayashi, S., 2018. Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations. arXiv:1805.06201 [cs].
Kotsiantis, S.B., Pintelas, P.E., 2003. Mixture of Expert Agents for Handling Imbalanced Data Sets. ANNALS OF MATHEMATICS 1, 10.
Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. ImageNet Classification with Deep Convolutional Neural Networks, in: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (Eds.), Advances in Neural Information Processing Systems 25. Curran Associates, Inc., pp. 1097–1105.
Kubat, M., Holte, R.C., Matwin, S., Kohavi, R., Provost, F., 1998. Machine learning for the detection of oil spills in satellite radar images, in: Machine Learning. pp. 195–215.
Kubat, M., Matwin, S., 1997. Addressing the Curse of Imbalanced Training Sets: One-Sided Selection. In Proceedings of the Fourteenth International Conference on Machine Learning 179–186.
Li, G., Wang, J., Zheng, Y., Franklin, M.J., 2016. Crowdsourced Data Management: A Survey 23.
Li, J., Monroe, W., Shi, T., Jean, S., Ritter, A., Jurafsky, D., 2017. Adversarial Learning for Neural Dialogue Generation. arXiv:1701.06547 [cs].
Li, Y., Sun, G., Zhu, Y., 2010. Data Imbalance Problem in Text Classification, in: 2010 Third International Symposium on Information Processing. Presented at the 2010 Third International Symposium on Information Processing (ISIP), IEEE, Qingdao, Shandong, China, pp. 301–305. https://doi.org/10.1109/ISIP.2010.47
Lin, K., Li, D., He, X., Zhang, Z., Sun, M., 2017. Adversarial Ranking for Language Generation 11.
Liu, Y., Loh, H.T., Sun, A., 2009. Imbalanced text classification: A term weighting approach. Expert Systems with Applications 36, 690–701. https://doi.org/10.1016/j.eswa.2007.10.042
Longadge, M.R., Dongre, S.S., Malik, D.L., 2013. Class Imbalance Problem in Data Mining: Review 2, 6.

Lowe, R., Pow, N., Serban, I., Pineau, J., 2015. The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems, in: Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue. Association for Computational Linguistics, Prague, Czech Republic, pp. 285–294. https://doi.org/10.18653/v1/W15-4640
Mao, X., Li, Q., Xie, H., Lau, R.Y.K., Wang, Z., Smolley, S.P., 2017. Least Squares Generative Adversarial Networks. arXiv:1611.04076 [cs].
Miao, Z., Li, Y., Wang, X., Tan, W.-C., 2020. Snippext: Semi-supervised Opinion Mining with Augmented Data. arXiv:2002.03049 [cs]. https://doi.org/10.1145/3366423.3380144
Mikolov, T., Karafiat, M., Burget, L., Cernocky, J., Khudanpur, S., 2010. Recurrent Neural Network Based Language Model 4.
Miller, G.A., 1995. WordNet: a lexical database for English. Commun. ACM 38, 39–41. https://doi.org/10.1145/219717.219748
Mirza, M., Osindero, S., 2014. Conditional Generative Adversarial Nets. arXiv:1411.1784 [cs, stat].
Moreo, A., Esuli, A., Sebastiani, F., 2016. Distributional Random Oversampling for Imbalanced Text Classification, in: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval - SIGIR ’16. Presented at the the 39th International ACM SIGIR conference, ACM Press, Pisa, Italy, pp. 805–808. https://doi.org/10.1145/2911451.2914722
Mosolova, A.V., Fomin, V.V., Bondarenko, I.Y., 2018. Text Augmentation for Neural Networks 6.
Nair, V., Hinton, G.E., 2010. Rectified Linear Units Improve Restricted Boltzmann Machines 8.
Pennington, J., Socher, R., Manning, C., 2014. Glove: Global Vectors for Word Representation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Presented at the Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Doha, Qatar, pp. 1532–1543. https://doi.org/10.3115/v1/D14-1162

Porter, M., 2006. The Porter Stemming Algorithm [WWW Document]. URL https://tartarus.org/ martin/PorterStemmer/
Qiu, S., Xu, B., Zhang, J., Wang, Y., Shen, X., de Melo, G., Long, C., Li, X., 2020. EasyAug: An Automatic Textual Data Augmentation Platform for Classification Tasks, in: Companion Proceedings of the Web Conference 2020. Presented at the WWW ’20: The Web Conference 2020, ACM, Taipei Taiwan, pp. 249–252. https://doi.org/10.1145/3366424.3383552
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., 2018a. Improving Language Understanding by Generative Pre-Training 12.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., 2018b. Language Models are Unsupervised Multitask Learners 24.
Ramos, J., 2013. Using TF-IDF to Determine Word Relevance in Document Queries.
Rosario, R., 2017. A Data Augmentation Approach to Short Text Classification.
Saif, M.A., Medvedev, A.N., Medvedev, M.A., Atanasova, T., 2018. Classification of online toxic comments using the logistic regression and neural networks models. AIP Conference Proceedings 2048, 060011. https://doi.org/10.1063/1.5082126
Sandfort, V., Yan, K., Pickhardt, P.J., Summers, R.M., 2019. Data augmentation using generative adversarial networks (CycleGAN) to improve generalizability in CT segmentation tasks. Scientific Reports 9, 1–9. https://doi.org/10.1038/s41598-019-52737-x
Schuster, M., Paliwal, K.K., 1997. Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45, 2673–2681. https://doi.org/10.1109/78.650093
Semeniuta, S., Severyn, A., Gelly, S., 2019. On Accurate Evaluation of GANs for Language Generation. arXiv:1806.04936 [cs].
Sennrich, R., Haddow, B., Birch, A., 2016. Improving Neural Machine Translation Models with Monolingual Data, in: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Presented at the ACL 2016, Association for Computational Linguistics, Berlin, Germany, pp. 86–96. https://doi.org/10.18653/v1/P16-1009
Sepp, H., Jurgen, S., 1997. Long Short-Term Memory | Neural Computation 1735–1780.
Shleifer, S., 2019. Low Resource Text Classiﬁcation with Backtranslation 9.
Shorten Connor, Taghi M. Khoshgoftaar, 2019. A survey on Image Data Augmentation for Deep Learning | SpringerLink [WWW Document]. URL https://link.springer.com/article/10.1186/s40537-019-0197-0 (accessed 3.5.20).
Silfverberg, M., Wiemerslage, A., Liu, L., Mao, L.J., 2017. Data Augmentation for Morphological Reinflection, in: Proceedings of the CoNLL SIGMORPHON 2017 Shared Task: Universal Morphological Reinflection. Presented at the CoNLL 2017, Association for Computational Linguistics, Vancouver, pp. 90–99. https://doi.org/10.18653/v1/K17-2010
Sun, A., Lim, E.-P., Liu, Y., 2009. On strategies for imbalanced text classification using SVM: A comparative study. Decision Support Systems 48, 191–201. https://doi.org/10.1016/j.dss.2009.07.011
Sun, Y., Kamel, M.S., Wong, A.K.C., Wang, Y., 2007. Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition 40, 3358–3378. https://doi.org/10.1016/j.patcog.2007.04.009
Sutskever, I., Vinyals, O., Le, Q.V., 2014. Sequence to Sequence Learning with Neural Networks. arXiv:1409.3215 [cs].
Tayyar Madabushi, H., Kochkina, E., Castelle, M., 2019. Cost-Sensitive BERT for Generalisable Sentence Classification on Imbalanced Data, in: Proceedings of the Second Workshop on Natural Language Processing for Internet Freedom: Censorship, Disinformation, and Propaganda. Association for Computational Linguistics, Hong Kong, China, pp. 125–134. https://doi.org/10.18653/v1/D19-5018
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I., 2017. Attention Is All You Need. arXiv:1706.03762 [cs].
Wang, J., Lu, W.F., Loh, H.T., 2012. P-SMOTE: One Oversampling Technique for Class Imbalanced Text Classification. Presented at the ASME 2011 International Design Engineering Technical Conferences and Computers and Information in Engineering Conference, American Society of Mechanical Engineers Digital Collection, pp. 1089–1098. https://doi.org/10.1115/DETC2011-47313
Wang, X., Sheng, Y., Deng, H., Zhao, Z., 2019. CHARCNN-SVM FOR CHINESE TEXT DATASETS SENTIMENT CLASSIFICATION WITH DATA AUGMENTATION 20.
Wei, J., Zou, K., 2019. EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. arXiv:1901.11196 [cs].
Wu, X., Lv, S., Zang, L., Han, J., Hu, S., 2019. Conditional BERT Contextual Augmentation, in: Rodrigues, J.M.F., Cardoso, P.J.S., Monteiro, J., Lam, R., Krzhizhanovskaya, V.V., Lees, M.H., Dongarra, J.J., Sloot, P.M.A. (Eds.), Computational Science – ICCS 2019. Springer International Publishing, Cham, pp. 84–95. https://doi.org/10.1007/978-3-030-22747-0_7
Xie, Q., Dai, Z., Hovy, E., Luong, M.-T., Le, Q.V., 2019. Unsupervised Data Augmentation for Consistency Training. arXiv:1904.12848 [cs, stat].
Xie, Z., Wang, S.I., Li, J., Levy, D., Nie, A., Jurafsky, D., Ng, A.Y., 2017. DATA NOISING AS SMOOTHING IN NEURAL NETWORK LANGUAGE MODELS 12.
Xu, J., Ren, X., Lin, J., Sun, X., 2018. DP-GAN: Diversity-Promoting Generative Adversarial Network for Generating Informative and Diversified Text. arXiv:1802.01345 [cs].
Yan, X., Yang, J., Sohn, K., Lee, H., 2016. Attribute2Image: Conditional Image Generation from Visual Attributes. arXiv:1512.00570 [cs].
Yang, Z., Hu, Z., Salakhutdinov, R., Berg-Kirkpatrick, T., 2017. Improved Variational Autoencoders for Text Modeling using Dilated Convolutions. arXiv:1702.08139 [cs].
Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., Hovy, E., 2016. Hierarchical Attention Networks for Document Classification, in: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Presented at the Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, San Diego, California, pp. 1480–1489. https://doi.org/10.18653/v1/N16-1174
Ye-hang, Z., 2007. Text tendency categorization method based on class space model [WWW Document]. URL /paper/Text-tendency-categorization-method-based-on-class-Ye-hang/759cdffe9892204a5f8df7856fca5096ca6e9d59 (accessed 3.7.20).
Yu, A.W., Dohan, D., Luong, M.-T., Zhao, R., Chen, K., Norouzi, M., Le, Q.V., 2018. QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension. arXiv:1804.09541 [cs].
Yu, L., Zhang, W., Wang, J., Yu, Y., 2017. SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient. arXiv:1609.05473 [cs].

Yu, S., Yang, J., Liu, D., Li, R., Zhang, Y., Zhao, S., 2019. Hierarchical Data Augmentation and the Application in Text Classification. IEEE Access 7, 185476–185485. https://doi.org/10.1109/ACCESS.2019.2960263
Zhang, X., Zhao, J., LeCun, Y., 2015. Character-level Convolutional Networks for Text Classification, in: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (Eds.), Advances in Neural Information Processing Systems 28. Curran Associates, Inc., pp. 649–657.
Zhang, Y., Gan, Z., Fan, K., Chen, Z., Henao, R., Shen, D., Carin, L., 2017. Adversarial Feature Matching for Text Generation. arXiv:1706.03850 [cs, stat].

指導教授

柯士文(Shih-Wen Ke)

審核日期

2020-7-16

推文