dc.description.abstract | Text generation is an important task in NLP. The text generative models can be divided into two categories: the maximum likelihood estimation (MLE)-based models and the generative adversarial network (GAN)-based models. However, the MLE-based models still suffer from the problem of overproducing high-frequency words and repeating sentences; the GAN-based models have the problem of mode collapse. Recently, some literatures proposed models to alleviate the problems, encouraging the text generative model to produce diverse and interesting sentences.
On the other hand, Determinant Point Processes (DPPs) is one of the important probability models when it comes to diversity in machine learning and deep learning. Past studies had also used DPPs on many deep learning applications to improve the diversity of model such as extractive summarization, recommendation system, mini-batches for SGD, and image generation.
Therefore, this study will embed DPPs into VAE and SeqGAN to perform the diversified text generation task and use various diversity evaluation metrics (reverse perplexity, distinct n-gram, TF cosine similarity) to measure the performance. Additionally, we also apply the DPP-based text generative model on the downstream task of text classification having class imbalance or small datasets scenario. We will compare DPP-VAE, DPP-SeqGAN with other data augmentation models (VAE, SeqGAN, EDA, GPT-2, IRL) and observe the correlation between the performance of diversity and classification, further investigating whether diverse generated data can bring a better impact, making the classifier to train well.
From the experiment results, we prove that DPPs can help the vanilla VAE and SeqGAN to generate more diverse data, getting better results on the diversity evaluation metrics. DPP-VAE even achieves the best results in long text datasets. Additionally, we also find that though the final results are not as good as directly reducing the examples of majority class to balance the number of training data between classes, diverse generated data can indeed bring a good impact in class imbalance scenario, getting better classification performance. Distinct n-gram and TF cosine similarity have a well correlation with the evaluation metrics of classification in class imbalance scenario. However, the help of these data augmentation models is not significant in the small datasets scenario and diversity score has no correlation with the classification performance. We think that compare with diverse generated data, within-class generated data can bring better impact on text classification task in small datasets scenario. | en_US |