摘要: | 文字類別不平衡任務在許多情境與應用常常出現,例如: 垃圾郵件偵測、文本分類任務...等。處理類別不平衡問題時,往往都會採用重採樣方法(resampling techniques),然而,處理類別不平衡問題時,需要考量到採納不同面向方法所帶來的影響。在本論文,我們觀察了不同面向對於文字不平衡資料集在分類上所帶來的影響,例如: 不同種的資料表示法(TF-IDF, Word2Vec, ELMo 以及 BERT), 重採樣方法(SMOTE)以及生成方法(VAE)在不同的類別不平衡比例。我們也納入多種分類器與上述方法做組合搭配,觀察差異為何。 從實驗結果來看,我們可以推薦一個較佳的組合方法處理文字類別不平衡的資料集。ELMo, SMOTE和SVM會是適合處理文字不平衡資料集,然而當資料集的資料量越大時,TF-IDF, SMOTE和SVM會是較佳的組合結果。 我們發現在處理文字不平衡資料集時,資料表示法、合成方法、生成方法、分類器、類別不平衡比例與資料量大小都是會互相影響。此外,比較分類器訓練在合成資料或是生成資料時,SMOTE的結果會比VAE來的較好,甚至在TF-IDF, SMOTE以及SVM此組合可以超越真實資料的結果。 本論文中,我們採納TF-IDF和其他embedding方法,並且關注在SMOTE與VAE,以及比較合成資料、生成資料與原始資料。我們甚至觀察不同的類別不平衡比例與資料量大小所帶來的影響。 ;Class imbalance is present in many text classification applications, for example, text polarity classification, spam detection, topic classification and so on. Resampling techniques are commonly used to deal with class imbalance problems. However, it takes a multifaceted approach to effectively address the class imbalance problems. In this study, we investigate the effectiveness of different text representations (TF-IDF, Word2Vec, ELMo and BERT), resampling techniques (SMOTE) and generative techniques (VAE) on various class imbalance ratios. We also evaluate how different classifiers perform with these techniques. From the experiment results, we can devise a general recommendation for dealing with class imbalance in text classification. The combination of ELMo, SMOTE and SVM is suitable for dealing with the imbalance dataset. However, as the larger training data set is, the combination of TF-IDF, SMOTE and SVM could be more suitable. We find that the perspectives of dealing with the class imbalance dataset are affected to each other, like data representation, synthetic method, generative method, classifiers, class imbalance ratio and the training data size. Besides, comparing that the classifiers are trained with the synthetic data and generative data, SMOTE still outperforms than VAE. Even the result of the combination of TF-IDF, SMOTE and SVM can surpass the original data. In our study, we take TF-IDF and the embedding methods be the data representation in the experiment, and focus on SMOTE and VAE, also compare the result of synthetic data and generative data with original data. Even considering the class imbalance and training data size to be one of the perspectives in our study. |