探討使用多面向方法在文字不平衡資料集之分類問題影響

DC 欄位	值	語言
DC.contributor	資訊管理學系	zh_TW
DC.creator	陳芃諭	zh_TW
DC.creator	Peng-Yu Chen	en_US
dc.date.accessioned	2020-7-20T07:39:07Z
dc.date.available	2020-7-20T07:39:07Z
dc.date.issued	2020
dc.identifier.uri	http://ir.lib.ncu.edu.tw:88/thesis/view_etd.asp?URN=107423059
dc.contributor.department	資訊管理學系	zh_TW
DC.description	國立中央大學	zh_TW
DC.description	National Central University	en_US
dc.description.abstract	文字類別不平衡任務在許多情境與應用常常出現，例如: 垃圾郵件偵測、文本分類任務...等。處理類別不平衡問題時，往往都會採用重採樣方法(resampling techniques)，然而，處理類別不平衡問題時，需要考量到採納不同面向方法所帶來的影響。在本論文，我們觀察了不同面向對於文字不平衡資料集在分類上所帶來的影響，例如: 不同種的資料表示法(TF-IDF, Word2Vec, ELMo 以及 BERT), 重採樣方法(SMOTE)以及生成方法(VAE)在不同的類別不平衡比例。我們也納入多種分類器與上述方法做組合搭配，觀察差異為何。從實驗結果來看，我們可以推薦一個較佳的組合方法處理文字類別不平衡的資料集。ELMo, SMOTE和SVM會是適合處理文字不平衡資料集，然而當資料集的資料量越大時，TF-IDF, SMOTE和SVM會是較佳的組合結果。我們發現在處理文字不平衡資料集時，資料表示法、合成方法、生成方法、分類器、類別不平衡比例與資料量大小都是會互相影響。此外，比較分類器訓練在合成資料或是生成資料時，SMOTE的結果會比VAE來的較好，甚至在TF-IDF, SMOTE以及SVM此組合可以超越真實資料的結果。本論文中，我們採納TF-IDF和其他embedding方法，並且關注在SMOTE與VAE，以及比較合成資料、生成資料與原始資料。我們甚至觀察不同的類別不平衡比例與資料量大小所帶來的影響。	zh_TW
dc.description.abstract	Class imbalance is present in many text classification applications, for example, text polarity classification, spam detection, topic classification and so on. Resampling techniques are commonly used to deal with class imbalance problems. However, it takes a multifaceted approach to effectively address the class imbalance problems. In this study, we investigate the effectiveness of different text representations (TF-IDF, Word2Vec, ELMo and BERT), resampling techniques (SMOTE) and generative techniques (VAE) on various class imbalance ratios. We also evaluate how different classifiers perform with these techniques. From the experiment results, we can devise a general recommendation for dealing with class imbalance in text classification. The combination of ELMo, SMOTE and SVM is suitable for dealing with the imbalance dataset. However, as the larger training data set is, the combination of TF-IDF, SMOTE and SVM could be more suitable. We find that the perspectives of dealing with the class imbalance dataset are affected to each other, like data representation, synthetic method, generative method, classifiers, class imbalance ratio and the training data size. Besides, comparing that the classifiers are trained with the synthetic data and generative data, SMOTE still outperforms than VAE. Even the result of the combination of TF-IDF, SMOTE and SVM can surpass the original data. In our study, we take TF-IDF and the embedding methods be the data representation in the experiment, and focus on SMOTE and VAE, also compare the result of synthetic data and generative data with original data. Even considering the class imbalance and training data size to be one of the perspectives in our study.	en_US
DC.subject	類別不平衡	zh_TW
DC.subject	文字分類	zh_TW
DC.subject	SMOTE	zh_TW
DC.subject	機器學習	zh_TW
DC.subject	深度學習	zh_TW
DC.subject	class imbalance	en_US
DC.subject	text classification	en_US
DC.subject	SMOTE	en_US
DC.subject	machine learning	en_US
DC.subject	deep learning	en_US
DC.title	探討使用多面向方法在文字不平衡資料集之分類問題影響	zh_TW
dc.language.iso	zh-TW	zh-TW
DC.title	The Effectiveness of Multifaceted Approach to Class Imbalance Text Classification	en_US
DC.type	博碩士論文	zh_TW
DC.type	thesis	en_US
DC.publisher	National Central University	en_US

博碩士論文 107423059 完整後設資料紀錄