博碩士論文 107423059 完整後設資料紀錄

DC 欄位 語言
DC.contributor資訊管理學系zh_TW
DC.creator陳芃諭zh_TW
DC.creatorPeng-Yu Chenen_US
dc.date.accessioned2020-7-20T07:39:07Z
dc.date.available2020-7-20T07:39:07Z
dc.date.issued2020
dc.identifier.urihttp://ir.lib.ncu.edu.tw:88/thesis/view_etd.asp?URN=107423059
dc.contributor.department資訊管理學系zh_TW
DC.description國立中央大學zh_TW
DC.descriptionNational Central Universityen_US
dc.description.abstract文字類別不平衡任務在許多情境與應用常常出現,例如: 垃圾郵件偵測、文本分類任務...等。處理類別不平衡問題時,往往都會採用重採樣方法(resampling techniques),然而,處理類別不平衡問題時,需要考量到採納不同面向方法所帶來的影響。在本論文,我們觀察了不同面向對於文字不平衡資料集在分類上所帶來的影響,例如: 不同種的資料表示法(TF-IDF, Word2Vec, ELMo 以及 BERT), 重採樣方法(SMOTE)以及生成方法(VAE)在不同的類別不平衡比例。我們也納入多種分類器與上述方法做組合搭配,觀察差異為何。 從實驗結果來看,我們可以推薦一個較佳的組合方法處理文字類別不平衡的資料集。ELMo, SMOTE和SVM會是適合處理文字不平衡資料集,然而當資料集的資料量越大時,TF-IDF, SMOTE和SVM會是較佳的組合結果。 我們發現在處理文字不平衡資料集時,資料表示法、合成方法、生成方法、分類器、類別不平衡比例與資料量大小都是會互相影響。此外,比較分類器訓練在合成資料或是生成資料時,SMOTE的結果會比VAE來的較好,甚至在TF-IDF, SMOTE以及SVM此組合可以超越真實資料的結果。 本論文中,我們採納TF-IDF和其他embedding方法,並且關注在SMOTE與VAE,以及比較合成資料、生成資料與原始資料。我們甚至觀察不同的類別不平衡比例與資料量大小所帶來的影響。zh_TW
dc.description.abstractClass imbalance is present in many text classification applications, for example, text polarity classification, spam detection, topic classification and so on. Resampling techniques are commonly used to deal with class imbalance problems. However, it takes a multifaceted approach to effectively address the class imbalance problems. In this study, we investigate the effectiveness of different text representations (TF-IDF, Word2Vec, ELMo and BERT), resampling techniques (SMOTE) and generative techniques (VAE) on various class imbalance ratios. We also evaluate how different classifiers perform with these techniques. From the experiment results, we can devise a general recommendation for dealing with class imbalance in text classification. The combination of ELMo, SMOTE and SVM is suitable for dealing with the imbalance dataset. However, as the larger training data set is, the combination of TF-IDF, SMOTE and SVM could be more suitable. We find that the perspectives of dealing with the class imbalance dataset are affected to each other, like data representation, synthetic method, generative method, classifiers, class imbalance ratio and the training data size. Besides, comparing that the classifiers are trained with the synthetic data and generative data, SMOTE still outperforms than VAE. Even the result of the combination of TF-IDF, SMOTE and SVM can surpass the original data. In our study, we take TF-IDF and the embedding methods be the data representation in the experiment, and focus on SMOTE and VAE, also compare the result of synthetic data and generative data with original data. Even considering the class imbalance and training data size to be one of the perspectives in our study.en_US
DC.subject類別不平衡zh_TW
DC.subject文字分類zh_TW
DC.subjectSMOTEzh_TW
DC.subject機器學習zh_TW
DC.subject深度學習zh_TW
DC.subjectclass imbalanceen_US
DC.subjecttext classificationen_US
DC.subjectSMOTEen_US
DC.subjectmachine learningen_US
DC.subjectdeep learningen_US
DC.title探討使用多面向方法在文字不平衡資料集之分類問題影響zh_TW
dc.language.isozh-TWzh-TW
DC.titleThe Effectiveness of Multifaceted Approach to Class Imbalance Text Classificationen_US
DC.type博碩士論文zh_TW
DC.typethesisen_US
DC.publisherNational Central Universityen_US

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡  - 隱私權政策聲明