English  |  正體中文  |  简体中文  |  Items with full text/Total items : 69937/69937 (100%)
Visitors : 23104878      Online Users : 680
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version


    Please use this identifier to cite or link to this item: http://ir.lib.ncu.edu.tw/handle/987654321/84036


    Title: 探討使用多面向方法在文字不平衡資料集之分類問題影響;The Effectiveness of Multifaceted Approach to Class Imbalance Text Classification
    Authors: 陳芃諭;Chen, Peng-Yu
    Contributors: 資訊管理學系
    Keywords: 類別不平衡;文字分類;SMOTE;機器學習;深度學習;class imbalance;text classification;SMOTE;machine learning;deep learning
    Date: 2020-07-20
    Issue Date: 2020-09-02 17:57:43 (UTC+8)
    Publisher: 國立中央大學
    Abstract: 文字類別不平衡任務在許多情境與應用常常出現,例如: 垃圾郵件偵測、文本分類任務...等。處理類別不平衡問題時,往往都會採用重採樣方法(resampling techniques),然而,處理類別不平衡問題時,需要考量到採納不同面向方法所帶來的影響。在本論文,我們觀察了不同面向對於文字不平衡資料集在分類上所帶來的影響,例如: 不同種的資料表示法(TF-IDF, Word2Vec, ELMo 以及 BERT), 重採樣方法(SMOTE)以及生成方法(VAE)在不同的類別不平衡比例。我們也納入多種分類器與上述方法做組合搭配,觀察差異為何。
    從實驗結果來看,我們可以推薦一個較佳的組合方法處理文字類別不平衡的資料集。ELMo, SMOTE和SVM會是適合處理文字不平衡資料集,然而當資料集的資料量越大時,TF-IDF, SMOTE和SVM會是較佳的組合結果。
    我們發現在處理文字不平衡資料集時,資料表示法、合成方法、生成方法、分類器、類別不平衡比例與資料量大小都是會互相影響。此外,比較分類器訓練在合成資料或是生成資料時,SMOTE的結果會比VAE來的較好,甚至在TF-IDF, SMOTE以及SVM此組合可以超越真實資料的結果。
    本論文中,我們採納TF-IDF和其他embedding方法,並且關注在SMOTE與VAE,以及比較合成資料、生成資料與原始資料。我們甚至觀察不同的類別不平衡比例與資料量大小所帶來的影響。
    ;Class imbalance is present in many text classification applications, for example, text polarity classification, spam detection, topic classification and so on. Resampling techniques are commonly used to deal with class imbalance problems. However, it takes a multifaceted approach to effectively address the class imbalance problems. In this study, we investigate the effectiveness of different text representations (TF-IDF, Word2Vec, ELMo and BERT), resampling techniques (SMOTE) and generative techniques (VAE) on various class imbalance ratios. We also evaluate how different classifiers perform with these techniques.
    From the experiment results, we can devise a general recommendation for dealing with class imbalance in text classification. The combination of ELMo, SMOTE and SVM is suitable for dealing with the imbalance dataset. However, as the larger training data set is, the combination of TF-IDF, SMOTE and SVM could be more suitable.
    We find that the perspectives of dealing with the class imbalance dataset are affected to each other, like data representation, synthetic method, generative method, classifiers, class imbalance ratio and the training data size. Besides, comparing that the classifiers are trained with the synthetic data and generative data, SMOTE still outperforms than VAE. Even the result of the combination of TF-IDF, SMOTE and SVM can surpass the original data.
    In our study, we take TF-IDF and the embedding methods be the data representation in the experiment, and focus on SMOTE and VAE, also compare the result of synthetic data and generative data with original data. Even considering the class imbalance and training data size to be one of the perspectives in our study.
    Appears in Collections:[資訊管理研究所] 博碩士論文

    Files in This Item:

    File Description SizeFormat
    index.html0KbHTML24View/Open


    All items in NCUIR are protected by copyright, with all rights reserved.

    社群 sharing

    ::: Copyright National Central University. | 國立中央大學圖書館版權所有 | 收藏本站 | 設為首頁 | 最佳瀏覽畫面: 1024*768 | 建站日期:8-24-2009 :::
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - Feedback  - 隱私權政策聲明