中大機構典藏-NCU Institutional Repository-提供博碩士論文、考古題、期刊論文、研究計畫等下載:Item 987654321/86649
English  |  正體中文  |  简体中文  |  Items with full text/Total items : 78852/78852 (100%)
Visitors : 37105962      Online Users : 602
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version


    Please use this identifier to cite or link to this item: http://ir.lib.ncu.edu.tw/handle/987654321/86649


    Title: 透過弱擴增與強擴增輔助半監督式學習中的文本分類
    Authors: 何逸家;He, Yi-Jia
    Contributors: 資訊管理學系
    Keywords: 半監督式學習;一致性訓練;資料擴增;文本分類;自然語言處理;Semi-supervised Learning;Consistency Training;Data Augmentation;Text Classification;Natural Language Processing
    Date: 2021-08-11
    Issue Date: 2021-12-07 13:04:35 (UTC+8)
    Publisher: 國立中央大學
    Abstract: 半監督式學習(Semi-supervised Learning)有效地使用無標籤資料來提升模型的表現,在當前研究中展現了其在少量標籤資料困境(Low-data Regime)下的卓越表現;當前使用無標籤資料的方法主要為一致性訓練(Consistency Training)並搭配合適的人工標籤作為訓練目標,然而在自然語言處理中,當前方法的一致性訓練的設計仍不夠嚴謹,且人工標籤也不夠具有意義,使模型不僅無法學習到足夠的無標籤資料資訊,導致容易在標籤資料上過度學習(Overfitting),甚至會因為品質差的人工標籤而對模型造成負面影響;因此本研究提出以弱擴增(Weakly Augmentation)、強擴增(Strongly Augmentation)後的無標籤資料並搭配閥值,建立更嚴謹的一致性訓練過程,也透過混合擴增結合使用標籤資料與無標籤資料,讓模型更好地避免在標籤資料上過度學習,而實驗結果證實本研究提出方法在僅使用每個類別各10筆標籤資料的情況下,於AG NEWS文本分類資料集上取得87.88%的準確率,高於當前方法1.58%,並在Yahoo! Answers資料集上取得67.3%的準確率,高於當前方法3.5%。;Semi-supervised learning can effectively utilize unlabeled data improving deep learning model’s performance and it has shown its outstanding performance in the low-data regime in current researches. The current mainly approaches of using unlabeled data is consistency training with a suitable artificial label as training target. However, these approaches are still not rigorous enough and the training targets are also not meaningful enough in natural language processing, so that the model not only can’t get enough information from unlabeled data and easily lead to overfitting on labeled data, but also have a negative impact due to poor quality training targets. Therefore, this work presents a more rigorous consistency training process by using weakly augmented and strongly augmented unlabeled data with confidence-based masking, and besides, we mix the labeled data and unlabeled data so that can utilize labeled data and unlabeled data together, this allows the model to better avoid overfitting on labeled data. Our approaches outperformed current approaches on two text classification datasets AG NEWS and Yahoo! Answers while only utilize 10 labeled data per class.
    Our approach achieves better performance on two text classification benchmarks, including 87.88% accuracy on AG NEWS and 67.3% accuracy on Yahoo! Answers with 10 labeled data per class, the gap of performance between our approach and current state-of-the-art on AG NEWS and Yahoo! Answers are respectively 1.58% and 3.5%.
    Appears in Collections:[Graduate Institute of Information Management] Electronic Thesis & Dissertation

    Files in This Item:

    File Description SizeFormat
    index.html0KbHTML120View/Open


    All items in NCUIR are protected by copyright, with all rights reserved.

    社群 sharing

    ::: Copyright National Central University. | 國立中央大學圖書館版權所有 | 收藏本站 | 設為首頁 | 最佳瀏覽畫面: 1024*768 | 建站日期:8-24-2009 :::
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - 隱私權政策聲明