中大機構典藏-NCU Institutional Repository-提供博碩士論文、考古題、期刊論文、研究計畫等下載:Item 987654321/86649
English  |  正體中文  |  简体中文  |  全文筆數/總筆數 : 78852/78852 (100%)
造訪人次 : 37108032      線上人數 : 652
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
搜尋範圍 查詢小技巧:
  • 您可在西文檢索詞彙前後加上"雙引號",以獲取較精準的檢索結果
  • 若欲以作者姓名搜尋,建議至進階搜尋限定作者欄位,可獲得較完整資料
  • 進階搜尋


    請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/86649


    題名: 透過弱擴增與強擴增輔助半監督式學習中的文本分類
    作者: 何逸家;He, Yi-Jia
    貢獻者: 資訊管理學系
    關鍵詞: 半監督式學習;一致性訓練;資料擴增;文本分類;自然語言處理;Semi-supervised Learning;Consistency Training;Data Augmentation;Text Classification;Natural Language Processing
    日期: 2021-08-11
    上傳時間: 2021-12-07 13:04:35 (UTC+8)
    出版者: 國立中央大學
    摘要: 半監督式學習(Semi-supervised Learning)有效地使用無標籤資料來提升模型的表現,在當前研究中展現了其在少量標籤資料困境(Low-data Regime)下的卓越表現;當前使用無標籤資料的方法主要為一致性訓練(Consistency Training)並搭配合適的人工標籤作為訓練目標,然而在自然語言處理中,當前方法的一致性訓練的設計仍不夠嚴謹,且人工標籤也不夠具有意義,使模型不僅無法學習到足夠的無標籤資料資訊,導致容易在標籤資料上過度學習(Overfitting),甚至會因為品質差的人工標籤而對模型造成負面影響;因此本研究提出以弱擴增(Weakly Augmentation)、強擴增(Strongly Augmentation)後的無標籤資料並搭配閥值,建立更嚴謹的一致性訓練過程,也透過混合擴增結合使用標籤資料與無標籤資料,讓模型更好地避免在標籤資料上過度學習,而實驗結果證實本研究提出方法在僅使用每個類別各10筆標籤資料的情況下,於AG NEWS文本分類資料集上取得87.88%的準確率,高於當前方法1.58%,並在Yahoo! Answers資料集上取得67.3%的準確率,高於當前方法3.5%。;Semi-supervised learning can effectively utilize unlabeled data improving deep learning model’s performance and it has shown its outstanding performance in the low-data regime in current researches. The current mainly approaches of using unlabeled data is consistency training with a suitable artificial label as training target. However, these approaches are still not rigorous enough and the training targets are also not meaningful enough in natural language processing, so that the model not only can’t get enough information from unlabeled data and easily lead to overfitting on labeled data, but also have a negative impact due to poor quality training targets. Therefore, this work presents a more rigorous consistency training process by using weakly augmented and strongly augmented unlabeled data with confidence-based masking, and besides, we mix the labeled data and unlabeled data so that can utilize labeled data and unlabeled data together, this allows the model to better avoid overfitting on labeled data. Our approaches outperformed current approaches on two text classification datasets AG NEWS and Yahoo! Answers while only utilize 10 labeled data per class.
    Our approach achieves better performance on two text classification benchmarks, including 87.88% accuracy on AG NEWS and 67.3% accuracy on Yahoo! Answers with 10 labeled data per class, the gap of performance between our approach and current state-of-the-art on AG NEWS and Yahoo! Answers are respectively 1.58% and 3.5%.
    顯示於類別:[資訊管理研究所] 博碩士論文

    文件中的檔案:

    檔案 描述 大小格式瀏覽次數
    index.html0KbHTML120檢視/開啟


    在NCUIR中所有的資料項目都受到原著作權保護.

    社群 sharing

    ::: Copyright National Central University. | 國立中央大學圖書館版權所有 | 收藏本站 | 設為首頁 | 最佳瀏覽畫面: 1024*768 | 建站日期:8-24-2009 :::
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - 隱私權政策聲明