透過弱擴增與強擴增輔助半監督式學習中的文本分類

NCU Institutional Repository > 管理學院 > 資訊管理研究所 > 博碩士論文 > Item 987654321/86649

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/86649

題名:	透過弱擴增與強擴增輔助半監督式學習中的文本分類
作者:	何逸家;He, Yi-Jia
貢獻者:	資訊管理學系
關鍵詞:	半監督式學習;一致性訓練;資料擴增;文本分類;自然語言處理;Semi-supervised Learning;Consistency Training;Data Augmentation;Text Classification;Natural Language Processing
日期:	2021-08-11
上傳時間:	2021-12-07 13:04:35 (UTC+8)
出版者:	國立中央大學
摘要:	半監督式學習（Semi-supervised Learning）有效地使用無標籤資料來提升模型的表現，在當前研究中展現了其在少量標籤資料困境（Low-data Regime）下的卓越表現；當前使用無標籤資料的方法主要為一致性訓練（Consistency Training）並搭配合適的人工標籤作為訓練目標，然而在自然語言處理中，當前方法的一致性訓練的設計仍不夠嚴謹，且人工標籤也不夠具有意義，使模型不僅無法學習到足夠的無標籤資料資訊，導致容易在標籤資料上過度學習（Overfitting），甚至會因為品質差的人工標籤而對模型造成負面影響；因此本研究提出以弱擴增（Weakly Augmentation）、強擴增（Strongly Augmentation）後的無標籤資料並搭配閥值，建立更嚴謹的一致性訓練過程，也透過混合擴增結合使用標籤資料與無標籤資料，讓模型更好地避免在標籤資料上過度學習，而實驗結果證實本研究提出方法在僅使用每個類別各10筆標籤資料的情況下，於AG NEWS文本分類資料集上取得87.88%的準確率，高於當前方法1.58%，並在Yahoo! Answers資料集上取得67.3%的準確率，高於當前方法3.5%。;Semi-supervised learning can effectively utilize unlabeled data improving deep learning model’s performance and it has shown its outstanding performance in the low-data regime in current researches. The current mainly approaches of using unlabeled data is consistency training with a suitable artificial label as training target. However, these approaches are still not rigorous enough and the training targets are also not meaningful enough in natural language processing, so that the model not only can’t get enough information from unlabeled data and easily lead to overfitting on labeled data, but also have a negative impact due to poor quality training targets. Therefore, this work presents a more rigorous consistency training process by using weakly augmented and strongly augmented unlabeled data with confidence-based masking, and besides, we mix the labeled data and unlabeled data so that can utilize labeled data and unlabeled data together, this allows the model to better avoid overfitting on labeled data. Our approaches outperformed current approaches on two text classification datasets AG NEWS and Yahoo! Answers while only utilize 10 labeled data per class. Our approach achieves better performance on two text classification benchmarks, including 87.88% accuracy on AG NEWS and 67.3% accuracy on Yahoo! Answers with 10 labeled data per class, the gap of performance between our approach and current state-of-the-art on AG NEWS and Yahoo! Answers are respectively 1.58% and 3.5%.
顯示於類別:	[資訊管理研究所] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	120	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....