透過弱擴增與強擴增輔助半監督式學習中的文本分類

DC 欄位	值	語言
DC.contributor	資訊管理學系	zh_TW
DC.creator	何逸家	zh_TW
DC.creator	Yi-Jia He	en_US
dc.date.accessioned	2021-8-11T07:39:07Z
dc.date.available	2021-8-11T07:39:07Z
dc.date.issued	2021
dc.identifier.uri	http://ir.lib.ncu.edu.tw:88/thesis/view_etd.asp?URN=108423062
dc.contributor.department	資訊管理學系	zh_TW
DC.description	國立中央大學	zh_TW
DC.description	National Central University	en_US
dc.description.abstract	半監督式學習（Semi-supervised Learning）有效地使用無標籤資料來提升模型的表現，在當前研究中展現了其在少量標籤資料困境（Low-data Regime）下的卓越表現；當前使用無標籤資料的方法主要為一致性訓練（Consistency Training）並搭配合適的人工標籤作為訓練目標，然而在自然語言處理中，當前方法的一致性訓練的設計仍不夠嚴謹，且人工標籤也不夠具有意義，使模型不僅無法學習到足夠的無標籤資料資訊，導致容易在標籤資料上過度學習（Overfitting），甚至會因為品質差的人工標籤而對模型造成負面影響；因此本研究提出以弱擴增（Weakly Augmentation）、強擴增（Strongly Augmentation）後的無標籤資料並搭配閥值，建立更嚴謹的一致性訓練過程，也透過混合擴增結合使用標籤資料與無標籤資料，讓模型更好地避免在標籤資料上過度學習，而實驗結果證實本研究提出方法在僅使用每個類別各10筆標籤資料的情況下，於AG NEWS文本分類資料集上取得87.88%的準確率，高於當前方法1.58%，並在Yahoo! Answers資料集上取得67.3%的準確率，高於當前方法3.5%。	zh_TW
dc.description.abstract	Semi-supervised learning can effectively utilize unlabeled data improving deep learning model’s performance and it has shown its outstanding performance in the low-data regime in current researches. The current mainly approaches of using unlabeled data is consistency training with a suitable artificial label as training target. However, these approaches are still not rigorous enough and the training targets are also not meaningful enough in natural language processing, so that the model not only can’t get enough information from unlabeled data and easily lead to overfitting on labeled data, but also have a negative impact due to poor quality training targets. Therefore, this work presents a more rigorous consistency training process by using weakly augmented and strongly augmented unlabeled data with confidence-based masking, and besides, we mix the labeled data and unlabeled data so that can utilize labeled data and unlabeled data together, this allows the model to better avoid overfitting on labeled data. Our approaches outperformed current approaches on two text classification datasets AG NEWS and Yahoo! Answers while only utilize 10 labeled data per class. Our approach achieves better performance on two text classification benchmarks, including 87.88% accuracy on AG NEWS and 67.3% accuracy on Yahoo! Answers with 10 labeled data per class, the gap of performance between our approach and current state-of-the-art on AG NEWS and Yahoo! Answers are respectively 1.58% and 3.5%.	en_US
DC.subject	半監督式學習	zh_TW
DC.subject	一致性訓練	zh_TW
DC.subject	資料擴增	zh_TW
DC.subject	文本分類	zh_TW
DC.subject	自然語言處理	zh_TW
DC.subject	Semi-supervised Learning	en_US
DC.subject	Consistency Training	en_US
DC.subject	Data Augmentation	en_US
DC.subject	Text Classification	en_US
DC.subject	Natural Language Processing	en_US
DC.title	透過弱擴增與強擴增輔助半監督式學習中的文本分類	zh_TW
dc.language.iso	zh-TW	zh-TW
DC.type	博碩士論文	zh_TW
DC.type	thesis	en_US
DC.publisher	National Central University	en_US

博碩士論文 108423062 完整後設資料紀錄