博碩士論文 108423062 完整後設資料紀錄

DC 欄位 語言
DC.contributor資訊管理學系zh_TW
DC.creator何逸家zh_TW
DC.creatorYi-Jia Heen_US
dc.date.accessioned2021-8-11T07:39:07Z
dc.date.available2021-8-11T07:39:07Z
dc.date.issued2021
dc.identifier.urihttp://ir.lib.ncu.edu.tw:88/thesis/view_etd.asp?URN=108423062
dc.contributor.department資訊管理學系zh_TW
DC.description國立中央大學zh_TW
DC.descriptionNational Central Universityen_US
dc.description.abstract半監督式學習(Semi-supervised Learning)有效地使用無標籤資料來提升模型的表現,在當前研究中展現了其在少量標籤資料困境(Low-data Regime)下的卓越表現;當前使用無標籤資料的方法主要為一致性訓練(Consistency Training)並搭配合適的人工標籤作為訓練目標,然而在自然語言處理中,當前方法的一致性訓練的設計仍不夠嚴謹,且人工標籤也不夠具有意義,使模型不僅無法學習到足夠的無標籤資料資訊,導致容易在標籤資料上過度學習(Overfitting),甚至會因為品質差的人工標籤而對模型造成負面影響;因此本研究提出以弱擴增(Weakly Augmentation)、強擴增(Strongly Augmentation)後的無標籤資料並搭配閥值,建立更嚴謹的一致性訓練過程,也透過混合擴增結合使用標籤資料與無標籤資料,讓模型更好地避免在標籤資料上過度學習,而實驗結果證實本研究提出方法在僅使用每個類別各10筆標籤資料的情況下,於AG NEWS文本分類資料集上取得87.88%的準確率,高於當前方法1.58%,並在Yahoo! Answers資料集上取得67.3%的準確率,高於當前方法3.5%。zh_TW
dc.description.abstractSemi-supervised learning can effectively utilize unlabeled data improving deep learning model’s performance and it has shown its outstanding performance in the low-data regime in current researches. The current mainly approaches of using unlabeled data is consistency training with a suitable artificial label as training target. However, these approaches are still not rigorous enough and the training targets are also not meaningful enough in natural language processing, so that the model not only can’t get enough information from unlabeled data and easily lead to overfitting on labeled data, but also have a negative impact due to poor quality training targets. Therefore, this work presents a more rigorous consistency training process by using weakly augmented and strongly augmented unlabeled data with confidence-based masking, and besides, we mix the labeled data and unlabeled data so that can utilize labeled data and unlabeled data together, this allows the model to better avoid overfitting on labeled data. Our approaches outperformed current approaches on two text classification datasets AG NEWS and Yahoo! Answers while only utilize 10 labeled data per class. Our approach achieves better performance on two text classification benchmarks, including 87.88% accuracy on AG NEWS and 67.3% accuracy on Yahoo! Answers with 10 labeled data per class, the gap of performance between our approach and current state-of-the-art on AG NEWS and Yahoo! Answers are respectively 1.58% and 3.5%.en_US
DC.subject半監督式學習zh_TW
DC.subject一致性訓練zh_TW
DC.subject資料擴增zh_TW
DC.subject文本分類zh_TW
DC.subject自然語言處理zh_TW
DC.subjectSemi-supervised Learningen_US
DC.subjectConsistency Trainingen_US
DC.subjectData Augmentationen_US
DC.subjectText Classificationen_US
DC.subjectNatural Language Processingen_US
DC.title透過弱擴增與強擴增輔助半監督式學習中的文本分類zh_TW
dc.language.isozh-TWzh-TW
DC.type博碩士論文zh_TW
DC.typethesisen_US
DC.publisherNational Central Universityen_US

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡  - 隱私權政策聲明