博碩士論文 108522046 完整後設資料紀錄

DC 欄位 語言
DC.contributor資訊工程學系zh_TW
DC.creator李正倫zh_TW
DC.creatorZheng-Lun Lien_US
dc.date.accessioned2021-9-30T07:39:07Z
dc.date.available2021-9-30T07:39:07Z
dc.date.issued2021
dc.identifier.urihttp://ir.lib.ncu.edu.tw:444/thesis/view_etd.asp?URN=108522046
dc.contributor.department資訊工程學系zh_TW
DC.description國立中央大學zh_TW
DC.descriptionNational Central Universityen_US
dc.description.abstract事實一致性問題是自動萃取式摘要中關鍵且棘手的問題,近年來受 到許多研究者的關注,然而先前之研究集中於探討英文摘要中的事實 一致性問題,中文摘要的事實一致性則尚被評估與研究。 我們基於中文相對於英文較為不同的部分進行研究,也就是斷詞, 現今的中文預訓練模型大多使用和 BERT 相同的斷詞系統,實際上相 當接近單純使用字元進行斷詞。 透過使用不同中文斷詞套件來訓練中文 BART 模型,並在 LCSTS 中文摘要資料集上微調,我們證實了斷詞不只影響傳統 ROUGE 分數 也同時影響了事實一致性。 此外考慮到簡體和繁體中文的用詞差異,我們也建立了台灣新聞弱 監督自動萃取式摘要資料集 TWNSum ,透過最簡單的 LEAD 方式抽 取摘要並使用事實一致性評估篩選,表明從大量未標記的新聞語料中 生成自動萃取式摘要資料集是可行的。zh_TW
dc.description.abstractHallucination is a critical and hard problem in abstractive summarization, getting increasing attention in recent years. However, hallucination in some languages, or specifically, in Chinese, is still unexplored. We experiment with a special procedure in the Chinese modeling, which is tokenization, to figure out the effect of tokenization on hallucinations in abstractive summarization. Tokenization is not often taken out for additional experimented in English due to the language characteristics. In the Chinese scenario, current models use either the character­level tokenization or the tokenization similar to the character­level tokenization, such as the BERT tokenizer. By applying different Chinese tokenizers to the BART model, we confirm that the tokenizer will affect both the ROUGE score and the faithfulness of the model. Moreover, considering the difference between the traditional Chinese and simplified Chinese tokenizers, we create Taiwan Weakly supervised News Summarization dataset (TWNSum) by using the simple LEAD method and the hallucination evaluation filtering. Additionally, our TWNSum dataset shows that creating an abstractive summarization dataset from a large amount of unlabeled news by a weakly supervised method is feasible.en_US
DC.subject自動萃取式摘要zh_TW
DC.subject預訓練模型zh_TW
DC.subject中文斷詞zh_TW
DC.subject事實一致性zh_TW
DC.subjectAbstractive Summarizationen_US
DC.subjectPre­trained Modelen_US
DC.subjectTokenizationen_US
DC.subjectHallucinationen_US
DC.title評估中文摘要之事實一致性並探討斷詞對其之影響zh_TW
dc.language.isozh-TWzh-TW
DC.titleDoes the Tokenization Influence the Faithfulness? Evaluation of Hallucinations for Chinese Abstractive Summarizationen_US
DC.type博碩士論文zh_TW
DC.typethesisen_US
DC.publisherNational Central Universityen_US

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡  - 隱私權政策聲明