中大機構典藏-NCU Institutional Repository-提供博碩士論文、考古題、期刊論文、研究計畫等下載:Item 987654321/93132
English  |  正體中文  |  简体中文  |  全文筆數/總筆數 : 80990/80990 (100%)
造訪人次 : 41661419      線上人數 : 1824
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
搜尋範圍 查詢小技巧:
  • 您可在西文檢索詞彙前後加上"雙引號",以獲取較精準的檢索結果
  • 若欲以作者姓名搜尋,建議至進階搜尋限定作者欄位,可獲得較完整資料
  • 進階搜尋


    請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/93132


    題名: 基於生成資料集和進一步預訓練之百科問答系統;Retrieval-based Question-Answering System based on Generated Dataset and Further Pretraining
    作者: 馮智詮;Feng, Zhi-Quan
    貢獻者: 資訊工程學系
    關鍵詞: 深度學習;自然語言處理;文本檢索;閱讀理解;問答系統;Deep Learning;Natural Language Processing;Document Retrieval;Muchine Reading Comprehension;Question Answering System
    日期: 2023-07-19
    上傳時間: 2024-09-19 16:43:58 (UTC+8)
    出版者: 國立中央大學
    摘要: 近年隨著自然語言處理領域的快速發展和進步,基於Transformer[1]的神經語言模型逐漸被開發出了各種各樣的預訓練演算法以及與之伴隨的資料集和優秀訓練結果如早期的BERT[2]、RoBERTa[3],和後來的DPR[4]等等。在檢索式開放領域問答的雙塔模型文本檢索器,以及文本閱讀理解下游任務的神經語言模型。在過去幾年,此類系統有相當多的實作和改進,但本研究所涉及之中文問答領域,往往存在一個問題,就是在雙塔模型的訓練以及文本閱讀器的訓練方面,缺少與檢索任務高度匹配且資料量較大的開放資料集,類似英文的PAQ[5]資料集,因此,本研究主要通過生成模型生成的方式,以開源中文預訓練新聞預料為基礎,獲得大規模文本-問題資料集,並通過此資料集,強化系統的文本檢索能力以及模型的閱讀理解能力,具體地,本系統分為三個主要部分。
    第一部分在於收集資料,本研究使用MT5[6]預訓練模型生成所需資料集QNews,並也同時對生成資料集實行資料清洗,篩選出較為合理的問題和長度合適的文本。第二部分在於使用QNews資料集中的文本-問題對,對雙塔模型實行領域相吻合的檢索預訓練,提升雙塔模型的檢索效能。第三部分主要通過經長度採樣的QNews資料集,對文本閱讀器進行進一步預訓練,並通過一定的約束,讓模型的參數變動控制在一定範圍。
    通過上述三個主要步驟,本研究意在傳統傳統檢索式開放領域百科問答系統中,一定程度地改善雙塔模型預訓練任務和下游任務的資料形式偏差,並提高神經語言模型在閱讀理解下游任務中的運行效能。
    ;In recent years, with the rapid development and advancement in the field of natural language processing, various pretraining algorithms based on Transformer-based[1] neural language models have been developed, along with accompanying datasets and outstanding training results such as early models like BERT[2], RoBERTa[3], and later models like DPR[4]. These include DSSM document retrievers for retrieval-based open-domain question answering and neural language models for text reading comprehension downstream tasks. Over the past few years, there have been numerous implementations and improvements in such systems. However, in the Chinese question answering domain, there is often a lack of large-scale open datasets that are highly matched to retrieval tasks for training DSSM models and reading comprehension models, similar to the English PAQ[5] dataset. Therefore, this study primarily focuses on generating a large-scale text-question dataset based on open-source Chinese pretraining news corpus through a generative model. Through this dataset, the system′s text retrieval capability and the model′s reading comprehension ability are strengthened. Specifically, this system consists of three main parts.
    The first part involves data collection. In this study, the MT5[6] pretraining model is used to generate the required dataset called QNews, and the generated dataset is also subject to data cleaning to filter out reasonable questions and texts of appropriate length.
    The second part involves domain-matched retrieval pretraining of the DSSM model using the text-question pairs from the QNews dataset to enhance the retrieval performance of the DSSM.
    The third part focuses on further pretraining the reading comprehension model using the length-sampled QNews dataset and controlling the variation of model parameters within a certain range through certain constraints.
    Through the aforementioned three main steps, this study aims to improve the data format bias in traditional retrieval-based open-domain question answering systems to a certain extent and enhance the performance of neural language models in reading comprehension downstream tasks.
    顯示於類別:[資訊工程研究所] 博碩士論文

    文件中的檔案:

    檔案 描述 大小格式瀏覽次數
    index.html0KbHTML11檢視/開啟


    在NCUIR中所有的資料項目都受到原著作權保護.

    社群 sharing

    ::: Copyright National Central University. | 國立中央大學圖書館版權所有 | 收藏本站 | 設為首頁 | 最佳瀏覽畫面: 1024*768 | 建站日期:8-24-2009 :::
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - 隱私權政策聲明