近年隨著自然語言處理領域的快速發展和進步,基於Transformer[1]的神經語言模型逐漸被開發出了各種各樣的預訓練演算法以及與之伴隨的資料集和優秀訓練結果如早期的BERT[2]、RoBERTa[3],和後來的DPR[4]等等。在檢索式開放領域問答的雙塔模型文本檢索器,以及文本閱讀理解下游任務的神經語言模型。在過去幾年,此類系統有相當多的實作和改進,但本研究所涉及之中文問答領域,往往存在一個問題,就是在雙塔模型的訓練以及文本閱讀器的訓練方面,缺少與檢索任務高度匹配且資料量較大的開放資料集,類似英文的PAQ[5]資料集,因此,本研究主要通過生成模型生成的方式,以開源中文預訓練新聞預料為基礎,獲得大規模文本-問題資料集,並通過此資料集,強化系統的文本檢索能力以及模型的閱讀理解能力,具體地,本系統分為三個主要部分。 第一部分在於收集資料,本研究使用MT5[6]預訓練模型生成所需資料集QNews,並也同時對生成資料集實行資料清洗,篩選出較為合理的問題和長度合適的文本。第二部分在於使用QNews資料集中的文本-問題對,對雙塔模型實行領域相吻合的檢索預訓練,提升雙塔模型的檢索效能。第三部分主要通過經長度採樣的QNews資料集,對文本閱讀器進行進一步預訓練,並通過一定的約束,讓模型的參數變動控制在一定範圍。 通過上述三個主要步驟,本研究意在傳統傳統檢索式開放領域百科問答系統中,一定程度地改善雙塔模型預訓練任務和下游任務的資料形式偏差,並提高神經語言模型在閱讀理解下游任務中的運行效能。 ;In recent years, with the rapid development and advancement in the field of natural language processing, various pretraining algorithms based on Transformer-based[1] neural language models have been developed, along with accompanying datasets and outstanding training results such as early models like BERT[2], RoBERTa[3], and later models like DPR[4]. These include DSSM document retrievers for retrieval-based open-domain question answering and neural language models for text reading comprehension downstream tasks. Over the past few years, there have been numerous implementations and improvements in such systems. However, in the Chinese question answering domain, there is often a lack of large-scale open datasets that are highly matched to retrieval tasks for training DSSM models and reading comprehension models, similar to the English PAQ[5] dataset. Therefore, this study primarily focuses on generating a large-scale text-question dataset based on open-source Chinese pretraining news corpus through a generative model. Through this dataset, the system′s text retrieval capability and the model′s reading comprehension ability are strengthened. Specifically, this system consists of three main parts. The first part involves data collection. In this study, the MT5[6] pretraining model is used to generate the required dataset called QNews, and the generated dataset is also subject to data cleaning to filter out reasonable questions and texts of appropriate length. The second part involves domain-matched retrieval pretraining of the DSSM model using the text-question pairs from the QNews dataset to enhance the retrieval performance of the DSSM. The third part focuses on further pretraining the reading comprehension model using the length-sampled QNews dataset and controlling the variation of model parameters within a certain range through certain constraints. Through the aforementioned three main steps, this study aims to improve the data format bias in traditional retrieval-based open-domain question answering systems to a certain extent and enhance the performance of neural language models in reading comprehension downstream tasks.