論文名稱 基於預訓練模型與再評分機制之開放領域中文問答系統
(Open Domain Chinese Question Answering System based on Pre-training Model and Retrieval Reranking)
摘要(中) 近年來在自然語言處理領域的研究,皆漸漸的轉往使用大型預訓練語言模型,在開放領域的問答系統也不例外。大型預訓練語言模型為問答系統帶來了強大的理解能力與答案抽取能力,但隨之而來的是其龐大參數量,所帶來的緩慢推理速度,再加上實際應用時模型需要處理的內容數量不固定而導致體驗不佳的問題。本論文提出一個中文開放領域問答系統其中加入Reranking的機制,對於要進入問答模型的文章改以段落為單位並進行語意層面的篩選,不但可提供傳統檢索器所缺乏的語意資訊外,更可以藉此有效的減少並控制進入問答模型的段落數量,以達到降低問答模型被誤導的可能性,並大幅提升系統給出答案的反應速度。
摘要(英) In recent years, research in natural language processing has gradually shifted to the use of large-scale pre-trained language models. The open-domain question answering system is no exception. Large-scale pre-trained language models bring powerful understanding and answer extraction capabilities to the question answering system. But what follows is the slow inferencing speed brought by its huge amount of parameters. Coupled with the fact that the amount of content that the model needs to deal with is not fixed in actual application, it leads to the problem of poor experience. This paper proposes a Chinese open-domain question answering system which incorporates Reranking mechanism. For articles that want to enter the question answering (Q&A) model, change them to paragraphs and screen them at the semantic level. Not only can provide semantic information that traditional document retrievers lack, but also can effectively reduce and control the number of paragraphs entering the Q&A model. In order to reduce the possibility of the Q&A model being misled, and greatly improve the response speed of the system to give answers.
The scope of question in the open-domain question answering system is not limited to a specific domain. Hence, in actual application, it will inevitably encounter many samples that have not been seen during training. Therefore, the question answering model must have a very good generalization ability in order to have better performance. And when using the question and answer system, the questions asked by the user often have a colloquial human habit. This feature is somewhat different from the more regular question format in the training data set. Therefore, this paper proposes a set of methods for Chinese question answering, including the processing of training data and the way of training. The goal of training data processing is to use existing data sets for adjustments and combinations, etc., to improve the ability to accept problem types. And the goal of the training method is to adjust the sample length during training to improve the adaptability to different types of question. Through the above methods, the generalization ability of the model can be improved, and it has a better acceptance of the colloquial question. And then improve the accuracy of the model when giving answers.
關鍵字(中) ★ 問答系統
★ 開放領域
★ 開放領域問答系統
★ 檢索再評分
★ 預訓練
關鍵字(英) ★ Question Answering System
★ Open-domain
★ Open-domain Question Answering System
★ Retrieval reranking
★ Pre-training
論文目次 中文摘要 i
Abstract ii
章節目次 iv
圖目錄 vi
表目錄 vii
第一章 緒論 1
1.1 背景 1
1.2 研究動機與目的 2
1.3 研究方法與章節概要 3
第二章 相關文獻及文獻探討 4
2.1 詞頻-逆向檔案頻率(Term Frequency–Inverse Document Frequency,TF-IDF) 4
2.2 變壓器(Transformer) 5
2.2.1 自注意力演算法(Self-Attention) 7
2.2.2 多頭注意力機制(Multi-Head Attention) 8
2.2.3 位置編碼演算法(Positional Encoding,PE) 9
2.2.4 時間複雜度之比較 11
2.3 基於變壓器的雙向編碼器表示技術(Bidirectional Encoder Representations from Transformers,BERT) 12
2.3.1 掩碼語言模型預訓練(Masked Language Model,MLM) 15
2.3.2 次句預測預訓練(Next Sentence Prediction,NSP) 16
2.3.3 微調BERT(Fine-tuning BERT) 17
2.3.4 常見的預訓練語言模型架構分析 19
2.4 兩階段架構的開放領域問答系統(Two stage open-domain question answering system) 24
2.4.1 檔案檢索器(Document Retriever) 25
2.4.2 檔案閱讀器(Document Reader) 26
2.4.3 遠程監督資料生成(Distantly Supervised Data Generating) 29
2.5 雙編碼器架構的檢索器(Dual encoder retriever) 29
第三章 基於預訓練模型與再評分機制之中文開放領域問答系統(Open domain question answering system base on pre-training model and retrieval reranking) 33
3.1 基於Roberta的中文問答模型 33
3.2 基於雙編碼器的搜索再評分(Retrieval reranking base on dual encoder architecture) 37
3.3 基於Inverse Cloze Task (ICT) 預訓練的reranker 42
3.4 中文問答資料集擴充 44
3.5 基於預訓練模型與再評分機制之中文開放領域問答系統 46
第四章 實驗結果與討論 49
4.1 實驗設備 49
4.2 資料集介紹 50
4.2.1 抽取式問答之閱讀理解資料集 50
4.2.2 一般問答之閱讀理解資料集 53
4.3 實驗與討論 55
4.3.1 基於Roberta的中文問答之Reader 55
4.3.2 基於雙編碼器之Reranker 63
4.3.3 基於預訓練模型與再評分機制之中文開放領域問答系統 66
第五章 結論及未來方向 71
參考文獻 72
指導教授 王家慶(Jia-Ching Wang) 審核日期 2021-8-24
