基於預訓練模型與再評分機制之開放領域中文問答系統

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：44

、訪客IP：18.191.212.146

姓名

陳大富(Ta-Fu Chen) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

基於預訓練模型與再評分機制之開放領域中文問答系統
(Open Domain Chinese Question Answering System based on Pre-training Model and Retrieval Reranking)

相關論文

★ Single and Multi-Label Environmental Sound Recognition with Gaussian Process	★ 波束形成與音訊前處理之嵌入式系統實現
★ 語音合成及語者轉換之應用與設計	★ 基於語意之輿情分析系統
★ 高品質口述系統之設計與應用	★ 深度學習及加速強健特徵之CT影像跟骨骨折辨識及偵測
★ 基於風格向量空間之個性化協同過濾服裝推薦系統	★ RetinaNet應用於人臉偵測
★ 金融商品走勢預測	★ 整合深度學習方法預測年齡以及衰老基因之研究
★ 漢語之端到端語音合成研究	★ 基於 ARM 架構上的 ORB-SLAM2 的應用與改進
★ 基於深度學習之指數股票型基金趨勢預測	★ 探討財經新聞與金融趨勢的相關性
★ 基於卷積神經網路的情緒語音分析	★ 運用深度學習方法預測阿茲海默症惡化與腦中風手術存活

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

近年來在自然語言處理領域的研究，皆漸漸的轉往使用大型預訓練語言模型，在開放領域的問答系統也不例外。大型預訓練語言模型為問答系統帶來了強大的理解能力與答案抽取能力，但隨之而來的是其龐大參數量，所帶來的緩慢推理速度，再加上實際應用時模型需要處理的內容數量不固定而導致體驗不佳的問題。本論文提出一個中文開放領域問答系統其中加入Reranking的機制，對於要進入問答模型的文章改以段落為單位並進行語意層面的篩選，不但可提供傳統檢索器所缺乏的語意資訊外，更可以藉此有效的減少並控制進入問答模型的段落數量，以達到降低問答模型被誤導的可能性，並大幅提升系統給出答案的反應速度。
開放領域的問答中其問答範圍是不設限在特定領域的，所以在實際應用時勢必會遇到許多訓練時不曾見過的樣本，因此問答模型必須具備有非常良好的泛化能力，才能有較佳的表現。並且在使用問答系統時使用者提出的問題時常會帶有口語化的人類習慣，這樣的特性與訓練資料集中，相較之下較為規矩的問句格式有些差異。因此本論文提出了一套用於中文問答的方法，包括對訓練資料的處理與訓練時的方式。對於資料的處理，目標在於利用現有的資料集進行調整與組合等，以提高接受問題類型的能力。對於訓練的方式，目標在於利用調整訓練時的樣本長度等，以提高模型對不同長度的適應性。藉由上述方法可提升模型的泛化能力，並使其對於口語化的問答有較良好的接受度，進而提升模型在給予答案時的精確度。

摘要(英)

In recent years, research in natural language processing has gradually shifted to the use of large-scale pre-trained language models. The open-domain question answering system is no exception. Large-scale pre-trained language models bring powerful understanding and answer extraction capabilities to the question answering system. But what follows is the slow inferencing speed brought by its huge amount of parameters. Coupled with the fact that the amount of content that the model needs to deal with is not fixed in actual application, it leads to the problem of poor experience. This paper proposes a Chinese open-domain question answering system which incorporates Reranking mechanism. For articles that want to enter the question answering (Q&A) model, change them to paragraphs and screen them at the semantic level. Not only can provide semantic information that traditional document retrievers lack, but also can effectively reduce and control the number of paragraphs entering the Q&A model. In order to reduce the possibility of the Q&A model being misled, and greatly improve the response speed of the system to give answers.
The scope of question in the open-domain question answering system is not limited to a specific domain. Hence, in actual application, it will inevitably encounter many samples that have not been seen during training. Therefore, the question answering model must have a very good generalization ability in order to have better performance. And when using the question and answer system, the questions asked by the user often have a colloquial human habit. This feature is somewhat different from the more regular question format in the training data set. Therefore, this paper proposes a set of methods for Chinese question answering, including the processing of training data and the way of training. The goal of training data processing is to use existing data sets for adjustments and combinations, etc., to improve the ability to accept problem types. And the goal of the training method is to adjust the sample length during training to improve the adaptability to different types of question. Through the above methods, the generalization ability of the model can be improved, and it has a better acceptance of the colloquial question. And then improve the accuracy of the model when giving answers.

關鍵字(中)

★ 問答系統
★ 開放領域
★ 開放領域問答系統
★ 檢索再評分
★ 預訓練

關鍵字(英)

★ Question Answering System
★ Open-domain
★ Open-domain Question Answering System
★ Retrieval reranking
★ Pre-training

論文目次

中文摘要 i
Abstract ii
章節目次 iv
圖目錄 vi
表目錄 vii
第一章緒論 1
1.1 背景 1
1.2 研究動機與目的 2
1.3 研究方法與章節概要 3
第二章相關文獻及文獻探討 4
2.1 詞頻-逆向檔案頻率(Term Frequency–Inverse Document Frequency，TF-IDF) 4
2.2 變壓器(Transformer) 5
2.2.1 自注意力演算法(Self-Attention) 7
2.2.2 多頭注意力機制(Multi-Head Attention) 8
2.2.3 位置編碼演算法(Positional Encoding，PE) 9
2.2.4 時間複雜度之比較 11
2.3 基於變壓器的雙向編碼器表示技術(Bidirectional Encoder Representations from Transformers，BERT) 12
2.3.1 掩碼語言模型預訓練(Masked Language Model，MLM) 15
2.3.2 次句預測預訓練(Next Sentence Prediction，NSP) 16
2.3.3 微調BERT(Fine-tuning BERT) 17
2.3.4 常見的預訓練語言模型架構分析 19
2.4 兩階段架構的開放領域問答系統(Two stage open-domain question answering system) 24
2.4.1 檔案檢索器(Document Retriever) 25
2.4.2 檔案閱讀器(Document Reader) 26
2.4.3 遠程監督資料生成(Distantly Supervised Data Generating) 29
2.5 雙編碼器架構的檢索器(Dual encoder retriever) 29
第三章基於預訓練模型與再評分機制之中文開放領域問答系統(Open domain question answering system base on pre-training model and retrieval reranking) 33
3.1 基於Roberta的中文問答模型 33
3.2 基於雙編碼器的搜索再評分(Retrieval reranking base on dual encoder architecture) 37
3.3 基於Inverse Cloze Task (ICT) 預訓練的reranker 42
3.4 中文問答資料集擴充 44
3.5 基於預訓練模型與再評分機制之中文開放領域問答系統 46
第四章實驗結果與討論 49
4.1 實驗設備 49
4.2 資料集介紹 50
4.2.1 抽取式問答之閱讀理解資料集 50
4.2.2 一般問答之閱讀理解資料集 53
4.3 實驗與討論 55
4.3.1 基於Roberta的中文問答之Reader 55
4.3.2 基於雙編碼器之Reranker 63
4.3.3 基於預訓練模型與再評分機制之中文開放領域問答系統 66
第五章結論及未來方向 71
參考文獻 72

參考文獻

[1] D. Chen, A. Fisch, J. Weston, and A. Bordes, “Reading Wikipedia to Answer Open-Domain Questions,” arXiv:1704.00051 [cs], Apr. 2017, Accessed: Dec. 22, 2020. [Online]. Available: http://arxiv.org/abs/1704.00051.
[2] W. Yang et al., “End-to-End Open-Domain Question Answering with BERTserini,” Proceedings of the 2019 Conference of the North, pp. 72–77, 2019, doi: 10.18653/v1/N19-4013.
[3] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” arXiv:1810.04805 [cs], May 2019, Accessed: Apr. 29, 2021. [Online]. Available: http://arxiv.org/abs/1810.04805.
[4] M. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi, “Bidirectional Attention Flow for Machine Comprehension,” arXiv:1611.01603 [cs], Jun. 2018, Accessed: Apr. 29, 2021. [Online]. Available: http://arxiv.org/abs/1611.01603.
[5] A. W. Yu et al., “QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension,” arXiv:1804.09541 [cs], Apr. 2018, Accessed: Apr. 29, 2021. [Online]. Available: http://arxiv.org/abs/1804.09541.
[6] A. Vaswani et al., “Attention Is All You Need,” arXiv:1706.03762 [cs], Dec. 2017, Accessed: Jan. 05, 2021. [Online]. Available: http://arxiv.org/abs/1706.03762.
[7] D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” arXiv:1409.0473 [cs, stat], May 2016, Accessed: Apr. 29, 2021. [Online]. Available: http://arxiv.org/abs/1409.0473.
[8] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving Language Understanding by Generative Pre-Training,” p. 12.
[9] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient Estimation of Word Representations in Vector Space,” arXiv:1301.3781 [cs], Sep. 2013, Accessed: Apr. 29, 2021. [Online]. Available: http://arxiv.org/abs/1301.3781.
[10] Y. Jernite, S. R. Bowman, and D. Sontag, “Discourse-Based Objectives for Fast Unsupervised Sentence Representation Learning,” arXiv:1705.00557 [cs, stat], Apr. 2017, Accessed: Apr. 29, 2021. [Online]. Available: http://arxiv.org/abs/1705.00557.
[11] L. Logeswaran and H. Lee, “AN EFFICIENT FRAMEWORK FOR LEARNING SENTENCE REPRESENTATIONS,” p. 16.
[12] A. Williams, N. Nangia, and S. Bowman, “A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, 2018, pp. 1112–1122, doi: 10.18653/v1/N18-1101.
[13] L. Sharma, L. Graesser, N. Nangia, and U. Evci, “Natural Language Understanding with the Quora Question Pairs Dataset,” p. 10.
[14] R. Socher et al., “Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank,” p. 12.
[15] A. Warstadt, A. Singh, and S. R. Bowman, “Neural Network Acceptability Judgments,” arXiv:1805.12471 [cs], Oct. 2019, Accessed: Apr. 29, 2021. [Online]. Available: http://arxiv.org/abs/1805.12471.
[16] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “SQuAD: 100,000+ Questions for Machine Comprehension of Text,” arXiv:1606.05250 [cs], Oct. 2016, Accessed: Apr. 29, 2021. [Online]. Available: http://arxiv.org/abs/1606.05250.
[17] P. Rajpurkar, R. Jia, and P. Liang, “Know What You Don’t Know: Unanswerable Questions for SQuAD,” arXiv:1806.03822 [cs], Jun. 2018, Accessed: Apr. 29, 2021. [Online]. Available: http://arxiv.org/abs/1806.03822.
[18] E. F. T. K. Sang and F. De Meulder, “Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition,” arXiv:cs/0306050, Jun. 2003, Accessed: Apr. 29, 2021. [Online]. Available: http://arxiv.org/abs/cs/0306050.
[19] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “ALBERT: A Lite BERT for Self-supervised Learning of Language Representations,” arXiv:1909.11942 [cs], Feb. 2020, Accessed: Apr. 29, 2021. [Online]. Available: http://arxiv.org/abs/1909.11942.
[20] M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer, and O. Levy, “SpanBERT: Improving Pre-training by Representing and Predicting Spans,” arXiv:1907.10529 [cs], Jan. 2020, Accessed: May 27, 2020. [Online]. Available: http://arxiv.org/abs/1907.10529.
[21] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter,” arXiv:1910.01108 [cs], Feb. 2020, Accessed: Apr. 29, 2021. [Online]. Available: http://arxiv.org/abs/1910.01108.
[22] Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun, and Q. Liu, “ERNIE: Enhanced Language Representation with Informative Entities,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, Jul. 2019, pp. 1441–1451, doi: 10.18653/v1/P19-1139.
[23] thunlp/PLMpapers. THUNLP, 2021.
[24] M. E. Peters et al., “Deep contextualized word representations,” arXiv:1802.05365 [cs], Mar. 2018, Accessed: Apr. 29, 2021. [Online]. Available: http://arxiv.org/abs/1802.05365.
[25] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor, “Freebase: a collaboratively created graph database for structuring human knowledge,” in In SIGMOD Conference, 2008, pp. 1247–1250.
[26] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives, “DBpedia: A Nucleus for a Web of Open Data,” in The Semantic Web, vol. 4825, K. Aberer, K.-S. Choi, N. Noy, D. Allemang, K.-I. Lee, L. Nixon, J. Golbeck, P. Mika, D. Maynard, R. Mizoguchi, G. Schreiber, and P. Cudré-Mauroux, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2007, pp. 722–735.
[27] D. Ferrucci et al., “Building Watson: An Overview of the DeepQA Project,” AIMag, vol. 31, no. 3, Art. no. 3, Jul. 2010, doi: 10.1609/aimag.v31i3.2303.
[28] K. M. Hermann et al., “Teaching Machines to Read and Comprehend,” arXiv:1506.03340 [cs], Nov. 2015, Accessed: Apr. 29, 2021. [Online]. Available: http://arxiv.org/abs/1506.03340.
[29] J. Pennington, R. Socher, and C. Manning, “GloVe: Global Vectors for Word Representation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, Oct. 2014, pp. 1532–1543, doi: 10.3115/v1/D14-1162.
[30] P. Baudisˇ, “YodaQA: A Modular Question Answering System Pipeline,” p. 8, 2015.
[31] J. Berant, A. Chou, R. Frostig, and P. Liang, “Semantic Parsing on Freebase from Question-Answer Pairs,” in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, Oct. 2013, pp. 1533–1544, Accessed: May 03, 2021. [Online]. Available: https://www.aclweb.org/anthology/D13-1160.
[32] A. Miller, A. Fisch, J. Dodge, A.-H. Karimi, A. Bordes, and J. Weston, “Key-Value Memory Networks for Directly Reading Documents,” arXiv:1606.03126 [cs], Oct. 2016, Accessed: May 03, 2021. [Online]. Available: http://arxiv.org/abs/1606.03126.
[33] V. Karpukhin et al., “Dense Passage Retrieval for Open-Domain Question Answering,” arXiv:2004.04906 [cs], Sep. 2020, Accessed: Dec. 22, 2020. [Online]. Available: http://arxiv.org/abs/2004.04906.
[34] facebookresearch/faiss. Facebook Research, 2021.
[35] Y. Liu et al., “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” arXiv:1907.11692 [cs], Jul. 2019, Accessed: May 03, 2021. [Online]. Available: http://arxiv.org/abs/1907.11692.
[36] C. C. Shao, T. Liu, Y. Lai, Y. Tseng, and S. Tsai, “DRCD: a Chinese Machine Reading Comprehension Dataset,” p. 5.
[37] Y. Cui et al., “A Span-Extraction Dataset for Chinese Machine Reading Comprehension,” Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5882–5888, 2019, doi: 10.18653/v1/D19-1600.
[38] P. Li et al., “Dataset and Neural Recurrent Sequence Labeling Model for Open-Domain Factoid Question Answering,” arXiv:1607.06275 [cs], Sep. 2016, Accessed: May 03, 2021. [Online]. Available: http://arxiv.org/abs/1607.06275.
[39] K. Lee, M.-W. Chang, and K. Toutanova, “Latent Retrieval for Weakly Supervised Open Domain Question Answering,” arXiv:1906.00300 [cs], Jun. 2019, Accessed: Apr. 29, 2021. [Online]. Available: http://arxiv.org/abs/1906.00300.
[40] Y. Cui et al., “Pre-Training with Whole Word Masking for Chinese BERT,” arXiv:1906.08101 [cs], Oct. 2019, Accessed: May 03, 2021. [Online]. Available: http://arxiv.org/abs/1906.08101.
[41] G. Attardi, attardi/wikiextractor. 2021.
[42] C. Kuo, BYVoid/OpenCC. 2021.
[43] G. Häring, ghaering/pysqlite. 2021.
[44] S. Junyi, fxsjy/jieba. 2021.
[45] C. Raffel et al., “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer,” arXiv:1910.10683 [cs, stat], Jul. 2020, Accessed: May 03, 2021. [Online]. Available: http://arxiv.org/abs/1910.10683.

指導教授

王家慶(Jia-Ching Wang)

審核日期

2021-8-24

推文