摘要: | 開放式資訊擷取的目的是將非結構化的句子,轉化成三元組的形式 (個體1,關係,個體2) ,以 “神經醯胺能夠修復皮脂膜及減緩乾燥”這個句子為例,開放式資訊擷取的模型會從此句子擷取出 (神經醯胺,修復,皮脂膜) 和 (神經醯胺,減緩,乾燥) 這兩個三元組,三元組的形式可以視覺化成知識圖譜,作為問答系統的知識推論基礎。在開放式資訊擷取的研究領域中,我們提出一個名為CHOIE (Chinese Healthcare Open Information Extraction) 的管道式語言轉譯器(pipelined language transformers) 的模型,專注於中文健康照護領域的資訊擷取。CHOIE模型以現今表現優良的RoBERTa自然語言預訓練模型作為基礎架構,搭配不同的神經網路模型抽取特徵,最後加上分類器。本研究將其任務視為兩階段,先抽取三元組中的所有關係,然後以每一個關係為中心找出個體1和個體2,完成三元組之擷取。由於目前缺少公開的中文人工標記的資料集,因此我們透過網路爬蟲,爬取醫療照護類型的文章,人工標記個體關係之後,最終可以將三元組分為四種類型,分別是簡單關係、單一重疊、多元重疊、複雜關係四個種類。藉由實驗結果和錯誤分析,我們可以得知提出的CHOIE管道式語言轉譯器,在開放式資訊擷取的三個評估指標,分別達到最佳效能 Exact Match (F1: 0.848) 、Contain Match (F1: 0.913) 、Token Level Match (F1: 0.925) ,比目前現有的資訊擷取模型 (Multi2OIE、SpanOIE、RNNOIE) 表現較好。;Open Information Extraction (OIE) aims at extracting the triples in terms of (Argument-1, Relation, Argument-2) from unstructured natural language texts. For example, an open IE system may extract the triples such as (Ceramide, repair, sebum) and (Ceramide, relieve, dryness) from the given sentence “Ceramide can repair the sebum and relieve the dryness”. These extracted triples can be visualized as a part of the knowledge graph that may benefit knowledge inferences in the question answering systems. In this study, we propose a pipelined language transformers model called CHOIE (Chinese Healthcare Open Information Extraction). It uses a pipeline of RoBERTa transformers and different neural networks for feature-extracting to extract triples. We regard the Chinese open information extraction as a two-phase task. First, we extract all the relations in a given sentence and then find all the arguments based on each relation. Due to the lack of publicly available datasets that were annotated manually, we construct such a Chinese OIE dataset in the healthcare domain. We firstly crawled articles from websites that provide healthcare information. After pre-processing, we split the remaining texts into several sentences. We randomly selected partial sentences for manual annotation. Finally, our constructed dataset can be further categorized into four distinct groups including simple relations, single overlaps, multiple overlaps, and complicated relations. Based on the experimental results and error analysis, our proposed CHOIE model achieved the best performance in three evaluation metrics: Exact Match (F1: 0.848), Contain Match (F1: 0.913), and Token Level Match (F1: 0.925) that outperforms existing Multi2OIE, SpanOIE, and RNNOIE models. |