基於資料結構探勘 PDF 文本資訊擷取系統之設計與開發

DC 欄位	值	語言
DC.contributor	資訊工程學系在職專班	zh_TW
DC.creator	彭綉雯	zh_TW
DC.creator	Hsiu-Wen Peng	en_US
dc.date.accessioned	2024-4-26T07:39:07Z
dc.date.available	2024-4-26T07:39:07Z
dc.date.issued	2024
dc.identifier.uri	http://ir.lib.ncu.edu.tw:444/thesis/view_etd.asp?URN=109552015
dc.contributor.department	資訊工程學系在職專班	zh_TW
DC.description	國立中央大學	zh_TW
DC.description	National Central University	en_US
dc.description.abstract	網路上充斥著大量以 PDF 儲存的資訊，例如裁判書、財務報告、入學簡章等。對於許多應用服務而言，往往需要將其轉成結構化格式以方便後續的應用。一般說來，我們需要以人工的方式進行資料結構的定義，並依據定義好的資料結構進行資料擷取，進而訓練模型，這是十分消耗人力及時間成本的，因此如何有效率的定義資料結構，且準確的擷取資料，將是本文研究的主要課題。本文結合資料探勘與資料擷取兩個任務，開發了一套互動式的線上學習資料擷取系統。前者透過 PrefixSpan 的技術可以幫助使用者找出目標文件的Pattern，讓使用者能有效率的定義目標文件的資料結構；後者則是採用傳統機器學習的有限狀態傳感機 (Finite-state transducer, FST)，系統可以透過少量的標記資料，依據資料結構的定義來學習提取規則，並經由這些提取規則完成資料擷取任務。由於資料探勘時會挖掘出過多 Pattern，因此我們透過排除項目（如：去除文件中的頁碼或行號資訊... 等) 的判斷來減少 Pattern 數量，並對不同文件格式類型作進一步的分析。而在資料擷取的任務中，我們實作兩種 LLM 擷取方法：LangChain 及 ChatGPT-QA。實驗結果顯示 LangChain 擷取效能優於ChatGPT-QA ，平均 F1 Score 分別為 0.77 及 0.63。另外，我們也針對兩種不同標記方法：人工標記及 LangChain 標記，以評估 LangChain 是否能達到取代人工標記的目標，透過使用 FST 進行資料擷取的實驗結果呈現LangChain並不能取代人工標記，其人工標記與 LangChain 標記的平均 F1 Score 分別為0.91 及 0.70。	zh_TW
dc.description.abstract	The internet is flooded with a large amount of information stored in PDF format, such as judgments, financial reports, admission brochures, and so on. For many applications and services, it is often necessary to convert this information into structured formats for subsequent use. Typically, this involves manually defining data structures and extracting data based on the defined structures to train models, which is extremely labor and time-consuming. Therefore, how to eﬀiciently define data structures and accurately extract data will be the main focus of this study. This paper combines two tasks, data mining and data extraction, to develop an interactive online learning data extraction system. The former uses the PrefixSpan technique to help users find patterns in target documents, allowing users to eﬀiciently define the data structure of target documents. The latter adopts the Finite-state transducer (FST) of traditional machine learning, which can learn extraction rules based on the defined data structure with a small amount of labeled data and complete the data extraction task through these extraction rules. Since data mining may uncover too many patterns, we reduce the number of patterns by excluding items (such as removing page numbers or line number information, etc.) and further analyze different document format types. In the data extraction task, we implemented two LLM extraction methods: LangChain and ChatGPT-QA. Experimental results show that LangChain outperforms ChatGPT-QA in extraction performance, with average F1 scores of 0.77 and 0.63, respectively. Additionally, we evaluated whether LangChain can replace manual labeling by comparing two different labeling methods: manual labeling and LangChain labeling. The experimental results of using FST for data extraction show that LangChain cannot replace manual labeling, with average F1 scores of 0.91 and 0.70 for manual labeling and LangChain labeling, respectively.	en_US
DC.subject	序列模式挖掘	zh_TW
DC.subject	上下文學習	zh_TW
DC.subject	線上學習	zh_TW
DC.subject	大型語言模型	zh_TW
DC.subject	Sequential pattern mining	en_US
DC.subject	In-context Learning	en_US
DC.subject	Online Learning	en_US
DC.subject	Large Language Model	en_US
DC.title	基於資料結構探勘 PDF 文本資訊擷取系統之設計與開發	zh_TW
dc.language.iso	zh-TW	zh-TW
DC.title	Schema Mining and Information Extraction for PDF Documents	en_US
DC.type	博碩士論文	zh_TW
DC.type	thesis	en_US
DC.publisher	National Central University	en_US

博碩士論文 109552015 完整後設資料紀錄