博碩士論文 109552015 完整後設資料紀錄

DC 欄位 語言
DC.contributor資訊工程學系在職專班zh_TW
DC.creator彭綉雯zh_TW
DC.creatorHsiu-Wen Pengen_US
dc.date.accessioned2024-4-26T07:39:07Z
dc.date.available2024-4-26T07:39:07Z
dc.date.issued2024
dc.identifier.urihttp://ir.lib.ncu.edu.tw:444/thesis/view_etd.asp?URN=109552015
dc.contributor.department資訊工程學系在職專班zh_TW
DC.description國立中央大學zh_TW
DC.descriptionNational Central Universityen_US
dc.description.abstract網路上充斥著大量以 PDF 儲存的資訊,例如裁判書、財務報告、入學簡章等。對於許多應用服務而言,往往需要將其轉成結構化格式以方便後續的應用。一般說來,我們需要以人工的方式進行資料結構的定義,並依據定義好的資料結構進行資料擷取,進而訓練模型,這是十分消耗人力及時間成本的,因此如何有效率的定義資料結構,且準確的擷取資料,將是本文研究的主要課題。 本文結合資料探勘與資料擷取兩個任務,開發了一套互動式的線上學習資料擷取系統。前者透過 PrefixSpan 的技術可以幫助使用者找出目標文件的Pattern,讓使用者能有效率的定義目標文件的資料結構;後者則是採用傳統機器學習的有限狀態傳感機 (Finite-state transducer, FST),系統可以透過少量的標記資料,依據資料結構的定義來學習提取規則,並經由這些提取規則完成資料擷取任務。 由於資料探勘時會挖掘出過多 Pattern,因此我們透過排除項目(如:去除文件中的頁碼或行號資訊... 等) 的判斷來減少 Pattern 數量,並對不同文件格式類型作進一步的分析。而在資料擷取的任務中,我們實作兩種 LLM 擷取方法:LangChain 及 ChatGPT-QA。實驗結果顯示 LangChain 擷取效能優於ChatGPT-QA ,平均 F1 Score 分別為 0.77 及 0.63。另外,我們也針對兩種不同標記方法:人工標記及 LangChain 標記,以評估 LangChain 是否能達到取代人工標記的目標,透過使用 FST 進行資料擷取的實驗結果呈現LangChain並不能取代人工標記,其人工標記與 LangChain 標記的平均 F1 Score 分別為0.91 及 0.70。zh_TW
dc.description.abstractThe internet is flooded with a large amount of information stored in PDF format, such as judgments, financial reports, admission brochures, and so on. For many applications and services, it is often necessary to convert this information into structured formats for subsequent use. Typically, this involves manually defining data structures and extracting data based on the defined structures to train models, which is extremely labor and time-consuming. Therefore, how to efficiently define data structures and accurately extract data will be the main focus of this study. This paper combines two tasks, data mining and data extraction, to develop an interactive online learning data extraction system. The former uses the PrefixSpan technique to help users find patterns in target documents, allowing users to efficiently define the data structure of target documents. The latter adopts the Finite-state transducer (FST) of traditional machine learning, which can learn extraction rules based on the defined data structure with a small amount of labeled data and complete the data extraction task through these extraction rules. Since data mining may uncover too many patterns, we reduce the number of patterns by excluding items (such as removing page numbers or line number information, etc.) and further analyze different document format types. In the data extraction task, we implemented two LLM extraction methods: LangChain and ChatGPT-QA. Experimental results show that LangChain outperforms ChatGPT-QA in extraction performance, with average F1 scores of 0.77 and 0.63, respectively. Additionally, we evaluated whether LangChain can replace manual labeling by comparing two different labeling methods: manual labeling and LangChain labeling. The experimental results of using FST for data extraction show that LangChain cannot replace manual labeling, with average F1 scores of 0.91 and 0.70 for manual labeling and LangChain labeling, respectively.en_US
DC.subject序列模式挖掘zh_TW
DC.subject上下文學習zh_TW
DC.subject線上學習zh_TW
DC.subject大型語言模型zh_TW
DC.subjectSequential pattern miningen_US
DC.subjectIn-context Learningen_US
DC.subjectOnline Learningen_US
DC.subjectLarge Language Modelen_US
DC.title基於資料結構探勘 PDF 文本資訊擷取系統之設計與開發zh_TW
dc.language.isozh-TWzh-TW
DC.titleSchema Mining and Information Extraction for PDF Documentsen_US
DC.type博碩士論文zh_TW
DC.typethesisen_US
DC.publisherNational Central Universityen_US

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡  - 隱私權政策聲明