基於資料結構探勘 PDF 文本資訊擷取系統之設計與開發

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：138

、訪客IP：3.12.76.7

姓名

彭綉雯(Hsiu-Wen Peng) 查詢紙本館藏

畢業系所

資訊工程學系在職專班

論文名稱

基於資料結構探勘 PDF 文本資訊擷取系統之設計與開發
(Schema Mining and Information Extraction for PDF Documents)

相關論文

★ 行程邀約郵件的辨識與不規則時間擷取之研究	★ NCUFree校園無線網路平台設計及應用服務開發
★ 網際網路半結構性資料擷取系統之設計與實作	★ 非簡單瀏覽路徑之探勘與應用
★ 遞增資料關聯式規則探勘之改進	★ 應用卡方獨立性檢定於關連式分類問題
★ 中文資料擷取系統之設計與研究	★ 非數值型資料視覺化與兼具主客觀的分群
★ 關聯性字組在文件摘要上的探討	★ 淨化網頁：網頁區塊化以及資料區域擷取
★ 問題答覆系統使用語句分類排序方式之設計與研究	★ 時序資料庫中緊密頻繁連續事件型樣之有效探勘
★ 星狀座標之軸排列於群聚視覺化之應用	★ 由瀏覽歷程自動產生網頁抓取程式之研究
★ 動態網頁之樣版與資料分析研究	★ 同性質網頁資料整合之自動化研究

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2024-12-31以後開放)

摘要(中)

網路上充斥著大量以 PDF 儲存的資訊，例如裁判書、財務報告、入學簡章等。對於許多應用服務而言，往往需要將其轉成結構化格式以方便後續的應用。一般說來，我們需要以人工的方式進行資料結構的定義，並依據定義好的資料結構進行資料擷取，進而訓練模型，這是十分消耗人力及時間成本的，因此如何有效率的定義資料結構，且準確的擷取資料，將是本文研究的主要課題。

本文結合資料探勘與資料擷取兩個任務，開發了一套互動式的線上學習資料擷取系統。前者透過 PrefixSpan 的技術可以幫助使用者找出目標文件的Pattern，讓使用者能有效率的定義目標文件的資料結構；後者則是採用傳統機器學習的有限狀態傳感機 (Finite-state transducer, FST)，系統可以透過少量的標記資料，依據資料結構的定義來學習提取規則，並經由這些提取規則完成資料擷取任務。

由於資料探勘時會挖掘出過多 Pattern，因此我們透過排除項目（如：去除文件中的頁碼或行號資訊... 等) 的判斷來減少 Pattern 數量，並對不同文件格式類型作進一步的分析。而在資料擷取的任務中，我們實作兩種 LLM 擷取方法：LangChain 及 ChatGPT-QA。實驗結果顯示 LangChain 擷取效能優於ChatGPT-QA ，平均 F1 Score 分別為 0.77 及 0.63。另外，我們也針對兩種不同標記方法：人工標記及 LangChain 標記，以評估 LangChain 是否能達到取代人工標記的目標，透過使用 FST 進行資料擷取的實驗結果呈現LangChain並不能取代人工標記，其人工標記與 LangChain 標記的平均 F1 Score 分別為0.91 及 0.70。

摘要(英)

The internet is flooded with a large amount of information stored in PDF format, such as judgments, financial reports, admission brochures, and so on. For many applications and services, it is often necessary to convert this information into structured formats for subsequent use. Typically, this involves manually defining data structures and extracting data based on the defined structures to train models, which is extremely labor and time-consuming. Therefore, how to eﬀiciently define data structures and accurately extract data will be the main focus of this study.

This paper combines two tasks, data mining and data extraction, to develop an interactive online learning data extraction system. The former uses the PrefixSpan technique to help users find patterns in target documents, allowing users to eﬀiciently define the data structure of target documents. The latter adopts the Finite-state transducer (FST) of traditional machine learning, which can learn extraction rules based on the defined data structure with a small amount of labeled data and complete the data extraction task through these extraction rules.

Since data mining may uncover too many patterns, we reduce the number of patterns by excluding items (such as removing page numbers or line number information, etc.) and further analyze different document format types. In the data extraction task, we implemented two LLM extraction methods: LangChain and ChatGPT-QA. Experimental results show that LangChain outperforms ChatGPT-QA in extraction performance, with average F1 scores of 0.77 and 0.63, respectively. Additionally, we evaluated whether LangChain can replace manual labeling by comparing two different labeling methods: manual labeling and LangChain labeling. The experimental results of using FST for data extraction show that LangChain cannot replace manual labeling, with average F1 scores of 0.91 and 0.70 for manual labeling and LangChain labeling, respectively.

關鍵字(中)

★ 序列模式挖掘
★ 上下文學習
★ 線上學習
★ 大型語言模型

關鍵字(英)

★ Sequential pattern mining
★ In-context Learning
★ Online Learning
★ Large Language Model

論文目次

中文摘要. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
目錄. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
圖目錄. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
表目錄. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
一、緒論. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1-1 動機與目標. . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1-2 貢獻. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
二、相關研究. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2-1 郵件資訊. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2-1-1 Mailparser . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2-1-2 Parsio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2-2 網頁資訊. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2-2-1 Octoparse . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2-2-2 ParseHub . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2-2-3 Mozenda . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2-2-4 Web Scraper . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2-3 PDF 檔案資訊. . . . . . . . . . . . . . . . . . . . . . . . 5
2-3-1 單一提取類型. . . . . . . . . . . . . . . . . . . . . . . . . 5
2-3-2 開發者使用套件或工具. . . . . . . . . . . . . . . . . . . . 5
2-3-3 資料擷取平台. . . . . . . . . . . . . . . . . . . . . . . . . 6
三、PDFEX 系統架構. . . . . . . . . . . . . . . . . . . . . . . 8
3-1 設計理念. . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3-2 系統架構. . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3-3 模組1：資料結構探勘(Schema Mining) . . . . . . . . . . 9
3-4 模組2：資料擷取(Text Extraction) . . . . . . . . . . . . 11
3-5 Rule Generalization . . . . . . . . . . . . . . . . . . . . . . 12
四、實驗討論. . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4-1 數據集. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4-2 資料結構探勘(Schema Mining) . . . . . . . . . . . . . . . 17
4-3 資料擷取（Text Extraction） . . . . . . . . . . . . . . . . 18
4-3-1 評比LLM 不同應用方法的擷取效能. . . . . . . . . . . . 18
4-3-2 評比不同標記方法進行FST 擷取效能. . . . . . . . . . . 22
4-4 評估方法. . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4-5 擷取失敗分析. . . . . . . . . . . . . . . . . . . . . . . . . 28
五、結論與未來研究. . . . . . . . . . . . . . . . . . . . . . . . 32
參考文獻. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
附錄一. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
A-1 數據集範例. . . . . . . . . . . . . . . . . . . . . . . . . . . 36
A-2 使用其他平台測試的擷取結果. . . . . . . . . . . . . . . . 41

參考文獻

[1] Chia-Hui Chang and Shao-Chen Lui. Iepad: information extraction based on pattern discovery. In The Web Conference, 2001.
[2] Bing Liu, Robert Grossman, and Yanhong Zhai. Mining data records in web pages. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 601–606, 2003.
[3] Oviliani Yenty Yuliana and Chia-Hui Chang. Dcade: divide and conquer alignment with dynamic encoding for full page data extraction. Applied Intelligence, 50(2):271–295, 2020.
[4] Steven C.H. Hoi, Doyen Sahoo, Jing Lu, and Peilin Zhao. Online learning: A comprehensive survey. Neurocomput., 459(C):249–289, oct 2021.
[5] LangChain. Langchain. https://python.langchain.com/docs/get_started/introduction/, 2023.
[6] Xiang Wei, Xingyu Cui, Ning Cheng, Xiaobin Wang, Xin Zhang, Shen Huang, Pengjun Xie, Jinan Xu, Yufeng Chen, Meishan Zhang, Yong Jiang, and Wenjuan Han. Zero-shot information extraction via chatting with chatgpt, 2023.
[7] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M. C. Hsu. Prefixspan: Mining sequential patterns efficiently by prefix-projected pattern growth. pages 215–224, 2001. 17th International Conference on Data Engineering ; Conference date: 02-04-2001 Through 06-04-2001.
[8] Docparser. Docparser. https://docparser.com/blog/data-extraction-tools/, 2020.
[9] Mailparser. Mailparser. https://mailparser.io/, 2014.
[10] Parsio. Parsio. https://parsio.io/, 2021.
[11] Octoparse. Octoparse. https://www.octoparse.com/, 2016.
[12] ParseHub. Parsehub. https://www.parsehub.com/, 2015.
[13] Mozenda. Mozenda. https://www.mozenda.com/, 2008.
[14] Web Scraper. Web scraper. https://webscraper.io/, 2013.
[15] Wondershare. Pdfelement. https://pdf.wondershare.net/, 2018.
[16] Tabula. Tabula. https://tabula.technology/, 2018.
[17] Adobe. Adobe pdf extract api. https://developer.adobe.com/document-services/apis/pdf-extract/.
[18] Amazon. Amazon textract. https://aws.amazon.com/tw/textract/.
[19] Nanonets. Nanonets. https://nanonets.com/, 2018.
[20] Docparser. Docparser. https://docparser.com/.
[21] Parseur. Parseur. https://parseur.com/, 2016.
[22] Rossum. Rossum. https://rossum.ai/.
[23] Docsumo. Docsumo. https://www.docsumo.com/, 2018.
[24] Anthropic. Claude. https://claude.ai/chats, 2023.
[25] Google. Gemini. https://gemini.google.com/, 2023.
[26] Sebastian Schreiber, Stefan Agne, Ivo Wolf, Andreas Dengel, and Sheraz Ahmed. Deepdesrt: Deep learning for detection and structure recognition of tables in document images. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), volume 01, pages 1162–1167, 2017.

指導教授

張嘉惠(Chia-Hui Chang)

審核日期

2024-4-26

推文