摘要(英) |
The internet is flooded with a large amount of information stored in PDF format, such as judgments, financial reports, admission brochures, and so on. For many applications and services, it is often necessary to convert this information into structured formats for subsequent use. Typically, this involves manually defining data structures and extracting data based on the defined structures to train models, which is extremely labor and time-consuming. Therefore, how to efficiently define data structures and accurately extract data will be the main focus of this study.
This paper combines two tasks, data mining and data extraction, to develop an interactive online learning data extraction system. The former uses the PrefixSpan technique to help users find patterns in target documents, allowing users to efficiently define the data structure of target documents. The latter adopts the Finite-state transducer (FST) of traditional machine learning, which can learn extraction rules based on the defined data structure with a small amount of labeled data and complete the data extraction task through these extraction rules.
Since data mining may uncover too many patterns, we reduce the number of patterns by excluding items (such as removing page numbers or line number information, etc.) and further analyze different document format types. In the data extraction task, we implemented two LLM extraction methods: LangChain and ChatGPT-QA. Experimental results show that LangChain outperforms ChatGPT-QA in extraction performance, with average F1 scores of 0.77 and 0.63, respectively. Additionally, we evaluated whether LangChain can replace manual labeling by comparing two different labeling methods: manual labeling and LangChain labeling. The experimental results of using FST for data extraction show that LangChain cannot replace manual labeling, with average F1 scores of 0.91 and 0.70 for manual labeling and LangChain labeling, respectively. |
參考文獻 |
[1] Chia-Hui Chang and Shao-Chen Lui. Iepad: information extraction based on pattern discovery. In The Web Conference, 2001.
[2] Bing Liu, Robert Grossman, and Yanhong Zhai. Mining data records in web pages. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 601–606, 2003.
[3] Oviliani Yenty Yuliana and Chia-Hui Chang. Dcade: divide and conquer alignment with dynamic encoding for full page data extraction. Applied Intelligence, 50(2):271–295, 2020.
[4] Steven C.H. Hoi, Doyen Sahoo, Jing Lu, and Peilin Zhao. Online learning: A comprehensive survey. Neurocomput., 459(C):249–289, oct 2021.
[5] LangChain. Langchain. https://python.langchain.com/docs/get_started/introduction/, 2023.
[6] Xiang Wei, Xingyu Cui, Ning Cheng, Xiaobin Wang, Xin Zhang, Shen Huang, Pengjun Xie, Jinan Xu, Yufeng Chen, Meishan Zhang, Yong Jiang, and Wenjuan Han. Zero-shot information extraction via chatting with chatgpt, 2023.
[7] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M. C. Hsu. Prefixspan: Mining sequential patterns efficiently by prefix-projected pattern growth. pages 215–224, 2001. 17th International Conference on Data Engineering ; Conference date: 02-04-2001 Through 06-04-2001.
[8] Docparser. Docparser. https://docparser.com/blog/data-extraction-tools/, 2020.
[9] Mailparser. Mailparser. https://mailparser.io/, 2014.
[10] Parsio. Parsio. https://parsio.io/, 2021.
[11] Octoparse. Octoparse. https://www.octoparse.com/, 2016.
[12] ParseHub. Parsehub. https://www.parsehub.com/, 2015.
[13] Mozenda. Mozenda. https://www.mozenda.com/, 2008.
[14] Web Scraper. Web scraper. https://webscraper.io/, 2013.
[15] Wondershare. Pdfelement. https://pdf.wondershare.net/, 2018.
[16] Tabula. Tabula. https://tabula.technology/, 2018.
[17] Adobe. Adobe pdf extract api. https://developer.adobe.com/document-services/apis/pdf-extract/.
[18] Amazon. Amazon textract. https://aws.amazon.com/tw/textract/.
[19] Nanonets. Nanonets. https://nanonets.com/, 2018.
[20] Docparser. Docparser. https://docparser.com/.
[21] Parseur. Parseur. https://parseur.com/, 2016.
[22] Rossum. Rossum. https://rossum.ai/.
[23] Docsumo. Docsumo. https://www.docsumo.com/, 2018.
[24] Anthropic. Claude. https://claude.ai/chats, 2023.
[25] Google. Gemini. https://gemini.google.com/, 2023.
[26] Sebastian Schreiber, Stefan Agne, Ivo Wolf, Andreas Dengel, and Sheraz Ahmed. Deepdesrt: Deep learning for detection and structure recognition of tables in document images. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), volume 01, pages 1162–1167, 2017. |