結合多模態檢索增強生成架構之個人化知識庫檢索系統設計與實作;Design and Implementation of a Personalized Knowledge Base Retrieval System with a Multimodal Retrieval-Augmented Generation Architecture

NCUIR > College of Electrical Engineering & Computer Science > Graduate Institute of Computer Science and Information Engineering > Electronic Thesis & Dissertation > Item 987654321/98598

Please use this identifier to cite or link to this item: https://ir.lib.ncu.edu.tw/handle/987654321/98598

Title:	結合多模態檢索增強生成架構之個人化知識庫檢索系統設計與實作;Design and Implementation of a Personalized Knowledge Base Retrieval System with a Multimodal Retrieval-Augmented Generation Architecture
Authors:	陳志昇;Chen, Zhi-Sheng
Contributors:	資訊工程學系
Keywords:	檢索增強生成;大型語言模型;多模態;Retrieval-Augmented Generation;Large Language Model;Multimodality
Date:	2025-08-19
Issue Date:	2025-10-17 12:59:01 (UTC+8)
Publisher:	國立中央大學
Abstract:	大型語言模型（LLM）在知識問答中經常面臨幻覺問題，難以準確引用使用者的個人資料或文件內容。檢索增強生成（Retrieval-Augmented Generation, RAG）架構通過在生成回應時檢索外部知識，顯著提高了LLM回答知識型問題的準確性。然而，現有RAG方法多局限於純文字資料的檢索，對於包含圖像等多模態資訊的個人知識庫，傳統方法難以充分利用其中蘊含的視覺訊息。本研究針對此挑戰，設計並實作一套結合多模態 RAG 架構的個人化知識庫檢索系統。該系統利用本地部署的語意檢索與生成模型：首先透過多語言嵌入模型 BGE-M3 將文件文字與查詢轉換為高維向量，以密集檢索找出相關文本片段；繼而使用 BGE-reranker-v2-m3 交叉編碼模型對初步結果重新排序，提升檢索精度；最後由多模態大型模型 Qwen-2.5-VL 接收檢索到的文字片段和相關圖像，產生最終回答。本系統採用 LangChain 串接各模組，並透過 Ollama 平臺在本地執行模型，以確保個人資料隱私。實驗結果表明，該系統能夠在中英雙語下有效從使用者PDF文件中擷取文字與圖像資訊來回答複雜問題，減少模型幻覺並提供更準確且豐富的答覆。本研究詳細討論了系統各組件的技術設計與優勢，並比較了相關領域研究，證明將多模態檢索與生成技術結合應用於個人化知識庫的可行性與效益。;Large language models (LLMs) often suffer from hallucinations in knowledge question answering, making it difficult to accurately cite personal information or document content. Retrieval-Augmented Generation (RAG) improves LLM accuracy by retrieving external knowledge during response generation, but most existing RAG methods focus only on text and cannot fully utilize visual information in multimodal personal knowledge bases. To tackle this, this study designs and implements a personalized multimodal RAG system. It combines local semantic retrieval and generation models: the multilingual embedding model BGE-M3 converts text and queries into vectors for dense retrieval; the BGE-reranker-v2-m3 cross-encoder re-ranks results to improve precision; and the multimodal large model Qwen-2.5-VL generates final answers using both retrieved text and related images. The system is orchestrated with LangChain and runs locally via Ollama to ensure data privacy. Experiments show it effectively extracts bilingual text and images from user PDFs, reduces hallucinations, and provides more accurate answers. This work demonstrates the feasibility and benefits of applying multimodal retrieval and generation to personalized knowledge bases.
Appears in Collections:	[Graduate Institute of Computer Science and Information Engineering] Electronic Thesis & Dissertation

Files in This Item:

File	Description	Size	Format
index.html		0Kb	HTML	7	View/Open

社群 sharing

Loading...