大型語言模型於多文本摘要之效能與成本效益分析：多段式架構、提示工程與評估方法之探討;Evaluating Large Language Models for Multi-Document News Summarization: Architecture Design, Prompting Strategies, and Cost-Effectiveness

NCU Institutional Repository > 資訊電機學院 > 資訊工程研究所 > 博碩士論文 > Item 987654321/98191

jsp.display-item.identifier=請使用永久網址來引用或連結此文件: https://ir.lib.ncu.edu.tw/handle/987654321/98191

题名:	大型語言模型於多文本摘要之效能與成本效益分析：多段式架構、提示工程與評估方法之探討;Evaluating Large Language Models for Multi-Document News Summarization: Architecture Design, Prompting Strategies, and Cost-Effectiveness
作者:	傅孟淳;Fu, Meng-Chun
贡献者:	資訊工程學系
关键词:	大型語言模型;多文檔摘要;提示工程;成本效益;基於 LLM 的評估;Large Language Models;Multi-document Summarization;Prompt Engineering;Cost-effectiveness;LLM-based Evaluation
日期:	2025-07-01
上传时间:	2025-10-17 12:28:26 (UTC+8)
出版者:	國立中央大學
摘要:	近年來，新聞報導的數量與資訊量爆炸性增長，如何有效消化多篇新聞內容並生成摘要逐漸受到大家的重視。此外，隨著 GPT-4o 等最新 LLM 放寬 Token 限制，過去的多階段摘要方法是否仍具優勢，以及不同的提示工程（Prompt Engineering）技術是否能進一步提升摘要品質，值得我們思考。本研究將使用大型語言模型(Large Language Model, LLM) 進行多文檔新聞摘要生成與輿情分析，透過爬取 24 小時內的體育新聞，分析新聞熱點與情緒趨勢，並最終產出結合輿情分析結果的摘要。我們測試了 GPT-4o、Llama-3.3- 70B、Mixtral-8x7B 和 Gemma2-9B 這四種模型的摘要效果，並比較了單階段與多階段摘要方法。此外，在提示工程部分，我們設計了簡單提示（Simple）、少樣本學習（Few-shot）、思維鏈（Chain-of-Thought, CoT）以及指令提示（Instruct）等不同策略，來探討提示工程對摘要品質的影響。為了進一步改善摘要評估的成本效益，我們採用了基於 LLM 的自動摘要評估方法，並與人工專家評估的花費進行比較。我們也探討了在 Notebooklm 這種工具已經推出的背景下，摘要任務是否還有研究的必要研究結果顯示，單階段摘要在品質上優於多階段摘要，而 Few-shot 能有效提升摘要的語義一致性與關鍵資訊保留能力。在 LLM 的比較上，Llama-3.3- 70B 在新聞摘要任務中的表現優於 GPT-4o，顯示開源模型在特定應用場景下具備與商業模型競爭的能力。且 Notebooklm 在多文本新聞摘要其實表現不如本研究之方法所產生的摘要，顯示在特定任務下，依舊需要更深的研究。此外，在輿情分析應用中，LLM 能夠有效識別新聞話題並分析文本情緒，使摘要具有更高的品質。;In recent years, the number and volume of news reports have grown explosively, making it increasingly important to effectively digest multiple news articles and generate summaries. Moreover, with the latest large language models (LLMs) such as GPT-4o relaxing token limitations, it is worth reconsidering whether traditional multi- stage summarization methods still hold an advantage, and whether different prompt engineering techniques can further improve summary quality. This study utilizes large language models (LLMs) for multi-document news summarization and public opinion analysis. By crawling sports news published within the past 24 hours, we analyze trending topics and sentiment trends, and ultimately generate summaries that incorporate the results of public opinion analysis. We evaluate the summarization performance of four models: GPT-4o, Llama-3.3-70B, Mixtral- 8x7B, and Gemma2-9B, and compare single-stage and multi-stage summarization methods. In terms of prompt engineering, we design and test various strategies, including Simple prompts, Few-shot learning, Chain-of-Thought (CoT), and Instruct prompts, to explore their effects on summary quality. To further improve the cost-effectiveness of summary evaluation, we adopt LLM- based automatic evaluation methods and compare them with human expert assessments. We also examine whether summarization research remains necessary in the context of tools like NotebookLM becoming available. The results show that single-stage summarization outperforms multi-stage summarization in terms of quality, and Few-shot prompts significantly enhance semantic consistency and the preservation of key information. Among the LLMs tested, Llama-3.3-70B performs better than GPT-4o in the news summarization task, demonstrating that open-source models can compete with commercial ones in specific application scenarios. Additionally, NotebookLM’s performance on multi-document news summarization is inferior to the summaries generated by our proposed method, indicating that in certain tasks, further in-depth research is still required. Furthermore, in the application of public opinion analysis, LLMs are effective in identifying news iii topics and analyzing textual sentiment, thereby contributing to higher-quality summarization.
显示于类别:	[資訊工程研究所] 博碩士論文

文件中的档案:

档案	描述	大小	格式	浏览次数
index.html		0Kb	HTML	160	检视/开启

在NCUIR中所有的数据项都受到原著作权保护.

社群 sharing

数据加载中.....