| 摘要: | 近年來,新聞報導的數量與資訊量爆炸性增長,如何有效消化多篇新聞內 容並生成摘要逐漸受到大家的重視。此外,隨著 GPT-4o 等最新 LLM 放寬 Token 限制,過去的多階段摘要方法是否仍具優勢,以及不同的提示工程 (Prompt Engineering)技術是否能進一步提升摘要品質,值得我們思考。 本研究將使用大型語言模型(Large Language Model, LLM) 進行多文檔新聞 摘要生成與輿情分析,透過爬取 24 小時內的體育新聞,分析新聞熱點與情緒趨 勢,並最終產出結合輿情分析結果的摘要。我們測試了 GPT-4o、Llama-3.3- 70B、Mixtral-8x7B 和 Gemma2-9B 這四種模型的摘要效果,並比較了單階段與 多階段摘要方法。此外,在提示工程部分,我們設計了簡單提示(Simple)、 少樣本學習(Few-shot)、思維鏈(Chain-of-Thought, CoT)以及指令提示 (Instruct)等不同策略,來探討提示工程對摘要品質的影響。為了進一步改善 摘要評估的成本效益,我們採用了基於 LLM 的自動摘要評估方法,並與人工 專家評估的花費進行比較。我們也探討了在 Notebooklm 這種工具已經推出的背 景下,摘要任務是否還有研究的必要 研究結果顯示,單階段摘要在品質上優於多階段摘要,而 Few-shot 能有效 提升摘要的語義一致性與關鍵資訊保留能力。在 LLM 的比較上,Llama-3.3- 70B 在新聞摘要任務中的表現優於 GPT-4o,顯示開源模型在特定應用場景下具 備與商業模型競爭的能力。且 Notebooklm 在多文本新聞摘要其實表現不如本研 究之方法所產生的摘要,顯示在特定任務下,依舊需要更深的研究。此外,在 輿情分析應用中,LLM 能夠有效識別新聞話題並分析文本情緒,使摘要具有更 高的品質。;In recent years, the number and volume of news reports have grown explosively, making it increasingly important to effectively digest multiple news articles and generate summaries. Moreover, with the latest large language models (LLMs) such as GPT-4o relaxing token limitations, it is worth reconsidering whether traditional multi- stage summarization methods still hold an advantage, and whether different prompt engineering techniques can further improve summary quality. This study utilizes large language models (LLMs) for multi-document news summarization and public opinion analysis. By crawling sports news published within the past 24 hours, we analyze trending topics and sentiment trends, and ultimately generate summaries that incorporate the results of public opinion analysis. We evaluate the summarization performance of four models: GPT-4o, Llama-3.3-70B, Mixtral- 8x7B, and Gemma2-9B, and compare single-stage and multi-stage summarization methods. In terms of prompt engineering, we design and test various strategies, including Simple prompts, Few-shot learning, Chain-of-Thought (CoT), and Instruct prompts, to explore their effects on summary quality. To further improve the cost-effectiveness of summary evaluation, we adopt LLM- based automatic evaluation methods and compare them with human expert assessments. We also examine whether summarization research remains necessary in the context of tools like NotebookLM becoming available. The results show that single-stage summarization outperforms multi-stage summarization in terms of quality, and Few-shot prompts significantly enhance semantic consistency and the preservation of key information. Among the LLMs tested, Llama-3.3-70B performs better than GPT-4o in the news summarization task, demonstrating that open-source models can compete with commercial ones in specific application scenarios. Additionally, NotebookLM’s performance on multi-document news summarization is inferior to the summaries generated by our proposed method, indicating that in certain tasks, further in-depth research is still required. Furthermore, in the application of public opinion analysis, LLMs are effective in identifying news iii topics and analyzing textual sentiment, thereby contributing to higher-quality summarization. |