PagePilot: 基於多代理架構之多模態自動化網頁助理;PagePilot: A Multimodal Automated Web Assistant Based on Multi-Agent Architecture

NCUIR > College of Electrical Engineering & Computer Science > Graduate Institute of Computer Science and Information Engineering > Electronic Thesis & Dissertation > Item 987654321/98326

Please use this identifier to cite or link to this item: https://ir.lib.ncu.edu.tw/handle/987654321/98326

Title:	PagePilot: 基於多代理架構之多模態自動化網頁助理;PagePilot: A Multimodal Automated Web Assistant Based on Multi-Agent Architecture
Authors:	葉季儒;Yeh, Chi-Ju
Contributors:	資訊工程學系
Keywords:	自動化;網頁自動化;大型語言模型Agent;Multi-Agent;Automation
Date:	2025-07-25
Issue Date:	2025-10-17 12:38:03 (UTC+8)
Publisher:	國立中央大學
Abstract:	隨著大型語言模型(LLM)推理及多模態分析能力的提升，已經可以自動完成許多任務，例如自動操作網頁，現有工具如browser-use和Manus可以依據使用者的要求瀏覽網頁，例如線上購物、搜尋資訊都能處理。但目前的自動化工具對於長網頁、大量文章、操作複雜的任務難以順利處理，容易出現導航問題、視覺對齊問題、幻覺問題，阻礙自動化操作。因此自動化agent仍然需要更多研究，對網頁架構深度優化以解決上述問題。我們以WebVoyager，一個自動化網頁操作系統為參考，在此基礎上提出了PagePilot系統，將網頁視覺輸入與原始碼資訊整合作為LLM Agent的輸入。PagePilot利用視覺方法進行網頁操作，並輔以從網頁原始碼萃取的關鍵資訊，提升在資訊擷取類任務上的表現。此外，系統引入了動態載入與觀察者agent等優化，前者通過模擬使用者滑鼠滾動來加載更多內容，後者在操作錯誤出現時提供回撤功能。實驗證明這些改進能緩解上述控制問題，提升任務完成率。在WebVoyager與GAIA等資料集上，PagePilot分別達到76% 和 57% 的任務完成率，皆顯著超越WebVoyager (65%, 27%)與GPT-4 (32%, 18%)的baseline，並大幅減少了所需的操作次數。另外我們構建了來自mind2web的任務資料集，以及中文語系的網頁資料集，即使此類任務較複雜的資料集，也能分別達到52%, 70%的性能。通過人工評估與LLM評估取得近似結果，顯示我們的系統對資訊擷取型的任務有較好的表現。根據消融實驗結果，本架構可以在減少9%動作步驟下，提高30%的任務完成率，為自動化網頁控制提供了新的基準。總體而言，我們提出了基於Multi Agent架構的網頁自動化控制系統，通過創新性的視覺與原始碼組合，以及針對網頁控制深度優化的架構，大幅提高任務完成率同時減少操作步驟。並且提出了基於中文網頁的評測資料集，驗證自動化控制在中文網站的可行性。我們期望透過這些方法與資源，對於網頁自動化領域有所幫助，並推動相關研究發展。;With the advancement of large language models (LLMs) in reasoning and multimodal analysis capabilities, many tasks can now be automated, such as solving mathematical problems, controlling computers, and automatically operating web pages. However, there are still relatively few automatic agents capable of deeply optimizing web architectures. These agents often struggle to handle complex tasks involving long web pages or large volumes of articles, leading to navigation issues, visual alignment problems, and hallucination, all of which hinder automated operations. Taking WebVoyager, an automated web operation system, as a reference, we propose the PagePilot system based on this foundation, integrating both web visual input and source code information as the input for the LLM Agent. PagePilot performs web operations using visual methods, supplemented with key information extracted from web source code to enhance its performance on information retrieval tasks. Additionally, the system incorporates optimizations such as dynamic loading and an observer agent: the former simulates user mouse scrolling to load more content, while the latter provides rollback functionality in case of operational errors. Experimental results demonstrate that these enhancements effectively mitigate the aforementioned control issues and improve task completion rates. On the WebVoyager and GAIA datasets, PagePilot achieves task completion rates of 76% and 57%, respectively, significantly surpassing the baselines set by WebVoyager (65%, 27%) and GPT-4 (32%, 18%), while also greatly reducing the number of required actions. Additionally, we constructed task datasets from mind2web as well as a Chinese-language web dataset. Even on these more complex datasets, our system achieves performances of 52\% and 70%, respectively. Manual evaluations and assessments by LLMs yielded similar results, demonstrating that our system performs well on information retrieval tasks. According to ablation study results, this architecture improves task completion rates by 30% while reducing the number of action steps by 9%, establishing a new benchmark for automated web control. Overall, we propose a web automation control system based on a Multi-Agent architecture, which significantly improves task completion rates and reduces the number of operational steps through an innovative combination of visual and source code inputs, as well as architecture optimized for deep web control. We have also introduced an evaluation dataset based on Chinese web pages to validate the feasibility of automated control on Chinese-language websites. We hope that these methods and resources will benefit the field of web automation and promote further research and development in this area.
Appears in Collections:	[Graduate Institute of Computer Science and Information Engineering] Electronic Thesis & Dissertation

Files in This Item:

File	Description	Size	Format
index.html		0Kb	HTML	45	View/Open

社群 sharing

Loading...