應用多模態語言模型於模擬機器人自主任務規劃與執行之研究;A Study on the Application of Multimodal Large Language Model for Autonomous Task Planning and Execution in Simulated Robots

NCUIR > College of Electrical Engineering & Computer Science > Graduate Institute of Computer Science and Information Engineering > Electronic Thesis & Dissertation > Item 987654321/98507

Please use this identifier to cite or link to this item: https://ir.lib.ncu.edu.tw/handle/987654321/98507

Title:	應用多模態語言模型於模擬機器人自主任務規劃與執行之研究;A Study on the Application of Multimodal Large Language Model for Autonomous Task Planning and Execution in Simulated Robots
Authors:	唐崇祐;Tang, Chung-Yu
Contributors:	資訊工程學系
Keywords:	多模態語言模型;任務規劃;自然語言指令;原子動作序列;語意推理;Multimodal Language Model;Task Planning;Natural Language Instruction;Atomic Action Sequence;Semantic Reasoning
Date:	2025-08-04
Issue Date:	2025-10-17 12:51:43 (UTC+8)
Publisher:	國立中央大學
Abstract:	隨著多模態大型語言模型（MLLM）在語意理解與推理方面的突破，如何將其應用於實際機器人的任務規劃與執行流程，成為當前智慧機器人研究中的關鍵課題之一。儘管如 GR00T 等系統展示了強大的跨模態整合與操作能力，但其高昂的計算成本與硬體需求，對於多數研究機構而言難以複製與部署。因此，本研究提出一個基於輕量語言模型的通用型任務規劃框架，整合常識知識圖譜推理、語義理解模組以及 Phi-4-mini-reasoning 模型，實現自然語言指令的結構化轉換，最終生成可執行的原子動作序列並交由模擬機器人完成任務。本系統支援以抽象語句下達高層指令，透過結構化語意與可操作物件資訊，推理並產生合乎邏輯的行動序列。我們設計實驗採用多項代表性任務場景，涵蓋分類、堆疊、需求推理與視覺辨識等行為，驗證語言模型對於高階語意、空間推理與行動轉譯的能力。實驗顯示，透過合理的提示設計與語意規範，本系統不僅能穩定完成任務，亦具備與 GPT-4 同等的拆解能力；特別是在中階模型如 GPT-3.5 與 Phi-3.5-mini 上，透過 Chain-of-Thought 與 Few-shot 策略可顯著提升模型表現。整體結果證明本方法於有限資源條件下，仍能有效支援語言導向的自主任務執行流程。 ;With recent breakthroughs in semantic understanding and reasoning by Multimodal Large Language Models (MLLMs), how to effectively apply them to real-world robot task planning and execution has become a key challenge in the field of intelligent robotics. Although systems like GR00T have demonstrated strong multimodal integration and manipulation capabilities, their high computational cost and hardware requirements make them difficult to replicate and deploy for most research institutions. To address this, we propose a general-purpose task planning framework based on lightweight language models, integrating commonsense knowledge graph reasoning, a semantic understanding module, and the Phi-4-mini-reasoning model to structurally translate natural language instructions into executable atomic action sequences for simulated robot execution. The proposed system supports issuing high-level commands in abstract language, and through structured semantics and operable object information, it infers and generates logically coherent action sequences. We design experiments covering various representative tasks—such as classification, stacking, navigation, and visual recognition—to validate the language model’s ability in high-level semantic understanding, spatial reasoning, and instruction-to-action translation. Experimental results show that with appropriate prompt design and semantic constraints, the system can not only execute tasks reliably but also match the decomposition performance of GPT-4. Notably, for mid-sized models such as GPT-3.5 and Phi-3.5-mini, strategies like Chain-of-Thought and Few-shot significantly improve task success rates. Overall, our findings demonstrate that this framework effectively supports language-driven autonomous task execution even under limited-resource conditions.
Appears in Collections:	[Graduate Institute of Computer Science and Information Engineering] Electronic Thesis & Dissertation

Files in This Item:

File	Description	Size	Format
index.html		0Kb	HTML	75	View/Open

社群 sharing

Loading...