本系統支援以抽象語句下達高層指令,透過結構化語意與可操作物件資訊,推理並產生合乎邏輯的行動序列。我們設計實驗採用多項代表性任務場景,涵蓋分類、堆疊、需求推理與視覺辨識等行為,驗證語言模型對於高階語意、空間推理與行動轉譯的能力。實驗顯示,透過合理的提示設計與語意規範,本系統不僅能穩定完成任務,亦具備與 GPT-4 同等的拆解能力;特別是在中階模型如 GPT-3.5 與 Phi-3.5-mini 上,透過 Chain-of-Thought 與 Few-shot 策略可顯著提升模型表現。整體結果證明本方法於有限資源條件下,仍能有效支援語言導向的自主任務執行流程。 ;With recent breakthroughs in semantic understanding and reasoning by Multimodal Large Language Models (MLLMs), how to effectively apply them to real-world robot task planning and execution has become a key challenge in the field of intelligent robotics. Although systems like GR00T have demonstrated strong multimodal integration and manipulation capabilities, their high computational cost and hardware requirements make them difficult to replicate and deploy for most research institutions. To address this, we propose a general-purpose task planning framework based on lightweight language models, integrating commonsense knowledge graph reasoning, a semantic understanding module, and the Phi-4-mini-reasoning model to structurally translate natural language instructions into executable atomic action sequences for simulated robot execution.
The proposed system supports issuing high-level commands in abstract language, and through structured semantics and operable object information, it infers and generates logically coherent action sequences. We design experiments covering various representative tasks—such as classification, stacking, navigation, and visual recognition—to validate the language model’s ability in high-level semantic understanding, spatial reasoning, and instruction-to-action translation. Experimental results show that with appropriate prompt design and semantic constraints, the system can not only execute tasks reliably but also match the decomposition performance of GPT-4. Notably, for mid-sized models such as GPT-3.5 and Phi-3.5-mini, strategies like Chain-of-Thought and Few-shot significantly improve task success rates. Overall, our findings demonstrate that this framework effectively supports language-driven autonomous task execution even under limited-resource conditions.