現今,問答系統在自然語言處理相關連的任一領域都非常流行。如餐廳、交通航站往往都有屬於自己的問答系統或是問答機器人。在此篇研究中,我們提出了一個去建立幫助人們完成導向性任務的多模態問答系統之工作流程。我們透過自己設計的一項機器人組裝任務來展示問答系統,其會嘗試去解決人們在組裝任務中所遇到的問題。許多領域中的任務都存在有常見問答集(FAQ),如客服系統或是產品維修手冊。在此篇研究中,我們展示了一套得已善用常見問答集知識的工作流程去建立問答系統,其中包含資料搜集、意圖定義與意圖分類等工作。此外,我們也引入多模態的架構來解決傳統單模態系統所遇到的瓶頸。在實驗結果中顯示,透過結合文字與影像的資訊,我們得已提升意圖分類任務的效果。在應用層面,這套工作流程可以遷移到許多類似的任務中,我們期待這對於智慧製造領域能有所貢獻。;Now days, Question Answering system becomes popular for any purpose in the field ofNatural Language Processing. Some researchers develop QA system for restaurant, bus station,and many more. In this research, we propose the workflow to build a Multimodal QuestionsAnsweringSystemthathelphumantocompleteaninstructiongivingtasks. Wedemonstrateourwork on Meccanoid, a personal robot developed by SPIN MASTER. When the user encounterproblem in assembly of Meccanoid, they will ask our system and our system will provide thebest guide as the solution for the problem.An FAQ is a list of frequently asked questions (FAQs) and answers on a particular topic.This term is always mentioned in customer service or in a product operation manual. In this pa-per, we present a complete workflow that how to transfer the knowledge of an FAQ to constructa question answering system in the task of Meccanoid robot assembly, including the methodof data collection, user intent definition and classification. Furthermore, we introduce a multi-modal architecture for solving the bottleneck which the traditional single-modality system mayencounter. The experimental results show that the combination of visual and textual contextenhance the performance of intent classification work. The workflow we proposed should beable to generalize to other domains depended on the requester’s demand, hopefully adapt on smart manufacturing field.