近年來,深度學習已被廣泛應用於機器人領域,所面臨的研究議題就屬機器人的視覺與語言的互動特別值得關注和需要突破。有許多的此類研究會用ALFRED (Action Learning From Realistic Environments and Directives) 當作效能評估指標,在此環境中,機器人需要依照所需執行的語言指令來執行日常室內家庭任務。本篇論文認為給予機器人視覺語意理解、語言語意理解,可使得推論能力得以提升。在本篇論文中,提出了一種新穎的方法-VSGM (Visual Semantic Graph Memory),利用語意圖的表示方式,能夠獲得較好的視覺影像特徵,提升機器人的視覺理解能力。藉由先驗知識與「場景圖生成網路」,轉換成圖表示方式,給予機器人;並將影像中的物件映射成由上而下以自我為中心的地圖 (Top-down Egocentric Map);最終藉由「圖神經網路」提取當前任務重要的物件特徵。本論文提出之方法,在ALFRED環境中進行驗證,在模型加入VSGM後,能夠提升任務成功率6~10 %。;In recent years, developing AI for robotics has raised much attention. The interaction of vision and language of robots is particularly difficult. We consider that giving robots an understanding of visual semantics and language semantics will improve inference ability. In our method, we propose a novel method-VSGM (Visual Semantic Graph Memory), which uses the semantic graph to obtain better visual image features, improve the robot′s visual understanding ability. By providing prior knowledge of the robot and detecting the objects in the image, it predicts the correlation between the attributes of the object and the objects and converts them into a graph-based representation; and mapping the object in the image to be a top-down egocentric map. Finally, the important object features of the current task are extracted by Graph Neural Networks. Our proposed method is verified in the ALFRED (Action Learning From Realistic Environments and Directives) dataset. In this dataset, the robot needs to perform daily indoor household tasks following the required language instructions. After the model is added to the VSGM, the task success rate can be improved by 6~10 %.