目前已有許多研究專注於特定領域的表示式,包含自然語言處理領域、電腦視覺領域等,而文字也可以被用來代表特定物體,意即自然語言及圖片可能會共享相同的意義,過去已有許多文獻將文字及圖片結合並應用在圖像描述生成、圖像問答、圖像檢所等任務上。然而,卻鮮少有研究專注於多個語言及圖片之間的共用表示式,所以在我們的研究中,我們以監督式方式使用編碼器-解碼器架構學習模態間及模態內的共用表示式,並且將由編碼器產生的隱藏層作為我們所關注的共用表示式。 除了學習共用表示式外,我們進一步分析了使用我們的架構學習出來的共用表示式,我們也將此共用表示式與單一模態的專屬表示是做視覺化比較,並且證明我們的共用表示式是能夠同時有學習到文字模態資料以及圖片模態資料。除此之外我們也探討其他會影響共用表示式學習的因素,增加相似字於文字資料來做訓練可以取得較獨特且集中的共用表示式分布,同時也可保持圖片的重建能力以及文字向量的生成能力。當增加另一個語言的文字一起做訓練時,也可以在共用表示式的分布上發現如同新增相似字較獨特且集中的特性,並且仍然可以被正確的重建成原來的圖片以及相對應的文字向量。最後,我們研究了共用表示式的可擴展性,也探討了這個實驗的限制。 ;Many studies have investigated representation learning for domains such as Natural Language Processing or Computer Vision. Texts can be viewed as a kind of representations that stand for a certain object. In other words, natural language might share the same meaning as in an image. There were plenty of works that requires learning from both texts and images in the tasks like image captioning, visual question answering, image-to-text retrieval and so on. However, the shared representation between multiple languages and an image is seldom discussed. Hence, in this study, we propose an encoder-decoder architecture to learn the shared representations for inter- and intra-modality data. Utilizing the encoder-decoder framework, we regard the latent space vector to be the shared representation since the latent space vectors are learned from the both modalities in a supervised way to capture the shared semantics. We also further analyze the shared representations learned via our architecture. Through visualization compared with single-modality representations, we demonstrate that our shared representations does learn from both image modality data and text modality data. We also discuss on other factors that might contribute to the shared representation learning. We find out that including synonyms for our model to learn will lead to more distinct and condensed distribution of shared representations each class while keeping the image reconstructing ability and become general on generating text vectors. When training with additional language, our shared representations are still able to be converted into images and texts correctly. In this case, we also observe the same characteristic on the distribution of shared representations as in adding synonyms. Lastly, we investigate in the scalability of our shared representation learning process and discuss on the limit to this approach.