視覺里程計(VO)相較於物件偵測、分類問題亦或是物件追蹤來說,是一個相對冷門的領域。它是同時定位與地圖構建(SLAM)裡的其中最重要的一個模塊。其目的在於增量式地估計相鄰幀之間的相機運動。其主要應用在自主移動機器人、無人機等相關領域。 傳統VO方法需要精心設計每個模塊,並使其相互耦合才能有良好的表現。然而隨著機器學習的發展,許多視覺任務皆在其幫助下有著重大的突破。過往在做序列到序列問題的研究通常都會採用長短期記憶模型(LSTM)來處理。不過近幾年提出的模型Transformer有了更大的突破。其自注意力打破了RNN不能並行計算的限制,一舉成為十分熱門的機器學習模型,並且已經在許多不同領域上實現。 本篇論文主要圍繞在如何運用Transformer的特性來改善VO問題。我們採用Self-attention能夠平行處理的優點來處理堆疊的連續圖像,進而獲得前後幀之間的上下文關係。另外,我們使用條件式位置編碼來解決絕對/相對位置編碼存在固定長度的缺點。最後在實驗中呈現我們的方法是如何產生改進的。 ;Visual odometry (VO) is a relatively unpopular field compared to object detection, classification problem, or object tracking. It is one of the most important modules in Simultaneous Localization and Mapping (SLAM). Its purpose is to incrementally estimate camera motion between adjacent frames. It is mainly used in autonomous mobile robots, drones and other related fields. The traditional VO method needs to carefully design each module and make it coupled with each other to have good performance. However, with the development of machine learning, many vision tasks have achieved major breakthroughs with its help. Previous studies on sequence-to-sequence problems usually use long short-term memory model (LSTM) to deal with them. However, the model Transformer proposed in recent years has made a bigger breakthrough. Its self-attention breaks the limitation that RNN cannot be calculated in parallel, and has become a very popular machine learning model in one fell swoop, and has been implemented in many different fields. This paper mainly focuses on how to use the characteristics of Transformer to improve the VO problem. We take advantage of the parallel processing of Self-attention to process stacked consecutive images to obtain the contextual relationship between the previous and subsequent frames. Additionally, we use conditional positional encoding to address the fixed-length disadvantage of absolute/relative positional encoding. Finally, we present in experiments how our method yields improvements.