基於transformer方法的視覺里程計

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：22

、訪客IP：18.224.63.20

姓名

藍子修(Zi-Xiu Lan) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

基於transformer方法的視覺里程計
(Visual odometry based on transformer method)

相關論文

★ 基於edX線上討論板社交關係之分組機制	★ 利用Kinect建置3D視覺化之Facebook互動系統
★ 利用 Kinect建置智慧型教室之評量系統	★ 基於行動裝置應用之智慧型都會區路徑規劃機制
★ 基於分析關鍵動量相關性之動態紋理轉換	★ 基於保護影像中直線結構的細縫裁減系統
★ 建基於開放式網路社群學習環境之社群推薦機制	★ 英語作為外語的互動式情境學習環境之系統設計
★ 基於膚色保存之情感色彩轉換機制	★ 一個用於虛擬鍵盤之手勢識別框架
★ 分數冪次型灰色生成預測模型誤差分析暨電腦工具箱之研發	★ 使用慣性傳感器構建即時人體骨架動作
★ 基於多台攝影機即時三維建模	★ 基於互補度與社群網路分析於基因演算法之分組機制
★ 即時手部追蹤之虛擬樂器演奏系統	★ 基於類神經網路之即時虛擬樂器演奏系統

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

視覺里程計(VO)相較於物件偵測、分類問題亦或是物件追蹤來說，是一個相對冷門的領域。它是同時定位與地圖構建(SLAM)裡的其中最重要的一個模塊。其目的在於增量式地估計相鄰幀之間的相機運動。其主要應用在自主移動機器人、無人機等相關領域。
傳統VO方法需要精心設計每個模塊，並使其相互耦合才能有良好的表現。然而隨著機器學習的發展，許多視覺任務皆在其幫助下有著重大的突破。過往在做序列到序列問題的研究通常都會採用長短期記憶模型(LSTM)來處理。不過近幾年提出的模型Transformer有了更大的突破。其自注意力打破了RNN不能並行計算的限制，一舉成為十分熱門的機器學習模型，並且已經在許多不同領域上實現。
本篇論文主要圍繞在如何運用Transformer的特性來改善VO問題。我們採用Self-attention能夠平行處理的優點來處理堆疊的連續圖像，進而獲得前後幀之間的上下文關係。另外，我們使用條件式位置編碼來解決絕對/相對位置編碼存在固定長度的缺點。最後在實驗中呈現我們的方法是如何產生改進的。

摘要(英)

Visual odometry (VO) is a relatively unpopular field compared to object detection, classification problem, or object tracking. It is one of the most important modules in Simultaneous Localization and Mapping (SLAM). Its purpose is to incrementally estimate camera motion between adjacent frames. It is mainly used in autonomous mobile robots, drones and other related fields.
The traditional VO method needs to carefully design each module and make it coupled with each other to have good performance. However, with the development of machine learning, many vision tasks have achieved major breakthroughs with its help. Previous studies on sequence-to-sequence problems usually use long short-term memory model (LSTM) to deal with them. However, the model Transformer proposed in recent years has made a bigger breakthrough. Its self-attention breaks the limitation that RNN cannot be calculated in parallel, and has become a very popular machine learning model in one fell swoop, and has been implemented in many different fields.
This paper mainly focuses on how to use the characteristics of Transformer to improve the VO problem. We take advantage of the parallel processing of Self-attention to process stacked consecutive images to obtain the contextual relationship between the previous and subsequent frames. Additionally, we use conditional positional encoding to address the fixed-length disadvantage of absolute/relative positional encoding. Finally, we present in experiments how our method yields improvements.

關鍵字(中)

★ 視覺里程計

關鍵字(英)

★ Visual odometry

論文目次

1 Introduction 1
2 Related Work 3
2.1 Simultaneous Localization and Mapping 3
2.1.1 SLAM Methods 4
2.1.2 Visual Odometry Methods 6
2.2 Methods based on Geometry 9
2.2.1 Feature-based Method 10
2.2.2 Optical Flow/Direct Method 13
2.2.3 Comparison of Feature-based and Direct Method 16
2.3 Methods based on Learning 18
2.3.1 Supervised Learning 18
2.3.2 Unsupervised Learning 21
2.3.3 Comparison of Supervised and Unsupervised Learning 26
3 Proposed Method 27
3.1 Overview 27
3.2 Feature Extraction 28
3.3 Attention based Contextual Enhancement 29
3.4 Sequential Modeling 33
3.5 Loss Function 35
4 Experiments 37
4.1 KITTI Datasets 37
4.2 Training and Data Pre-processing 37
4.3 Experimental Results 38
5 Conclusion 42
6 Reference 43

參考文獻

[1] W. Hess, D. Kohler, H. Rapp, and D. Andor, "Real-Time Loop Closure in 2D LIDAR SLAM", in International Conference on Robotics and Automation (ICRA), 2016.
[2] Jakob Engel, Thomas Schöps ,and Daniel Cremers, "LSD-SLAM: Large-Scale Direct Monocular SLAM", in European Conference on Computer Vision(ECCV), pp. 834–849, 2014
[3] https://commons.wikimedia.org/wiki/File:6DOF.svg (Six degrees of freedom)
[4] https://docs.opencv.org/3.4/df/d0c/tutorial_py_fast.html (FAST Algorithm)
[5] Georg Klein and David Murray, "Parallel Tracking and Mapping for Small AR Workspaces", in IEEE International Symposium on Mixed and Augmented Reality (ISMAR), 2007
[6] Raul Mur-Artal, J. M. M. Montiel, Juan D. Tardos, "ORB-SLAM: a Versatile and Accurate Monocular SLAM System", in IEEE Transactions on Robotics, pp. 1147-1163, 2015
[7] Tong Qin, Peiliang Li, Shaojie Shen, "VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator", in IEEE Transactions on Robotics, pp.1004-1020, 2018
[8] Jakob Engel, Vladlen Koltun and Daniel Cremers, "Direct Sparse Odometry", in IEEE Transactions on Pattern Analysis and Machine Intelligence, pp.611-625, 2018
[9] Christian Forster, Matia Pizzoli, Davide Scaramuzza, "SVO: Fast Semi-Direct Monocular Visual Odometry", in IEEE International Conference on Robotics and Automation (ICRA), 2014
[10] https://vision.in.tum.de/research/vslam/lsdslam (Difference to keypoint-based methods)
[11] Sen Wang, Ronald Clark, Hongkai Wen, Niki Trigoni, "DeepVO: Towards End-to-End Visual Odometry with Deep Recurrent Convolutional Neural Networks", in IEEE International Conference on Robotics and Automation (ICRA), 2017
[12] Jian Jiao, Jichao Jiao, Yaokai Mo, Weilun Liu, Zhongliang Deng, "MagicVO: End-to-End Monocular Visual Odometry through Deep Bi-directional Recurrent Convolutional Neural Network", in Computer Vision and Pattern Recognition(CVPR), 2019
[13] Philipp Fischer, Alexey Dosovitskiy, Eddy Ilg, Philip Häusser, Caner Hazırbaş, Vladimir Golkov, Patrick van der Smagt, Daniel Cremers, Thomas Brox, "FlowNet: Learning Optical Flow with Convolutional Networks", in IEEE International Conference on Computer Vision (ICCV), 2015
[14] Ruihao Li, Sen Wang, Zhiqiang Long, Dongbing Gu, "UnDeepVO: Monocular Visual Odometry through Unsupervised Deep Learning", in International Conference on Robotics and Automation(ICRA), 2018
[15] Yasin Almalioglu, Muhamad Risqi U. Saputra, Pedro P. B. de Gusmao, Andrew Markham, Niki Trigoni, "GANVO: Unsupervised Deep Monocular Visual Odometry and Depth Estimation with Generative Adversarial Networks", in International Conference on Robotics and Automation(ICRA), 2019
[16] Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu, "Spatial Transformer Networks", arXiv:1506.02025, 2015
[17] Abundantard A. Newcombe, Steven J. Lovegrove and Andrew J. Davison, "DTAM: Dense Tracking and Mapping in Real-Time", in IEEE International Conference on Computer Vision (ICCV), 2011
[18] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, "Attention Is All You Need", arXiv:1706.03762, 2017
[19] Xiangxiang Chu, Zhi Tian, Bo Zhang, Xinlong Wang, Xiaolin Wei, Huaxia Xia, Chunhua Shen, "Conditional Positional Encodings for Vision Transformers", arXiv:1706.03762, 2021

指導教授

施國琛

審核日期

2022-8-8

推文