視覺追蹤的多尺度視覺基礎網路

DC 欄位	值	語言
DC.contributor	通訊工程學系	zh_TW
DC.creator	王品灃	zh_TW
DC.creator	Pin-Feng Wang	en_US
dc.date.accessioned	2023-7-19T07:39:07Z
dc.date.available	2023-7-19T07:39:07Z
dc.date.issued	2023
dc.identifier.uri	http://ir.lib.ncu.edu.tw:88/thesis/view_etd.asp?URN=110523039
dc.contributor.department	通訊工程學系	zh_TW
DC.description	國立中央大學	zh_TW
DC.description	National Central University	en_US
dc.description.abstract	在單目標追蹤中，採用階層式（hierarchical）的Vision Transformer（ViT）架構的追蹤器，往往追蹤表現不如plain ViT，同時文獻彼此之間架構都是有差異的，並沒有一個通用的網路架構。本論文提出一個通用的階層式網路架構（HyperXTrack），第一個將骨幹網路的架構，引用到追蹤任務上作為交互作用網路，同時加入時空上下文，空間的上下文是多尺度資訊，時間的上下文提供歷史資訊。HyperXTrack能進行全局與局部空間交互作用，且交互作用計算複雜度為影像解析度的線性複雜度。HyperXTrack每一個block都是先進行比對細緻紋理特徵，再進行整個物件外觀輪廓的交互比對。交互骨幹網路採用本論文所提之注意力機制，同時採用經典的堆疊規則在注意力機制前使用卷積。最後，本論文提出輕量的重新預訓練策略，可以使用預訓練好的MaxViT網路參數，將更改網路交互運算的網路重新訓練一個epoch，就可以讓網路的參數可以遷移到下游任務上。實驗結果顯示，本論文設計的HyperXTrack架構在GOT-10k數據集上AO以75%超越OSTrack的71%，同時僅需要使用30M參數量的階層式架構，就可以超越OSTrack的93M參數量的ViT架構。	zh_TW
dc.description.abstract	In single object tracking, the hierarchical Vision Transformer (ViT) architectures usually perform worse than plain ViT among current trackers. At the same time, the network architectures of state-of-the-art trackers are distinct, and thus there is no general purposed network architecture. This paper presents HyperXTrack, the first backbone network architecture that is applied to interaction in visual tracking. In addition, the proposed backbone interacts spatio-temporal context, where spatial context is the multi-scale information and temporal context provides historical information. HyperXTrack proceeds global and local spatial interaction, and computation complexity is linear with image resolution. After correlating with local texture features, the contour of the entire object is interacting. Interaction backbone networks adopt the proposed attention mechanism and the classic stacking rule where convolutions are applied before attention mechanism. Finally, this thesis proposes lightweight re-pretraining strategy. After modifying the existing network MaxViT, this thesis uses the pre-trained MaxViT weights, and re-pretrains only one epoch. Then the network can transfer to the downstream tasks. The experimental results show that HyperXTrack surpasses OSTrack′s 71% in AO with 71.8% on the GOT-10k dataset. HyperXTrack using a hierarchical architecture only needs 30M parameters, which can surpass OSTrack architecture with 93M parameters.	en_US
DC.subject	單目標追蹤	zh_TW
DC.subject	階層式	zh_TW
DC.subject	重新預訓練	zh_TW
DC.subject	視覺轉換器	zh_TW
DC.subject	模板更新策略	zh_TW
DC.subject	single object tracking	en_US
DC.subject	hierarchical	en_US
DC.subject	re-pretraining	en_US
DC.subject	vision Transformer	en_US
DC.subject	template update strategy	en_US
DC.title	視覺追蹤的多尺度視覺基礎網路	zh_TW
dc.language.iso	zh-TW	zh-TW
DC.title	Multi-Scale Vision Foundation Networks for Visual Tracking	en_US
DC.type	博碩士論文	zh_TW
DC.type	thesis	en_US
DC.publisher	National Central University	en_US

博碩士論文 110523039 完整後設資料紀錄