摘要: | 深度神經網路已經徹底改變了電腦視覺領域,帶來了單物件追蹤任務的重大進展。然而,這些網路仍然面臨在動態環境中處理目標物件外觀變化和遮擋的挑戰。此外,在長時間內保持一致的追蹤,特別是在面對相似背景物件時,仍是一個重大挑戰。單物件追蹤的核心困難在於目標在整個視頻序列中經常發生的外觀變化,這些變化,例如縱橫比、 大小比例和姿勢狀態的變化,會顯著影響追蹤器的穩定性。此外,被其他物件遮擋和雜亂的背景,使得保持一致追蹤的過程變得複雜。 為了解決這些挑戰,本論文提出了一種利用時間卷積網路(TCN)、注意力機制、和空間-時序記憶網路相結合的追蹤架構。TCN組件通過捕捉視頻序列中的時間依賴性、 並且起了關鍵作用。這使得模型能夠學習物件的外觀如何隨時間演變,從而對短期外觀變化具有更高的適應性。結合注意力機制提供了雙重好處。首先,它使模型能夠根據當前背景,聚焦在畫面中最相關的區域,降低了模型的計算複雜度。這在背景雜亂或存在多個相似物件的情況下特別有利。其次,注意力機制將模型的注意力,聚焦到對追蹤目標物件至關重要的資訊特徵上。最後一個組件,空間-時序記憶網路,利用了長期記憶的能力。該網路儲存了關於目標物件的歷史訊息,包括其外觀和運動模式。這些儲存的訊息為追蹤器提供了參考點,使其能夠更好地適應目標變形和遮擋。通過有效結合這三個組件,我們提出了架構,來實現比現有方法更優越的追蹤性能。 我們方法的有效性,透過在多個基準數據集上的廣泛評估,得到了驗證,包括 GOT-10K、OTB2015、UAV123 和 VOT2018。我們的模型在GOT-10K數據集上實現了67.5%的AO(平均重疊度),在OTB2015上取得了72.1%的成功得分(AUC),在UAV123上取得了65.8%的成功得分(AUC),並在VOT2018數據集上實現了59.0%的準確性。 結果顯示了我們所提出的方法,在單物件追蹤任務中的卓越追蹤能力,展示其解決外觀變化和長期追蹤場景挑戰的潛力。這項研究提供了一個穩定且靈活的解決方案,結合注意力和記憶動態,增強在複雜現實場景中的追蹤精確性和穩定性,從而推動了追蹤系統的發展。 ;Deep neural networks have revolutionized the field of computer vision, leading to significant advancements in single object tracking tasks. However, these networks still encounter challenges in handling dynamic environments where target objects undergo appearance changes and occlusions. Additionally, maintaining consistent tracking across extended periods, especially when faced with similar-looking background objects, presents a significant challenge. The core difficulty in single object tracking arises from the frequent variations a target′s appearance can undergo throughout the video sequence. These variations, such as changes in aspect ratio, scale, and pose, can significantly impact the robustness of trackers. Additionally, occlusions by other objects and cluttered backgrounds further complicate the process of maintaining a consistent track. To address these challenges, this dissertation proposes a novel tracking architecture that leverages the combined strengths of a temporal convolutional network (TCN), an attention mechanism, and a spatial-temporal memory network. The TCN component plays a critical role by capturing temporal dependencies within the video sequence. This enables the model to learn how an object′s appearance evolves over time, resulting in greater resilience to short-term appearance changes. Incorporating an attention mechanism offers a two-fold benefit. Firstly, it reduces the computational complexity of the model by enabling it to focus on the most relevant regions of the frame based on the current context. This is particularly advantageous in scenarios with cluttered backgrounds or multiple similar objects present. Secondly, the attention mechanism directs the model′s focus towards informative features that are critical for tracking the target object. The final component, the spatial-temporal memory network, leverages the power of long-term memory. This network stores historical information about the target object, including its appearance and motion patterns. This stored information serves as a reference point for the tracker, allowing it to better adapt to target deformations and occlusions. By effectively combining these three elements, our proposed architecture aims to achieve superior tracking performance compared to existing methods. The effectiveness of our approach is validated through extensive evaluations on several benchmark datasets, including GOT-10K, OTB2015, UAV123, and VOT2018. Our model achieves a state-of-the-art average overlap (AO) of 67.5% on the GOT-10K dataset, a 72.1% success score (AUC) on OTB2015, a 65.8% success score (AUC) on UAV123, and a 59.0% accuracy on the VOT2018 dataset. The results highlight the superior tracking capabilities of our proposed approach in single object tracking tasks, demonstrating its potential to address the challenges posed by appearance variations and prolonged tracking scenarios. This research contributes to the advancement of tracking systems by offering a robust and adaptive solution that combines attention and memory dynamics to enhance tracking accuracy and robustness in complex real-world scenarios. |