dc.description.abstract | Deep neural networks have revolutionized the field of computer vision, leading to significant advancements in single object tracking tasks. However, these networks still encounter challenges in handling dynamic environments where target objects undergo appearance changes and occlusions. Additionally, maintaining consistent tracking across extended periods, especially when faced with similar-looking background objects, presents a significant challenge. The core difficulty in single object tracking arises from the frequent variations a target′s appearance can undergo throughout the video sequence. These variations, such as changes in aspect ratio, scale, and pose, can significantly impact the robustness of trackers. Additionally, occlusions by other objects and cluttered backgrounds further complicate the process of maintaining a consistent track.
To address these challenges, this dissertation proposes a novel tracking architecture that leverages the combined strengths of a temporal convolutional network (TCN), an attention mechanism, and a spatial-temporal memory network. The TCN component plays a critical role by capturing temporal dependencies within the video sequence. This enables the model to learn how an object′s appearance evolves over time, resulting in greater resilience to short-term appearance changes. Incorporating an attention mechanism offers a two-fold benefit. Firstly, it reduces the computational complexity of the model by enabling it to focus on the most relevant regions of the frame based on the current context. This is particularly advantageous in scenarios with cluttered backgrounds or multiple similar objects present. Secondly, the attention mechanism directs the model′s focus towards informative features that are critical for tracking the target object. The final component, the spatial-temporal memory network, leverages the power of long-term memory. This network stores historical information about the target object, including its appearance and motion patterns. This stored information serves as a reference point for the tracker, allowing it to better adapt to target deformations and occlusions. By effectively combining these three elements, our proposed architecture aims to achieve superior tracking performance compared to existing methods.
The effectiveness of our approach is validated through extensive evaluations on several benchmark datasets, including GOT-10K, OTB2015, UAV123, and VOT2018. Our model achieves a state-of-the-art average overlap (AO) of 67.5% on the GOT-10K dataset, a 72.1% success score (AUC) on OTB2015, a 65.8% success score (AUC) on UAV123, and a 59.0% accuracy on the VOT2018 dataset.
The results highlight the superior tracking capabilities of our proposed approach in single object tracking tasks, demonstrating its potential to address the challenges posed by appearance variations and prolonged tracking scenarios. This research contributes to the advancement of tracking systems by offering a robust and adaptive solution that combines attention and memory dynamics to enhance tracking accuracy and robustness in complex real-world scenarios. | en_US |