摘要: | 多物件追蹤在機器學習領域中是一個十分熱門的研究項目,其最大的挑戰在於行人重疊時系統的穩定性。多數的解決方式即是擷取前幾幀中物件的外觀特徵或動作特徵來關聯當前幀該物件和前幾幀物件的相似性,再透過匈牙利演算法後處理將所有物件做匹配。 而在2021年初,MIT、Facebook、Google等將自然語言領域中較為熱門的架構,Transformer,帶入物件追蹤議題,其準確度高於現有模型,使得大家開始爭相研究。雖然Transformer的引入似乎引起了一陣熱潮,吸引大家使用該架構,但相對Transformer所需的龐大訓練資料集以及記憶體空間也讓許多研究員或學者為之頭疼。 2021年年中,發表了第一篇以Transformer為架構的端對端模型,雖然準確度並不是最頂尖,但其架構簡單、將以往需人工設計的data association包含進架構中,減少人工設計函式的誤差,使的其架構以更直觀的方式作呈現。但也因其端對端模型的設計,導致物件偵測及其後續的物件追蹤有著強烈的牽制。在本文中我們提出一種新的想法,使用YOLOv5模型的輸出資訊做為輸入資料,以輔助Transformer,增加訓練時的模型穩定度,不僅可以加快模型的收斂速度,也可藉此減少transformer的堆疊層數,降低GPU記憶體的需求量,使得單GPU的使用者也可以輕鬆訓練。;Multiple object tracking is a popular research project in the field of machine learning. The biggest challenge is stability when objects overlap. Most of the solutions are to extract the appearance features or motion features in the previous frames to correlate the similarity between the current frame and the previous frames, then match objects by Hungarian Algorithm. At the beginning of 2021, MIT, Facebook, Google, etc. brought the popular architecture in the natural language field, Transformer, into object tracking issue. Although it seems to have caused a wave of enthusiasm, attracting everyone to use this architecture, the large training dataset and memory space required by Transformer do not look friendly to researchers or scholars. In mid-2021, the first end-to-end model with Transformer as the architecture was published. Although the accuracy is not the best, its architecture is simple, and the data associations that used to be manually designed are included in the architecture, which means we can reduce the need for manual design, can be presented more intuitively. However, object detection and object tracking are strongly hindered due to the model′s design. In this paper, we propose a new idea to use the output information of the YOLOv5 model as input data to assist the Transformer and increase the model stability during training, which can not only speed up the convergence of the model, but also reduce the stacking layers of the transformer. This reduces the amount of GPU memory required so that users with a single GPU can easily train. |