dc.description.abstract | Multiple object tracking is a popular research project in the field of machine learning. The biggest challenge is stability when objects overlap. Most of the solutions are to extract the appearance features or motion features in the previous frames to correlate the similarity between the current frame and the previous frames, then match objects by Hungarian Algorithm.
At the beginning of 2021, MIT, Facebook, Google, etc. brought the popular architecture in the natural language field, Transformer, into object tracking issue. Although it seems to have caused a wave of enthusiasm, attracting everyone to use this architecture, the large training dataset and memory space required by Transformer do not look friendly to researchers or scholars.
In mid-2021, the first end-to-end model with Transformer as the architecture was published. Although the accuracy is not the best, its architecture is simple, and the data associations that used to be manually designed are included in the architecture, which means we can reduce the need for manual design, can be presented more intuitively. However, object detection and object tracking are strongly hindered due to the model′s design. In this paper, we propose a new idea to use the output information of the YOLOv5 model as input data to assist the Transformer and increase the model stability during training, which can not only speed up the convergence of the model, but also reduce the stacking layers of the transformer. This reduces the amount of GPU memory required so that users with a single GPU can easily train. | en_US |