| 摘要: | 多目標行人追蹤在智慧監控與機器視覺應用中扮演關鍵角色,但在實際部署情境下,系統常受限於頻寬與儲存資源,必須透過壓縮機制傳輸影像或特徵,導致追蹤性能明顯劣化。傳統JDE(Joint Detection and Embedding)架構雖能在單一網路內同時完成偵測與re-ID特徵抽取,但其採用的Darknet-53 backbone 計算量較大,且在複雜場景下容易產生較高的漏偵率(FN),在特徵壓縮環境下更放大了此問題。 本論文首先以 YOLO11 偵測架構為基礎,重新設計 JDE 的偵測與嵌入分支,提出多目標追蹤架構 YOLO11-JDE。所提出之模型整合輕量化且具有多尺度特徵融合能力的 backbone 與 neck,並搭配 Mosaic 等資料增強策略,以提升小物體與遮擋情境下的偵測能力。實驗結果顯示,在未壓縮情況下YOLO11-JDE 相較原始 JDE 模型,於多個序列上皆獲得較高之 MOTA,且 FN 明顯下降,證實所設計架構於偵測與追蹤任務上具有較佳效能。 接著,本論文將 YOLO11-JDE 導入機器視覺特徵編碼系統(Feature Coding for Machine, FCTM)中,將偵測網路中介層之特徵圖作為壓縮與傳輸對象,系統於解碼端再恢復特徵並進行後續追蹤。透過在不同壓縮品質下之實驗,本研究分析FCTM 壓縮對YOLO11-JDE 追蹤效能的影響,針對MOTA、FN、FP 與IDsw等指標進行量化比較,說明特徵壓縮所造成之性能退化特性;並進一步與原始JDE 於相同壓縮條件下之表現進行對照,結果顯示YOLO11-JDE 在壓縮前後皆能維持較佳的追蹤穩健度。綜合而言,本論文證實YOLO11-JDE 取代原始JDE 作為多目標追蹤核心,能在機器視覺特徵編碼系統下有效提升追蹤效能,並釐清FCTM 壓縮對多目標追蹤架構之影響行為,可作為未來設計針對特徵壓縮友善之追蹤網路與補償機制之參考。 ;Multi-object pedestrian tracking plays a critical role in intelligent surveillance and machine vision applications. However, in practical deployment scenarios, systems are often constrained by bandwidth and storage resources, and thus must rely on compression mechanisms to transmit images or features, which leads to noticeable degradation in tracking performance. The traditional JDE (Joint Detection and Embedding) architecture can jointly perform detection and re-ID feature extraction within a single network, but its Darknet-53 backbone incurs high computational cost and tends to suffer from a high missed-detection rate (FN) in complex scenes, a problem that is further amplified under feature compression environments. In this thesis, we redesign the detection and embedding branches of JDE based on the YOLO11 detection architecture and propose a multi-object tracking framework named YOLO11-JDE. The proposed model integrates a lightweight backbone and neck with strong multi-scale feature fusion capability, and adopts data augmentation strategies such as Mosaic to enhance detection performance for small objects and occluded targets. Experimental results show that, in the uncompressed setting, YOLO11-JDE consistently achieves higher MOTA and significantly lower FN than the original JDE model across multiple sequences, demonstrating the superior effectiveness of the proposed architecture in both detection and tracking tasks. Subsequently, this thesis deploys YOLO11-JDE within a machine vision feature coding system (Feature Coding for Machine, FCTM), where intermediate feature maps from the detection network are used as the target of compression and transmission, and are reconstructed at the decoder side for subsequent tracking. By conducting experiments under different compression qualities, this work analyzes the impact of FCTM compression on the tracking performance of YOLO11-JDE, and quantitatively compares MOTA, FN, FP, and IDsw to characterize the performance degradation caused by feature compression. Furthermore, we compare YOLO11-JDE with the original JDE under identical compression conditions, and the results show that YOLO11-JDE maintains better tracking robustness both before and after compression. In summary, this thesis demonstrates that replacing the original JDE with YOLO11-JDE as the core multi-object tracking framework can effectively improve tracking performance under a machine vision feature coding system. The analyses further clarify how FCTM compression affects multi-object tracking architectures, providing a useful reference for designing future tracking networks and compensation mechanisms that are more robust and friendly to feature compression. |