| 摘要: | 近年機器視覺任務需求快速成長,使影像編碼技術不再僅以提升視覺呈現品質為目標,而轉向同時服務人工智慧模型的分析需求。機器視覺影像編碼(Video Coding for Machines, VCM)概念因而萌生,其核心在於建立能兼顧「人看」與「機器看」的統一碼流。因而延伸出的機器視覺特徵編碼(Feature Coding for Machines, FCM)進一步強調:將深度神經網路所抽取的特徵進行壓縮後傳輸,可在降低頻寬的同時維持下游任務表現。為推動研究標準化,MPEG 提出了特徵編碼測試模型(FCM Test Model, FCTM),提供從特徵抽取、壓縮、解碼到任務端評估的完整流程。 然而既有 FCTM 中常用的 JDE(Joint Detection and Embedding)前端模型,其偵測與 ReID 嵌入共用特徵分支,在高壓縮環境下容易出現語義表徵不足、跨幀辨識不穩定等問題,限制整體追蹤性能。本研究因而以 CSTrack 取代 JDE 作為 FCTM 的前端特徵抽取器。CSTrack 藉由 CCN(Cross-Correlation Network)與 SAAN(Scale-Aware Attention Network)兩個模組,採用更強的語義建模與跨特徵強化設計,使輸出的特徵更具區辨性與可編碼性,特別是在多目標追蹤(MOT)場景中能有效減少 ID Switch、FN 以及 FP,進而提升 MOTA 性能表現。本研究實驗證實,維持 FCTM 標準流程不變的前提下,在前端特徵抽取器中,以 CSTrack 取代 JDE ,在 HiEve 數據集的測試中,平均 MOTA性能提昇了約7.53%,能夠顯著改善壓縮後的追蹤表現,展現其作為特徵編碼前端的可行性與優勢。;In recent years, the demand for machine vision tasks has grown rapidly, shifting the role of video coding from solely enhancing visual presentation quality to simultaneously serving the analytical needs of artificial intelligence models. This has led to the emergence of the Video Coding for Machines (VCM) concept, which aims to establish a unified bitstream that serves both human perception and machine interpretation. Extending from this idea, Feature Coding for Machines (FCM) further emphasizes the compression and transmission of deep neural network–extracted features, enabling reduced bandwidth while maintaining downstream task performance. To facilitate standardization in this field, MPEG introduced the Feature Compression Test Model (FCTM), which provides a complete processing pipeline covering feature extraction, compression, decoding, and task-level evaluation. However, the JDE (Joint Detection and Embedding) model commonly used as the frontend in existing FCTM frameworks shares features between detection and re-identification branches. Under high compression conditions, this design tends to suffer from insufficient semantic representation and unstable cross-frame identity association, limiting overall tracking performance. To address this issue, this study replaces JDE with CSTrack as the frontend feature extractor within FCTM. By integrating the Cross-Correlation Network (CCN) and Scale-Aware Attention Network (SAAN) modules, CSTrack employs robust semantic modeling and cross-feature reinforcement designs, producing more discriminative and more compressible representations. Particularly in multi-object tracking (MOT) scenarios, CSTrack effectively reduces ID switches, false negatives, and false positives, thereby improving MOTA performance. Experimental results demonstrate that, while preserving the standard FCTM pipeline, substituting JDE with CSTrack as the frontend feature extractor yielded an average MOTA improvement of approximately 7.53% on the HiEve dataset. This confirms CSTrack’s feasibility and advantages as a feature-coding frontend, significantly enhancing tracking robustness under compressed settings. |