ProjectionFormerBEV：深度感知且時序穩定的體素交叉注意力三維物體檢測框架;ProjectionFormerBEV: A Depth-Aware and Temporally Stable Framework for 3D Object Detection with Voxel Cross-Attention

NCU Institutional Repository > 資訊電機學院 > 通訊工程學系碩士在職專班 > 博碩士論文 > Item 987654321/99353

請使用永久網址來引用或連結此文件: https://ir.lib.ncu.edu.tw/handle/987654321/99353

題名:	ProjectionFormerBEV：深度感知且時序穩定的體素交叉注意力三維物體檢測框架;ProjectionFormerBEV: A Depth-Aware and Temporally Stable Framework for 3D Object Detection with Voxel Cross-Attention
作者:	周星羽;Chou, Hsing-Yu
貢獻者:	通訊工程學系在職專班
關鍵詞:	三維物體檢測;鳥瞰圖表示 (BEV);ProjectionFormerBEV;體素交叉注意力;自動駕駛;3D Object Detection;Bird’s-Eye View (BEV) Representation;ProjectionFormerBEV;Voxel Cross-Attention;Autonomous Driving Systems
日期:	2026-01-23
上傳時間:	2026-03-06 18:47:27 (UTC+8)
出版者:	國立中央大學
摘要:	在自動駕駛中，僅透過相機來實現可靠的三維物體檢測仍然具有挑戰，主要原因包括深度不確定、感測器安裝的偏差以及運動模糊等問題。本研究提出了ProjectionFormerBEV，一種基於相機的 BEV 表示框架。我們的方法首先將圖像特徵轉換為偽體素特徵，並通過體素交叉注意力（Voxel Cross-Attention）模組將它們融合，從而改進空間對齊。此外，我們還在不同幀之間引入了自注意力機制，使模型能夠根據場景變化調整預測，並且減輕運動模糊的影響。我們在 nuScenes-mini 數據集上對該方法進行了評估。ProjectionFormerBEV 在NDS 與 mAP 上分別達到 0.0517 和 0.0071，相較於可比基線，分別提升了 11.7% 和 74.6%。在部分幀遺失、輸入影像短暫停滯或相機位置略微偏移的情況下，模型仍表現出穩定的性能。定性分析結果也顯示，模型預測的邊界框在 BEV 與相機視圖中均更接近真實標註的位置。綜合以上結果，可見 ProjectionFormerBEV 為基於相機的三維感知提供了一種可行的方案，並具備在實際駕駛場景中應用的潛力。;For autonomous driving, achieving reliable 3D object detection with cameras alone remains difficult due to challenges such as uncertain depth, sensor setup deviations, and motion blur. In this work, we introduce ProjectionFormerBEV, a camera-only BEV framework. Our approach first converts image features into pseudo-voxels and uses a Voxel Cross-Attention module to fuse them, improving spatial alignment. Additionally, we included a self-attention mechanism across frames, allowing the model to adjust its predictions according to scene changes and to reduce the effects of motion blur. We evaluated the approach on the nuScenes-mini benchmark. ProjectionFormerBEV obtained an NDS of 0.0517 and an mAP of 0.0071, corresponding to 11.7% and 74.6% relative improvements over comparable baselines. The model also showed consistent performance when some frames were missing, inputs froze temporarily, or camera positions shifted slightly. Our qualitative study further showed that the predicted bounding boxes tended to align more closely with the ground truth in both BEV and camera views. Taken together, these findings indicate that ProjectionFormerBEV offers a practical approach for camera-based 3D perception and could be employed in real driving conditions.
顯示於類別:	[通訊工程學系碩士在職專班 ] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	18	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....