在自動駕駛中,僅透過相機來實現可靠的三維物體檢測仍然具有挑戰,主要原因包括深度不確定、感測器安裝的偏差以及運動模糊等問題。本研究提出了ProjectionFormerBEV,一種基於相機的 BEV 表示框架。我們的方法首先將圖像特徵轉換為偽體素特徵,並通過體素交叉注意力(Voxel Cross-Attention)模組將它們融合,從而改進空間對齊。此外,我們還在不同幀之間引入了自注意力機制,使模型能夠根據場景變化調整預測,並且減輕運動模糊的影響。 我們在 nuScenes-mini 數據集上對該方法進行了評估。ProjectionFormerBEV 在NDS 與 mAP 上分別達到 0.0517 和 0.0071,相較於可比基線,分別提升了 11.7% 和 74.6%。在部分幀遺失、輸入影像短暫停滯或相機位置略微偏移的情況下,模型仍表現出穩定的性能。定性分析結果也顯示,模型預測的邊界框在 BEV 與相機視圖中均更接近真實標註的位置。 綜合以上結果,可見 ProjectionFormerBEV 為基於相機的三維感知提供了一種可行的方案,並具備在實際駕駛場景中應用的潛力。;For autonomous driving, achieving reliable 3D object detection with cameras alone remains difficult due to challenges such as uncertain depth, sensor setup deviations, and motion blur. In this work, we introduce ProjectionFormerBEV, a camera-only BEV framework. Our approach first converts image features into pseudo-voxels and uses a Voxel Cross-Attention module to fuse them, improving spatial alignment. Additionally, we included a self-attention mechanism across frames, allowing the model to adjust its predictions according to scene changes and to reduce the effects of motion blur. We evaluated the approach on the nuScenes-mini benchmark. ProjectionFormerBEV obtained an NDS of 0.0517 and an mAP of 0.0071, corresponding to 11.7% and 74.6% relative improvements over comparable baselines. The model also showed consistent performance when some frames were missing, inputs froze temporarily, or camera positions shifted slightly. Our qualitative study further showed that the predicted bounding boxes tended to align more closely with the ground truth in both BEV and camera views. Taken together, these findings indicate that ProjectionFormerBEV offers a practical approach for camera-based 3D perception and could be employed in real driving conditions.