面部表情識別已成為計算機視覺和模式識別領域的重要研究方向,在人機互動、情感計算、心理健康評估和智能監控等應用中發揮著關鍵作用。然而,面部表情識別任務面臨著類間相似性、類內差異性和類別不平衡等重大挑戰,這些挑戰在非受控或野外環境中更為嚴峻,進而削弱了模型的辨識準確性與穩定性。本研究旨在解決上述問題,故提出了一種新穎的多尺度跨模態融合模型方法,整合圖像特徵和面部關鍵點資訊進行表情識別。此外,我們採用針對性的數據增強和專門的損失函數來處理類別不平衡問題。根據實驗結果表明,我們的方法在AffectNet和RAF-DB數據集上均實現顯著進步,優於現有的頂尖面部表情識別模型。;Facial expression recognition has emerged as a significant area of study within computer vision and pattern recognition, contributing to a wide range of applications, including human-computer interaction, affective computing, mental health assessment, and intelligent surveillance. Nevertheless, facial expression recognition encounters significant challenges including inter-class similarity, intra-class difference, and class imbalance, which are particularly prominent in wild environments and affect the accuracy and reliability of recognition systems. To effectively deal with these challenges, we propose a novel multi-scale cross-modal fusion approach that integrates image features and facial landmark information for expression recognition. Additionally, we employ targeted data augmentation and a specialized loss function to handle class imbalance issues. Experimental results verify that our approach surpasses existing state-of-the-art facial expression recognition models, achieving significant improvements on the widely used AffectNet and RAF-DB benchmark datasets.