非同步視訊面試(Asynchronous Video Interviews, AVI)在辨識短 暫且具情境依賴性的印象管理(Impression Management, IM)行為時 面臨挑戰。為此,我們提出 SAMFNet,一個具段落感知的多模態融 合網路,透過多實例學習(Multiple Instance Learning, MIL)來辨識 局部的印象管理行為 ,無需嚴格對齊時間段。SAMFNet 整合了文 字、語音、臉部表情、心率變異(HRV)及眼動資訊,實現對行為線 索的穩健融合。模型訓練基於一組來自實際非同步視訊面試平台的 121 名求職者資料,並採用留一交叉驗證(Leave-One-Out Cross-Validation, LOOCV)以提升在小樣本情境下的評估可靠性。 SAMFNet 分別在四類 IM 行為上達成 92%(誠實-自我推銷)、82% (誠實-自我防衛)、94%(欺騙-誇大不實)及 84%(欺騙-避重就輕) 的準確率,表現優於 HireNet 模型在整體面試評估上的 74.9%。本 系統具備非侵入性、可擴展性,並適用於實務的非同步視訊面試平台 面試應用情境。;Asynchronous Video Interviews (AVI) present difficulties in detecting brief and context-dependent impression management (IM) behaviors. We propose SAMFNet, a Segment-Aware Multi-Modal Fusion Network that leverages Multiple Instance Learning (MIL) to identify localized IM behaviors without requiring strict temporal alignment. SAMFNet integrates text, audio, facial expressions, heart rate variability (HRV), and eye movement, enabling robust fusion of behavioral cues. Trained on a real-world dataset of 121 applicants, we employ leave-one-out cross-validation (LOOCV) to ensure reliable evaluation under limited data conditions. SAMFNet achieves accuracies of 92% (Honest–self-promotion), 82% (Honest–defensive), 94% (Deceptive image creation), and 84% (Deceptive–image protection). Compared to HireNet’s 74.9% accuracy in overall candidate adequacy prediction, SAMFNet demonstrates superior performance in fine-grained IM detection. The framework is non-invasive, scalable, and suitable for practical AVI applications.