動作識別(Action Recognition)是電腦視覺研究中一門較基礎之領域,由於能延伸的應用非常眾多,因此也是現今仍需不斷精進的技術。隨著深度學習技術不斷地發展,許多圖像識別的研究方法也不斷更新與進步,而這些技術也能套用在動作識別領域以增加其準確率及穩健性,因此本篇論文致力於將許多新提出之方法部分應用於現有的基礎模型並修改其架構,對該基礎模型進行優化。 本篇論文使用Facebook AI Research提出之SlowFast網路作為欲修改之模型基礎,並參考Actor-Context-Actor Relation Network(ACAR Net)處理高階特徵的概念,提出了Informative Feature with Residual Self-Attention module(IFRSA),並使用了MobileNet提出之separable convolution取代部分卷積層,衍生出輕量化版本Lightweight IFRSA (LIFRSA),且IFRSA中的自注意力機制(self-attention)也以兩階段自注意力機制(decoupling self-attention)取代之,提出了Lightweight Informative Feature with Residual Decoupling Self-Attention(LIFRDeSA)。 根據實驗結果,本篇論文所提出之方法除了提升了基礎模型的準確率外,同時也考量了模型所需的計算資源,提出輕量且準確率更高之架構。 ;Action Recognition aims to detect and classify the actions of one or more people in the video, and it can be connected to many different fields and provide several applications, so the accuracy of this basic task becomes an important part for these related researches. Therefore, we focus on enhancing the accuracy of previous work in this paper and manage to reduce its computational cost. The base model we used is SlowFast Network, which was a state-of-the-art. We refer to the concept of extracting high-level feature method in Actor-Context-Actor Relation Network(ACAR Net), and propose Informative Feature with Residual Self-Attention module(IFRSA). But the computational cost is very huge, so we first use the separable convolution, which was presented in MobileNet, to replace some convolution in this module. Secondly, the self-attention layer is substituted for decoupling self-attention, then we present Lightweight Informative Feature with Residual Decoupling Self-Attention (LIFRDeSA). Experiment on AVA dataset shows that the LIFRDeSA module enhance the accuracy of the baseline, and meanwhile concerning about the computational cost. The model we propose has higher accuracy than the baseline, and the additional part is very lightweight.