摘要: | 過去這幾年,我們看到多種電腦視覺應用有顯著的進步,尤其是在人類動作識別這個領域。人類動作辨識的目的在自動檢查和識別影片中的發聲的動作,且已經廣泛地在多種應用中使用。本論文對基於深度學習的人類動作識別的方法和技術進行了全面概述,並特別聚焦在三種主要的學習策略:監督學習、自監督式學習和半監督學習。針對每個學習機制,我們引入了有效的方法來解決基於知識蒸餾的動作辨識和知識蒸餾的優化。具體來說,對於監督式學習,我們提出了一個輕量化的網路架構,也就是(2+1)DShuffleNet,此外我們也引入了兩個基於知識蒸餾的方法來優化學生網路的泛化能力和性能,而不需要龐大且昂貴的教師網路;至於自監督式學習,我們提出一個新的對基於自監督式學習的動作辨識的委託任務;最後,我們提供了一種基於相互學習的半監督式動作辨識的有效方法。所有的實驗結果顯示,這些方法不僅實現最先進的性能,更在模型大小、運算成本、訓練時間、運行時間等不同指標都有所提升。;Over the past several years, we have witnessed remarkable progress in numerous computer vision applications, particularly in human activity analysis. Human action recognition, which aims to automatically examine and recognize the actions taking place in the video, has been widely applied in many applications. This thesis presents a comprehensive survey of approaches and techniques in deep learning-based human activity analysis. In particular, the thesis focuses on three main strategies of learning including supervised learning, self-supervised learning, and semi-supervised learning. In each learning mechanism, we introduce efficient approaches to address action recognition based on knowledge distillation and improvements for knowledge distillation. Specifically, for supervised learning, we proposed a lightweight network architecture i.e., (2+1)D ShuffleNet. Besides, we also introduce two self-knowledge distillation-based approaches to improve the generalization and performance of the student network without the large and expensive teacher network. For self-supervised learning, we present a novel pretext task for self-supervised learning-based action recognition. Finally, we propose an efficient approach based on mutual learning to semi-supervised action recognition. All experiment results have shown that these approaches not only achieve state-of-the-art performance but also improve in terms of many different metrics such as model size, computational cost, training time, running time, etc. |