終身式學習相較於其他像是物件偵測、物件追蹤領域,是一個相對比較冷門且新興的領域。這個領域在研究的議題是讓神經網路能夠擁有像人類一樣持續學習的能力,並且能夠利用過往學習的知識,使之在面對未來的任務時,學習起來能夠更加輕鬆、表現更好。 目前終身式學習能夠分為單頭終身式學習與多頭終身式學習,其差別是在測試時是否有任務類別資料,單頭終身式學習並無包含任務類別資訊,因此在分類上需要進行設計來避免資料不平衡導致分類偏好問題;而多頭終身式學習因為其在測試時具有任務類別資訊,不需要處理任務類別分類偏好問題,因此這個子領域主要在研究如何優化模型,使之能夠占用最少的空間。 此篇論文主要圍繞在單頭終身式學習領域,並且採用知識蒸餾策略來避免模型災耐性遺忘的問題。以往的模型蒸餾方式主要著重在模型的輸出層來進行蒸餾,又或是透過中間層的特徵進行歐式距離計算來定義為訓練損失。本文改良了透過歐式距離計算的方法,將特徵經過平均池化層運算,來規避特徵複雜度過高而造成訓練不佳的情形,並且另外再加上一層全連接層作為輸出層,將此組合成一個分支網路,利用分支網路的輸出搭配原本主網路的輸出,來傳承不同尺度特徵的資訊,藉此獲得更佳的效果。 ;Compared with other fields, such as object detection and tracking, Lifelong Learning is a relatively unpopular and emerging field. The main purpose of this field is to enable neural networks to have the ability to continuously learn like humans, and use the knowledge which was learned in the past to make it easier to learn and perform better when facing future tasks. At present, Lifelong Learning can be divided into two types: single-head and multi-head Lifelong Learning. The difference between these two types of Lifelong Learning is whether there is task category data during the testing stage or not. Single-head Lifelong Learning doesn’t contain task category information at testing stage, so the model needs to be designed to handle the problem of imbalanced data which causes the classification preference. On the other hand, the task category is accessible at the testing stage under the definition of the multi-head Lifelong Learning. Therefore, there is no need to deal with the task category classification preference problem. The researchers of this subfield are mainly studying how to optimize the model so that it can use the least memory and get the best performance. In this thesis, we focus on the single-head Lifelong Learning problem, and adopt the knowledge distillation strategy to avoid catastrophic forgetting problems. Previous distillation strategies often use the distribution from the output layers to distill the knowledge, or use the features from intermediate layers to calculate the distillation loss by Euclidean distance. Besides, we propose a branch distillation method. We add an average pooling layer and fully connected layer after we obtain the features from intermediate layers. And then, we use the distribution from this branch network to distill different scales of features in the network. By using this novel knowledge distillation methods, we can improve our model’s performance.