摘要: | 近年深度學習算法配合GPU或其他卷機加速硬體的加速,深度神經網路在各種任務上都獲得了顯著的改進。從基本的影像前處理、影像切割技術、人臉辨識、語音辨識等,逐漸的取代了以往的傳統演算法,這說明了神經網路的興起已經帶動人工智慧的各項改革。 3D骨架的領域中,傳統演算法必須在身上綁上感測器或透過隨機森林演算法來預測關節點,但缺點就是需要額外設備或者隨機森林的準確度不夠,透過深度學習的方式可使用RGB或RGB-D相機且不須額外穿戴式設備便可對骨架進行預測,這促使近幾年有不少研究如何改善模型的準確度。 本論文使用RGB當作輸入並提出基於2D/3D HeatMap形式的多任務學習方式來訓練一單級3D手部骨架預測網路,僅需一個骨幹網路便可同時輸出2D/3D HeatMap,透過取出HeatMap上最大值的(x,y,z)即為座標,透過分享卷積網路的權重來避免各項任務的重複運算。我們認為同一根手指間是有連續的關係,有別於一般一張HeatMap只預測一個關節點,將其修改為一張圖預測5個關節點(即同一根手指預測在同一張熱圖中),並將其作為特徵來分別預測左右手的3D HeatMap, 在從3D HeatMap中取最大值即目標的(x, y, z)座標。由於手部大型資料集多半在實驗室蒐集,因此我們還提出了手部切割技術,透過改善基本的編碼-解碼架構,來將資料集的手切割出來,並與各式風景照結合,來訓練出一個更泛化的網路,而不侷限在資料集的背景上。 ;In recent years, deep learning algorithms have been accelerated with GPUs or other volume acceleration hardware, and deep neural networks have gained significant improvements in various tasks. From basic image pre-processing, image cutting techniques, face recognition, voice recognition, etc., they are gradually replacing the traditional algorithms, which shows that the rise of neural networks has led to various reforms in artificial intelligence. In the field of 3D hand pose estimation, traditional algorithms require sensors tied to the body or random forest algorithms to predict joints, but the drawback is that additional equipment is required or the accuracy of random forest is not sufficient. We propose a multi-task learning approach based on 2D/3D HeatMap as input to train a single-level 3D hand skeleton prediction network, which only requires one backbone network to output 2D/3D HeatMap simultaneously. We believe that there is a continuous relationship between the same finger, so we modify it to predict 5 nodes in one HeatMap (i.e., the same finger is predicted in the same HeatMap), and use it as a feature to predict the 3D HeatMap of left and right hand separately, and take the maximum value of (x, y, z) coordinates of the target from the 3D HeatMap. Since large hand datasets are mostly collected in the laboratory, we also propose a hand-segmentation technique to improve the basic encoding and decoding architecture to segment out the hands of the dataset and combine them with various landscape photographs to train a more robust network without restricting to the context of the dataset. |