摘要: | 即時且精確的手勢辨識 (hand gesture recognition, HGR) 是一個重要且便捷的人機介面。在本研究中,我們分別使用基於手勢之幾何關係特徵方法和深度學習兩種技術實現了兩個靜態寬鬆手勢辨識 (loose hand gesture recognition, LHGR) 系統。所謂的寬鬆手勢是指手指彎曲程度、手掌方向、及手腕彎曲角度允許有較大的變異。 基於幾何關係特徵的LHGR系統使用深度傳感器 (depth camera) 擷取深度影像,同時使用彩色與深度資料讓手勢辨識更精準。手勢辨識程序通常分為三個階段:手部偵測、特徵擷取、和手勢分類。我們所提的方法均能有效的改進這三個步驟的效能。在手部偵測階段,我們提出一種符合人手特徵的動態ROI估計法和手腕切割法。在特徵擷取階段,我們使用了局部特徵、整體特徵、和深度編碼構建出更可靠的基於手勢之幾何關係特徵 (relational features)。在手勢分類階段,我們使用三層分類器,包括手指計數、手指名稱匹配、和編碼比較,來分辨16種手勢。最後,手勢經過一個自適應決策演算法 (adaptive decision) 調整,使手勢的辨識結果更為穩定。傳統的HGR方法常為了獲得更好的辨識結果而設計出複雜且嚴格的判斷條件;我們的方法利用較寬鬆的準則判斷各手指與手掌的相對幾何關係,然後再根據其幾何關係分類為相對應的手勢,因此我們的方法對寬鬆手勢可取得較好的容忍度。 卷積神經網路 (convolutional neural network, CNN) 能夠提取適應各種變異的手勢特徵,在樣本充足的條件下可以克服光影變化、模糊雜訊、手部旋轉等因素。我們提出的基於深度學習的LHGR系統同時使用了兩個獨立輸入架構各自讀取彩色影像和深度影像,兩個架構一開始各別學習彩色與深度的低階 (low-level) 特徵,之後再合併學習整體的RGBD高階 (high-level) 特徵;這麼做的好處是可以抑制彩色影像與深度影像像素對位不精準的問題,而且也可以縮減網路模型的參數量。另外我們使用多重解析度特徵來參與最後的手勢分類,因此對於較小、較遠、較模糊的手勢具有更強的辨識能力。訓練階段我們使用包含各種變異的寬鬆手勢資料集訓練我們的CNN模型,使CNN具備辨識寬鬆手勢的能力。在實驗中,我們比較了多種不同架構的卷積神經網路模式之結果;其中我們提出的模型之mAP值達到最高的0.997333。我們的方法除了可以很好且有效率的搭配彩色影像與深度影像,也對較低品質的影像有較好的辨識能力(即使訓練資料中缺少較低品質的影像資料),其中對於10×10的影像資料集仍有0.662222的mAP。如上所述,我們所提出的方法不僅具有手勢縮放和旋轉的可靠性,而且允許較低解析度的影像作為輸入,因此我們提出的CNN模型很適合應用於我們的LHGR系統。 ;A quickly-responded precise hand gesture recognition (HGR) system is an important and convenient human–computer interaction (HCI). In this paper, we propose two loose hand gesture recognition (LHGR) systems individually using a cascade classifier with geometric relational features and a multi-resolution convolutional neural network. The loose means that the system accepts more different variations on the bending degrees of fingers, the direction of palm, and the bending angles of wrist. The LHGR system based on geometric relational features uses a depth camera, which not only maintains an impressive accuracy in real-time processing but also enables the users to pose loose gestures. The process of a HGR system is usually divided into three stages: hand detection, feature extraction, and gesture classification. However, the method we propose has been useful in improving all the stages of HGR. In the hand detection stage, we propose a dynamic ROI estimation method and a wrist-cutting method that conform to the characteristics of a human hand. In the feature extraction stage, we use the more reliable geometric relational features which are constructed by local features, global features, and depth coding. In the gesture classification stage, we use three layers of classifiers including finger counting, finger name matching, and coding comparison; these layers are used to classify 16 kinds of hand gestures. In the end, the final output is adjusted by an adaptive decision. Convolutional neural network (CNN) can extract gesture features to adapt various mutations. It can overcome light and shadow, blur noises, hand rotation and other factors under adequate sample conditions. The proposed LHGR system based on deep learning have two input-paths for color images and depth maps. The two paths learn the low-resolution features at beginning, and then concatenate the low-resolution features to learn RGBD high-resolution features. The advantage is that it can suppress the problem of the inaccurate alignment pixels between color images and deep images, and it can also reduce the parameter number of the model. In addition, we use multi-resolution features to classify the hand gestures, therefor, the proposed model has stronger ability for smaller, farther, and blurrier images. During the training stage, we trained the proposed CNN model using a dataset that contained various mutations of loose hand gestures to make CNN have the ability to classify loose hand gestures. In the experiments, we compared the results of the proposed CNN model with many different CNN architectures; the mAP of the model we proposed is up to 0.997333. The proposed method not only enables better and more efficiently use of color images and depth images, but also have better accuracy for lower-quality images (even if the training dataset lacks of the lower-quality images), which the mAP still has a value of 0.662222 for the 10×10 image dataset. As mentioned above, the proposed method not only has reliability in the scaling and rotation of gestures, but allows the lower resolution images as the inputs. Therefore, the proposed CNN model is suitable for LHGR system. |