基於深度學習之動態手勢辨識應用於非接觸式點餐系統

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：29

、訪客IP：3.129.209.194

姓名

謝友倫(Yu-Lun Hsieh) 查詢紙本館藏

畢業系所

電機工程學系

論文名稱

基於深度學習之動態手勢辨識應用於非接觸式點餐系統
(Application of Deep Learning Based Dynamic Gesture Recognition to Contactless Ordering System)

相關論文

★ 獨立成份分析法於真實環境中聲音訊號分離之探討	★ 口腔核磁共振影像的分割與三維灰階值內插
★ 數位式氣喘尖峰氣流量監測系統設計	★ 結合人工電子耳與助聽器對中文語音辨識率的影響
★ 人工電子耳進階結合編碼策略的中文語音辨識成效模擬--結合助聽器之分析	★ 中文發聲之神經關聯性的腦功能磁振造影研究
★ 利用有限元素法建構3維的舌頭力學模型	★ 以磁振造影為基礎的立體舌頭圖譜之建構
★ 腎小管之草酸鈣濃度變化與草酸鈣結石關係之模擬研究	★ 口腔磁振影像舌頭構造之自動分割
★ 微波輸出窗電性匹配之研究	★ 以軟體為基準的助聽器模擬平台之發展-噪音消除
★ 以軟體為基準的助聽器模擬平台之發展-回饋音消除	★ 模擬人工電子耳頻道數、刺激速率與雙耳聽對噪音環境下中文語音辨識率之影響
★ 用類神經網路研究中文語音聲調產生之神經關聯性	★ 教學用電腦模擬生理系統之建構

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2027-9-1以後開放)

摘要(中)

近年來，全球經歷 COVID-19 疫情的影響，人們的生活方式和互動模式發生了根本性的轉變，為確保公共衛生安全，非接觸式技術的需求急劇增加。手勢識別作為一種直觀的人機互動方式，變得更為重要，若能經由深度學習影像辨識技術，搭配隨手可得的網路攝影機，並對動態手勢達到一定程度的辨識率與效率，用於非接觸式互動，即能達到減少接觸，且有效降低病毒傳播的風險。

本研究使用 RGB 網路攝影機，結合 MediaPipe Hands 進行手部偵測與靜態手勢辨識功能，以及 Decouple + Recouple 深度學習網路用以學習所定義的 27 種動態手勢，並將人體動作辨識資料集應用於模型預訓練，再由調整資料集數據量與模型不同配置之設定下，比較其各別的差異性，最終建立一套自助式點餐介面，將其整合應用於Real-Time 場景中，以模擬實際點餐的操作流程，達到非接觸式且能自定義各別手勢動作所對應的控制功能。

在手部偵測方面，能達到平均 99 % 的偵測信心度，而在動態手勢辨識方面，達到整體平均高於 95 % 的辨識準確率，以及單一各別手勢平均 95 % 的 F1-Score，並能在極小型的自製資料集上達到整體平均高於 93 % 的準確率，最終在 Real-Time 辨識方面，能達到平均單次執行時間約 0.4 秒，與 0.27 秒的手勢預測時間，以及正確辨識率為 94.07 %，具有高穩定性與準確率，以及優良的辨識速度，有利動態手勢辨識於非接觸式應用的實用性與發展。

摘要(英)

In recent years, the COVID-19 pandemic has fundamentally transformed people’s lifestyles and interaction modes worldwide. To ensure public health safety, the demand for contactless technologies has surged. Gesture recognition, as an intuitive form of human-computer interaction, has become increasingly important. By utilizing deep learning-based image recognition technology, combined with readily available web cameras, and achieving a certain level of accuracy and efficiency in recognizing dynamic gestures, it can be applied to contactless interactions, thereby reducing contact and effectively lowering the risk of virus transmission.

This study uses RGB web cameras in combination with MediaPipe Hands for hand detection and static gesture recognition, and employs a Decouple + Recouple deep learning network to learn 27 defined dynamic gestures. Human action recognition datasets are used for model pre-training. By adjusting the dataset size and different model configurations, we compare their respective differences. Finally, we develop a self-service ordering interface and integrate it into a Real-Time scenario to simulate the actual ordering process, achieving a contactless system with customizable control functions corresponding to each gesture.

For hand detection, we achieve an average detection confidence of 99 %. In terms of dynamic gesture recognition, we attain an overall average recognition accuracy exceeding 95 %, and an average F1-Score of 95 % for individual gestures. On a very small custom dataset, we achieve overall average accuracy exceeding 93 %. In Real-Time recognition, the average execution time per operation is approximately 0.4 seconds, with a gesture prediction time of 0.27 seconds. The correct recognition rate stands at 94.07 %, showcasing the system’s high stability, accuracy, and excellent recognition speed, making dynamic gesture recognition highly practical and beneficial for contactless applications.

關鍵字(中)

★ 深度學習
★ 影像辨識
★ 動態手勢
★ Real-Time
★ 非接觸式

關鍵字(英)

★ Deep Learning
★ Image Recognition
★ Dynamic Gestures
★ Real-Time
★ Contactless

論文目次

摘要................................................................................................................ i
Abstract.......................................................................................................... ii
目錄................................................................................................................ iv
圖目錄............................................................................................................ vii
表目錄............................................................................................................ x
第一章緒論................................................................................................ 1
1.1 研究動機....................................................................................... 1
1.2 手勢辨識介紹............................................................................... 2
1.2.1 靜態手勢辨識.................................................................... 2
1.2.2 動態手勢辨識.................................................................... 3
1.3 文獻探討....................................................................................... 3
1.3.1 傳統手勢辨識技術發展.................................................... 3
1.3.2 深度學習神經網路技術發展............................................ 7
1.4 研究目的....................................................................................... 10
1.5 論文架構....................................................................................... 11
第二章深度學習神經網路........................................................................ 13
2.1 卷積神經網路(Convolutional Neural Network, CNN) ............... 13
2.2 損失函數(Loss Function)............................................................. 20
2.3 多層感知器(Multiple Layer Perceptron, MLP) ........................... 22
2.4 Transformer ................................................................................... 22
2.5 MediaPipe ...................................................................................... 24
2.5.1 MediaPipe Hands................................................................ 26
2.6 Decouple + Recouple 網路架構.................................................... 30
2.6.1 解耦空間表示學習網絡(Decoupled Spatial Representation
Learning Network, DSN)......................................... 31
2.6.2 解耦時間表示學習網絡(Decoupled Temporal Representation
Learning Network, DTN) ................................... 34
2.6.3 重新耦合時空表示(Recoupling Spatiotemporal Representation)............................................................................
36
2.7 結論............................................................................................... 38
第三章研究方法與系統架構.................................................................... 39
3.1 實驗設備....................................................................................... 39
3.2 資料集........................................................................................... 41
3.2.1 NTU RGB + D 60............................................................... 42
3.2.2 Jester ................................................................................... 43
3.2.3 資料準備............................................................................ 44
3.3 遷移學習(Transfer Learning)....................................................... 47
3.4 系統架構與流程........................................................................... 47
3.5 結論............................................................................................... 49
第四章研究結果與討論............................................................................ 50
4.1 多類別辨識評估指標................................................................... 50
4.2 手部偵測結果與評估................................................................... 52
4.3 Decouple + Recouple 手勢辨識結果與評估................................ 59
4.3.1 模型訓練結果與評估........................................................ 59
4.3.2 模型測試結果與評估........................................................ 65
4.3.3 模型可視化結果................................................................ 74
4.3.4 Real-Time 辨識結果與評估.............................................. 78
4.4 非接觸式點餐系統....................................................................... 84
4.5 結論............................................................................................... 92
第五章結論與未來展望............................................................................ 93
5.1 結論............................................................................................... 93
5.2 未來展望....................................................................................... 94
參考文獻........................................................................................................ 96

參考文獻

Ansari, M. A., & Singh, D. K. (2019). An approach for human machine interaction
using dynamic hand gesture recognition. In 2019 ieee conference
on information and communication technology (pp. 1–6).
Biswas, K. K., & Basu, S. K. (2011). Gesture recognition using microsoft
kinect®. In The 5th international conference on automation, robotics
and applications (pp. 100–103).
Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? a new
model and the kinetics dataset. In proceedings of the ieee conference on
computer vision and pattern recognition (pp. 6299–6308).
De Smedt, Q., Wannous, H., & Vandeborre, J.-P. (2016). Skeleton-based dynamic
hand gesture recognition. In Proceedings of the ieee conference on
computer vision and pattern recognition workshops (pp. 1–9).
Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan,
S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional
networks for visual recognition and description. In Proceedings of the ieee
conference on computer vision and pattern recognition (pp. 2625–2634).
Du, Y., Fu, Y., & Wang, L. (2015). Skeleton based action recognition with convolutional
neural network. In 2015 3rd iapr asian conference on pattern
recognition (acpr) (pp. 579–583).
Kim, J.-H., Thang, N. D., & Kim, T.-S. (2009). 3-d hand motion tracking
and gesture recognition using a data glove. In 2009 ieee international
symposium on industrial electronics (pp. 1013–1018).
Kopuklu, O., Kose, N., Gunduz, A., & Rigoll, G. (2019). Resource efficient
3d convolutional neural networks. In Proceedings of the ieee/cvf international
conference on computer vision workshops (pp. 0–0).
Li, G., Wu, H., Jiang, G., Xu, S., & Liu, H. (2018). Dynamic gesture recognition
in the internet of things. Ieee Access, 7, 23713–23724.
Liu, H., Wang, Y., Zhou, A., He, H., Wang, W., Wang, K., … Ma, H. (2020).
Real-time arm gesture recognition in smart home scenarios via millimeter
wave sensing. Proceedings of the ACM on interactive, mobile, wearable
and ubiquitous technologies, 4(4), 1–28.
Lu, W., Tong, Z., & Chu, J. (2016). Dynamic hand gesture recognition with leap
motion controller. IEEE Signal Processing Letters, 23(9), 1188–1192.
Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja, E., Hays, M., … others
(2019). Mediapipe: A framework for building perception pipelines.
arXiv preprint arXiv:1906.08172.
Materzynska, J., Berger, G., Bax, I., & Memisevic, R. (2019). The jester
dataset: A large-scale video dataset of human gestures. In Proceedings of
the ieee/cvf international conference on computer vision workshops (pp.
0–0).
Molchanov, P., Yang, X., Gupta, S., Kim, K., Tyree, S., & Kautz, J. (2016). Online
detection and classification of dynamic hand gestures with recurrent
3d convolutional neural network. In Proceedings of the ieee conference
on computer vision and pattern recognition (pp. 4207–4215).
Pan, S. J., & Yang, Q. (2009). A survey on transfer learning. IEEE Transactions
on knowledge and data engineering, 22(10), 1345–1359.
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once:
Unified, real-time object detection. In Proceedings of the ieee conference
on computer vision and pattern recognition (pp. 779–788).
Reifinger, S., Wallhoff, F., Ablassmeier, M., Poitschke, T., & Rigoll, G. (2007).
Static and dynamic hand-gesture recognition for augmented reality applications.
In Human-computer interaction. hci intelligent multimodal interaction
environments: 12th international conference, hci international
2007, beijing, china, july 22-27, 2007, proceedings, part iii 12 (pp. 728–
737).
Roh, M.-C., & Lee, S.-W. (2015). Human gesture recognition using a simplified
dynamic bayesian network. Multimedia Systems, 21(6), 557–568.
Shahroudy, A., Liu, J., Ng, T.-T., & Wang, G. (2016). Ntu rgb+ d: A large
scale dataset for 3d human activity analysis. In Proceedings of the ieee
conference on computer vision and pattern recognition (pp. 1010–1019).
Shan, C. (2010). Gesture control for consumer electronics. Multimedia Interaction
and Intelligent User Interfaces: Principles, Methods and Applications,
107–128.
Shin, S., & Kim, W.-Y. (2020). Skeleton-based dynamic hand gesture recognition
using a part-based gru-rnn for gesture-based interface. Ieee Access,
8, 50236–50243.
Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for
action recognition in videos. Advances in neural information processing
systems, 27.
Singh, A. K., Kumbhare, V. A., & Arthi, K. (2021). Real-time human pose
detection and recognition using mediapipe. In International conference
on soft computing and signal processing (pp. 145–154).
Sohn, M.-K., Lee, S.-H., Kim, D.-J., Kim, B., & Kim, H. (2012). A comparison
of 3d hand gesture recognition using dynamic time warping. In Proceedings
of the 27th conference on image and vision computing new zealand
(pp. 418–422).
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking
the inception architecture for computer vision. In Proceedings of the ieee
conference on computer vision and pattern recognition (pp. 2818–2826).
Thaman, B., Cao, T., & Caporusso, N. (2022). Face mask detection using
mediapipe facemesh. In 2022 45th jubilee international convention on
information, communication and electronic technology (mipro) (pp. 378–
382).
Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning
spatiotemporal features with 3d convolutional networks. In Proceedings
of the ieee international conference on computer vision (pp. 4489–4497).
Truong, V. N., Yang, C.-K., & Tran, Q.-V. (2016). A translator for american
sign language to text and speech. In 2016 ieee 5th global conference on
consumer electronics (pp. 1–2).
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N.,
… Polosukhin, I. (2017). Attention is all you need. Advances in neural
information processing systems, 30.
Wang, T., Qian, X., He, F., Hu, X., Cao, Y., & Ramani, K. (2021). Gesturar: An
authoring system for creating freehand interactive augmented reality applications.
In The 34th annual acm symposium on user interface software
and technology (pp. 552–567).
Wu, J., Sun, L., & Jafari, R. (2016). A wearable system for recognizing american
sign language in real-time using imu and surface emg sensors. IEEE
journal of biomedical and health informatics, 20(5), 1281–1290.
Xu, D., Chen, Y.-L., Lin, C., Kong, X., & Wu, X. (2012). Real-time dynamic
gesture recognition system based on depth perception for robot navigation.
In 2012 ieee international conference on robotics and biomimetics
(robio) (pp. 689–694).
Yang, L., Huang, J., Feng, T., Hong-An, W., & Guo-Zhong, D. (2019). Gesture
interaction in virtual reality. Virtual Reality & Intelligent Hardware, 1(1),
84–112.
Yang, Z., Li, Y., Chen, W., & Zheng, Y. (2012). Dynamic hand gesture recognition
using hidden markov models. In 2012 7th international conference
on computer science & education (iccse) (pp. 360–365).
Zhang, F., Bazarevsky, V., Vakunov, A., Tkachenka, A., Sung, G., Chang, C.-L.,
& Grundmann, M. (2020). Mediapipe hands: On-device real-time hand
tracking. arXiv preprint arXiv:2006.10214.
Zhang, W., Wang, J., & Lan, F. (2020). Dynamic hand gesture recognition
based on short-term sampling neural networks. IEEE/CAA Journal of
Automatica Sinica, 8(1), 110–120.
Zhang, X., Chen, X., Li, Y., Lantz, V., Wang, K., & Yang, J. (2011). A framework
for hand gesture recognition based on accelerometer and emg sensors.
IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems
and Humans, 41(6), 1064–1076.
Zhang, X., & Wu, X. (2019). Robotic control of dynamic and static gesture
recognition. In 2019 2nd world conference on mechanical engineering
and intelligent manufacturing (wcmeim) (pp. 474–478).
Zhou, B., Wang, P., Wan, J., Liang, Y., Wang, F., Zhang, D., … Jin, R. (2022).
Decoupling and recoupling spatiotemporal representation for rgb-d-based
motion recognition. In Proceedings of the ieee/cvf conference on computer
vision and pattern recognition (pp. 20154–20163).

指導教授

吳炤民(Chao-Min Wu)

審核日期

2024-7-25

推文