博碩士論文 110522093 詳細資訊




以作者查詢圖書館館藏 以作者查詢臺灣博碩士 以作者查詢全國書目 勘誤回報 、線上人數:179 、訪客IP:3.133.146.246
姓名 蔡允齊(Yun-Chi Tsai)  查詢紙本館藏   畢業系所 資訊工程學系
論文名稱 擷取有效畫面域與時間域資訊進行深度學習手語辨識
(Enhancing Deep-Learning Sign Language Recognition through Effective Spatial and Temporal Information Extraction)
相關論文
★ 基於QT之跨平台無線心率分析系統實現★ 網路電話之額外訊息傳輸機制
★ 針對與運動比賽精彩畫面相關串場效果之偵測★ 植基於向量量化之視訊/影像內容驗證技術
★ 植基於串場效果偵測與內容分析之棒球比賽精華擷取系統★ 以視覺特徵擷取為基礎之影像視訊內容認證技術
★ 使用動態背景補償以偵測與追蹤移動監控畫面之前景物★ 應用於H.264/AVC視訊內容認證之適應式數位浮水印
★ 棒球比賽精華片段擷取分類系統★ 利用H.264/AVC特徵之多攝影機即時追蹤系統
★ 利用隱式型態模式之高速公路前車偵測機制★ 基於時間域與空間域特徵擷取之影片複製偵測機制
★ 結合數位浮水印與興趣區域位元率控制之車行視訊編碼★ 應用於數位智權管理之H.264/AVC視訊加解密暨數位浮水印機制
★ 基於文字與主播偵測之新聞視訊分析系統★ 植基於數位浮水印之H.264/AVC視訊內容驗證機制
檔案 [Endnote RIS 格式]    [Bibtex 格式]    [相關文章]   [文章引用]   [完整記錄]   [館藏目錄]   至系統瀏覽論文 (2025-7-25以後開放)
摘要(中) 基於深度學習的自動手語辨識需要大量視訊資料進行模型訓練,然而手語視訊的製作與蒐集相當費時繁瑣,少量或不夠多樣的資料集則限制了手語辨識模型的準確率。本研究針對手語辨識提出有效的空間域與時間域資料擷取方法,希望將有限的手語視訊資料透過合理的擴增處理產生更大量與多樣的訓練資料,這些做為深度學習網路的輸入資料可搭配較簡易的架構如3D-ResNet來搭建,可以不採用複雜或需要大量訓練資源的網路架構即可獲致相當的手語辨識效果。我們的空間域資料擷取採用以Mediapipe所取得的骨架、手部區域型態或遮罩,以及移動光流,這三種資料可做為像是較早的3D-ResNet模型所常採用的三通道輸入,但與以往RGB輸入不同的是我們的三種資料各有特點而讓特徵擷取更具效果。時間域資料擷取則透過計算與決定關鍵幀的方式挑選更有意義畫面,藉此達成不同的畫面選擇策略。我們所提出的時間域與空間域資料可再用有效的資料增強模擬多種手尺寸、手勢速度、拍攝角度等,對於擴充資料集與增加多樣性都有很大的助益。實驗結果顯示我們的方法對於常用的美國手語資料集有顯著的辨識準確度提升。
摘要(英) Automatic sign language recognition based on deep learning requires a large amount of video data for model training. However, the creation and collection of sign language videos are time-consuming and tedious processes. Limited or insufficiently diverse datasets restrict the accuracy of sign language recognition models. In this study, we propose effective spatial and temporal data extraction methods for sign language recognition. The goal is to augment the limited sign language video data to generate a larger and more diverse training dataset. The augmented data, used as inputs to deep learning networks, can be paired with simpler architectures like 3D-ResNet, which allows for achieving considerable sign language recognition performance without the need for complex or resource-intensive network structures.
Our spatial data extraction employs three types of data: skeletons obtained using Mediapipe, hand region patterns or masks, and optical flows. These three data types can be used as three-channel inputs, akin to the approach often used in earlier 3D-ResNet models. Nevertheless, our distinct data types offer specific features that enhance feature extraction. For temporal data extraction, we determine certain key-frames to capture more meaningful visual information, thus employing different scene selection strategies.
The proposed spatial and temporal data extraction methods facilitate data augmentation, which simulates various hand sizes, gesture speeds, shooting angles, etc. The strategy significantly contributes to expanding the dataset and increasing its diversity. Experimental results demonstrate that our approach significantly improves the recognition accuracy for commonly used American Sign Language datasets.
關鍵字(中) ★ 手語辨識
★ 關鍵幀
★ 深度學習
關鍵字(英)
論文目次 論文摘要 i
Abstract ii
目錄 iv
附圖目錄 vii
表格目錄 ix
一、 緒論 1
1-1 研究背景與動機 1
1-2 研究貢獻 2
1-3 論文架構 3
二、 相關研究 4
2-1 手語辨識 4
2-2 美國手語資料集 4
2-3 相關方法 5
2-3-1 傳統方法 5
2-3-2 深度學習 6
三、 方法 13
3-1 預處理 13
3-1-1 RGB to LSO 13
3-1-2 MediaPipe的錯誤偵測 13
3-1-3 臉部特徵(face landmark) 15
3-2 模型架構 19
3-3 資料增強 20
3-3-1 空間域 20
3-3-2 時間域 21
四、 實驗結果 29
4-1 實驗環境 29
4-2 實驗結果 29
4-2-1 Baseline 29
4-2-2 整理資料集 30
4-2-3 Face landmark 31
4-2-4 時間域測試 31
五、 結論與未來展望 36
5-1 結論 36
5-2 未來展望 36
六、 參考文獻 37
參考文獻 [1] Y.-J. Chen, "Suitable Data Input for Deep-Learning-Based Sign Language Recognition with a Small Training Dataset," National Central University,CSIE,2022.[Online].Available:https://hdl.handle.net/11296/4ybeup.
[2] D. Li, C. Rodriguez, X. Yu, and H. Li, "Word-level deep sign language recognition from video: A new large-scale dataset and methods comparison," in Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2020, pp. 1459-1469.
[3] K. Simonyan and A. Zisserman, "Two-stream convolutional networks for action recognition in videos," Advances in neural information processing systems, vol. 27, 2014.
[4] K. Soomro, A. R. Zamir, and M. Shah, "UCF101: A dataset of 101 human actions classes from videos in the wild," arXiv preprint arXiv:1212.0402, 2012.
[5] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, "HMDB: a large video database for human motion recognition," in 2011 International conference on computer vision, 2011: IEEE, pp. 2556-2563.
[6] H. Luqman, "An Efficient Two-Stream Network for Isolated Sign Language Recognition Using Accumulative Video Motion," IEEE Access, vol. 10, pp. 93785-93798, 2022.
[7] A. A. I. Sidig, H. Luqman, S. Mahmoud, and M. Mohandes, "KArSL: Arabic Sign Language Database," ACM Trans. Asian Low-Resour. Lang. Inf. Process., vol. 20, no. 1, p. Article 14, 2021, doi: 10.1145/3423420.
[8] J. Donahue et al., "Long-term recurrent convolutional networks for visual recognition and description," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 2625-2634.
[9] L. Hu, L. Gao, and W. Feng, "Self-Emphasizing Network for Continuous Sign Language Recognition," arXiv preprint arXiv:2211.17081, 2022.
[10] J. Wang et al., "Deep high-resolution representation learning for visual recognition," IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 10, pp. 3349-3364, 2020.
[11] N. C. Camgoz, S. Hadfield, O. Koller, H. Ney, and R. Bowden, "Neural sign language translation," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7784-7793.
[12] H. Zhou, W. Zhou, W. Qi, J. Pu, and H. Li, "Improving sign language translation with monolingual data by sign back-translation," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1316-1325.
[13] K. Hara, H. Kataoka, and Y. Satoh, "Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?," in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 6546-6555.
[14] L. Smaira, J. Carreira, E. Noland, E. Clancy, A. Wu, and A. Zisserman, "A short note on the kinetics-700-2020 human action dataset," arXiv preprint arXiv:2010.10864, 2020.
[15] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, "Imagenet: A large-scale hierarchical image database," in 2009 IEEE conference on computer vision and pattern recognition, 2009: Ieee, pp. 248-255.
[16] W. Du, Y. Wang, and Y. Qiao, "Rpan: An end-to-end recurrent pose-attention network for action recognition in videos," in Proceedings of the IEEE international conference on computer vision, 2017, pp. 3725-3734.
[17] M. Boháček and M. Hrúz, "Sign pose-based transformer for word-level sign language recognition," in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022, pp. 182-191.
[18] Z. Zhou, V. W. Tam, and E. Y. Lam, "SIGNBERT: a Bert-based deep learning framework for continuous sign language recognition," IEEE Access, vol. 9, pp. 161669-161682, 2021.
[19] Z. Liu et al., "Video swin transformer," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 3202-3211.
指導教授 蘇柏齊 審核日期 2023-7-28
推文 facebook   plurk   twitter   funp   google   live   udn   HD   myshare   reddit   netvibes   friend   youpush   delicious   baidu   
網路書籤 Google bookmarks   del.icio.us   hemidemi   myshare   

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡  - 隱私權政策聲明