以少量視訊建構台灣手語詞分類模型

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：14

、訪客IP：3.135.206.19

姓名

翁浚銘(Jun-Ming Wong) 查詢紙本館藏

畢業系所

軟體工程研究所

論文名稱

以少量視訊建構台灣手語詞分類模型
(Using a Small Video Dataset to Construct a Taiwanese-Sign-Language Word Classification Model)

相關論文

★ 基於QT之跨平台無線心率分析系統實現	★ 網路電話之額外訊息傳輸機制
★ 針對與運動比賽精彩畫面相關串場效果之偵測	★ 植基於向量量化之視訊/影像內容驗證技術
★ 植基於串場效果偵測與內容分析之棒球比賽精華擷取系統	★ 以視覺特徵擷取為基礎之影像視訊內容認證技術
★ 使用動態背景補償以偵測與追蹤移動監控畫面之前景物	★ 應用於H.264/AVC視訊內容認證之適應式數位浮水印
★ 棒球比賽精華片段擷取分類系統	★ 利用H.264/AVC特徵之多攝影機即時追蹤系統
★ 利用隱式型態模式之高速公路前車偵測機制	★ 基於時間域與空間域特徵擷取之影片複製偵測機制
★ 結合數位浮水印與興趣區域位元率控制之車行視訊編碼	★ 應用於數位智權管理之H.264/AVC視訊加解密暨數位浮水印機制
★ 基於文字與主播偵測之新聞視訊分析系統	★ 植基於數位浮水印之H.264/AVC視訊內容驗證機制

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

手語是一種視覺語言，利用手形、動作，甚至面部表情傳達訊息以作為聽障人
士主要的溝通工具。以深度學習技術進行手語辨識在近年來受到矚目，然而神經網
路訓練資料需仰賴大量手語視訊，其製作過程頗費時繁瑣。本研究提出利用單一手
語視訊建構深度學習訓練資料的方法，實現在視訊畫面中辨識台灣手語詞彙。
首先，我們由視訊共享平台中取得一系列手語教學視訊，透過Mask RCNN[1]
找出所有教學畫面中的手部和面部分割遮罩，再透過空間域數據增強來創建更多不
同內容的訓練集。我們也採用不同的時間域採樣策略，模擬不同手譯員的速度。最
後我們以具注意力機制的3D-ResNet 對多種台灣手語辭彙進行分類，實驗結果顯
示，我們所產生的合成資料集能在手語辭彙辨識上帶來幫助。

摘要(英)

Sign languages (SL) are visual languages that use shapes of hands,
movements, and even facial expressions to convey information, acting
as the primary communication tool for hearing-impaired people. Sign
language recognition (SLR) based on deep learning technologies has attracted
much attention in recent years. Nevertheless, training neural
networks requires a massive number of SL videos. Their preparation process
is time-consuming and cumbersome. This research proposes using a
set of SL videos to build effective training data for the classification of
Taiwanese Sign Language (TSL) vocabulary. First, we begin with a series
of TSL teaching videos from the video-sharing platform. Then, Mask
RCNN[1] is employed to extract the segmentation masks of hands and
faces in all video frames. Next, spatial domain data augmentation is applied
to create the training set with different contents. Varying temporal
domain sampling strategies are also employed to simulate the speeds of
different signers. Finally, the attention-based 3D-ResNet trained by the
synthetic dataset is used to classify a variety of TSL vocabulary. The
experimental results show the promising performance and the feasibility
to SLR.

關鍵字(中)

★ 台灣手語
★ 手語識別
★ 深度學習

關鍵字(英)

★ Taiwanese sign language
★ sign language recognition
★ deep learning

論文目次

1 Introduction 1
1.1 Motivation of Research . . . . . . . . . . . . . . . . . . . . 1
1.2 Contribution of Research . . . . . . . . . . . . . . . . . . . 3
1.3 The Organization of Thesis . . . . . . . . . . . . . . . . . 4
2 Related Work 5
2.1 TSL Background . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Sign Language . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Types of TSL . . . . . . . . . . . . . . . . . . . . . 7
2.2 Related Solutions . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Annotation Type of Data . . . . . . . . . . . . . . 10
2.2.2 Methods of Eliciting Features . . . . . . . . . . . . 10
2.2.3 Deep Learning Models . . . . . . . . . . . . . . . . 11
3 TSL Recognition 17
3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.1 TSL Dataset . . . . . . . . . . . . . . . . . . . . . 17
3.1.2 Data Augmentation . . . . . . . . . . . . . . . . . 18
3.2 Model Architecture . . . . . . . . . . . . . . . . . . . . . . 24
3.2.1 Local feature attention . . . . . . . . . . . . . . . . 24
3.2.2 Layer attention . . . . . . . . . . . . . . . . . . . . 24
3.2.3 Training . . . . . . . . . . . . . . . . . . . . . . . . 27
4 Experimental Results 29
4.1 Development Environment . . . . . . . . . . . . . . . . . . 29
4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2.1 Test dataset . . . . . . . . . . . . . . . . . . . . . . 30
4.2.2 Spatial domain results . . . . . . . . . . . . . . . . 30
4.2.3 Temporal domain results . . . . . . . . . . . . . . . 31
4.2.4 Attention effects . . . . . . . . . . . . . . . . . . . 32
5 Conclusion and Future Work 34
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 36
References 37

參考文獻

[1] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In
Proceedings of the IEEE international conference on computer vision,
pages 2961–2969, 2017.
[2] S. Jetley, N. A. Lord, N. Lee, and P. H. Torr. Learn to pay attention.
arXiv preprint arXiv:1804.02391, 2018.
[3] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. In
Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 7132–7141, 2018.
[4] O. Koller, J. Forster, and H. Ney. Continuous sign language recognition:
Towards large vocabulary statistical recognition systems handling
multiple signers. Computer Vision and Image Understanding,
141:108–125, Dec. 2015.
[5] J. Pu, W. Zhou, J. Zhang, and H. Li. Sign language recognition based
on trajectory modeling with hmms. In International Conference on
Multimedia Modeling, pages 686–697. Springer, 2016.
[6] L. Lamberti and F. Camastra. Real-time hand gesture recognition
using a color glove. In International Conference on Image Analysis
and Processing, pages 365–373. Springer, 2011.
[7] L.-J. Kau, W.-L. Su, P.-J. Yu, and S.-J. Wei. A real-time portable
sign language translation system. In 2015 IEEE 58th International
Midwest Symposium on Circuits and Systems (MWSCAS), pages 1–4.
IEEE, 2015.
[8] L. Jing, E. Vahdani, M. Huenerfauth, and Y. Tian. Recognizing
american sign language manual signs from rgb-d videos. arXiv
preprint arXiv:1906.02851, 2019.
[9] D.-Y. Huang, W.-C. Hu, and S.-H. Chang. Vision-based hand gesture
recognition using pca+ gabor filters and svm. In 2009 fifth international
conference on intelligent information hiding and multimedia
signal processing, pages 1–4. IEEE, 2009.
[10] K. Pearson. Liii. on lines and planes of closest fit to systems of points
in space. The London, Edinburgh, and Dublin Philosophical Magazine
and Journal of Science, 2(11):559–572, 1901.
[11] C. Cortes and V. Vapnik. Support-vector networks. Machine learning,
20(3):273–297, 1995.
[12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image
recognition. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 770–778, 2016.
[13] K. Hara, H. Kataoka, and Y. Satoh. Can spatiotemporal 3d cnns
retrace the history of 2d cnns and imagenet? In Proceedings of the
38
IEEE conference on Computer Vision and Pattern Recognition, pages
6546–6555, 2018.
[14] M.-T. Luong, H. Pham, and C. D. Manning. Effective approaches
to attention-based neural machine translation. arXiv preprint
arXiv:1508.04025, 2015.
[15] D. Britz, A. Goldie, M.-T. Luong, and Q. Le. Massive exploration
of neural machine translation architectures. arXiv preprint
arXiv:1703.03906, 2017.
[16] J. Cheng, L. Dong, and M. Lapata. Long short-term memorynetworks
for machine reading. arXiv preprint arXiv:1601.06733,
2016.
[17] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation
by jointly learning to align and translate. arXiv preprint
arXiv:1409.0473, 2014.
[18] P. Sermanet, A. Frome, and E. Real. Attention for fine-grained categorization.
arXiv preprint arXiv:1412.7054, 2014.
[19] X. Liu, T. Xia, J. Wang, Y. Yang, F. Zhou, and Y. Lin. Fully
convolutional attention networks for fine-grained recognition. arXiv
preprint arXiv:1603.06765, 2016.
[20] J. Ba, V. Mnih, and K. Kavukcuoglu. Multiple object recognition
with visual attention. arXiv preprint arXiv:1412.7755, 2014.
[21] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov,
R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International conference
on machine learning, pages 2048–2057. PMLR, 2015.
[22] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu. Recurrent models
of visual attention. arXiv preprint arXiv:1406.6247, 2014.
[23] R. S. Sutton. Learning to predict by the methods of temporal differences.
Machine learning, 3(1):9–44, 1988.
[24] L. Wright. Ranger - a synergistic optimizer. https://github.com/
lessw2020/Ranger-Deep-Learning-Optimizer, 2019.
[25] D. R. Cox. The regression analysis of binary sequences. Journal
of the Royal Statistical Society: Series B (Methodological), 20(2):
215–232, 1958.
[26] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature
hierarchies for accurate object detection and semantic segmentation.
In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 580–587, 2014.
[27] R. Girshick. Fast r-cnn. In Proceedings of the IEEE international
conference on computer vision, pages 1440–1448, 2015.
[28] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards realtime
object detection with region proposal networks. arXiv preprint
arXiv:1506.01497, 2015.
[29] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look
once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–
788, 2016.
[30] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro,
G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow,
A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser,
M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray,
C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar,
P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals,
P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. Tensor-
Flow: Large-scale machine learning on heterogeneous systems, 2015.
URL https://www.tensorflow.org/. Software available from tensorflow.
org.

指導教授

蘇柏齊(Po-Chyi Su)

審核日期

2021-8-4

推文