以多特徵神經網路實現連續手語識別

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：67

、訪客IP：18.222.161.123

姓名

費群安(Arda Satata Fitriajie) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

以多特徵神經網路實現連續手語識別
(Realizing Sign Language Recognition using Multi-Feature Neural Network)

相關論文

★ 基於edX線上討論板社交關係之分組機制	★ 利用Kinect建置3D視覺化之Facebook互動系統
★ 利用 Kinect建置智慧型教室之評量系統	★ 基於行動裝置應用之智慧型都會區路徑規劃機制
★ 基於分析關鍵動量相關性之動態紋理轉換	★ 基於保護影像中直線結構的細縫裁減系統
★ 建基於開放式網路社群學習環境之社群推薦機制	★ 英語作為外語的互動式情境學習環境之系統設計
★ 基於膚色保存之情感色彩轉換機制	★ 一個用於虛擬鍵盤之手勢識別框架
★ 分數冪次型灰色生成預測模型誤差分析暨電腦工具箱之研發	★ 使用慣性傳感器構建即時人體骨架動作
★ 基於多台攝影機即時三維建模	★ 基於互補度與社群網路分析於基因演算法之分組機制
★ 即時手部追蹤之虛擬樂器演奏系統	★ 基於類神經網路之即時虛擬樂器演奏系統

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

若有 RGB 視頻串流，我們的目標是正確識別與連續手語識別 (CSLR) 相關的手語。儘管
該領域提出的深度學習方法逐漸增加，但大多數主要集中在僅使用 RGB 特徵，無論是全
幀圖像還是手部和臉部的細節。 CSLR 訓練過程信息的不足嚴重限制了他們學習視頻輸入
幀中多個特徵的能力。目前，多特徵網路變得相當普遍，因為當前的計算能力不再限制我
們擴大網路規模。因此，在本文中，我們將研究深度學習網路並應用多特徵技術，以期增
加和改進當前的連續手語識別任務，詳細說明我們將包括的另一個特徵在這項研究中，如
果我們將它們做比較，關鍵點特徵沒有圖像特徵那麼沉重。這項研究的結果表明，在
Phoenix2014 和中國手語這兩個最流行的 CSLR 數據集上，添加關鍵點特徵作為一種多特
徵模態可以提高識別率，或者通常會降低單詞錯誤率 (WER)。

摘要(英)

Given the RGB video streams, we aim to recognize signs related to continuous sign
language recognition (CSLR) correctly. Despite there are increasing of proposed deep learning
methods in this area, most of them mainly focus on only using an RGB feature, either the fullframe image or the detail of hands and face. The scarcity of information for the CSLR training
process heavily constrains their capability to learn the multiple features within the video input
frames. Currently, Multi-feature networks became something quite common since the current
computing power is something that is not limiting us from scaling the network size anymore. Thus,
in this thesis, we’re going to work deep learning network and apply a multi-feature technique with
the hope to increase & improve the current state of the art of continuous sign language recognition
tasks, in detail another feature that we would include in this research is the key-point feature which
is not as heavy as the image feature if we are comparing them. The result of this research shows
that adding a key-point feature as a multi-feature modality could increase the recognition rate or
commonly, decrease the word error rate (WER) on the two most popular CSLR datasets:
Phoernix2014 and Chinese Sign Language.

關鍵字(中)

★ 圖像處理
★ 視頻處理
★ 連續手語識別
★ 手勢識別
★ 關鍵點

關鍵字(英)

★ Image Processing
★ Video Processing
★ Continuous Sign Language Recognition
★ Gesture Recognition
★ Keypoint

論文目次

LIST OF CONTENT
Abstract........................................................................................................................................................iv
摘要 ..............................................................................................................................................................v
List of Content .............................................................................................................................................vi
List of Figure..............................................................................................................................................viii
List of Tables...............................................................................................................................................ix
Chapter 1. Introduction .................................................................................................................................1
1.1 General Introduction....................................................................................................................1
1.2 Objective of Research..................................................................................................................2
1.3 Scope of the Study.......................................................................................................................3
1.4 Thesis Outline..............................................................................................................................3
Chapter 2. Literature Review........................................................................................................................5
2.1 Isolated Sign Language Recognition...........................................................................................5
2.2 Continuous Sign Language Recognition .....................................................................................6
2.3 Keypoint-based Action Recognition ...........................................................................................8
2.4 Convolutional Neural Network ...................................................................................................9
2.5 Bidirectional LSTM Networks..................................................................................................11
2.6 Connectionist temporal classification........................................................................................13
2.7 Multi-features Approach............................................................................................................15
2.8 Self-Attention ............................................................................................................................16
Chapter 3. Research Method.......................................................................................................................17
3.1 Framework Overview................................................................................................................17
3.2 Dataset.......................................................................................................................................18
3.2.1 Phoenix2014..............................................................................................................................18
3.2.2 Chinese Sign Language (CSL-100)...........................................................................................19
3.3 Data Pre-processing...................................................................................................................21
3.3.1 Data Augmentation....................................................................................................................21
3.3.1.1 Random Crop ........................................................................................................................22
3.3.1.2 Horizontal Flip ......................................................................................................................22
3.3.1.3 Random Temporal Scaling....................................................................................................23
3.3.2 Key-point Extraction .................................................................................................................24
3.4 Spatial Module...........................................................................................................................26
3.4.1 Full Frame Feature ....................................................................................................................27
vii
3.4.2 Keypoint Feature .......................................................................................................................28
3.5 Temporal Module ......................................................................................................................29
3.6 Sequence Learning ....................................................................................................................30
3.7 Evaluation Metric ......................................................................................................................31
3.8 Loss Function ............................................................................................................................32
3.9 Self-Attention ............................................................................................................................34
3.9.1 Spatial Attention........................................................................................................................35
3.9.2 Early Temporal Attention..........................................................................................................36
3.9.3 Proposed Late Temporal Attention............................................................................................37
Chapter 4. Experiment Result & Discussion ..............................................................................................38
4.1 Experiment Settings...................................................................................................................38
4.2 Experiment on Input Streams.....................................................................................................39
4.3 Experiment on Attention Module...............................................................................................40
4.4 Experiment on Proposed Model.................................................................................................41
4.4.1 Quantitative Result.....................................................................................................................41
4.4.2 Qualitative Result ......................................................................................................................43
Chapter 5. Conclusion and Discussion .......................................................................................................46
5.1 Conclusion.................................................................................................................................46
5.2 Discussion & Future Works.......................................................................................................46
References...................................................................................................................................................48

參考文獻

REFERENCES
[1] K. Emmorey, “Language, Cognition, and the Brain,” Language, Cognition, and the Brain, Nov.
2001, doi: 10.4324/9781410603982/LANGUAGE-COGNITION-BRAIN-KAREN-EMMOREY.
[2] M. Mukushev, A. Sabyrov, A. Imashev, K. Koishybay, V. Kimmelman, and A. Sandygulova,
“Evaluation of Manual and Non-manual Components for Sign Language Recognition,”
Proceedings of the 12th Language Resources and Evaluation Conference, pp. 6073–6078, 2020.
[3] N. B. Ibrahim, H. H. Zayed, and M. M. Selim, “Advances, Challenges and Opportunities in
Continuous Sign Language Recognition,” Journal of Engineering and Applied Sciences, vol. 15, no.
5, pp. 1205–1227, Dec. 2019, doi: 10.36478/JEASCI.2020.1205.1227.
[4] H. Zhou, W. Zhou, Y. Zhou, and H. Li, “Spatial-Temporal Multi-Cue Network for Continuous Sign
Language Recognition,” IEEE Transactions on Multimedia, vol. 24, pp. 768–779, 2022.
[5] K. Chen et al., “MMDetection: Open MMLab Detection Toolbox and Benchmark,” Jun. 2019, doi:
10.48550/arxiv.1906.07155.
[6] S. Jiang, B. Sun, L. Wang, Y. Bai, K. Li, and Y. Fu, “Skeleton Aware Multi-modal Sign Language
Recognition,” IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Workshops, pp. 3408–3418, Mar. 2021.
[7] A. Oya, “Vision based sign language recognition: modeling and recognizing isolated signs with
manual and non-manual components,” Bo˘gazi¸ci University, 2008.
[8] P. Wang, W. Li, S. Liu, Y. Zhang, Z. Gao, and P. Ogunbona, “Large-scale Continuous Gesture
Recognition Using Convolutional Neural Networks,” Aug. 2016, doi: 10.48550/arxiv.1608.06338.
[9] J. Wan, S. Z. Li, Y. Zhao, S. Zhou, I. Guyon, and S. Escalera, “ChaLearn Looking at People RGB-D
Isolated and Continuous Datasets for Gesture Recognition,” IEEE Computer Society Conference on
Computer Vision and Pattern Recognition Workshops, pp. 761–769, Dec. 2016.
[10] D. Li, C. Rodriguez Opazo, X. Yu, and H. Li, “Word-level Deep Sign Language Recognition from
Video: A New Large-scale Dataset and Methods Comparison,” Proceedings - 2020 IEEE Winter
Conference on Applications of Computer Vision, WACV 2020, pp. 1448–1458, Oct. 2019.
[11] O. M. Sincan and H. Y. Keles, “AUTSL: A Large Scale Multi-modal Turkish Sign Language Dataset
and Baseline Methods,” IEEE Access, vol. 8, pp. 181340–181355, Aug. 2020, doi:
10.1109/ACCESS.2020.3028072.
[12] J. Carreira, A. Zisserman, Z. Com, and † Deepmind, “Quo Vadis, Action Recognition? A New Model
and the Kinetics Dataset,” Proceedings - 30th IEEE Conference on Computer Vision and Pattern
Recognition, CVPR 2017, vol. 2017-January, pp. 4724–4733, May 2017, doi:
10.48550/arxiv.1705.07750.
49
[13] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image
Recognition,” 3rd International Conference on Learning Representations, ICLR 2015 - Conference
Track Proceedings, Sep. 2014, doi: 10.48550/arxiv.1409.1556.
[14] K. Cho et al., “Learning Phrase Representations using RNN Encoder–Decoder for Statistical
Machine Translation,” EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language
Processing, Proceedings of the Conference, pp. 1724–1734, 2014, doi: 10.3115/V1/D14-1179.
[15] O. Koller, J. Forster, and H. Ney, “Continuous sign language recognition: Towards large
vocabulary statistical recognition systems handling multiple signers,” Computer Vision and Image
Understanding, vol. 141, pp. 108–125, Dec. 2015, doi: 10.1016/J.CVIU.2015.09.013.
[16] S. Jin et al., “Whole-Body Human Pose Estimation in the Wild,” Lecture Notes in Computer Science
(including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics),
vol. 12354 LNCS, pp. 196–214, Jul. 2020, doi: 10.48550/arxiv.2007.11858.
[17] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning Spatiotemporal Features
with 3D Convolutional Networks,” Proceedings of the IEEE International Conference on Computer
Vision, vol. 2015 ICCV 2015, pp. 4489–4497, Dec. 2014.
[18] L. Song, X. Guo, and Y. Fan, “Action recognition in video using human keypoint detection,” 15th
International Conference on Computer Science and Education, ICCSE 2020, pp. 465–470, Aug.
2020, doi: 10.1109/ICCSE49874.2020.9201857.
[19] J. Cai, N. Jiang, X. Han, K. Jia, and J. Lu, “JOLO-GCN: Mining Joint-Centered Light-Weight
Information for Skeleton-Based Action Recognition,” Proceedings - 2021 IEEE Winter Conference
on Applications of Computer Vision, WACV 2021, pp. 2734–2743, Nov. 2020.
[20] R. Yamashita, M. Nishio, R. Kinh, G. Do, and K. Togashi, “Convolutional neural networks: an
overview and application in radiology”, doi: 10.1007/s13244-018-0639-9.
[21] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document
recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2323, 1998, doi:
10.1109/5.726791.
[22] J. Deng, W. Dong, R. Socher, L.-J. Li, Kai Li, and Li Fei-Fei, “ImageNet: A large-scale hierarchical
image database,” pp. 248–255, Mar. 2010, doi: 10.1109/CVPR.2009.5206848.
[23] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” Proceedings
of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2016-
December, pp. 770–778, Dec. 2015, doi: 10.48550/arxiv.1512.03385.
[24] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image
Recognition,” 3rd International Conference on Learning Representations, ICLR 2015 - Conference
Track Proceedings, Sep. 2014, doi: 10.48550/arxiv.1409.1556.
[25] M. S. Islam, M. S. Sultana, U. K. Roy, and J. al Mahmud, “A review on Video Classification with
Methods, Findings, Performance, Challenges, Limitations and Future Work,” Jurnal Ilmiah Teknik
Elektro Komputer dan Informatika, vol. 6, no. 2, p. 47, Jan. 2021, doi:
10.26555/JITEKI.V6I2.18978.
50
[26] Y. Gao, “News Video Classification Model Based on ResNet-2 and Transfer Learning,” Security and
Communication Networks, vol. 2021, 2021, doi: 10.1155/2021/5865200.
[27] Z. C. Lipton, J. Berkowitz, and C. Elkan, “A Critical Review of Recurrent Neural Networks for
Sequence Learning,” arXiv:1506.00019, May 2015.
[28] S. Siami-Namini, N. Tavakoli, and A. S. Namin, “The Performance of LSTM and BiLSTM in
Forecasting Time Series,” Proceedings - 2019 IEEE International Conference on Big Data, Big Data
2019, pp. 3285–3292, Dec. 2019, doi: 10.1109/BIGDATA47090.2019.9005997.
[29] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE Transactions on
Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997, doi: 10.1109/78.650093.
[30] A. Hannun, “Sequence Modeling with CTC,” Distill, vol. 2, no. 11, p. e8, Nov. 2017, doi:
10.23915/DISTILL.00008.
[31] A. Graves, A. Ch, S. Fernández, F. Gomez, J. Schmidhuber, and J. Ch, “Connectionist Temporal
Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks,” ACM
International Conference Proceeding Series, vol. 148, pp. 369–376, 2006.
[32] J. Summaira, A. Muhammad Shoib, O. Bourahla, L. Songyuan, and J. Abdul, “Recent Advances and
Trends in Multimodal Deep Learning: A Review,” arXiv preprint, vol. 2105, no. 11087, 2021.
[33] P. Molchanov, X. Yang, S. Gupta, K. Kim, S. Tyree, and J. Kautz, “Online Detection and
Classification of Dynamic Hand Gestures with Recurrent 3D Convolutional Neural Networks,”
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern
Recognition, vol. 2016-December, pp. 4207–4215, Dec. 2016, doi: 10.1109/CVPR.2016.456.
[34] R. Cui, H. Liu, and C. Zhang, “A Deep Neural Framework for Continuous Sign Language
Recognition by Iterative Training,” IEEE Transactions on Multimedia, vol. 21, no. 7, pp. 1880–
1891, Jul. 2019, doi: 10.1109/TMM.2018.2889563.
[35] A. Vaswani et al., “Attention Is All You Need,” Advances in Neural Information Processing
Systems, vol. 201-December, pp. 5999–6009, Jun. 2017.
[36] J. Pu, W. Zhou, and H. Li, “Iterative Alignment Network for Continuous Sign Language
Recognition,” Proceedings of the IEEE Computer Society Conference on Computer Vision and
Pattern Recognition, vol. 2019-June, pp. 4160–4169, Jun. 2019.
[37] J. Huang, W. Zhou, Q. Zhang, H. Li, and W. Li, “Video-based Sign Language Recognition without
Temporal Segmentation,” 32nd AAAI Conference on Artificial Intelligence, AAAI 2018, pp. 2257–
2264, Jan. 2018, doi: 10.48550/arxiv.1801.10111.
[38] H. Zhou, W. Zhou, and H. Li, “Dynamic pseudo label decoding for continuous sign language
recognition,” Proceedings - IEEE International Conference on Multimedia and Expo, vol. 2019-July,
pp. 1282–1287, Jul. 2019, doi: 10.1109/ICME.2019.00223.
[39] Y. Min, A. Hao, X. Chai, and X. Chen, “Visual Alignment Constraint for Continuous Sign Language
Recognition,” Proceedings of the IEEE International Conference on Computer Vision, pp. 11522–
11531, Apr. 2021.
51
[40] R. Takahashi, T. Matsubara, and K. Uehara, “Data Augmentation using Random Image Cropping
and Patching for Deep CNNs,” IEEE Transactions on Circuits and Systems for Video Technology,
vol. 30, no. 9, pp. 2917–2931, Nov. 2018.
[41] T.-Y. Lin et al., “Microsoft COCO: Common Objects in Context,” ECCV2020, vol. 8693 LNCS, no.
PART 5, pp. 740–755, May 2014.
[42] A. Sengupta, F. Jin, R. Zhang, and S. Cao, “mm-Pose: Real-Time Human Skeletal Posture
Estimation using mmWave Radars and CNNs,” IEEE Sensors Journal, vol. 20, no. 17, pp. 10032–
10044, Nov. 2019, doi: 10.1109/JSEN.2020.2991741.
[43] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep High-Resolution Representation Learning for Human
Pose Estimation,” Proceedings of the IEEE Computer Society Conference on Computer Vision and
Pattern Recognition, vol. 2019-June, pp. 5686–5696, Feb. 2019.
[44] S. Chen, M. Zhang, X. Yang, Z. Zhao, T. Zou, and X. Sun, “The Impact of Attention Mechanisms on
Speech Emotion Recognition,” Sensors 2021, Vol. 21, Page 7530, vol. 21, no. 22, p. 7530, Nov.
2021, doi: 10.3390/S21227530.
[45] A. A. Baffour, Z. Qin, Y. Wang, Z. Qin, and K. K. R. Choo, “Spatial self-attention network with selfattention distillation for fine-grained image recognition,” Journal of Visual Communication and
Image Representation, vol. 81, p. 103368, Nov. 2021, doi: 10.1016/J.JVCIR.2021.103368.
[46] J. Huang, W. Zhou, Q. Zhang, H. Li, and W. Li, “Video-based Sign Language Recognition without
Temporal Segmentation,” 32nd AAAI Conference on Artificial Intelligence, AAAI 2018, pp. 2257–
2264, Jan. 2018, doi: 10.48550/arxiv.1801.10111.
[47] D. Guo, S. Wang, Q. Tian, and M. Wang, “Dense temporal convolution network for sign language
translation,” IJCAI International Joint Conference on Artificial Intelligence, vol. 2019-August, pp.
744–750, 2019, doi: 10.24963/IJCAI.2019/105.
[48] S. Wang, D. Guo, W. G. Zhou, Z. J. Zha, and M. Wang, “Connectionist temporal fusion for sign
language translation,” MM 2018 - Proceedings of the 2018 ACM Multimedia Conference, pp.
1483–1491, Oct. 2018, doi: 10.1145/3240508.3240671.
[49] N. C. Camgoz, S. Hadfield, O. Koller, and R. Bowden, “SubUNets: End-to-End Hand Shape and
Continuous Sign Language Recognition,” Proceedings of the IEEE International Conference on
Computer Vision, vol. 2017-October, pp. 3075–3084, Dec. 2017, doi: 10.1109/ICCV.2017.332.
[50] D. Guo, W. Zhou, H. Li, and M. Wang, “Hierarchical LSTM for Sign Language Translation,”
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, pp. 6845–6852, Apr.
2018, doi: 10.1609/AAAI.V32I1.12235.
[51] Z. Yang, Z. Shi, X. Shen, and Y.-W. Tai, “SF-Net: Structured Feature Network for Continuous Sign
Language Recognition,” Aug. 2019, doi: 10.48550/arxiv.1908.01341.
[52] K. L. Cheng, Z. Yang, Q. Chen, and Y.-W. Tai, “Fully Convolutional Networks for Continuous Sign
Language Recognition,” ECCV, Jul. 2020, doi: 10.1007/978-3-030-58586-0_41.
52
[53] O. Koller, H. Ney, and R. Bowden, “Deep Hand: How to Train a CNN on 1 Million Hand Images
When Your Data Is Continuous and Weakly Labelled,” Proceedings of the IEEE Computer Society
Conference on Computer Vision and Pattern Recognition, vol. 2016-December, pp. 3793–3802,
2016.
[54] O. Koller, S. Zargaran, H. Ney, and R. Bowden, “Deep Sign: Enabling Robust Statistical Continuous
Sign Language Recognition via Hybrid CNN-HMMs,” International Journal of Computer Vision, vol.
126, no. 12, pp. 1311–1325, Dec. 2018, doi: 10.1007/S11263-018-1121-3/TABLES/8.
[55] F. ben Slimane and M. Bouguessa, “Context Matters: Self-Attention for Sign Language
Recognition,” Proceedings - International Conference on Pattern Recognition, pp. 7884–7891, Jan.
2021, doi: 10.48550/arxiv.2101.04632.
[56] Z. Niu and B. Mak, “Stochastic Fine-Grained Labeling of Multi-state Sign Glosses for Continuous
Sign Language Recognition,” ECCV 2020: Computer Vision – ECCV 2020, vol. 12361 LNCS, pp. 172–
186, 2020, doi: 10.1007/978-3-030-58517-4_11/TABLES/3.
[57] O. Koller, N. C. Camgoz, H. Ney, and R. Bowden, “Weakly Supervised Learning with Multi-Stream
CNN-LSTM-HMMs to Discover Sequential Parallelism in Sign Language Videos,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, vol. 42, no. 9, pp. 2306–2320, Sep. 2020, doi:
10.1109/TPAMI.2019.2911077.
[58] R. Cui, H. Liu, and C. Zhang, “A Deep Neural Framework for Continuous Sign Language
Recognition by Iterative Training,” IEEE Transactions on Multimedia, vol. 21, no. 7, pp. 1880–
1891, Jul. 2019, doi: 10.1109/TMM.2018.2889563.
[59] J. Pu, W. Zhou, H. Hu, and H. Li, “Boosting Continuous Sign Language Recognition via Cross
Modality Augmentation,” MM 2020 - Proceedings of the 28th ACM International Conference on
Multimedia, pp. 1497–1505, Oct. 2020, doi: 10.1145/3394171.3413931.
[60] R. Zuo and B. Mak, “C2SLR: Consistency-enhanced Continuous Sign Language Recognition,” CVPR
2022, pp. 5131–5140, 2022

指導教授

施國琛(Prof. Timothy K. Shih)

審核日期

2022-7-25

推文