基於Transformer及姿態辨識之即時手語翻譯系統

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：32

、訪客IP：18.219.40.177

姓名

余昌翰(Chang-Han Yu) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

基於Transformer及姿態辨識之即時手語翻譯系統
(The Real-Time Sign Language Translation System Based on Transformer and Pose Estimation)

相關論文

★ 以Q-學習法為基礎之群體智慧演算法及其應用	★ 發展遲緩兒童之復健系統研製
★ 從認知風格角度比較教師評量與同儕互評之差異：從英語寫作到遊戲製作	★ 基於檢驗數值的糖尿病腎病變預測模型
★ 模糊類神經網路為架構之遙測影像分類器設計	★ 複合式群聚演算法
★ 身心障礙者輔具之研製	★ 指紋分類器之研究
★ 背光影像補償及色彩減量之研究	★ 類神經網路於營利事業所得稅選案之應用
★ 一個新的線上學習系統及其於稅務選案上之應用	★ 人眼追蹤系統及其於人機介面之應用
★ 結合群體智慧與自我組織映射圖的資料視覺化研究	★ 追瞳系統之研發於身障者之人機介面應用
★ 以類免疫系統為基礎之線上學習類神經模糊系統及其應用	★ 基因演算法於語音聲紋解攪拌之應用

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

根據台灣衛生福利部2021年的統計，
在台灣領有身心障礙證明者約為119.8萬人，其中有聽覺機能障礙者有12萬5764人，
約佔總人口數的5\%。而聽障者們常會因自幼的聽力障礙而導致口語發音、學習上有許多困難，
故以手語作為他們主要的溝通方式。

而在現今，有許多手語慣用者在閱讀電視新聞、選舉辯論及直播記者會等
需大量仰賴聽覺吸收資訊的媒體時，常僅能以字幕的形式進行閱讀，其中之選舉辯論、
防疫記者會等政府舉辦之公共節目常會配有手語翻譯員，
手語翻譯員會將主講者中文口語的內容轉換至手語的形式比劃給視聽者觀看，
進而使手語慣用者能夠更輕鬆的理解媒體內之內容。
但因手語翻譯員的數量仍有限，故僅能配置在少數的場合。
於是，如何使聽障者能與一般閱聽者擁有同等的使用體驗，
為現代媒體目前遇到的一項重大課題。

本研究結合深度學習中的兩大領域，自然語言處理以及姿態辨識領域的技術，
開發出了一套能及時進行手語翻譯並使用虛擬人物比劃手語手勢的系統，
運用3D姿態辨識的模型將手語的單字影片轉化為手勢數據資料集，
運用第三方語音辨識服務辨識使用者口語轉換至中文句子，
並且利用自然語言處理模型將中文句子轉換為手語單字序列，
並將手語單字序列與手勢數據資料集進行比對，
進而將正確的手語手勢傳遞給虛擬人物，讓其進行比劃手語手勢，
再串接所有階段成一完整的使用者系統。
使其可進行即時翻譯手語的系統。

此外，本研究也實驗並應用多種訊號平滑化的技術，
改善姿態辨識常有的Temporal Jitter問題，
使虛擬人物進行手語手勢時能更貼近真人。

摘要(英)

According to statistics from the Taiwan′s Ministry of Health and Welfare in 2021,
There are about 1,198,000 people with physical disability certificates in Taiwan,
including 125,764 people with hearing impairment.
About 5% of the total population.
The hearing-impaired people often have many difficulties in oral pronunciation and learning due to the hearing impairment since childhood.
Therefore, sign language is often used as its main communication method.

And many sign language users are reading TV news, election debates and live press conferences, etc.
Media that relies heavily on hearing to absorb information can often only be read in subtitles, including election debates,
Regular meetings of public programs organized by the government such as epidemic prevention press conferences
Equipped with a sign language teacher, the sign language teacher will convert the content of the speaker′s spoken language into the form of sign language,
Make it easier for sign language users to understand the content.
However, because the number of sign language teachers is still limited, they can only be deployed in a few occasions. then,
How to enable the hearing-impaired to have the same experience as ordinary listeners,
A major issue for modern media.

This research combines technologies from two major fields in deep learning, natural language processing and gesture recognition.
Developed a system that can perform sign language translation in time and use virtual characters to make sign language gestures,
Using the 3D gesture recognition model to convert the single-word video of sign language into a gesture data set,
Use a third-party speech recognition service to recognize the user′s spoken language and convert it into Chinese sentences,
And use the natural language processing model to convert Chinese sentences into sign language word sequences,
And compared the sign language word sequence with the gesture data set,

Then, the correct sign language gestures are passed to the avatar, so that they can make sign language gestures,
Then connect all the stages into a complete user system.
A system that enables instant interpretation of sign language.

In addition, this study also experimented and applied a variety of signal smoothing techniques,
Improve the Temporal Jitter problem common in gesture recognition,
The virtual characters can be closer to real people when they perform sign language gestures.

關鍵字(中)

★ 深度學習
★ 自然語言處理
★ 影像處理
★ 電腦視覺

關鍵字(英)

★ Deep Learning
★ Natural Language Processing
★ Image Processing
★ Computer Vision

論文目次

摘要i
Abstract iii
目錄v
一、緒論1
1.1 研究動機.................................................................. 1
1.2 研究目的.................................................................. 2
1.3 論文架構.................................................................. 3
二、背景知識以及文獻回顧4
2.1 背景知識.................................................................. 4
2.1.1 台灣手語......................................................... 4
2.1.2 Transformer ...................................................... 4
2.1.3 BERT .............................................................. 5
2.1.4 孿生網路......................................................... 7
2.1.5 Sentence-BERT .................................................. 8
2.1.6 Unity............................................................... 9
2.1.7 姿態辨識......................................................... 10
2.1.8 MediaPipe Hand ................................................. 10
2.1.9 卡爾曼濾波器................................................... 11
2.1.10 One Euro Filter................................................... 12
2.2 文獻回顧.................................................................. 13
2.2.1 深度學習應用於手語辨識之相關研究..................... 13
2.2.2 3D 手部姿態辨識之相關研究................................ 13
2.2.3 翻譯口語至手語之相關研究................................. 14
三、系統介紹及研究方法17
3.1 系統架構.................................................................. 17
3.2 資料調整工具............................................................ 19
3.2.1 內部系統介紹................................................... 19
3.2.2 取得節點修正後軌跡.......................................... 21
3.2.3 即時測試結果................................................... 22
3.3 文字處理系統............................................................ 22
3.3.1 內部系統介紹................................................... 22
3.3.2 輸入資訊處理................................................... 23
3.4 使用者系統............................................................... 23
3.5 資料傳輸方式............................................................ 25
3.6 系統輸入詞前處理...................................................... 25
3.6.1 系統句型翻譯................................................... 25
3.6.2 斷詞及詞性分類................................................ 26
3.6.3 單字序列與資料集匹配判斷................................. 27
3.6.4 語義辨識資料集................................................ 27
3.7 手勢資料集收集及儲存方式.......................................... 28
3.8 降低Temporal jitter...................................................... 29
3.8.1 濾波器篩選...................................................... 29
3.8.2 X、Y 軸.......................................................... 29
3.8.3 Z 軸................................................................ 30
四、實驗設計以及成果32
4.1 Sentence-BERT 訓練結果.............................................. 32
4.2 Temporal Jitter 優化效果............................................... 36
4.3 手語輸出序列準確度實驗結果....................................... 47
4.4 視覺化成效調查......................................................... 49
五、總結52
5.1 結論........................................................................ 52
5.2 未來展望.................................................................. 53
參考文獻55
附錄A 實用手語教材- 範例對話57

參考文獻

[1] 統計處. “身心障礙統計專區.” (Jul. 2021), [Online]. Available: https://dep.mohw.gov.
tw/dos/cp-5224-62359-113.html (visited on 06/09/2022).
[2] “Speech-to-Text：自動語音辨識| Cloud 語音轉文字,” [Online]. Available: https :
//cloud.google.com/speech-to-text?hl=zh-tw (visited on 05/19/2022).
[3] A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention is all you need,” arXiv, Tech. Rep.,
2017.
[4] N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese
BERT-Networks,” arXiv, Tech. Rep. arXiv:1908.10084, Aug. 2019.
[5] A. Juliani, V. Berges, E. Vckay, et al., “Unity: A general platform for intelligent agents,”
CoRR, vol. abs/1809.02627, 2018.
[6] F. Zhang, V. Bazarevsky, A. Vakunov, et al., “MediaPipe Hands: On-device Real-time
Hand Tracking,” arXiv, Tech. Rep. arXiv:2006.10214, Jun. 2020.
[7] R. E. Kalman. “卡爾曼濾波.” (Aug. 2021), [Online]. Available: https://zh.wikipedia.
org/w/index.php?title=%E5%8D%A1%E5%B0%94%E6%9B%BC%E6%BB%A4%
E6%B3%A2&oldid=67182863 (visited on 06/09/2022).
[8] G. Casiez, N. Roussel, and D. Vogel, “1€ Filter: A Simple Speed-based Low-pass Filter
for Noisy Input in Interactive Systems,” Conference on Human Factors in Computing
Systems - Proceedings, pp. 2527–2530, May 2012.
[9] 台灣手語, zh-Hant-TW, Nov. 2021.
[10] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding,” arXiv, Tech. Rep. arXiv:1810.04805,
May 2019.
[11] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah, “Signature verification using
a ”siamese” time delay neural network,” in Advances in Neural Information Processing
Systems, vol. 6, Morgan-Kaufmann, 1993.
[12] N. Kasukurthi, B. Rokad, S. Bidani, and D. A. Dennisan, “American Sign Language
Alphabet Recognition using Deep Learning,” arXiv, Tech. Rep. arXiv:1905.05487, May
2019.
[13] “美國手語.” zh-Hant-TW. (Dec. 2020), [Online]. Available: https://zh.wikipedia.org/
w/index.php?title=美國手語&oldid=63043570 (visited on 06/29/2022).
[14] S. He, “Research of a Sign Language Translation System Based on Deep Learning,” in
2019 International Conference on Artificial Intelligence and Advanced Manufacturing
(AIAM), Oct. 2019, pp. 392–396.
[15] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh, “Openpose: Realtime multiperson
2d pose estimation using part affinity fields,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. 43, no. 1, pp. 172–186, 2021.
[16] S. A. Ehssan Aly, A. Hassanin, and S. Bekhet, “Esldl: An integrated deep learning model
for egyptian sign language recognition,” in 2021 3rd Novel Intelligent and Leading Emerging
Sciences Conference (NILES), 2021, pp. 331–335.
[17] W. Cheng, J. H. Park, and J. H. Ko, “Handfoldingnet: A 3d hand pose estimation network
using multiscale-feature guided folding of a 2d hand skeleton,” CoRR, vol. abs/
2108.05545, 2021.
[18] U. Iqbal, P. Molchanov, T. M. Breuel, J. Gall, and J. Kautz, “Hand pose estimation via
latent 2.5d heatmap regression,” CoRR, vol. abs/1804.09534, 2018.
[19] M. Boulares and M. Jemni, “Mobile sign language translation system for deaf community,”
in Proceedings of the International Cross-Disciplinary Conference on Web Accessibility,
ser. W4A ’12, Lyon, France: Association for Computing Machinery, 2012.
[20] S. Stoll, N. C. Camgoz, S. Hadfield, and R. Bowden, “Text2Sign: Towards Sign Language
Production Using Neural Machine Translation and Generative Adversarial Networks,”
en, International Journal of Computer Vision, vol. 128, no. 4, pp. 891–908, Apr.
2020.
[21] Thadeu Luz. “Using AI for Sign Language Translation.” (Mar. 2020), [Online]. Available:
https://www.youtube.com/watch?v=N0Vm0LXmcU4 (visited on 06/29/2022).
[22] Hand Talk Translator–Apps on Google Play, zh-TW.
[23] 孫聖然. “北京冬奧｜央視推AI 手語主播助聽障人士觀賽適應快語速識專有詞.”
(Feb. 2022), [Online]. Available: https://www.hk01.com/即時中國/732025/北京冬奧-
央視推ai手ª主播助聽障人士觀賽-適應快語速識專有詞(visited on 05/18/2022).
[24] STSbenchmark - stswiki.
[25] Huertas97, Multilingual-STSB, Mar. 2022.
[26] 張榮興. “實用臺灣手語教材,” [Online]. Available: https : / / www . books . com . tw /
products/0010882503 (visited on 06/25/2022).
[27] “SentenceTransformers Documentation —Sentence-Transformers documentation,” [Online].
Available: https://www.sbert.net/ (visited on 06/26/2022).

指導教授

蘇木春(Mu-Chun Su)

審核日期

2022-8-12

推文