基於聲音驅動的End to end即時面部模型合成系統

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：31

、訪客IP：18.227.111.236

姓名

胡峻愷(Jyun-Kai Hu) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

基於聲音驅動的End to end即時面部模型合成系統
(Audio-driven End to End real-time facial model synthesis system)

相關論文

★ Single and Multi-Label Environmental Sound Recognition with Gaussian Process	★ 波束形成與音訊前處理之嵌入式系統實現
★ 語音合成及語者轉換之應用與設計	★ 基於語意之輿情分析系統
★ 高品質口述系統之設計與應用	★ 深度學習及加速強健特徵之CT影像跟骨骨折辨識及偵測
★ 基於風格向量空間之個性化協同過濾服裝推薦系統	★ RetinaNet應用於人臉偵測
★ 金融商品走勢預測	★ 整合深度學習方法預測年齡以及衰老基因之研究
★ 漢語之端到端語音合成研究	★ 基於 ARM 架構上的 ORB-SLAM2 的應用與改進
★ 基於深度學習之指數股票型基金趨勢預測	★ 探討財經新聞與金融趨勢的相關性
★ 基於卷積神經網路的情緒語音分析	★ 運用深度學習方法預測阿茲海默症惡化與腦中風手術存活

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

VR/AR作為一種新興技術，無論是教育、娛樂還是情景模擬，都有非常重要的應用。 VR 可以提供與實際空間環境相媲美的體驗。在模擬醫療手術，是軍事訓練，甚至心理諮詢時的畫面想像中是一個非常好的應用工具。製造業、建築業和旅遊業也可以在 VR 和 AR 的幫助下發生巨大的變化。例如，VR可以輕鬆實現工廠遠程監控、旅遊景點導覽，甚至應用於建築信息模型。用於工程建設項目的設計模擬、協同編輯、造價試算，而AR疊加現實場景中虛擬物體的特徵，可用於疊加設備運行、維護SOP，甚至空間內的管線圖信息、方位引導、物品歷史信息等，都可以為生產操作、各種設備的維護操作、消防救援、觀光引導等帶來極大的便利。
而虛擬世界中，人的面部表情是極為重要的一環。人在處裡的外界資訊中，人臉占了大腦中相當分量的容量。人腦甚至有專門的區域負責處裡視覺訊號中面部表情的區塊。若是虛擬解色的面部處裡不夠逼真，很容易使VR使用者沉浸感降低，達不到VR/AR預期該有的效果。因此，投入資源模擬出逼真的虛擬人物面部模型，是相當有必要的。
現有的面部捕捉技術，可以利用影像資訊，搭配各種感測器在虛擬世界中重建出原本的人物面部。這項技術已經縝緻成熟，建出以假亂真的模型，在各大動畫/遊戲/影視中已經被大量應用。然而，現有的技術，捕捉面部的器材成本卻也所費不貲。許多情境下，並沒有那麼多的資源可以使用，可傳輸的資料更加稀少。在這種情境下，利用深度學習，分析音訊中的文字以及對應情緒，重建和合成出虛擬角色該有的五官動作網格的技術，就能派上用場了。
本論文基於前人提出的即時面部模型合成系統，利用輕量化的Transformer模型，在消耗更少量資源的前提下，使用語音訊息即時的分析出說話者嘴部該有的形狀，同時分析出語氣中隱含的情緒，調整面部模型其他部位諸如眉毛、眼睛和臉頰等部件的形狀。

摘要(英)

As a novel technology, VR/AR has very important applications whether it is education, entertainment or scenario simulation. VR can provide an experience comparable to the actual spatial environment. It is a very good application tool in the image imagination of simulated medical surgery, military training, and even psychological consultation. Manufacturing, construction, and tourism can also be dramatically transformed with the help of VR and AR. For example, VR can easily implement remote monitoring of factories, tours of tourist attractions, and even applied to building information models. It is used for design simulation, collaborative editing, and cost trial calculation of engineering construction projects, while AR superimposes the characteristics of virtual objects in real scenes, which can be used to superimpose equipment operation, maintenance SOP, and even pipeline map information, orientation guidance, and item history information in space. , can bring great convenience to production operation, maintenance operation of various equipment, fire rescue, sightseeing guidance, etc.
In the virtual world, human facial expressions are an extremely important part. Among the external information of human beings, the human face occupies a considerable amount of capacity in the brain. The human brain even has a dedicated area responsible for processing facial expressions in visual signals. If the virtual decolorization of the face is not realistic enough, it is easy to reduce the immersion of the VR user, and the expected effect of VR/AR cannot be achieved. Therefore, it is quite necessary to invest resources to simulate realistic facial models of virtual characters.
Existing face capture technology can use image information and various sensors to reconstruct the original face of a character in the virtual world. This technology has matured and built a fake model, which has been widely used in major animations/games/films. However, with the existing technology, the cost of the equipment to capture the face is also very expensive. In many situations, there are not so many resources available, and the data that can be transmitted is even more scarce. In this situation, the use of deep learning to analyze the text and corresponding emotions in the audio, reconstruct and synthesize the technology of the facial features and action grids that the virtual character should have, can come in handy.
Based on the real-time facial model synthesis system proposed by the predecessors, this paper uses the lightweight Transformer model to analyze the shape of the speaker′s mouth in real time and analyze the tone of the voice under the premise of consuming less resources. Implied emotions, adjust the shape of other parts of the face model such as eyebrows, eyes and cheeks.

關鍵字(中)

★ Seq2Seq模型
★ Transformer輕量化
★ 人臉合成

關鍵字(英)

★ Sequence to Sequence
★ Lightweight Transformer
★ face synthesis

論文目次

章節目次
中文摘要 i
Abstract ii
章節目次 iii
圖目錄 v
表目錄 vi
第一章緒論 1
1.1 研究動機 1
1.2 研究目的 1
第二章相關文獻與文獻探討 2
2-1 相關問題定義 2
2-2 面部表情驅動任務與序列到序列任務 2
2-3端到端的序列生成任務 5
2-4注意力機制 6
2-5 Transformer 模型的應用 6
2-5-1 編碼器與解碼器 8
2-5-2 位置編碼 9
2-5-3 編碼與解碼 10
2-6 Transformer 模型的輕量化 10
2-6-1 DeLighT 11
2-6-2 LSRA輕量級Transformer 12
2-7 Faceformer 14
第三章研究內容與方法 14
3-1 研究對象 14
3-2資料處理 14
3-3研究步驟 14
3-4 實現方法 15
3-4-1 音訊前處理 15
3-4-2 模型架構 16
第四章實驗結果 18
4-1 實驗參數 18
4-2訓練成果 18
4-3模型準確度和輕量化成果 19
4-3成效的主觀評估 20
第五章結論與未來研究方向 21
第六章參考文獻 22

參考文獻

[1] J. Li, A. Sun, J. Han, and C. Li, “A Survey on Deep Learning for Named Entity Recognition.” arXiv, Mar. 18, 2020. Accessed: Sep. 19, 2022. [Online]. Available: http://arxiv.org/abs/1812.09449
[2] M. M. Cohen and D. W. Massaro, “Modeling Coarticulation in Synthetic Visual Speech,” in Models and Techniques in Computer Animation, N. M. Thalmann and D. Thalmann, Eds. Tokyo: Springer Japan, 1993, pp. 139–156. doi: 10.1007/978-4-431-66911-1_13.
[3] Mori. M, “Bukimi no tani (The uncanny valley).,” Energy 7, 4, 1970.
[4] Z. Deng, U. Neumann, J. P. Lewis, T.-Y. Kim, M. Bulut, and S. Narayanan, “Expressive facial animation synthesis by learning speech coarticulation and expression spaces,” IEEE Trans. Vis. Comput. Graph., vol. 12, no. 6, pp. 1523–1534, Dec. 2006, doi: 10.1109/TVCG.2006.90.
[5] T. Karras, T. Aila, S. Laine, A. Herva, and J. Lehtinen, “Audio-driven facial animation by joint end-to-end learning of pose and emotion,” ACM Trans. Graph., vol. 36, no. 4, pp. 1–12, Jul. 2017, doi: 10.1145/3072959.3073658.
[6] E. S. Chuang, F. Deshpande, and C. Bregler, “Facial expression space learning,” in 10th Pacific Conference on Computer Graphics and Applications, 2002. Proceedings., Beijing, China, 2002, pp. 68–76. doi: 10.1109/PCCGA.2002.1167840.
[7] Y. Cao, W. C. Tien, P. Faloutsos, and F. Pighin, “Expressive speech-driven facial animation,” ACM Trans. Graph., vol. 24, no. 4, pp. 1283–1302, Oct. 2005, doi: 10.1145/1095878.1095881.
[8] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, pp. 533–536, Oct. 1986, doi: 10.1038/323533a0.
[9] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997, doi: 10.1162/neco.1997.9.8.1735.
[10] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio, “On the Properties of Neural Machine Translation: Encoder-Decoder Approaches.” arXiv, Oct. 07, 2014. Accessed: Sep. 19, 2022. [Online]. Available: http://arxiv.org/abs/1409.1259
[11] Z. Huang, W. Xu, and K. Yu, “Bidirectional LSTM-CRF Models for Sequence Tagging.” arXiv, Aug. 09, 2015. Accessed: Sep. 19, 2022. [Online]. Available: http://arxiv.org/abs/1508.01991
[12] G. Tian, Y. Yuan, and Y. liu, “Audio2Face: Generating Speech/Face Animation from Single Audio with Attention-Based Bidirectional LSTM Networks.” arXiv, May 27, 2019. Accessed: Sep. 19, 2022. [Online]. Available: http://arxiv.org/abs/1905.11142
[13] D. Hu, “An Introductory Survey on Attention Mechanisms in NLP Problems.” arXiv, Nov. 12, 2018. Accessed: Sep. 20, 2022. [Online]. Available: http://arxiv.org/abs/1811.05544
[14] A. Vaswani et al., “Attention Is All You Need.” arXiv, Dec. 05, 2017. Accessed: Sep. 19, 2022. [Online]. Available: http://arxiv.org/abs/1706.03762
[15] A. Zeyer, P. Bahar, K. Irie, R. Schluter, and H. Ney, “A Comparison of Transformer and LSTM Encoder Decoder Models for ASR,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), SG, Singapore, Dec. 2019, pp. 8–15. doi: 10.1109/ASRU46091.2019.9004025.
[16] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin, “Convolutional Sequence to Sequence Learning.” arXiv, Jul. 24, 2017. Accessed: Sep. 20, 2022. [Online]. Available: http://arxiv.org/abs/1705.03122
[17] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov, “Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context.” arXiv, Jun. 02, 2019. Accessed: Sep. 20, 2022. [Online]. Available: http://arxiv.org/abs/1901.02860
[18] T. B. Brown et al., “Language Models are Few-Shot Learners.” arXiv, Jul. 22, 2020. Accessed: Sep. 20, 2022. [Online]. Available: http://arxiv.org/abs/2005.14165
[19] S. Mehta, M. Ghazvininejad, S. Iyer, L. Zettlemoyer, and H. Hajishirzi, “DeLighT: Deep and Light-weight Transformer.” arXiv, Feb. 11, 2021. Accessed: Sep. 20, 2022. [Online]. Available: http://arxiv.org/abs/2008.00623
[20] S. Mehta, R. Koncel-Kedziorski, M. Rastegari, and H. Hajishirzi, “Pyramidal Recurrent Unit for Language Modeling,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 2018, pp. 4620–4630. doi: 10.18653/v1/D18-1491.
[21] Z. Wu, Z. Liu, J. Lin, Y. Lin, and S. Han, “Lite Transformer with Long-Short Range Attention.” arXiv, Apr. 24, 2020. Accessed: Sep. 18, 2022. [Online]. Available: http://arxiv.org/abs/2004.11886
[22] F. Wu, A. Fan, A. Baevski, Y. N. Dauphin, and M. Auli, “Pay Less Attention with Lightweight and Dynamic Convolutions.” arXiv, Feb. 22, 2019. Accessed: Sep. 20, 2022. [Online]. Available: http://arxiv.org/abs/1901.10430
[23] A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.” arXiv, Jun. 03, 2021. Accessed: Sep. 21, 2022. [Online]. Available: http://arxiv.org/abs/2010.11929
[24] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-End Object Detection with Transformers.” arXiv, May 28, 2020. Accessed: Sep. 21, 2022. [Online]. Available: http://arxiv.org/abs/2005.12872
[25] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers.” arXiv, Oct. 28, 2021. Accessed: Sep. 21, 2022. [Online]. Available: http://arxiv.org/abs/2105.15203
[26] Y. Fan, Z. Lin, J. Saito, W. Wang, and T. Komura, “FaceFormer: Speech-Driven 3D Facial Animation with Transformers.” arXiv, Mar. 16, 2022. Accessed: Sep. 18, 2022. [Online]. Available: http://arxiv.org/abs/2112.05329
[27] S. Mehta and M. Rastegari, “MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer.” arXiv, Mar. 04, 2022. Accessed: Sep. 20, 2022. [Online]. Available: http://arxiv.org/abs/2110.02178
[28] A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations.” arXiv, Oct. 22, 2020. Accessed: Sep. 18, 2022. [Online]. Available: http://arxiv.org/abs/2006.11477

指導教授

王家慶(Jia-Ching Wang)

審核日期

2022-9-26

推文