摘要(英) |
As a novel technology, VR/AR has very important applications whether it is education, entertainment or scenario simulation. VR can provide an experience comparable to the actual spatial environment. It is a very good application tool in the image imagination of simulated medical surgery, military training, and even psychological consultation. Manufacturing, construction, and tourism can also be dramatically transformed with the help of VR and AR. For example, VR can easily implement remote monitoring of factories, tours of tourist attractions, and even applied to building information models. It is used for design simulation, collaborative editing, and cost trial calculation of engineering construction projects, while AR superimposes the characteristics of virtual objects in real scenes, which can be used to superimpose equipment operation, maintenance SOP, and even pipeline map information, orientation guidance, and item history information in space. , can bring great convenience to production operation, maintenance operation of various equipment, fire rescue, sightseeing guidance, etc.
In the virtual world, human facial expressions are an extremely important part. Among the external information of human beings, the human face occupies a considerable amount of capacity in the brain. The human brain even has a dedicated area responsible for processing facial expressions in visual signals. If the virtual decolorization of the face is not realistic enough, it is easy to reduce the immersion of the VR user, and the expected effect of VR/AR cannot be achieved. Therefore, it is quite necessary to invest resources to simulate realistic facial models of virtual characters.
Existing face capture technology can use image information and various sensors to reconstruct the original face of a character in the virtual world. This technology has matured and built a fake model, which has been widely used in major animations/games/films. However, with the existing technology, the cost of the equipment to capture the face is also very expensive. In many situations, there are not so many resources available, and the data that can be transmitted is even more scarce. In this situation, the use of deep learning to analyze the text and corresponding emotions in the audio, reconstruct and synthesize the technology of the facial features and action grids that the virtual character should have, can come in handy.
Based on the real-time facial model synthesis system proposed by the predecessors, this paper uses the lightweight Transformer model to analyze the shape of the speaker′s mouth in real time and analyze the tone of the voice under the premise of consuming less resources. Implied emotions, adjust the shape of other parts of the face model such as eyebrows, eyes and cheeks. |
參考文獻 |
[1] J. Li, A. Sun, J. Han, and C. Li, “A Survey on Deep Learning for Named Entity Recognition.” arXiv, Mar. 18, 2020. Accessed: Sep. 19, 2022. [Online]. Available: http://arxiv.org/abs/1812.09449
[2] M. M. Cohen and D. W. Massaro, “Modeling Coarticulation in Synthetic Visual Speech,” in Models and Techniques in Computer Animation, N. M. Thalmann and D. Thalmann, Eds. Tokyo: Springer Japan, 1993, pp. 139–156. doi: 10.1007/978-4-431-66911-1_13.
[3] Mori. M, “Bukimi no tani (The uncanny valley).,” Energy 7, 4, 1970.
[4] Z. Deng, U. Neumann, J. P. Lewis, T.-Y. Kim, M. Bulut, and S. Narayanan, “Expressive facial animation synthesis by learning speech coarticulation and expression spaces,” IEEE Trans. Vis. Comput. Graph., vol. 12, no. 6, pp. 1523–1534, Dec. 2006, doi: 10.1109/TVCG.2006.90.
[5] T. Karras, T. Aila, S. Laine, A. Herva, and J. Lehtinen, “Audio-driven facial animation by joint end-to-end learning of pose and emotion,” ACM Trans. Graph., vol. 36, no. 4, pp. 1–12, Jul. 2017, doi: 10.1145/3072959.3073658.
[6] E. S. Chuang, F. Deshpande, and C. Bregler, “Facial expression space learning,” in 10th Pacific Conference on Computer Graphics and Applications, 2002. Proceedings., Beijing, China, 2002, pp. 68–76. doi: 10.1109/PCCGA.2002.1167840.
[7] Y. Cao, W. C. Tien, P. Faloutsos, and F. Pighin, “Expressive speech-driven facial animation,” ACM Trans. Graph., vol. 24, no. 4, pp. 1283–1302, Oct. 2005, doi: 10.1145/1095878.1095881.
[8] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, pp. 533–536, Oct. 1986, doi: 10.1038/323533a0.
[9] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997, doi: 10.1162/neco.1997.9.8.1735.
[10] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio, “On the Properties of Neural Machine Translation: Encoder-Decoder Approaches.” arXiv, Oct. 07, 2014. Accessed: Sep. 19, 2022. [Online]. Available: http://arxiv.org/abs/1409.1259
[11] Z. Huang, W. Xu, and K. Yu, “Bidirectional LSTM-CRF Models for Sequence Tagging.” arXiv, Aug. 09, 2015. Accessed: Sep. 19, 2022. [Online]. Available: http://arxiv.org/abs/1508.01991
[12] G. Tian, Y. Yuan, and Y. liu, “Audio2Face: Generating Speech/Face Animation from Single Audio with Attention-Based Bidirectional LSTM Networks.” arXiv, May 27, 2019. Accessed: Sep. 19, 2022. [Online]. Available: http://arxiv.org/abs/1905.11142
[13] D. Hu, “An Introductory Survey on Attention Mechanisms in NLP Problems.” arXiv, Nov. 12, 2018. Accessed: Sep. 20, 2022. [Online]. Available: http://arxiv.org/abs/1811.05544
[14] A. Vaswani et al., “Attention Is All You Need.” arXiv, Dec. 05, 2017. Accessed: Sep. 19, 2022. [Online]. Available: http://arxiv.org/abs/1706.03762
[15] A. Zeyer, P. Bahar, K. Irie, R. Schluter, and H. Ney, “A Comparison of Transformer and LSTM Encoder Decoder Models for ASR,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), SG, Singapore, Dec. 2019, pp. 8–15. doi: 10.1109/ASRU46091.2019.9004025.
[16] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin, “Convolutional Sequence to Sequence Learning.” arXiv, Jul. 24, 2017. Accessed: Sep. 20, 2022. [Online]. Available: http://arxiv.org/abs/1705.03122
[17] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov, “Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context.” arXiv, Jun. 02, 2019. Accessed: Sep. 20, 2022. [Online]. Available: http://arxiv.org/abs/1901.02860
[18] T. B. Brown et al., “Language Models are Few-Shot Learners.” arXiv, Jul. 22, 2020. Accessed: Sep. 20, 2022. [Online]. Available: http://arxiv.org/abs/2005.14165
[19] S. Mehta, M. Ghazvininejad, S. Iyer, L. Zettlemoyer, and H. Hajishirzi, “DeLighT: Deep and Light-weight Transformer.” arXiv, Feb. 11, 2021. Accessed: Sep. 20, 2022. [Online]. Available: http://arxiv.org/abs/2008.00623
[20] S. Mehta, R. Koncel-Kedziorski, M. Rastegari, and H. Hajishirzi, “Pyramidal Recurrent Unit for Language Modeling,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 2018, pp. 4620–4630. doi: 10.18653/v1/D18-1491.
[21] Z. Wu, Z. Liu, J. Lin, Y. Lin, and S. Han, “Lite Transformer with Long-Short Range Attention.” arXiv, Apr. 24, 2020. Accessed: Sep. 18, 2022. [Online]. Available: http://arxiv.org/abs/2004.11886
[22] F. Wu, A. Fan, A. Baevski, Y. N. Dauphin, and M. Auli, “Pay Less Attention with Lightweight and Dynamic Convolutions.” arXiv, Feb. 22, 2019. Accessed: Sep. 20, 2022. [Online]. Available: http://arxiv.org/abs/1901.10430
[23] A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.” arXiv, Jun. 03, 2021. Accessed: Sep. 21, 2022. [Online]. Available: http://arxiv.org/abs/2010.11929
[24] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-End Object Detection with Transformers.” arXiv, May 28, 2020. Accessed: Sep. 21, 2022. [Online]. Available: http://arxiv.org/abs/2005.12872
[25] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers.” arXiv, Oct. 28, 2021. Accessed: Sep. 21, 2022. [Online]. Available: http://arxiv.org/abs/2105.15203
[26] Y. Fan, Z. Lin, J. Saito, W. Wang, and T. Komura, “FaceFormer: Speech-Driven 3D Facial Animation with Transformers.” arXiv, Mar. 16, 2022. Accessed: Sep. 18, 2022. [Online]. Available: http://arxiv.org/abs/2112.05329
[27] S. Mehta and M. Rastegari, “MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer.” arXiv, Mar. 04, 2022. Accessed: Sep. 20, 2022. [Online]. Available: http://arxiv.org/abs/2110.02178
[28] A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations.” arXiv, Oct. 22, 2020. Accessed: Sep. 18, 2022. [Online]. Available: http://arxiv.org/abs/2006.11477 |