| 摘要(英) |
This study presents a novel framework for constructing a GaussianAvatars 3D Gaussian Splatting-based upper-body avatar driven by voice and capable of dynamic clothing changes, using only a small number of input images. The proposed system leverages multimodal generation techniques by integrating image synthesis, speech processing, and real-time rendering modules to achieve high-fidelity and interactive virtual humans. Starting with a frontal portrait, a Head Synthesizer is employed to synthesize multi-view facial images, which are then reconstructed into a continuous-viewpoint 3D representation using Gaussian Splatting. Voice interaction is enabled through automatic speech recognition (ASR) and text-to-speech (TTS) modules, driving realistic lip-sync and expression dynamics. For clothing manipulation, a conditional image-to-image translation model is applied to perform seamless virtual outfit try-on. The system features low-data requirements, fast rendering, and visually coherent results, making it suitable for applications such as digital humans, remote interaction, virtual fitting, and immersive marketing. Experimental results demonstrate that the system can generate temporally consistent and speech-synchronized 3D avatars with as few as 1 to 3 images, while supporting diverse outfit changes with high visual realism. These findings confirm the feasibility and practicality of the proposed multimodal human avatar system. |
| 參考文獻 |
[1] Kerbl, Bernhard, et al. "3D Gaussian splatting for real-time radiance field rendering." ACM Transactions on Graphics (TOG) 42.4 (2023): 1-14.
[2] Qian, Shenhan, et al. "GaussianAvatars: Photorealistic head avatars with rigged 3D Gaussians." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2024.
[3] Deng, Yu, et al. "Portrait4D: Learning One-Shot 4D Head Avatar Synthesis using Synthetic Data." arXiv preprint arXiv:2311.18729 (2023).
[4] Jiang, Boyuan, et al. "FitDiT: Advancing the Authentic Garment Details for High-fidelity Virtual Try-on." arXiv preprint arXiv:2411.10499 (2024).
[5] Kirschstein, Tobias, et al. "NeRSemble: Multi-view radiance field reconstruction of human heads." ACM Transactions on Graphics (TOG) 42.4 (2023): 1-14.
[6] Vaswani, Ashish, et al. "Attention is all you need." Advances in Neural Information Processing Systems (NeurIPS). 2017.
[7] Li, Tianye, et al. "Learning a model of facial shape and expression from 4D scans." ACM Transactions on Graphics (TOG) 36.6 (2017): 194-1.
[8] Mildenhall, Ben, et al. "NeRF: Representing scenes as neural radiance fields for view synthesis." Communications of the ACM 65.1 (2021): 99-106.
[9] Wang, Yuxuan, et al. "Tacotron: Towards end-to-end speech synthesis." Interspeech. 2017.
[10] Oord, Aaron van den, et al. "WaveNet: A generative model for raw audio." arXiv preprint arXiv:1609.03499 (2016).
[11] Hughes, Timothy, and Keir Mierle. "Recurrent neural networks for voice activity detection." IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2013.
[12] Pavlakos, Georgios, et al. "SMPL-X: A new joint 3D model of body, face, and hands." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2019.
[13] Thies, Justus, et al. "Face2Face: Real-time face capture and reenactment of RGB videos." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2016.
[14] Chan, Caroline, et al. "Everybody Dance Now." Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 2019.
[15] Wu, Yu, et al. "4D Gaussian Splatting for Real-Time Dynamic Scene Rendering." arXiv preprint arXiv:2310.08528 (2023).
[16] Cherry, E. Colin. "Some experiments on the recognition of speech, with one and with two ears." The Journal of the Acoustical Society of America 25.5 (1953): 975-979.
[17] Zhang, Richard, et al. "The unreasonable effectiveness of deep features as a perceptual metric." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
[18] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems 25 (2012).
[19] Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).
[20]Iandola, Forrest N., et al. "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size." arXiv preprint arXiv:1602.07360 (2016).
|