博碩士論文 110522156 詳細資訊




以作者查詢圖書館館藏 以作者查詢臺灣博碩士 以作者查詢全國書目 勘誤回報 、線上人數:10 、訪客IP:18.97.14.86
姓名 陳明威(Ming-Wei Chen)  查詢紙本館藏   畢業系所 資訊工程學系
論文名稱 圖片建置聲音驅動之 3D 高斯潑濺立體半身模型及換衣功能
(Image-Based Construction of a Voice-Driven 3D Gaussian Splatting Upper-Body Model with Virtual Garment Transfer)
相關論文
★ Single and Multi-Label Environmental Sound Recognition with Gaussian Process★ 波束形成與音訊前處理之嵌入式系統實現
★ 語音合成及語者轉換之應用與設計★ 基於語意之輿情分析系統
★ 高品質口述系統之設計與應用★ 深度學習及加速強健特徵之CT影像跟骨骨折辨識及偵測
★ 基於風格向量空間之個性化協同過濾服裝推薦系統★ RetinaNet應用於人臉偵測
★ 金融商品走勢預測★ 整合深度學習方法預測年齡以及衰老基因之研究
★ 漢語之端到端語音合成研究★ 基於 ARM 架構上的 ORB-SLAM2 的應用與改進
★ 基於深度學習之指數股票型基金趨勢預測★ 探討財經新聞與金融趨勢的相關性
★ 基於卷積神經網路的情緒語音分析★ 運用深度學習方法預測阿茲海默症惡化與腦中風手術存活
檔案 [Endnote RIS 格式]    [Bibtex 格式]    [相關文章]   [文章引用]   [完整記錄]   [館藏目錄]   至系統瀏覽論文 ( 永不開放)
摘要(中) 提出一套基於少量圖像輸入,即可建構具備聲音驅動能力與動態換衣功能之 3D高斯潑濺立體半身人像模型。本系統融合了多模態生成技術,整合了圖像合成、語音處理與即時渲染等模組,實現高擬真的虛擬人像互動框架。首先,透過頭部合成影像模型從單人正面圖像推衍出多角度視角圖,並進一步利用GaussianAvatars 的 3D Gaussian Splatting 技術建構連續視角下的立體半身模型。接著,結合語音辨識與文字轉語音模型,驅動虛擬人像實現同步嘴型與表情動作。衣著更換方面,透過條件式圖像轉換模型實現視覺一致性的虛擬試衣功能。整體系統具備低資源建模、高度即時性與高度視覺擬真度,能廣泛應用於虛擬人類、遠端互動、數位分身與沉浸式行銷等場域。實驗結果顯示,本方
法於極少圖像輸入(1~3 張)條件下,依然可生成穩定、連續且語音同步良好的 3D 半身人像,並具備靈活的多套衣著視覺轉換能力,驗證本系統於多模態互動應用的可行性與實用性。
摘要(英) This study presents a novel framework for constructing a GaussianAvatars 3D Gaussian Splatting-based upper-body avatar driven by voice and capable of dynamic clothing changes, using only a small number of input images. The proposed system leverages multimodal generation techniques by integrating image synthesis, speech processing, and real-time rendering modules to achieve high-fidelity and interactive virtual humans. Starting with a frontal portrait, a Head Synthesizer is employed to synthesize multi-view facial images, which are then reconstructed into a continuous-viewpoint 3D representation using Gaussian Splatting. Voice interaction is enabled through automatic speech recognition (ASR) and text-to-speech (TTS) modules, driving realistic lip-sync and expression dynamics. For clothing manipulation, a conditional image-to-image translation model is applied to perform seamless virtual outfit try-on. The system features low-data requirements, fast rendering, and visually coherent results, making it suitable for applications such as digital humans, remote interaction, virtual fitting, and immersive marketing. Experimental results demonstrate that the system can generate temporally consistent and speech-synchronized 3D avatars with as few as 1 to 3 images, while supporting diverse outfit changes with high visual realism. These findings confirm the feasibility and practicality of the proposed multimodal human avatar system.
關鍵字(中) ★ 3D 高斯潑濺
★ 多模態生成
★ 虛擬人像
★ 聲音驅動
★ 虛擬試衣
關鍵字(英) ★ 3D Gaussian Splatting
★ multimodal generation
★ virtual avatar
★ voice-driven animation
★ virtual try-on
論文目次 中文摘要 i
Abstract ii
章節目次 iii
第一章 緒論 1
1.1 背景 1
1.2 研究動機與目的 2
1.3 研究方法與章節概要 2
第二章 相關文獻及文獻探討 3
2.1 3D Gaussian Splatting 3
2.2 GaussianAvatars 4
2.3 Portrait4D 5
2.4 FitDiT 6
2.5 文字轉語音 (TTS) 與語音偵測 (VAD) 7
第三章 系統架構與功能 8
3.1輸入與前端處理 9
3.2頭部影像合成 (Head Synthesizer) 9
3.3換裝影像合成 9
3.4 3D 重建與高斯潑濺渲染 10
3.5 語音同步與動畫生成 10
第四章 系統實做結果與討論 11
4.1 資料及前處理 11
4.2 主要技術實作 12
4.2.1 GaussianAvatars 3D Gaussian Splatting 用於稀疏影像建模 12
4.2.2 Portrait-4D 生成圖與 NeRSemble dataset 輸入圖像的對齊 13
4.2.3 FitDiT 服裝轉移與幾何對齊 14
4.3實驗結果與討論 15
4.4生成圖像結果評估 18
第五章 結論及未來方向 19
參考文獻 21
參考文獻 [1] Kerbl, Bernhard, et al. "3D Gaussian splatting for real-time radiance field rendering." ACM Transactions on Graphics (TOG) 42.4 (2023): 1-14.
[2] Qian, Shenhan, et al. "GaussianAvatars: Photorealistic head avatars with rigged 3D Gaussians." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2024.
[3] Deng, Yu, et al. "Portrait4D: Learning One-Shot 4D Head Avatar Synthesis using Synthetic Data." arXiv preprint arXiv:2311.18729 (2023).
[4] Jiang, Boyuan, et al. "FitDiT: Advancing the Authentic Garment Details for High-fidelity Virtual Try-on." arXiv preprint arXiv:2411.10499 (2024).
[5] Kirschstein, Tobias, et al. "NeRSemble: Multi-view radiance field reconstruction of human heads." ACM Transactions on Graphics (TOG) 42.4 (2023): 1-14.
[6] Vaswani, Ashish, et al. "Attention is all you need." Advances in Neural Information Processing Systems (NeurIPS). 2017.
[7] Li, Tianye, et al. "Learning a model of facial shape and expression from 4D scans." ACM Transactions on Graphics (TOG) 36.6 (2017): 194-1.
[8] Mildenhall, Ben, et al. "NeRF: Representing scenes as neural radiance fields for view synthesis." Communications of the ACM 65.1 (2021): 99-106.
[9] Wang, Yuxuan, et al. "Tacotron: Towards end-to-end speech synthesis." Interspeech. 2017.
[10] Oord, Aaron van den, et al. "WaveNet: A generative model for raw audio." arXiv preprint arXiv:1609.03499 (2016).
[11] Hughes, Timothy, and Keir Mierle. "Recurrent neural networks for voice activity detection." IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2013.
[12] Pavlakos, Georgios, et al. "SMPL-X: A new joint 3D model of body, face, and hands." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2019.
[13] Thies, Justus, et al. "Face2Face: Real-time face capture and reenactment of RGB videos." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2016.
[14] Chan, Caroline, et al. "Everybody Dance Now." Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 2019.
[15] Wu, Yu, et al. "4D Gaussian Splatting for Real-Time Dynamic Scene Rendering." arXiv preprint arXiv:2310.08528 (2023).
[16] Cherry, E. Colin. "Some experiments on the recognition of speech, with one and with two ears." The Journal of the Acoustical Society of America 25.5 (1953): 975-979.
[17] Zhang, Richard, et al. "The unreasonable effectiveness of deep features as a perceptual metric." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
[18] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems 25 (2012).
[19] Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).
[20]Iandola, Forrest N., et al. "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size." arXiv preprint arXiv:1602.07360 (2016).
指導教授 王家慶(Jia-Ching Wang) 審核日期 2025-8-29
推文 facebook   plurk   twitter   funp   google   live   udn   HD   myshare   reddit   netvibes   friend   youpush   delicious   baidu   
網路書籤 Google bookmarks   del.icio.us   hemidemi   myshare   

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡  - 隱私權政策聲明