圖片建置聲音驅動之 3D 高斯潑濺立體半身模型及換衣功能;Image-Based Construction of a Voice-Driven 3D Gaussian Splatting Upper-Body Model with Virtual Garment Transfer

NCU Institutional Repository > 資訊電機學院 > 資訊工程研究所 > 博碩士論文 > Item 987654321/98643

請使用永久網址來引用或連結此文件: https://ir.lib.ncu.edu.tw/handle/987654321/98643

題名:	圖片建置聲音驅動之 3D 高斯潑濺立體半身模型及換衣功能;Image-Based Construction of a Voice-Driven 3D Gaussian Splatting Upper-Body Model with Virtual Garment Transfer
作者:	陳明威;Chen, Ming-Wei
貢獻者:	資訊工程學系
關鍵詞:	3D 高斯潑濺;多模態生成;虛擬人像;聲音驅動;虛擬試衣;3D Gaussian Splatting;multimodal generation;virtual avatar;voice-driven animation;virtual try-on
日期:	2025-08-29
上傳時間:	2025-10-17 13:02:29 (UTC+8)
出版者:	國立中央大學
摘要:	提出一套基於少量圖像輸入，即可建構具備聲音驅動能力與動態換衣功能之 3D高斯潑濺立體半身人像模型。本系統融合了多模態生成技術，整合了圖像合成、語音處理與即時渲染等模組，實現高擬真的虛擬人像互動框架。首先，透過頭部合成影像模型從單人正面圖像推衍出多角度視角圖，並進一步利用GaussianAvatars 的 3D Gaussian Splatting 技術建構連續視角下的立體半身模型。接著，結合語音辨識與文字轉語音模型，驅動虛擬人像實現同步嘴型與表情動作。衣著更換方面，透過條件式圖像轉換模型實現視覺一致性的虛擬試衣功能。整體系統具備低資源建模、高度即時性與高度視覺擬真度，能廣泛應用於虛擬人類、遠端互動、數位分身與沉浸式行銷等場域。實驗結果顯示，本方法於極少圖像輸入（1~3 張）條件下，依然可生成穩定、連續且語音同步良好的 3D 半身人像，並具備靈活的多套衣著視覺轉換能力，驗證本系統於多模態互動應用的可行性與實用性。 ;This study presents a novel framework for constructing a GaussianAvatars 3D Gaussian Splatting-based upper-body avatar driven by voice and capable of dynamic clothing changes, using only a small number of input images. The proposed system leverages multimodal generation techniques by integrating image synthesis, speech processing, and real-time rendering modules to achieve high-fidelity and interactive virtual humans. Starting with a frontal portrait, a Head Synthesizer is employed to synthesize multi-view facial images, which are then reconstructed into a continuous-viewpoint 3D representation using Gaussian Splatting. Voice interaction is enabled through automatic speech recognition (ASR) and text-to-speech (TTS) modules, driving realistic lip-sync and expression dynamics. For clothing manipulation, a conditional image-to-image translation model is applied to perform seamless virtual outfit try-on. The system features low-data requirements, fast rendering, and visually coherent results, making it suitable for applications such as digital humans, remote interaction, virtual fitting, and immersive marketing. Experimental results demonstrate that the system can generate temporally consistent and speech-synchronized 3D avatars with as few as 1 to 3 images, while supporting diverse outfit changes with high visual realism. These findings confirm the feasibility and practicality of the proposed multimodal human avatar system.
顯示於類別:	[資訊工程研究所] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	94	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....