提出一套基於少量圖像輸入,即可建構具備聲音驅動能力與動態換衣功能之 3D高斯潑濺立體半身人像模型。本系統融合了多模態生成技術,整合了圖像合成、語音處理與即時渲染等模組,實現高擬真的虛擬人像互動框架。首先,透過頭部合成影像模型從單人正面圖像推衍出多角度視角圖,並進一步利用GaussianAvatars 的 3D Gaussian Splatting 技術建構連續視角下的立體半身模型。接著,結合語音辨識與文字轉語音模型,驅動虛擬人像實現同步嘴型與表情動作。衣著更換方面,透過條件式圖像轉換模型實現視覺一致性的虛擬試衣功能。整體系統具備低資源建模、高度即時性與高度視覺擬真度,能廣泛應用於虛擬人類、遠端互動、數位分身與沉浸式行銷等場域。實驗結果顯示,本方 法於極少圖像輸入(1~3 張)條件下,依然可生成穩定、連續且語音同步良好的 3D 半身人像,並具備靈活的多套衣著視覺轉換能力,驗證本系統於多模態互動應用的可行性與實用性。 ;This study presents a novel framework for constructing a GaussianAvatars 3D Gaussian Splatting-based upper-body avatar driven by voice and capable of dynamic clothing changes, using only a small number of input images. The proposed system leverages multimodal generation techniques by integrating image synthesis, speech processing, and real-time rendering modules to achieve high-fidelity and interactive virtual humans. Starting with a frontal portrait, a Head Synthesizer is employed to synthesize multi-view facial images, which are then reconstructed into a continuous-viewpoint 3D representation using Gaussian Splatting. Voice interaction is enabled through automatic speech recognition (ASR) and text-to-speech (TTS) modules, driving realistic lip-sync and expression dynamics. For clothing manipulation, a conditional image-to-image translation model is applied to perform seamless virtual outfit try-on. The system features low-data requirements, fast rendering, and visually coherent results, making it suitable for applications such as digital humans, remote interaction, virtual fitting, and immersive marketing. Experimental results demonstrate that the system can generate temporally consistent and speech-synchronized 3D avatars with as few as 1 to 3 images, while supporting diverse outfit changes with high visual realism. These findings confirm the feasibility and practicality of the proposed multimodal human avatar system.