利用潛藏一致性模型實現高效影片生成應用於語意驅動音樂生成系統;Efficient Video Generation with Latent Consistency Models for Text-Driven music system

NCUIR > College of Electrical Engineering & Computer Science > International Graduate Program in Artificial Intelligence > Electronic Thesis & Dissertation > Item 987654321/94402

Please use this identifier to cite or link to this item: https://ir.lib.ncu.edu.tw/handle/987654321/94402

Title:	利用潛藏一致性模型實現高效影片生成應用於語意驅動音樂生成系統;Efficient Video Generation with Latent Consistency Models for Text-Driven music system
Authors:	陳丕中;Chen, Pi-Jhong
Contributors:	人工智慧國際碩士學位學程
Keywords:	生成式AI;大型語言模型;LLM Agent;影片擴散模型;潛藏一致性模型;多模態生成;Generative AI;Large Language Model;LLM Agent;video Latent Diffusion Model;Latent Consistency Model;Multimodal Generation
Date:	2024-07-13
Issue Date:	2024-10-09 14:40:41 (UTC+8)
Publisher:	國立中央大學
Abstract:	目前許多音樂串流平臺都積極嘗試利用文本自動創作多樣化的作品，但現有技術在連結音樂與動畫方面明顯存在不足，不但難以準確反映特定文化的獨特元素和情感，甚至無助於音樂情境的表達。為了解決這一問題，我們採用了大型生成式預訓練模型（Large Generative Pre-trained Model, LGPM）和視頻潛在擴散模型（Video Latent Diffusion Model, video LDM），這兩種技術在技術創新方面已顯示出強大的潛力。我們系統的核心是一個語義驅動的音樂及動畫生成模塊，它能根據用戶的文字提示精準生成具有文化特色的音樂及相應動畫。其中LLM負責分析和理解使用者的自然語言輸入，據此指導音樂與動畫的主題及情感基調，確保生成內容精準反映使用者的意圖和風格需求；在利用基於強化學習的音樂生成模組產生符合使用者需求的音樂文本之後，video LDM會生成對應音樂風格的動畫，將抽象的音樂情感與張力轉換為具體意象。此外，我們專注於提升動畫的視覺品質，特別是在動態連貫性和減少視覺失真方面。為了進一步優化生成動畫的品質和效率，我們整合了潛在一致性模型（Latent Consistency Model, LCM），這一新模型能夠在保持高視覺品質的同時，將動畫關鍵幀的生成步驟從20步大幅減少至4步。本研究不僅提升了AI音樂視頻生成技術的實用性，同時也為相關領域的未來研究提供了新的方向。我們的系統顯著提高了音樂和動畫之間的連接性，並能更準確地反映出用戶的文化和情感需求，這對於推動文化多樣性的表達和保護具有重要意義。 ;Although existing music generation platforms are capable of autonomously creating diverse musical compositions, they frequently fail to integrate music with animation effectively, particularly in accurately reflecting specific cultural attributes and emotions. To address this issue, we have employed Large Generative Pre-trained Models (LGPM) and Video Latent Diffusion Models (video LDM), both of which have shown considerable potential in technological innovation. At the heart of our system is a semantically-driven module for generating music and animations, which accurately produces culturally distinctive tracks and corresponding animations based on user text prompts. Our experiments demonstrate that the enhanced capability of Large Language Models (LLMs) to analyze and understand natural language significantly improves the thematic and emotional accuracy of the generated content. Additionally, we focused on enhancing the visual quality of animations, particularly in terms of dynamic coherence and reducing visual distortions. To further optimize the quality and efficiency of generated animations, we integrated Latent Consistency Models (LCMs), which significantly reduce the steps required for generating keyframes from 20 to 4 while maintaining high visual quality. This research not only advances the practicality of AI-driven music video generation technologies but also opens new directions for future research in the field. Our system significantly improves the connectivity between music and animations, and more accurately reflects users′ cultural and emotional needs, which is crucial for promoting the expression and preservation of cultural diversity.
Appears in Collections:	[ International Graduate Program in Artificial Intelligence ] Electronic Thesis & Dissertation

Files in This Item:

File	Description	Size	Format
index.html		0Kb	HTML	164	View/Open

社群 sharing

Loading...