利用潛藏一致性模型實現高效影片生成應用於語意驅動音樂生成系統

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：22

、訪客IP：3.144.20.151

姓名

陳丕中(Pi-Jhong Chen) 查詢紙本館藏

畢業系所

人工智慧國際碩士學位學程

論文名稱

利用潛藏一致性模型實現高效影片生成應用於語意驅動音樂生成系統
(Efficient Video Generation with Latent Consistency Models for Text-Driven music system)

相關論文

★ 基於edX線上討論板社交關係之分組機制	★ 利用Kinect建置3D視覺化之Facebook互動系統
★ 利用 Kinect建置智慧型教室之評量系統	★ 基於行動裝置應用之智慧型都會區路徑規劃機制
★ 基於分析關鍵動量相關性之動態紋理轉換	★ 針對JPEG影像中隙縫修改之偵測技術
★ 基於保護影像中直線結構的細縫裁減系統	★ 建基於開放式網路社群學習環境之社群推薦機制
★ 英語作為外語的互動式情境學習環境之系統設計	★ 基於膚色保存之情感色彩轉換機制
★ A Gesture-based Presentation System for Smart Classroom using Kinect	★ 一個用於虛擬鍵盤之手勢識別框架
★ 分數冪次型灰色生成預測模型誤差分析暨電腦工具箱之研發	★ 使用慣性傳感器構建即時人體骨架動作
★ 基於多台攝影機即時三維建模	★ 基於互補度與社群網路分析於基因演算法之分組機制

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2026-7-5以後開放)

摘要(中)

目前許多音樂串流平臺都積極嘗試利用文本自動創作多樣化的作品，但現有技術在連結音樂與動畫方面明顯存在不足，不但難以準確反映特定文化的獨特元素和情感，甚至無助於音樂情境的表達。為了解決這一問題，我們採用了大型生成式預訓練模型（Large Generative Pre-trained Model, LGPM）和視頻潛在擴散模型（Video Latent Diffusion Model, video LDM），這兩種技術在技術創新方面已顯示出強大的潛力。我們系統的核心是一個語義驅動的音樂及動畫生成模塊，它能根據用戶的文字提示精準生成具有文化特色的音樂及相應動畫。

其中LLM負責分析和理解使用者的自然語言輸入，據此指導音樂與動畫的主題及情感基調，確保生成內容精準反映使用者的意圖和風格需求；在利用基於強化學習的音樂生成模組產生符合使用者需求的音樂文本之後，video LDM會生成對應音樂風格的動畫，將抽象的音樂情感與張力轉換為具體意象。此外，我們專注於提升動畫的視覺品質，特別是在動態連貫性和減少視覺失真方面。為了進一步優化生成動畫的品質和效率，我們整合了潛在一致性模型（Latent Consistency Model, LCM），這一新模型能夠在保持高視覺品質的同時，將動畫關鍵幀的生成步驟從20步大幅減少至4步。

本研究不僅提升了AI音樂視頻生成技術的實用性，同時也為相關領域的未來研究提供了新的方向。我們的系統顯著提高了音樂和動畫之間的連接性，並能更準確地反映出用戶的文化和情感需求，這對於推動文化多樣性的表達和保護具有重要意義。

摘要(英)

Although existing music generation platforms are capable of autonomously creating diverse musical compositions, they frequently fail to integrate music with animation effectively, particularly in accurately reflecting specific cultural attributes and emotions. To address this issue, we have employed Large Generative Pre-trained Models (LGPM) and Video Latent Diffusion Models (video LDM), both of which have shown considerable potential in technological innovation. At the heart of our system is a semantically-driven module for generating music and animations, which accurately produces culturally distinctive tracks and corresponding animations based on user text prompts.

Our experiments demonstrate that the enhanced capability of Large Language Models (LLMs) to analyze and understand natural language significantly improves the thematic and emotional accuracy of the generated content. Additionally, we focused on enhancing the visual quality of animations, particularly in terms of dynamic coherence and reducing visual distortions. To further optimize the quality and efficiency of generated animations, we integrated Latent Consistency Models (LCMs), which significantly reduce the steps required for generating keyframes from 20 to 4 while maintaining high visual quality.

This research not only advances the practicality of AI-driven music video generation technologies but also opens new directions for future research in the field. Our system significantly improves the connectivity between music and animations, and more accurately reflects users′ cultural and emotional needs, which is crucial for promoting the expression and preservation of cultural diversity.

關鍵字(中)

★ 生成式AI
★ 大型語言模型
★ LLM Agent
★ 影片擴散模型
★ 潛藏一致性模型
★ 多模態生成

關鍵字(英)

★ Generative AI
★ Large Language Model
★ LLM Agent
★ video Latent Diffusion Model
★ Latent Consistency Model
★ Multimodal Generation

論文目次

Table of Contents
Chinese Abstract i
English Abstract ii
Acknowledgements iii
Table of Contents iv
I Introductions 1
1-1 Research Background and Motivation 1
1-2 Research Objectives 3
II Related Work 4
2-1 AI Agent 4
2-2 Building Agents Based on LLM 4
2-2-1 LLM Agent 5
2-3 Sentence-BERT 7
2-3-1 Pooling Techniques in SBERT 7
2-3-2 Loss Function Adaptation for SBERT 8
2-4 LLaMA 9
2-4-1 Pre-normalization and RMS Norm 9
2-4-2 SwiGLU 10
2-4-3 Rotary Position Embedding (RoPE) 10
2-5 Reinforcement Learning for Music Generation 12
2-5-1 System Architecture 12
2-5-2 Hierarchical Recurrent Neural Network (Bar Profile) 14
2-5-3 PPO with LSTM Architecture 15
2-6 Stable Diffusion 17
2-6-1 Latent Diffusion Model 17
2-6-2 Latent Video Diffusion Models 20
2-7 Latent Consistency Models 21
2-7-1 Consistency Distillation 22
2-7-2 One-Stage Guided Distillation 22
2-7-3 Accelerating Distillation 24
III Method 26
3-1 System architecture overview 26
3-2 Semantic Analysis Module 28
3-2-1 LLaMA 29
3-2-2 Enhanced Video Generation Module 32
IV Experiments 35
4-1 Experimental Environment Setup 35
4-2 Evaluation of Semantic Augmentation Module Effects 35
4-3 Comparison of Video Model Performance at Different Frame Rates 37
4-3-1 Differences in Motion Smoothness Across Frame Rates 37
4-3-2 Comparison of Motion Modules: animateDiff vs video Latent Diffusion Model 40
4-3-3 Comparison of Text Prompts: Evaluating the Performance of LDM and LCM using CLIP 43
V Conclusion 46
Refernce 49

參考文獻

[1] Stuart J Russell and Peter Norvig, Artificial intelligence: a modern approach, Pearson, 2016.
[2] Richard S Sutton and Andrew G Barto, Reinforcement learning: An introduction, MIT press, 2018.
[3] Yuxi Li, “Deep reinforcement learning: An overview,” arXiv preprint arXiv:1701.07274, 2017.
[4] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al., “Mastering the game of go with deep neural networks and tree search,” nature, vol. 529, no. 7587, pp. 484–489, 2016.
[5] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
[6] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al., “Training language models to follow instructions with human feedback,” Advances in neural information processing systems, vol. 35, pp. 27730–27744, 2022. 49
[7] ROpenAI, “Gpt-4technical report. arxiv 2303.08774,” View in Article, vol. 2, no. 5, 2023. [8] Markus Schlosser, “Agency,” https://plato.stanford.edu/archives/win2019/ entries/agency/, 2019.
[9] Steven Pinker, The language instinct: How the mind creates language, Penguin uK, 2003.
[10] Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernockỳ, and Sanjeev Khudanpur, “Recurrent neural network based language model.,” in Interspeech. Makuhari, 2010, pp. 1045–1048.
[11] Alex Graves, “Generating sequences with recurrent neural networks,” arXiv preprint arXiv:1308.0850, 2013.
[12] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
[13] Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen, “Deberta: Decodingenhanced bert with disentangled attention,” arXiv preprint arXiv:2006.03654, 2020.
[14] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al., “Palm: Scaling language modeling with pathways,” Journal of Machine Learning Research, vol. 24, no. 240, pp. 1–113, 2023.
[15] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023. 50
[16] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
[17] Noah Shinn, Beck Labash, and Ashwin Gopinath, “Reflexion: an autonomous agent with dynamic memory and self-reflection,” arXiv preprint arXiv:2303.11366, 2023.
[18] Chan Hee Song, Jiaman Wu, Clayton Washington, Brian M Sadler, Wei-Lun Chao, and Yu Su, “Llm-planner: Few-shot grounded planning for embodied agents with large language models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2998–3009.
[19] Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, JunzheWang,SenjieJin, EnyuZhou, etal., “Theriseandpotentialoflargelanguagemodel based agents: A survey,” arXiv preprint arXiv:2309.07864, 2023.
[20] Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem, “Camel: Communicative agents for” mind” exploration of large scale language model society,” 2023.
[21] Chen Qian, Xin Cong, Cheng Yang, Weize Chen, Yusheng Su, Juyuan Xu, Zhiyuan Liu, and Maosong Sun, “Communicative agents for software development,” arXiv preprint arXiv:2307.07924, 2023.
[22] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, “Bert: Pretraining of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018. 51
[23] Nils Reimers and Iryna Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” arXiv preprint arXiv:1908.10084, 2019.
[24] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al., “Training compute-optimal large language models,” arXiv preprint arXiv:2203.15556, 2022.
[25] Chien-Hao Huang, “Combining deep supervised learning and reinforcement learning for music melody generation,” Master’s thesis, National Central University, 6 2023.
[26] Jian Wu, Changran Hu, Yulong Wang, Xiaolin Hu, and Jun Zhu, “A hierarchical recurrent neural network for symbolic melody generation,” IEEE transactions on cybernetics, vol. 50, no. 6, pp. 2749–2757, 2019.
[27] Delong Huang and Fei Guo, “Multiplicity of periodic bouncing solutions for generalized impact hamiltonian systems,” Boundary Value Problems, vol. 2019, no. 1, pp. 57, 2019.
[28] Minh-Ngoc Tran and YoungHan Kim, “Concurrent service auto-scaling for knative resource quota-based serverless system,” Future Generation Computer Systems, 2024.
[29] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 1068410695.
[30] PrafullaDhariwalandAlexanderNichol, “Diffusionmodelsbeatgansonimagesynthesis,” Advances in neural information processing systems, vol. 34, pp. 8780–8794, 2021. 52
[31] Jonathan Ho, Ajay Jain, and Pieter Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020.
[32] Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J Fleet, and Mohammad Norouzi, “Image super-resolution via iterative refinement,” IEEE transactions on pattern analysis and machine intelligence, vol. 45, no. 4, pp. 4713–4726, 2022.
[33] Mehdi Mirza and Simon Osindero, “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784, 2014.
[34] Kihyuk Sohn, Honglak Lee, and Xinchen Yan, “Learning structured output representation using deep conditional generative models,” Advances in neural information processing systems, vol. 28, 2015.
[35] Olaf Ronneberger, Philipp Fischer, and Thomas Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18. Springer, 2015, pp. 234–241.
[36] AndrewJaegle,SebastianBorgeaud,Jean-BaptisteAlayrac, CarlDoersch,CatalinIonescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, et al., “Perceiver io: A general architecture for structured inputs & outputs,” arXiv preprint arXiv:2107.14795, 2021.
[37] Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira, “Perceiver: General perception with iterative attention,” in International conference on machine learning. PMLR, 2021, pp. 4651–4664. 53
[38] Andreas Blattmann, Robin Rombach, HuanLing, TimDockhorn, SeungWookKim, Sanja Fidler, and Karsten Kreis, “Align your latents: High-resolution video synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22563–22575.
[39] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255.
[40] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao, “Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop,” arXiv preprint arXiv:1506.03365, 2015.
[41] Jonathan Ho and Tim Salimans, “Classifier-free diffusion guidance,” arXiv preprint arXiv:2207.12598, 2022.
[42] Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever, “Consistency models,” arXiv preprint arXiv:2303.01469, 2023.
[43] Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik Kingma, Stefano Ermon, Jonathan Ho, and TimSalimans, “On distillation of guided diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1429714306.
[44] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine, “Elucidating the design space of diffusion-based generative models,” Advances in neural information processing systems, vol. 35, pp. 26565–26577, 2022. 54
[45] Jiaming Song, Chenlin Meng, and Stefano Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502, 2020.
[46] ChengLu,YuhaoZhou,FanBao,JianfeiChen, ChongxuanLi, andJunZhu, “Dpm-solver: Afast ode solver for diffusion probabilistic model sampling in around 10 steps,” Advances in Neural Information Processing Systems, vol. 35, pp. 5775–5787, 2022.
[47] ChengLu, YuhaoZhou, FanBao, JianfeiChen, ChongxuanLi, and JunZhu, “Dpm-solver+ +: Fast solver for guided sampling of diffusion probabilistic models,” arXiv preprint arXiv:2211.01095, 2022.
[48] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.

指導教授

施國琛(Kuo-Chen Shih)

審核日期

2024-7-13

推文