基於音樂數位介面的音樂與肢體動作生成

、線上人數：19

、訪客IP：18.220.247.209

姓名	魏宇圻(Yu-Chi Wei) 查詢紙本館藏	畢業系所	資訊工程學系
論文名稱	基於音樂數位介面的音樂與肢體動作生成
檔案	[Endnote RIS 格式] [Bibtex 格式] [相關文章] [文章引用] [完整記錄] [館藏目錄] 至系統瀏覽論文 ( 永不開放)
摘要(中)	音樂是人類在藝術領域上不可忽略的一項成就，而電腦是否能做到這樣創造性的行為，也是電腦科學家這數十年來一直持續研究的議題。此論文將涵蓋兩個主題，一是基於情緒之音樂生成，二是基於音樂生成演奏之肢體動作。　　生成模型在過去數年間取得了巨大的進步，生成建模的目標之一就是捕捉資料的特徵部分，並生成出與真實資料難以分辨真偽的逼真資料，生成模型也早已應用於音樂生成任務。此領域其中一個研究分支在追求 Non-Symbolic (非符號化形式) 的方式，試圖直接生成出一段語音訊號 (Audio)。但這會使這項任務將會變得更具挑戰性，因為原始音訊的資料空間維度非常高，需要建模的特徵資料量將非常大。在過去的研究之中有取得一些直接生成音訊資料的成功進展，像是模型可以在原始音訊域 (raw audio domain) 或時頻域(spectrogram domain) 上製作鋼琴作品，而最關鍵的困難點就是對原始音訊進行建模時，將會面臨極長範圍的依賴關係(extremely long-range dependencies)，這使得學習樂曲的高階語義 (high level semantics) 在計算上具有挑戰性。　　在資源及設備有限的情況下，較合理且有效的做法便是採用 Symbolic (符號化形式) 的作法，以鋼琴卷 (Piano Roll) ，為目標的音樂生成，生成出來的音樂將是使用符號定義出樂曲演奏中每個音符 (note) 的時間 (timing) 、音高 (pitch) 、動態表現 (dynamic) 以及使用的是何種樂器 (instrument)；這樣的作法即是透過將音訊這樣高複雜且高維度的資訊透過符號降維轉換至低維空間來處理問題，使建模可以更加容易。為了避免生成出來的音符會過於碎裂化，大部分研究會將時間單位限制在離散切割下僅容許最小為十六分音符。　　在聽音樂時，即使沒有歌詞我們也可以感受到旋律所表達的情緒，廣義上來說，大調傾向表達正向和快樂，而小調則較偏向負面與悲傷。這裡進一步基於筆者個人和大眾普遍的觀點，將 24 個調式整理出相應的情緒詞彙，此表格也經過問卷調查呈現有一定的可靠性。分別使用主觀和客觀的方式來評估音樂生成的結果。　　此論文音樂生成旨在使用深度學習生成模型根據現有的歌曲片段作為開頭，根據不同的調式所訓練出的模型，期望能生成具有相似情緒表達的旋律，來接續生成出完整的樂曲。　　另外音樂作品在呈現上或許可以分為三個重點，首先是旋律的正確性，再者為節奏的正確性，接著便是舞台上的詮釋表演性。對於機器而言，前兩者目標或許較容易達成，但是如何可以像是人一樣可以學習到音樂上的詮釋或表演技巧，情感面向的體現或許可以說是現代人工智慧所在追尋的目標與挑戰。　　在深度學習快速發展的現在，使用模型生成出樂曲已經有許多突破性的進展，從　 MuseGAN、 Music Transformer、Jukebox (使用Auto-encoder) 到 GPT ，人工智慧模型在模仿人類進行作曲這一項任務，技術正在日漸成熟。而由此而延伸出了額外的發想，如果模型生成出來的音樂經過人類的演奏者詮釋，必然會呈現出比起 MIDI 或 Audio 更豐富且精準的情感與演奏表現，那模型是否可以學習到「表演」呢？這一項任務在詮釋音樂作品上必定有著至關重要的價值，而這個嶄新的題目發想希望可以為音樂資訊檢索領域帶來新的可能性。　　在此論文音樂生成肢體動作部分中，將使用 MIDI ——Symbolic形式、Audio——Non-Symbolic形式與兩種資料合併作為輸入，目標資料為人類進行小提琴演奏的肢體動作，使用 3D Motion capture 技術捕捉人體 34 個關節點，由輸入不同形式的音樂資料型態，經過前處理後作為訓練資料，訓練模型生成出演奏小提琴樂曲的肢體動作。亦以客觀與主觀評測方試進行實驗，比較何種形式的音樂資料會有助於提升模型生成的效果。
摘要(英)	It is well known that even though a music piece may not contain lyrics, we are still able to sense the implicit emotions in the melody. The mood of a music piece is highly related to the key. For example, a C major key under most circumstances conveys happiness, and an A minor key is often used to express sadness. This thesis will focus on generating music with specific emotions. The purpose of this thesis is to use deep learning generative models to achieve music generation. Trained models based on different modes are expected to generate melodies with similar emotional expressions, ultimately creating complete compositions. This thesis uses both objective and subjective evaluation to show the generated music has learned the emotional expressions implied in music pieces. Additionally, musical works can be categorized into three main aspects: melodic accuracy, rhythmic precision, and interpretive performance on stage. Achieving the first two goals might be relatively easier to machines, but learning the interpretive or performance skills in music like humans is a challenge that touches on the realm of emotional expression, representing a goal and challenge in modern artificial intelligence. With the rapid development of deep learning, there have been breakthroughs in generating music using models such as MuseGAN, Music Transformer, Jukebox (using Auto-encoders), and GPT. Artificial intelligence models are progressively maturing in the task of music composing. This leads to further ideas - if the music generated by models undergoes human performer interpretation, it will likely exhibit richer and more accurate emotional and performance expressions than MIDI or audio. Can the models learn to "perform"? This task undoubtedly holds significant value in interpreting musical works, and this fresh concept creates hopes to bring new possibilities to the field of music information retrieval. In the section of body movement generation, input data are used in both MIDI and Audio data, MIDI in Symbolic form, Audio in Non-Symbolic form, and a combination of the two are also used as inputs. The target data consists of body movements of human violin performance, with a total of 34 body joints captured using 3D motion capture technology. Different forms of music data are preprocessed and used as training data to generate body movements for violin performance. Both objective and subjective evaluations are conducted to experimentally compare which format of music data would enhance the model′s generation effectiveness.
關鍵字(中)	★ 音樂資訊檢索 ★ 音樂生成 ★ 肢體動作生成	關鍵字(英)
論文目次	第一章、論文介紹 p.1 1.1 音樂相關基礎概念 p.1 2 1.1.1 音樂概念簡介 p.1 1.1.2 音樂專有名詞 p.2 1.2 研究目的 p.4 第二章、音樂生成研究方法 p.5 2.1 音樂生成所使用之資料集 p.5 2.2 音樂生成模型架構 p.6 2.2.1. Encoder p.7 2.2.2. Decoder p.7 2.2.3. Multi-head self-attention p.8 2.3音樂生成推論流程 p.9 2.4 音樂生成實驗設置 p.10 2.4.1 訓練模型的詳細參數設定 p.10 2.4.2 資料集前處理 p.10 第三章、音樂生成研究結果 p.11 3.1 音樂生成成果評測（Evaluation） p.11 3.1.1 客觀評測 p.11 3.1.1.1 相對性測量（Relative measurement） p.11 3.1.1.2 Pairwise cross validation p.11 3.1.1.3 核密度估計 p.12 3.4.3 Kullback-Leibler 散度(divergence)以及重疊區域(Overlapped area) p.12 3.1.2 主觀評測 p.16 第四章、肢體動作生成研究方法 p.21 4.1 肢體動作生成所使用之資料集 p.21 4.2 肢體動作生成模型架構 p.23 4.2.1. Long Short-Term Memory 模型 p.23 4.3 肢體動作生成實驗設置 p.24 4.3.1 訓練模型的詳細參數設定 p.24 4.3.2 資料集前處理 p.25 4.3.3 訓練資料集讀取 p.26 4.3.4 驗證資料集使用 p.27 4.4 肢體動作生成推論流程 p.27 第五章、肢體動作生成研究結果 p.28 5.1 肢體動作生成成果評測（Evaluation） p.28 5.1.1 客觀評測 p.28 5.1.2 主觀評測 p.31 第六章、音樂生成與肢體動作生成結論 p.36 6.1 音樂生成結論 p.36 6.2 音樂生成未來研究 p.36 6.3 肢體動作生成結論 p.36 6.4 肢體動作生成未來研究 p.37 參考文獻 p.38
參考文獻	[1] Dong, H.-W., Hsiao, W.-Y., Yang, L.-C., & Yang, Y.-H. (2018). MuseGAN: Multi-track Sequential Generative Adversarial Networks for Symbolic Music Generation and Accompaniment. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1). [2] Huang, C.A., Vaswani, A., Uszkoreit, J., Simon, I., Hawthorne, C., Shazeer, N.M., Dai, A.M., Hoffman, M.D., Dinculescu, M., & Eck, D. (2019). Music Transformer: Generating Music with Long-Term Structure. ICLR. [3] Dhariwal, P., Jun, H., Payne, C., Kim, J.W., Radford, A., & Sutskever, I. (2020). Jukebox: A Generative Model for Music. ArXiv, abs/2005.00341. [4] Yu, Y., & Canales, S. (2021). Conditional LSTM-GAN for Melody Generation from Lyrics. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 17, 1 – 20. [5] Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani, (2018), Self-Attention with Relative Position Representations, Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 464–468. [6] Hsuan-Kai Kao and Li Su. (2020). Temporally Guided Music-to-Body- Movement Generation. ArXiv, abs/2009.08015. [7] Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, and Jitendra Malik, (2019), Learning Individual Styles of Conversational Gesture, IEEE Conference on Computer Vision and Pattern Recognition, 3497–3506. [8] Jiaman Li, Yihang Yin, Hang Chu, Yi Zhou, Tingwu Wang, Sanja Fidler and Hao Li. (2020). Learning to Generate Diverse Dance Motions with Transformer. ArXiv, abs/2008.08171. [9] Yueh-Kao Wu, Ching-Yu Chiu and Yi-Hsuan Yang.(2022). JUKEDRUMMER: Conditional Beat-Aware Audio-Domain Drum Accompaniment Gerneration via Transformer VQ-VAE. ArXiv, abs/2210.06007. [10] Sepp Hochreiter and Jürgen Schmidhuber. (1997). Long Short-Term Memory. Neural Comput. 9, 8 (November 15, 1997), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735 [11] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser and Illia Polosukhin. (2017). Attention Is All You Need. ArXiv, abs/1706.03762. [12] Ailing Zeng, Muxi Chen, Lei Zhang and Qiang Xu.(2022). Are Transformers Effective for Time Series Forecasting?. ArXiv, abs/1706.03762. [13] Étienne LOULIÉ. (1988-90). Mélanges sur la musique; règles de composition; notes et extraits, Bibliothèque nationale de France. Département des Manuscrits, ark:/12148/btv1b525178077(p.13r-13v). [14] Rita Steblin. (1983). “A History of Key Characteristics in the 18th and Early 19th Centuries”. ISBN：9781580460415(link).
指導教授	王家慶蘇黎(Jia-Ching Wang Li Su)	審核日期	2023-8-17
推文	facebook plurk twitter funp google live udn HD myshare reddit netvibes friend youpush delicious baidu
網路書籤	Google bookmarks del.icio.us hemidemi myshare

博碩士論文 110522059 詳細資訊