基於音樂數位介面的音樂與肢體動作生成

NCU Institutional Repository > 資訊電機學院 > 資訊工程研究所 > 博碩士論文 > Item 987654321/92816

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/92816

題名:	基於音樂數位介面的音樂與肢體動作生成
作者:	魏宇圻;Wei, Yu-Chi
貢獻者:	資訊工程學系
關鍵詞:	音樂資訊檢索;音樂生成;肢體動作生成
日期:	2023-08-17
上傳時間:	2023-10-04 16:11:14 (UTC+8)
出版者:	國立中央大學
摘要:	音樂是人類在藝術領域上不可忽略的一項成就，而電腦是否能做到這樣創造性的行為，也是電腦科學家這數十年來一直持續研究的議題。此論文將涵蓋兩個主題，一是基於情緒之音樂生成，二是基於音樂生成演奏之肢體動作。　　生成模型在過去數年間取得了巨大的進步，生成建模的目標之一就是捕捉資料的特徵部分，並生成出與真實資料難以分辨真偽的逼真資料，生成模型也早已應用於音樂生成任務。此領域其中一個研究分支在追求 Non-Symbolic (非符號化形式) 的方式，試圖直接生成出一段語音訊號 (Audio)。但這會使這項任務將會變得更具挑戰性，因為原始音訊的資料空間維度非常高，需要建模的特徵資料量將非常大。在過去的研究之中有取得一些直接生成音訊資料的成功進展，像是模型可以在原始音訊域 (raw audio domain) 或時頻域(spectrogram domain) 上製作鋼琴作品，而最關鍵的困難點就是對原始音訊進行建模時，將會面臨極長範圍的依賴關係(extremely long-range dependencies)，這使得學習樂曲的高階語義 (high level semantics) 在計算上具有挑戰性。　　在資源及設備有限的情況下，較合理且有效的做法便是採用 Symbolic (符號化形式) 的作法，以鋼琴卷 (Piano Roll) ，為目標的音樂生成，生成出來的音樂將是使用符號定義出樂曲演奏中每個音符 (note) 的時間 (timing) 、音高 (pitch) 、動態表現 (dynamic) 以及使用的是何種樂器 (instrument)；這樣的作法即是透過將音訊這樣高複雜且高維度的資訊透過符號降維轉換至低維空間來處理問題，使建模可以更加容易。為了避免生成出來的音符會過於碎裂化，大部分研究會將時間單位限制在離散切割下僅容許最小為十六分音符。　　在聽音樂時，即使沒有歌詞我們也可以感受到旋律所表達的情緒，廣義上來說，大調傾向表達正向和快樂，而小調則較偏向負面與悲傷。這裡進一步基於筆者個人和大眾普遍的觀點，將 24 個調式整理出相應的情緒詞彙，此表格也經過問卷調查呈現有一定的可靠性。分別使用主觀和客觀的方式來評估音樂生成的結果。　　此論文音樂生成旨在使用深度學習生成模型根據現有的歌曲片段作為開頭，根據不同的調式所訓練出的模型，期望能生成具有相似情緒表達的旋律，來接續生成出完整的樂曲。　　另外音樂作品在呈現上或許可以分為三個重點，首先是旋律的正確性，再者為節奏的正確性，接著便是舞台上的詮釋表演性。對於機器而言，前兩者目標或許較容易達成，但是如何可以像是人一樣可以學習到音樂上的詮釋或表演技巧，情感面向的體現或許可以說是現代人工智慧所在追尋的目標與挑戰。　　在深度學習快速發展的現在，使用模型生成出樂曲已經有許多突破性的進展，從　 MuseGAN、 Music Transformer、Jukebox (使用Auto-encoder) 到 GPT ，人工智慧模型在模仿人類進行作曲這一項任務，技術正在日漸成熟。而由此而延伸出了額外的發想，如果模型生成出來的音樂經過人類的演奏者詮釋，必然會呈現出比起 MIDI 或 Audio 更豐富且精準的情感與演奏表現，那模型是否可以學習到「表演」呢？這一項任務在詮釋音樂作品上必定有著至關重要的價值，而這個嶄新的題目發想希望可以為音樂資訊檢索領域帶來新的可能性。　　在此論文音樂生成肢體動作部分中，將使用 MIDI ——Symbolic形式、Audio——Non-Symbolic形式與兩種資料合併作為輸入，目標資料為人類進行小提琴演奏的肢體動作，使用 3D Motion capture 技術捕捉人體 34 個關節點，由輸入不同形式的音樂資料型態，經過前處理後作為訓練資料，訓練模型生成出演奏小提琴樂曲的肢體動作。亦以客觀與主觀評測方試進行實驗，比較何種形式的音樂資料會有助於提升模型生成的效果。;It is well known that even though a music piece may not contain lyrics, we are still able to sense the implicit emotions in the melody. The mood of a music piece is highly related to the key. For example, a C major key under most circumstances conveys happiness, and an A minor key is often used to express sadness. This thesis will focus on generating music with specific emotions. The purpose of this thesis is to use deep learning generative models to achieve music generation. Trained models based on different modes are expected to generate melodies with similar emotional expressions, ultimately creating complete compositions. This thesis uses both objective and subjective evaluation to show the generated music has learned the emotional expressions implied in music pieces. Additionally, musical works can be categorized into three main aspects: melodic accuracy, rhythmic precision, and interpretive performance on stage. Achieving the first two goals might be relatively easier to machines, but learning the interpretive or performance skills in music like humans is a challenge that touches on the realm of emotional expression, representing a goal and challenge in modern artificial intelligence. With the rapid development of deep learning, there have been breakthroughs in generating music using models such as MuseGAN, Music Transformer, Jukebox (using Auto-encoders), and GPT. Artificial intelligence models are progressively maturing in the task of music composing. This leads to further ideas - if the music generated by models undergoes human performer interpretation, it will likely exhibit richer and more accurate emotional and performance expressions than MIDI or audio. Can the models learn to "perform"? This task undoubtedly holds significant value in interpreting musical works, and this fresh concept creates hopes to bring new possibilities to the field of music information retrieval. In the section of body movement generation, input data are used in both MIDI and Audio data, MIDI in Symbolic form, Audio in Non-Symbolic form, and a combination of the two are also used as inputs. The target data consists of body movements of human violin performance, with a total of 34 body joints captured using 3D motion capture technology. Different forms of music data are preprocessed and used as training data to generate body movements for violin performance. Both objective and subjective evaluations are conducted to experimentally compare which format of music data would enhance the model′s generation effectiveness.
顯示於類別:	[資訊工程研究所] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	89	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....