摘要(英) |
It is well known that even though a music piece may not contain lyrics, we are still able to sense the implicit emotions in the melody. The mood of a music piece is highly related to the key. For example, a C major key under most circumstances conveys happiness, and an A minor key is often used to express sadness. This thesis will focus on generating music with specific emotions.
The purpose of this thesis is to use deep learning generative models to achieve music generation. Trained models based on different modes are expected to generate melodies with similar emotional expressions, ultimately creating complete compositions.
This thesis uses both objective and subjective evaluation to show the generated music has learned the emotional expressions implied in music pieces.
Additionally, musical works can be categorized into three main aspects: melodic accuracy, rhythmic precision, and interpretive performance on stage. Achieving the first two goals might be relatively easier to machines, but learning the interpretive or performance skills in music like humans is a challenge that touches on the realm of emotional expression, representing a goal and challenge in modern artificial intelligence.
With the rapid development of deep learning, there have been breakthroughs in generating music using models such as MuseGAN, Music Transformer, Jukebox (using Auto-encoders), and GPT. Artificial intelligence models are progressively maturing in the task of music composing. This leads to further ideas - if the music generated by models undergoes human performer interpretation, it will likely exhibit richer and more accurate emotional and performance expressions than MIDI or audio. Can the models learn to "perform"? This task undoubtedly holds significant value in interpreting musical works, and this fresh concept creates hopes to bring new possibilities to the field of music information retrieval.
In the section of body movement generation, input data are used in both MIDI and Audio data, MIDI in Symbolic form, Audio in Non-Symbolic form, and a combination of the two are also used as inputs. The target data consists of body movements of human violin performance, with a total of 34 body joints captured using 3D motion capture technology. Different forms of music data are preprocessed and used as training data to generate body movements for violin performance. Both objective and subjective evaluations are conducted to experimentally compare which format of music data would enhance the model′s generation effectiveness. |
參考文獻 |
[1] Dong, H.-W., Hsiao, W.-Y., Yang, L.-C., & Yang, Y.-H. (2018). MuseGAN: Multi-track Sequential Generative Adversarial Networks for Symbolic Music Generation and Accompaniment. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1).
[2] Huang, C.A., Vaswani, A., Uszkoreit, J., Simon, I., Hawthorne, C., Shazeer, N.M., Dai, A.M., Hoffman, M.D., Dinculescu, M., & Eck, D. (2019). Music Transformer: Generating Music with Long-Term Structure. ICLR.
[3] Dhariwal, P., Jun, H., Payne, C., Kim, J.W., Radford, A., & Sutskever, I. (2020). Jukebox: A Generative Model for Music. ArXiv, abs/2005.00341.
[4] Yu, Y., & Canales, S. (2021). Conditional LSTM-GAN for Melody Generation from Lyrics. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 17, 1 – 20.
[5] Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani, (2018), Self-Attention with Relative Position Representations, Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 464–468.
[6] Hsuan-Kai Kao and Li Su. (2020). Temporally Guided Music-to-Body- Movement Generation. ArXiv, abs/2009.08015.
[7] Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, and Jitendra Malik, (2019), Learning Individual Styles of Conversational Gesture, IEEE Conference on Computer Vision and Pattern Recognition, 3497–3506.
[8] Jiaman Li, Yihang Yin, Hang Chu, Yi Zhou, Tingwu Wang, Sanja Fidler and Hao Li. (2020). Learning to Generate Diverse Dance Motions with Transformer. ArXiv, abs/2008.08171.
[9] Yueh-Kao Wu, Ching-Yu Chiu and Yi-Hsuan Yang.(2022). JUKEDRUMMER: Conditional Beat-Aware Audio-Domain Drum Accompaniment Gerneration via Transformer VQ-VAE. ArXiv, abs/2210.06007.
[10] Sepp Hochreiter and Jürgen Schmidhuber. (1997). Long Short-Term Memory. Neural Comput. 9, 8 (November 15, 1997), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
[11] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser and Illia Polosukhin. (2017). Attention Is All You Need. ArXiv, abs/1706.03762.
[12] Ailing Zeng, Muxi Chen, Lei Zhang and Qiang Xu.(2022). Are Transformers Effective for Time Series Forecasting?. ArXiv, abs/1706.03762.
[13] Étienne LOULIÉ. (1988-90). Mélanges sur la musique; règles de composition; notes et extraits, Bibliothèque nationale de France. Département des Manuscrits, ark:/12148/btv1b525178077(p.13r-13v).
[14] Rita Steblin. (1983). “A History of Key Characteristics in the 18th and Early 19th Centuries”. ISBN:9781580460415(link). |