GPT-2 及 CycleGAN 生成江南風格音樂打擊節奏

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：81

、訪客IP：3.128.172.154

姓名

劉洮語(Chao-Yu Liu) 查詢紙本館藏

畢業系所

軟體工程研究所

論文名稱

GPT-2 及 CycleGAN 生成江南風格音樂打擊節奏
(Generate Jiangnan Percussion Rhythm with GPT-2 and CycleGAN)

相關論文

★ 基於edX線上討論板社交關係之分組機制	★ 利用Kinect建置3D視覺化之Facebook互動系統
★ 利用 Kinect建置智慧型教室之評量系統	★ 基於行動裝置應用之智慧型都會區路徑規劃機制
★ 基於分析關鍵動量相關性之動態紋理轉換	★ 針對JPEG影像中隙縫修改之偵測技術
★ 基於保護影像中直線結構的細縫裁減系統	★ 建基於開放式網路社群學習環境之社群推薦機制
★ 英語作為外語的互動式情境學習環境之系統設計	★ 基於膚色保存之情感色彩轉換機制
★ A Gesture-based Presentation System for Smart Classroom using Kinect	★ 一個用於虛擬鍵盤之手勢識別框架
★ 分數冪次型灰色生成預測模型誤差分析暨電腦工具箱之研發	★ 使用慣性傳感器構建即時人體骨架動作
★ 基於多台攝影機即時三維建模	★ 基於互補度與社群網路分析於基因演算法之分組機制

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2026-7-3以後開放)

摘要(中)

自從音樂自動生成技術的興起，我們見證了一系列的演進。從早期神經網路像
DNN、CNN，到如今的 GAN、LSTM，每一個技術都為音樂創作帶來了新的可能性。
最近，GPT-2（Generative Pre-trained Transformer 2）的應用尤其引人注目。GPT-2 是一
種基於 Transformer 架構的預訓練語言模型，最初用於自然語言處理任務，但近年來已
擴展至音樂生成領域。相比其他技術，GPT-2 具備顯著的優勢，例如能夠利用大量音樂
資料進行訓練，從而更好地理解音樂結構和風格。其預訓練特性使其生成的音樂片段
更為流暢、自然，並且在創作中展現出更高的創造性和多樣性。此外，GPT-2 具有良好
的可擴展性，可應用於不同類型和風格的音樂生成任務。然而，使用 GPT-2 進行音樂
生成也面臨挑戰，例如模型可能存在固有的偏見或對音樂理解的不完整，生成的音樂
可能缺乏情感表達或創意性，需要進一步的後處理和調整。
另一方面，CycleGAN（Cycle-Consistent Generative Adversarial Network）的應用也
在音樂生成領域崭露頭角。CycleGAN 利用生成對抗網路技術進行未配對圖像的轉
換，並通過引入循環一致性損失來確保生成內容的連貫性和真實性。這種技術在音樂
生成中尤其適用於將一種音樂風格轉換為另一種風格。例如，CycleGAN 能夠學習並
生成具有特定風格特徵的音樂片段，這對於保留傳統音樂特別有效。與 GPT-2 相比，
CycleGAN 在捕捉和維持音樂風格一致性方面展現了優勢，但可能在創造性和多樣性上
稍顯不足。因此，將 GPT-2 和 CycleGAN 的優勢結合，可以在音樂生成中取得更平衡
的效果，既能產生自然且富有創意的音樂片段，又能保留原有風格的特徵。在本研究
中，我們探討了利用 GPT-2 和 CycleGAN 兩種技術來自動生成中國江南絲竹音樂中主
旋律對應的打擊樂器節奏。我們從傳統江南絲竹樂曲中提取音樂片段，使用這些數據
來訓練 GPT-2 模型生成獨特的打擊節奏。同時，我們還使用 CycleGAN 進行生成，它
透過學習不同音樂風格之間的轉換來生成符合江南絲竹風格的打擊樂節奏。與 GPT-2
相比，CycleGAN 在捕捉風格特徵和音樂結構一致性方面表現出色，但可能在創造性
和多樣性上略顯不足。我們發現 GPT-2 在生成多樣性和創新性節奏方面具有優勢，而
CycleGAN 更擅長在保留原有風格特徵的同時生成連貫且風格化的節奏。結合兩者的優
點可以進一步提升打擊樂器節奏的生成質量，為中國古典音樂的自動化創作提供了新
的路徑。

摘要(英)

Since the emergence of music automatic generation technology, we have witnessed a series of advancements. From early neural networks like DNN and CNN to the recent developments in GANs and LSTMs, each technique has brought new possibilities to music composition.
Recently, the application of GPT-2 (Generative Pre-trained Transformer 2) has garnered particular attention. GPT-2 is a pre-trained language model based on the Transformer architecture,
initially used for natural language processing tasks but has expanded into the field of music
generation in recent years. Compared to other techniques, GPT-2 offers significant advantages,
such as the ability to train on large amounts of music data to better understand musical structure
and style. Its pre-training nature enables it to generate music segments that are more fluent,
natural, and demonstrate higher creativity and diversity in composition. Additionally, GPT-2
exhibits good scalability and can be applied to various types and styles of music generation
tasks. However, using GPT-2 for music generation also faces challenges, such as inherent biases in the model or incomplete understanding of music, resulting in generated music lacking
emotional expression or creativity, requiring further post-processing and adjustments.
On the other hand, the application of CycleGAN (Cycle-Consistent Generative Adversarial Network) has also emerged in the field of music generation. CycleGAN utilizes generative
adversarial network technology for unpaired image transformation and ensures the coherence
and authenticity of generated content by introducing cycle consistency loss. This technique is
particularly useful in music generation for transforming one musical style into another. For example, CycleGAN can learn and generate music segments with specific stylistic features, which
is especially effective in preserving traditional music characteristics. Compared to GPT-2, CycleGAN demonstrates advantages in capturing and maintaining stylistic consistency in music
but may be slightly lacking in creativity and diversity. Therefore, combining the strengths of
GPT-2 and CycleGAN can achieve a more balanced effect in music generation, producing natural and creative music segments while preserving the original style features. In this study, we
explored the use of both GPT-2 and CycleGAN techniques to automatically generate percussion
rhythms corresponding to the main melodies of Chinese Jiangnan silk and bamboo music. We
extracted music segments from traditional Jiangnan silk and bamboo music and used this data
to train the GPT-2 model to generate unique percussion rhythms. Additionally, we employed
CycleGAN for generation, which generates percussion rhythms that match the Jiangnan silk
and bamboo style by learning transformations between different musical styles. Compared to
GPT-2, CycleGAN performs well in capturing stylistic features and maintaining consistency in
musical structure, but may be slightly lacking in creativity and diversity. We found that GPT-2
has advantages in generating diverse and innovative rhythms, while CycleGAN excels in generating coherent and stylized rhythms while preserving original style features. Combining the
strengths of both can further improve the quality of percussion rhythm generation, providing a
new path for the automation of Chinese classical music composition.

關鍵字(中)

★ 打擊節奏
★ GPT-2
★ CycleGAN
★ 江南

關鍵字(英)

★ GPT-2
★ CycleGAN
★ Percussion Rhythm
★ Jiangnan

論文目次

Contents
摘要 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
1. Introductions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1-1 Motivation of the research . . . . . . . . . . . . . . . . . . . . . . . . 1
1-2 Research purpose and contributions . . . . . . . . . . . . . . . . . . . 3
2. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2-1 Music Generation and Simulation . . . . . . . . . . . . . . . . . . . . 5
2-1-1 Music Accompaniment Generation: . . . . . . . . . . . . . . . . . . . 5
2-1-2 Simulation of Music Instruments: . . . . . . . . . . . . . . . . . . . . 6
2-2 Music Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2-2-1 Digital Audio Representation . . . . . . . . . . . . . . . . . . . . . . 6
2-2-2 Waveform Representation . . . . . . . . . . . . . . . . . . . . . . . . 7
2-2-3 Note-based Representation . . . . . . . . . . . . . . . . . . . . . . . . 8
2-2-4 Pianoroll Image Representation . . . . . . . . . . . . . . . . . . . . . 9
2-3 TCN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2-3-1 TCN Application: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2-3-2 TCN Music Application . . . . . . . . . . . . . . . . . . . . . . . . . 12
2-4 GPT-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2-4-1 GPT-2 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2-4-2 GPT-2 Music Application . . . . . . . . . . . . . . . . . . . . . . . . 14
2-4-3 GPT-2 music rhythm application . . . . . . . . . . . . . . . . . . . . . 15
iv
2-5 CycleGAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3. Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3-1 Data Collecting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3-2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3-3 Model Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3-3-1 TCNs (Temporal Convolutional Networks) . . . . . . . . . . . . . . . 23
3-3-2 GPT-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3-4 TripleGAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3-4-1 Generator Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4-1 TCN Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4-2 GPT-2 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4-2-1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4-2-2 Experiments Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4-3 TripleGAN Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4-3-1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4-3-2 Experiments Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4-4 Evalution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4-4-1 DTW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4-4-2 Comparison with Original Scores : . . . . . . . . . . . . . . . . . . . . 47
4-4-3 Expert Listening Evaluation : . . . . . . . . . . . . . . . . . . . . . . 47
5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

參考文獻

[1] Liang Li, Xinge Zhu, and Yiming Hao, “A hierarchical cgan-rnn approach for visual emotion classification,” ACM Trans. Multimedia Comput. Commun., vol. 15, 2019.
[2] K Goel, R Vohra, and JK Sahoo, “Polyphonic music generation by modeling temporal
dependencies using a rnn-dbn,” Artificial Neural Networks and Machine Learning-ICANN
2014, 2014.
[3] HW Dong, WY Hsiao, LC Yang, and YH Yang, “Musegan: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment,” Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
[4] K Fukuda, N Mori, and K Matsumoto, “A novel sentence vector generation method based
on autoencoder and bi-directional lstm,” Distributed Computing and Artificial Intelligence,
15th International Conference, 2019.
[5] Mangal S, Modak R, and Joshi P, “Lstm based music generation system,” arXiv preprint
arXiv:1908.01080, 2019.
[6] Douglas Eck and Jurgen Schmidhuber, “A first look at music composition using lstm
recurrent neural networks,” Istituto Dalle Molle Di Studi Sull Intelligenza Artificiale, 2002.
[7] Shaojie Bai, J.Zico Kolter, and Vladlen Koltun, “An empirical evaluation of generic convolutional and recurrent networks for sequence modeling,” 2018.
[8] Y Guo, Y Liu, T Zhou, L Xu, and Q Zhang, “An automatic music generation and evaluation
method based on transfer learning,” Plos one, 2023.
54
[9] K Papineni, S Roukos, T Ward, and WJ Zhu, “Bleu: a method for automatic evaluation
of machine translation,” Proceedings of the 40th Annual Meeting of the Association for
Computational Linguistics (ACL), Philadelphia, pp. 311–318, 2002.
[10] B Banar and S Colton, “A systematic evaluation of gpt-2-based music generation,” International Conference on Computational Intelligence in Music, Sound, Art and Design,
2022.
[11] Li-Chia Yang, Szu-Yu Chou, and Yi-Hsuan Yang, “Midinet: A convolutional generative adversarial network for symbolic-domain music generation,” in arXiv preprint
arXiv:1703.10847, 2017.
[12] Zixun Nicolas, Dimos Makris, and Dorien Herremans, “Hierarchical recurrent neural networks for conditional melody generation with long-term structure,” 2021 International
Joint Conference on Neural Networks (IJCNN), 2021.
[13] Ashish Vaswnai, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N.
Gomez, Łukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” Advances in
neural information processing systems, 2017.
[14] Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Douglas Eck, Karen Simonyan, and Mohammad Norouzi, “Neural audio synthesis of musical notes with wavenet
autoencoders,” International Conference on Machine Learning, 2017.
[15] Aaron van den Oord, Sander Dieleman, Heiga Zen, Oriol Simonyan, Karen Vinyals, Nal
Graves, Alex Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu, “Wavenet: A generative model for raw audio,” arXiv:1609.03499, 2016.
[16] CE Shannon, “Communication in the presence of noise,” Proceedings of the IRE, 1949.
[17] C Walshaw, “A statistical analysis of the abc music notation corpus: Exploring duplication,” 2014.
[18] H.M de Oliveira and R.C de Oliveira, “Understanding midi: A painless tutorial on midi
format,” arXiv preprint arXiv:1705.05322, 2017.
55
[19] Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier, “Language modeling
with gated convolutional networks,” Conference on machine, 2017.
[20] Colin Stuart Lea, “Multi-modal models for fine-grained action segmentation in situated
environments,” Johns Hopkins University, 2017.
[21] Fanzhi Jiang, Liumei Zhang, Kexin Wang, Xi Deng, and Wanyan Yang, “Boyatcn: Research on music generation of traditional chinese pentatonic scale based on bidirectional
octave your attention temporal convolutional network,” 2022.
[22] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Hya Sutskever,
“Language models are unsupervised multitask learners.,” 2019.
[23] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, et al., “Language models are few-shot learners,” Advances in neural information processing systems,
2020.
[24] PF Chen, SM Wang, WC Liao, and LC Kuo, “Automatic icd-10 coding and training system: deep neural network based on supervised learning,” JMIR Medical, 2021.
[25] “Musenet,” 2019, https://openai.com/index/musenet/.
[26] “Jukedeck,” 2012, https://soundcloud.com/jukedeck.
[27] “Sony csl research lab,” 1988, https://csl.sony.it/.
[28] Jack Canavan-Gosselin, “Encoding musical culture: Attention-based machine learning
models for music generation in different improvisatory jazz idioms,” Wesleyan University,
2022.
[29] W Bian, Y Song, N Gu, TY Chan, TT Lo, et al., “Momusic: a motion-driven humanai collaborative music composition and performing system,” AAAI-23 Special Programs,
IAAI-23, EAAI-23, Student Papers and Demonstrations, vol. 37, 2023.
[30] J Yang, M Wang, H Zhou, C Zhao, W Zhang, Y Yu, and L Li, “Towards making the most
of bert in neural machine translation,” Proceedings of the AAAI conference on artificial
intelligence, 2020.
56
[31] Jun-Yan Zhu, Taesung P ark, Phillip Isola, and Alexei A. Efro, “Unpaired image-to-image
translation using cycle-consistent adversarial networks,” arXiv:1703.10593, 2020.
[32] Chen Yang, Lai Yu-Kun, and Yong-Jin Liu, “Cartoongan: Generative adversarial networks
for photo cartoonization,” Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 2018.
[33] Kaneko Takuhiro and Kameoka Hirokazu, “Cyclegan-vc: Non-parallel voice conversion
using cycle-consistent adversarial networks,” European Signal Processing Conference
(EUSIPCO), 2018.
[34] Brunner Gino, Wang Yuyi, Wattenhofer Roger, and Zhao Sumu, “Symbolic music genre
transfer with cyclegan,” International Conference on Tools for Artificial Intelligence (ICTAI), 2018.
[35] Leif Sulaiman and Sebastian Larsson, “Genre style transfer: Symbolic genre style transfer utilising gan with additional genre-enforcing discriminators,” Digitala Vetenskapliga
Arkivet, 2022.
[36] C Walder, “Modelling symbolic music: Beyond the piano roll,” Asian conference on
machine learning, 2016.
[37] Davide Baccerini, Donatella Merlini, and Renzo Sprugnoli, “Tablatures for stringed instruments and generating functions,” Fun with Algorithms, 4th International Conference,
2007.
[38] K Lee, J Ray, and C Safta, “The predictive skill of convolutional neural networks models
for disease forecasting,” Plos one, 2021.
[39] SD Yang, ZA Ali, and BM Wong, “Fluid-gpt fast learning to understand and investigate
dynamics with a generative pre-trained transformer: Efficient predictions of particle trajectories and erosion,” Industrial and Engineering Chemistry Research, 2023.
[40] Din Wanchin, “Using generative adversarial network for music transformatiom,” NCU,
2023.
57
[41] Olaf Ronneberger, Philipp Fischer, and Thomas Brox, “U-net: Convolutional networks
for biomedical image segmentation,” Medical image computing and computer-assisted
intervention–MICCAI, 2015.

指導教授

施國琛(Guo-Chen Shih)

審核日期

2024-7-17

推文