Research on Applying Diffusion Model Decoders in Timbre Transformation Systems A Case Study of Erhu Timbre

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：111

、訪客IP：13.59.108.218

姓名

劉秉澤(BING-ZE LIU) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

(Research on Applying Diffusion Model Decoders in Timbre Transformation Systems A Case Study of Erhu Timbre)

相關論文

★ 基於edX線上討論板社交關係之分組機制	★ 利用Kinect建置3D視覺化之Facebook互動系統
★ 利用 Kinect建置智慧型教室之評量系統	★ 基於行動裝置應用之智慧型都會區路徑規劃機制
★ 基於分析關鍵動量相關性之動態紋理轉換	★ 基於保護影像中直線結構的細縫裁減系統
★ 建基於開放式網路社群學習環境之社群推薦機制	★ 英語作為外語的互動式情境學習環境之系統設計
★ 基於膚色保存之情感色彩轉換機制	★ 一個用於虛擬鍵盤之手勢識別框架
★ 分數冪次型灰色生成預測模型誤差分析暨電腦工具箱之研發	★ 使用慣性傳感器構建即時人體骨架動作
★ 基於多台攝影機即時三維建模	★ 基於互補度與社群網路分析於基因演算法之分組機制
★ 即時手部追蹤之虛擬樂器演奏系統	★ 基於類神經網路之即時虛擬樂器演奏系統

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2026-7-5以後開放)

摘要(中)

在本研究中，我們提出了一種基於 Diffusion 架構的音色轉換模型，該模型旨在將多
種樂器演奏的樂曲轉換為二胡演奏版本。我們的模型通過 Pitch Encoder 和 Loudness
Encoder 擷取樂曲的音高和響度特徵，並將這些特徵作為條件輸入至 Diffusion Model
base 的 Decoder 中，以生成高品質的二胡音色樂曲。在實驗部分，我們系統地評估了模
型的性能，包括音高準確性（Pitch Accuracy）、餘弦相似度（Cosine Similarity）和弗雷
歇音頻距離（Fréchet Audio Distance）。結果表明，我們的模型在音高準確性方面達到了
95% 至 96% 的高準確率，並且生成的二胡音色與真實二胡演奏接近。此外，通過消融
實驗驗證了 Loudness Encoder 在模型中的重要性，確保了模型在無聲輸入時能夠正確地
生成無聲音波。本研究展示了基於 Diffusion 架構的音色轉換模型在音樂生成領域的潛
力，為未來的音樂生成和音色轉換研究提供了新的思路。

摘要(英)

In this study, we propose a timbre transfer model based on the Diffusion architecture, which
aims to convert musical pieces performed by various instruments into erhu performances. Our
model utilizes Pitch Encoder and Loudness Encoder to extract the pitch and loudness features of
the music, and these features are then used as conditions input into the Diffusion Model-based
Decoder to generate high-quality erhu timbre music. In the experimental section, we systematically evaluated the model’s performance, including Pitch Accuracy, Cosine Similarity, and
Fréchet Audio Distance. The results show that our model achieved a high pitch accuracy of 95%
to 96% and that the generated erhu timbre closely matches the real erhu performances. Furthermore, ablation experiments confirmed the importance of the Loudness Encoder, ensuring that
the model correctly generates silent waveforms when given silent inputs. This study demonstrates the potential of Diffusion architecture-based timbre transfer models in the field of music
generation, providing new insights for future research in music generation and timbre transfer.

關鍵字(中)

★ 擴散模型
★ 音色轉換
★ 音高編碼器
★ 響度編碼器
★ FAD

關鍵字(英)

★ Diffusion
★ Timbre Change
★ Pitch Encoder
★ Loudness Encoder
★ FAD

論文目次

Chinese Abstract i
English Abstract ii
Table of Contents iii
I. Introductions 1
II. Related Work 3
III. Background 5
3-1 Diffusion Model 5
3-1-1 DDPM 5
3-1-2 DDIM 6
3-1-3 V-Diffusion 6
3-2 U-Net 8
3-3 CREPE 10
3-4 Fourier Transform 11
3-4-1 FFT 11
3-4-2 STFT 12
3-4-3 Mel spectrogram 12
3-5 Loudness 15
IV. Method 17
4-1 Architecture Overview 17
4-2 Pitch Encoder 19
4-2-1 Frequency Tokenizer 19
4-2-2 Pitch Embedding 20
4-3 Loudness Encoder 22
4-3-1 Vector Quantizer 23
4-4 Diffusion Decoder 24
V. Experiments 26
5-1 Dataset 26
5-2 Training 26
5-3 Evaluation 26
5-3-1 Pitch Accuracy 27
5-3-2 VGG Feature Extractor 28
5-3-3 Cosine-Similarity 29
5-3-4 Fréchet Audio Distance 32
5-3-5 Using PCA for visualization 34
5-3-6 How Loudness Encoder effect the non-sound wav 39
VI. Conclusion 42
Refernce 44

參考文獻

[1] Jonathan Ho, Ajay Jain, and Pieter Abbeel, “Denoising diffusion probabilistic models,”
Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020.
[2] Jiaming Song, Chenlin Meng, and Stefano Ermon, “Denoising diffusion implicit models,”
arXiv preprint arXiv:2010.02502, 2020.
[3] Tim Salimans and Jonathan Ho, “Progressive distillation for fast sampling of diffusion
models,” arXiv preprint arXiv:2202.00512, 2022.
[4] Olaf Ronneberger, Philipp Fischer, and Thomas Brox, “U-net: Convolutional networks
for biomedical image segmentation,” in Medical image computing and computer-assisted
intervention–MICCAI 2015: 18th international conference, Munich, Germany, October
5-9, 2015, proceedings, part III 18. Springer, 2015, pp. 234–241.
[5] Flavio Schneider, “Archisound: Audio generation with diffusion,” arXiv preprint
arXiv:2301.13267, 2023.
[6] Jong Wook Kim, Justin Salamon, Peter Li, and Juan Pablo Bello, “Crepe: A convolutional
representation for pitch estimation,” in 2018 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 161–165.
[7] E Oran Brigham and RE Morrow, “The fast fourier transform,” IEEE spectrum, vol. 4, no.
12, pp. 63–70, 1967.
[8] Ilya Loshchilov and Frank Hutter, “Decoupled weight decay regularization,” arXiv
preprint arXiv:1711.05101, 2017.
[9] Jordi Pons and Xavier Serra, “musicnn: Pre-trained convolutional neural networks for
music audio tagging,” arXiv preprint arXiv:1909.06654, 2019.
[10] Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen,
R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al., “Cnn
architectures for large-scale audio classification,” in 2017 ieee international conference
on acoustics, speech and signal processing (icassp). IEEE, 2017, pp. 131–135.
[11] Kevin Kilgour, Mauricio Zuluaga, Dominik Roblek, and Matthew Sharifi, “Fr’echet audio distance: A metric for evaluating music enhancement algorithms,” arXiv preprint
arXiv:1812.08466, 2018.
[12] Rayhane Mama, Marc S. Tyndel, Hashiam Kadhim, Cole Clifford, and Ragavan Thurairatnam, “Nwt: Towards natural audio-to-video generation with representation learning,”
2021.
[13] Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi,
“Soundstream: An end-to-end neural audio codec,” 2021.
[14] Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Mohammad Norouzi, Douglas Eck, and Karen Simonyan, “Neural audio synthesis of musical notes with wavenet
autoencoders,” in International Conference on Machine Learning. PMLR, 2017, pp. 1068–
1077.
[15] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros, “Unpaired image-to-image
translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2223–2232.
[16] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo,
“Stargan: Unified generative adversarial networks for multi-domain image-to-image translation,” in Proceedings of the IEEE conference on computer vision and pattern recognition,
2018, pp. 8789–8797.
[17] Cheng-Zhi Anna Huang, Tim Cooijmans, Adam Roberts, Aaron Courville, and Douglas
Eck, “Counterpoint by convolution,” arXiv preprint arXiv:1903.07227, 2019.
[18] Eric Grinstein, Ngoc QK Duong, Alexey Ozerov, and Patrick Pérez, “Audio style transfer,” in 2018 IEEE international conference on acoustics, speech and signal processing
(ICASSP). IEEE, 2018, pp. 586–590.
[19] Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya
Sutskever, “Jukebox: A generative model for music,” arXiv preprint arXiv:2005.00341,
2020.
[20] Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin,
Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, et al.,
“Audiolm: a language modeling approach to audio generation,” IEEE/ACM Transactions
on Audio, Speech, and Language Processing, 2023.
[21] Jonathan Ho and Tim Salimans, “Classifier-free diffusion guidance,” arXiv preprint
arXiv:2207.12598, 2022.
[22] Diederik P Kingma and Max Welling, “Auto-encoding variational bayes,” arXiv preprint
arXiv:1312.6114, 2013.
[23] Aaron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex
Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu, et al., “Wavenet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, vol. 12, 2016.
[24] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
Gomez, Łukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” Advances in
neural information processing systems, vol. 30, 2017.
[25] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of
the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10684–
10695.
[26] Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro, “Diffwave: A
versatile diffusion model for audio synthesis,” arXiv preprint arXiv:2009.09761, 2020.
[27] Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi,
“Soundstream: An end-to-end neural audio codec,” IEEE/ACM Transactions on Audio,
Speech, and Language Processing, vol. 30, pp. 495–507, 2021.

指導教授

施國琛(GUO-CHEN SHI)

審核日期

2024-7-12

推文