參考文獻 |
[1] Jonathan Ho, Ajay Jain, and Pieter Abbeel, “Denoising diffusion probabilistic models,”
Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020.
[2] Jiaming Song, Chenlin Meng, and Stefano Ermon, “Denoising diffusion implicit models,”
arXiv preprint arXiv:2010.02502, 2020.
[3] Tim Salimans and Jonathan Ho, “Progressive distillation for fast sampling of diffusion
models,” arXiv preprint arXiv:2202.00512, 2022.
[4] Olaf Ronneberger, Philipp Fischer, and Thomas Brox, “U-net: Convolutional networks
for biomedical image segmentation,” in Medical image computing and computer-assisted
intervention–MICCAI 2015: 18th international conference, Munich, Germany, October
5-9, 2015, proceedings, part III 18. Springer, 2015, pp. 234–241.
[5] Flavio Schneider, “Archisound: Audio generation with diffusion,” arXiv preprint
arXiv:2301.13267, 2023.
[6] Jong Wook Kim, Justin Salamon, Peter Li, and Juan Pablo Bello, “Crepe: A convolutional
representation for pitch estimation,” in 2018 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 161–165.
[7] E Oran Brigham and RE Morrow, “The fast fourier transform,” IEEE spectrum, vol. 4, no.
12, pp. 63–70, 1967.
[8] Ilya Loshchilov and Frank Hutter, “Decoupled weight decay regularization,” arXiv
preprint arXiv:1711.05101, 2017.
[9] Jordi Pons and Xavier Serra, “musicnn: Pre-trained convolutional neural networks for
music audio tagging,” arXiv preprint arXiv:1909.06654, 2019.
[10] Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen,
R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al., “Cnn
architectures for large-scale audio classification,” in 2017 ieee international conference
on acoustics, speech and signal processing (icassp). IEEE, 2017, pp. 131–135.
[11] Kevin Kilgour, Mauricio Zuluaga, Dominik Roblek, and Matthew Sharifi, “Fr’echet audio distance: A metric for evaluating music enhancement algorithms,” arXiv preprint
arXiv:1812.08466, 2018.
[12] Rayhane Mama, Marc S. Tyndel, Hashiam Kadhim, Cole Clifford, and Ragavan Thurairatnam, “Nwt: Towards natural audio-to-video generation with representation learning,”
2021.
[13] Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi,
“Soundstream: An end-to-end neural audio codec,” 2021.
[14] Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Mohammad Norouzi, Douglas Eck, and Karen Simonyan, “Neural audio synthesis of musical notes with wavenet
autoencoders,” in International Conference on Machine Learning. PMLR, 2017, pp. 1068–
1077.
[15] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros, “Unpaired image-to-image
translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2223–2232.
[16] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo,
“Stargan: Unified generative adversarial networks for multi-domain image-to-image translation,” in Proceedings of the IEEE conference on computer vision and pattern recognition,
2018, pp. 8789–8797.
[17] Cheng-Zhi Anna Huang, Tim Cooijmans, Adam Roberts, Aaron Courville, and Douglas
Eck, “Counterpoint by convolution,” arXiv preprint arXiv:1903.07227, 2019.
[18] Eric Grinstein, Ngoc QK Duong, Alexey Ozerov, and Patrick Pérez, “Audio style transfer,” in 2018 IEEE international conference on acoustics, speech and signal processing
(ICASSP). IEEE, 2018, pp. 586–590.
[19] Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya
Sutskever, “Jukebox: A generative model for music,” arXiv preprint arXiv:2005.00341,
2020.
[20] Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin,
Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, et al.,
“Audiolm: a language modeling approach to audio generation,” IEEE/ACM Transactions
on Audio, Speech, and Language Processing, 2023.
[21] Jonathan Ho and Tim Salimans, “Classifier-free diffusion guidance,” arXiv preprint
arXiv:2207.12598, 2022.
[22] Diederik P Kingma and Max Welling, “Auto-encoding variational bayes,” arXiv preprint
arXiv:1312.6114, 2013.
[23] Aaron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex
Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu, et al., “Wavenet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, vol. 12, 2016.
[24] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
Gomez, Łukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” Advances in
neural information processing systems, vol. 30, 2017.
[25] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of
the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10684–
10695.
[26] Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro, “Diffwave: A
versatile diffusion model for audio synthesis,” arXiv preprint arXiv:2009.09761, 2020.
[27] Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi,
“Soundstream: An end-to-end neural audio codec,” IEEE/ACM Transactions on Audio,
Speech, and Language Processing, vol. 30, pp. 495–507, 2021. |