以非監督式表徵分離學習之邊緣運算裝置低延遲樂曲中人聲轉換架構

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：68

、訪客IP：3.22.217.176

姓名

羅崇訓(Chung-Hsun Lo) 查詢紙本館藏

畢業系所

通訊工程學系

論文名稱

以非監督式表徵分離學習之邊緣運算裝置低延遲樂曲中人聲轉換架構
(Low-latency Singing Conversion Architecture based on Unsupervised Disentanglement Representation Learning for Edge Computing Devices)

相關論文

★ 基於序列至序列模型之 FMCW雷達估計人體姿勢

★ 基於多層次注意力機制之單目相機語意場景補全技術

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

在語音處理領域，深度學習技術的進步日益迅速。語音轉換技術是一項結合了語音處理技術的創新，它能夠將一位說話者的聲音特徵轉換成另一位說話者的聲音，同時保持說話內容的不變，從而實現更自然、生動且多樣化的合成語音。這項技術不僅在自動語音系統和虛擬角色等應用中扮演重要角色，還能在情感表達領域中應用，通過改變聲音特徵來傳達不同的情感，從而增強使用者的體驗。
　　以音樂為例，音樂作為一種普遍存在且具有強烈情感觸動力的藝術形式，不僅為人們提供娛樂，還能促進交流，反映社會價值觀和文化。然而，音樂中蘊含著大量的資訊，音樂後期製作在保證音樂質量的過程中具有至關重要的作用。所以，本文以音樂轉換技術為核心，通過實驗和技術探討，研究了不同的音樂處理應用。
　　然而，音樂轉換技術面臨著一些挑戰，包括如何實現高品質的合成語音，以及準確提取和轉換語音特徵等。儘管大型語言模型能夠有效提取特徵，但在資源有限的設備上應用時，需要考慮模型的輕量化處理。
　　因此，本研究深入探討了「音樂後期製作與低延遲樂曲中的人聲轉換」，運用現代科技解決傳統音樂製作中的限制。通過表徵分離學習方式，對聲學資訊進行細節分析，以實現聲音轉換後的結果生成。在低延遲的要求下，我們引入時序卷積網路來取代傳統的循環系列神經網路，以實現低運算且快速的轉換效果。實驗結果顯示，這種架構在轉換品質和處理速度方面均優於傳統的循環系列神經網路。
　　通過引入這樣的低延遲樂曲中人聲轉換技術，我們有望解決歌手音高的問題，提高音樂品質和製作效率，從而使錄音室專業人員受益，同時節省時間和資源，增強音樂製作的靈活性。此外，對音樂技術的不斷推進有助於我們更好地了解其潛力和限制，也為未來的音樂製作工具和方法提供新的啟發。

摘要(英)

In the field of speech processing, the rapid advancement of deep learning technology is increasingly evident. Voice conversion technology is an innovative application that combines speech processing techniques. It allows for the transformation of the acoustic features of one speaker′s voice into those of another speaker, while maintaining the unchanged content of speech. This achievement brings about a more natural, vivid, and diverse synthesized speech. This technology not only plays a crucial role in applications such as automatic speech systems and virtual characters but also finds utility in emotional expression. By altering vocal characteristics, it conveys different emotions, thus enhancing the user experience.
　　Taking music as an example, it is a universally present art form with powerful emotional impact. Music not only provides entertainment but also fosters communication and reflects societal values and culture. However, music contains a wealth of information, and post-production processes play a vital role in ensuring the quality of music. Therefore, this paper centers around music conversion technology, conducting experiments and technical exploration to examine various applications in music processing.
　　Yet, music conversion technology faces several challenges, including achieving high-quality synthetic speech and accurately extracting and converting vocal characteristics. While large-scale language models are effective at feature extraction, their application on resource-constrained devices requires consideration of model light weighting.
　　Hence, this study delves deeply into "Post-Production in Music and Low-Latency Singing Conversion," utilizing modern technology to address constraints in traditional music production. Through disentanglement representation learning, detailed analysis of acoustic information is conducted to achieve the generation of transformed audio. To meet low-latency requirements, we introduce temporal convolutional networks as a replacement for traditional recurrent neural networks, achieving low-computation and rapid transformation effects. Experimental results demonstrate that this architecture outperforms traditional recurrent neural networks in both transformation quality and processing speed.
　　By introducing such low-latency vocal conversion technology in music, we aspire to resolve singer intonation issues, enhance music quality and production efficiency, benefiting studio professionals while saving time and resources, and enhancing music production flexibility. Furthermore, the continuous advancement of music technology contributes to a better understanding of its potential and limitations, providing fresh inspiration for future music production tools and methods.

關鍵字(中)

★ 低延遲樂曲中人聲轉換
★ 邊緣運算
★ 分離表徵學習
★ 時序卷積網路

關鍵字(英)

★ Low-latency Singing Conversion
★ Edge Computing
★ Disentanglement Representation Learning
★ Temporal Convolutional Networks

論文目次

摘要 I
Abstract III
誌謝 V
目錄 VII
圖目錄 IX
表目錄 X
第一章緒論 1
1-1 研究背景 1
1-2 研究動機與目的 3
1-3 論文架構 5
第二章語音轉換相關介紹 6
2-1 背景演變 7
2-2 語音轉換技術說明 10
2-2-1 源-濾波模型 (Source-Filter Model) 12
2-2-2 可變性自動編碼器、生成對抗網路 (VAE & GAN) 12
2-3 神經聲碼器說明 17
第三章提出方法與架構設計 18
3-1 文獻回顧 20
3-1-1 SPEECHSPLIT 20
3-1-2 SPEECHSPLIT 2.0 22
3-1-3 SRDVC 25
3-3 實驗設置 34
3-3-1 資料前處理 34
3-3-2 Baseline聲學特徵編碼器 34
3-3-3 節奏編碼器架構 35
3-3-4 內容、音高編碼器架構 37
3-3-5 相互訊息網路替代方案 42
3-4 損失函數 43
3-4-1 重建損失函數 (Reconstruct Loss Function) 43
3-4-2 對抗式損失函數 (Adversarial Loss Function) 44
3-4-3 廣義訊息相互損失 (Generalized MI Loss Function) 46
第四章實驗結果與分析討論 48
4-1實驗環境與實驗數據 49
4-2 評分方式 52
4-2-1 平均意見分數 (Mean Opinion Score, MOS) 52
4-2-2 即時因素 (Real-Time Factor, RTF) 53
4-3 實驗結果分析 54
4-3-1 節奏編碼器架構結果比較 54
4-3-2 轉換架構聲學特徵數據分析 55
4-3-3 轉換架構低延遲實驗成果 58
第五章結論與未來展望 59
參考文獻 61

參考文獻

[1] B. Sisman, J. Yamagishi, S. King and H. Li, “An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 132-157, 2021.
[2] B. Atal and M. Schroeder, “Predictive coding of speech signals and subjective error criteria,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 27, no. 3, pp. 247-254, June 1979.
[3] P. Mermelstein, “Distance measures for speech recognition, psychological and instrumental,” Pattern recognition and artificial intelligence, 116:374–388, 1976.
[4] Y. Zhang, Z. Ou and M. Hasegawa-Johnson, “Improvement of Probabilistic Acoustic Tube model for speech decomposition,” 2014 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, Florence, Italy, pp. 7929-7933, 2014.
[5] H. Kawahara, M. Morise, T. Takahashi, R. Nisimura, T. Irino, and H. Banno, “TANDEM-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation,” 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 3933-3936, 2008.
[6] D. P. Kingma and M. Welling, “Auto-Encoding Variational Bayes,” 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.
[7] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, and Y. Bengio, “Generative Adversarial Network,” Advances in neural information processing systems, NIPS 2014, Cambridge, MA, USA, pp. 2672-2680, December 2014.
[8] C. Hsu, H. Hwang, Y. Wu, Y. Tsao, and H. Wang, “Voice conversion from non-parallel corpora using variational auto-encoder,” 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA, pp. 1–6. IEEE, 2016.
[9] C. Hsu, H. Hwang, Y. Wu, Y. Tsao, and H. Wang, “Voice conversion from unaligned corpora using variational autoencoding Wasserstein generative adversarial networks,” Proc. Interspeech 2017, pp. 3364–3368, 2017.
[10] W. Huang, H. Hwang, Y. Peng, Y. Tsao, and H. Wang, “Voice conversion based on cross-domain features using variational auto encoders,” 2018 11th International Symposium on Chinese Spoken Language Processing, ISCSLP, pp. 51–55. IEEE, 2018.
[11] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, “ACVAE-VC: Non-Parallel Voice Conversion with Auxiliary Classifier Variational Autoencoder,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 9, pp. 1432–1443, September 2019.
[12] J. Chou, C. Yeh, H. Lee, and L. Lee, “Multitarget voice conversion without parallel data by adversarially learning disentangled audio representations,” Proc. Interspeech 2018, pp. 501–505, 2018.
[13] A. Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete representation learning,” Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS′17, Curran Associates Inc., Red Hook, NY, USA, pp. 6309–6318, 2017.
[14] Y. Gao, R. Singh, and B. Raj, “Voice impersonation using generative adversarial networks,” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 2506–2510, IEEE, 2018.
[15] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, “StarGAN-VC: Non-parallel many-to-many voice conversion using star generative adversarial networks,” 2018 IEEE Spoken Language Technology Workshop, SLT, pp. 266–273. IEEE, 2018.
[16] T. Kaneko, and H. Kameoka, “Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks,” ArXiv, abs/1711.11293, 2017.
[17] Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim, and J. Choo, “StarGAN: Unified generative adversarial networks for multi-domain image-to-image translation,” Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8789–8797, 2018.
[18] T. Kaneko and H. Kameoka, “CycleGAN-VC: Non-parallel Voice Conversion Using Cycle-Consistent Adversarial Networks,” 2018 26th European Signal Processing Conference, EUSIPCO, Rome, Italy, pp. 2100–2104, IEEE, 2018
[19] T. Kaneko, H. Kameoka, K. Tanaka and N. Hojo, “Cyclegan-VC2: Improved Cyclegan-based Non-parallel Voice Conversion,” 2019 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, Brighton, UK, pp. 6820-6824, 2019.
[20] W. Huang, H. Luo, H. Hwang, C. Lo, Y. Peng, Y. Tsao, and H. Wang, “Unsupervised Representation Disentanglement Using Cross Domain Features and Adversarial Learning in Variational Autoencoder Based Voice Conversion,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 4, no. 4, pp. 468-479, Aug. 2020.
[21] T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, “StarGAN-VC2: Rethinking conditional methods for StarGAN-based voice conversion,” Proc. Interspeech 2019, pp. 679–683, 2019.
[22] H. Kawahara, I. Masuda-Katsuse, and A. Cheveign´e, “Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based f0 extraction,” Speech Communication, vol.27, no.3-4, pp. 187-207, April. 1999.
[23] M. Morise, F. Yokomori, and K. Ozawa, “WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications,” IEICE Transactions on Information and System, vol. 99, no. 7, pp. 1877-1884, July. 2016.
[24] A. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: A Generative Model for Raw Audio,” 9th ISCA Speech Synthesis Workshop, SSW2016, Sunnyvale, USA, Sep. 2016.
[25] R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A Flow-based Generative Network for Speech Synthesis,” IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2019, Brighton, UK, pp. 3617-3621, May. 2019.

[26] X. Wang, S. Takaki, and J. Yamagishi, “Neural source-filter-based waveform model for statistical parametric speech synthesis,” IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2019, Brighton, UK, pp. 5916-5920, May. 2019.
[27] K. Kumar, R. Kumar, T.D. Boissière, L. Gestin, W.Z. Teoh, J.M. Sotelo, A.D. Brébisson, Y. Bengio, and A.C. Courville. “MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis,” Neural Information Processing System, NeurIPS 2019, pp. 14881-14892, Oct. 2019.
[28] J. Kong, J. Kim, and J. Bae, “HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis,” 34th International Conference on Neural Information Processing Systems, NIPS′20, Curran Associates Inc., Red Hook, NY, USA, Article 1428, pp. 17022–17033, 2020.
[29] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 8, pp. 1798–1828, 2013.
[30] K. Qian, Y. Zhang, S. Chang, X. Yang, and M.A. Hasegawa-Johnson, “AutoVC: Zero-shot voice style transfer with only autoencoder loss,” International Conference on Machine Learning, pp. 5210–5219, 2019.
[31] K. Qian, Z. Jin, M.A. Hasegawa-Johnson, and G.J. Mysore, “F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder,” ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 6284–6288. IEEE, 2020.
[32] K. Qian, Y. Zhang, S. Chang, D. Cox, and M.A. Hasegawa-Johnson, “Unsupervised speech decomposition via triple information bottleneck,” Proceedings of the 37th International Conference on Machine Learning, pp. 7836–7846, 2020.
[33] C.H. Chan, K. Qian, Y. Zhang, and M.A. Hasegawa-Johnson, “Speechsplit2.0: Unsupervised speech disentanglement for voice conversion without tuning autoencoder bottlenecks,” ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 6332–6336, 2022.

[34] N. Jaitly and E. Hinton, “Vocal Tract Length Perturbation (VTLP) improves speech recognition,” International Conference on Machine Learning, ICML, 2013.
[35] S. Yang, M. Tantrawenith, H. Zhuang, Z. Wu, A. Sun, J. Wang, N. Cheng, H. Tang, X. Zhao, J. Wang, and H.M. Meng, “Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion,” Proc. Interspeech 2022, pp. 2553-2557, 2022.
[36] P. Cheng, W. Hao, S. Dai, J. Liu, Z. Gan, and L. Carin, “Club: A contrastive log-ratio upper bound of mutual information,” Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, ser. Proceedings of Machine Learning Research, vol. 119, pp. 1779–1788, PMLR, 2020,
[37] J. Chou and H. Lee, “One-shot voice conversion by separating speaker and content representations with instance normalization,” Interspeech, 2019, pp. 664–668.
[38] L. Zhang, R. Li, S. Wang, L. Deng, J. Liu, Y. Ren, J. He, R. Huang, J. Zhu, X. Chen, and Z. Zhao, “M4Singer: A Multi-Style, Multi-Singer and Musical Score Provided Mandarin Singing Corpus,” 36th Conference on Neural Information Processing Systems Datasets and Benchmarks Track, NIPS 2022, 2022.
[39] R. Huang, F. Chen, Y. Ren, J. Liu, C. Cui, and Z. Zhao, “Multi-Singer: Fast Multi-Singer Singing Voice Vocoder with A Large-Scale Corpus,” Proceedings of the 29th ACM International Conference on Multimedia, MM ′21, Association for Computing Machinery, New York, NY, USA, pp. 3945–3954, 2021.
[40] S. Bai, J.Z. Kolter, and V. Koltun, “An empirical evaluation of generic convolutional and recurrent networks for sequence modeling,” ArXiv, abs/1803.01271, 2018.

指導教授

張寶基陳永芳(Pao-Chi Chang Yung-Fang Chen)

審核日期

2023-8-15

推文