非平行語料庫基於生成注意力網路之語音轉換技術

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：38

、訪客IP：18.117.91.170

姓名

邱則維(Tse-Wei Chiu) 查詢紙本館藏

畢業系所

通訊工程學系

論文名稱

非平行語料庫基於生成注意力網路之語音轉換技術
(Spectrum and Prosody Transformation for Non-parallel Voice Conversion with Generative Attentional Networks)

相關論文

★ 基於區域權重之衛星影像超解析技術	★ 延伸曝光曲線線性特性之調適性高動態範圍影像融合演算法
★ 實現於RISC架構之H.264視訊編碼複雜度控制	★ 基於卷積遞迴神經網路之構音異常評估技術
★ 具有元學習分類權重轉移網路生成遮罩於少樣本圖像分割技術	★ 具有注意力機制之隱式表示於影像重建三維人體模型
★ 使用對抗式圖形神經網路之物件偵測張榮	★ 基於弱監督式學習可變形模型之三維人臉重建
★ 以非監督式表徵分離學習之邊緣運算裝置低延遲樂曲中人聲轉換架構	★ 基於序列至序列模型之 FMCW雷達估計人體姿勢
★ 基於多層次注意力機制之單目相機語意場景補全技術	★ 基於時序卷積網路之單FMCW雷達應用於非接觸式即時生命特徵監控
★ 視訊隨選網路上的視訊訊務描述與管理	★ 基於線性預測編碼及音框基頻週期同步之高品質語音變換技術
★ 基於藉語音再取樣萃取共振峰變化之聲調調整技術	★ 即時細緻可調性視訊在無線區域網路下之傳輸效率最佳化研究

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2026-7-1以後開放)

摘要(中)

音轉換(Voice Conversion, VC)是一種較為複雜的技術，其目的為將原始語者的音色和音調做轉換，並保留語音內容，讓輸出後的結果聽起來像是目標語者所講出的。
本篇論文使用了非平行的語料庫作為訓練數據，並提出加入注意力機制的循環生成對抗網路 (Cycle Generative Adversarial Network, Cycle-GAN) 用於語音轉換上，在轉換過程中能對不同語者特徵上的差異給予更多的權重，讓轉換時更能針對差異的地方做轉換，並保留較相似的片段。我們在架構中加入注意力模塊，並加入了新的損失函數用來更新網路。由於訓練生成對抗網路時會遇到不穩定的問題，因此我們針對鑑別器的損失函數部分，對真實樣本與生成後的樣本鑑別時給予不同的權重來改善。
上述方法我們用於轉換頻譜包絡(音色)上，但我們也針對基本頻率(音調)嘗試使用生成對抗網路做轉換，並與原先轉換的方法做分析比較。最後從實驗結果表明在梅爾倒譜失真(Mel-Cepstral distortion, MCD)與平均意見分數(Mean Opinion Score, MOS)中，我們所提出語音轉換架構較基線系統好。

摘要(英)

Voice Conversion (VC) is a complex technology designed to convert the pitch and timbre of the original speaker and preserve the speech content, let the output sounds like what the target speaker said.
This paper uses non-parallel corpus as training data, and proposes a Cycle Generation Adversarial Network (Cycle-GAN) with attention mechanisms for voice conversion, which can give more weight to differences in the characteristics of different speakers during the transformation process, so that the conversion can be made more closely to the differences, and some similarities are retained. We added attention modules to the architecture and new loss functions to update the network. Because we often encounter unstable problems in training GAN, we give different weights to real and generated samples for the loss function part of the discriminator.
The above methods are used to transform the spectrum envelope, but we also try to convert using the GAN for the fundamental frequency and compare it with the original conversion method. Finally, the experimental results show that in Mel-Cepstral distortion (MCD) and Mean Opinion Score (MOS), we proposed voice conversion architecture is better than the baseline system.

關鍵字(中)

★ 語音轉換
★ 生成對抗網路
★ 注意力機制
★ 非平行語料庫

關鍵字(英)

★ Voice conversion
★ Generative Adversarial Networks
★ Attention
★ Non-parallel data

論文目次

摘要 i
Abstract ii
誌謝 iii
目錄 iv
圖目錄 vi
表目錄 vii
第一章緒論 1
1-1 研究背景 1
1-2 研究動機與目的 3
1-3 論文架構 4
第二章語音轉換相關介紹 5
2-1 語音轉換基本介紹 5
2-2 語音轉換相關技術 7
2-2-1 可變性自適應編碼器 7
2-2-2 生成對抗網路 9
2-2-3 語音後驗概率圖 11
2-3 開源聲碼器WORLD介紹 13
第三章深度學習相關介紹 15
3-1 類神經網路 15
3-2 深度學習相關技術 18
3-2-1 卷積神經網路 19
3-2-2 殘差網路 23
3-2-3 門控線性單元 24
3-2-4 子像素卷積神經網路 26
3-2-5 生成對抗網路 27
第四章提出之架構&方法 29
4-1 系統架構 29
4-2 頻譜包絡特徵轉換 31
4-2-1 頻譜包絡前處理 31
4-2-2 頻譜包絡轉換架構 32
4-2-3 頻譜包絡轉換架構損失函數 41
4-3 基本頻率特徵轉換 53
4-3-1 基本頻率前處理 54
4-3-2 基本頻率轉換架構 58
4-3-3 基本頻率轉換架構損失函數 62
4-4 訓練&測試階段 64
第五章實驗結果與分析討論 70
5-1 實驗環境與數據集介紹 70
5-2 評分方法 72
5-3 實驗結果比較與討論 74
第六章結論與未來展望 82
參考文獻 83

參考文獻

[1] D. Kingma, and M. Welling, “Auto-Encoding Varitational Bayes,” 2nd International Conference on Learning Representations (ICLR 2014), Banff, AB, Canada, April 2014.

[2] A. Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete representation learning,” 31st International Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, pp. 6309-6318, Dec 2017.

[3] A. B. L. Larsen, S. K. Sønderby, and O. Winther, “Autoencoding beyond pixels using a learned similarity metric,” 33rd International Conference on Machine Learning (ICML 2016), New York, NY, USA, pp. 1558-1566, Jun 2016.

[4] J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative Adversarial Networks,” 27th International Conference on Neural Information Processing Systems(NIPS 2014), Cambridge, MA, USA, pp. 2672-2680, December 2014.

[5] J.-Y. Zhu, P. Krhenbuhl, E. Shechtman, and A. A. Efros, “Unpaired Image-To-Image Translation Using Cycle-Consistent Adversarial Networks,” International Conference on Compurter Vision (ICCV 2017), Venice, Italy, pp. 2242-2251, Oct. 2017.

[6] Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim and J. Choo, “StarGAN: Unified Generative Adversarial Networks for Multi-domain Image-to-Image Translation,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, pp. 8789-8797, Jun. 2018.

[7] T. Kaneko and H. Kameoka, “CycleGAN-VC: Non-parallel Voice Conversion Using Cycle-Consistent Adversarial Networks,” 26th European Signal Processing Conference (EUSIPCO 2018), Rome, Italy, pp.2100-2104, Dec. 2018.

[8] L. Sun, K. Li, H. Wang, S. Kang, and H. Meng, “Phonetic posteriorgrams for many-to-one voice conversion without parallel data training,” IEEE International Conference on Multimedia and Exop (ICME 2016), Seattle, WA, USA, pp.1-6, July. 2016.

[9] H. Kawahara, I. Masuda-Katsuse, and A. Cheveign´e, “Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based f0 extraction,” Speech Communication, vol.27, no.3-4, pp.187-207, April. 1999.

[10] M. Morise, F. Yokomori, and K. Ozawa, “WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications,” IEICE Transactions on Information and System, vol. 99, no. 7, pp. 1877-1884, July. 2016.

[11] A. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: A Generative Model for Raw Audio,” 9th ISCA Speech Synthesis Workshop(SSW 2016), Sunnyvale, USA, Sep. 2016.

[12] R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A Flow-based Generative Network for Speech Synthesis,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2019), Brighton, UK, pp. 3617-3621, May. 2019.

[13] X. Wang, S. Takaki, and J. Yamagishi, “Neural source-filter-based waveform model for statistical parametric speech synthesis,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2019), Brighton, UK, pp. 5916-5920, May. 2019.

[14] K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Teoh, J. Sotelo, A. de Brébisson, Y. Bengio, and A. Courville. “MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis,” Advances in Neural Information Processing System (NeurIPS 2019), pp. 14881-14892, Oct. 2019.

[15] M. Morise, H. Kawahara, and H. Katayose, “Fast and reliable f0 estimation method based on the period extraction of vocal fold vibration of singing voice and speech,” in Proc. AES 35th International Conference, CD-ROM Proceedings, 2009.

[16] M. Morise, “Cheaptrick, a spectral envelope estimator for high-quality speech synthesis,” Speech Communication, vol.67, pp.1–7, March. 2015.

[17] M. Morise, “Platinum: A method to extract excitation signals for voice synthesis system,” Acoustical Science and Technology, vol.33, no.2, pp.123–125, March. 2012.

[18] P. J. Werbos, “Beyond regression: new tools for prediction and analysis in the behavioral sciences,” Ph.D. thesis, Harvard University, 1974.

[19] K. Fukushima, “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position,” Biological cybernetics, vol. 36, no. 4, pp. 193-202, 1980.

[20] Y. Lecun, L. Bottou, Y. Bengio and P. Haffner, "Gradient-based learning applied to document recognition," in Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998.

[21] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), pp. 770-778, Jun. 2016.

[22] Y. N. Dauphin, A. Fan, M. Auli, and D. Grngier, “Language modeling with gated convolutional networks,” Proceedings of the 34th International Conference on Machine Learning (ICML 2017), vol. 70, pp. 933-941, Aug. 2017.

[23] S. Hochreiter, and J. Schmidhuber, “Long short-term memory,” Neural computation, 9(8):1735–1780, 1997.

[24] W. Shi et al., "Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network," IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1874-1883, Jun. 2016.

[25] M. Mirza, and S. Osindero, “Conditional Generative Adversarial Nets,” arXiv:1411.1784 [cs.LG], Nov. 2014.

[65] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein Generative Adversarial Networks,” 34th International Conference on Machine Learning (ICML 2017), pp. 214-223, Aug. 2017.

[27] T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, "Cyclegan-VC2: Improved Cyclegan-based Non-parallel Voice Conversion," ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, pp. 6820-6824, May. 2019.

[28] K. Junho, K. Minjae, K. Hyeonwoo, and L. Kwanghee, “U-GAT-IT: Unsupervised Generative Attentional Networks with Adaptive Layer-Instance Normalization for Image-to-Image Translation,” 8th International Conference on Learning Representations (ICLR 2020), Apr. 2020.

[29] P. Isla, J. Zhu, T. Zhou, and A. A. Efros, “Image-to-Image Translation with Conditional Adversarial Networks,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, pp. 5967-5976, July. 2017.

[30] R. Ferro, N. Obin, and A. Roebel, “CycleGan Voice Conversion of Spectral Envelopes using Adversarial Weights,” 28th European Signal Processing Conference (EUSIPCO 2020), pp. 406-410, Jan. 2021.

[31] K. Zhou, B. Sisman, and H. Li, "Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data", Proc. Odyssey 2020 The Speaker and Language Recognition Workshop, pp. 230-237, 2020.

[32] Z. Du, K. Zhou, B. Sisman and H. Li, "Spectrum and Prosody Conversion for Cross-lingual Voice Conversion with CycleGAN," 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Auckland, New Zealand, pp. 507-513, 2020.

指導教授

張寶基(Pao-Chi Chang)

審核日期

2021-7-19

推文