參考文獻 |
[1] R. A. J. Clark, K. Richmond, and S. King, “Multisyn: Open-domain unit selection for the Festival speech synthesis system,” Speech Communication, vol. 49, no. 4, pp. 317–330, 2007.
[2] H. Zen, K. Tokuda, and A. W. Black, “Statistical parametric speech synthesis,” Speech Communication, vol. 51, no. 11, pp. 1039–1064, 2009.
[3] T. Merritt, J. Latorre, and S. King, “Attributing modeling errors in HMM synthesis by stepping gradually from natural to modelled speech,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), 2015, pp. 4220–4224.
[4] K. Tokuda, Y. Nankaku, T. Toda, H. Zen, J. Yamagishi, and K. Oura, “Speech synthesis based on hidden Markov models,” Proceedings of the IEEE, vol. 101, no. 5, pp. 1234–1252, 2013.
[5] Z.-H. Ling, S.-Y. Kang, H. Zen, A. Senior, M. Schuster, X.-J. Qian, H. M. Meng, and L. Deng, “Deep learning for acoustic modeling in parametric speech generation: A systematic review of existing techniques and future trends,” IEEE Signal Processing Magazine, vol. 32, no. 3, pp. 35–52, 2015.
[6] H. Zen, “Acoustic modeling in statistical parametric speech synthesis - from HMM to LSTM-RNN,” in Proc. MLSLP, 2015, invited paper.
[7] T. Weijters and J. Thole, “Speech synthesis with artificial neural networks,” in Proc. Int. Conf. on Neural Networks, 1993, pp. 1764–1769.
[8] G. Cawley and P. Noakes, “LSP speech synthesis using backpropagation networks,” in Proc. Third Int. Conf. on Artificial Neural Networks, 1993, pp. 291–294.
[9] C. Tuerk and T. Robinson, “Speech synthesis using artificial neural networks trained on cepstral coefficients.” in Proc. European Conference on Speech Communication and Technology (Eurospeech), 1993, pp. 4–7.
[10] M. Riedi, “A neural-network-based model of segmental duration for speech synthesis,” in Proc. European Conference on Speech Communication and Technology (Eurospeech), 1995, pp. 599–602.
[11] O. Karaali, G. Corrigan, N. Massey, C. Miller, O. Schnurr, and A. Mackie, “A high quality text-to-speech system composed of multiple neural networks,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), vol. 2, 1998, pp. 1237–1240.
[12] Z.-H. Ling, L. Deng, and D. Yu, “Modeling spectral envelopes using Restricted Boltzmann Machines and Deep Belief Networks for statistical parametric speech synthesis,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 10, pp. 2129–2139, 2013.
[13] S. Kang, X. Qian, and H. Meng, “Multi-distribution deep belief network for speech synthesis,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), 2013, pp. 8012–8016.
[14] S. Kang and H. Meng, “Statistical parametric speech synthesis using weighted multi-distribution deep belief network,” in Proc. Interspeech, 2014, pp. 1959–1963.
[15] H. Zen and A. Senior, “Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), 2014, pp. 3844–3848.
[16] B. Uria, I. Murray, S. Renals, and C. Valentini, “Modelling acoustic feature dependencies with artificial neural networks: Trajectory-rnade,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), 2015, pp. 4465–4469.
[17] H. Zen, A. Senior, and M. Schuster, “Statistical parametric speech synthesis using deep neural networks,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), 2013, pp. 7962–7966.
[18] H. Lu, S. King, and O. Watts, “Combining a vector space representation of linguistic context with a deep neural network for text-to-speech synthesis,” Proc. the 8th ISCA Speech Synthesis Workshop (SSW), pp. 281–285, 2013.
[19] Y. Qian, Y. Fan, W. Hu, and F. K. Soong, “On the training aspects of deep neural network (DNN) for parametric TTS synthesis,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), 2014, pp. 3829–3833.
[20] Z. Wu, C. Valentini-Botinhao, O. Watts, and S. King,“Deep neural networks employing multi-task learning and stacked bottleneck features for speech synthesis,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), 2015, pp. 4460–4464.
[21] K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda, “The effect of neural networks in statistical parametric speech synthesis,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), 2015, pp. 4455–4459.
[22] O. Watts, G. E. Henter, T. Merritt, Z. Wu, and S. King,“From HMMs to DNNs: where do the improvements come from?” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), 2016.
[23] C. Valentini-Botinhao, Z. Wu, and S. King, “Towards minimum perceptual error training for DNN-based speech synthesis,” in Proc. Interspeech, 2015, pp. 869–873.
[24] Z.Wu and S. King, “Minimum trajectory error training for deep neural networks, combined with stacked bottleneck features,” in Proc. Interspeech, 2015, pp. 309–313.
[25] Y. Fan, Y. Qian, F. K. Soong, and L. He, “Sequence generation error (SGE) minimization based deep neural networks training for text-to-speech synthesis,” in Proc. Interspeech, 2015, pp. 864–868.
[26] Z. Wu and S. King, “Improving trajectory modelling for dnn-based speech synthesis by using stacked bottleneck features and minimum generation error training,” IEEE Trans. Audio, Speech and Language Processing, 2016.
[27] Y. Fan, Y. Qian, F. Xie, and F. K. Soong, “TTS synthesis with bidirectional LSTM based recurrent neural networks,” in Proc. Interspeech, 2014, pp. 1964–1968.
[28] H. Zen and H. Sak, “Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), 2015, pp. 4470–4474.
[29] Zhizheng Wu, Oliver Watts, Simon King, “Merlin: An Open Source Neural Network Speech Synthesis System,” in Proc. 9th ISCA Speech Synthesis Workshop (SSW9), September 2016, Sunnyvale, CA, USA.
[30] Z. Wu and S. King, “Investigating gated recurrent neural networks for speech synthesis,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), 2016.
[31] SPTK官方網站, http://sp-tk.sourceforge.net/
[32] T. Merritt, R. A. Clark, Z. Wu, J. Yamagishi, and S. King,“Deep neural network-guided unit selection synthesis,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), 2016.
[33] Q. Hu, Z. Wu, K. Richmond, J. Yamagishi, Y. Stylianou, and R. Maia, “Fusion of multiple parameterisations for DNN-based sinusoidal speech synthesis with multi-task learning,” in Proc. Interspeech, 2015, pp. 854–858.
[34] M. MORISE, F. YOKOMORI, and K. OZAWA,“WORLD: a vocoder-based high-quality speech synthesis system for real-time applications,” IEICE transactions on information and systems, 2016.
[35] H. Kawahara, I. Masuda-Katsuse, and A. de Cheveign´e,“Restructuring speech representations using a pitchadaptive time–frequency smoothing and an instantaneousfrequency-based F0 extraction: Possible role of a repetitive structure in sounds,” Speech communication, vol. 27, no. 3, pp. 187–207, 1999.
[36] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[37] A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional LSTM and other neural network architectures,” Neural Networks, vol. 18, no. 5, pp. 602–610, 2005.
[38] Festival官網下載網址, http://festvox.org/packed/festival/2.4/
[39] Merlin提供的訓練資料下載連結,http://104.131.174.95/slt_arctic_full_data.zip
[40] 陰陽師官網, https://www.onmyojigame.com/#2
[41] Merlin相關討論文章, https://github.com/CSTR-Edinburgh/merlin/issues/18
[42] 市面上販售有聲故事書, http://shopping.windmill.com.tw/product.php?product_num=10155936 |