應用於語者驗證之雙序列門控注意力單元架構

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：55

、訪客IP：18.118.253.117

姓名

陳登國(Tran Dang Khoa) 查詢紙本館藏

畢業系所

電機工程學系

論文名稱

應用於語者驗證之雙序列門控注意力單元架構
(Dual-Sequences Gated Attention Unit Architecture for Speaker Verification)

相關論文

★ 即時的SIFT特徵點擷取之低記憶體硬體設計	★ 即時的人臉偵測與人臉辨識之門禁系統
★ 具即時自動跟隨功能之自走車	★ 應用於多導程心電訊號之無損壓縮演算法與實現
★ 離線自定義語音語者喚醒詞系統與嵌入式開發實現	★ 晶圓圖缺陷分類與嵌入式系統實現
★ 語音密集連接卷積網路應用於小尺寸關鍵詞偵測	★ G2LGAN: 對不平衡資料集進行資料擴增應用於晶圓圖缺陷分類
★ 補償無乘法數位濾波器有限精準度之演算法設計技巧	★ 可規劃式維特比解碼器之設計與實現
★ 以擴展基本角度CORDIC為基礎之低成本向量旋轉器矽智產設計	★ JPEG2000靜態影像編碼系統之分析與架構設計
★ 適用於通訊系統之低功率渦輪碼解碼器	★ 應用於多媒體通訊之平台式設計
★ 適用MPEG 編碼器之數位浮水印系統設計與實現	★ 適用於視訊錯誤隱藏之演算法開發及其資料重複使用考量

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

在本文中，我們提出了一種GRU結構的變體，稱為雙序列門控注意單元（DS-GAU），其中計算了x向量基線的每個TDNN層的統計池，並將其通過DS-GAU層傳遞，在訓練為幀級時從輸入要素的不同時間上下文中聚合更多信息。我們提出的架構在VoxCeleb2數據集上進行了訓練，其中特徵向量稱為DSGAU-向量。我們對VoxCeleb1數據集和“野生演說者”（SITW）數據集進行了評估，並將實驗結果與x矢量基線系統進行了比較。結果表明，相對於VoxCeleb1數據集的x向量基線，我們提出的方法在EER相對改進方面最多可存檔11.6％，7.9％和7.6％.

摘要(英)

In this thesis, we present a variant of GRU architecture called Dual-Sequences Gated Attention Unit (DS-GAU), in which the statistics pooling from each TDNN layer of the x-vector baseline are computed and passed through the DS-GAU layer, to aggregate more information from the variant temporal context of input features while training as frame-level. Our proposed architecture was trained on the VoxCeleb2 dataset, where the feature vector is referred to as a DSGAU-vector. We made our evaluation on the VoxCeleb1 dataset and the Speakers in the Wild (SITW) dataset and compared the experimental results with the x-vector baseline system. It showed that our proposed method archived up to 11.6%, 7.9%, and 7.6% in EER relative improvements over the x-vector baseline on the VoxCeleb1 dataset.

關鍵字(中)

★ 應用於語者驗證之雙序列門控注意力單元架構

關鍵字(英)

論文目次

CONTENTS
1 Introduction 1
1.1 Motivations 1
1.2 Thesis Organization 2
2 Background 4
2.1 Time-delay neural networks (TDNN) 5
2.2 Baseline x-vector system 5
2.3 Extension topology of x-vector 7
2.3.1 Extended-TDNN (E-TDNN) 7
2.3.2 Factorized TDNN (F-TDNN) 8
2.4 DSP-vector system 9
2.4.1 DSP-vector structure 9
2.4.2 DSP-LSTM architecture 11
3 Dual-Sequences Gated Attention Unit (DS-GAU) 14
3.1 Dual-Sequences Gated Attention Unit (DS-GAU) Vector Network Architecture 14
3.2 Dual-Sequences Gated Attention Unit (DS-GAU) 14
3.2.1 Recurrent Attention Unit (RAU) 14
3.2.2 Gated Attention Unit (GAU) 17
3.2.3 Dual-Sequences Gated Attention Unit (DS-GAU) 19
4 Experimental Setups and Results 23
4.1 Data preparation 23
4.1.1 Dataset preparation and metrics 23
4.1.2 Pre-processing speaker features 25
4.1.3 Backend classifier 25
4.2 Experimental results 25
4.2.1 Evaluation on VoxCeleb1 dataset 26
4.2.2 Evaluation on SITW dataset 29
5 Conclusion And Future Recommendations 36
6 References 38

參考文獻

[1] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Trans. Audio, Speech Lang. Process., vol. 19, no. 4, pp. 788–798, 2011, doi: 10.1109/TASL.2010.2064307.
[2] S. Ioffe, “Probabilistic linear discriminant analysis,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 3954 LNCS, pp. 531–542, 2006, doi: 10.1007/11744085_41.
[3] P. Kenny, “Bayesian speaker verification with heavy tailed priors,” Proc. Odyssey Speak. Lang. Recogntion Work. Brno, Czech Repub., 2010.
[4] E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification,” ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., pp. 4052–4056, 2014, doi: 10.1109/ICASSP.2014.6854363.
[5] G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, “End-to-End Text-Dependent Speaker Verification,” 2016, pp. 5115–5119.
[6] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., vol. 2018-April, pp. 4879–4883, 2018, doi: 10.1109/ICASSP.2018.8462665.
[7] D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur, “Deep neural network embeddings for text-independent speaker verification,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2017-Augus, pp. 999–1003, 2017, doi: 10.21437/Interspeech.2017-620.
[8] V. Peddinti, D. Povey, and S. Khudanpur, “Atimedelayneuralnetworkarchitectureforefﬁcientmodelingoflong temporalcontexts.pdf,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2015-Janua, pp. 2–6, 2015.
[9] D. Snyder, D. Garcia-Romero, G. Sell, A. McCree, D. Povey, and S. Khudanpur, “Speaker Recognition for Multi-speaker Conversations Using X-vectors,” ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., vol. 2019-May, pp. 5796–5800, 2019, doi: 10.1109/ICASSP.2019.8683760.
[10] B. Gu, W. Guo, L. Dai, and J. Du, “An Improved Deep Neural Network for Modeling Speaker Characteristics at Different Temporal Scales,” pp. 6814–6818, 2020, doi: 10.1109/icassp40776.2020.9054151.
[11] “Chen, Chia-Ping, Su-Yu Zhang, Chih-Ting Yeh, Jia-Ching Wang, Tenghui Wang, and Chien-Lin Huang. ‘Speaker characterization using tdnn-lstm based speaker embedding.’ In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processin,” pp. 6211–6215, 2019.
[12] Q.-B. Hong, C.-H. Wu, H.-M. Wang, and C.-L. Huang, “Statistics Pooling Time Delay Neural Network Based on X-Vector for Speaker Verification,” ICASSP 2020-2020 IEEE Int. Conf. Acoust. Speech Signal Process., pp. 6849–6853, 2020, doi: 10.1109/icassp40776.2020.9054350.
[13] F. A. R. R. Chowdhury, Q. Wang, I. L. Moreno, and L. Wan, “Attention-Based Models for Text-Dependent Speaker Verification,” ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., vol. 2018-April, no. 2, pp. 5359–5363, 2018, doi: 10.1109/ICASSP.2018.8461587.
[14] M. H. Rahman, I. Himawan, M. Mclaren, C. Fookes, and S. Sridharan, “Employing phonetic information in DNN speaker embeddings to improve speaker recognition performance,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2018-Septe, no. September, pp. 3593–3597, 2018, doi: 10.21437/Interspeech.2018-1804.
[15] Y. Zhu, T. Ko, D. Snyder, B. Mak, and D. Povey, “Self-attentive speaker embeddings for text-independent speaker verification,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2018-Septe, pp. 3573–3577, 2018, doi: 10.21437/Interspeech.2018-1158.
[16] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling,” pp. 1–9, 2014, [Online]. Available: http://arxiv.org/abs/1412.3555.
[17] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: robust dnn embeddings for speaker recognition,” 2018 IEEE Int. Conf. Acoust. Speech Signal Process., pp. 5329–5333, 2018.
[18] D. Povey et al., “Semi-orthogonal low-rank matrix factorization for deep neural networks,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2018-Septe, no. 2, pp. 3743–3747, 2018, doi: 10.21437/Interspeech.2018-1417.
[19] G. Zhong, G. Yue, and X. Ling, “Recurrent attention unit,” arXiv, 2018.
[20] Y. Qin, D. Chen, S. Xiang, and C. Zhu, “Gated dual attention unit neural networks for remaining useful life prediction of rolling bearings,” IEEE Trans. Ind. Informatics, vol. 3203, no. c, pp. 1–1, 2020, doi: 10.1109/tii.2020.2999442.
[21] J. S. Chung, A. Nagrani, and A. Zisserman, “VoxceleB2: Deep speaker recognition,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2018-Septe, no. i, pp. 1086–1090, 2018, doi: 10.21437/Interspeech.2018-1929.
[22] D. Snyder, G. Chen, and D. Povey, “MUSAN: A Music, Speech, and Noise Corpus,” pp. 2–5, 2015, [Online]. Available: http://arxiv.org/abs/1510.08484.
[23] D. Snyder, D. Garcia-Romero, A. McCree, G. Sell, D. Povey, and S. Khudanpur, “Spoken Language Recognition using X-vectors,” pp. 105–111, 2018, doi: 10.21437/odyssey.2018-15.
[24] A. Nagraniy, J. S. Chungy, and A. Zisserman, “VoxCeleb: A large-scale speaker identification dataset,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2017-Augus, pp. 2616–2620, 2017, doi: 10.21437/Interspeech.2017-950.
[25] M. Mclaren, A. Lawson, L. Ferrer, D. Castán, and M. Graciarena, “The Speakers in the Wild Speaker Recognition Challenge Plan,” pp. 818–822, 2016.
[26] J. Villalba et al., “State-of-the-art speaker recognition with neural network embeddings in NIST SRE18 and Speakers in the Wild evaluations,” Comput. Speech Lang., vol. 60, 2020, doi: 10.1016/j.csl.2019.101026.
[27] M. Ravanelli, T. Parcollet, Y. Bengio, and C. Fellow, “the pytorch-kaldi speech recognition toolkit mila , Universit ´ e de Montr ´ LIA , Universit ´ e d ’ Avignon,” pp. 6465–6469, 2019.

指導教授

蔡宗漢(Tsung-Han Tsai)

審核日期

2021-1-27

推文