基於x-vector端到端語者驗證之高性能神經網路系統晶片

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：20

、訪客IP：18.191.245.229

姓名

江孟叡(Meng-Jui Chiang) 查詢紙本館藏

畢業系所

電機工程學系

論文名稱

基於x-vector端到端語者驗證之高性能神經網路系統晶片
(A High-Performance Neural Network SoC for x-vector based End-to-End Speaker Verification)

相關論文

★ 即時的SIFT特徵點擷取之低記憶體硬體設計	★ 即時的人臉偵測與人臉辨識之門禁系統
★ 具即時自動跟隨功能之自走車	★ 應用於多導程心電訊號之無損壓縮演算法與實現
★ 離線自定義語音語者喚醒詞系統與嵌入式開發實現	★ 晶圓圖缺陷分類與嵌入式系統實現
★ 語音密集連接卷積網路應用於小尺寸關鍵詞偵測	★ G2LGAN: 對不平衡資料集進行資料擴增應用於晶圓圖缺陷分類
★ 補償無乘法數位濾波器有限精準度之演算法設計技巧	★ 可規劃式維特比解碼器之設計與實現
★ 以擴展基本角度CORDIC為基礎之低成本向量旋轉器矽智產設計	★ JPEG2000靜態影像編碼系統之分析與架構設計
★ 適用於通訊系統之低功率渦輪碼解碼器	★ 應用於多媒體通訊之平台式設計
★ 適用MPEG 編碼器之數位浮水印系統設計與實現	★ 適用於視訊錯誤隱藏之演算法開發及其資料重複使用考量

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2026-3-1以後開放)

摘要(中)

在過去的幾年裡，使用神經網路從說話人的聲音中識別出他們的身份逐漸普及。在這些方法中，x-vector神經網路表現出更強的抗噪能力，通常比以前的方法，如高斯混合模型（GMM）和支持向量機（SVM）具有更高的準確性。本文介紹了一個由RISC-V CPU和神經網路加速器模組組成的系統晶片（SoC），用於基於x-vector的語者驗證（SV）。由於模型中含有大量的參數，在本研究中，x-vector的處理分為三個步驟：縮小尺寸、剪枝和壓縮，以確保即時運算，並實現於運算資源有限的邊緣裝置。我們致力於優化具有稀疏性的資料流程，與傳統的稀疏矩陣壓縮方法Compressed Sparse Row（CSR）相比，我們提出了Binary pointer Compressed Sparse Row （BPCSR）方法，該方法大幅改善了運算延遲，並避免了稀疏性導致PE的負載平衡問題。在硬體實現的部份，我們進一步設計了神經網路加速器模組，儲存壓縮後的參數並計算x-vector神經網路，而RISC-V CPU處理其餘的計算，如特徵提取和分類器。本語者驗證系統在Voxceleb資料集上進行了測試，包含1251個不同的測試語者，並取得了超過95%的準確率。最後，我們使用台積電90奈米製程合成該系統晶片。它的面積為15.5 mm2，功率為97.88mW。此外，本晶片也透過晶心科技ADP-XC7K160 FPGA驗證其功能性，利用麥克風輸入音訊資料，配合General-purpose input/output (GPIO)和Universal Asynchronous Receiver/ Transmitter (UART)等外部IO與使用者互動，並將結果輸出於文字顯示器中，實現完整的端到端語者驗證。

摘要(英)

The use of the neural network to recognize speakers′ identity from their speech sounds has become popular in the last few years. Among these methods, X-vector performs more noise immunity and usually has higher accuracy than the previous method, such as the Gaussian mixture model (GMM) and the support vector machine (SVM). This paper presents a system-on-chip (SoC) composed of a RISC-V CPU and a neural network accelerator module for x-vector-based speaker verification (SV). Due to a large number of parameters, in this work, x-vector is processed with three steps: reducing size, pruning, and compression to ensure real-time latency and possible to be implemented on edge devices. We are dedicated to optimizing the data flow with sparsity. Compared with the conventional sparse matrix compression method compressed sparse row (CSR), we propose the binary pointer compressed sparse row (BPCSR) method which significantly improves the latency and avoids the load balancing issue in each PEs. We further design the neural network accelerator module stores the compressed parameters and computes the x-vector while the RISC-V CPU processes the rest of the calculations such as feature extraction and the classifier. The system was tested on the Voxceleb dataset, containing 1251 test speakers, and achieved over 95% accuracy. Lastly, we synthesized the chip with TSMC 90 nm technology. It presents 15.5 mm2 in the area and 97.88 mW for real-time identification. In addition, the chip is also verified by Andes ADP-XC7K160 FPGA, which uses the microphone to input audio data and external IOs such as General-purpose input/output (GPIO) and Universal Asynchronous Receiver/ Transmitter (UART) to interact with users. The results output to a text display to achieve complete end-to-end speaker verification.

關鍵字(中)

★ 神經網路
★ 系統晶片
★ 語者驗證

關鍵字(英)

論文目次

目錄
摘要 I
ABSTRACT II
1. 序論 1
1.1研究背景與動機 1
1.2論文架構 4
2. 文獻探討 5
2.1語者驗證演算法 5
2.2 語者驗證硬體 9
2.3 語音處理系統晶片 11
2.4 稀疏性神經網路加速器 13
3. 提出的系統驗證流程概述 16
3.1整體驗證流程 16
3.2特徵提取 17
3.3 X-VECTOR神經網路 18
3.4 比較器(評分器) 19
4. 神經網路優化和資料排程 20
4.1 縮小模型 20
4.2剪枝 21
4.3量化 22
4.4運算單元負載平衡 23
4.5提出的BINARY POINTER COMPRESSED SPARSE ROW儲存格式 25
4.6 PE的資料預排程 26
5. 硬體架構設計 29
5.1整體架構 29
5.2嵌入式CPU 31
5.3神經網路運算模組 32
5.4運算單元 35
6. 實驗結果 38
6.1實驗設置 38
6.2準確度和運算量 38
6.3晶片設計和面積分析 40
6.4 FPGA驗證 41
6.5比較 43
7. 結論 46
參考文獻 47

參考文獻

[1] N. Egi, T. Hayashi and A. Takahashi, "The proposal of quantification method of speaker identification accuracy for speech communication service" Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.
[2] D. A. Reynolds and R. C. Rose, "Robust text-independent speaker identification using Gaussian mixture speaker models," in IEEE Transactions on Speech and Audio Processing, vol. 3, no. 1, pp. 72-83, Jan. 1995.
[3] P. Kenny, G. Boulianne, P. Ouellet and P. Dumouchel, "Speaker and Session Variability in GMM-Based Speaker Verification," in IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 4, pp. 1448-1460, May 2007.
[4] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel and P. Ouellet, "Front-End Factor Analysis for Speaker Verification," in IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788-798, May 2011
[5] M. Li, A. Tsiartas, M. Van Segbroeck and S. S. Narayanan, "Speaker verification using simplified and supervised i-vector modeling," 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, pp. 7199-7203.
[6] S. Cumani, O. Plchot and P. Laface, "On the use of i–vector posterior distributions in Probabilistic Linear Discriminant Analysis," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 4, pp. 846-857, April 2014.
[7] C. J. S. de Souza, D. C. G. González and L. L. Ling, "VVGP features for speaker verification using i-vector framework," 2015 International Workshop on Telecommunications (IWT), 2015, pp. 1-4.
[8] E. Variani, X. Lei, E. McDermott, I. L. Moreno and J. Gonzalez-Dominguez, "Deep neural networks for small footprint text-dependent speaker verification," 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 4052-4056.
[9] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano and K. J. Lang, "Phoneme recognition using time-delay neural networks," in IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no. 3, pp. 328-339, March 1989.
[10] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey and S. Khudanpur, "X-Vectors: Robust DNN Embeddings for Speaker Recognition," 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5329-5333.
[11] F. A. Rezaur rahman Chowdhury, Q. Wang, I. L. Moreno and L. Wan, "Attention-Based Models for Text-Dependent Speaker Verification," 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5359-5363.
[12] C. -P. Chen, S. -Y. Zhang, C. -T. Yeh, J. -C. Wang, T. Wang and C. -L. Huang, "Speaker Characterization Using TDNN-LSTM Based Speaker Embedding," ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 6211-6215.
[13] F. Zhao, H. Li and X. Zhang, "A Robust Text-independent Speaker Verification Method Based on Speech Separation and Deep Speaker," ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 6101-6105.
[14] J. S. P. Giraldo, S. Lauwereins, K. Badami, H. Van Hamme and M. Verhelst, "18μW SoC for near-microphone Keyword Spotting and Speaker Verification," 2019 Symposium on VLSI Circuits, 2019, pp. C52-C53.
[15] J. S. P. Giraldo, S. Lauwereins, K. Badami and M. Verhelst, "Vocell: A 65-nm Speech-Triggered Wake-Up SoC for 10- $mu$ W Keyword Spotting and Speaker Verification," in IEEE Journal of Solid-State Circuits, vol. 55, no. 4, pp. 868-878.
[16] J. Wang, L. Lian, Y. Lin and J. Zhao, "VLSI Design for SVM-Based Speaker Verification System," in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 23, no. 7, pp. 1355-1359, July 2015.
[17] Snyder, D., Garcia-Romero, D., Povey, D., Khudanpur, S. (2017) Deep Neural Network Embeddings for Text-Independent Speaker Verification. Proc. Interspeech 2017, 999-1003, doi: 10.21437/Interspeech.2017-620.
[18] X. Zhang, X. Zou, M. Sun, T. F. Zheng, C. Jia and Y. Wang, "Noise Robust Speaker Recognition Based on Adaptive Frame Weighting in GMM for i-Vector Extraction," in IEEE Access, vol. 7, pp. 27874-27882, 2019.
[19] M. Horowitz, Computing’s energy problem (and what we can do about it), in International Solid-State Circuits Conference (ISSCC), 2014
[20] D. Kadetotad, V. Berisha, C. Chakrabarti and J. -S. Seo, "A 8.93-TOPS/W LSTM Recurrent Neural Network Accelerator Featuring Hierarchical Coarse-Grain Sparsity With All Parameters Stored On-Chip," in IEEE Solid-State Circuits Letters, vol. 2, no. 9, pp. 119-122, Sept. 2019.
[21] K. -Y. Fan, J. -H. Chen, C. -N. Liu and J. -D. Huang, "Performance Optimization for MLP Accelerators using ILP-Based On-Chip Weight Allocation Strategy," 2022 International Symposium on VLSI Design, Automation and Test (VLSI-DAT), 2022, pp. 1-4.
[22] S. Wang et al., "Acceleration of LSTM With Structured Pruning Method on FPGA," in IEEE Access, vol. 7, pp. 62930-62937, 2019.
[23] X. Dai, H. Yin and N. K. Jha, "Grow and Prune Compact, Fast, and Accurate LSTMs," in IEEE Transactions on Computers, vol. 69, no. 3, pp. 441-452, 1 March 2020.
[24] Y. -H. Chen, T. -J. Yang, J. Emer and V. Sze, "Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices," in IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 2, pp. 292-308, June 2019.
[25] J. Park, W. Yi, D. Ahn, J. Kung and J. -J. Kim, "Balancing Computation Loads and Optimizing Input Vector Loading in LSTM Accelerators," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 39, no. 9, pp. 1889-1901, Sept. 2020.
[26] http://www.andestech.com/en/products-solutions/andescore-processors/RISC-V-n25f/
[27] M. Jiao, Y. Li, P. Dang, W. Cao and L. Wang, "A High Performance FPGA-Based Accelerator Design for End-to-End Speaker Recognition System," 2019 International Conference on Field-Programmable Technology (ICFPT), 2019, pp. 215-223.
[28] J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speaker recognition,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2018-Septe, no. i, pp. 1086–1090, 2018, doi: 10.21437/Interspeech.2018-1929.
[29] A. Nagraniy, J. S. Chungy, and A. Zisserman, “VoxCeleb: A large-scale speaker identification dataset,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2017-Augus, pp. 2616–2620, 2017, doi: 10.21437/Interspeech.2017-950.
[30] Peddinti, V., Povey, D., Khudanpur, S. (2015) A time delay neural network architecture for efficient modeling of long temporal contexts. Proc. Interspeech 2015, 3214-3218
[31] Klaus Greff, Rupesh Kumar Srivastava, Jan Koutník, Bas R. Steunebrink, Jürgen Schmidhuber, “LSTM: A Search Space Odyssey” arXiv:1503.04069, 2015
[32] R. Ramos-Lara, M. Lopez-Garcia, E. Canto-Navarro and L. Puente-Rodriguez, "SVM speaker verification system based on a low-cost FPGA," 2009 International Conference on Field Programmable Logic and Applications, 2009, pp. 582-586
[33] Ramos-Lara, R., López-García, M., Cantó-Navarro, E. et al. Real-Time Speaker Verification System Implemented on Reconfigurable Hardware. J Sign Process Syst 71, pp. 89–103, 2013
[34] E. Cantó-Navarro, M. López-García, R. Ramos-Lara and R. Sánchez-Reíllo, "Flexible Biometric Online Speaker-Verification System Implemented on FPGA Using Vector Floating-Point Units," in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 23, no. 11, pp. 2497-2507, Nov. 2015
[35] A. S. Bora et al., "Power Efficient Speaker Verification Using Linear Predictive Coding on FPGA," 2018 International CET Conference on Control, Communication, and Computing (IC4), 2018, pp. 260-265
[36] B. Liu et al., "A Target-Separable BWN Inspired Speech Recognition Processor with Low-power Precision-adaptive Approximate Computing," 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2022, pp. 196-201
[37] T. Tambe et al., "9.8 A 25mm2 SoC for IoT Devices with 18ms Noise-Robust Speech-to-Text Latency via Bayesian Speech Denoising and Attention-Based Sequence-to-Sequence DNN Speech Recognition in 16nm FinFET," 2021 IEEE International Solid- State Circuits Conference (ISSCC), 2021, pp. 158-160.
[38] T. -J. Lin et al., "A 40nm CMOS SoC for Real-Time Dysarthric Voice Conversion of Stroke Patients," 2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC), 2022, pp. 7-8
[39] Y. -H. Chen, T. Krishna, J. S. Emer and V. Sze, "Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks," in IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127-138, Jan. 2017
[40] M. D. Balasingam and C. S. Kumar, "Refining Cosine Distance Features for Robust Speaker Verification," 2018 International Conference on Communication and Signal Processing (ICCSP), 2018, pp. 0152-0155.
[41] S. J. D. Prince and J. H. Elder, "Probabilistic Linear Discriminant Analysis for Inferences About Identity," 2007 IEEE 11th International Conference on Computer Vision, 2007, pp. 1-8, doi: 10.1109/ICCV.2007.
[42] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in Advances in neural information processing systems, 2015, pp. 1135–1143.
[43] J. Fernández-Marqués, Vincent W.-S. Tseng, Sourav Bhattachara, and Nicholas D. Lane.. “On-the-fly deterministic binary filters for memory efficient keyword spotting applications on embedded devices.” In Proceedings of the 2nd International Workshop on Embedded and Mobile Deep Learning (EMDL′18)
[44] P. Blouw, G. Malik, B. Morcos, A. R. Voelker, C. Eliasmith A. Akandeh and F. M. Salem, " Hardware Aware Training for Efficient Keyword Spotting on General Purpose and Specialized Hardware" in arXiv:2009.04465 [eess.AS], 2021

指導教授

蔡宗漢

審核日期

2023-3-14

推文