利用韻律訊息之強健性語者辨識

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：90

、訪客IP：3.15.226.41

姓名

陳子和(Zi-He Chen) 查詢紙本館藏

畢業系所

電機工程學系

論文名稱

利用韻律訊息之強健性語者辨識
(Latent Prosody Analysis for Robust Speaker Identification)

相關論文

★ 小型化 GSM/GPRS 行動通訊模組之研究	★ 語者辨識之研究
★ 應用投影法作受擾動奇異系統之強健性分析	★ 利用支撐向量機模型改善對立假設特徵函數之語者確認研究
★ 結合高斯混合超級向量與微分核函數之語者確認研究	★ 敏捷移動粒子群最佳化方法
★ 改良式粒子群方法之無失真影像預測編碼應用	★ 粒子群演算法應用於語者模型訓練與調適之研究
★ 粒子群演算法之語者確認系統	★ 改良式梅爾倒頻譜係數混合多種語音特徵之研究
★ 利用語者特定背景模型之語者確認系統	★ 智慧型遠端監控系統
★ 正向系統輸出回授之穩定度分析與控制器設計	★ 混合式區間搜索粒子群演算法
★ 基於深度神經網路的手勢辨識研究	★ 人體姿勢矯正項鍊配載影像辨識自動校準及手機接收警告系統

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

在公共電話網路中，語者辨認系統通常會遇到話筒不匹配和辨認語料不足的問題。為增進語者辨認系統之強健性，我們提出一融合下層聲學與上層韻律訊息之架構，利用韻律訊息特徵分析(latent prosody analysis, LPA)，量測不同語者間的韻律模型距離，並融合聲學模型(GMM)與韻律模型分數得到最後的辨識結果。LPA 主要是利用資訊檢索的概念將SID 問題轉化成全文檢索的問題，經由下列三步驟(1) 韻律訊息標示化( tokenization), (2) 韻律訊息分析(LPA)及(3)語者檢索(speaker retrieval) 實現利用韻律訊息之強健性語者辨識。
實驗使用 Handset TIMIT（HTIMIT）語料庫，以leave-one-out方式輪流使用九種不同的話筒當作未知話筒，驗證所提出之方法。實驗結果顯示，若以傳統 maximum likelihood a priori handset knowledge interpolation (ML-AKI) 的方法當作基礎(baseline)，語者辨識率將可傳統pitch-GMM或 prosody bi-gram modeling 方法優異，無論對已知話筒和未知話筒皆能有效改善系統之強健性。

摘要(英)

Handsets that are not seen in the training phase (unseen handsets) are significant sources of performance degradation for speaker identification (SID) applications in the telecommunication environment. In this thesis, a novel latent prosody analysis (LPA) approach to automatically extract the most discriminative prosody cues for assisting in conventional spectral feature-based SID is proposed. The concept of the LPA approach is to transform the SID problem into a full-text document retrieval-like task via (1) prosodic contour tokenization, (2) latent prosody analysis, and (3) speaker retrieval. Experimental results of the phonetically balanced, read-speech, handset-TIMIT (HTIMIT) database demonstrated that the proposed method of fusing the LPA prosodic feature-based SID systems with maximum likelihood a priori handset knowledge interpolation (ML-AKI) spectral feature-based SID outperformed both the pitch and energy Gaussian mixture model (Pitch-GMM) and the bi-gram of the prosodic state (bi-gram) counterparts for both cases of counting all and only unseen handsets.

關鍵字(中)

★ 語者辨識
★ 韻律訊息

關鍵字(英)

★ speaker identification
★ prosodic information

論文目次

Chapter 1. Introduction - 1 -
1.1. Background - 1 -
1.2. Outline of this Dissertation - 3 -
Chapter 2. Latent Prosody Analysis - 4 -
2.1. Introduction - 4 -
2.2. Tokenization - 7 -
2.2.1. Inter-syllable Prosodic Feature Extraction - 8 -
2.2.2. Automatic Prosodic State Labeler - 9 -
2.2.3. Prosodic Keyword Parser - 10 -
2.3. Latent Prosody Analysis - 12 -
2.3.1. Construction of Prosodic Keyword-Speaker Co-occurrence Matrix - 14 -
2.3.2. Term Frequency and Inverse Document Frequency Method - 15 -
2.3.3. Construction of the Latent Prosody Space of Speakers - 16 -
2.4. Speaker Retrieval - 21 -
2.5. Fusion of Prosodic and Spectral Feature-based SID Scores - 23 -
Chapter 3. Cluster-Based LPA - 25 -
3.1. Introduction - 25 -
3.2. Cluster-Based LPA Method - 26 -
3.3. Fusion of CD-LPA and CI-LPA SID Scores - 28 -
3.4. Fusion of LPA and Spectral Feature-based SID Scores - 30 -
Chapter 4. Experiments - 32 -
4.1. The HTIMIT Database - 33 -
4.2. Experiment Conditions - 34 -
4.2.1. Training, Test and Extra Training Sets - 34 -
4.2.2. ML-AKI Spectral Feature-based SID Baseline - 36 -
4.2.3. Pitch-GMM and Bi-gram Prosodic Feature-based SID Baselines - 37 -
4.3. Experimental Results - 40 -
4.3.1. Spectral Feature-based SID Baseline - 40 -
4.3.2. Fusion of Spectral and Prosodic Feature-based SID systems - 41 -
Chapter 5. Analysis and Discussions - 47 -
5.1. The Properties of LPA - 48 -
5.1.1. Automatic Prosodic State Labeling - 48 -
5.1.2. Constructed Latent Prosody Space of Speakers - 54 -
5.1.3. Speaker Entropy and the Constructed Latent Prosody Space - 57 -
5.2. Discussions on Experimental Results - 59 -
5.2.1. Sensitiveness to Telephone Handset and Speaker Gender - 59 -
5.2.2. Contribution of Different Prosodic Features to LPA - 63 -
5.3. Potentiality of Applying LPA to other Speaking Styles or Languages - 66 -
Chapter 6. Conclusions and Future works - 67 -
6.1. Conclusions - 67 -
6.2. Future Works - 69 -
REFERENCES - 70 -

參考文獻

[1] J. P. Campbell, “Speaker Recognition: A Tutorial,” Proceedings of the IEEE, Volume 85, Issue 9, Sept. 1997, 1437-1462.
[2] M. Faundez-Zanuy and E. Monte-Moreno, “State-of-the-Art in Speaker Recognition,” IEEE Aerospace and Electronic Systems Magazine, Volume 20, Issue 5, March 2005, 7-12.
[3] R. Mammone, X. Zhang, and R. Ramachandran, “Robust Speaker Recognition – A Feature-based Approach,” IEEE Signal Processing Magazine, Sept. 1996, 58-71.
[4] H. A. Murthy, F. Beaufays, L. P. Heck, and M. Weintraub, “Robust Text-Independent Speaker Identification over Telephone Channels,” IEEE Trans. Speech Audio Processing, Volume 7, No. 5, September 1999.
[5] J. Pelecanos and S. Sridharan, “Feature Warping for Robust Speaker Verification,” Proc. A Speaker Odyssey, 2001.
[6] D. A. Reynolds, “Channel Robust Speaker Verification via Feature Mapping,” Proc. ICASSP 2003, Volume 2, 2003, II – 53-6.
[7] R. Teunen, B. Shahshahani, and L. P. Heck, “A Model Based Transformational Approach to Robust Speaker Recognition,” Proc. ICSLP'2000, vol.2, pp. 495-498, 2000.
[8] Y. F. Liao, J. H. Yang, Z. X. Zhuang, and S. H. Chen, “A Priori Knowledge Interpolation-based Approach for Handset Mismatch-Compensated Speaker Identification,” submitted to IEEE Transactions on Audio, Speech and Language Processing.
[9] Jyh-Her Yang and Yuan-Fu Liao, “Unseen Handset Mismatch Compensation Based On Feature/Model-Space A Priori Knowledge Interpolation For Robust Speaker Recognition”, ISCLSP, pp. 65 – 68, 2004.
[10] D. A. Reyolds, T. F. Quatieri, and R. B. Dunn, “Speaker Verification Using Adapted Gaussian Mixture Models,” Digital Signal Processing, Volume 10, Jan. 2000, 19-41.
[11] M. K. Sonmez, L. Heck, M. Weintraub, and E. Shriberg, “A Lognormal Tied Mixture Model of Pitch for Prosody-Based Speaker Recognition,” Proc. EUROSPEECH 1997 (Rhodes, Greece), Volume 3, September 1997, 1391-1394.
[12] M. J. Carey, E. S. Parris, H. Lloyd-Thomas, and S. Bennet, “Robust Prosodic Features for Speaker Identification,” Proc. ICSLP 1996, 1996, 1800-1803.
[13] K. Sonmez, E. Shriberg, L. Heck, and M. Weintraub, “Modeling Dynamic Prosodic Variation for Speaker Verification,” In R. H. Mannell and J. Robert-Ribes (Eds.), Proc. ICSLP 1998 (Sydney), Volume 7, 1998, 3189-3192.
[14] A. G. Adami, R. Mihaescu, D. A. Reynolds, and J. J. Godfrey, “Modeling Prosodic Dynamics for Speaker Recognition,” Proc. ICASS 2003, Volume 4, April 2003, IV – 788-91.
[15] S. Kajarekar, L. Ferrer, K. Sonmez, J. Zheng, E. Shriberg, and A. Stolcke, “Modeling NERFs for Speaker Recognition,” Proc. Odyssey 2004 Speaker and Language Recognition Workshop (Toledo, Spain), pp. 51-56, June 2004.
[16] D. Reynolds, W. Andrews, J. Campbell, J. Navratil, B. Peskin, A. Adami, Q. Jin, D. Klusacek, J. Abramson, R. Mihaescu, J. Godfrey, D. Jones, and B. Xiang, “The SuperSID Project: Exploiting High-Level Information for High-Accuracy Speaker Recognition,” Proc. ICASSP 2003, Volume IV, 2003, 784-787.
[17] E. Shriberg, L. Ferrer, S. Kajarekar, A. Venkataraman, and A. Stolcke, “Modeling Prosodic Feature Sequences for Speaker Recognition,” Speech Communication, Volume 46, 2005, 455-472.
[18] “NIST - Speaker Recognition Evaluations,” http://www.nist.gov/speech/tests/spk/index.htm
[19] “NIST 2001 Speaker Recognition Evaluation – Extended Data task,” http://www.nist.gov/speech/tests/spk/2001/extended-data/
[20] Z. H. Chen, Y. F. Liao and Y. T. Juang, “Prosodic modeling and Eigen-Prosody Analysis for Robust Speaker Recognition,” Proc. ICASSP 2005, Volume 1, Issue , March 18-23, 2005 Page(s): 185 - 188.
[21] R. Baeza-Yates and B. Riberiro-Neto, Modern Information Retrieval, Addison-Wesley, 1999.
[22] L. P. Jing, H. K. Huang, H. B. Shi, “Improved Feature Selection Approach TFIDF in Text Mining,” Proc. 2002 International Conference on Machine Learning and Cybernetics, Volume 2, 2002, 944 - 946.
[23] G. W. Furnas, S. Deerwester, S. T. Dumais, T. K. Landauer,R. A. Harshman, L. A. Streeter, and K. E. Lochbaum, “Information Retrieval Using A Singular Value Decomposition Model of Latent Semantic Structure,” Proc. SIGIR, 1988, 465-480.
[24] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer and R. Harshman. "Indexing by Latent Semantic Analysis." Journal of the American Society for Information Science 41, pp. 391-407, 1990.
[25] T. Hofmann, “Probabilistic Latent Semantic Analysis,” Proceedings of the Fifteenth Annual Conference on Uncertainty in Artificial Intelligence (UAI-99), San Fracisco, CA (pp. 289-296), 1999.
[26] T. Hofmann, “Unsupervised Learning by Probabilistic Latent Semantic Analysis,” Machine Learning, 42, 2001, 177-196.
[27] TIMIT Speech Database, http://www.mpi.nl/world/tg/corpora/timit/timit.html
[28] D. A. Reynolds, “HTIMIT and LLHDB: Speech corpora for the study of handset transducer effects,” Proc. ICASSP 1997, Volume 2, 1997, 1535-1538.
[29] M. Hasegawa-Johnson, K. Chen, J. Cole, S. Borys, S. S. Kim, A. Cohen, T. Zhang, J. Y. Choi, H. Kim, T. Yoon, and S. Chavarria, “Simultaneous Recognition of Words and Prosody in the Boston University Radio Speech Corpus,” Speech Communication, 46(3-4), 2005, 418-439.
[30] K. Chen, M. Hasegawa-Johnson, A. Cohen, S. Borys, S. S. Kim, J. Cole, and J.Y. Choi, “Prosody Dependent Speech Recognition on Radio News Corpus of American English,” IEEE Transactions on Speech and Audio Processing, 14(1), 2006, 232-245.
[31] K. J. Chen and W. Y. Ma, “Unknown Word Extraction for Chinese Documents,” Proc. COLING 2002, 2002, 169-175.
[32] P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum-likelihood from Incomplete Data via the EM Algorithm,” Journal of the Royal Statistical Society, Ser. B, 39, 1977, 1-38.
[33] T. J. Hazen, “A Comparison of Novel Techniques for Rapid Speaker Adaptation,” Speech Communication, Volume 31, May 2000, 15-33.
[34] C. J. Leggetter and P. C. Woodland, “Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous Density Hidden Markov Models,” Computer Speech Lang., Vol. 9, 1995, 171-185.
[35] M. Nishida, T. Kawahara, “Speaker Indexing and Adaptation using Speaker Clustering Based on Statistical Model Selection,” Proc. ICASSP 2004, Volume 1, 17-21, May 2004, I – 353-56.
[36] D. Lilt and F. Kubala, “Online Speaker Clustering,” Proc. ICASSP 2004, 2004, Volume 1, I – 333-6.
[37] J. L. Gauvain and C. H. Lee, “Maximum A Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains,” IEEE Trans. on Speech and Audio Processing, 2, 1994, 291-298.
[38] B. H. Juang, W. Chou, and C. H. Lee, “Minimum Classification Error Rate Methods for Speech Recognition,” IEEE Trans. on Speech and Audio Processing. Volume 5, No. 3, May 1997.
[39] K. Sjölander and J. Beskow, “Wavesurfer,” http://www.speech.kth.se/wavesurfer/
[40] K. Sjölander, “Snack Sound Toolkit,” http://www.speech.kth.se/snack/
[41] I. J. Good, “The Population Frequencies of Species and the Estimation of Population Parameters,” Biometrika, Volume 40 (3, 4), 1953, 237-264.
[42] G. Doddington, “Speaker Recognition based on Idiolectal Differences between Speakers,” Proc. EUROSPEECH 2001 (Aalborg, Denmark), 2001, 2521-2524.
[43] B. Xiang, “Text-independent Speaker Verification with a Dynamic Trajectory Model,” IEEE Signal Processing Letters, 10(5), 2003, 141-143.
[44] Z. H. Chen, Z. R. Zeng, Y. F. Liao, and Y. T. Juang, “Probabilistic Latent Prosody Analysis for Robust Speaker Verification,” Proc. ICASSP 2006, 2006.
[45] W. C. Chang, D. Y. Chen, Z. H. Chen, Z. R. Zeng, Y. F. Liao, and Y. T. Juang, “Incorporating Prosodic with Acoustic information for ISCSLP 2006 Speaker Recognition Evaluation – Robust Cross-Channel Speaker Verification,” Proc. ISCSLP 2006, 2006.

指導教授

廖元甫、莊堯棠
(Yuan-Fu Liao、Yau-Tarng Juang)

審核日期

2007-6-20

推文