應用於非監督式音訊轉換偵測之新型方法及特徵參數

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：8

、訪客IP：3.14.4.171

姓名

辜振禹(Zhen-yu Gu) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

應用於非監督式音訊轉換偵測之新型方法及特徵參數
(New Segmentation Method and Acoustical Features for Unsupervised Audio Change Detection)

相關論文

★ Single and Multi-Label Environmental Sound Recognition with Gaussian Process	★ 波束形成與音訊前處理之嵌入式系統實現
★ 語音合成及語者轉換之應用與設計	★ 基於語意之輿情分析系統
★ 高品質口述系統之設計與應用	★ 深度學習及加速強健特徵之CT影像跟骨骨折辨識及偵測
★ 基於風格向量空間之個性化協同過濾服裝推薦系統	★ RetinaNet應用於人臉偵測
★ 金融商品走勢預測	★ 整合深度學習方法預測年齡以及衰老基因之研究
★ 漢語之端到端語音合成研究	★ 基於 ARM 架構上的 ORB-SLAM2 的應用與改進
★ 基於深度學習之指數股票型基金趨勢預測	★ 探討財經新聞與金融趨勢的相關性
★ 基於卷積神經網路的情緒語音分析	★ 運用深度學習方法預測阿茲海默症惡化與腦中風手術存活

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

音訊分割可以分成兩部份，分別為語音分割及環境聲音分割，其目的是將聲音切成多個分段，而每一個分段都只包含單一語者或單一環境聲音。
對於語音分割，本論文主要提出一個新的概念，將傳統語音切割轉換成語者驗證問題。而為解決訓練的資料不足問題，因此採用支持向量機作模型的訓練，由於支持向量機需要耗費較多的訓練時間，因此我們先用較簡單的廣義概似比例作為第一階段找出可能的轉換點，第二階段再由我們提出的支持向量機相鄰音窗相似度演算法作確認，藉此減少運算時間，而實驗結果顯示我們提出的音訊切割方法效果較傳統貝氏資訊準則演算法好。
在音訊特徵參數部分，語音部份我們採用梅爾倒頻譜參數，而環境聲音則因變化較大，因此我們提出非均勻尺度頻率圖參數，此參數採用匹配追蹤演算法對音訊作拆解。環境聲音分割的實驗結果顯示，我們提出的參數較梅爾倒頻譜參數有更好的抗噪能力及鑑別度。

摘要(英)

Audio segmentation can be divided into two categories which are speech segmentation and environmental sound segmentation. It divides an audio stream into many segments and there is only one speaker or one environmental sound in each segment.
In speaker segmentation, this thesis proposes a new concept that turns traditional speaker change detection problem into speaker verification problem. In order to solve the problem of insufficient training data, we use support vector machine (SVM) to train the speaker models. Because SVM has a computational load in training, we adopt a two stage search strategy. In the first stage, generalized likelihood ratio is used to find the change point candidates. In the second stage, we confirm it by the proposed SVM based adjacent window similarity criterion. In the experimental results, the performance of the proposed SVM based adjacent window similarity criterion is better than conventional Bayesian information criterion (BIC).
Considering the acoustical features, we use MFCC to do the speaker segmentation. As for the environmental sound, we propose a feature set based on non-uniform scale frequency map (SFM). This feature is obtained by decomposing an audio signal by matching pursuit algorithm. Experimental results demonstrates that the proposed non-uniform SFM based feature set is more noise robust than MFCC in environmental sound segmentation.

關鍵字(中)

★ 語者切割
★ 語者轉換偵測

關鍵字(英)

★ speaker segmentation
★ speaker change detection

論文目次

Abstract in Chinese...................................................................................................................I
Abstract in English..................................................................................................................II
ACKNOWLEDGMENTS..................................................................................................... III
Contents...................................................................................................................................IV
List of Figures...........................................................................................................................V
List of Tables.........................................................................................................................VII
Explanation of symbol.........................................................................................................VIII
Chapter 1 Introduction 1
1-1 Motivation 1
1-2 Research Background and Purpose 1
1-3 Thesis Outline 3
Chapter 2 Related Works 4
2-1 Speech Feature Extraction 4
2-2 Strategy of Searching Change Point 8
2-3 Related Research Method 12
Chapter 3 Speaker Change Detection 21
3-1 Support Vector Machine (SVM) 21
3-2 K-Means Algorithm 29
3-3 Speaker Change Detection Algorithms 30
3-4 Experimental Results 38
Chapter 4 Environmental Sound Change Detection 46
4-1 Non-Uniform Scale-Frequency Map 46
4-2 SFM Descriptors 51
4-3 Experimental results 54
Chapter 5 Conclusion 58

參考文獻

[1] S. Wegmann, P. Zhan, and L. Gillick, “Progress in broadcast news transcription at dragon systems,” IEEE International Conference on Acoustics, Speech, Signal Processing, vol. 1, pp. 33-36, Mar 1999.
[2] Z. Zhang , S. Furui , and K. Ohtsuki, “On-line incremental speaker adaptation for broadcast news transcription,” Speech Communication, vol. 37, no. 3-4, pp. 271-281, July 2002.
[3] J. Gauvain, L. Lamel, and G. Adda, “The LIMSI broadcast news transcription system,” Speech Communication, vol. 37, no. 1-2, pp. 89-108, 2002.
[4] K. Mori and S. Nakagawa, “Speaker change detection and speaker clustering using VQ distortion for broadcast news speech recognition,” IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 413-416, May 2001.
[5] R. Huang and J. H. L. Hansen, “Advances in unsupervised audio classification and segmentation for the broadcast news and NGSW corpora,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 3, pp. 907-919, May 2006.
[6] K. Park, Jeong-sik Park, and Y. H. Oh, “GMM adaptation based online speaker segmentation for spoken document retrieval,” IEEE Transactions on Consumer Electronics, vol.56, no.2, pp.1123-1129, May 2010.
[7] L. Couvreur and J.M. Boite, “Speaker tracking in broadcast audio material in the framework of the THISL project,” Workshop Accessing Information in Spoken Audio, pp. 84-89, 1999.
[8] L. Lu and H. J. Zhang, “Speaker change detection and tracking in real-time news broadcasting analysis,” 10th ACM International Conference on Multimedia, pp. 602-610, Dec. 2002.
[9] L. Lu and H. J. Zhang “Unsupervised speaker segmentation and tracking in real-time audio content analysis,” Multimedia Systems , vol. 10, no. 4, pp. 332-343, April 2005
[10] B. W. Zhou, and John H. L. Hansen, “Unsupervised audio stream segmentation and clustering via the Bayesian Information criterion,“ International Conference Spoken Language Processing , vol.1, pp. 714-717, 2000.
[11] A. Tritschler and R. Gopinath, “Improved speaker segmentation and segments clustering using the Bayesian Information Criterion,” European Conference Speech Communication Technology, pp.679-682, 1999.
[12] M. Siegler, U. Jain, B.Raj, and R. Stern, “Automatic segmentation, classification and clustering of broadcast news audio,” DARPA Speech Recognition Workshop, pp. 97-99, Feb 1997.
[13] M. Cettolo, “Segmentation, classification and clustering of an Italian broadcast news corpus,“ Sixth RIAO-Content-Based Multimedia Information Access Conference, pp. 281-372, 2000.
[14] T. Kemp, M. Schmidt, M. Westphal, and A. Waibel, “Acoustics, strategies for automatic segmentation of audio data,” IEEE International Conference Acoustics, Speech, Signal Process, vol. 3, pp. 1423-1426, June 2000.
[15] S. Meignier, J.-F. Bonastre, and S. Igounet, “E-HMM approach for learning and adapting sound models for speaker indexing,” Speaker Odyssey—The Speaker Recognition Workshop, pp. 175-180, 2001.
[16] D. Moraru, S. Meignier, C. Fredouille, L. Besacier, and J.F. Bonastre, “The ELISA consortium approaches in broadcast news speaker segmentation during the NIST 2003 rich transcription evaluation,” IEEE International Conference Acoustics, Speech, and Signal Processing, vol. 1, pp. I-373-I-376, May 2004.
[17] S. Chen and P. Gopalakrishnan, “Speaker, environment and channel change detection and clustering via the Bayesian information criterion,” DARPA Broadcast News Transcription Understanding Workshop, pp. 127-132, 1998.
[18] S.S. Cheng and H.M. Wang, “A sequential metric-based audio segmentation method via the Bayesian information criterion,” European Conference Speech Communication and Technology , pp. 945-948, 2003.
[19] G. Schwarz, “Estimating the dimension of a model,“ The Annals of Statistics, vol. 6, no. 2, pp. 461-464, 1978.
[20] J. W. Hung, H. M. Wang, and L. S. Lee, “Automatic metric-based speech segmentation for broadcast news via principal component analysis“, 2000 International Conference on Spoken Language Processing, 1998.
[21] J. F. Bonastre, P. Delacourt, C. Fredouille, T. Merlin, and C. Wellekens, “A speaker tracking system based on speaker turn detection for NIST evaluation,” IEEE International Conference Acoustics, Speech, Signal Process, vol. 2, pp. 1177-1180, 2000.
[22] D. Liu and F. Kubala, “Fast speaker change detection for broadcast news transcription and indexing,” European Conference Speech Communication and Technology, pp. 1031-1034, Sept. 1999.
[23] P. Delacourt and C. J. Wellekens, “DISTBIC: a speaker-based segmentation for audio data indexing,” Speech Communication, vol. 32, no. 1-2, pp. 111-126, Sept. 2000.
[24] Mohamed Kamal Omar, Upendra Chaudhari, Ganesh Ramaswamy, “Blind change detection for audio segmentation,” IEEE International Conference on Acoustics, Speech, and Signal Processing, vol.1, no., pp. 501- 504, March 18-23, 2005.
[25] Michele Basseville and Igor V. Nikiforov, Detection of Abrupt Changes: Theory and Application, Prentice-Hall, Inc. Upper Saddle River, NJ, USA 1993.
[26] V. N. Vapnik, “An overview of statistical learning theory,” IEEE Transactions on Neural Networks, vol. 10, pp 988-999, 1999.
[27] S. G, Mallat and Zhifeng Zhang, “Matching pursuits with time-frequency dictionaries”, IEEE Transactions on Signal Processing, vol. 41, no.12, pp.3397-3415,1993.
[28] 王小川，語音訊號處理，修訂版，全華圖書股份有限公司，台北縣，民國96年。
[29] 蘇峻慶，錄音資料中語者切割與分群方法之研究，清華大學，碩士論文，民國94年。
[30] S. S. Cheng, H. M. Wang, and H. C.Fu, “BIC-based speaker segmentation using divide-and-conquer strategies with application to speaker diarization,” IEEE Transactions on Audio, Speech and Language Processing, vol. 18,pp. 141 - 157 , JAN. 2010.

指導教授

王家慶(Jia-ching Wang)

審核日期

2011-8-23

推文