基於非均勻尺度-頻率圖之環境聲音辨識

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：58

、訪客IP：3.145.64.245

姓名

蔡旻剛(Min-Kang Tsai) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

基於非均勻尺度-頻率圖之環境聲音辨識
(Non-uniform Scale-Frequency Map for Environmental Sound Recognition)

相關論文

★ Single and Multi-Label Environmental Sound Recognition with Gaussian Process	★ 波束形成與音訊前處理之嵌入式系統實現
★ 語音合成及語者轉換之應用與設計	★ 基於語意之輿情分析系統
★ 高品質口述系統之設計與應用	★ 深度學習及加速強健特徵之CT影像跟骨骨折辨識及偵測
★ 基於風格向量空間之個性化協同過濾服裝推薦系統	★ RetinaNet應用於人臉偵測
★ 金融商品走勢預測	★ 整合深度學習方法預測年齡以及衰老基因之研究
★ 漢語之端到端語音合成研究	★ 基於 ARM 架構上的 ORB-SLAM2 的應用與改進
★ 基於深度學習之指數股票型基金趨勢預測	★ 探討財經新聞與金融趨勢的相關性
★ 基於卷積神經網路的情緒語音分析	★ 運用深度學習方法預測阿茲海默症惡化與腦中風手術存活

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

本論文對於環境聲音的辨識提出一個新穎的參數擷取技術稱為non-uniform scale-frequency map。對每一個frame，我們利用matching pursuit演算法從Gabor字典中選取重要的atoms。忽略phase和position的資訊，我們選擇atoms的scale和frequency建構一個scale-frequency map。在應用主成分分析和線性鑑別分析於scale-frequency map後，產生最終之特徵向量。
對於環境聲音辨識，我們執行一個區段層級的多類支持向量機(SVM)。在實驗方面，我們採用17個類別的聲音資料庫，結果顯示提出的方法能夠達到86.47% 的準確率，跟其它時頻參數的效果比較，本論文所提出之特徵參數具明顯優越性。
另外，我們對於語音情緒辨識也提出一個新穎的參數擷取技術稱為SFM descriptor。對於每一個frame，我們一樣利用matching pursuit演算法選取atom，然後建構scale-frequency map。接著我們對每一個scale-frequency map擷取descriptor參數。然後建議的SFM descriptor結合non-uniform SFM 和MFCC且送進分類器。對於語音情緒辨識，我們執行一個語句層級的多類支持向量機。在實驗方面，我們採用7個類別的情緒語音資料庫，且辨識率可以達到73.96%。

摘要(英)

In this study, we present a novel feature extraction technique called non-uniform scale-frequency map for environmental sound recognition. For each audio frame, we use matching pursuit algorithm to select important atoms from the Gabor dictionary. Ignoring phase and position information, we extract the scale and frequency of the selected atoms to construct a scale-frequency map. Principle component analysis (PCA) and linear discriminate analysis (LDA) are then applied to the scale-frequency map, generating a 16-dimensional vector. In the recognition phase, a segment-level multiclass support vector machine (SVM) is performed. Experiments are carried out on a 17-class sound database, and the result shows that the proposed approach can achieve an 86.47% accuracy rate. The performance comparison between the other time-frequency features demonstrates the superiority of the proposed feature. Other, we also present a novel feature extraction technique called SFM descriptor for emotional sound. For each frame，we use matching pursuit algorithm to select atom ，then construct scale-frequency map. Next，we extract descriptor feature for each scale-frequency map, then proposed SFM descriptor combined with non-uniform SFM feature and MFCC and sent into multiclass SVM. In the recognition phase，a file-level multiclass support vector machine (SVM) is performed. Experiments are carried out on 7-class emotional sound database and the result of recognition can achieve 73.96%.

關鍵字(中)

★ 匹配追蹤
★ 非均勻尺度-頻率圖
★ 環境聲音辨識
★ 加伯函數
★ 參數擷取

關鍵字(英)

★ Gabor function
★ Nonuniform scale-frequency map
★ matching pursuit
★ feature extraction
★ environmental sound classification

論文目次

Chapter 1 Preface - 1 -
1-1 Introduction - 1 -
1-2 Motivation - 3 -
1-3 Method Construct - 4 -
Chapter 2 Literature Review - 6 -
2- 1 Audio Feature Extraction Approach - 6 -
2-2 Time Domain Feature - 7 -
2-3 Frequency Domain Feature - 9 -
2-4 Mel-Frequency Cepstral Coefficients - 10 -
2-5 Time-Frequency Feature - 14 -
Chapter 3 Support Vector Machine - 16 -
3-1 Separable Case - 16 -
3-2 Non-Separable Case - 19 -
3-3 Non-Linear Case - 21 -
3-4 Multiple Classification - 22 -
Chapter 4 Proposed Method - 25 -
4-1 Dimension Reduction Method - 25 -
4-2 MATCHING PURSUIT ALGORITHM - 27 -
4-3 Gabor Dictionary - 29 -
4-4 Non-uniform Scale Frequency Map - 31 -
Chapter 5 SFM Descriptor for Emotion Speech - 37 -
5-1 Emotional feature - 37 -
5-2 SFM Descriptor - 39 -
Chapter 6 Experiment results - 42 -
6-1Environmental Sound Database and Emotional Sound Database - 42 -
6-2 Comparison of Uniform Band and Non-uniform Band - 42 -
6-3 The Purpose of Applying PCA and LDA - 46 -
6-4 The Recognition Result of Non-uniform SFM in Different SNR Level - 50 -
6-5 Emotion Speech Recognition - 51 -
Chapter 7 Conclusion - 54 -
Reference - 55 -

參考文獻

[1] L. Lu, H.-J. Zhang, and H. Jiang, “Content analysis for audio classification and segmentation,” IEEE Trans. Speech and Audio Processing, vol. 10, no. 7, pp. 504–516, Oct. 2002.
[2] J.-C. Wang, J.-F. Wang, K. W. He, and C.-S. Hsu, “Environmental sound classification using hybrid SVM/KNN classifier and MPEG-7 audio low-level descriptor,” in Proc. Int. Joint Conf. Neural Networks, Vancouver, British Columbia, Canada, July 2006 , pp. 1731–1735.
[3] S. Chu, S. Narayanan, C.-C. J. Kuo and M.J.Mataric “Where am I? Scene recognition for mobile robots using audio features,” IEEE Int. Conf. Multimedia and Expo, Toronto, Ontario, Canada, July 2006, pp.885-888.
[4] J. Huang, “Spatial auditory processing for a hearing robot,” IEEE Int. Conf. Multimedia and Expo, Lausanne, Switzerland, vol. 2, pp. Sep 2002, 253- 256.
[5] E.Wold, T. Blum, D. Keislar, and J. Wheaton, “Content-based classification, search, and retrieval of audio,” IEEE Trans. Multimedia, vol. 3, no. 3, pp. 27–36, Sep. 1996.
[6] J. T. Foote, “Content-based retrieval of music and audio,” in Proc. 1997 SPIE Conf. Multimedia Storage and Archiving Systems II, Dallas, Texas, United States, Nov 1997, pp. 138–147.
[7] V. Peltonen, J. Tuomi, A. Klapuri, J. Huopaniemi, and T.Sorsa, “Computational auditory scene recognition,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Orlando, Florida, United States, May 2002, pp. 1941–1944.
[8] S. Z. Li, “Content-based audio classification and retrieval using the nearest feature line method,” IEEE Trans. Speech and Audio Processing, vol. 8, no. 5, pp. 619–625, Sep. 2000.
[9] G. Guo and S. Z. Li, “Content-based audio classification and retrieval by support vector machines,” IEEE Trans. Neural Networks, vol. 14, no. 1, pp. 209–215, Jan. 2003.
[10] J. Zheng, G. Wei, and C. Yang, “Modified local discriminant bases and its application in audio feature extraction,” in Proc. Int. Forum on Information Technology and Application, Chengdu, China, May 2009, pp. 42–52.
[11] H. M. Hadi, M. Y. Mashor, M. S. Mohamed, and K. B. Tat, “Classification of heart sounds using wavelets and neural networks,” in Proc.5th Int. Conf. Electrical Engineering, Computing Science and Automatic Control, Mexico, Nov 2008, pp.177-180.
[12] Samuel P. Ebenezer, “Classification of Acoustic Emissions Using Modified Matching Pursuit,” EURASIP Journal.Signal Processing, pp.347-357, 2004.
[13] K. Umapathy, S. Krishnan, and S. Jimaa, “Multigroup classification of audio signals using time-frequency parameters,” IEEE Trans. Multimedia, vol. 7, no. 2, pp. 308–315, Apr. 2005.
[14] S. Chu, S. Narayanan, and C.-C. J. Kuo, “Environmental sound recognition with time-frequency audio features,” IEEE Trans. Audio, Speech, and Language Processing, vol. 17, no. 6, pp. 1142–1158, Aug. 2009.
[15] K. Umapathy, S. Krishnan, “Sub-dictionary selection using local discriminant bases algorithm for signal classification,” Canadian Conference on, Electrical and Computer Engineering, Canada vol.4, pp. May 2004, 2001- 2004,.
[16] K. Umapathy, S. Krishnan, “A signal classification approach using time-width vs frequency band sub-energy distributions,” IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Philadelphia, Pennsylvania, USA, vol.5, pp. March 2005, 477-480.
[17] S. G. Mallat and Z. Zhang, “Matching pursuits with time-frequency dictionaries,” IEEE Trans. Signal Processing, vol. 41, no. 12, pp. 3397–3415, Dec. 1993.
[18] 王小川，語音訊號處理，修訂版，全華圖書股份有限公司，台北縣，民國96年。
[19] V. Vapnik and C. Cortes, “Support vector networks,” Mach. Learn, vol. 20, pp. 273–297, 1995.
[20] J. Shlens, “A Tutorial on Principal Component Analysis,” Systems Neurobiology Laboratory, ver. 3.01, Apr. 2005.
[21] A. M. Martinez and A. C. Kak, “PCA versus LDA,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, no. 2, pp. 228–233, Feb. 2001.
[22] K. Fukunaga, Introduction to Statistical Pattern Recognition, second ed. Academic Press, 1990.
[23] T. Nwe, S. Fool, and L. De Silva, “Speech Emotion Recognition Using hidden Markov model,” Speech Commun. 41(2003)603-623.
[24] H. Teager and S. Teager, “Evidence for nonlinear production mechanisms in the vocal tract: Speech Production and Speech Modeling, Nato Advanced Institute,” vol. 55, pp.241-261, 1990.

指導教授

王家慶(Jia-Ching Wang)

審核日期

2011-8-23

推文