聲學場景分類運用卷積神經網絡

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：19

、訪客IP：3.149.238.0

姓名

沈安迪(Andri Santoso) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

聲學場景分類運用卷積神經網絡
(Acoustic scene classification using self-determination convolutional neural network)

相關論文

★ Single and Multi-Label Environmental Sound Recognition with Gaussian Process	★ 波束形成與音訊前處理之嵌入式系統實現
★ 語音合成及語者轉換之應用與設計	★ 基於語意之輿情分析系統
★ 高品質口述系統之設計與應用	★ 深度學習及加速強健特徵之CT影像跟骨骨折辨識及偵測
★ 基於風格向量空間之個性化協同過濾服裝推薦系統	★ RetinaNet應用於人臉偵測
★ 金融商品走勢預測	★ 整合深度學習方法預測年齡以及衰老基因之研究
★ 漢語之端到端語音合成研究	★ 基於 ARM 架構上的 ORB-SLAM2 的應用與改進
★ 基於深度學習之指數股票型基金趨勢預測	★ 探討財經新聞與金融趨勢的相關性
★ 基於卷積神經網路的情緒語音分析	★ 運用深度學習方法預測阿茲海默症惡化與腦中風手術存活

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

自動場景分類在機器學習研究領域中是個熱門的議題。許多研究專注於以視覺為基礎做自動場景分類，而使用聲音為基礎做場景分類的研究則相對較少。以聲音為基礎的場景分類系統，或稱為聲學場景分類，分析輸入的聲音資料，並自動分類紀錄聲音的環境場景。當視覺資訊無法取得時，聲學場景分類可視為以視覺為基礎的場景分類的延伸。當聲音資訊被取得，聲學場景分類系統可以分類場景，因此可被稱為機器聽覺。此領域有數種針對聲學場景分類提出的方法。近年來，使用電腦視覺技術分析聲學事件的研究愈來愈多。此外，深層學習的研究也受到許多注意。深層學習在許多領域都展現傑出的效果。本篇論文中，針對聲學場景分類問題提出了以深層學習為基礎的方法。

摘要(英)

Automatic scene classification is an active issue in the machine learning research field. While many works put a lot of focus on visual based approach, relatively little attention has been put to solve the problem of automatic scene classification using audio-based approach. The audio-based scene classification, or is known as acoustic scene classification (ASC), analyzes the input of audio data to automatically identify the scene of environment where the sound was recorded. Furthermore, the works in ASC can be seen as an alternative to visual-based approach when the performance of visual-based classifier is compromised. The audio-based approach has benefit, that as long as the sound can be listened, the practical ASC system will be able to perform scene classification, thus the obscuring object problem that exists in visual-based approach can be alternatively addressed. In this field, there have been a number of proposed approach to address the problem of audio-based scene classification. In recent years, there is an increasing interest of adopting the approach from computer vision research field to address the problem in audio analysis. Moreover, the research works of deep learning have attracted many attention. The deep learning based system has presented a promising result in many fields. In this thesis, the problem of ASC is solved using deep learning based approach. Several ASC systems, including the proposed system, have been implemented and discussed in the experiments. The results show the superiority of proposed system versus another systems that have been discussed in this thesis.

關鍵字(中)

★ 聲學場景分類運用卷積神經網絡

關鍵字(英)

★ acoustic scene classification
★ deep learning
★ audio processing

論文目次

摘要 .................................................................................................... i Abstract.............................................................................................. ii Acknowledgement ............................................................................... iii Contents ............................................................................................. iv
List of ﬁgures...................................................................................... v
List of tables.......................................................................................vii
Description of symbols........................................................................ x
一、 Introduction.................................................................. 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Organization . . . . . . . . . . . . . . . . . . . . . . 5
二、 Acoustic scene classiﬁcation.......................................... 6
2.1 Audio features . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Mel frequency cepstral coeﬃcient . . . . . . . . . . . 7
2.1.2 Spectrogram . . . . . . . . . . . . . . . . . . . . . . 10
2.1.3 Statistical normalization . . . . . . . . . . . . . . . . 12
2.2 Classiﬁcation method . . . . . . . . . . . . . . . . . 12
2.2.1 Gaussian mixture model . . . . . . . . . . . . . . . . 13
2.2.2 Hidden Markov model . . . . . . . . . . . . . . . . . 14
2.2.3 Support vector machine . . . . . . . . . . . . . . . . 16
2.2.4 Deep neural network . . . . . . . . . . . . . . . . . . 18
三、 Deep neural network architectures................................ 22
3.1 Deep belief network . . . . . . . . . . . . . . . . . . 22
3.2 Convolutional neural network . . . . . . . . . . . . . 24
3.3 Recurrent neural network . . . . . . . . . . . . . . . 26
四、 Methodology ................................................................. 29
4.1 The ﬁle-based approaches for ASC . . . . . . . . . . 29
4.1.1 File-based approach using SVM . . . . . . . . . . . . 30
4.1.2 File-based approach using MLP . . . . . . . . . . . . 31
4.1.3 File-based approach using CNN . . . . . . . . . . . . 32
4.1.4 File-based approach using RNN . . . . . . . . . . . . 33
4.2 The frame-based approaches for ASC . . . . . . . . . 34
4.2.1 Frame-based approach using GMM . . . . . . . . . . 35
4.2.2 Frame-based approach using MLP . . . . . . . . . . 36
4.2.3 Frame-based approach using CNN . . . . . . . . . . 36
4.2.4 Frame-based approach using NIN-CNN . . . . . . . . 37
4.3 Self-determination convolutional neural network . . . 38
4.3.1 What problem does SD-CNN solve . . . . . . . . . . 38
4.3.2 The architecture of proposed systems . . . . . . . . . 40
五、 Experimental setup....................................................... 43
5.1 Dataset description . . . . . . . . . . . . . . . . . . 43
5.2 Experimental setup . . . . . . . . . . . . . . . . . . 44
5.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . 45
六、 Experimental Results.................................................... 46
6.1 The evaluation for ﬁle-based approaches . . . . . . . 46
6.1.1 Result of ﬁle-based approach using SVM . . . . . . . 46
6.1.2 Result of ﬁle-based approach using MLP . . . . . . . 48
6.1.3 Result of ﬁle-based approach using CNN . . . . . . . 52
6.1.4 Result of ﬁle-based approach using RNN . . . . . . . 60
6.1.5 Comparison of all ﬁle-based approaches . . . . . . . 66
6.2 The evaluation for frame-based approaches . . . . . . 66
6.2.1 Result of frame-based approach using GMM . . . . . 67
6.2.2 Result of frame-based approach using MLP . . . . . 67
6.2.3 Result of frame-based approach using CNN . . . . . 67
6.2.4 Result of frame-based approach using NIN-CNN . . . 68
6.2.5 Result of frame-based approach using SD-CNN . . . 68
6.2.6 Result of frame-based approach using SD-NIN-CNN . 68
6.2.7 Comparison of all frame-based approaches . . . . . . 69
6.2.8 Comparison of conventional CNN and SD-CNN . . . 69
七、 Conclusions ................................................................... 71 References........................................................................................... 72

參考文獻

[1] Geoﬀrey Hinton and Ruslan Salakhutdinov. Reducing the dimen- sionality of data with neural networks. Science, 313(5786):504 – 507, 2006.

[2] L. Fei-Fei and P. Perona. A bayesian hierarchical model for learning natural scene categories. In Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 2, pages 524–531 vol. 2, June 2005.

[3] A. Bosch, A. Zisserman, and X. Muñoz. Scene classiﬁcation using a hybrid generative/discriminative approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(4):712–727, April 2008.

[4] DeLiang Wang and Guy J. Brown. Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. Wiley-IEEE Press, 2006.

[5] A. de la Torre, A. M. Peinado, J. C. Segura, J. L. Perez-Cordoba, M. C. Benitez, and A. J. Rubio. Histogram equalization of speech representation for robust speech recognition. IEEE Transactions on Speech and Audio Processing, 13(3):355–366, May 2005.

[6] K. S. R. Murty and B. Yegnanarayana. Combining evidence from residual phase and mfcc features for speaker recognition. IEEE Signal Processing Letters, 13(1):52–55, Jan 2006.

[7] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet. Front-end factor analysis for speaker veriﬁcation. IEEE Transac- tions on Audio, Speech, and Language Processing, 19(4):788–798, May 2011.

[8] O. Yilmaz and S. Rickard. Blind separation of speech mixtures via time-frequency masking. IEEE Transactions on Signal Processing, 52(7):1830–1847, July 2004.

[9] Z. Fu, G. Lu, K. M. Ting, and D. Zhang. A survey of audio-based music classiﬁcation and annotation. IEEE Transactions on Multi- media, 13(2):303–319, April 2011.

[10] M. Muller, D. P. W. Ellis, A. Klapuri, and G. Richard. Signal processing for music analysis. IEEE Journal of Selected Topics in Signal Processing, 5(6):1088–1110, Oct 2011.

[11] D. A. Reynolds and R. C. Rose. Robust text-independent speaker identiﬁcation using gaussian mixture speaker models. IEEE Trans- actions on Speech and Audio Processing, 3(1):72–83, Jan 1995.

[12] Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. Tut database for acoustic scene classiﬁcation and sound event detection. In Proceedings of 24th European Signal Processing Conference 2016 (EUSIPCO 2016), Budapest, Hungary, 2016.

[13] Jonathan Dennis, Huy Dat Tran, and Haizhou Li. Spectrogram im- age feature for sound event classiﬁcation in mismatched conditions. IEEE Signal Processing Letters, 18(2):130–133, 2011.

[14] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kings- bury. Deep neural networks for acoustic modeling in speech recog- nition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, Nov 2012.

[15] S. Chu, S. Narayanan, and C. C. J. Kuo. Environmental sound recognition with time-frequency audio features. IEEE Transactions on Audio, Speech, and Language Processing, 17(6):1142–1158, Aug 2009.

[16] Y. Lecun, L. Bottou, Y. Bengio, and P. Haﬀner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, November 1998.

[17] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: Closing the gap to human-level performance in face veriﬁcation. In Procees- dings of 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 1701–1708, June 2014.

[18] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. CoRR, abs/1312.4400, 2013.

[19] S. Davis and P. Mermelstein. Comparison of parametric represen- tations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4):357–366, Aug 1980.

[20] A. Ramalingam and S. Krishnan. Gaussian mixture modeling of short-time fourier transform features for audio ﬁngerprinting. IEEE Transactions on Information Forensics and Security, 1(4):457–463, Dec 2006.

[21] G. Tzanetakis and P. Cook. Musical genre classiﬁcation of au- dio signals. IEEE Transactions on Speech and Audio Processing, 10(5):293–302, Jul 2002.

[22] D. Pye. Content-based methods for the management of digital mu- sic. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2000, volume 6, pages 2437–2440 vol.4, 2000.

[23] L. R. Rabiner. A tutorial on hidden markov models and se- lected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, Feb 1989.

[24] Leonard E. Baum, Ted Petrie, George Soules, and Norman Weiss. A maximization technique occurring in the statistical analysis of prob- abilistic functions of markov chains. Ann. Math. Statist., 41(1):164– 171, 02 1970.

[25] J. Ajmera and C. Wooters. A robust speaker clustering algorithm. In Automatic Speech Recognition and Understanding, 2003. ASRU ’03. 2003 IEEE Workshop on, pages 411–416, Nov 2003.

[26] A. L. Berenzweig and D. P. W. Ellis. Locating singing voice seg- ments within music signals. In Applications of Signal Processing to Audio and Acoustics, 2001 IEEE Workshop on the, pages 119–122, 2001.

[27] Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. A training algorithm for optimal margin classiﬁers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, COLT ’92, pages 144–152, New York, NY, USA, 1992. ACM.

[28] Guodong Guo and S. Z. Li. Content-based audio classiﬁcation and retrieval by support vector machines. IEEE Transactions on Neural Networks, 14(1):209–215, Jan 2003.

[29] D. A. Sadlier and N. E. O’Connor. Event detection in ﬁeld sports video using audio-visual features and a support vector machine. IEEE Transactions on Circuits and Systems for Video Technology, 15(10):1225–1233, Oct 2005.

[30] V. Wan and S. Renals. Speaker veriﬁcation using sequence discrim- inant support vector machines. IEEE Transactions on Speech and Audio Processing, 13(2):203–210, March 2005.

[31] Tommi S. Jaakkola and David Haussler. Exploiting generative mod- els in discriminative classiﬁers. In Proceedings of the 1998 Confer- ence on Advances in Neural Information Processing Systems II, pages 487–493, Cambridge, MA, USA, 1999. MIT Press.

[32] Changsheng Xu, N. C. Maddage, and Xi Shao. Automatic music classiﬁcation and summarization. IEEE Transactions on Speech and Audio Processing, 13(3):441–450, May 2005.

[33] Nitish Srivastava, Geoﬀrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neu- ral networks from overﬁtting. Journal of Machine Learning Re- search, 15:1929–1958, 2014.

[34] Ruslan Salakhutdinov and Geoﬀrey Hinton. Deep Boltzmann ma- chines. In Proceedings of the International Conference on Artiﬁcial Intelligence and Statistics, volume 5, pages 448–455, 2009.

[35] G. E. Dahl, D. Yu, L. Deng, and A. Acero. Context-dependent pre- trained deep neural networks for large-vocabulary speech recogni- tion. IEEE Transactions on Audio, Speech, and Language Process- ing, 20(1):30–42, Jan 2012.

[36] Abdel rahman Mohamed, George Dahl, and Geoﬀrey Hinton. Deep belief networks for phone recognition. In Proceedings of the NIPS Workshop on Deep Learning for Speech Recognition and Related Applications, 2009.

[37] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. Cnn features oﬀ-the-shelf: An astounding baseline for recognition. In 2014 IEEE Conference on Computer Vision and Pattern Recogni- tion Workshops, pages 512–519, June 2014.

[38] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classiﬁcation with convolutional neu- ral networks. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 1725–1732, June 2014.

[39] Alex Krizhevsky, Ilya Sutskever, and Geoﬀrey E. Hinton. Ima- genet classiﬁcation with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, edi- tors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.

[40] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term mem- ory. Neural Computation, 9(8):1735–1780, November 1997.

[41] Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR, abs/1406.1078, 2014.

[42] Michael E. Tipping and Chris M. Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society, Series B, 61:611–622, 1999.

[43] Tijmen Tieleman and Geoﬀrey Hinton. Lecture 6.5 - rmsprop, coursera: Neural networks for machine learning. 2012.

[44] Vinod Nair and Geoﬀrey E. Hinton. Rectiﬁed linear units im- prove restricted boltzmann machines. In Johannes Fürnkranz and Thorsten Joachims, editors, Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 807–814. Om- nipress, 2010.

[45] D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, and M. D. Plumbley. Detection and classiﬁcation of acoustic scenes and events. IEEE Transactions on Multimedia, 17(10):1733–1746, Oct 2015.

指導教授

王家慶(Wang Jia-Ching)

審核日期

2016-8-26

推文