基於深度學習之聲音辨識及偵測

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：36

、訪客IP：3.145.12.136

姓名

鄧氏陲殷(Dang Thi Thuy An) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

基於深度學習之聲音辨識及偵測
(Sound Classification and Detection using Deep Learning)

相關論文

★ Single and Multi-Label Environmental Sound Recognition with Gaussian Process	★ 波束形成與音訊前處理之嵌入式系統實現
★ 語音合成及語者轉換之應用與設計	★ 基於語意之輿情分析系統
★ 高品質口述系統之設計與應用	★ 深度學習及加速強健特徵之CT影像跟骨骨折辨識及偵測
★ 基於風格向量空間之個性化協同過濾服裝推薦系統	★ RetinaNet應用於人臉偵測
★ 金融商品走勢預測	★ 整合深度學習方法預測年齡以及衰老基因之研究
★ 漢語之端到端語音合成研究	★ 基於 ARM 架構上的 ORB-SLAM2 的應用與改進
★ 基於深度學習之指數股票型基金趨勢預測	★ 探討財經新聞與金融趨勢的相關性
★ 基於卷積神經網路的情緒語音分析	★ 運用深度學習方法預測阿茲海默症惡化與腦中風手術存活

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

本研究開發了各種深度學習模型，以在現實環境中進行聲學場景分類(ASC)和聲音事件檢測(SED)。我們利用卷積神經網絡(CNN) 及時間遞歸神經網絡 (RNN) 用於音頻信號處理的優點來建立模型。CNN對於提取多維數據的空間信息提供了一個有效率的方法，而RNN在學習具有時間順序的數據是強大的。我們的實驗在 DCASE 2017 challenge 的三個開發數據集中進行，包括聲學場景數據集，稀有聲音事件數據集和復音聲音事件數據集。為了避免過度擬合問題，我們採用一些數據增加技術，例如以給定的概率中斷輸入值到零，增加高斯噪聲或改變聲音的響度。
提出的方法的性能對於三個 DCASE 2017 challenge的數據集優於基礎方法。聲學場景分類的準確度相對於基礎方法提高了7.2%。對於罕見的聲音事件檢測，我們的方法平均誤差率為0.26，F評分為85.9%，而基礎方法為0.53和72.7%。對於復音聲音事件檢測，我們的方法的誤差率改進為0.59，而基礎方法為0.69。

摘要(英)

In this work, we develop various deep learning models to perform the acoustic scene classification (ASC) and sound event detection (SED) in real life environments. In particular, we take advantages of both convolution neural networks (CNN) and recurrent neural networks (RNN) for audio signal processing, our proposed models are constructed from these two networks. CNNs provide an effective way to capture spatial information of multidimensional data, while RNNs are powerful in learning temporal sequential data. We conduct experiments on three development datasets from the DCASE 2017 challenge including acoustic scene dataset, rare sound event dataset, and polyphonic sound event dataset. In order to reduce overfitting problem as the data is limited, we employ some data augmentation techniques such as interrupting input values to zeros with a given probability, adding Gaussian noise, and changing sound loudness.
The performance of proposed methods outperforms the baselines of DCASE 2017 challenge over all three datasets. The accuracy of acoustic scene classification improves 7.2 % in comparison with the baseline. For rare sound event detection, we report an average error rate of 0.26 and F-score of 85.9% compared to 0.53 and 72.7% of baselines. For polyphonic sound event detection, our method obtains a slight improvement on an error rate of 0.59 while the baseline of 0.69.

關鍵字(中)

★ 深度學習
★ CNNs
★ RNNs
★ 場景分類
★ 聲音事件檢測

關鍵字(英)

★ Deep learning
★ CNNs
★ RNNs
★ scene classification
★ sound event detection

論文目次

Chapter 1 Introduction 1
1.1 Motivation 1
1.2 Aim and Objective 3
1.3 Thesis Overview 4
Chapter 2 Deep Learning 5
2.1 Neural Network: Definitions and basic 6
2.2 Convolutional Neural Network 15
2.2.1 Convolutional layer 16
2.2.2 Pooling layer 17
2.2.3 Fully-connected layer 18
2.3 Recurrent neural network 18
2.4 Long Short-Term Memory 22
2.5 Gated Recurrent Units 24
2.6 Bidirectional Recurrent Neural Networks 24
Chapter 3 Sound classification and detection problem 27
3.1 Previous works 27
3.2 Audio feature extraction 29
Chapter 4 Proposed methods 31
4.1 Audio scene classification 31
4.1.1 Feature Extraction 31
4.1.2 Network Architectures 31
4.2 Sound event detection 33
4.2.1 Feature extraction 33
4.2.2 Data augmentation 33
4.2.3 Network Architecture 34
Chapter 5 Experiments 38
5.1 Dataset 38
5.1.1 Acoustic scene classification dataset 38
5.1.2 Sound event detection dataset 38
5.2 Metric 39
5.3 Baselines 41
5.4 Results 41
5.4.1 Acoustic scene classification 41
5.4.2 Sound events detection 44
Chapter 6 Conclusions 48
Referrences 49

參考文獻

[1] Alex Graves, Abdel-rahman Mohamed, Geoffrey Hinton, “Speech recognition with deep recurrent neural networks,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2013, pp. 6645–6649.
[2] Alex Krizhevsky, Ilya Sutskever, Geoffrey E Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems (NIPS) 2012, pp. 1097–1105.
[3] Alexis Conneau, Holger Schwenk, Loïc Barrault, Yann Lecun, “Very Deep Convolutional Networks for Text Classification,” in arXiv:1606.01781, 2016.
[4] Ian Goodfellow, Yoshua Bengio, Aaron Courville, “Deep learing book”, 2015.
[5] S. Chu, S. Narayanan, and C.-C. Kuo, “Environmental sound recognition with time-frequency audio features,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 17, no. 6, pp. 1142–1158, Aug. 2009
[6] C. Mydlarz, J. Salamon, and J. P. Bello, “The implementation of low costurban acoustic monitoring devices,” Applied Acoustics, vol. In Press, 2016.
[7] Foggia, P.; Petkov, N.; Saggese, A.; Strisciuglio, N.; Vento, “Reliable detection of audio events in highly noisy environments,” Pattern Recognit. Lett. 2015, 65, 22–28.
[8] Guyot, P.; Pinquier, J.; Valero, X.; Alias, “Two-step detection of water sound events for the diagnostic and monitoring of dementia,” In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), San Jose, CA, USA, 15–19 July 2013; pp. 1–6.
[9] Stowell, D.; Clayton, “Acoustic Event Detection for Multiple Overlapping Similar Source,” s. In Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, 18–21 October 2015.
[10] Clavel, C.; Ehrette, T.; Richard, “Events Detection for an Audio-Based Surveillance System,” In Proceedings of the IEEE International Conference on Multimedia and Expo, Amsterdam, the Netherlands, 6 July 2005;
[11] J. Nam, Z. Hyung, and K. Lee, “Acoustic scene classification using sparse feature learning and selective max-pooling by event detection,” IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events, 2013.
[12] T. Heittola, A. Mesaros, A. J. Eronen, and T. Virtanen, “Context-dependent sound event detection”, EURASIP Journal on Audio, Speech, and Music Processing, 1:1–13, 2013.
[13] J. Barker, E. Vincent, N. Ma, H. Christensen, and P. Green, “The PASCAL CHiME speech separation and recognition challenge,” Comput. Speech Language, vol. 27, no. 3, pp. 621–633, 2012.
[14] Yoonchang Han and Kyogu Lee. “Acoustic scene classification using convolutional neural network and multiple-width frequency-delta data augmentation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing on 08-July-2016.
[15] Daniele Barchiesi; Dimitrios Giannoulis; Dan Stowell; Mark D. Plumbley, “Acoustic Scene Classification: Classifying environments from the sounds they produce,” IEEE Signal Processing Magazine, vol. 32, pp. 16-34, 2015.
[16] Annamaria Mesaros; Toni Heittola; Tuomas Virtane, “TUT database for acoustic scene classification and sound event detection,” 2016 24th European Signal Processing Conference (EUSIPCO), 2016.
[17] Victor Bisot; Romain Serizel; Slim Essid; Gaël Richard. “Acoustic scene classification with matrix factorization for unsupervised feature learning,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016.
[18] Dan Stowell; Dimitrios Giannoulis; Emmanouil Benetos; Mathieu Lagrange; Mark D. Plumbley, “Detection and Classification of Acoustic Scenes and Events,” IEEE Transactions on Multimedia, 2015.
[19] Huy Phan, Philipp Koch, Fabrice Katzberg, Marco Maass, Radoslaw Mazur, Alfred Mertins. “Audio Scene Classification with Deep Recurrent Neural Networks,” 2017.
[20] A. Rakotomamonjy and G. Gasso, “Histogram of gradients of time-frequency representations for audio scene classification,” IEEE/ACM Trans. Audio, Speech, and Language Processing, vol. 23, no. 1, pp. 142–153, 2015.
[21] T. H. Vu and J.-C. Wang, “Acoustic scene and event recognition using recurrent neural networks,” Detection and Classification of Acoustic Scenes and Events 2016, Tech. Rep., 2016.
[22] A. Mesaros, T. Heittola, A. Eronen, and T. Virtanen, “Acoustic event detection in real-life recordings,” in Proc. 18th Eur. Signal Process. Conf., Aalborg, Denmark, Aug. 2010, pp. 1267–1271.
[23] Toni Heittola, Annamaria Mesaros, Tuomas Virtanen, Moncef Gabbouj, “Supervised model training for overlapping sound events based on unsupervised source separation,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2013, pp. 8677–8681.
[24] T. Heittola, A. Mesaros, A. J. Eronen, and T. Virtanen, “Context-dependent sound event detection,” EURASIP J. Audio, Speech, Music Process., vol. 1, pp. 1–13, 2013.
[25] Satoshi Innami, Hiroyuki Kasai, “NMF-based environmental sound source separation using time-variant gain features,” in Computers & Mathematics with Applications, vol. 64, no. 5, pp. 1333–1342, 2012.
[26] Annamaria Mesaros, Toni Heittola, Onur Dikmen, Tuomas Virtanen, “Sound event detection in real life recordings using coupled matrix factorization of spectral representations and class activity annotations,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2015, pp. 606–618.
[27] Emre Cakir, Toni Heittola, Heikki Huttunen, Tuomas Virtanen, “Polyphonic Sound Event Detection Using Multi Label Deep Neural Networks,” in IEEE International Joint Conference on Neural Networks (IJCNN) 2015.
[28] E. Cakir, T. Heittola, H. Huttunen, and T. Virtanen, “Multi-label vs. combined single-label sound event detection with deep neural networks,” in Proc. 23rd Eur. Signal Process. Conf., Nice, France, Aug. 2015, pp. 2551–2555.
[29] G. Parascandolo, H. Huttunen, and T. Virtanen, “Recurrent neural networks for polyphonic sound event detection in real life recordings,” in 2016 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 6440–6444.
[30] E. Cakir, G. Parascandolo, T. Heittola, H. Huttunen, and T. Virtanen, “Convolutional recurrent neural networks for polyphonic sound event detection,” in IEEE/ACM TASLP Special Issue on Sound Scene and Event Analysis, 2017.
[31] Y. Bengio. Learning deep architectures for AI, in Foundations and Trends in Machine Learning, 2(1):1–127, 2009.
[32] G. Hinton, S. Osindero, and Y. The, “A fast learning algorithm for deep belief nets,” Neural Computation, 18:1527–1554, 2006.
[33] Simon Haykin, “Neural networks and learning machines,”. vol. 3, 2009, Pearson Education Upper Saddle River.
[34] David E Rumelhart, Geoffrey E Hinton, Ronald J Williams, “Learning internal representations by error propagation,” tech. rep., DTIC Document, 1985.
[35] Paul J Werbos, “Generalization of backpropagation with application to a
recurrent gas market model,” in Neural Networks, vol. 1, no. 4, pp. 339–356,
1988.
[36] Yann LeCun, Yoshua Bengio, “Convolutional networks for images, speech, and time series,” in The handbook of brain theory and neural networks, vol. 3361, no. 10, 1995.
[37] Boris Teodorovich Polyak, “Some methods of speeding up the convergence of iteration methods,” in USSR Computational Mathematics and Mathematical Physics, vol. 4, no. 5, pp. 1–17, 1964.
[38] John Duchi, Elad Hazan, Yoram Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” in The Journal of Machine Learning Research, vol. 12, pp. 2121–2159, 2011.
[39] Matthew D Zeiler, “ADADELTA: An adaptive learning rate method,” in arXiv:1212.5701,2012.
[40] Yurii Nesterov, “A method for unconstrained convex minimization problem with the rate of convergence O(1/k2),” in Doklady an SSSR, vol. 269, no.3,pp.543–547,1983.
[41] Tijmen Tieleman, Geoffrey Hinton, “Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude,” in COURSERA: Neural Networks for Machine Learning, vol. 4, 2012.
[42] Diederik Kingma, Jimmy Ba, “Adam: A method for stochastic optimization,” in arXiv:1412.6980, 2014.
[43] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification,” in arXiv:1502.01852, 2015.
[44] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition”, Proceedings of the IEEE 86(11): 2278–2324, 1998.
[45] Zeiler, M. D. and Fergus, “Visualizing and understanding convolutional networks,” Published in Proc. ECCV, 2014.
[46] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich, “Going Deeper with Convolutions,” 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
[47] K. Simonyan, A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” in arXiv technical report, 2014.
[48] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, “Deep Residual Learning for Image Recognition,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770 – 778, 2016.
[49] Hinton, G. E. and Sejnowski, T. J., “Learning and relearning in Boltzmann machines,” in Parallel Distributed Processing, vol. 1, pp. 282–317. MIT Press, Cambridge, 1986.
[50] Michael I Jordan, “Attractor dynamics and parallelism in a connectionist
sequential machine,” 1986.
[51] Jeffrey L Elman, “Finding structure in time,” in Cognitive Science, vol. 14,no.2,pp.179–211,1990.
[52] Alex Waibel, “Modular construction of time-delay neural networks for speech recognition,” in Neural Computation, vol. 1, no. 1, pp. 39–46, 1989.
[53] Sepp Hochreiter, Jürgen Schmidhuber, “Long short-term memory,” in Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[54] Kyunghyun Cho, Bart Merriënboer, Dzmitry Bahdanau, Yoshua Bengio, “On the properties of neural machine translation: Encoder-decoder approaches,” in arXiv:1409.1259, 2014.
[55] Bengio, Y., Simard, P., and Frasconi, P., “Learning long-term dependencies with gradient descent is difficult,” IEEE Transactions on Neural Networks, 5(2), 157–166, 1994.
[56] Mikolov, T., “Statistical Language Models based on Neural Networks,” Ph.D. thesis, Brno University of Technology, 2012.
[57] Razvan Pascanu, Tomas Mikolov, Yoshua Bengio, “On the difficulty of training Recurrent Neural Networks,” in arXiv:1211.5063, 2013.
[58] Quoc V Le, Navdeep Jaitly, Geoffrey E Hinton, “A Simple Way to Initialize Recurrent Networks of Rectified Linear Units,” in arXiv preprint arXiv:1504.00941, 2015.
[59] Nicolas Boulanger-Lewandowski, Yoshua Bengio, Pascal Vincent, “Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and transcription,” in arXiv preprint arXiv: 1206.6392, 2012.
[60] Luca Pasa, Alessandro Sperduti, “Pre-training of Recurrent Neural Networks via Linear Autoencoders,” in Advances in Neural Information Processing Systems (NIPS), pp. 3572–3580, 2014.
[61] Mike Schuster, Kuldip K Paliwal, “Bidirectional recurrent neural networks,” in IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.
[62] Alex Graves, Jürgen Schmidhuber, “Framewise phoneme classification with bidirectional LSTM and other neural network architectures,” in Neural Networks, vol. 18, no. 5, pp. 602–610, 2005.
[63] Alex Graves, Marcus Liwicki, Santiago Fernández, Roman Bertolami, Horst Bunke, Jürgen Schmidhuber, “A novel connectionist system for unconstrained handwriting recognition,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 5, pp. 855–868,
[64] Alex Graves, “Generating sequences with recurrent neural networks,” in arXiv:1308.0850,2013.
[65] Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan, “Show and tell: A neural image caption generator,” in arXiv:1411.4555, 2014.
[66] Wojciech Zaremba, Ilya Sutskever, Oriol Vinyals, “Recurrent neural network regularization,” in arXiv:1409.2329, 2014.
[67] Ilya Sutskever, Oriol Vinyals, Quoc VV Le, “Sequence to sequence learning with neural networks,” in Advances in Neural Information Processing Systems (NIPS), pp. 3104–3112, 2014.
[68] D. Giannoulis, E. Benetos, D. Stowell, M. Rossignol, M. Lagrange, and M. D. Plumbley, “Detection and classification of acoustic scenes and events: An ieee aasp challenge,” in Applications of Signal Processing to Audio and Acoustics (WASPAA), 2013 IEEE Workshop on. IEEE, 2013, pp. 1–4.
[69] B. E. Kingsbury, N. Morgan, and S. Greenberg, “Robust speech recognition using the modulation spectrogram,” Speech communication, vol. 25, no. 1, pp. 117–132, 1998.
[70] C. Nadeu, D. Macho, and J. Hernando, “Time and frequency filtering of filter-bank energies for robust hmm speech recognition,” Speech Communication, vol. 34, no. 1, pp. 93–114, 2001.
[71] S. Molau, M. Pitz, R. Schluter, and H. Ney, “Computing mel-frequency cepstral coefficients on the power spectrum,” in Acoustics, Speech, and Signal Processing, vol. 1, pp. 73–76, 2001.
[72] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.
[73] M. Valenti, A. Diment, G. Parascandolo, S. Squartini, and T. Virtanen, “DCASE 2016 acoustic scene classification using convolutional neural networks,” in Proc. Workshop Detection Classif. Acoust. Scenes Events, Sep. 2016, pp. 95–99.
[74] Y. Han and K. Lee, “Convolutional neural network with multiple-width frequency-delta data augmentation for acoustic scene classification,” DCASE 2016 Challenge, Tech. Rep., Sep. 2016.
[75] H. Eghbal-Zadeh, B. Lehner, M. Dorfer, and G. Widmer, “CP-JKU submissions for DCASE-2016: A hybrid approach using binaural i-vectors and deep convolutional neural networks,” DCASE 2016 Challenge, Tech. Rep., Sep. 2016.
[76] S. Adavanne, G. Parascandolo, P. Pertila, T. Heittola, and T. Virtanen, ¨ “Sound event detection in multichannel audio using spatial and harmonic features,” IEEE Detection and Classification of Acoustic Scenes and Events workshop, 2016.
[77] Huy Phan Philipp Koch, Fabrice Katzberg, Marco Maass, Radoslaw Mazur and Alfred Mertins, “Audio scene classification with Deep recurrent neural networks,” in arXiv:1703.04770v2, 2017.
[78] Yanmin Qian, Philip C Woodland, “Very deep convolutional neural networks for robust speech recognition” in arXiv:1610.00277v1, 2016.
[79] S.Ioffe and C.Szegedy, “Batch normalization: Accelerating deep network training be reducing internal covariate shift,” CoRR, vol.abs/1502.03167, 2015.
[80] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R., “Dropout: a simple way to prevent neural networks from overfitting,” Machine Learning Res. 15, 1929–1958 (2014).
[81] http://www.cs.tut.fi/sgn/arg/dcase2017/challenge/task-acoustic-scene-classification.
[82] http://www.cs.tut.fi/sgn/arg/dcase2017/challenge/task-rare-sound-event-detection.
[83] http://www.cs.tut.fi/sgn/arg/dcase2017/challenge/task-sound-event-detection-in-real-life-audio.

指導教授

王家慶(Jia-Ching Wang)

審核日期

2017-7-28

推文