博碩士論文 105525002 詳細資訊




以作者查詢圖書館館藏 以作者查詢臺灣博碩士 以作者查詢全國書目 勘誤回報 、線上人數:26 、訪客IP:18.218.61.16
姓名 林郁凱(Yu-Kai Lin)  查詢紙本館藏   畢業系所 軟體工程研究所
論文名稱 深度類神經網路於環境音偵測之應用與改良
(The Applications and Improvements of Deep Neural Networks in Environmental Sound Recognition)
相關論文
★ 以Q-學習法為基礎之群體智慧演算法及其應用★ 發展遲緩兒童之復健系統研製
★ 從認知風格角度比較教師評量與同儕互評之差異:從英語寫作到遊戲製作★ 基於檢驗數值的糖尿病腎病變預測模型
★ 基於類神經網路之白血球分類系統★ 模糊類神經網路為架構之遙測影像分類器設計
★ 複合式群聚演算法★ 身心障礙者輔具之研製
★ 指紋分類器之研究★ 背光影像補償及色彩減量之研究
★ 類神經網路於營利事業所得稅選案之應用★ 一個新的線上學習系統及其於稅務選案上之應用
★ 人眼追蹤系統及其於人機介面之應用★ 結合群體智慧與自我組織映射圖的資料視覺化研究
★ 追瞳系統之研發於身障者之人機介面應用★ 以類免疫系統為基礎之線上學習類神經模糊系統及其應用
檔案 [Endnote RIS 格式]    [Bibtex 格式]    [相關文章]   [文章引用]   [完整記錄]   [館藏目錄]   [檢視]  [下載]
  1. 本電子論文使用權限為同意立即開放。
  2. 已達開放權限電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。
  3. 請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。

摘要(中) 類神經網路已能在聲音辨識上取得極好的成績,多種不同的聲音特徵都被嘗試作為網路的輸入進行訓練辨識,然而以原始聲音訊號作為網路輸入,測試神經網路是否能夠自行擷取出聲音特徵依舊是一門挑戰。本文改良了現有原始訊號網路的架構,利用高層數的深度神經網路成功提升了訊號輸入分析的效果,以擬似頻譜轉換的方式,探討正確的參數設定,最終提出的1d-2d network 於ESC50中可成功達到73.55%的正確率。
除此之外,本文亦提出一種特徵融合的網路架構,利用全域池化層的特性,整合出一種較具彈性的結合方式。利用此類網路,本文成功結合了利用原始訊號輸入及利用對數梅爾頻譜係數輸入的兩種網路,我們提出的ParallelNet在ESC50中以上述輸入得到了81.55%的辨識效果,達到了人類辨識水平。
摘要(英) Neural network has achieved a great result in the sound recognition, many different kinds of acoustic features have been tried as the training input with the network. However, there is still under doubt that the whether the neural network could efficiently extract features from the raw audio signal input. This study improved the raw-signal-input network from other researches, with the deeper network architectures, the raw signals get the well analysis with our network, we also make the discussion in several kinds of network settings, with the spectrogram-like conversion, our network could reach the accuracy of 73.55% in the open-audio-dataset ESC50.
Besides, in this study, we proposed a network architectures that could combine different kinds of networks feed with different features. With the help of global pooling, a flexible fusion way is well integrated into the network. Our experiment successfully combined two different networks which use different kinds of audio feature inputs—raw audio signal and log-mel spectrum. By the above settings, the ParallelNet we proposed finally reaches the accuracy of 81.55% in ESC50, which also reaches the recognition level of human being.
關鍵字(中) ★ 深度神經網路
★ 卷積神經網路
★ 環境音偵測
★ 特徵融合
關鍵字(英) ★ Deep Neuron network
★ Convolutional Neuron Network
★ Environmental Sound Recognition
★ Feature Combination
論文目次 摘要 i
Abstract ii
誌謝 iii
Content v
List of Figures vii
List of Tables viii
1. Introduction 1
2. Background 3
2.1 Related Works of Environmental sound recognition 3
2.2 Review of Neuron Networks 5
2.2.1 Feed-Forward Neural Networks 5
2.2.2 Convolutional Neural Networks 9
2.2.3 Convolutional Layers 10
2.2.4 Activation Layers 10
2.2.5 Pooling Layers 12
2.2.6 Fully Connected Layers 13
2.2.7 Loss Functions 13
2.2.8 Model Initialization 14
2.2.9 Batch Normalization 15
3. Methods Development 17
3.1 Data Sets 17
3.2 Data Preprocessing 17
3.3 Data Augmentations 18
3.4 Network Customizations 19
3.4.1 Network configuration 19
3.4.2 Network parallelization 23
4. Results and Discussion 25
4.1 Experiment Setup 25
4.2 The architecture of 1D network 25
4.2.1 Frame Size 25
4.2.2 Network depth 26
4.2.3 Number of filters 28
4.3 The architecture of 2D network 28
4.3.1 Kernel Shapes 28
4.4 The parallel network 29
4.4.1 Data augmentation 30
4.4.2 The effect of pre-train 29
4.5 Network Conclusion 31
5. Conclusion and Perspectives 33
Bibliography 35
參考文獻 [1] J. Chen, A. H. Kam, J. Zhang, N. Liu and L. Shue, "Bathroom activity monitoring based on sound," in International Conference on Pervasive Computing, 2005.
[2] F. Weninger and B. Schuller, "Audio recognition in the wild: Static and dynamic classification on a real-world database of animal vocalizations," in acoustics, speech and signal processing (ICASSP), 2011 IEEE international conference, 2011.
[3] C. Clavel, T. Ehrette and G. Richard, "Events detection for an audio-based surveillance system," in Multimedia and Expo, 2005. ICME 2005. IEEE International conference, 2005.
[4] M. Bugalho, J. Portelo, I. Trancoso, T. Pellegrini and A. Abad, "Detecting audio events for semantic video search," in Tenth Annual Conference of the International Speech Communication Association, 2009.
[5] A.-r. Mohamed, G. Hinton and G. Penn, "Understanding how deep belief networks perform acoustic modelling," in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference, 2012.
[6] T. N. Sainath, R. J. Weiss, A. Senior, K. W. Wilson and O. Vinyals, "Learning the speech front-end with raw waveform CLDNNs," in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
[7] H. Lee, P. Pham, Y. Largman and A. Y. Ng, "Unsupervised feature learning for audio classification using convolutional deep belief networks," in Advances in neural information processing systems, 2009.
[8] A. Van den Oord, S. Dieleman and B. Schrauwen, "Deep content-based music recommendation," in Advances in neural information processing systems, 2013.
[9] V. Peltonen, J. Tuomi, A. Klapuri, J. Huopaniemi and T. Sorsa, "Computational auditory scene recognition," in Acoustics, speech, and signal processing (icassp), 2002 IEEE international conference, 2002.
[10] L. Rabiner, "A tutorial on hidden Markov models and selected applications in speech recognition," Proc. IEEE, vol. 77, no. 2, pp. 257-286, 1989.
[11] Y.-T. Peng, C.-Y. Lin, M.-T. Sun and K.-C. Tsai, "Healthcare audio event classification using hidden markov models and hierarchical hidden markov models," in Multimedia and Expo, 2009. ICME 2009. IEEE International Conference, 2009.
[12] B. Elizalde, A. Kumar, A. Shah, R. Badlani, E. Vincent, B. Raj and I. Lane, "Experiments on the DCASE challenge 2016: Acoustic scene classification and sound event detection in real life recording," arXiv preprint arXiv:1607.06706, 2016.
[13] J.-C. Wang, J.-F. Wang, K. W. He and C.-S. Hsu, "Environmental sound classification using hybrid SVM/KNN classifier and MPEG-7 audio low-level descriptor," in Neural Networks, 2006. IJCNN′06. International Joint Conference, 2006.
[14] A. Krizhevsky, I. Sutskever and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," in Advances in neural information processing systems, 2012.
[15] K. J. Piczak, "Environmental sound classification with convolutional neural networks," in Machine Learning for Signal Processing (MLSP), 2015 IEEE 25th International Workshop, 2015.
[16] D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange and M. D. Plumbley, "Detection and classification of acoustic scenes and events," IEEE Transactions on Multimedia, vol. 17, no. 10, pp. 1733-1746, 2015.
[17] "DCASE 2017 Workshop," [Online]. Available: http://www.cs.tut.fi/sgn/arg/dcase2017/. [Accessed 30 - June - 2017].
[18] Y. Aytar, C. Vondrick and A. Torralba, "Soundnet: Learning sound representations from unlabeled video," Advances in Neural Information Processing Systems, pp. 892-900, 2016.
[19] W. Dai, C. Dai, S. Qu, J. Li and S. Das, "Very deep convolutional neural networks for raw waveforms," in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference, 2017.
[20] M. Lin, Q. Chen and S. Yan, "Network in network," arXiv preprint arXiv:1312.4400, 2013.
[21] Y. Tokozume and T. Harada, "Learning environmental sounds with end-to-end convolutional neural network," in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference, 2017.
[22] Y. Tokozume, Y. Ushiku and T. Harada, "Learning from Between-class Examples for Deep Sound Recognition," in ICLR 2018 Conference, 2018.
[23] F. Rosenblatt, "The perceptron: a probabilistic model for information storage and organization in the brain.," Psychological review, vol. 65, pp. 386-408, 1958.
[24] D. E. Rumelhart, G. E. Hinton and R. J. Williams, "Learning representations by back-propagating errors," nature, vol. 323, no. 6088, p. 533, 1986.
[25] Y. LeCun, L. Bottou, Y. Bengio and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, 1998.
[26] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot and others, "Mastering the game of Go with deep neural networks and tree search," nature, vol. 529, no. 7587, p. 484, 2016.
[27] X. Glorot and Y. Bengio, "Understanding the difficulty of training deep feedforward neural networks," in Proceedings of the thirteenth international conference on artificial intelligence and statistics, 2010.
[28] K. He, X. Zhang, S. Ren and J. Sun, "Delving deep into rectifiers: Surpassing human-level performance on imagenet classification," in Proceedings of the IEEE international conference on computer vision, 2015.
[29] S. Ioffe and C. Szegedy, "Batch normalization: Accelerating deep network training by reducing internal covariate shift," arXiv preprint arXiv:1502.03167, 2015.
[30] K. J. Piczak, "ESC: Dataset for environmental sound classification," in Proceedings of the 23rd ACM international conference on Multimedia, 2015.
[31] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei, "Imagenet: A large-scale hierarchical image database," in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference, 2009.
[32] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll′ar and C. L. Zitnick, "Microsoft coco: Common objects in context," in European conference on computer vision, 2014.
[33] J. Salamon and J. P. Bello, "Deep convolutional neural networks and data augmentation for environmental sound classification," IEEE Signal Processing Letters, vol. 24, no. 3, pp. 279-283, 2017.
[34] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever and R. Salakhutdinov, "Dropout: A simple way to prevent neural networks from overfitting," The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929-1958, 2014.
[35] Y. Nesterov, "Gradient Methods for Minimizing Composite," 2007.
[36] V. Boddapati, A. Petef, J. Rasmusson and L. Lundberg, "Classifying environmental sounds using image recognition networks," Procedia Computer Science, vol. 112, pp. 2048-2056, 2017.
[37] K. Simonyan, A. Vedaldi and A. Zisserman, "Deep inside convolutional networks: Visualising image classification models and saliency maps," arXiv preprint arXiv:1312.6034, 2013.
[38] M. D. Zeiler and R. Fergus, "Visualizing and understanding convolutional networks," in European conference on computer vision, 2014.
[39] J. Salamon, C. Jacoby and J. P. Bello, "A dataset and taxonomy for urban sound research," in Proceedings of the 22nd ACM international conference on Multimedia, 2014.
指導教授 蘇木春(Mu-Chun Su) 審核日期 2018-8-23
推文 facebook   plurk   twitter   funp   google   live   udn   HD   myshare   reddit   netvibes   friend   youpush   delicious   baidu   
網路書籤 Google bookmarks   del.icio.us   hemidemi   myshare   

若有論文相關問題,請聯絡國立中央大學圖書館推廣服務組 TEL:(03)422-7151轉57407,或E-mail聯絡  - 隱私權政策聲明