非對稱摺積神經網路之聲音場景分類

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：21

、訪客IP：3.137.174.186

姓名

伍聿旂(Yu-Chi Wu) 查詢紙本館藏

畢業系所

通訊工程學系

論文名稱

非對稱摺積神經網路之聲音場景分類
(Asymmetric Kernel Convolutional Neural Network for Acoustic Scenes Classification)

相關論文

★ 基於區域權重之衛星影像超解析技術	★ 延伸曝光曲線線性特性之調適性高動態範圍影像融合演算法
★ 實現於RISC架構之H.264視訊編碼複雜度控制	★ 基於卷積遞迴神經網路之構音異常評估技術
★ 具有元學習分類權重轉移網路生成遮罩於少樣本圖像分割技術	★ 具有注意力機制之隱式表示於影像重建三維人體模型
★ 使用對抗式圖形神經網路之物件偵測張榮	★ 基於弱監督式學習可變形模型之三維人臉重建
★ 以非監督式表徵分離學習之邊緣運算裝置低延遲樂曲中人聲轉換架構	★ 基於序列至序列模型之 FMCW雷達估計人體姿勢
★ 基於多層次注意力機制之單目相機語意場景補全技術	★ 基於時序卷積網路之單FMCW雷達應用於非接觸式即時生命特徵監控
★ 視訊隨選網路上的視訊訊務描述與管理	★ 基於線性預測編碼及音框基頻週期同步之高品質語音變換技術
★ 基於藉語音再取樣萃取共振峰變化之聲調調整技術	★ 即時細緻可調性視訊在無線區域網路下之傳輸效率最佳化研究

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

隨著人類追求便利性，我們使用電腦使其學習並了解人類所熟知的事物，我們希望通過分析聲音使電腦認識自己的環境，自2013年首次舉辦IEEE Audio and Acoustic Signal Processing (AASP) 聲音場景與事件辨識(Detection and Classification of Acoustic Scenes and Events, DCASE) 競賽，掀起了聲音場景分類 (Acoustic scene classification, ASC)的風波，邁向統一ASC的資料庫與評估方法的第一步，更於2016年舉辦第二屆 DCASE2016競賽。
本論文利用深度學習中的摺積神經網路 (Convolutional Neural Net-work, CNN) 作為ASC的方法。由於CNN之輸入資料為頻譜，而頻譜包含時域資訊與頻域資訊，因此我們假設時域資訊與頻域資訊的資料變化量不一，因此使用長形的摺積核 (kernel) ，也就是本論文提出之非對稱摺積核 (Asymmetric Kernel) (相對於以往的方形的對稱摺積核)，並在訓練期間做資料正規化 (Normalization)加速訓練。我們發現即使現在多以寬又深的網路作為趨勢，發展更佳的資料分類方法，但其實本論文所提出的架構，兩層不用預訓練 (Pre-train)的CNN即可達到相較DCASE2016排名第五名更佳的效果。

摘要(英)

Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge have held in three times. The first DCASE Challenge was held in 2013. Then, DCASE2016 Challenge was the 2nd times of DCASE Challenge. The result why IEEE Audio and Acoustic Signal Processing (AASP) held the 2nd challenge after 3 years is to reset a brand new dataset and united the rule of ASC.
In this work, we use the dataset of ASC from DCASE2016 to propose an Asymmetric Kernel Convolutional Neural Network (AKCNN), whose kernel shape is very different from the traditionally squared kernel. The width and height of the kernel are asymmetric which means that the shape of the kernel is a rectangular kernel. Also, the proposed uses weight normalization (WN) to accelerate the training time because it can early converge the training loss and testing accuracy during training. The best of all, WN can help increase the accuracy of ASC. The result shows that AKCNN achieves accuracy 86.7%. If we rank the score in DCASE2016 ASC Challenge, it would show that we have a better score than the 5th place.

關鍵字(中)

★ 計算聽覺場景分析
★ 聲音場景辨分類
★ 深度學習
★ 摺積神經網路

關鍵字(英)

★ Computational Auditory Scene Analysis
★ Acoustic scenes classification
★ Deep learning
★ Convolutional neural network

論文目次

摘要 i
Abstract ii
誌謝 iii
目錄 v
圖目錄 viii
表目錄 xi
第一章緒論 1
1-1 研究動機與背景 1
1-2 論文架構 3
第二章聲音場景分類 4
2-1 聲音場景分類發展史 4
2-1-1 2013聲音場景與事件的分類與偵測競賽 5
2-1-2 2016與2017聲音場景與事件的分類與偵測競賽 6
2-2 聲音場景分類特徵 7
2-2-1 對數梅爾刻度頻譜 8
2-2-2 梅爾倒頻譜係數 11
第三章神經網路與深度學習 13
3-1 類神經網路 13
3-1-1 類神經網路發展史 14
3-1-2 反向傳播演算法 17
3-2 深度學習 20
3-2-1 深度神經網路 20
3-2-2 摺積神經網路 23
3-3 正規化加速訓練 26
3-3-1 批次資料正規化 26
3-3-2 權重正規化 31
第四章提出之架構 33
4-1 資料前處理 33
4-1-1 特徵提取 34
4-1-2 資料正規化 35
4-1-3 資料切割與堆疊 36
4-2 摺積神經網路架構 37
4-2-1 訓練階段 40
4-2-2 測試階段 41
第五章實驗與分析 43
5-1 實驗環境與資料庫 43
5-2 參數選擇實驗 46
5-3 實驗結果比較與分析 56
第六章結論與未來展望 60
參考文獻 61

參考文獻

[1] D. Wang and G. J. Brown, “Computational Auditory Scene Analysis: Prin-ciples, Algorithms, and Applications”. Wiley-IEEE Press, 2006.
[2] A. S. Bregman, “Auditory Scene Analysis,” MIT Press, Cambridge, MA, 1990.
[3] M. Slaney, “The History and Future of CASA,” Speech separation by hu-mans and machines, pp.199-211, Springer US, 2005.
[4] N. Sawhney, “Situational Awareness from Environmental Sounds,” Tech-nical Report, Massachusetts Institute of Technology, 1997.
[5] D. Barchiesi, D. Giannoulis, D. Stowell, M. D. Plumbley, “Acoustic Scene Classification,” in IEEE Signal Processing Magazine, vol. 32, no. 3, pp.16-34, May 2015.
[6] S. McAdams, “Recognition of sound sources and events,” Thinking in Sound: The Cognitive Psychology of Human Audition, pp. 146-198, 1993.
[7] H. E. Zadeh, B. Lehner, M. Dorfer and G. Widmer, “CP-JKU Submissions for DCASE-2016: A Hybrid Approach Using Binaural I-Vectors and Deep Convolutional Neural Networks,” IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE2016), Budapest, Hungary, Sep. 2016.
[8] M. Valenti, A. Diment, G. Parascandolo, S. Squartini, and T. Virtanen, “DCASE 2016 Acoustic Scene Classification Using Convolutional Neural Networks,” IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE2016), Budapest, Hungary, Sep. 2016.
[9] D. Giannoulis, E. Benetos, D. Stowell, and M. D. Plumbley, IEEE AASP CASA Challenge - Public Dataset for Scene Classification Task, https://archive.org/details/dcase2013_scene_classification, retrieved Jun. 29, 2017.
[10] D. Giannoulis, E. Benetos, D. Stowell, and M. D. Plumbley, IEEE AASP CASA Challenge - Private Dataset for Scene Classification Task, https://archive.org/details/dcase2013_scene_classification_testset, retrieved Jun. 29, 2017.
[11] M. Annamaria, H. Toni, and V. Tuomas, TUT Acoustic scenes 2016, De-velopment dataset, http://doi.org/10.5281/zenodo.45739, retrieved Dec. 1, 2016.
[12] M. Annamaria, H. Toni, and V. Tuomas, TUT Acoustic scenes 2016, Eval-uation dataset, https://zenodo.org/record/165995#.WXblsYiGNhE, re-trieved Dec. 1, 2016.
[13] ETSI Standard Doc., “Speech Processing, Transmission and Quality As-pects (STQ); Distributed Speech Recognition; Front-End Feature Extraction Algorithm; Compression Algorithms,” ES 201 108, v1.1.3, Sep. 2003.
[14] ETSI Standard Doc., “Speech Processing, Transmission and Quality As-pects (STQ); Distributed Speech Recognition; Front-End Feature Extraction Algorithm; Compression Algorithms,” ES 202 050, v1.1.5, Jan. 2007.
[15] Librosa: an open source Python package for music and audio analysis, https://github.com/librosa, retrieved Dec. 1, 2016.
[16] B. McFee, C. Raffe, D. Liang, D. P. W. Ellis, M. McVicar, E.Battenberg, and O. Nieto, “librosa: Audio and Music Signal Analysis in Python,” in Pro-ceedings of the 14th Python in Conference, Jul. 2015.
[17] K. Simonyan, and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition," arXiv preprint arXiv:1409.1556, 2014.
[18] C. Szegedy, et al. “Going Deeper with Convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.1-9, Jun. 2015.
[19] K. Alex, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” in Advances in Neural Information Processing Systems, pp.1097-1105, 2012.
[20] W. S. Mcculloch and W. Pitts, “A Logical Calculus of the Ideas Immanent in Nervous Activity,” Bulletin of Mathematical Biophysics, vol.5, no.4, pp.115-133, Dec. 1943.
[21] D. O. Hebb, “Organization of Behavior,” New York: Wiley & Sons.
[22] N. Rochester, J. Holland, L. Haibt, W. Duda, “Tests on A Cell Assembly Theory of the Action of the Brain, Using A Large Digital Computer”
[23] F. Rosenblatt, “The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain,” Cornell Aeronautical Laboratory, Psychological Review, v. 65, no. 6, pp. 386–408.
[24] F. Rosenblatt, “Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms,” Spartan Books, Washington DC, 1961.
[25] M. Minsky and S. Paper, “Perceptrons,” Cambridge, MA: MIT Press.
[26] P. J. Werbos, “Beyond regression: new tools for prediction and analysis in the behavioral sciences,” Ph.D. thesis, Harvard University, 1974.
[27] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representa-tions by back-propagating errors,” Nature, vol. 323, pp. 533–536, Oct. 1986.
[28] V. Nair, and G. E. Hinton, “Rectified Linear Units Improve Restricted Boltzmann Machines,” in Proceedings of the 27th International Conference on Machine Learning (ICML-10), Jun. 2010.
[29] S. Sigtia, and S. Dixon, "Improved Music Feature Learning with Deep Neural Networks," in 2014 IEEE International Conference on Acoustics, speech and signal processing (ICASSP), pp. 6959-6963, May 2014.
[30] N. Srivastava, G. E. Hinton, A. Krizhevsky, "Dropout: A Simple Way to Prevent Neural Networks from Overfitting," in Journal of Machine Learn-ing Research, vol. 15, pp. 1929-1958. Jun. 2014.
[31] Q. Kong, I. Sobieraj, W. Wang and M. Plumbley, “Deep Neural Network Baseline for DCASE Challenge 2016,” in 2016 Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE2016), pp. 50-54, Sep. 2016.
[32] Z. Liao, G. Carneiro. "Competitive Multi-Scale Convolution," arXiv pre-print arXiv:1511.05635, 2015.
[33] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-Based Learning Applied to Document Recognition,” in Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998.
[34] I. Mrazova, and M. Kukacka, “Hybrid convolutional neural networks,” in 6th IEEE International Conference on Industrial Informatics (INDIN), 2008.
[35] M. Lin, Q. Chen, and S. Yan, “Network in Network,” in Computing Re-search Repository (CoRR), 2013.
[36] S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” in International Conference on Machine Learning, pp. 448-456, 2015.
[37] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770-778, 2016.
[38] T. Salimans and D. P. Kingma, “Weight Normalization: A Simple Repa-rameterization to Accelerate Training of Deep Neural Networks,” in Ad-vances in Neural Information Processing Systems, pp. 901-909, 2016.
[39] TensorFlow: an open source Python package for machine intelligence, https://www.tensorflow.org, retrieved Dec. 1, 2016.
[40] J. Dean, et al. “Large-Scale Deep Learning for Building Intelligent Com-puter Systems,” in Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, pp. 1-1, Feb. 2016.
[41] M., Annamaria, T. Heittola, and T. Virtanen, “TUT Database for Acoustic Scene Classification and Sound Event Detection,” IEEE 2016 24th Euro-pean Signal Processing Conference, pp. 1128-1132, Aug. 2016.
[42] DCASE2017 Challenge Baseline website, http://doi.org/10.5281/zenodo.400515, retrieved Mar. 17, 2017.
[43] DCASE2016 Challenge website, http://www.cs.tut.fi/sgn/arg/dcase2016/task-results-acoustic-scene-classification, retrieved Jun. 26, 2017.
[44] A. V. D. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A Generative Model for Raw Audio,” arXiv preprint arXiv:1609.03499, 2016.

指導教授

張寶基

審核日期

2017-7-26

推文