基於時頻分離卷積壓縮網路之聲音事件定位與偵測

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：8

、訪客IP：3.23.101.186

姓名

楊世宗(Shih-Tsung Yang) 查詢紙本館藏

畢業系所

通訊工程學系

論文名稱

基於時頻分離卷積壓縮網路之聲音事件定位與偵測
(Sound Event Localization and Detection Based on Time-Frequency Separable Convolutional Compression Network)

相關論文

★ 基於區域權重之衛星影像超解析技術	★ 延伸曝光曲線線性特性之調適性高動態範圍影像融合演算法
★ 實現於RISC架構之H.264視訊編碼複雜度控制	★ 基於卷積遞迴神經網路之構音異常評估技術
★ 具有元學習分類權重轉移網路生成遮罩於少樣本圖像分割技術	★ 具有注意力機制之隱式表示於影像重建三維人體模型
★ 使用對抗式圖形神經網路之物件偵測張榮	★ 基於弱監督式學習可變形模型之三維人臉重建
★ 以非監督式表徵分離學習之邊緣運算裝置低延遲樂曲中人聲轉換架構	★ 基於序列至序列模型之 FMCW雷達估計人體姿勢
★ 基於多層次注意力機制之單目相機語意場景補全技術	★ 基於時序卷積網路之單FMCW雷達應用於非接觸式即時生命特徵監控
★ 視訊隨選網路上的視訊訊務描述與管理	★ 基於線性預測編碼及音框基頻週期同步之高品質語音變換技術
★ 基於藉語音再取樣萃取共振峰變化之聲調調整技術	★ 即時細緻可調性視訊在無線區域網路下之傳輸效率最佳化研究

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2026-7-1以後開放)

摘要(中)

隨著人工智慧的技術發展日益盛行，更多的領域紛紛朝以機器取代或輔助人力的方向進行研究。在音訊領域中，聲音事件定位與偵測便是其中之一。近期的研究主要透過深度學習的方法，使機器具有人耳聽覺的能力，以辨別環境中各種突發之聲音事件及其所處之位置與移動軌跡。
本論文提出時頻分離卷積壓縮網路(Time-Frequency Separable Convolutional Compression Network, TFSCCN)作為聲音事件定位與偵測的系統架構，透過不同維度大小的1-D卷積核，分別對時間與頻率成分進行特徵提取，用於捕捉同一時間下不同聲音事件的頻率分布，或者在連續時間中，聲音事件的持續時間以及相位或延遲的變化。同時，透過控制通道數降維與升維的時機點，大幅降低模型的參數量。另外，模型結合多頭自注意力機制(Multi-head self-attention)來獲取時間序列特徵中的全局與局部資訊，以及透過雙分支追蹤技術來對相同或相異的重疊聲音事件進行有效的定位與偵測。實驗結果表明，在DCASE 2020 Task 3的評估機制中與Baseline相比，偵測的錯誤率下降了37%，而角度定位誤差則降低了14°。另外，與其他以降低參數量方法為目的所建構的網路模型相比，TFSCCN不僅具有最少的參數量，同時也具有最佳的聲音事件定位與偵測的表現。

摘要(英)

With the increasing prevalence of artificial intelligence technology, in the audio field, sound event localization and detection is one of the fast growing research topics. By simulating the hearing ability of human ears, it can distinguish various sound events in the environment and locate their spatial locations and movement trajectories.
In this work, we propose a Time-Frequency Separable Convolutional Compression Network (TFSCCN) as a system architecture for sound event localization and detection, which uses 1-D convolution kernels of different dimensions to extract features of time and frequency components separately. It can distinguish each sound event class according to the different characteristics of the frequency distribution of different sound events. Meanwhile, it can also track the spatial location and movement trajectory. In addition, we greatly reduce the number of model parameters by controlling the timing of the increase and decrease of the number of channels. In the overall system, we also use multi-head self-attention to obtain global and local information in time series features, and use dual-branch tracking technology to effectively locate and detect the same or different overlapping sound events.
Experimental results show that compared with baseline in the evaluation metrics of DCASE 2020 Task 3, the detection error rate is reduced by 37%, and the localization error is reduced by 14°. In addition, compared with other lightweight models, TFSCCN not only has the fewest number of parameters, but also has the best sound event localization and detection performance.

關鍵字(中)

★ 聲音事件定位與偵測
★ 時頻分離卷積壓縮網路
★ 多頭自注意力機制
★ 雙分支追蹤

關鍵字(英)

★ sound event localization and detection
★ time-frequency separable convolutional compression network
★ multi-head self-attention
★ dual-branch tracking

論文目次

目錄
摘要 I
Abstract II
致謝 III
目錄 IV
圖目錄 VI
表目錄 IX
第一章緒論 1
1-1 研究背景 1
1-2 研究動機與目的 2
1-3 論文架構 3
第二章聲音事件定位技術背景 4
2-1 麥克風陣列配置 5
2-2 聲源定位技術 6
2-2-1 到達時間差 6
2-2-2 波束形成 8
2-2-3 高分辨率空間譜估計 9
2-2-4 基於深度學習定位方法 10
第三章聲音事件偵測技術背景 13
3-1 基於機器學習偵測方法 14
3-1-1 高斯混合模型 14
3-1-2 隱藏式馬可夫模型 15
3-1-3 支持向量機 16
3-2 基於深度學習偵測方法 17
第四章神經網路與深度學習 20
4-1 神經網路 21
4-1-1 類神經網路 21
4-1-2 卷積神經網路 23
4-1-3 遞迴神經網路 27
4-1-4 輕量化模型 — SqueezeNet 32
4-2 注意力機制 35
4-2-1 自注意力機制 35
4-2-2 多頭自注意力機制 37
4-2-3 位置編碼 39
第五章實驗架構與設計 40
5-1 系統流程圖 41
5-2 Ambisonics音訊格式 42
5-3 前處理與特徵提取 44
5-4 時頻分離卷積壓縮網路 47
5-5 損失函數 51
5-6 雙分支追蹤與置換不變訓練 52
第六章實驗結果與分析 56
6-1 實驗環境與參數設定 56
6-2 實驗數據集 57
6-3 評估機制 60
6-4 實驗結果比較與分析 65
第七章結論與未來展望 77
參考文獻 79

參考文獻

[1] R. Stiefelhagen, K. Bernardin, R. Bowers, R. T. Rose, M. Michel, and J. Garofolo, “The clear 2007 evaluation,” in Multimodal Technologies for Perception of Humans, R. Stiefelhagen, R. Bowers, and J. Fiscus, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008, pp. 3–34.
[2] D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange and M. D. Plumbley, “Detection and Classification of Acoustic Scenes and Events,” in IEEE Transactions on Multimedia, vol. 17, no. 10, pp. 1733-1746.
[3] Danfeng Li and S. E. Levinson, “A linear phase unwrapping method for binaural sound source localization on a robot,” Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No.02CH37292), 2002, pp. 19-23 vol.1.
[4] S. Mischie and G. Gășpăresc, “On Using ReSpeaker Mic Array 2.0 for speech processing algorithms,” 2020 International Symposium on Electronics and Telecommunications (ISETC), 2020, pp. 1-4.
[5] M. Binelli, A. Venturi, A. Amendola, and A. Farina, “Experimental analysis of spatial properties of the sound field inside a car employing a spherical microphone array,” in Audio Eng. Soc. (AES) Conv., Audio Engineering Society, 2011.
[6] C. Knapp and G. Carter, “The generalized correlation method for estimation of time delay,” in IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 24, no. 4, pp. 320-327.
[7] M. S. Brandstein and H. F. Silverman, “A robust method for speech signal time-delay estimation in reverberant rooms,” 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, 1997, pp. 375-378 vol.1.
[8] D. H. Johnson and D. E. Dudgeon, “Array Signal Processing: Concepts and Techniques,” 1993.
[9] J. P. Burg, “Maximum entropy spectral analysis,” in Proceedings of the 37th Annual International Meeting, Oklahoma City, OK, USA, 31 October 1967.
[10] J. Capon, “High-resolution frequency-wavenumber spectrum analysis,” in Proceedings of the IEEE, vol. 57, no. 8, pp. 1408-1418, Aug. 1969.
[11] R. Schmidt, “Multiple emitter location and signal parameter estimation,” in IEEE Transactions on Antennas and Propagation, vol. 34, no. 3, pp. 276-280, March 1986.
[12] K. Youssef, S. Argentieri and J. Zarader, “A learning-based approach to robust binaural sound localization,” 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2013, pp. 2927-2932.
[13] Aminoff, J. Michael, François Boller, and D F. Swaab, “The Human Auditory System: Fundamental Organization and Clinical Disorders,” 2015, Internet resource.
[14] X. Xiao, S. Zhao, X. Zhong, D. L. Jones, E. S. Chng and H. Li, “A learning-based approach to direction of arrival estimation in noisy and reverberant environments,” 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 2814-2818.
[15] R. Roden, N. Moritz, S. Gerlach, S. Weinzierl and S. Goetze, “On sound source localization of speech signals using deep neural networks,” Proc. Deutsche Jahrestagung Akustik (DAGA), pp. 1510-1513, 2015.
[16] D. Krause, A. Politis and K. Kowalczyk, “Feature Overview for Joint Modeling of Sound Event Detection and Localization Using a Microphone Array,” 2020 28th European Signal Processing Conference (EUSIPCO), 2021, pp. 31-35.
[17] S. Adavanne, A. Politis and T. Virtanen, “Direction of arrival estimation for multiple sound sources using convolutional recurrent neural network,” Proc. Euro. Signal Process. Conf., 2018.
[18] S. S. Mane, S. G. Mali and S. P. Mahajan, “Localization of Steady Sound Source and Direction Detection of Moving Sound Source using CNN,” 2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT), 2019, pp. 1-6.
[19] J. Rouat, “Computational auditory scene analysis: Principles algorithms and applications (Wang D. and Brown GJ eds.; 2006) [book review],” IEEE Trans. Neural Netw., vol. 19, no. 1, Jan. 2008.
[20] G.J. Zapata-Zapata, J.D. Arias-Londoño, J.F. Vargas-Bonilla and J.R. Orozco-Arroyave, “On-line signature verification using Gaussian Mixture Models and small-sample learning strategies,” Revista Facultad de Ingeniería Universidad de Antioquia, vol. 79, pp. 86-97, 2016.
[21] G. Xuan, W. Zhang and P. Chai, “EM algorithms of Gaussian Mixture Model and Hidden Markov Model,” Proc. 2001 Int. Conference on Image Processing (ICIP), vol. 1, pp. 145-148, 2001.
[22] X. Zhou, X. Zhuang, M. Liu, H. Tang, M. Hasegawa-Johnson, T. Huang, “HMM-based acoustic event detection with AdaBoost feature selection,” In Multimodal Technologies for Perception of Humans: International Evaluation Workshops CLEAR 2007 and RT 2007. Springer, Berlin, Germany; 2008:345-353.
[23] J. Vavrek, M. Pleva, J. Juhar, “Acoustic events detection with support vector machines,” In Electrical Engineering and Informatics, Proceeding of the Faculty of Electrical Engineering and Informatics of the Technical University of Košice, September, 2010, Kosice, pp. 796-801, ISBN 978-80-553-0460-1, 2010.
[24] J. Schröder, F. X. Nsabimana, J. Rennies, D. Hollosi and S. Goetze, “Automatic detection of relevant acoustic events in kindergarten noisy environments,” Proc. Deutsche Jahrestagung für Akustik, pp. 1525-1528, Mar. 2015.
[25] T. Heittola, A. Mesaros, A. Eronen and T. Virtanen, “Context-dependent sound event detection,” EURASIP J. Audio Speech Music Process., vol. 2013, 2013.
[26] E. Çakır, G. Parascandolo, T. Heittola, H. Huttunen and T. Virtanen, “Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection,” in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 6, pp. 1291-1303, June 2017.
[27] Y. Li, M. Liu, K. Drossos and T. Virtanen, “Sound Event Detection Via Dilated Convolutional Recurrent Neural Networks,” ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 286-290.
[28] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” Proc. of ICLR, pp. 1-13, 2016.
[29] P. S. Tan, K. M. Lim, C. P. Lee and C. H. Tan, “Acoustic Event Detection with MobileNet and 1D-Convolutional Neural Network,” 2020 IEEE 2nd International Conference on Artificial Intelligence in Engineering and Technology (IICAIET), 2020, pp. 1-6.
[30] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, et al., “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” 2017.
[31] I. Aizenberg, N. Aizenberg, C. Butakov and E. Farberov, “Image recognition on the neural network based on multi-valued neurons,” Proceedings 15th International Conference on Pattern Recognition. ICPR-2000, 2000, pp. 989-992 vol.2.
[32] W. S. McCulloch and W. Pitts, “A Logical Calculus of the Ideas Imminent in Nervous Activity,” Bulletin of Mathematical Biophysics, vol. 5, pp. 115-133, 1943.
[33] K. Fukushima, “Neocognitron: A Self-Organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position,” Biological Cybernetics, vol. 36, pp. 193-202, 1980.
[34] Y. Lecun, L. Bottou, Y. Bengio and P. Haffner, “Gradient-based learning applied to document recognition,” in Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324.
[35] D.E. Rumelhart, G.E. Hinton, and R.J. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, pp. 533–536, 1986.
[36] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, 9(8):1735–1780, 1997.
[37] J. Chung, C. Gulcehre, K. Cho and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” Dec. 2014, [online] Available: http://arxiv.org/abs/1412.3555.
[38] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <0.5mb model size,” arXiv preprint arXiv:1602.07360, 2016.
[39] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” CoRR, abs/1409.0473, 2014.
[40] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, et al., “Attention is all you need,” CoRR, vol. abs/1706.03762, 2017.
[41] Y. Cao, T. Iqbal, Q. Kong, Y. Zhong, W. Wang, and M. D. Plumbley, “Event-Independent Network for Polyphonic Sound Event Localization and Detection,” DCASE 2020 Workshop, November 2020.
[42] A. Politis, S. Adavanne, and T. Virtanen, “A dataset of reverberant spatial sound scenes with moving sources for sound event localization and detection,” In Proceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE2020). November 2020.
[43] A. Mesaros, S. Adavanne, A. Politis, T. Heittola, and T. Virtanen, “Joint measurement of localization and detection of sound events,” In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). New Paltz, NY, Oct 2019. Accepted.
[44] K. He, X. Zhang, S. Ren and J. Sun, “Deep Residual Learning for Image Recognition,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770-778.

指導教授

張寶基(Pao-Chi Chang)

審核日期

2021-7-16

推文