深度學習應用於麥克風陣列單聲源追蹤系統

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：32

、訪客IP：3.147.27.71

姓名

彭冠銘(Kuan-Ming PENG) 查詢紙本館藏

畢業系所

電機工程學系

論文名稱

深度學習應用於麥克風陣列單聲源追蹤系統
(The application of deep learning in microphone array single-source tracking systems)

相關論文

★ 獨立成份分析法於真實環境中聲音訊號分離之探討	★ 口腔核磁共振影像的分割與三維灰階值內插
★ 數位式氣喘尖峰氣流量監測系統設計	★ 結合人工電子耳與助聽器對中文語音辨識率的影響
★ 人工電子耳進階結合編碼策略的中文語音辨識成效模擬--結合助聽器之分析	★ 中文發聲之神經關聯性的腦功能磁振造影研究
★ 利用有限元素法建構3維的舌頭力學模型	★ 以磁振造影為基礎的立體舌頭圖譜之建構
★ 腎小管之草酸鈣濃度變化與草酸鈣結石關係之模擬研究	★ 口腔磁振影像舌頭構造之自動分割
★ 微波輸出窗電性匹配之研究	★ 以軟體為基準的助聽器模擬平台之發展-噪音消除
★ 以軟體為基準的助聽器模擬平台之發展-回饋音消除	★ 模擬人工電子耳頻道數、刺激速率與雙耳聽對噪音環境下中文語音辨識率之影響
★ 用類神經網路研究中文語音聲調產生之神經關聯性	★ 教學用電腦模擬生理系統之建構

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2027-9-1以後開放)

摘要(中)

自從新冠病毒的疫情之後，遠端視訊會議之需求急劇上升，而各式遠端視訊之產品的需求也不斷上升，而隨著科技的發展各式輔助會議進行的產品也不斷推陳出新，使遠端會議更有效率。而在會議進行不論是要進行錄影來記錄會議內容或需要進行遠端會議，往往需要確認發言者是否在鏡頭拍攝範圍之內，會使會議效率降低。而若可將聲源追蹤系統應用於現代會議場景，將可提升會議的品質與效率。故本篇研究運用麥克風陣列搭配攝影機建構適合運用於會議場景之聲源追蹤裝置，使用 Python 搭配真實錄製的 LOCATA 資料集針對定位準確度、所需計算時間與體積小巧等條件進行各種麥克風陣列幾何之分析，最終選擇正八面排列之體麥克風陣列。結合最小能量無失真響應 (Minimum Power Distortionless Response, MPDR)、轉向功率相位轉換 (Steered Response Power Phase Transform, SRP-PHAT)、多重訊號分類法 (Mulitiple Signal Classification, MUSIC) 三種過去常見的聲源定位演算法，以及使用深度學習強化在具有回響與高雜訊場景下仍保有不錯性能的 Cross3D、IcoDOA 與 Neural-SRP 三種聲源定位演算法。研究中針對室內回響與噪音兩種不利於聲源定位之條件進行模擬分析，以及實時聲源追蹤需要演算法計算要夠快在追蹤任務中才不會造成延遲。而 IcoDOA 演算法與 Neural-SRP 演算法在訊噪比 SNR = 5dB ~ 30dB 的環境下定位誤差均在 10 度之內，兩種演算法在回響 RT60 = 0.2s ~ 1s 的環境下定位誤差也都在 10 度之內，但每幀計算時間就以 IcoDOA 演算法最好，平均計算一幀只需 2.067 毫秒。因此最終使用正八面體之麥克風陣列搭配 IcoDOA 演算法，在模擬實際會議狀況的情境中使用單一聲源並播放語音訊號之場景下，可使得聲源有 91.11 % 的時間落在鏡頭內。而若是在模擬實際會議狀況的情境中播放音樂聲源，可使得聲源有 87.77 % 的時間落在鏡頭內。

摘要(英)

Since the outbreak of the COVID-19 pandemic, the demand for remote video conferencing has surged, driving up the need for various remote video products. With technological advancements, numerous auxiliary products have been con- tinuously introduced to enhance the efficiency of remote meetings. One com- mon issue during meetings is ensuring that the speaker is within the camera’s frame, which can lower meeting efficiency when recording the meeting content or conducting remote conferences. Applying a sound source tracking system to modern meeting scenarios can improve the quality and efficiency of meetings.
This study utilizes a microphone array paired with a camera to construct a sound source tracking device suitable for meeting scenarios. By using Python and the LOCATA dataset, recorded in real-life conditions, various microphone array geometries were analyzed based on criteria such as localization accu- racy, computational time, and compactness. The final choice was an octahe- dral microphone array. This array combines three commonly used sound source localization algorithms—Minimum Power Distortionless Response (MPDR), Steered Response Power Phase Transform (SRP-PHAT), and Multiple Signal Classification (MUSIC)—with three deep learning-enhanced localization algo- rithms that maintain good performance in echoic and noisy environments: Cross3D, IcoDOA, and Neural-SRP.
The study simulates and analyzes the conditions of indoor reverberation and noise, which are unfavorable for sound source localization. It also considers the need for fast algorithmic computations to prevent delays in real-time sound source tracking. Both the IcoDOA and Neural-SRP algorithms demonstrated localization errors within 10 degrees in environments with signal-to-noise ratios (SNR) ranging from 5dB to 30dB and reverberation times (RT60) from 0.2s to 1s. However, IcoDOA showed the best performance in terms of computation time per frame, averaging only 2.067 milliseconds per frame.
Therefore, by ultimately using an octahedral microphone array with the Ico- DOA algorithm, the sound source can be kept within the camera’s field of view 91.11 % of the time in a simulated real meeting scenario with a single sound source playing a speech signal. In a simulated real meeting scenario playing a music source, the sound source can be kept within the camera’s field of view 87.77 % of the time.

關鍵字(中)

★ 麥克風陣列
★ 聲援追蹤

關鍵字(英)

論文目次

摘要................................................................................................................ i Abstract .......................................................................................................... iii 目錄................................................................................................................ vii 圖目錄............................................................................................................ ix 表目錄............................................................................................................ x
第一章緒論 ................................................................................................ 1
1.1 研究動機 ....................................................................................... 1
1.2 文獻探討 ....................................................................................... 2
1.2.1 聲源定位技術發展 ........................................................... 2
1.2.2 深度學習用於聲源定位 ................................................... 3
1.3 研究目的 ....................................................................................... 7
1.4 論文架構 ....................................................................................... 8
第二章聲源定位理論介紹 ........................................................................ 9
2.1 麥克風陣列訊號處理模型 ........................................................... 9
2.2 聲源定位演算法 ........................................................................... 10
2.2.1 最小能量無失真響應 ....................................................... 10
2.2.2 轉向功率相位轉換法 ....................................................... 12
2.2.3 多重訊號分類法 ............................................................... 13
2.3 基於深度學習聲源定位演算法 ................................................... 15
2.3.1 Cross3D 演算法 ................................................................ 15
2.3.2 IcoDOA 演算法 ................................................................ 16
2.3.3 Neural-SRP 演算法........................................................... 19
v
2.3.4 結論 ................................................................................... 21
第三章研究方法........................................................................................ 23
3.1 資料集 ........................................................................................... 23
3.1.1 模擬資料集 ....................................................................... 23
3.1.2 LOCATA 資料集 .............................................................. 24
3.2 模型訓練環境與參數 ................................................................... 25
3.3 麥克風陣列 ................................................................................... 25
3.4 硬體裝置 ....................................................................................... 28
3.4.1 微機電麥克風 ................................................................... 28
3.4.2 麥克風陣列 ....................................................................... 29
3.4.3 訊號接收裝置 ................................................................... 30
3.4.4 伺服馬達 ........................................................................... 31
3.4.5 攝影鏡頭 ........................................................................... 31
3.4.6 系統架構 ........................................................................... 32
3.4.7 結論 ................................................................................... 33
第四章實驗結果........................................................................................ 35
4.1 評估指標與參數介紹 ................................................................... 35
4.1.1 訊號噪音比 ....................................................................... 35
4.1.2 殘響時間 ........................................................................... 35
4.1.3 根均方角度誤差 ............................................................... 36
4.2 實驗一、演算法靜態聲源定位比較 ........................................... 36
4.3 實驗二、演算法動態聲源追蹤比較 ........................................... 39
4.4 實驗三、計算時間比較 ............................................................... 45
4.5 實驗四、實時聲源追蹤實驗 ....................................................... 45
vi

4.6 結論 ............................................................................................... 51
第五章結果與未來展望............................................................................ 53
5.1 結論 ............................................................................................... 53
5.2 未來展望 ....................................................................................... 54
參考文獻........................................................................................................ 55

參考文獻

Adavanne, S., Politis, A., Nikunen, J., & Virtanen, T. (2019). Sound event local- ization and detection of overlapping sources using convolutional recurrent neural networks. IEEE Journal of Selected Topics in Signal Processing, 13(1), 34-48.
Adavanne, S., Politis, A., & Virtanen, T. (2018). Direction of arrival estimation for multiple sound sources using convolutional recurrent neural network. In 2018 26th european signal processing conference (eusipco) (p. 1462- 1466).
Bai, M. R., Kung, F.-J., & Tao, C.-S. (2022). Tracking of moving sources in a reverberant environment using evolutionary algorithms. IEEE Access, 10, 107563-107574.
Diaz-Guerra, D., Miguel, A., & Beltran, J. R. (2021a). gpurir: A python library for room impulse response simulation with gpu acceleration. Multimedia Tools and Applications, 80(4), 5653–5671.
Diaz-Guerra, D., Miguel, A., & Beltran, J. R. (2021b). Robust sound source tracking using srp-phat and 3d convolutional neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 300-311.
Diaz-Guerra, D., Miguel, A., & Beltran, J. R. (2023). Direction of arrival esti- mation of sound sources using icosahedral cnns. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31, 313-321.
Grinstein, E., Hicks, C. M., van Waterschoot, T., Brookes, M., & Naylor, P. A. (2024). The neural-srp method for universal robust multi-source tracking. IEEE Open Journal of Signal Processing, 5, 19-28.
Jang, Y., Kim, J., & Kim, J. (2015). The development of the vehicle sound source localization system. In 2015 asia-pacific signal and information processing association annual summit and conference (apsipa) (p. 1241- 1244).
Knapp, C., & Carter, G. (1976). The generalized correlation method for estima- tion of time delay. IEEE Transactions on Acoustics, Speech, and Signal Processing, 24(4), 320-327.
Löllmann, H. W., Evers, C., Schmidt, A., Mellmann, H., Barfuss, H., Naylor, P. A., & Kellermann, W. (2018). The locata challenge data corpus for acoustic source localization and tracking. In 2018 ieee 10th sensor array and multichannel signal processing workshop (sam) (p. 410-414).
Ribeiro, F., Zhang, C., Florêncio, D. A., & Ba, D. E. (2010). Using reverbera- tion to improve range and elevation discrimination for small array sound source localization. IEEE Transactions on Audio, Speech, and Language Processing, 18(7), 1781-1792.
Roy, R., Paulraj, A., & Kailath, T. (1986). Estimation of signal parameters via rotational invariance techniques - esprit. In Milcom 1986 - ieee mil- itary communications conference: Communications-computers: Teamed for the 90’s (Vol. 3, p. 41.6.1-41.6.5).
Schmidt, R. (1986). Multiple emitter location and signal parameter estimation. IEEE Transactions on Antennas and Propagation, 34(3), 276-280.
Sun, H., Yang, P., Zu, L., & Xu, Q. (2011). A far field sound source localiza-
tion system for rescue robot. In 2011 international conference on control, automation and systems engineering (case) (p. 1-4).
Varanasi, V., Gupta, H., & Hegde, R. M. (2020). A deep learning frame- work for robust doa estimation using spherical harmonic decomposition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28, 1248-1259.
Velázquez, I. M., Ren, Y., Haneda, Y., & Meana, H. M. P. (2021). A fu- sion method based on class rotations for dnn-doa estimation on spherical microphone array. In 2021 29th european signal processing conference (eusipco) (p. 885-889).
Wang, L., & Cavallaro, A. (2022). Deep-learning-assisted sound source local- ization from a flying drone. IEEE Sensors Journal, 22(21), 20828-20838.
Wang, Z., Zou, W., Su, H., Guo, Y., & Li, D. (2023). Multiple sound source lo- calization exploiting robot motion and approaching control. IEEE Trans- actions on Instrumentation and Measurement, 72, 1-16.
Yin, J., & Verhelst, M. (2023). Cnn-based robust sound source localization with srp-phat for the extreme edge. ACM Transactions on Embedded Computing Systems, 22(3), 1–27.

指導教授

吳炤民(Chao-Min Wu)

審核日期

2024-7-25

推文