dc.description.abstract | Since the outbreak of the COVID-19 pandemic, the demand for remote video conferencing has surged, driving up the need for various remote video products. With technological advancements, numerous auxiliary products have been con- tinuously introduced to enhance the efficiency of remote meetings. One com- mon issue during meetings is ensuring that the speaker is within the camera’s frame, which can lower meeting efficiency when recording the meeting content or conducting remote conferences. Applying a sound source tracking system to modern meeting scenarios can improve the quality and efficiency of meetings.
This study utilizes a microphone array paired with a camera to construct a sound source tracking device suitable for meeting scenarios. By using Python and the LOCATA dataset, recorded in real-life conditions, various microphone array geometries were analyzed based on criteria such as localization accu- racy, computational time, and compactness. The final choice was an octahe- dral microphone array. This array combines three commonly used sound source localization algorithms—Minimum Power Distortionless Response (MPDR), Steered Response Power Phase Transform (SRP-PHAT), and Multiple Signal Classification (MUSIC)—with three deep learning-enhanced localization algo- rithms that maintain good performance in echoic and noisy environments: Cross3D, IcoDOA, and Neural-SRP.
The study simulates and analyzes the conditions of indoor reverberation and noise, which are unfavorable for sound source localization. It also considers the need for fast algorithmic computations to prevent delays in real-time sound source tracking. Both the IcoDOA and Neural-SRP algorithms demonstrated localization errors within 10 degrees in environments with signal-to-noise ratios (SNR) ranging from 5dB to 30dB and reverberation times (RT60) from 0.2s to 1s. However, IcoDOA showed the best performance in terms of computation time per frame, averaging only 2.067 milliseconds per frame.
Therefore, by ultimately using an octahedral microphone array with the Ico- DOA algorithm, the sound source can be kept within the camera’s field of view 91.11 % of the time in a simulated real meeting scenario with a single sound source playing a speech signal. In a simulated real meeting scenario playing a music source, the sound source can be kept within the camera’s field of view 87.77 % of the time. | en_US |