A neural network based algorithm for speaker localization in a multi-room environment

doi:10.1109/MLSP.2016.7738817

Proceedings ArticleDOI

A neural network based algorithm for speaker localization in a multi-room environment

Fabio Vesperini, +4 more

- pp 1-6

Chats0

TLDR

A Speaker Localization algorithm based on Neural Networks for multi-room domestic scenarios is proposed and outperforms the reference one, providing an average localization error, expressed in terms of RMSE, equal to 525 mm against 1465 mm.

Abstract:

A Speaker Localization algorithm based on Neural Networks for multi-room domestic scenarios is proposed in this paper. The approach is fully data-driven and employs a Neural Network fed by GCC-PHAT (Generalized Cross Correlation Phase Transform) Patterns, calculated by means of the microphone signals, to determine the speaker position in the room under analysis. In particular, we deal with a multi-room case study, in which the acoustic scene of each room is influenced by sounds emitted in the other rooms. The algorithm is tested against the home recorded DIRHA dataset, characterized by multiple wall and ceiling microphone signals for each room. In particular, we focused on the speaker localization problem in two distinct neighbouring rooms. We assumed the presence of an Oracle multi-room Voice Activity Detector (VAD) in our experiments. A three-stage optimization procedure has been adopted to find the best network configuration and GCC-PHAT Patterns combination. Moreover, an algorithm based on Time Difference of Arrival (TDOA), recently proposed in literature for the addressed applicative context, has been considered as term of comparison. As result, the proposed algorithm outperforms the reference one, providing an average localization error, expressed in terms of RMSE, equal to 525 mm against 1465 mm. Concluding, we also assessed the algorithm performance when a real VAD, recently proposed by some of the authors, is used. Even though a degradation of localization capability is registered (an average RMSE equal to 770 mm), still a remarkable improvement with respect to the state of the art performance is obtained.

A neural network based algorithm for speaker localization in a multi-room environment

Citations

Deep Learning for Audio Signal Processing

Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks

Direction-of-Arrival Estimation Based on Deep Neural Networks With Robustness to Array Imperfections

Multi-Speaker DOA Estimation Using Deep Convolutional Networks Trained With Noise Signals

Direction of Arrival Estimation for Multiple Sound Sources Using Convolutional Recurrent Neural Network

References

Adam: A Method for Stochastic Optimization

Long short-term memory

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Learning representations by back-propagating errors

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Related Papers (5)

A learning-based approach to direction of arrival estimation in noisy and reverberant environments

Multiple emitter location and signal parameter estimation

Image method for efficiently simulating small‐room acoustics

Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks

Multi-Speaker DOA Estimation Using Deep Convolutional Networks Trained With Noise Signals