Multi-scale aggregation of phase information for reducing computational cost of CNN based DOA estimation
20 Nov 2018-
TL;DR: In this article, the authors proposed to use systematic dilations of the convolution filters in each of the CNN layers for expansion of the receptive field of the filters to reduce the computational cost of the method.
Abstract: In a recent work on direction-of-arrival (DOA) estimation of multiple speakers with convolutional neural networks (CNNs), the phase component of short-time Fourier transform (STFT) coefficients of the microphone signal is given as input and small filters are used to learn the phase relations between neighboring microphones. Due to this chosen filter size, $M-1$ convolution layers are required to achieve the best performance for a microphone array with M microphones. For arrays with large number of microphones, this requirement leads to a high computational cost making the method practically infeasible. In this work, we propose to use systematic dilations of the convolution filters in each of the convolution layers of the previously proposed CNN for expansion of the receptive field of the filters to reduce the computational cost of the method. Different strategies for expansion of the receptive field of the filters for a specific microphone array are explored. With experimental analysis of the different strategies, it is shown that an aggressive expansion strategy results in a considerable reduction in computational cost while a relatively gradual expansion of the receptive field exhibits the best DOA estimation performance along with reduction in the computational cost.
Citations
More filters
•
TL;DR: In this article, a survey on deep learning methods for single and multiple sound source localization is presented, where the authors provide an exhaustive topography of the neural-based localization literature in this context, organized according to several aspects.
Abstract: This article is a survey on deep learning methods for single and multiple sound source localization. We are particularly interested in sound source localization in indoor/domestic environment, where reverberation and diffuse noise are present. We provide an exhaustive topography of the neural-based localization literature in this context, organized according to several aspects: the neural network architecture, the type of input features, the output strategy (classification or regression), the types of data used for model training and evaluation, and the model training strategy. This way, an interested reader can easily comprehend the vast panorama of the deep learning-based sound source localization methods. Tables summarizing the literature survey are provided at the end of the paper for a quick search of methods with a given set of target characteristics.
37 citations
••
TL;DR: In this article, the authors show that training with speech or music signals produces a relative improvement in DoA accuracy for a variety of audio classes across 16 acoustic conditions and 9 DoAs, amounting to an average improvement of around 17% and 19% respectively when compared to training with spectrally flat random signals.
Abstract: Convolutional Neural Networks (CNNs) are a popular choice for estimating Direction of Arrival (DoA) without explicitly estimating delays between multiple microphones. The CNN method first optimises unknown filter weights (of a CNN) by using observations and ground-truth directional information. This trained CNN is then used to predict incident directions given test observations. Most existing methods train using spectrally-flat random signals and test using speech. In this paper, which focuses on single source DoA estimation, we find that training with speech or music signals produces a relative improvement in DoA accuracy for a variety of audio classes across 16 acoustic conditions and 9 DoAs, amounting to an average improvement of around 17% and 19% respectively when compared to training with spectrally flat random signals. This improvement is also observed in scenarios in which the speech and music signals are synthesised using, for example, a Generative Adversarial Network (GAN). When the acoustic environments during test and training are similar and reverberant, training a CNN with speech outperforms Generalized Cross Correlation (GCC) methods by about 125%. When the test conditions are different, a CNN performs comparably. This paper takes a step towards answering open questions in the literature regarding the nature of the signals used during training, as well as the amount of data required for estimating DoA using CNNs.
13 citations
•
01 Jan 2021
1 citations
••
16 May 2022TL;DR: In this article , a feed-forward neural network is used to estimate the direction of arrival of moving sound sources, with Short Time Fourier Transform input features, and different hyperparameters are tested to determine which combination results in better direction-of-arrival detection for moving sources.
Abstract: Sound source localization is an important task for several applications and the use of deep learning for this task has recently become a popular research topic. While nearly all previous work has focused on static sound sources, in this paper we evaluate the performance of a deep learning classification system for localization of moving sound sources and we evaluate the effect of different hyperparameters and acoustic conditions. A feedforward neural network is used to estimate the direction of arrival of moving sound sources, with Short Time Fourier Transform input features. Diverse synthetic datasets are generated to represent different acoustic conditions, and hyperparameters are tested to determine which combination results in better direction-of-arrival detection for moving sources. We evaluate the performance of the different combinations in terms of precision and recall, in a multi-class multi-label classification framework, and we find that (1) the number of frequency bins and the reverberation time have a significant effect for localizing high-speed sources, and (2) precision and recall decay slowly at low speeds while dropping sharply at high speeds.
1 citations
••
TL;DR: In this paper , the authors evaluate the performance of a deep learning classification system for localization of moving sound sources and show that a temporal convolutional neural network can outperform both recurrent and feed-forward networks for moving sound source detection.
Abstract: Sound source localization is an important task for several applications and the use of deep learning for this task has recently become a popular research topic. While a number of previous works has focused on static sound sources, in this paper we evaluate the performance of a deep learning classification system for localization of moving sound sources. In particular, we evaluate the effect of key parameters at the levels of feature extraction (e.g., STFT parameters) and model training (e.g., neural network architectures). We evaluate the performance of different settings in terms of precision and F-score, in a multi-class multi-label classification framework. In our previous work for localization of moving sound sources, we investigated feedforward neural networks under different acoustic conditions and STFT parameters, and showed that the presence of some reverberation in the training dataset can help in achieving better detection for the direction of arrival of the sources. In this paper, we extend the work to show that (1) window size does not affect the performance of static sources but highly affects the performance of moving sources, (2) sequence length has a significant effect on the performance of recurrent architectures, and (3) a temporal convolutional neural network can outperform both recurrent and feedforward networks for moving sound sources.
References
More filters
••
TL;DR: This historical survey compactly summarizes relevant work, much of it from the previous millennium, review deep supervised learning, unsupervised learning, reinforcement learning & evolutionary computation, and indirect search for short programs encoding deep and large networks.
14,635 citations
•
30 Apr 2016TL;DR: This work develops a new convolutional network module that is specifically designed for dense prediction, and shows that the presented context module increases the accuracy of state-of-the-art semantic segmentation systems.
Abstract: State-of-the-art models for semantic segmentation are based on adaptations of convolutional networks that had originally been designed for image classification. However, dense prediction and image classification are structurally different. In this work, we develop a new convolutional network module that is specifically designed for dense prediction. The presented module uses dilated convolutions to systematically aggregate multi-scale contextual information without losing resolution. The architecture is based on the fact that dilated convolutions support exponential expansion of the receptive field without loss of resolution or coverage. We show that the presented context module increases the accuracy of state-of-the-art semantic segmentation systems. In addition, we examine the adaptation of image classification networks to dense prediction and show that simplifying the adapted network can increase accuracy.
5,566 citations
••
19 Apr 2015TL;DR: It is shown that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models training on WSJ itself.
Abstract: This paper introduces a new corpus of read English speech, suitable for training and evaluating speech recognition systems. The LibriSpeech corpus is derived from audiobooks that are part of the LibriVox project, and contains 1000 hours of speech sampled at 16 kHz. We have made the corpus freely available for download, along with separately prepared language-model training data and pre-built language models. We show that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models trained on WSJ itself. We are also releasing Kaldi scripts that make it easy to build these systems.
4,770 citations
••
01 Sep 2018TL;DR: In this paper, a deep neural network was proposed to estimate the directions of arrival (DOA) of multiple sound sources in anechoic, matched and unmatched reverberant conditions.
Abstract: This paper proposes a deep neural network for estimating the directions of arrival (DOA) of multiple sound sources. The proposed stacked convolutional and recurrent neural network (DOAnet) generates a spatial pseudo-spectrum (SPS) along with the DOA estimates in both azimuth and elevation. We avoid any explicit feature extraction step by using the magnitudes and phases of the spectrograms of all the channels as input to the network. The proposed DOAnet is evaluated by estimating the DOAs of multiple concurrently present sources in anechoic, matched and unmatched reverberant conditions. The results show that the proposed DOAnet is capable of estimating the number of sources and their respective DOAs with good precision and generate SPS with high signal-to-noise ratio.
191 citations
••
20 Mar 2016TL;DR: This paper describes sound source localization (SSL) based on deep neural networks (DNNs) using discriminative training and indicates that the method outperformed the naive DNN-based SSL by 20 points in terms of the block-level accuracy.
Abstract: This paper describes sound source localization (SSL) based on deep neural networks (DNNs) using discriminative training. A naive DNNs for SSL can be configured as follows. Input is the frequency-domain feature used in other SSL methods, and the structure of DNNs is a fully-connected network using real numbers. The training fails because its network structure loses two important properties, i.e., the orthogonality of sub-bands and the intensity- and time-information saved in complex numbers. We solved these two problems by 1) integrating directional information at each sub-band hierarchically, and 2) designing a directional activator that could treat the complex numbers at each sub-band. Our experiments indicated that our method outperformed the naive DNN-based SSL by 20 points in terms of the block-level accuracy.
153 citations