Multi-scale aggregation of phase information for reducing computational cost of CNN based DOA estimation

doi:10.23919/EUSIPCO.2019.8903176

Home
/
Papers
/
Multi-scale aggregation of phase information for reducing computational cost of CNN based DOA estimation

Proceedings Article•DOI•

Multi-scale aggregation of phase information for reducing computational cost of CNN based DOA estimation

Soumitro Chakrabarty¹, Emanuel A. P. Habets¹•Institutions (1)

University of Erlangen-Nuremberg¹

20 Nov 2018-

TL;DR: In this article, the authors proposed to use systematic dilations of the convolution filters in each of the CNN layers for expansion of the receptive field of the filters to reduce the computational cost of the method.

read less

Abstract: In a recent work on direction-of-arrival (DOA) estimation of multiple speakers with convolutional neural networks (CNNs), the phase component of short-time Fourier transform (STFT) coefficients of the microphone signal is given as input and small filters are used to learn the phase relations between neighboring microphones. Due to this chosen filter size, $M-1$ convolution layers are required to achieve the best performance for a microphone array with M microphones. For arrays with large number of microphones, this requirement leads to a high computational cost making the method practically infeasible. In this work, we propose to use systematic dilations of the convolution filters in each of the convolution layers of the previously proposed CNN for expansion of the receptive field of the filters to reduce the computational cost of the method. Different strategies for expansion of the receptive field of the filters for a specific microphone array are explored. With experimental analysis of the different strategies, it is shown that an aggressive expansion strategy results in a considerable reduction in computational cost while a relatively gradual expansion of the receptive field exhibits the best DOA estimation performance along with reduction in the computational cost.

...read moreread less

Citations

PDF

Open Access

More filters

Posted Content•

A Survey of Sound Source Localization with Deep Learning Methods.

[...]

Pierre-Amaury Grumiaux, Srdan Kitic, Laurent Girin, Alexandre Guérin¹•Institutions (1)

Grenoble Institute of Technology¹

08 Sep 2021-arXiv: Sound

TL;DR: In this article, a survey on deep learning methods for single and multiple sound source localization is presented, where the authors provide an exhaustive topography of the neural-based localization literature in this context, organized according to several aspects.

...read moreread less

Abstract: This article is a survey on deep learning methods for single and multiple sound source localization. We are particularly interested in sound source localization in indoor/domestic environment, where reverberation and diffuse noise are present. We provide an exhaustive topography of the neural-based localization literature in this context, organized according to several aspects: the neural network architecture, the type of input features, the output strategy (classification or regression), the types of data used for model training and evaluation, and the model training strategy. This way, an interested reader can easily comprehend the vast panorama of the deep learning-based sound source localization methods. Tables summarizing the literature survey are provided at the end of the paper for a quick search of methods with a given set of target characteristics.

...read moreread less

37 citations

Journal Article•DOI•

On Improved Training of CNN for Acoustic Source Localisation

[...]

Elizabeth Vargas¹, James R. Hopgood², Keith Brown¹, Kartic Subr²•Institutions (2)

Heriot-Watt University¹, University of Edinburgh²

08 Jan 2021-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: In this article, the authors show that training with speech or music signals produces a relative improvement in DoA accuracy for a variety of audio classes across 16 acoustic conditions and 9 DoAs, amounting to an average improvement of around 17% and 19% respectively when compared to training with spectrally flat random signals.

...read moreread less

Abstract: Convolutional Neural Networks (CNNs) are a popular choice for estimating Direction of Arrival (DoA) without explicitly estimating delays between multiple microphones. The CNN method first optimises unknown filter weights (of a CNN) by using observations and ground-truth directional information. This trained CNN is then used to predict incident directions given test observations. Most existing methods train using spectrally-flat random signals and test using speech. In this paper, which focuses on single source DoA estimation, we find that training with speech or music signals produces a relative improvement in DoA accuracy for a variety of audio classes across 16 acoustic conditions and 9 DoAs, amounting to an average improvement of around 17% and 19% respectively when compared to training with spectrally flat random signals. This improvement is also observed in scenarios in which the speech and music signals are synthesised using, for example, a Generative Adversarial Network (GAN). When the acoustic environments during test and training are similar and reverberant, training a CNN with speech outperforms Generalized Cross Correlation (GCC) methods by about 125%. When the test conditions are different, a CNN performs comparably. This paper takes a step towards answering open questions in the literature regarding the nature of the signals used during training, as well as the amount of data required for estimating DoA using CNNs.

...read moreread less

13 citations

DOI•

Improvement of learning-based methods for localization of multiple sound sources

[...]

Saulius Sakavičius

01 Jan 2021

1 citations

Proceedings Article•DOI•

Direction of Arrival Estimation of Moving Sound Sources using Deep Learning

[...]

16 May 2022

TL;DR: In this article , a feed-forward neural network is used to estimate the direction of arrival of moving sound sources, with Short Time Fourier Transform input features, and different hyperparameters are tested to determine which combination results in better direction-of-arrival detection for moving sources.

...read moreread less

Abstract: Sound source localization is an important task for several applications and the use of deep learning for this task has recently become a popular research topic. While nearly all previous work has focused on static sound sources, in this paper we evaluate the performance of a deep learning classification system for localization of moving sound sources and we evaluate the effect of different hyperparameters and acoustic conditions. A feedforward neural network is used to estimate the direction of arrival of moving sound sources, with Short Time Fourier Transform input features. Diverse synthetic datasets are generated to represent different acoustic conditions, and hyperparameters are tested to determine which combination results in better direction-of-arrival detection for moving sources. We evaluate the performance of the different combinations in terms of precision and recall, in a multi-class multi-label classification framework, and we find that (1) the number of frequency bins and the reverberation time have a significant effect for localizing high-speed sources, and (2) precision and recall decay slowly at low speeds while dropping sharply at high speeds.

...read moreread less

1 citations

Journal Article•DOI•

Characterization of Moving Sound Sources Direction-of-Arrival Estimation Using Different Deep Learning Architectures

[...]

01 Jan 2023-IEEE Transactions on Instrumentation and Measurement

TL;DR: In this paper , the authors evaluate the performance of a deep learning classification system for localization of moving sound sources and show that a temporal convolutional neural network can outperform both recurrent and feed-forward networks for moving sound source detection.

...read moreread less

Abstract: Sound source localization is an important task for several applications and the use of deep learning for this task has recently become a popular research topic. While a number of previous works has focused on static sound sources, in this paper we evaluate the performance of a deep learning classification system for localization of moving sound sources. In particular, we evaluate the effect of key parameters at the levels of feature extraction (e.g., STFT parameters) and model training (e.g., neural network architectures). We evaluate the performance of different settings in terms of precision and F-score, in a multi-class multi-label classification framework. In our previous work for localization of moving sound sources, we investigated feedforward neural networks under different acoustic conditions and STFT parameters, and showed that the presence of some reverberation in the training dataset can help in achieving better detection for the direction of arrival of the sources. In this paper, we extend the work to show that (1) window size does not affect the performance of static sources but highly affects the performance of moving sources, (2) sequence length has a significant effect on the performance of recurrent architectures, and (3) a temporal convolutional neural network can outperform both recurrent and feedforward networks for moving sound sources.

...read moreread less

References

PDF

Open Access

More filters

Journal Article•DOI•

Deep learning in neural networks

[...]

Jürgen Schmidhuber¹•Institutions (1)

University of Lugano¹

01 Jan 2015-Neural Networks

TL;DR: This historical survey compactly summarizes relevant work, much of it from the previous millennium, review deep supervised learning, unsupervised learning, reinforcement learning & evolutionary computation, and indirect search for short programs encoding deep and large networks.

...read moreread less

14,635 citations

Proceedings Article•

Multi-Scale Context Aggregation by Dilated Convolutions

[...]

Fisher Yu¹, Vladlen Koltun²•Institutions (2)

Princeton University¹, Intel²

30 Apr 2016

TL;DR: This work develops a new convolutional network module that is specifically designed for dense prediction, and shows that the presented context module increases the accuracy of state-of-the-art semantic segmentation systems.

...read moreread less

Abstract: State-of-the-art models for semantic segmentation are based on adaptations of convolutional networks that had originally been designed for image classification. However, dense prediction and image classification are structurally different. In this work, we develop a new convolutional network module that is specifically designed for dense prediction. The presented module uses dilated convolutions to systematically aggregate multi-scale contextual information without losing resolution. The architecture is based on the fact that dilated convolutions support exponential expansion of the receptive field without loss of resolution or coverage. We show that the presented context module increases the accuracy of state-of-the-art semantic segmentation systems. In addition, we examine the adaptation of image classification networks to dense prediction and show that simplifying the adapted network can increase accuracy.

...read moreread less

5,566 citations

Proceedings Article•DOI•

Librispeech: An ASR corpus based on public domain audio books

[...]

Vassil Panayotov¹, Guoguo Chen¹, Daniel Povey¹, Sanjeev Khudanpur¹•Institutions (1)

Johns Hopkins University¹

19 Apr 2015

TL;DR: It is shown that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models training on WSJ itself.

...read moreread less

Abstract: This paper introduces a new corpus of read English speech, suitable for training and evaluating speech recognition systems. The LibriSpeech corpus is derived from audiobooks that are part of the LibriVox project, and contains 1000 hours of speech sampled at 16 kHz. We have made the corpus freely available for download, along with separately prepared language-model training data and pre-built language models. We show that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models trained on WSJ itself. We are also releasing Kaldi scripts that make it easy to build these systems.

...read moreread less

4,770 citations

Proceedings Article•DOI•

Direction of Arrival Estimation for Multiple Sound Sources Using Convolutional Recurrent Neural Network

[...]

Sharath Adavanne¹, Archontis Politis², Tuomas Virtanen¹•Institutions (2)

Tampere University of Technology¹, Aalto University²

01 Sep 2018

TL;DR: In this paper, a deep neural network was proposed to estimate the directions of arrival (DOA) of multiple sound sources in anechoic, matched and unmatched reverberant conditions.

...read moreread less

Abstract: This paper proposes a deep neural network for estimating the directions of arrival (DOA) of multiple sound sources. The proposed stacked convolutional and recurrent neural network (DOAnet) generates a spatial pseudo-spectrum (SPS) along with the DOA estimates in both azimuth and elevation. We avoid any explicit feature extraction step by using the magnitudes and phases of the spectrograms of all the channels as input to the network. The proposed DOAnet is evaluated by estimating the DOAs of multiple concurrently present sources in anechoic, matched and unmatched reverberant conditions. The results show that the proposed DOAnet is capable of estimating the number of sources and their respective DOAs with good precision and generate SPS with high signal-to-noise ratio.

...read moreread less

191 citations

Proceedings Article•DOI•

Sound source localization based on deep neural networks with directional activate function exploiting phase information

[...]

Ryu Takeda¹, Kazunori Komatani¹•Institutions (1)

Osaka University¹

20 Mar 2016

TL;DR: This paper describes sound source localization (SSL) based on deep neural networks (DNNs) using discriminative training and indicates that the method outperformed the naive DNN-based SSL by 20 points in terms of the block-level accuracy.

...read moreread less

Abstract: This paper describes sound source localization (SSL) based on deep neural networks (DNNs) using discriminative training. A naive DNNs for SSL can be configured as follows. Input is the frequency-domain feature used in other SSL methods, and the structure of DNNs is a fully-connected network using real numbers. The training fails because its network structure loses two important properties, i.e., the orthogonality of sub-bands and the intensity- and time-information saved in complex numbers. We solved these two problems by 1) integrating directional information at each sub-band hierarchically, and 2) designing a directional activator that could treat the complex numbers at each sub-band. Our experiments indicated that our method outperformed the naive DNN-based SSL by 20 points in terms of the block-level accuracy.

...read moreread less

153 citations