What have the authors contributed in "On improved training of cnn for acoustic source localisation" ?

In this paper, which focuses on single source DoA estimation, the authors find that training with speech or music signals produces a relative improvement in DoA accuracy for a variety of audio classes across 16 acoustic conditions and 9 DoAs, amounting to an average improvement of around 17 % and 19 % respectively when compared to training with spectrally flat random signals. This improvement is also observed in scenarios in which the speech and music signals are synthesised using, for example, a Generative Adversarial Network ( GAN ). This paper takes a step towards answering open questions in the literature regarding the nature of the signals used during training, as well as the amount of data required for estimating DoA using CNNs.

What future works have the authors mentioned in the paper "On improved training of cnn for acoustic source localisation" ?

Another avenue for future work could be the extension of their work for multiple simultaneous sources.

How much training data does the DoA estimate accuracy decrease?

For speech, lowering the volume of training data below 25% decreases the overall DoA estimation accuracy, with a significant drop in accuracy at 5%.

What is the reason for the decrease in performance of the training data?

Given that the training data, both speech and music, has a high number of silent frames (around one quarter of the training data — 26%), the decrease in performance cannot be due to a low number of silent frames in the training data.

What is the main hypothesis of the experiment?

Their main hypothesis is that using speech for training the CNN will provide accurate results and will outperform the ones obtained with the baseline.

What is the way to train a CNN?

Future work includes the use of transfer learning techniques in order to use simulated environments for training the CNN and test using data from real scenarios.

What is the hypothesis behind the use of music for training the CNN?

Their hypothesis in this case is that using music for training will provide accurate results, outperforming those of the baseline, though not as robust as those obtained with speech.

What is the reason why the CNN trained with speech performs better?

The fact that the CNN trained with music performs better on speech data than the CNN trained with speech is because the CNN trained with music performs better for all DoAs while the one trained with speech fails for 30° and 150°.

How does the accuracy of the GCC-PHAT function compare to other methods?

For 0.3 s, however,GAN clearly outperforms GCC, especially for DoAs 30°, 45°, 135°and 150°, where the accuracy improves 16× on average.

Why do the authors think that synthetic speech generators are more effective?

The authors conjecture that this is due to synthetic speech generators being able to generate accurate speech samples, while current music generators are simple and usually focus on one instrument.

How does the accuracy of the synthesis of real speech differ from that of a BS?

the authors observed that generating synthetic speech with WaveGAN yields about a 15% relative improvement in accuracy over methods such as synthesis using a BSAR model across 16 acoustic conditions and 9 DoAs.

(Open Access) On Improved Training of CNN for Acoustic Source Localisation (2021) | Elizabeth Vargas

Edinburgh Research Explorer

On Improved Training of CNN for Acoustic Source Localisation

Citation for published version:

Vargas, E, Hopgood, J, Brown, K & Subr, K 2021, 'On Improved Training of CNN for Acoustic Source

Localisation', IEEE Transactions on Audio, Speech and Language Processing, vol. 29, pp. 720 - 732.

https://doi.org/10.1109/TASLP.2021.3049337

Digital Object Identifier (DOI):

10.1109/TASLP.2021.3049337

Link:

Link to publication record in Edinburgh Research Explorer

Document Version:

Peer reviewed version

Published In:

IEEE Transactions on Audio, Speech and Language Processing

General rights

and / or other copyright owners and it is a condition of accessing these publications that users recognise and

abide by the legal requirements associated with these rights.

Take down policy

The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer

content complies with UK legislation. If you believe that the public display of this file breaches copyright please

contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and

investigate your claim.

Download date: 10. Aug. 2022

JOURNAL OF L

X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

On Improved Training of CNN for Acoustic Source

Localisation

Elizabeth Vargas, James R. Hopgood, Member, IEEE, Keith Brown, and Kartic Subr

Abstract—Convolutional Neural Networks (CNNs) are a pop-

ular choice for estimating Direction of Arrival (DoA) without

explicitly estimating delays between multiple microphones. The

CNN method ﬁrst optimises unknown ﬁlter weights (of a CNN)

by using observations and ground-truth directional information.

This trained CNN is then used to predict incident directions given

test observations. Most existing methods train using spectrally-

ﬂat random signals and test using speech. In this paper, which

focuses on single source DoA estimation, we ﬁnd that training

with speech or music signals produces a relative improvement in

DoA accuracy for a variety of audio classes across 16 acoustic

conditions and 9 DoAs, amounting to an average improvement of

around 17% and 19% respectively when compared to training

with spectrally ﬂat random signals. This improvement is also

observed in scenarios in which the speech and music signals

are synthesised using, for example, a Generative Adversarial

Network (GAN). When the acoustic environments during test and

training are similar and reverberant, training a CNN with speech

outperforms Generalized Cross Correlation (GCC) methods by

about 125%. When the test conditions are different, a CNN

performs comparably. This paper takes a step towards answering

open questions in the literature regarding the nature of the signals

used during training, as well as the amount of data required for

estimating DoA using CNNs.

Index Terms—Microphone Arrays, Direction of Arrival, Neu-

ral Networks, Convolutional Neural Network (CNN), Generative

Adversarial Network (GAN)

I. INTRODUCTION

Estimation of the Direction of Arrival (DoA), or spatial

direction from which a sound is emitted, is an important and

well-studied problem in Acoustic Source Localisation (ASL)

with applications in numerous domains [15], [44]. The advent

of smart assistants (e.g. Amazon Echo, Google Home, Apple

HomePod) [6], equipped with arrays of microphones, has

facilitated the generation of large datasets and has motivated

research into the use of data-driven methods for DoA esti-

mation. In particular, learning via a Deep Neural Network

(DNN) architecture – deployed effectively for computer vision

applications [26] and audio processing [53] – is emerging as

an effective tool for ASL [10].

Traditional methods for performing ASL have been widely

studied in the literature [4], the most common of which

are: (i) Time Difference of Arrival (TDoA)-based approaches,

which normally employ Generalized Cross-Correlation (GCC)

methods [25], [47], [48]; (ii) beamforming-based approaches,

E. Vargas and K. Brown are with the Institute of Sensors, Signals, and

Systems, Heriot-Watt University, Email: elizabeth.vargas@hw.ac.uk

James R. Hopgood is with the Institute of Digital Communications, in the

School of Engineering, University of Edinburgh.

K. Subr is with the Institute of Perception, Action and Behaviour, University

of Edinburgh.

including the well-known Steered Response Power (SRP) [30],

[33], which solve directly for the most likely source posi-

tion among a grid of candidate locations; and (iii) MUltiple

SIgnal Classiﬁcation (MUSIC) [42], [46], which uses the

signal subspaces to estimate multiple DoA. More modern

approaches include the use of learning-based methods in ASL,

focused on feature extraction and classiﬁers [23], [27]. Neural

networks have been applied to various problems related to

ASL including speaker localisation using a robot [44], [45],

passive underwater sensing [15], antennas [31] and acous-

tic emission localisation on a pipeline [21]. Chakrabarty et

al. [8] perform single source localisation by treating ASL

as a classiﬁcation problem, where the discretised DoA cor-

responds to a class, which they solve using a CNN. This

method has been extended to multiple sources [10] using a

ﬂat spectral uncorrelated random process to train the network.

CNNs combined with Long Short-Term Memory (LSTM) [29]

have been shown to be useful for estimating DoA by using

Generalized Cross-Correlation Phase Transform (GCC-PHAT)

as input data. Some approaches use neural networks to perform

pre-processing such as time-frequency (TF) masking [36],

[51], [52] or denoising and dereverberation [49].

Despite the widespread use of CNNs in applications related

to ASL, numerous questions regarding the quality and quantity

of the training data remain unanswered. In [1], [2], data from

different sound classes is randomly used for both training

and testing, while in [34] the authors propose a method of

data augmentation for the task of room classiﬁcation from

reverberant speech using a GAN. In [40], deep CNN and data

augmentation are used for environmental sound classiﬁcation.

On the other hand, Pons et al. [37] use few training samples

(from 1 to 100) per class to train an event and acoustic scene

classiﬁer. It is important to study the impact of training data

for a CNN that estimates DoA, as this will help to generalise

the use of deep learning methods in ASL without the need of

limiting the test data to the same one as used in the training.

In this paper, we test the impact of various sound classes

for training on the accuracy of single source DoA estimation.

We hypothesise and show that using speech and music data

for training will provide more accurate DoA estimation than

using noise, which is used by the current literature [8], [10].

Our reasoning is that speech and music data contains more

relevant spectral information that helps the CNN learn the

room acoustics much better than white noise. Our conclusion

is that using real speech data augmented with synthetic speech

data (using GAN-based methods) performs best for a wide

range of test audio classes and different incident directions.

Our main ﬁndings and novel contributions in this work are

JOURNAL OF L

X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2

that:

• training with speech data, rather than ﬂat spectral noise,

produces an average relative improvement of 3% in the

accuracy of DoA estimates for test speech signals and

17% when the test signals belong to one of three other

classes: speech, children playing and street music, across

16 acoustic conditions and 9 DoAs in both cases;

• training with music data from a dataset produces an

average relative improvement of 19% in DoA estimation

accuracy across 16 acoustic conditions and 9 DoAs,

compared to training with ﬂat spectral noise;

• synthetic speech data generated using a state-of-the-art

GAN [13], which can be generated automatically, is as

effective in training as using real human speech;

• compared with GCC methods, a CNN trained with speech

is 125% more accurate when the test and training environ-

ments have similar reverberation, and comparable when

the reverberation levels are different.

The article is organised as follows. We review state-of-the-

art DoA estimation using Neural Networkss (NNs) in Sec-

tion II. Section III gives details of our proposed approach for

training the CNN to estimate DoA. We present our evaluation

in Section IV, in which we compare our training methodology

against related state-of-the-art approaches. In Section V we

discuss the results of our experiments. Section VI concludes

our work and states future directions for research.

II. RELATED WORK

This section shows existing DoA-based work in NN for

ASL. We discuss the training data used in each work and at

the end of the section we highlight how our work differs from

previous ones.

DoA methods are subdivided depending on whether they

estimate the DoA for a single source or multiple sources. Since

our contributions are oriented to the estimation of a single

source, we focus our review of the literature in single source

approaches.

The use of planar arrays is very common in single-source

DoA estimation. In [44], for instance, the authors train a

DNN to localise sources using a microphone array embedded

on a humanoid robot. Localisation is presented as a binary

classiﬁcation problem, in which the algorithm returns either

1 or 0, depending on the existence (or not) of a source at a

given direction. The main contributions arising out of this work

are the uses of a directional activator, similar to MUSIC, and

the use of this activator to treat complex numbers (from the

spectrogram) at each sub-band. The evaluation was performed

using real data from a Japanese dataset as training and testing

sets (with different data used for each set), and accuracy

computed for 72 different DoAs and frames of 200ms. The

main limitation of this work is that the DNN is unable to

localise sources located in positions that not appear on the

training set. The authors propose a new approach to overcome

these limitations in [45], using unsupervised learning together

with a parameter adaption layer and early cessation of the

parameter updates. These changes result in improvements for

some of the DoA angles, but in a deterioration for others.

A similar approach is presented by Chakrabarty et al. [8],

where phase information of the Short-Time Fourier Transform

(STFT) coefﬁcients is used together with a single-class clas-

siﬁer to train a CNN that outputs the DoA of a group of

signals from a microphone array. The DoA is modeled as a

single-class classiﬁcation problem, in which the classes are

37 different angles (DoA), with 5° intervals. The network is

trained with synthetic data and tested with speech signals from

the TIMIT dataset. The results are presented as accuracy level

per frames: that is to say, the number of frames that correctly

classify the DoA, similar to [44]. Since this article is the basis

for our work, Section III-A discuss this in further detail. In [29]

the authors use a CNN combined with a LSTM to estimate

DoA. The main contribution of [29] is its adaptability to a

change in microphone array conﬁguration and the use of a very

small amount of data, since the network uses GCC-PHAT as

the input, rather than the spectrogram as in previous cases [8],

[44].

There are a set of approaches that use a NN as a pre-

processing step, including [51], in which the authors use a

Bidirectional Long Short Term Memory (BLSTM) for time-

frequency (TF) masking to arrive at a clean phase TDoA

estimation. They use this to improve conventional Cross-

correlation (CC), beamforming, and subspace-based algo-

rithms for ASL. They perform experiments with a binaural

setup, judging the estimation as accurate when the error is

within 5 degrees. This approach is extended in [52] where

the DoA is calculated directly using monaural spectral in-

formation for mask estimation during training, and therefore

this approach could be extended to different microphone

conﬁgurations. Similar to [36], the authors use a CNN to

predict a time-frequency (TF) mask for emphasising the direct

path speech signal in time-varying interference. This approach

is applied in combination with SRP to estimate the DoA.

The main limitation is that it only works on the same audio

class as in the training set while the main assumption is that

there is only one main interference with the target of interest.

The experiments were conducted using speech (English for

training and Japanese for testing) mixed with everyday sounds

(ofﬁce printer background or household noise) to train and test

the NN for both static and moving speech sources. Wang et

al [49] propose the use of an Acoustic Vector Sensor (AVS)

to estimate DoA, in conjunction with a network for denoising

and dereverberation. The authors’ hypothesis is that clean

features are better classiﬁed than unclean ones, therefore they

used a DNN for Signal Denoising and Dereverberation (DNN-

SDD), which maps noise and reverberant speech features

to their clean versions and uses them as input for a DNN

that calculates DoA. The method is evaluated in small-sized

microphone arrays, with the Mean Absolute Error (MAE) and

Root Mean Square Error (RMSE) used as evaluation metrics.

There are some works that describe ASL using NNs in

planar arrays for very speciﬁc applications. In [15], the authors

present an application of CNN for DoA to passive underwater

sensors, a technique that uses cepstrograms and generalized

cross-correlogram as input to estimate range and bearing. The

network is trained using real, multi-channel acoustic record-

ings of a surface in a shallow water environment. Another

JOURNAL OF L

X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3

application is presented in [31], in which DoA estimation

using DNN is used in antennas. The main contributions of the

work in [31] are a proposed end to end DNN for general (not

only acoustic) DoA estimation, the use of an autoencoder for

pre-processing and training with various outputs of a certain

array, so the network is robust to imperfections. The authors

train and test their approach based on simulated data and use

MUSIC as a baseline for comparison. Finally, in [21] we are

presented with an application of acoustic emission localization

on a pipeline, generated when energy is released within a

material. The experiments showed an accuracy of 97% and

execution time of 0.963 milliseconds.

In general, we summarise that the literature in deep neural

networks, as applied to ASL, is focused on creating neural

network architectures and methodologies that generalise the

following:

• Room Acoustic Conditions: The network goal is to be

robust to new acoustic conditions, such as noise and

reverberation, different from those used during training.

One of the clearest examples is [8], in which the net-

work is trained and tested with different room sizes and

reverberations. Moreover, [52] test their pre-processing

TF mask in various noise and reverberant environments.

Perotin et al, [35], train their NN on a large variety of

simulated rooms and test it on unseen rooms. The main

limitation of these approaches is their assumption that

both the train and test data belong to the same audio

class.

• Source Locations: The objective is to be able to estimate

source locations different from those present in the train-

ing set. In [8] the authors considered in their experiments

the inﬂuence of source-array distance. Similarly, [35]

evaluated their algorithm on DoAs that lie anywhere on

the sphere rather than on the same discrete grid used for

training.

• Microphone Conﬁguration: The NN should be able to

be tested on any microphone conﬁguration, independent

of the one(s) it was trained with. This is partially achieved

in [29], in which the authors use GCC-PHAT as the input

to the NN, therefore the microphone conﬁgurations of

training and testing could be different, provided that the

microphones are located at the same distance. A better

generalisation is presented in [52], in which the NN

uses monaural information: however, this is only for TF

mask estimation as a pre-processing step, rather than DoA

estimation directly.

Even though the literature covers a lot of work in gener-

alising the learning process, there is a gap in the efforts to

generalise the nature of training data. The closest effort has

been presented in [2], in which the authors use various data

classes for training and testing the network: however, they

limit their work to using the same audio class for training and

testing. Accordingly, in this paper we have focused this work

on studying the impact of the quality and quantity of training

data when it comes to DoA estimation. Studying this impact,

will help to generalise the use of deep learning methods in

ASL without the need of limiting the test data to the same

one as used in the training.

III. METHODOLOGY

A. Baseline: DoA estimation using CNNs trained with spec-

trally ﬂat random noise

The focus of this work is on analyzing the impact of training

data, therefore we use an existing architecture [8] and follow

the methodology presented in this section for training and

testing.

The CNN, initially proposed in [8] and used in [9]–[11], is

based on a standard CNN [17] architecture. These networks

typically consist of a set of “convolution layers”, which act

as ﬁlters on the input, resulting in the set of features that the

network learns. The convolution is followed by an activation

layer, operating point-wise over each element of the feature

map. Later on, a pooling operation is applied to reduce the

feature map. In the ﬁnal step, the fully connected layers

aggregate information from all different positions to perform

classiﬁcation.

In this particular application, the authors use the CNN

architecture presented in [8], which has the following char-

acteristics:

• The CNN treats the phase of the STFT as an image and

the input is a matrix of size M by K, where M is the

number of microphones and K the resolution of the STFT

in the frequency domain. It is important to note that the

input is a single time frame of the total signal per training

data point, as opposed to the entire STFT.

• The CNN uses the rectiﬁed linear units (ReLU) as acti-

vation function.

• The CNN does not have any pooling layer, since it

decreases the performance of the network.

• The last layer uses softmax activation function to perform

classiﬁcation.

• The network was trained using the Adam optimiser [24],

with a learning rate of 0.001, for 5 epochs, and uses

categorical cross-entropy as loss function.

• The output of the CNN is the posterior probabilities of

the input belonging to one of 37 DoA classes (discrete

values from 0 to 180, with a gap of 5 degrees).

We tested the performance of this network to have a baseline

for comparison. Fig. 1 illustrates this. It also presents the

results of the sample experiments available in [7].

B. Acoustic conditions

Four microphones arranged in a linear array were used.

The training and testing conditions are summarised in Ta-

ble I, which are the same as those described in [8], to

aid comparison. Moreover, the signals (16kHz sampling fre-

quency) were transformed using the STFT with a window of

size 256 and overlap of 129. Although the inter-microphone

distance is the same for both training and test, the arrays

are positioned in different locations within the rooms. The

training data is composed of 5.6 million frames, including

cases in which the input combined real and synthetic data,

guarantying a fair comparison among training data variations.

JOURNAL OF L

X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4

The test data is composed of 100 audio ﬁles per audio

class (see Section III-D). The test signals are generated by

convolving these audio ﬁles with Room Impulse Responses

(RIRs) for 9 different DoAs, the same as those established

in the baseline: 30°, 45°, 60°, 75°, 90°, 105°, 120°, 135°and

150°. The RIR simulation is performed using the Image Source

Method (ISM) [3]. The noise on the test signals is uncorrelated

additive white Gaussian noise (that is, independent at each

microphone), added using the ISM simulator from [28].

C. Training audio classes

We used two different audio classes to train the CNN:

speech and music. For each of these classes we used different

variations to produce this data (see Section III-C1 and Sec-

tion III-C2), either by using datasets or methods to synthesise

these sounds.

1) Speech: Six different types of speech training data are

used, in order to improve the DoA estimation accuracy of

existing CNN architectures in different audio classes. The

methods used for generating the training data are as follows:

1) Speech (TIMIT) Data from the TIMIT dataset [16],

containing data of 630 speakers from 8 major dialects of

American English, who are reading phonetically rich sen-

tences. The dataset was originally designed as a database

of speech data for acoustic-phonetic studies, as well

as the development and evaluation of automatic speech

recognition systems. This dataset includes silent frames,

usually when the speaker pauses in between words, where

there is little signal energy. We do not remove these

frames. In the case of silent frames, the target label is

deﬁned as the same as for the rest of the frames, since

we assume single static sources.

2) Speech and Voice Activity Detector (VAD)

(TIMIT+VAD) The TIMIT speech data is pre-processed

using a VAD [43], a technique in speech processing,

used to detect the absence of human speech. In this

case, silent frames were detected using a VAD and later

removed from the signal before training the NN.

In general, a VAD algorithm consists of three steps: ﬁrst,

there is a noise reduction stage; then, some features are

extracted from a section of the signal (which is what is

described here as a frame); and, ﬁnally, a classiﬁcation

technique is applied in order to evaluate whether the

frame contains speech or not. In the classiﬁcation step,

the algorithm proposed in [5] is employed, using an

implementation available in [43]. The authors use end-

point detection to determine where speech begins and

ends, and also to determine a speech threshold for initial

estimation of silent frames. Moreover, they compute

the zero crossing rate in the vicinity of endpoints, that

is, the number of successive signal samples that have

different algebraic signs. If frames above the initial

threshold have considerable changes in zero-crossing rate,

the endpoints are re-designed to the points at which the

changes take place. The parameters used in [43] (and

in this manuscript) are threshold energy = 0.0012 and

threshold zero cross rate = 1.5.

As a result, when a VAD is applied to the TIMIT data

used for training, silent frames represent 26.47% of the

total number of frames.

3) Synthetic Speech (BSAR) Synthetic speech signal, mod-

elled by using a Block Stationary Autoregressive (BSAR)

process [14]. Eq. 1 illustrates how the signal, s

, is

modelled: s

is partitioned into M contiguous blocks,

with block i beginning at sample t

; e

denotes the

excitation process with variance σ

= −

q =1

(q) s

t−q

+ e

, e

∼ N (µ, σ

) (1)

The rational for using this model is to investigate the

effect of a training signal with well-structured but time-

varying spectral characteristics.

4) GAN Speech (GAN-TIMIT) Synthetic speech signal

generated using an implementation of a GAN, known

as WaveGAN [13], trained with TIMIT speech data.

WaveGAN is a machine learning algorithm based on

GANs, which uses real (recorded) audio samples to learn

to synthesise raw waveform audio. The implementation

provided by the authors is capable of learning up to 4

seconds of audio at 16 kHz. GANs, originally proposed

in [18], are composed of two NNs: a discriminator, D,

and a generator, G. D is trained to determine whether

an example is real or not (i.e. if it is realistic enough

to resemble the signal that it is trying to synthesise)

using training data, while G is trained to try to fool the

discriminator into thinking its output is real. Therefore,

G is trained to minimise and D is trained to maximise

the value function. Eq. 2 illustrates such a value function,

V (D, G). P

is a probability distribution over the dis-

crete variable X. E

x∼P

[f(x)] represents the expectation

of f (x) with respect to P

. The generator commonly

uses randomized input as initial seed. More details about

GANs can be found in the original publication [18].

V (D, G) = E

x∼P

[log D(x)]+E

z∼P

[log(1−D(G(z)))]

(2)

The approach proposed in [13] is based on a two-

dimensional deep convolutional GAN (DCGAN) pro-

posed in [38], used for image synthesis. The authors boot-

strap DCGAN to work on spectrograms, proposing an ap-

proach called SpecGAN. Moreover, they use a waveform

approach called WaveGAN, which ﬂattens the DCGAN

architecture to work on one dimension. Moreover, they

increased the stride factor for all convolutions, removed

batch normalisation from generator and discriminator and

ﬁnally trained using the WGAN-GP [19] strategy.

5) GAN Speech (GAN-SC09) Synthetic speech signal gen-

erated using WaveGAN [13], trained with Speech Com-

mands Zero through Nine (SC09) data.

6) GAN for Speech Data Augmentation (TIMIT+GAN-

TIMIT) Half of the data is from Speech (TIMIT) while

the other half is synthetically generated using a waveGAN

and no VAD is used.

On Improved Training of CNN for Acoustic Source Localisation

Figures

Citations

A survey of sound source localization with deep learning methods

A Survey of Sound Source Localization with Deep Learning Methods.

Estimation of Azimuth and Elevation for Multiple Acoustic Sources Using Tetrahedral Microphone Arrays and Convolutional Neural Networks

Signal-Aware Direction-of-Arrival Estimation Using Attention Mechanisms

Differentiable Tracking-Based Training of Deep Learning Sound Source Localizers.

References

Adam: A Method for Stochastic Optimization

ImageNet Classification with Deep Convolutional Neural Networks

Generative Adversarial Nets

Multiple emitter location and signal parameter estimation

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

Related Papers (5)

Broadband doa estimation using convolutional neural networks trained with noise signals

Multi-Speaker DOA Estimation Using Deep Convolutional Networks Trained With Noise Signals

Exploiting Periodicity Features for Joint Detection and DOA Estimation of Speech Sources Using Convolutional Neural Networks

A Regression Approach to Speech Source Localization Exploiting Deep Neural Network

Neural network based spectral mask estimation for acoustic beamforming

Frequently Asked Questions (14)

Q1. What have the authors contributed in "On improved training of cnn for acoustic source localisation" ?

Q2. What future works have the authors mentioned in the paper "On improved training of cnn for acoustic source localisation" ?

Q3. How much training data does the DoA estimate accuracy decrease?

Q4. What is the reason for the decrease in performance of the training data?

Q5. What is the main hypothesis of the experiment?

Q6. What type of data are used to improve the DoA estimation accuracy of existing CNN architectures?

Q7. How did the authors train the CNN using the WGAN-GP strategy?

Q8. What is the way to train a CNN?

Q9. What is the hypothesis behind the use of music for training the CNN?

Q10. What is the reason why the CNN trained with speech performs better?

Q11. What are the methods used for generating the training data?

Q12. How does the accuracy of the GCC-PHAT function compare to other methods?

Q13. Why do the authors think that synthetic speech generators are more effective?

Q14. How does the accuracy of the synthesis of real speech differ from that of a BS?

Trending Questions (2)