scispace - formally typeset
Open AccessJournal ArticleDOI

On Improved Training of CNN for Acoustic Source Localisation

Reads0
Chats0
TLDR
In this article, the authors show that training with speech or music signals produces a relative improvement in DoA accuracy for a variety of audio classes across 16 acoustic conditions and 9 DoAs, amounting to an average improvement of around 17% and 19% respectively when compared to training with spectrally flat random signals.
Abstract
Convolutional Neural Networks (CNNs) are a popular choice for estimating Direction of Arrival (DoA) without explicitly estimating delays between multiple microphones. The CNN method first optimises unknown filter weights (of a CNN) by using observations and ground-truth directional information. This trained CNN is then used to predict incident directions given test observations. Most existing methods train using spectrally-flat random signals and test using speech. In this paper, which focuses on single source DoA estimation, we find that training with speech or music signals produces a relative improvement in DoA accuracy for a variety of audio classes across 16 acoustic conditions and 9 DoAs, amounting to an average improvement of around 17% and 19% respectively when compared to training with spectrally flat random signals. This improvement is also observed in scenarios in which the speech and music signals are synthesised using, for example, a Generative Adversarial Network (GAN). When the acoustic environments during test and training are similar and reverberant, training a CNN with speech outperforms Generalized Cross Correlation (GCC) methods by about 125%. When the test conditions are different, a CNN performs comparably. This paper takes a step towards answering open questions in the literature regarding the nature of the signals used during training, as well as the amount of data required for estimating DoA using CNNs.

read more

Content maybe subject to copyright    Report

Edinburgh Research Explorer
On Improved Training of CNN for Acoustic Source Localisation
Citation for published version:
Vargas, E, Hopgood, J, Brown, K & Subr, K 2021, 'On Improved Training of CNN for Acoustic Source
Localisation', IEEE Transactions on Audio, Speech and Language Processing, vol. 29, pp. 720 - 732.
https://doi.org/10.1109/TASLP.2021.3049337
Digital Object Identifier (DOI):
10.1109/TASLP.2021.3049337
Link:
Link to publication record in Edinburgh Research Explorer
Document Version:
Peer reviewed version
Published In:
IEEE Transactions on Audio, Speech and Language Processing
General rights
Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s)
and / or other copyright owners and it is a condition of accessing these publications that users recognise and
abide by the legal requirements associated with these rights.
Take down policy
The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer
content complies with UK legislation. If you believe that the public display of this file breaches copyright please
contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and
investigate your claim.
Download date: 10. Aug. 2022

JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1
On Improved Training of CNN for Acoustic Source
Localisation
Elizabeth Vargas, James R. Hopgood, Member, IEEE, Keith Brown, and Kartic Subr
Abstract—Convolutional Neural Networks (CNNs) are a pop-
ular choice for estimating Direction of Arrival (DoA) without
explicitly estimating delays between multiple microphones. The
CNN method first optimises unknown filter weights (of a CNN)
by using observations and ground-truth directional information.
This trained CNN is then used to predict incident directions given
test observations. Most existing methods train using spectrally-
flat random signals and test using speech. In this paper, which
focuses on single source DoA estimation, we find that training
with speech or music signals produces a relative improvement in
DoA accuracy for a variety of audio classes across 16 acoustic
conditions and 9 DoAs, amounting to an average improvement of
around 17% and 19% respectively when compared to training
with spectrally flat random signals. This improvement is also
observed in scenarios in which the speech and music signals
are synthesised using, for example, a Generative Adversarial
Network (GAN). When the acoustic environments during test and
training are similar and reverberant, training a CNN with speech
outperforms Generalized Cross Correlation (GCC) methods by
about 125%. When the test conditions are different, a CNN
performs comparably. This paper takes a step towards answering
open questions in the literature regarding the nature of the signals
used during training, as well as the amount of data required for
estimating DoA using CNNs.
Index Terms—Microphone Arrays, Direction of Arrival, Neu-
ral Networks, Convolutional Neural Network (CNN), Generative
Adversarial Network (GAN)
I. INTRODUCTION
Estimation of the Direction of Arrival (DoA), or spatial
direction from which a sound is emitted, is an important and
well-studied problem in Acoustic Source Localisation (ASL)
with applications in numerous domains [15], [44]. The advent
of smart assistants (e.g. Amazon Echo, Google Home, Apple
HomePod) [6], equipped with arrays of microphones, has
facilitated the generation of large datasets and has motivated
research into the use of data-driven methods for DoA esti-
mation. In particular, learning via a Deep Neural Network
(DNN) architecture deployed effectively for computer vision
applications [26] and audio processing [53] is emerging as
an effective tool for ASL [10].
Traditional methods for performing ASL have been widely
studied in the literature [4], the most common of which
are: (i) Time Difference of Arrival (TDoA)-based approaches,
which normally employ Generalized Cross-Correlation (GCC)
methods [25], [47], [48]; (ii) beamforming-based approaches,
E. Vargas and K. Brown are with the Institute of Sensors, Signals, and
Systems, Heriot-Watt University, Email: elizabeth.vargas@hw.ac.uk
James R. Hopgood is with the Institute of Digital Communications, in the
School of Engineering, University of Edinburgh.
K. Subr is with the Institute of Perception, Action and Behaviour, University
of Edinburgh.
including the well-known Steered Response Power (SRP) [30],
[33], which solve directly for the most likely source posi-
tion among a grid of candidate locations; and (iii) MUltiple
SIgnal Classification (MUSIC) [42], [46], which uses the
signal subspaces to estimate multiple DoA. More modern
approaches include the use of learning-based methods in ASL,
focused on feature extraction and classifiers [23], [27]. Neural
networks have been applied to various problems related to
ASL including speaker localisation using a robot [44], [45],
passive underwater sensing [15], antennas [31] and acous-
tic emission localisation on a pipeline [21]. Chakrabarty et
al. [8] perform single source localisation by treating ASL
as a classification problem, where the discretised DoA cor-
responds to a class, which they solve using a CNN. This
method has been extended to multiple sources [10] using a
flat spectral uncorrelated random process to train the network.
CNNs combined with Long Short-Term Memory (LSTM) [29]
have been shown to be useful for estimating DoA by using
Generalized Cross-Correlation Phase Transform (GCC-PHAT)
as input data. Some approaches use neural networks to perform
pre-processing such as time-frequency (TF) masking [36],
[51], [52] or denoising and dereverberation [49].
Despite the widespread use of CNNs in applications related
to ASL, numerous questions regarding the quality and quantity
of the training data remain unanswered. In [1], [2], data from
different sound classes is randomly used for both training
and testing, while in [34] the authors propose a method of
data augmentation for the task of room classification from
reverberant speech using a GAN. In [40], deep CNN and data
augmentation are used for environmental sound classification.
On the other hand, Pons et al. [37] use few training samples
(from 1 to 100) per class to train an event and acoustic scene
classifier. It is important to study the impact of training data
for a CNN that estimates DoA, as this will help to generalise
the use of deep learning methods in ASL without the need of
limiting the test data to the same one as used in the training.
In this paper, we test the impact of various sound classes
for training on the accuracy of single source DoA estimation.
We hypothesise and show that using speech and music data
for training will provide more accurate DoA estimation than
using noise, which is used by the current literature [8], [10].
Our reasoning is that speech and music data contains more
relevant spectral information that helps the CNN learn the
room acoustics much better than white noise. Our conclusion
is that using real speech data augmented with synthetic speech
data (using GAN-based methods) performs best for a wide
range of test audio classes and different incident directions.
Our main findings and novel contributions in this work are

JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2
that:
training with speech data, rather than flat spectral noise,
produces an average relative improvement of 3% in the
accuracy of DoA estimates for test speech signals and
17% when the test signals belong to one of three other
classes: speech, children playing and street music, across
16 acoustic conditions and 9 DoAs in both cases;
training with music data from a dataset produces an
average relative improvement of 19% in DoA estimation
accuracy across 16 acoustic conditions and 9 DoAs,
compared to training with flat spectral noise;
synthetic speech data generated using a state-of-the-art
GAN [13], which can be generated automatically, is as
effective in training as using real human speech;
compared with GCC methods, a CNN trained with speech
is 125% more accurate when the test and training environ-
ments have similar reverberation, and comparable when
the reverberation levels are different.
The article is organised as follows. We review state-of-the-
art DoA estimation using Neural Networkss (NNs) in Sec-
tion II. Section III gives details of our proposed approach for
training the CNN to estimate DoA. We present our evaluation
in Section IV, in which we compare our training methodology
against related state-of-the-art approaches. In Section V we
discuss the results of our experiments. Section VI concludes
our work and states future directions for research.
II. RELATED WORK
This section shows existing DoA-based work in NN for
ASL. We discuss the training data used in each work and at
the end of the section we highlight how our work differs from
previous ones.
DoA methods are subdivided depending on whether they
estimate the DoA for a single source or multiple sources. Since
our contributions are oriented to the estimation of a single
source, we focus our review of the literature in single source
approaches.
The use of planar arrays is very common in single-source
DoA estimation. In [44], for instance, the authors train a
DNN to localise sources using a microphone array embedded
on a humanoid robot. Localisation is presented as a binary
classification problem, in which the algorithm returns either
1 or 0, depending on the existence (or not) of a source at a
given direction. The main contributions arising out of this work
are the uses of a directional activator, similar to MUSIC, and
the use of this activator to treat complex numbers (from the
spectrogram) at each sub-band. The evaluation was performed
using real data from a Japanese dataset as training and testing
sets (with different data used for each set), and accuracy
computed for 72 different DoAs and frames of 200ms. The
main limitation of this work is that the DNN is unable to
localise sources located in positions that not appear on the
training set. The authors propose a new approach to overcome
these limitations in [45], using unsupervised learning together
with a parameter adaption layer and early cessation of the
parameter updates. These changes result in improvements for
some of the DoA angles, but in a deterioration for others.
A similar approach is presented by Chakrabarty et al. [8],
where phase information of the Short-Time Fourier Transform
(STFT) coefficients is used together with a single-class clas-
sifier to train a CNN that outputs the DoA of a group of
signals from a microphone array. The DoA is modeled as a
single-class classification problem, in which the classes are
37 different angles (DoA), with intervals. The network is
trained with synthetic data and tested with speech signals from
the TIMIT dataset. The results are presented as accuracy level
per frames: that is to say, the number of frames that correctly
classify the DoA, similar to [44]. Since this article is the basis
for our work, Section III-A discuss this in further detail. In [29]
the authors use a CNN combined with a LSTM to estimate
DoA. The main contribution of [29] is its adaptability to a
change in microphone array configuration and the use of a very
small amount of data, since the network uses GCC-PHAT as
the input, rather than the spectrogram as in previous cases [8],
[44].
There are a set of approaches that use a NN as a pre-
processing step, including [51], in which the authors use a
Bidirectional Long Short Term Memory (BLSTM) for time-
frequency (TF) masking to arrive at a clean phase TDoA
estimation. They use this to improve conventional Cross-
correlation (CC), beamforming, and subspace-based algo-
rithms for ASL. They perform experiments with a binaural
setup, judging the estimation as accurate when the error is
within 5 degrees. This approach is extended in [52] where
the DoA is calculated directly using monaural spectral in-
formation for mask estimation during training, and therefore
this approach could be extended to different microphone
configurations. Similar to [36], the authors use a CNN to
predict a time-frequency (TF) mask for emphasising the direct
path speech signal in time-varying interference. This approach
is applied in combination with SRP to estimate the DoA.
The main limitation is that it only works on the same audio
class as in the training set while the main assumption is that
there is only one main interference with the target of interest.
The experiments were conducted using speech (English for
training and Japanese for testing) mixed with everyday sounds
(office printer background or household noise) to train and test
the NN for both static and moving speech sources. Wang et
al [49] propose the use of an Acoustic Vector Sensor (AVS)
to estimate DoA, in conjunction with a network for denoising
and dereverberation. The authors’ hypothesis is that clean
features are better classified than unclean ones, therefore they
used a DNN for Signal Denoising and Dereverberation (DNN-
SDD), which maps noise and reverberant speech features
to their clean versions and uses them as input for a DNN
that calculates DoA. The method is evaluated in small-sized
microphone arrays, with the Mean Absolute Error (MAE) and
Root Mean Square Error (RMSE) used as evaluation metrics.
There are some works that describe ASL using NNs in
planar arrays for very specific applications. In [15], the authors
present an application of CNN for DoA to passive underwater
sensors, a technique that uses cepstrograms and generalized
cross-correlogram as input to estimate range and bearing. The
network is trained using real, multi-channel acoustic record-
ings of a surface in a shallow water environment. Another

JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3
application is presented in [31], in which DoA estimation
using DNN is used in antennas. The main contributions of the
work in [31] are a proposed end to end DNN for general (not
only acoustic) DoA estimation, the use of an autoencoder for
pre-processing and training with various outputs of a certain
array, so the network is robust to imperfections. The authors
train and test their approach based on simulated data and use
MUSIC as a baseline for comparison. Finally, in [21] we are
presented with an application of acoustic emission localization
on a pipeline, generated when energy is released within a
material. The experiments showed an accuracy of 97% and
execution time of 0.963 milliseconds.
In general, we summarise that the literature in deep neural
networks, as applied to ASL, is focused on creating neural
network architectures and methodologies that generalise the
following:
Room Acoustic Conditions: The network goal is to be
robust to new acoustic conditions, such as noise and
reverberation, different from those used during training.
One of the clearest examples is [8], in which the net-
work is trained and tested with different room sizes and
reverberations. Moreover, [52] test their pre-processing
TF mask in various noise and reverberant environments.
Perotin et al, [35], train their NN on a large variety of
simulated rooms and test it on unseen rooms. The main
limitation of these approaches is their assumption that
both the train and test data belong to the same audio
class.
Source Locations: The objective is to be able to estimate
source locations different from those present in the train-
ing set. In [8] the authors considered in their experiments
the influence of source-array distance. Similarly, [35]
evaluated their algorithm on DoAs that lie anywhere on
the sphere rather than on the same discrete grid used for
training.
Microphone Configuration: The NN should be able to
be tested on any microphone configuration, independent
of the one(s) it was trained with. This is partially achieved
in [29], in which the authors use GCC-PHAT as the input
to the NN, therefore the microphone configurations of
training and testing could be different, provided that the
microphones are located at the same distance. A better
generalisation is presented in [52], in which the NN
uses monaural information: however, this is only for TF
mask estimation as a pre-processing step, rather than DoA
estimation directly.
Even though the literature covers a lot of work in gener-
alising the learning process, there is a gap in the efforts to
generalise the nature of training data. The closest effort has
been presented in [2], in which the authors use various data
classes for training and testing the network: however, they
limit their work to using the same audio class for training and
testing. Accordingly, in this paper we have focused this work
on studying the impact of the quality and quantity of training
data when it comes to DoA estimation. Studying this impact,
will help to generalise the use of deep learning methods in
ASL without the need of limiting the test data to the same
one as used in the training.
III. METHODOLOGY
A. Baseline: DoA estimation using CNNs trained with spec-
trally flat random noise
The focus of this work is on analyzing the impact of training
data, therefore we use an existing architecture [8] and follow
the methodology presented in this section for training and
testing.
The CNN, initially proposed in [8] and used in [9]–[11], is
based on a standard CNN [17] architecture. These networks
typically consist of a set of “convolution layers”, which act
as filters on the input, resulting in the set of features that the
network learns. The convolution is followed by an activation
layer, operating point-wise over each element of the feature
map. Later on, a pooling operation is applied to reduce the
feature map. In the final step, the fully connected layers
aggregate information from all different positions to perform
classification.
In this particular application, the authors use the CNN
architecture presented in [8], which has the following char-
acteristics:
The CNN treats the phase of the STFT as an image and
the input is a matrix of size M by K, where M is the
number of microphones and K the resolution of the STFT
in the frequency domain. It is important to note that the
input is a single time frame of the total signal per training
data point, as opposed to the entire STFT.
The CNN uses the rectified linear units (ReLU) as acti-
vation function.
The CNN does not have any pooling layer, since it
decreases the performance of the network.
The last layer uses softmax activation function to perform
classification.
The network was trained using the Adam optimiser [24],
with a learning rate of 0.001, for 5 epochs, and uses
categorical cross-entropy as loss function.
The output of the CNN is the posterior probabilities of
the input belonging to one of 37 DoA classes (discrete
values from 0 to 180, with a gap of 5 degrees).
We tested the performance of this network to have a baseline
for comparison. Fig. 1 illustrates this. It also presents the
results of the sample experiments available in [7].
B. Acoustic conditions
Four microphones arranged in a linear array were used.
The training and testing conditions are summarised in Ta-
ble I, which are the same as those described in [8], to
aid comparison. Moreover, the signals (16kHz sampling fre-
quency) were transformed using the STFT with a window of
size 256 and overlap of 129. Although the inter-microphone
distance is the same for both training and test, the arrays
are positioned in different locations within the rooms. The
training data is composed of 5.6 million frames, including
cases in which the input combined real and synthetic data,
guarantying a fair comparison among training data variations.

JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4
The test data is composed of 100 audio files per audio
class (see Section III-D). The test signals are generated by
convolving these audio files with Room Impulse Responses
(RIRs) for 9 different DoAs, the same as those established
in the baseline: 30°, 45°, 60°, 75°, 90°, 105°, 120°, 135°and
150°. The RIR simulation is performed using the Image Source
Method (ISM) [3]. The noise on the test signals is uncorrelated
additive white Gaussian noise (that is, independent at each
microphone), added using the ISM simulator from [28].
C. Training audio classes
We used two different audio classes to train the CNN:
speech and music. For each of these classes we used different
variations to produce this data (see Section III-C1 and Sec-
tion III-C2), either by using datasets or methods to synthesise
these sounds.
1) Speech: Six different types of speech training data are
used, in order to improve the DoA estimation accuracy of
existing CNN architectures in different audio classes. The
methods used for generating the training data are as follows:
1) Speech (TIMIT) Data from the TIMIT dataset [16],
containing data of 630 speakers from 8 major dialects of
American English, who are reading phonetically rich sen-
tences. The dataset was originally designed as a database
of speech data for acoustic-phonetic studies, as well
as the development and evaluation of automatic speech
recognition systems. This dataset includes silent frames,
usually when the speaker pauses in between words, where
there is little signal energy. We do not remove these
frames. In the case of silent frames, the target label is
defined as the same as for the rest of the frames, since
we assume single static sources.
2) Speech and Voice Activity Detector (VAD)
(TIMIT+VAD) The TIMIT speech data is pre-processed
using a VAD [43], a technique in speech processing,
used to detect the absence of human speech. In this
case, silent frames were detected using a VAD and later
removed from the signal before training the NN.
In general, a VAD algorithm consists of three steps: first,
there is a noise reduction stage; then, some features are
extracted from a section of the signal (which is what is
described here as a frame); and, finally, a classification
technique is applied in order to evaluate whether the
frame contains speech or not. In the classification step,
the algorithm proposed in [5] is employed, using an
implementation available in [43]. The authors use end-
point detection to determine where speech begins and
ends, and also to determine a speech threshold for initial
estimation of silent frames. Moreover, they compute
the zero crossing rate in the vicinity of endpoints, that
is, the number of successive signal samples that have
different algebraic signs. If frames above the initial
threshold have considerable changes in zero-crossing rate,
the endpoints are re-designed to the points at which the
changes take place. The parameters used in [43] (and
in this manuscript) are threshold energy = 0.0012 and
threshold zero cross rate = 1.5.
As a result, when a VAD is applied to the TIMIT data
used for training, silent frames represent 26.47% of the
total number of frames.
3) Synthetic Speech (BSAR) Synthetic speech signal, mod-
elled by using a Block Stationary Autoregressive (BSAR)
process [14]. Eq. 1 illustrates how the signal, s
t
, is
modelled: s
t
is partitioned into M contiguous blocks,
with block i beginning at sample t
i
; e
t
denotes the
excitation process with variance σ
i
:
s
t
=
Q
i
X
q =1
b
i
(q) s
tq
+ e
t
, e
t
N (µ, σ
2
i
) (1)
The rational for using this model is to investigate the
effect of a training signal with well-structured but time-
varying spectral characteristics.
4) GAN Speech (GAN-TIMIT) Synthetic speech signal
generated using an implementation of a GAN, known
as WaveGAN [13], trained with TIMIT speech data.
WaveGAN is a machine learning algorithm based on
GANs, which uses real (recorded) audio samples to learn
to synthesise raw waveform audio. The implementation
provided by the authors is capable of learning up to 4
seconds of audio at 16 kHz. GANs, originally proposed
in [18], are composed of two NNs: a discriminator, D,
and a generator, G. D is trained to determine whether
an example is real or not (i.e. if it is realistic enough
to resemble the signal that it is trying to synthesise)
using training data, while G is trained to try to fool the
discriminator into thinking its output is real. Therefore,
G is trained to minimise and D is trained to maximise
the value function. Eq. 2 illustrates such a value function,
V (D, G). P
X
is a probability distribution over the dis-
crete variable X. E
xP
X
[f(x)] represents the expectation
of f (x) with respect to P
X
. The generator commonly
uses randomized input as initial seed. More details about
GANs can be found in the original publication [18].
V (D, G) = E
xP
X
[log D(x)]+E
zP
X
[log(1D(G(z)))]
(2)
The approach proposed in [13] is based on a two-
dimensional deep convolutional GAN (DCGAN) pro-
posed in [38], used for image synthesis. The authors boot-
strap DCGAN to work on spectrograms, proposing an ap-
proach called SpecGAN. Moreover, they use a waveform
approach called WaveGAN, which flattens the DCGAN
architecture to work on one dimension. Moreover, they
increased the stride factor for all convolutions, removed
batch normalisation from generator and discriminator and
finally trained using the WGAN-GP [19] strategy.
5) GAN Speech (GAN-SC09) Synthetic speech signal gen-
erated using WaveGAN [13], trained with Speech Com-
mands Zero through Nine (SC09) data.
6) GAN for Speech Data Augmentation (TIMIT+GAN-
TIMIT) Half of the data is from Speech (TIMIT) while
the other half is synthetically generated using a waveGAN
and no VAD is used.

Figures
Citations
More filters
Journal ArticleDOI

A survey of sound source localization with deep learning methods

TL;DR: A survey of deep learning methods for single and multiple sound source localization in indoor environments is presented in this article , with a focus on sound sources localization in environments where reverberation and diffuse noise are present.
Posted Content

A Survey of Sound Source Localization with Deep Learning Methods.

TL;DR: In this article, a survey on deep learning methods for single and multiple sound source localization is presented, where the authors provide an exhaustive topography of the neural-based localization literature in this context, organized according to several aspects.
Journal ArticleDOI

Estimation of Azimuth and Elevation for Multiple Acoustic Sources Using Tetrahedral Microphone Arrays and Convolutional Neural Networks

TL;DR: This method presents a novel approach for the estimation of acoustic source direction of arrival (DoA), both azimuth and elevation, utilizing a non-coplanar microphone array, and outperforms the currently available methods for multiple sound source DoA estimation in both accuracy and speed.
Journal ArticleDOI

Signal-Aware Direction-of-Arrival Estimation Using Attention Mechanisms

TL;DR: In this article , the authors evaluate different DNNs and signal processing-based methods for DOA estimation when attention is applied and propose training strategies for attention-based estimation optimized via a DOA objective.
Posted Content

Differentiable Tracking-Based Training of Deep Learning Sound Source Localizers.

TL;DR: In this article, a differentiable network is constructed that can be plugged to the output of the localizer to solve the optimal assignment between predictions and references, optimizing directly the popular CLEAR-MOT tracking metrics.
References
More filters
Proceedings Article

Adam: A Method for Stochastic Optimization

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Journal ArticleDOI

Generative Adversarial Nets

TL;DR: A new framework for estimating generative models via an adversarial process, in which two models are simultaneously train: a generative model G that captures the data distribution and a discriminative model D that estimates the probability that a sample came from the training data rather than G.
Journal ArticleDOI

Multiple emitter location and signal parameter estimation

TL;DR: In this article, a description of the multiple signal classification (MUSIC) algorithm, which provides asymptotically unbiased estimates of 1) number of incident wavefronts present; 2) directions of arrival (DOA) (or emitter locations); 3) strengths and cross correlations among the incident waveforms; 4) noise/interference strength.
Posted Content

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

TL;DR: This work introduces a class of CNNs called deep convolutional generative adversarial networks (DCGANs), that have certain architectural constraints, and demonstrates that they are a strong candidate for unsupervised learning.
Related Papers (5)
Frequently Asked Questions (14)
Q1. What have the authors contributed in "On improved training of cnn for acoustic source localisation" ?

In this paper, which focuses on single source DoA estimation, the authors find that training with speech or music signals produces a relative improvement in DoA accuracy for a variety of audio classes across 16 acoustic conditions and 9 DoAs, amounting to an average improvement of around 17 % and 19 % respectively when compared to training with spectrally flat random signals. This improvement is also observed in scenarios in which the speech and music signals are synthesised using, for example, a Generative Adversarial Network ( GAN ). This paper takes a step towards answering open questions in the literature regarding the nature of the signals used during training, as well as the amount of data required for estimating DoA using CNNs. 

Another avenue for future work could be the extension of their work for multiple simultaneous sources. 

For speech, lowering the volume of training data below 25% decreases the overall DoA estimation accuracy, with a significant drop in accuracy at 5%. 

Given that the training data, both speech and music, has a high number of silent frames (around one quarter of the training data — 26%), the decrease in performance cannot be due to a low number of silent frames in the training data. 

Their main hypothesis is that using speech for training the CNN will provide accurate results and will outperform the ones obtained with the baseline. 

Six different types of speech training data are used, in order to improve the DoA estimation accuracy of existing CNN architectures in different audio classes. 

they increased the stride factor for all convolutions, removed batch normalisation from generator and discriminator and finally trained using the WGAN-GP [19] strategy. 

Future work includes the use of transfer learning techniques in order to use simulated environments for training the CNN and test using data from real scenarios. 

Their hypothesis in this case is that using music for training will provide accurate results, outperforming those of the baseline, though not as robust as those obtained with speech. 

The fact that the CNN trained with music performs better on speech data than the CNN trained with speech is because the CNN trained with music performs better for all DoAs while the one trained with speech fails for 30° and 150°. 

The methods used for generating the training data are as follows:1) Speech (TIMIT) Data from the TIMIT dataset [16], containing data of 630 speakers from 8 major dialects of American English, who are reading phonetically rich sentences. 

For 0.3 s, however,GAN clearly outperforms GCC, especially for DoAs 30°, 45°, 135°and 150°, where the accuracy improves 16× on average. 

The authors conjecture that this is due to synthetic speech generators being able to generate accurate speech samples, while current music generators are simple and usually focus on one instrument. 

the authors observed that generating synthetic speech with WaveGAN yields about a 15% relative improvement in accuracy over methods such as synthesis using a BSAR model across 16 acoustic conditions and 9 DoAs. 

Trending Questions (2)
Training of cnn?

The paper explores the training of CNNs for acoustic source localization, finding that training with speech or music signals improves accuracy compared to training with random signals.

How we can train the cnn?

The CNN can be trained using speech or music signals, either from datasets or synthesized using methods like a Generative Adversarial Network (GAN).