On Improved Training of CNN for Acoustic Source Localisation
read more
Citations
A survey of sound source localization with deep learning methods
A Survey of Sound Source Localization with Deep Learning Methods.
Estimation of Azimuth and Elevation for Multiple Acoustic Sources Using Tetrahedral Microphone Arrays and Convolutional Neural Networks
Signal-Aware Direction-of-Arrival Estimation Using Attention Mechanisms
Differentiable Tracking-Based Training of Deep Learning Sound Source Localizers.
References
Adam: A Method for Stochastic Optimization
ImageNet Classification with Deep Convolutional Neural Networks
Generative Adversarial Nets
Multiple emitter location and signal parameter estimation
Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
Related Papers (5)
Frequently Asked Questions (14)
Q2. What future works have the authors mentioned in the paper "On improved training of cnn for acoustic source localisation" ?
Another avenue for future work could be the extension of their work for multiple simultaneous sources.
Q3. How much training data does the DoA estimate accuracy decrease?
For speech, lowering the volume of training data below 25% decreases the overall DoA estimation accuracy, with a significant drop in accuracy at 5%.
Q4. What is the reason for the decrease in performance of the training data?
Given that the training data, both speech and music, has a high number of silent frames (around one quarter of the training data — 26%), the decrease in performance cannot be due to a low number of silent frames in the training data.
Q5. What is the main hypothesis of the experiment?
Their main hypothesis is that using speech for training the CNN will provide accurate results and will outperform the ones obtained with the baseline.
Q6. What type of data are used to improve the DoA estimation accuracy of existing CNN architectures?
Six different types of speech training data are used, in order to improve the DoA estimation accuracy of existing CNN architectures in different audio classes.
Q7. How did the authors train the CNN using the WGAN-GP strategy?
they increased the stride factor for all convolutions, removed batch normalisation from generator and discriminator and finally trained using the WGAN-GP [19] strategy.
Q8. What is the way to train a CNN?
Future work includes the use of transfer learning techniques in order to use simulated environments for training the CNN and test using data from real scenarios.
Q9. What is the hypothesis behind the use of music for training the CNN?
Their hypothesis in this case is that using music for training will provide accurate results, outperforming those of the baseline, though not as robust as those obtained with speech.
Q10. What is the reason why the CNN trained with speech performs better?
The fact that the CNN trained with music performs better on speech data than the CNN trained with speech is because the CNN trained with music performs better for all DoAs while the one trained with speech fails for 30° and 150°.
Q11. What are the methods used for generating the training data?
The methods used for generating the training data are as follows:1) Speech (TIMIT) Data from the TIMIT dataset [16], containing data of 630 speakers from 8 major dialects of American English, who are reading phonetically rich sentences.
Q12. How does the accuracy of the GCC-PHAT function compare to other methods?
For 0.3 s, however,GAN clearly outperforms GCC, especially for DoAs 30°, 45°, 135°and 150°, where the accuracy improves 16× on average.
Q13. Why do the authors think that synthetic speech generators are more effective?
The authors conjecture that this is due to synthetic speech generators being able to generate accurate speech samples, while current music generators are simple and usually focus on one instrument.
Q14. How does the accuracy of the synthesis of real speech differ from that of a BS?
the authors observed that generating synthetic speech with WaveGAN yields about a 15% relative improvement in accuracy over methods such as synthesis using a BSAR model across 16 acoustic conditions and 9 DoAs.