ESResNet: Environmental Sound Classification Based on Visual Domain Models

Open AccessPosted Content

ESResNet: Environmental Sound Classification Based on Visual Domain Models

- 15 Apr 2020 -

arXiv: Computer Vision and Pattern Recog...

TLDR

This work presents a model that is inherently compatible with mono and stereo sound inputs and out-performs all previously known approaches in a fair comparison, based on simple log-power Short-Time Fourier Transform (STFT) spectrograms.

Abstract:

Environmental Sound Classification (ESC) is an active research area in the audio domain and has seen a lot of progress in the past years. However, many of the existing approaches achieve high accuracy by relying on domain-specific features and architectures, making it harder to benefit from advances in other fields (e.g., the image domain). Additionally, some of the past successes have been attributed to a discrepancy of how results are evaluated (i.e., on unofficial splits of the UrbanSound8K (US8K) dataset), distorting the overall progression of the field. The contribution of this paper is twofold. First, we present a model that is inherently compatible with mono and stereo sound inputs. Our model is based on simple log-power Short-Time Fourier Transform (STFT) spectrograms and combines them with several well-known approaches from the image domain (i.e., ResNet, Siamese-like networks and attention). We investigate the influence of cross-domain pre-training, architectural changes, and evaluate our model on standard datasets. We find that our model out-performs all previously known approaches in a fair comparison by achieving accuracies of 97.0 % (ESC-10), 91.5 % (ESC-50) and 84.2 % / 85.4 % (US8K mono / stereo). Second, we provide a comprehensive overview of the actual state of the field, by differentiating several previously reported results on the US8K dataset between official or unofficial splits. For better reproducibility, our code (including any re-implementations) is made available.

ESResNet: Environmental Sound Classification Based on Visual Domain Models

Citations

Rethinking CNN Models for Audio Classification

CLAR: Contrastive Learning of Auditory Representations

Diverse ocean noise classification using deep learning

Multi-View Audio And Music Classification

CMKD: CNN/Transformer-Based Cross-Model Knowledge Distillation for Audio Classification

References

Deep Residual Learning for Image Recognition

Adam: A Method for Stochastic Optimization

ImageNet Classification with Deep Convolutional Neural Networks

Attention is All you Need

ImageNet: A large-scale hierarchical image database

Related Papers (5)

ESC: Dataset for Environmental Sound Classification

A Dataset and Taxonomy for Urban Sound Research

Environmental sound classification with convolutional neural networks

Deep Residual Learning for Image Recognition

Learning from Between-class Examples for Deep Sound Recognition