scispace - formally typeset
Open AccessPosted Content

ESResNet: Environmental Sound Classification Based on Visual Domain Models

TLDR
This work presents a model that is inherently compatible with mono and stereo sound inputs and out-performs all previously known approaches in a fair comparison, based on simple log-power Short-Time Fourier Transform (STFT) spectrograms.
Abstract
Environmental Sound Classification (ESC) is an active research area in the audio domain and has seen a lot of progress in the past years. However, many of the existing approaches achieve high accuracy by relying on domain-specific features and architectures, making it harder to benefit from advances in other fields (e.g., the image domain). Additionally, some of the past successes have been attributed to a discrepancy of how results are evaluated (i.e., on unofficial splits of the UrbanSound8K (US8K) dataset), distorting the overall progression of the field. The contribution of this paper is twofold. First, we present a model that is inherently compatible with mono and stereo sound inputs. Our model is based on simple log-power Short-Time Fourier Transform (STFT) spectrograms and combines them with several well-known approaches from the image domain (i.e., ResNet, Siamese-like networks and attention). We investigate the influence of cross-domain pre-training, architectural changes, and evaluate our model on standard datasets. We find that our model out-performs all previously known approaches in a fair comparison by achieving accuracies of 97.0 % (ESC-10), 91.5 % (ESC-50) and 84.2 % / 85.4 % (US8K mono / stereo). Second, we provide a comprehensive overview of the actual state of the field, by differentiating several previously reported results on the US8K dataset between official or unofficial splits. For better reproducibility, our code (including any re-implementations) is made available.

read more

Citations
More filters
Posted Content

Rethinking CNN Models for Audio Classification

TL;DR: It is shown that ImageNet-Pretrained standard deep CNN models can be used as strong baseline networks for audio classification and qualitative results of what the CNNs learn from the spectrograms by visualizing the gradients are shown.
Posted Content

CLAR: Contrastive Learning of Auditory Representations

TL;DR: By combining all these methods and with substantially less labeled data, the CLAR framework achieves significant improvement on prediction performance compared to supervised approach and converges faster with significantly better representations.
Journal ArticleDOI

Diverse ocean noise classification using deep learning

TL;DR: A deep neural network architecture, Convolutional Neural Network-based ocean noise classification cum recognition system capable of classifying vocalization of cetaceans, fishes, marine invertebrates, anthropogenic sounds, natural sounds, and the unidentified ocean sounds from passive acoustic ocean noise recordings is presented.
Proceedings ArticleDOI

Multi-View Audio And Music Classification

TL;DR: In this paper, a multi-view learning approach for audio and music classification is proposed, which consists of four sub-networks, each handling one input type, and the learned embedding in the subnetworks are then concatenated to form the multiview embedding for classification similar to a simple concatenation network.
Journal ArticleDOI

CMKD: CNN/Transformer-Based Cross-Model Knowledge Distillation for Audio Classification

TL;DR: An intriguing interaction is found between the two very different models CNN and AST models are good teachers for each other and when either of them is used as the teacher and the other model is trained as the student via knowledge distillation, the performance of the student model noticeably improves, and in many cases, is better than the teacher model.
References
More filters
Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Proceedings Article

Adam: A Method for Stochastic Optimization

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Proceedings Article

Attention is All you Need

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.