scispace - formally typeset
Open AccessProceedings ArticleDOI

Slow-Fast Auditory Streams for Audio Recognition

Reads0
Chats0
TLDR
In this article, a two-stream convolutional network for audio recognition is proposed, which operates on time-frequency spectrogram inputs and achieves state-of-the-art results on both VGG-Sound and EPIC-KITCHENS-100 datasets.
Abstract
We propose a two-stream convolutional network for audio recognition, that operates on time-frequency spectrogram inputs. Following similar success in visual recognition, we learn Slow-Fast auditory streams with separable convolutions and multi-level lateral connections. The Slow pathway has high channel capacity while the Fast pathway operates at a fine-grained temporal resolution. We showcase the importance of our two-stream proposal on two diverse datasets: VGG-Sound and EPIC-KITCHENS-100, and achieve state- of-the-art results on both.

read more

Citations
More filters
Journal ArticleDOI

ImageBind: One Embedding Space To Bind Them All

TL;DR: ImageBind as discussed by the authors learns a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data, and shows that all combinations of paired data are not necessary to train such an embedding, and only image-paired data is sufficient to bind the modalities together.
Proceedings ArticleDOI

Contrastive Audio-Visual Masked Autoencoder

TL;DR: The Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE) is proposed by combining contrastive learning and masked data modeling, two major self-supervised learning frameworks, to learn a joint and coordinated audio-visual representation.
Proceedings ArticleDOI

Wav2CLIP: Learning Robust Audio Representations from Clip

TL;DR: Wav2CLIP as mentioned in this paper distills from Contrastive Language-Image Pre-training (CLIP) for audio representation learning, which projects audio into a shared embedding space with images and text, which enables multimodal applications such as zero-shot classification, and cross-modal retrieval.
Proceedings ArticleDOI

Towards Learning Universal Audio Representations

TL;DR: In this paper , the authors introduce a holistic audio representation evaluation suite (HARES) spanning 12 downstream tasks across audio domains and provide a thorough empirical study of recent sound representation learning systems on that benchmark.
References
More filters
Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Book ChapterDOI

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

TL;DR: Temporal Segment Networks (TSN) as discussed by the authors combine a sparse temporal sampling strategy and video-level supervision to enable efficient and effective learning using the whole action video, which obtains the state-of-the-art performance on the datasets of HMDB51 and UCF101.
Proceedings ArticleDOI

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

TL;DR: This work presents SpecAugment, a simple data augmentation method for speech recognition that is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients) and achieves state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work.
Proceedings ArticleDOI

SlowFast Networks for Video Recognition

TL;DR: This work presents SlowFast networks for video recognition, which achieves strong performance for both action classification and detection in video, and large improvements are pin-pointed as contributions by the SlowFast concept.
Proceedings ArticleDOI

A Closer Look at Spatiotemporal Convolutions for Action Recognition

TL;DR: In this article, a new spatio-temporal convolutional block "R(2+1)D" was proposed, which achieved state-of-the-art performance on Sports-1M, Kinetics, UCF101, and HMDB51.