Slow-Fast Auditory Streams for Audio Recognition

doi:10.1109/ICASSP39728.2021.9413376

Open AccessProceedings ArticleDOI

Slow-Fast Auditory Streams for Audio Recognition

Evangelos Kazakos, +3 more

- pp 855-859

Chats0

TLDR

In this article, a two-stream convolutional network for audio recognition is proposed, which operates on time-frequency spectrogram inputs and achieves state-of-the-art results on both VGG-Sound and EPIC-KITCHENS-100 datasets.

Abstract:

We propose a two-stream convolutional network for audio recognition, that operates on time-frequency spectrogram inputs. Following similar success in visual recognition, we learn Slow-Fast auditory streams with separable convolutions and multi-level lateral connections. The Slow pathway has high channel capacity while the Fast pathway operates at a fine-grained temporal resolution. We showcase the importance of our two-stream proposal on two diverse datasets: VGG-Sound and EPIC-KITCHENS-100, and achieve state- of-the-art results on both.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

ImageBind: One Embedding Space To Bind Them All

Rohit Girdhar, +6 more

- 09 May 2023 -

arXiv.org

TL;DR: ImageBind as discussed by the authors learns a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data, and shows that all combinations of paired data are not necessary to train such an embedding, and only image-paired data is sufficient to bind the modalities together.

...read moreread less

Proceedings ArticleDOI

Contrastive Audio-Visual Masked Autoencoder

Yuan Gong, +6 more

TL;DR: The Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE) is proposed by combining contrastive learning and masked data modeling, two major self-supervised learning frameworks, to learn a joint and coordinated audio-visual representation.

...read moreread less

Journal ArticleDOI

Overview of behavior recognition based on deep learning

Kai Hu, +4 more

- 21 Jun 2022 -

Artificial Intelligence Review

Proceedings ArticleDOI

Wav2CLIP: Learning Robust Audio Representations from Clip

TL;DR: Wav2CLIP as mentioned in this paper distills from Contrastive Language-Image Pre-training (CLIP) for audio representation learning, which projects audio into a shared embedding space with images and text, which enables multimodal applications such as zero-shot classification, and cross-modal retrieval.

...read moreread less

Proceedings ArticleDOI

Towards Learning Universal Audio Representations

TL;DR: In this paper , the authors introduce a holistic audio representation evaluation suite (HARES) spanning 12 downstream tasks across audio domains and provide a thorough empirical study of recent sound representation learning systems on that benchmark.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Book ChapterDOI

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

Limin Wang, +6 more

TL;DR: Temporal Segment Networks (TSN) as discussed by the authors combine a sparse temporal sampling strategy and video-level supervision to enable efficient and effective learning using the whole action video, which obtains the state-of-the-art performance on the datasets of HMDB51 and UCF101.

...read moreread less

Proceedings ArticleDOI

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

Daniel S. Park, +6 more

TL;DR: This work presents SpecAugment, a simple data augmentation method for speech recognition that is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients) and achieves state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work.

...read moreread less

Proceedings ArticleDOI

SlowFast Networks for Video Recognition

Christoph Feichtenhofer, +3 more

TL;DR: This work presents SlowFast networks for video recognition, which achieves strong performance for both action classification and detection in video, and large improvements are pin-pointed as contributions by the SlowFast concept.

...read moreread less

Proceedings ArticleDOI

A Closer Look at Spatiotemporal Convolutions for Action Recognition

Du Tran, +6 more

TL;DR: In this article, a new spatio-temporal convolutional block "R(2+1)D" was proposed, which achieved state-of-the-art performance on Sports-1M, Kinetics, UCF101, and HMDB51.

...read moreread less

Slow-Fast Auditory Streams for Audio Recognition

Citations

ImageBind: One Embedding Space To Bind Them All

Contrastive Audio-Visual Masked Autoencoder

Overview of behavior recognition based on deep learning

Wav2CLIP: Learning Robust Audio Representations from Clip

Towards Learning Universal Audio Representations

References

Deep Residual Learning for Image Recognition

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

SlowFast Networks for Video Recognition

A Closer Look at Spatiotemporal Convolutions for Action Recognition

Related Papers (5)

SlowFast Networks for Video Recognition

Using the Multi-Stream Approach for Continuous Audio-Visual Speech Recognition

Using the Multi-Stream Approach for Continuous Audio-Visual Speech Recognition: Experiments on the M2VTS Database

Look, Listen and Learn

A Simple Framework for Contrastive Learning of Visual Representations