Slow-Fast Auditory Streams for Audio Recognition
Evangelos Kazakos,Arsha Nagrani,Andrew Zisserman,Dima Damen +3 more
- pp 855-859
Reads0
Chats0
TLDR
In this article, a two-stream convolutional network for audio recognition is proposed, which operates on time-frequency spectrogram inputs and achieves state-of-the-art results on both VGG-Sound and EPIC-KITCHENS-100 datasets.Abstract:
We propose a two-stream convolutional network for audio recognition, that operates on time-frequency spectrogram inputs. Following similar success in visual recognition, we learn Slow-Fast auditory streams with separable convolutions and multi-level lateral connections. The Slow pathway has high channel capacity while the Fast pathway operates at a fine-grained temporal resolution. We showcase the importance of our two-stream proposal on two diverse datasets: VGG-Sound and EPIC-KITCHENS-100, and achieve state- of-the-art results on both.read more
Citations
More filters
Journal ArticleDOI
ImageBind: One Embedding Space To Bind Them All
Rohit Girdhar,Alaaeldin El-Nouby,Zhuang Liu,Mannat Singh,Kalyan Vasudev Alwala,Armand Joulin,Ishan Misra +6 more
TL;DR: ImageBind as discussed by the authors learns a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data, and shows that all combinations of paired data are not necessary to train such an embedding, and only image-paired data is sufficient to bind the modalities together.
Proceedings ArticleDOI
Contrastive Audio-Visual Masked Autoencoder
Yuan Gong,Andrew Rouditchenko,Alexander H. Liu,David Harwath,Leonid Karlinsky,Hilde Kuehne,James Glass +6 more
TL;DR: The Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE) is proposed by combining contrastive learning and masked data modeling, two major self-supervised learning frameworks, to learn a joint and coordinated audio-visual representation.
Journal ArticleDOI
Overview of behavior recognition based on deep learning
Proceedings ArticleDOI
Wav2CLIP: Learning Robust Audio Representations from Clip
TL;DR: Wav2CLIP as mentioned in this paper distills from Contrastive Language-Image Pre-training (CLIP) for audio representation learning, which projects audio into a shared embedding space with images and text, which enables multimodal applications such as zero-shot classification, and cross-modal retrieval.
Proceedings ArticleDOI
Towards Learning Universal Audio Representations
TL;DR: In this paper , the authors introduce a holistic audio representation evaluation suite (HARES) spanning 12 downstream tasks across audio domains and provide a thorough empirical study of recent sound representation learning systems on that benchmark.
References
More filters
Proceedings ArticleDOI
Deep Residual Learning for Image Recognition
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Book ChapterDOI
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
TL;DR: Temporal Segment Networks (TSN) as discussed by the authors combine a sparse temporal sampling strategy and video-level supervision to enable efficient and effective learning using the whole action video, which obtains the state-of-the-art performance on the datasets of HMDB51 and UCF101.
Proceedings ArticleDOI
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
TL;DR: This work presents SpecAugment, a simple data augmentation method for speech recognition that is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients) and achieves state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work.
Proceedings ArticleDOI
SlowFast Networks for Video Recognition
TL;DR: This work presents SlowFast networks for video recognition, which achieves strong performance for both action classification and detection in video, and large improvements are pin-pointed as contributions by the SlowFast concept.
Proceedings ArticleDOI
A Closer Look at Spatiotemporal Convolutions for Action Recognition
TL;DR: In this article, a new spatio-temporal convolutional block "R(2+1)D" was proposed, which achieved state-of-the-art performance on Sports-1M, Kinetics, UCF101, and HMDB51.
Related Papers (5)
Using the Multi-Stream Approach for Continuous Audio-Visual Speech Recognition
Stéphane Dupont,Juergen Luettin +1 more