Audio Set: An ontology and human-labeled dataset for audio events

doi:10.1109/ICASSP.2017.7952261

Proceedings ArticleDOI

Audio Set: An ontology and human-labeled dataset for audio events

- pp 776-780

TLDR

The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.

Abstract:

Audio event recognition, the human-like ability to identify and relate sounds from audio, is a nascent problem in machine perception. Comparable problems such as object detection in images have reaped enormous benefits from comprehensive datasets - principally ImageNet. This paper describes the creation of Audio Set, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research. Using a carefully structured hierarchical ontology of 632 audio classes guided by the literature and manual curation, we collect data from human labelers to probe the presence of specific audio classes in 10 second segments of YouTube videos. Segments are proposed for labeling using searches based on metadata, context (e.g., links), and content analysis. The result is a dataset of unprecedented breadth and size that will, we hope, substantially stimulate the development of high-performance audio event recognizers.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Continual lifelong learning with neural networks: A review.

German Ignacio Parisi, +4 more

- 01 May 2019 -

Neural Networks

TL;DR: This review critically summarize the main challenges linked to lifelong learning for artificial learning systems and compare existing neural network approaches that alleviate, to different extents, catastrophic forgetting.

...read moreread less

Proceedings ArticleDOI

CNN architectures for large-scale audio classification

Shawn Hershey, +12 more

TL;DR: In this paper, the authors used various CNN architectures to classify the soundtracks of a dataset of 70M training videos (5.24 million hours) with 30,871 video-level labels.

...read moreread less

Journal ArticleDOI

Self-Supervised Visual Feature Learning With Deep Neural Networks: A Survey

Longlong Jing, +1 more

- 01 Nov 2021 -

IEEE Transactions on Pattern Analysis an...

TL;DR: An extensive review of deep learning-based self-supervised general visual feature learning methods from images or videos as a subset of unsupervised learning methods to learn general image and video features from large-scale unlabeled data without using any human-annotated labels is provided.

...read moreread less

Proceedings ArticleDOI

Look, Listen and Learn

Relja Arandjelovic, +1 more

TL;DR: There is a valuable, but so far untapped, source of information contained in the video itself – the correspondence between the visual and the audio streams, and a novel “Audio-Visual Correspondence” learning task that makes use of this.

...read moreread less

Book ChapterDOI

Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

Andrew Owens, +1 more

TL;DR: In this paper, the authors argue that the visual and audio components of a video signal should be modeled jointly using a fused multisensory representation, and they propose to learn such a representation in a self-supervised way, by training a neural network to predict whether video frames and audio are temporally aligned.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, +2 more

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

Proceedings ArticleDOI

Going deeper with convolutions

Christian Szegedy, +8 more

TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).

...read moreread less

Journal ArticleDOI

ImageNet Large Scale Visual Recognition Challenge

Olga Russakovsky, +11 more

- 01 Dec 2015 -

International Journal of Computer Vision

TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.

...read moreread less

Journal ArticleDOI

WordNet: a lexical database for English

George A. Miller

- 01 Nov 1995 -

Communications of The ACM

TL;DR: WordNet1 provides a more effective combination of traditional lexicographic information and modern computing, and is an online lexical database designed for use under program control.

...read moreread less

Audio Set: An ontology and human-labeled dataset for audio events

Citations

Continual lifelong learning with neural networks: A review.

CNN architectures for large-scale audio classification

Self-Supervised Visual Feature Learning With Deep Neural Networks: A Survey

Look, Listen and Learn

Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

References

Deep Residual Learning for Image Recognition

ImageNet Classification with Deep Convolutional Neural Networks

Going deeper with convolutions

ImageNet Large Scale Visual Recognition Challenge

WordNet: a lexical database for English

Related Papers (5)

Deep Residual Learning for Image Recognition

Adam: A Method for Stochastic Optimization

ImageNet: A large-scale hierarchical image database

Very Deep Convolutional Networks for Large-Scale Image Recognition

Attention is All you Need