Ava Active Speaker: An Audio-Visual Dataset for Active Speaker Detection

doi:10.1109/ICASSP40776.2020.9053900

Open AccessProceedings ArticleDOI

Ava Active Speaker: An Audio-Visual Dataset for Active Speaker Detection

Joseph Roth, +10 more

- pp 4492-4496

Chats0

TLDR

The AVA Active Speaker dataset (AVA-ActiveSpeaker) as discussed by the authors contains temporally labeled face tracks in videos, where each face instance is labeled as speaking or not, and whether the speech is audible.

Abstract:

Active speaker detection is an important component in video analysis algorithms for applications such as speaker diarization, video re-targeting for meetings, speech enhancement, and human-robot interaction. The absence of a large, carefully labeled audio-visual active speaker dataset has limited evaluation in terms of data diversity, environments, and accuracy. In this paper, we present the AVA Active Speaker detection dataset (AVA-ActiveSpeaker) which has been publicly released to facilitate algorithm development and comparison. It contains temporally labeled face tracks in videos, where each face instance is labeled as speaking or not, and whether the speech is audible. The dataset contains about 3.65 million human labeled frames spanning 38.5 hours. We also introduce a state-of-the-art, jointly trained audio-visual model for real-time active speaker detection and compare several variants. The evaluation clearly demonstrates a significant gain due to audio-visual modeling and temporal integration over multiple frames.

Citations

PDF

Open Access

More filters

Posted Content

Self-Supervised Learning of Audio-Visual Objects from Video

Triantafyllos Afouras, +4 more

- 10 Aug 2020 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: This work introduces a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time, and significantly outperforms other self-supervised approaches, and obtains performance competitive with methods that use supervised face detection.

...read moreread less

Posted ContentDOI

Rescaling Egocentric Vision.

Dima Damen, +10 more

- 23 Jun 2020 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: This paper introduces EPIC-KITCHENS-100, the largest annotated egocentric dataset - 100 hrs, 20M frames, 90K actions - of wearable videos capturing long-term unscripted activities in 45 environments, using a novel annotation pipeline that allows denser and more complete annotations of fine-grained actions.

...read moreread less

Posted Content

An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

Daniel Michelsanti, +6 more

- 21 Aug 2020 -

arXiv: Audio and Speech Processing

TL;DR: This paper provides a systematic survey of this research topic, focusing on the main elements that characterise the systems in the literature: acoustic features; visual features; deep learning methods; fusion techniques; training targets; and objective functions.

...read moreread less

Proceedings ArticleDOI

A Local-to-Global Approach to Multi-Modal Movie Scene Segmentation

Anyi Rao, +6 more

TL;DR: This work builds a large-scale video dataset MovieScenes, which contains 21K annotated scene segments from 150 movies, and proposes a local-to-global scene segmentation framework, which integrates multi-modal information across three levels, i.e. clip, segment, and movie.

...read moreread less

Proceedings ArticleDOI

Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection

Ruijie Tao, +5 more

- 14 Jul 2021 -

arXiv: Audio and Speech Processing

TL;DR: In this paper, the authors propose a novel framework, named TalkNet, that makes decision by taking both short-term and long-term features into consideration, which achieves 3.5% and 2.2% improvement over the state-of-the-art systems on the AVA-ActiveSpeaker dataset and Columbia ASD dataset, respectively.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Posted Content

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Andrew Howard, +7 more

- 17 Apr 2017 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: This work introduces two simple global hyper-parameters that efficiently trade off between latency and accuracy and demonstrates the effectiveness of MobileNets across a wide range of applications and use cases including object detection, finegrain classification, face attributes and large scale geo-localization.

...read moreread less

Journal ArticleDOI

Measuring nominal scale agreement among many raters.

Joseph L. Fleiss

- 01 Jan 1971 -

Psychological Bulletin

Proceedings ArticleDOI

A convolutional neural network cascade for face detection

Haoxiang Li, +4 more

TL;DR: This work proposes a cascade architecture built on convolutional neural networks (CNNs) with very powerful discriminative capability, while maintaining high performance, and introduces a CNN-based calibration stage after each of the detection stages in the cascade.

...read moreread less

Book ChapterDOI

Out of Time: Automated Lip Sync in the Wild

Joon Son Chung, +1 more

TL;DR: The goal of this work is to determine the audio-video synchronisation between mouth motion and speech in a video.

...read moreread less

Proceedings Article

Audio Vision: Using Audio-Visual Synchrony to Locate Sounds

John R. Hershey, +1 more

TL;DR: A system that searches for regions of the visual landscape that correlate highly with the acoustic signals and tags them as likely to contain an acoustic source and presents results on a speaker localization task is developed.

...read moreread less

Ava Active Speaker: An Audio-Visual Dataset for Active Speaker Detection

Citations

Self-Supervised Learning of Audio-Visual Objects from Video

Rescaling Egocentric Vision.

An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

A Local-to-Global Approach to Multi-Modal Movie Scene Segmentation

Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection

References

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Measuring nominal scale agreement among many raters.

A convolutional neural network cascade for face detection

Out of Time: Automated Lip Sync in the Wild

Audio Vision: Using Audio-Visual Synchrony to Locate Sounds

Related Papers (5)

Deep Residual Learning for Image Recognition

VoxCeleb2: Deep Speaker Recognition.

ImageNet: A large-scale hierarchical image database

Adam: A Method for Stochastic Optimization

ActivityNet: A large-scale video benchmark for human activity understanding