scispace - formally typeset
Open AccessProceedings ArticleDOI

Ava Active Speaker: An Audio-Visual Dataset for Active Speaker Detection

Reads0
Chats0
TLDR
The AVA Active Speaker dataset (AVA-ActiveSpeaker) as discussed by the authors contains temporally labeled face tracks in videos, where each face instance is labeled as speaking or not, and whether the speech is audible.
Abstract
Active speaker detection is an important component in video analysis algorithms for applications such as speaker diarization, video re-targeting for meetings, speech enhancement, and human-robot interaction. The absence of a large, carefully labeled audio-visual active speaker dataset has limited evaluation in terms of data diversity, environments, and accuracy. In this paper, we present the AVA Active Speaker detection dataset (AVA-ActiveSpeaker) which has been publicly released to facilitate algorithm development and comparison. It contains temporally labeled face tracks in videos, where each face instance is labeled as speaking or not, and whether the speech is audible. The dataset contains about 3.65 million human labeled frames spanning 38.5 hours. We also introduce a state-of-the-art, jointly trained audio-visual model for real-time active speaker detection and compare several variants. The evaluation clearly demonstrates a significant gain due to audio-visual modeling and temporal integration over multiple frames.

read more

Citations
More filters
Posted Content

Self-Supervised Learning of Audio-Visual Objects from Video

TL;DR: This work introduces a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time, and significantly outperforms other self-supervised approaches, and obtains performance competitive with methods that use supervised face detection.
Posted ContentDOI

Rescaling Egocentric Vision.

TL;DR: This paper introduces EPIC-KITCHENS-100, the largest annotated egocentric dataset - 100 hrs, 20M frames, 90K actions - of wearable videos capturing long-term unscripted activities in 45 environments, using a novel annotation pipeline that allows denser and more complete annotations of fine-grained actions.
Posted Content

An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

TL;DR: This paper provides a systematic survey of this research topic, focusing on the main elements that characterise the systems in the literature: acoustic features; visual features; deep learning methods; fusion techniques; training targets; and objective functions.
Proceedings ArticleDOI

A Local-to-Global Approach to Multi-Modal Movie Scene Segmentation

TL;DR: This work builds a large-scale video dataset MovieScenes, which contains 21K annotated scene segments from 150 movies, and proposes a local-to-global scene segmentation framework, which integrates multi-modal information across three levels, i.e. clip, segment, and movie.
Proceedings ArticleDOI

Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection

TL;DR: In this paper, the authors propose a novel framework, named TalkNet, that makes decision by taking both short-term and long-term features into consideration, which achieves 3.5% and 2.2% improvement over the state-of-the-art systems on the AVA-ActiveSpeaker dataset and Columbia ASD dataset, respectively.
References
More filters
Posted Content

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

TL;DR: This work introduces two simple global hyper-parameters that efficiently trade off between latency and accuracy and demonstrates the effectiveness of MobileNets across a wide range of applications and use cases including object detection, finegrain classification, face attributes and large scale geo-localization.
Proceedings ArticleDOI

A convolutional neural network cascade for face detection

TL;DR: This work proposes a cascade architecture built on convolutional neural networks (CNNs) with very powerful discriminative capability, while maintaining high performance, and introduces a CNN-based calibration stage after each of the detection stages in the cascade.
Book ChapterDOI

Out of Time: Automated Lip Sync in the Wild

TL;DR: The goal of this work is to determine the audio-video synchronisation between mouth motion and speech in a video.
Proceedings Article

Audio Vision: Using Audio-Visual Synchrony to Locate Sounds

TL;DR: A system that searches for regions of the visual landscape that correlate highly with the acoustic signals and tags them as likely to contain an acoustic source and presents results on a speaker localization task is developed.