Ava Active Speaker: An Audio-Visual Dataset for Active Speaker Detection
Joseph Roth,Sourish Chaudhuri,Ondrej Klejch,Radhika Marvin,Andrew C. Gallagher,Liat Kaver,Sharadh Ramaswamy,Arkadiusz Stopczynski,Cordelia Schmid,Zhonghua Xi,Caroline Pantofaru +10 more
- pp 4492-4496
Reads0
Chats0
TLDR
The AVA Active Speaker dataset (AVA-ActiveSpeaker) as discussed by the authors contains temporally labeled face tracks in videos, where each face instance is labeled as speaking or not, and whether the speech is audible.Abstract:
Active speaker detection is an important component in video analysis algorithms for applications such as speaker diarization, video re-targeting for meetings, speech enhancement, and human-robot interaction. The absence of a large, carefully labeled audio-visual active speaker dataset has limited evaluation in terms of data diversity, environments, and accuracy. In this paper, we present the AVA Active Speaker detection dataset (AVA-ActiveSpeaker) which has been publicly released to facilitate algorithm development and comparison. It contains temporally labeled face tracks in videos, where each face instance is labeled as speaking or not, and whether the speech is audible. The dataset contains about 3.65 million human labeled frames spanning 38.5 hours. We also introduce a state-of-the-art, jointly trained audio-visual model for real-time active speaker detection and compare several variants. The evaluation clearly demonstrates a significant gain due to audio-visual modeling and temporal integration over multiple frames.read more
Citations
More filters
Posted Content
Self-Supervised Learning of Audio-Visual Objects from Video
TL;DR: This work introduces a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time, and significantly outperforms other self-supervised approaches, and obtains performance competitive with methods that use supervised face detection.
Posted ContentDOI
Rescaling Egocentric Vision.
Dima Damen,Hazel Doughty,Giovanni Maria Farinella,Antonino Furnari,Evangelos Kazakos,Jian Ma,Davide Moltisanti,Jonathan Munro,Toby Perrett,Will Price,Michael Wray +10 more
TL;DR: This paper introduces EPIC-KITCHENS-100, the largest annotated egocentric dataset - 100 hrs, 20M frames, 90K actions - of wearable videos capturing long-term unscripted activities in 45 environments, using a novel annotation pipeline that allows denser and more complete annotations of fine-grained actions.
Posted Content
An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation
TL;DR: This paper provides a systematic survey of this research topic, focusing on the main elements that characterise the systems in the literature: acoustic features; visual features; deep learning methods; fusion techniques; training targets; and objective functions.
Proceedings ArticleDOI
A Local-to-Global Approach to Multi-Modal Movie Scene Segmentation
TL;DR: This work builds a large-scale video dataset MovieScenes, which contains 21K annotated scene segments from 150 movies, and proposes a local-to-global scene segmentation framework, which integrates multi-modal information across three levels, i.e. clip, segment, and movie.
Proceedings ArticleDOI
Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection
TL;DR: In this paper, the authors propose a novel framework, named TalkNet, that makes decision by taking both short-term and long-term features into consideration, which achieves 3.5% and 2.2% improvement over the state-of-the-art systems on the AVA-ActiveSpeaker dataset and Columbia ASD dataset, respectively.
References
More filters
Posted Content
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
Andrew Howard,Menglong Zhu,Bo Chen,Dmitry Kalenichenko,Weijun Wang,Tobias Weyand,M. Andreetto,Hartwig Adam +7 more
TL;DR: This work introduces two simple global hyper-parameters that efficiently trade off between latency and accuracy and demonstrates the effectiveness of MobileNets across a wide range of applications and use cases including object detection, finegrain classification, face attributes and large scale geo-localization.
Proceedings ArticleDOI
A convolutional neural network cascade for face detection
TL;DR: This work proposes a cascade architecture built on convolutional neural networks (CNNs) with very powerful discriminative capability, while maintaining high performance, and introduces a CNN-based calibration stage after each of the detection stages in the cascade.
Book ChapterDOI
Out of Time: Automated Lip Sync in the Wild
Joon Son Chung,Andrew Zisserman +1 more
TL;DR: The goal of this work is to determine the audio-video synchronisation between mouth motion and speech in a video.
Proceedings Article
Audio Vision: Using Audio-Visual Synchrony to Locate Sounds
TL;DR: A system that searches for regions of the visual landscape that correlate highly with the acoustic signals and tags them as likely to contain an acoustic source and presents results on a speaker localization task is developed.