scispace - formally typeset
Open AccessProceedings ArticleDOI

Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching

TLDR
In this article, the authors introduce CNN architectures for both binary and multi-way cross-modal face and audio matching, and compare dynamic and static testing with human testing as a baseline to calibrate the difficulty of the task.
Abstract
We introduce a seemingly impossible task: given only an audio clip of someone speaking, decide which of two face images is the speaker. In this paper we study this, and a number of related cross-modal tasks, aimed at answering the question: how much can we infer from the voice about the face and vice versa? We study this task "in the wild", employing the datasets that are now publicly available for face recognition from static images (VGGFace) and speaker identification from audio (VoxCeleb). These provide training and testing scenarios for both static and dynamic testing of cross-modal matching. We make the following contributions: (i) we introduce CNN architectures for both binary and multi-way cross-modal face and audio matching: (ii) we compare dynamic testing (where video information is available, but the audio is not from the same video) with static testing (where only a single still image is available): and (iii) we use human testing as a baseline to calibrate the difficulty of the task. We show that a CNN can indeed be trained to solve this task in both the static and dynamic scenarios, and is even well above chance on 10-way classification of the face given the voice. The CNN matches human performance on easy examples (e.g. different gender across faces) but exceeds human performance on more challenging examples (e.g. faces with the same gender, age and nationality).

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

VoxCeleb2: Deep Speaker Recognition.

TL;DR: In this article, a large-scale audio-visual speaker recognition dataset, VoxCeleb2, is presented, which contains over a million utterances from over 6,000 speakers.
Journal ArticleDOI

Voxceleb: Large-scale speaker verification in the wild

TL;DR: A very large-scale audio-visual dataset collected from open source media using a fully automated pipeline and developed and compared different CNN architectures with various aggregation methods and training loss functions that can effectively recognise identities from voice under various conditions are introduced.
Journal ArticleDOI

Knowledge Distillation and Student-Teacher Learning for Visual Intelligence: A Review and New Outlooks.

TL;DR: This paper provides a comprehensive survey on the recent progress of KD methods together with S-T frameworks typically used for vision tasks and systematically analyzes the research status of KD in vision applications.
Proceedings ArticleDOI

The Sound of Motions

TL;DR: Quantitative and qualitative evaluations show that comparing to previous models that rely on visual appearance cues, the proposed novel motion based system improves performance in separating musical instrument sounds.
Proceedings ArticleDOI

Emotion Recognition in Speech using Cross-Modal Transfer in the Wild

TL;DR: This article showed that the emotional content of speech correlates with the facial expression of the speaker, which can be transferred from the visual domain (faces) to the speech domain (voices) through cross-modal distillation.
References
More filters
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
Book ChapterDOI

Visualizing and Understanding Convolutional Networks

TL;DR: A novel visualization technique is introduced that gives insight into the function of intermediate feature layers and the operation of the classifier in large Convolutional Network models, used in a diagnostic role to find model architectures that outperform Krizhevsky et al on the ImageNet classification benchmark.
Proceedings Article

Two-Stream Convolutional Networks for Action Recognition in Videos

TL;DR: This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.
Proceedings ArticleDOI

DeepFace: Closing the Gap to Human-Level Performance in Face Verification

TL;DR: This work revisits both the alignment step and the representation step by employing explicit 3D face modeling in order to apply a piecewise affine transformation, and derive a face representation from a nine-layer deep neural network.