Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching

In this article, the authors introduce CNN architectures for both binary and multi-way cross-modal face and audio matching, and compare dynamic and static testing with human testing as a baseline to calibrate the difficulty of the task.

Abstract:

We introduce a seemingly impossible task: given only an audio clip of someone speaking, decide which of two face images is the speaker. In this paper we study this, and a number of related cross-modal tasks, aimed at answering the question: how much can we infer from the voice about the face and vice versa? We study this task "in the wild", employing the datasets that are now publicly available for face recognition from static images (VGGFace) and speaker identification from audio (VoxCeleb). These provide training and testing scenarios for both static and dynamic testing of cross-modal matching. We make the following contributions: (i) we introduce CNN architectures for both binary and multi-way cross-modal face and audio matching: (ii) we compare dynamic testing (where video information is available, but the audio is not from the same video) with static testing (where only a single still image is available): and (iii) we use human testing as a baseline to calibrate the difficulty of the task. We show that a CNN can indeed be trained to solve this task in both the static and dynamic scenarios, and is even well above chance on 10-way classification of the face given the voice. The CNN matches human performance on easy examples (e.g. different gender across faces) but exceeds human performance on more challenging examples (e.g. faces with the same gender, age and nationality).

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

VoxCeleb2: Deep Speaker Recognition.

Joon Son Chung,Arsha Nagrani,Andrew Zisserman +2 moreUniversity of Oxford

Show Less

TL;DR: In this article, a large-scale audio-visual speaker recognition dataset, VoxCeleb2, is presented, which contains over a million utterances from over 6,000 speakers.

...read moreread less

Journal ArticleDOI

Voxceleb: Large-scale speaker verification in the wild

Arsha Nagrani,Joon Son Chung,Joon Son Chung,Weidi Xie,Andrew Zisserman +4 moreUniversity of Oxford,Naver Corporation

- 01 Mar 2020 -

Computer Speech & Language

Show Less

TL;DR: A very large-scale audio-visual dataset collected from open source media using a fully automated pipeline and developed and compared different CNN architectures with various aggregation methods and training loss functions that can effectively recognise identities from voice under various conditions are introduced.

...read moreread less

Journal ArticleDOI

Knowledge Distillation and Student-Teacher Learning for Visual Intelligence: A Review and New Outlooks.

Lin Wang,Kuk-Jin Yoon +1 moreKAIST

- 29 Jan 2021 -

IEEE Transactions on Pattern Analysis an...

Show Less

TL;DR: This paper provides a comprehensive survey on the recent progress of KD methods together with S-T frameworks typically used for vision tasks and systematically analyzes the research status of KD in vision applications.

...read moreread less

Proceedings ArticleDOI

The Sound of Motions

Hang Zhao,Chuang Gan,Wei-Chiu Ma,Antonio Torralba +3 moreMassachusetts Institute of Technology

Show Less

TL;DR: Quantitative and qualitative evaluations show that comparing to previous models that rely on visual appearance cues, the proposed novel motion based system improves performance in separating musical instrument sounds.

...read moreread less

Proceedings ArticleDOI

Emotion Recognition in Speech using Cross-Modal Transfer in the Wild

Samuel Albanie,Arsha Nagrani,Andrea Vedaldi,Andrew Zisserman +3 moreUniversity of Oxford

Show Less

TL;DR: This article showed that the emotional content of speech correlates with the facial expression of the speaker, which can be transferred from the visual domain (faces) to the speech domain (voices) through cross-modal distillation.

...read moreread less

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40

Collapse

References

PDF

Open Access

More filters

Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan,Andrew Zisserman +1 moreUniversity of Oxford

Show Less

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

...read moreread less

Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan,Andrew Zisserman +1 moreUniversity of Oxford

Show Less

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.

...read moreread less

Book ChapterDOI

Visualizing and Understanding Convolutional Networks

Matthew D. Zeiler,Rob Fergus +1 moreNew York University

Show Less

TL;DR: A novel visualization technique is introduced that gives insight into the function of intermediate feature layers and the operation of the classifier in large Convolutional Network models, used in a diagnostic role to find model architectures that outperform Krizhevsky et al on the ImageNet classification benchmark.

...read moreread less

Proceedings Article

Two-Stream Convolutional Networks for Action Recognition in Videos

Karen Simonyan,Andrew Zisserman +1 moreUniversity of Oxford

Show Less

TL;DR: This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.

...read moreread less

Proceedings ArticleDOI

DeepFace: Closing the Gap to Human-Level Performance in Face Verification

Yaniv Taigman,Ming Yang,Marc'Aurelio Ranzato,Lior Wolf +3 moreFacebook,Tel Aviv University

Show Less

TL;DR: This work revisits both the alignment step and the representation step by employing explicit 3D face modeling in order to apply a piecewise affine transformation, and derive a face representation from a nine-layer deep neural network.

...read moreread less

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14

Collapse

SciSpace

About Careers Resources Support Browse Papers Pricing SciSpace Affiliate Program Cancellation & Refund Policy Terms Privacy

Tools

Citation generator AI Detector Paraphraser Citation Booster

Extensions

SciSpace

Directories

Papers Topics Journals Authors Conferences Institutions Questions Citation Styles

Contact

support@typeset.io +91 8431021544

Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching

Citations

VoxCeleb2: Deep Speaker Recognition.

Voxceleb: Large-scale speaker verification in the wild

Knowledge Distillation and Student-Teacher Learning for Visual Intelligence: A Review and New Outlooks.

The Sound of Motions

Emotion Recognition in Speech using Cross-Modal Transfer in the Wild

References

Very Deep Convolutional Networks for Large-Scale Image Recognition

Very Deep Convolutional Networks for Large-Scale Image Recognition

Visualizing and Understanding Convolutional Networks

Two-Stream Convolutional Networks for Action Recognition in Videos

DeepFace: Closing the Gap to Human-Level Performance in Face Verification

Related Papers (5)

Look, Listen and Learn

VoxCeleb: A Large-Scale Speaker Identification Dataset.

Deep Residual Learning for Image Recognition

VoxCeleb2: Deep Speaker Recognition.

Audio-Visual Scene Analysis with Self-Supervised Multisensory Features