Disjoint Mapping Network for Cross-modal Matching of Voices and Faces

Open AccessPosted Content

Disjoint Mapping Network for Cross-modal Matching of Voices and Faces

- 12 Jul 2018 -

arXiv: Computer Vision and Pattern Recog...

TLDR

It is shown empirically that DIMNet is able to achieve better performance than other current methods, with the additional benefits of being conceptually simpler and less data-intensive.

Citations

PDF

Open Access

More filters

Posted Content

Deep Audio-Visual Learning: A Survey

Hao Zhu, +6 more

- 14 Jan 2020 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: A comprehensive survey of recent audio-visual learning development can be found in this article, where the authors divide the current audio visual learning tasks into four different subfields: audio visual separation and localization, audio visual correspondence learning, audiovisual generation, and audio visual representation learning.

...read moreread less

Journal ArticleDOI

Deep Audio-visual Learning: A Survey

Hao Zhu, +6 more

- 15 Apr 2021 -

International Journal of Automation and ...

TL;DR: A comprehensive survey of recent audio-visual learning development is provided, dividing the current audio- visual learning tasks into four different subfields: audio- Visual separation and localization, audio-Visual correspondence learning, audio -visual generation, and audio- visuals representation learning.

...read moreread less

Posted Content

Speech2Face: Learning the Face Behind a Voice

Tae-Hyun Oh, +6 more

- 23 May 2019 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: In this paper, a deep neural network is trained using millions of natural Internet/YouTube videos of people speaking to learn voice-face correlations that allow it to produce images that capture various physical attributes of the speakers such as age, gender and ethnicity.

...read moreread less

Proceedings Article

Face Reconstruction from Voice using Generative Adversarial Networks

Yandong Wen, +2 more

TL;DR: This paper addresses the challenge posed by a subtask of voice profiling - reconstructing someone's face from their voice by proposing a simple but effective computational framework based on generative adversarial networks (GANs).

...read moreread less

Proceedings Article

From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face from Speech

Hyeong-Seok Choi, +2 more

TL;DR: This work seeks the possibility of generating the human face from voice solely based on the audio-visual data without any human-labeled annotations by proposing a multi-modal learning framework that links the inference stage and generation stage.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, +2 more

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

Proceedings Article

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe, +1 more

TL;DR: Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.

...read moreread less

Journal Article

Visualizing Data using t-SNE

Laurens van der Maaten, +1 more

- 01 Jan 2008 -

Journal of Machine Learning Research

TL;DR: A new technique called t-SNE that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map, a variation of Stochastic Neighbor Embedding that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map.

...read moreread less

Journal ArticleDOI

An Introduction to the Bootstrap

Scott D. Grimshaw

- 01 Aug 1995 -

Technometrics

TL;DR: Statistical theory attacks the problem from both ends as discussed by the authors, and provides optimal methods for finding a real signal in a noisy background, and also provides strict checks against the overinterpretation of random patterns.

...read moreread less

Proceedings Article

The Kaldi Speech Recognition Toolkit

Daniel Povey, +12 more

TL;DR: The design of Kaldi is described, a free, open-source toolkit for speech recognition research that provides a speech recognition system based on finite-state automata together with detailed documentation and a comprehensive set of scripts for building complete recognition systems.

...read moreread less

Disjoint Mapping Network for Cross-modal Matching of Voices and Faces

Citations

Deep Audio-Visual Learning: A Survey

Deep Audio-visual Learning: A Survey

Speech2Face: Learning the Face Behind a Voice

Face Reconstruction from Voice using Generative Adversarial Networks

From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face from Speech

References

ImageNet Classification with Deep Convolutional Neural Networks

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Visualizing Data using t-SNE

An Introduction to the Bootstrap

The Kaldi Speech Recognition Toolkit

Related Papers (5)

Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching

VoxCeleb2: Deep Speaker Recognition.

Learnable PINs: Cross-modal Embeddings for Person Identity

Deep face recognition

VoxCeleb: A Large-Scale Speaker Identification Dataset.