scispace - formally typeset
Open AccessPosted Content

Disjoint Mapping Network for Cross-modal Matching of Voices and Faces

TLDR
It is shown empirically that DIMNet is able to achieve better performance than other current methods, with the additional benefits of being conceptually simpler and less data-intensive.
Abstract
We propose a novel framework, called Disjoint Mapping Network (DIMNet), for cross-modal biometric matching, in particular of voices and faces. Different from the existing methods, DIMNet does not explicitly learn the joint relationship between the modalities. Instead, DIMNet learns a shared representation for different modalities by mapping them individually to their common covariates. These shared representations can then be used to find the correspondences between the modalities. We show empirically that DIMNet is able to achieve better performance than other current methods, with the additional benefits of being conceptually simpler and less data-intensive.

read more

Citations
More filters
Posted Content

Deep Audio-Visual Learning: A Survey

TL;DR: A comprehensive survey of recent audio-visual learning development can be found in this article, where the authors divide the current audio visual learning tasks into four different subfields: audio visual separation and localization, audio visual correspondence learning, audiovisual generation, and audio visual representation learning.
Journal ArticleDOI

Deep Audio-visual Learning: A Survey

TL;DR: A comprehensive survey of recent audio-visual learning development is provided, dividing the current audio- visual learning tasks into four different subfields: audio- Visual separation and localization, audio-Visual correspondence learning, audio -visual generation, and audio- visuals representation learning.
Posted Content

Speech2Face: Learning the Face Behind a Voice

TL;DR: In this paper, a deep neural network is trained using millions of natural Internet/YouTube videos of people speaking to learn voice-face correlations that allow it to produce images that capture various physical attributes of the speakers such as age, gender and ethnicity.
Proceedings Article

Face Reconstruction from Voice using Generative Adversarial Networks

TL;DR: This paper addresses the challenge posed by a subtask of voice profiling - reconstructing someone's face from their voice by proposing a simple but effective computational framework based on generative adversarial networks (GANs).
Proceedings Article

From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face from Speech

TL;DR: This work seeks the possibility of generating the human face from voice solely based on the audio-visual data without any human-labeled annotations by proposing a multi-modal learning framework that links the inference stage and generation stage.
References
More filters
Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Proceedings Article

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

TL;DR: Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.
Journal Article

Visualizing Data using t-SNE

TL;DR: A new technique called t-SNE that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map, a variation of Stochastic Neighbor Embedding that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map.
Journal ArticleDOI

An Introduction to the Bootstrap

Scott D. Grimshaw
- 01 Aug 1995 - 
TL;DR: Statistical theory attacks the problem from both ends as discussed by the authors, and provides optimal methods for finding a real signal in a noisy background, and also provides strict checks against the overinterpretation of random patterns.
Proceedings Article

The Kaldi Speech Recognition Toolkit

TL;DR: The design of Kaldi is described, a free, open-source toolkit for speech recognition research that provides a speech recognition system based on finite-state automata together with detailed documentation and a comprehensive set of scripts for building complete recognition systems.