scispace - formally typeset
Open AccessProceedings Article

Multimodal Deep Learning

Reads0
Chats0
TLDR
This work presents a series of tasks for multimodal learning and shows how to train deep networks that learn features to address these tasks, and demonstrates cross modality feature learning, where better features for one modality can be learned if multiple modalities are present at feature learning time.
Abstract
Deep networks have been successfully applied to unsupervised feature learning for single modalities (e.g., text, images or audio). In this work, we propose a novel application of deep networks to learn features over multiple modalities. We present a series of tasks for multimodal learning and show how to train deep networks that learn features to address these tasks. In particular, we demonstrate cross modality feature learning, where better features for one modality (e.g., video) can be learned if multiple modalities (e.g., audio and video) are present at feature learning time. Furthermore, we show how to learn a shared representation between modalities and evaluate it on a unique task, where the classifier is trained with audio-only data but tested with video-only data and vice-versa. Our models are validated on the CUAVE and AVLetters datasets on audio-visual speech classification, demonstrating best published visual speech classification on AVLetters and effective shared representation learning.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

3-D Fully Convolutional Networks for Multimodal Isointense Infant Brain Image Segmentation

TL;DR: A novel 3-D multimodal fully convolutional network (FCN) architecture is proposed for segmentation of isointense phase brain MR images and it is demonstrated that carefully integrating coarse and dense feature maps can considerably improve the segmentation performance.
Posted Content

Deep Multimodal Feature Analysis for Action Recognition in RGB+D Videos

TL;DR: In this article, a new deep autoencoder based shared-specific feature factorization network is proposed to separate input multimodal signals into a hierarchy of components, and a structured sparsity learning machine is proposed which utilizes mixed norms to apply regularization within components and group selection between them for better classification performance.
Journal ArticleDOI

Effective multi-modal retrieval based on stacked auto-encoders

TL;DR: This paper proposes an effective mapping mechanism based on deep learning (i.e., stacked auto-encoders) for multi-modal retrieval that achieves significant improvement in search accuracy over the state-of-the-art methods.
Journal ArticleDOI

Cross-Modal Attention With Semantic Consistence for Image–Text Matching

TL;DR: The proposed CASC is a joint framework that performs cross-modal attention for local alignment and multilabel prediction for global semantic consistence and directly extracts semantic labels from available sentence corpus without additional labor cost, which provides a global similarity constraint for the aggregated region-word similarity obtained by the local alignment.
Proceedings ArticleDOI

Learning Robust Visual-Semantic Embeddings

TL;DR: An end-to-end learning framework that is able to extract more robust multi-modal representations across domains and a novel technique of unsupervised-data adaptation inference is introduced to construct more comprehensive embeddings for both labeled and unlabeled data.
References
More filters
Proceedings ArticleDOI

Histograms of oriented gradients for human detection

TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.
Journal ArticleDOI

Reducing the Dimensionality of Data with Neural Networks

TL;DR: In this article, an effective way of initializing the weights that allows deep autoencoder networks to learn low-dimensional codes that work much better than principal components analysis as a tool to reduce the dimensionality of data is described.
Journal ArticleDOI

A fast learning algorithm for deep belief nets

TL;DR: A fast, greedy algorithm is derived that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory.
Proceedings ArticleDOI

Extracting and composing robust features with denoising autoencoders

TL;DR: This work introduces and motivate a new training principle for unsupervised learning of a representation based on the idea of making the learned representations robust to partial corruption of the input pattern.
Journal ArticleDOI

Hearing lips and seeing voices

TL;DR: The study reported here demonstrates a previously unrecognised influence of vision upon speech perception, on being shown a film of a young woman's talking head in which repeated utterances of the syllable [ba] had been dubbed on to lip movements for [ga].
Related Papers (5)