Multimodal Deep Learning

Open AccessProceedings Article

Multimodal Deep Learning

Jiquan Ngiam, +5 more

- pp 689-696

Chats0

TLDR

This work presents a series of tasks for multimodal learning and shows how to train deep networks that learn features to address these tasks, and demonstrates cross modality feature learning, where better features for one modality can be learned if multiple modalities are present at feature learning time.

Abstract:

Deep networks have been successfully applied to unsupervised feature learning for single modalities (e.g., text, images or audio). In this work, we propose a novel application of deep networks to learn features over multiple modalities. We present a series of tasks for multimodal learning and show how to train deep networks that learn features to address these tasks. In particular, we demonstrate cross modality feature learning, where better features for one modality (e.g., video) can be learned if multiple modalities (e.g., audio and video) are present at feature learning time. Furthermore, we show how to learn a shared representation between modalities and evaluate it on a unique task, where the classifier is trained with audio-only data but tested with video-only data and vice-versa. Our models are validated on the CUAVE and AVLetters datasets on audio-visual speech classification, demonstrating best published visual speech classification on AVLetters and effective shared representation learning.

Citations

PDF

Open Access

More filters

DissertationDOI

Geometry and Uncertainty in Deep Learning for Computer Vision

Alex Kendall

TL;DR: This thesis presents end-to-end deep learning architectures for a number of core computer vision problems; scene understanding, camera pose estimation, stereo vision and video semantic segmentation, and introduces ideas from probabilistic modelling and Bayesian deep learning to understand uncertainty in computer vision models.

...read moreread less

Proceedings ArticleDOI

Towards 3D object detection with bimodal deep Boltzmann machines over RGBD imagery

Wei Liu, +2 more

TL;DR: This work proposes a cross-modality deep learning framework based on deep Boltzmann Machines for 3D Scenes object detection and demonstrates that by learning cross- modality feature from RGBD data, it is possible to capture their joint information to reinforce detector trainings in individual modalities.

...read moreread less

Posted Content

Y^2Seq2Seq: Cross-Modal Representation Learning for 3D Shape and Text by Joint Reconstruction and Prediction of View and Word Sequences

Zhizhong Han, +4 more

- 07 Nov 2018 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: This work proposes Y^2Seq2Sequ, a view-based model, to learn cross-modal representations by joint reconstruction and prediction of view and word sequences, and bridges the semantic meaning embedded in the two modalities by two coupled `Y' like sequence-to-sequence structures.

...read moreread less

Journal ArticleDOI

SPARCNet: A Hardware Accelerator for Efficient Deployment of Sparse Convolutional Networks

Adam Page, +3 more

- 12 May 2017 -

ACM Journal on Emerging Technologies in ...

TL;DR: The proposed SPARCNet, a hardware accelerator for efficient deployment of SPARse Convolutional NETworks, looks to enable deploying networks in embedded, resource-bound settings by both exploiting efficient forms of parallelism inherent in convolutional layers and by exploiting the sparsification and approximation techniques proposed.

...read moreread less

Posted Content

Deep Partial Multi-View Learning.

Changqing Zhang, +5 more

- 12 Nov 2020 -

arXiv: Learning

TL;DR: Cross Partial Multi-View Networks (CPM-Nets) as discussed by the authors is a framework for multi-view representation learning, which aims to fully and flflexibly take advantage of multiple partial views.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Histograms of oriented gradients for human detection

Navneet Dalal, +1 more

TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.

...read moreread less

Journal ArticleDOI

Reducing the Dimensionality of Data with Neural Networks

Geoffrey E. Hinton, +1 more

- 28 Jul 2006 -

Science

TL;DR: In this article, an effective way of initializing the weights that allows deep autoencoder networks to learn low-dimensional codes that work much better than principal components analysis as a tool to reduce the dimensionality of data is described.

...read moreread less

Journal ArticleDOI

A fast learning algorithm for deep belief nets

Geoffrey E. Hinton, +2 more

- 01 Jul 2006 -

Neural Computation

TL;DR: A fast, greedy algorithm is derived that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory.

...read moreread less

Proceedings ArticleDOI

Extracting and composing robust features with denoising autoencoders

Pascal Vincent, +3 more

TL;DR: This work introduces and motivate a new training principle for unsupervised learning of a representation based on the idea of making the learned representations robust to partial corruption of the input pattern.

...read moreread less

Journal ArticleDOI

Hearing lips and seeing voices

Harry McGurk, +1 more

- 01 Dec 1976 -

Nature

TL;DR: The study reported here demonstrates a previously unrecognised influence of vision upon speech perception, on being shown a film of a young woman's talking head in which repeated utterances of the syllable [ba] had been dubbed on to lip movements for [ga].

...read moreread less

Collapse

Multimodal Deep Learning

Citations

Geometry and Uncertainty in Deep Learning for Computer Vision

Towards 3D object detection with bimodal deep Boltzmann machines over RGBD imagery

Y^2Seq2Seq: Cross-Modal Representation Learning for 3D Shape and Text by Joint Reconstruction and Prediction of View and Word Sequences

SPARCNet: A Hardware Accelerator for Efficient Deployment of Sparse Convolutional Networks

Deep Partial Multi-View Learning.

References

Histograms of oriented gradients for human detection

Reducing the Dimensionality of Data with Neural Networks

A fast learning algorithm for deep belief nets

Extracting and composing robust features with denoising autoencoders

Hearing lips and seeing voices

Related Papers (5)

ImageNet Classification with Deep Convolutional Neural Networks

Deep Residual Learning for Image Recognition

Long short-term memory

Very Deep Convolutional Networks for Large-Scale Image Recognition

Deep learning