scispace - formally typeset
Open AccessProceedings ArticleDOI

Deep Multimodal Representation Learning from Temporal Data

Reads0
Chats0
TLDR
The proposed CorrRNN model is validated via experimentation on two different tasks: video-and sensor-based activity classification, and audio-visual speech recognition, and its robustness, effectiveness and state-of-the-art performance on multiple datasets are demonstrated.
Abstract
In recent years, Deep Learning has been successfully applied to multimodal learning problems, with the aim of learning useful joint representations in data fusion applications. When the available modalities consist of time series data such as video, audio and sensor signals, it becomes imperative to consider their temporal structure during the fusion process. In this paper, we propose the Correlational Recurrent Neural Network (CorrRNN), a novel temporal fusion model for fusing multiple input modalities that are inherently temporal in nature. Key features of our proposed model include: (i) simultaneous learning of the joint representation and temporal dependencies between modalities, (ii) use of multiple loss terms in the objective function, including a maximum correlation loss term to enhance learning of cross-modal information, and (iii) the use of an attention model to dynamically adjust the contribution of different input modalities to the joint representation. We validate our model via experimentation on two different tasks: video-and sensor-based activity classification, and audio-visual speech recognition. We empirically analyze the contributions of different components of the proposed CorrRNN model, and demonstrate its robustness, effectiveness and state-of-the-art performance on multiple datasets.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks

TL;DR: This work uses self-supervision to learn a compact and multimodal representation of sensory inputs, which can then be used to improve the sample efficiency of the policy learning of deep reinforcement learning algorithms.
Journal ArticleDOI

State representation learning for control: An overview.

TL;DR: This survey aims at covering the state-of-the-art on state representation learning in the most recent years by reviewing different SRL methods that involve interaction with the environment, their implementations and their applications in robotics control tasks (simulated or real).
Book ChapterDOI

Audio-Visual Event Localization in Unconstrained Videos

TL;DR: An audio-guided visual attention mechanism to explore audio- visual correlations, a dual multimodal residual network (DMRN) to fuse information over the two modalities, and an audio-visual distance learning network to handle the cross-modality localization are developed.
Journal ArticleDOI

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

TL;DR: A technical review of available models and learning methods for multimodal intelligence, focusing on the combination of vision and natural language modalities, which has become an important topic in both the computer vision andnatural language processing research communities.
Proceedings ArticleDOI

Multi-modal Learning from Unpaired Images: Application to Multi-organ Segmentation in CT and MRI

TL;DR: Results demonstrate that information across modalities can in particular improve performance on varying structures such as the spleen, and show that multi-modal learning can improve overall accuracy over modality-specific training.
References
More filters
Proceedings ArticleDOI

Going deeper with convolutions

TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
Proceedings Article

Neural Machine Translation by Jointly Learning to Align and Translate

TL;DR: It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.
Proceedings ArticleDOI

Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation

TL;DR: In this paper, the encoder and decoder of the RNN Encoder-Decoder model are jointly trained to maximize the conditional probability of a target sequence given a source sequence.
Posted Content

Neural Machine Translation by Jointly Learning to Align and Translate

TL;DR: In this paper, the authors propose to use a soft-searching model to find the parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.
Proceedings Article

Sequence to Sequence Learning with Neural Networks

TL;DR: The authors used a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector.
Related Papers (5)