Deep Multimodal Representation Learning from Temporal Data

doi:10.1109/CVPR.2017.538

Open AccessProceedings ArticleDOI

Deep Multimodal Representation Learning from Temporal Data

Xitong Yang, +5 more

- Vol. 2017, Iss: 7, pp 5066-5074

Chats0

TLDR

The proposed CorrRNN model is validated via experimentation on two different tasks: video-and sensor-based activity classification, and audio-visual speech recognition, and its robustness, effectiveness and state-of-the-art performance on multiple datasets are demonstrated.

Abstract:

In recent years, Deep Learning has been successfully applied to multimodal learning problems, with the aim of learning useful joint representations in data fusion applications. When the available modalities consist of time series data such as video, audio and sensor signals, it becomes imperative to consider their temporal structure during the fusion process. In this paper, we propose the Correlational Recurrent Neural Network (CorrRNN), a novel temporal fusion model for fusing multiple input modalities that are inherently temporal in nature. Key features of our proposed model include: (i) simultaneous learning of the joint representation and temporal dependencies between modalities, (ii) use of multiple loss terms in the objective function, including a maximum correlation loss term to enhance learning of cross-modal information, and (iii) the use of an attention model to dynamically adjust the contribution of different input modalities to the joint representation. We validate our model via experimentation on two different tasks: video-and sensor-based activity classification, and audio-visual speech recognition. We empirically analyze the contributions of different components of the proposed CorrRNN model, and demonstrate its robustness, effectiveness and state-of-the-art performance on multiple datasets.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks

Michelle A. Lee, +7 more

TL;DR: This work uses self-supervision to learn a compact and multimodal representation of sensory inputs, which can then be used to improve the sample efficiency of the policy learning of deep reinforcement learning algorithms.

...read moreread less

Journal ArticleDOI

State representation learning for control: An overview.

Timothée Lesort, +3 more

- 04 Aug 2018 -

Neural Networks

TL;DR: This survey aims at covering the state-of-the-art on state representation learning in the most recent years by reviewing different SRL methods that involve interaction with the environment, their implementations and their applications in robotics control tasks (simulated or real).

...read moreread less

Book ChapterDOI

Audio-Visual Event Localization in Unconstrained Videos

Yapeng Tian, +4 more

TL;DR: An audio-guided visual attention mechanism to explore audio- visual correlations, a dual multimodal residual network (DMRN) to fuse information over the two modalities, and an audio-visual distance learning network to handle the cross-modality localization are developed.

...read moreread less

Journal ArticleDOI

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

Chao Zhang, +3 more

- 15 Apr 2020 -

IEEE Journal of Selected Topics in Signa...

TL;DR: A technical review of available models and learning methods for multimodal intelligence, focusing on the combination of vision and natural language modalities, which has become an important topic in both the computer vision andnatural language processing research communities.

...read moreread less

Proceedings ArticleDOI

Multi-modal Learning from Unpaired Images: Application to Multi-organ Segmentation in CT and MRI

Vanya V. Valindria, +7 more

TL;DR: Results demonstrate that information across modalities can in particular improve performance on varying structures such as the spleen, and show that multi-modal learning can improve overall accuracy over modality-specific training.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Going deeper with convolutions

Christian Szegedy, +8 more

TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).

...read moreread less

Proceedings Article

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, +2 more

TL;DR: It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.

...read moreread less

Proceedings ArticleDOI

Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation

Kyunghyun Cho, +8 more

TL;DR: In this paper, the encoder and decoder of the RNN Encoder-Decoder model are jointly trained to maximize the conditional probability of a target sequence given a source sequence.

...read moreread less

Posted Content

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, +2 more

- 01 Sep 2014 -

arXiv: Computation and Language

TL;DR: In this paper, the authors propose to use a soft-searching model to find the parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.

...read moreread less

Proceedings Article

Sequence to Sequence Learning with Neural Networks

Ilya Sutskever, +2 more

TL;DR: The authors used a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector.

...read moreread less

Collapse

Deep Multimodal Representation Learning from Temporal Data

Citations

Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks

State representation learning for control: An overview.

Audio-Visual Event Localization in Unconstrained Videos

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications

Multi-modal Learning from Unpaired Images: Application to Multi-organ Segmentation in CT and MRI

References

Going deeper with convolutions

Neural Machine Translation by Jointly Learning to Align and Translate

Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation

Neural Machine Translation by Jointly Learning to Align and Translate

Sequence to Sequence Learning with Neural Networks

Related Papers (5)

Multimodal Deep Learning

Multimodal Machine Learning: A Survey and Taxonomy

Long short-term memory

Deep Residual Learning for Image Recognition

Very Deep Convolutional Networks for Large-Scale Image Recognition