Deep Multimodal Representation Learning from Temporal Data
Xitong Yang,Palghat S. Ramesh,Radha Chitta,Sriganesh Madhvanath,Edgar A. Bernal,Jiebo Luo +5 more
- Vol. 2017, Iss: 7, pp 5066-5074
Reads0
Chats0
TLDR
The proposed CorrRNN model is validated via experimentation on two different tasks: video-and sensor-based activity classification, and audio-visual speech recognition, and its robustness, effectiveness and state-of-the-art performance on multiple datasets are demonstrated.Abstract:
In recent years, Deep Learning has been successfully applied to multimodal learning problems, with the aim of learning useful joint representations in data fusion applications. When the available modalities consist of time series data such as video, audio and sensor signals, it becomes imperative to consider their temporal structure during the fusion process. In this paper, we propose the Correlational Recurrent Neural Network (CorrRNN), a novel temporal fusion model for fusing multiple input modalities that are inherently temporal in nature. Key features of our proposed model include: (i) simultaneous learning of the joint representation and temporal dependencies between modalities, (ii) use of multiple loss terms in the objective function, including a maximum correlation loss term to enhance learning of cross-modal information, and (iii) the use of an attention model to dynamically adjust the contribution of different input modalities to the joint representation. We validate our model via experimentation on two different tasks: video-and sensor-based activity classification, and audio-visual speech recognition. We empirically analyze the contributions of different components of the proposed CorrRNN model, and demonstrate its robustness, effectiveness and state-of-the-art performance on multiple datasets.read more
Citations
More filters
Proceedings ArticleDOI
Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks
Michelle A. Lee,Yuke Zhu,Krishnan Srinivasan,Parth Shah,Silvio Savarese,Li Fei-Fei,Animesh Garg,Jeannette Bohg +7 more
TL;DR: This work uses self-supervision to learn a compact and multimodal representation of sensory inputs, which can then be used to improve the sample efficiency of the policy learning of deep reinforcement learning algorithms.
Journal ArticleDOI
State representation learning for control: An overview.
TL;DR: This survey aims at covering the state-of-the-art on state representation learning in the most recent years by reviewing different SRL methods that involve interaction with the environment, their implementations and their applications in robotics control tasks (simulated or real).
Book ChapterDOI
Audio-Visual Event Localization in Unconstrained Videos
TL;DR: An audio-guided visual attention mechanism to explore audio- visual correlations, a dual multimodal residual network (DMRN) to fuse information over the two modalities, and an audio-visual distance learning network to handle the cross-modality localization are developed.
Journal ArticleDOI
Multimodal Intelligence: Representation Learning, Information Fusion, and Applications
TL;DR: A technical review of available models and learning methods for multimodal intelligence, focusing on the combination of vision and natural language modalities, which has become an important topic in both the computer vision andnatural language processing research communities.
Proceedings ArticleDOI
Multi-modal Learning from Unpaired Images: Application to Multi-organ Segmentation in CT and MRI
Vanya V. Valindria,Nick Pawlowski,Martin Rajchl,Ioannis Lavdas,Eric O. Aboagye,Andrea Rockall,Daniel Rueckert,Ben Glocker +7 more
TL;DR: Results demonstrate that information across modalities can in particular improve performance on varying structures such as the spleen, and show that multi-modal learning can improve overall accuracy over modality-specific training.
References
More filters
Proceedings ArticleDOI
Going deeper with convolutions
Christian Szegedy,Wei Liu,Yangqing Jia,Pierre Sermanet,Scott Reed,Dragomir Anguelov,Dumitru Erhan,Vincent Vanhoucke,Andrew Rabinovich +8 more
TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
Proceedings Article
Neural Machine Translation by Jointly Learning to Align and Translate
TL;DR: It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.
Proceedings ArticleDOI
Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation
Kyunghyun Cho,Bart van Merriënboer,Caglar Gulcehre,Dzmitry Bahdanau,Fethi Bougares,Holger Schwenk,Yoshua Bengio,Yoshua Bengio,Yoshua Bengio +8 more
TL;DR: In this paper, the encoder and decoder of the RNN Encoder-Decoder model are jointly trained to maximize the conditional probability of a target sequence given a source sequence.
Posted Content
Neural Machine Translation by Jointly Learning to Align and Translate
TL;DR: In this paper, the authors propose to use a soft-searching model to find the parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.
Proceedings Article
Sequence to Sequence Learning with Neural Networks
TL;DR: The authors used a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector.