Learning Relationships between Text, Audio, and Video via Deep Canonical Correlation for Multimodal Language Analysis

doi:10.1609/AAAI.V34I05.6431

Open AccessJournal ArticleDOI

Learning Relationships between Text, Audio, and Video via Deep Canonical Correlation for Multimodal Language Analysis

Zhongkai Sun, +3 more

- Vol. 34, Iss: 05, pp 8992-8999

Chats0

TLDR

A novel model, the Interaction Canonical Correlation Network (ICCN), is proposed, which learns correlations between all three modes via deep canonical correlation analysis (DCCA) and the proposed embeddings are tested on several benchmark datasets and against other state-of-the-art multimodal embedding algorithms.

Abstract:

Multimodal language analysis often considers relationships between features based on text and those based on acoustical and visual properties Text features typically outperform non-text features in sentiment analysis or emotion recognition tasks in part because the text features are derived from advanced language models or word embeddings trained on massive data sources while audio and video features are human-engineered and comparatively underdeveloped Given that the text, audio, and video are describing the same utterance in different ways, we hypothesize that the multimodal sentiment analysis and emotion recognition can be improved by learning (hidden) correlations between features extracted from the outer product of text and audio (we call this text-based audio) and analogous text-based video This paper proposes a novel model, the Interaction Canonical Correlation Network (ICCN), to learn such multimodal embeddings ICCN learns correlations between all three modes via deep canonical correlation analysis (DCCA) and the proposed embeddings are then tested on several benchmark datasets and against other state-of-the-art multimodal embedding algorithms Empirical results and ablation studies confirm the effectiveness of ICCN in capturing useful information from all three views

Citations

PDF

Open Access

More filters

Posted Content

MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis

Devamanyu Hazarika, +2 more

- 07 May 2020 -

arXiv: Computation and Language

TL;DR: A novel framework, MISA, is proposed, which projects each modality to two distinct subspaces, which provide a holistic view of the multimodal data, which is used for fusion that leads to task predictions.

...read moreread less

Journal ArticleDOI

Deep multi-view learning methods: a review

Xiaoqiang Yan, +4 more

- 23 Mar 2021 -

Neurocomputing

TL;DR: In this article, a comprehensive review on deep multi-view learning from the following two perspectives: MVL methods in deep learning scope and deep MVL extensions of traditional methods is presented, and the authors attempt to identify some open challenges to inform future research directions.

...read moreread less

Proceedings ArticleDOI

Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal Sentiment Analysis

Wei Han, +5 more

TL;DR: Zhang et al. as discussed by the authors proposed the Bi-Bimodal Fusion Network (BBFN), which performs fusion (relevance increment) and separation (difference increment) on pairwise modality representations.

...read moreread less

Posted Content

Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis

Wenmeng Yu, +3 more

- 09 Feb 2021 -

arXiv: Computation and Language

TL;DR: Li et al. as discussed by the authors designed a label generation module based on the self-supervised learning strategy to acquire independent unimodal supervisions, and jointly trained the multimodal and uni-modal tasks to learn the consistency and difference.

...read moreread less

Journal ArticleDOI

Multimodal Emotion Recognition With Transformer-Based Self Supervised Feature Fusion

Shamane Siriwardhana, +3 more

- 25 Sep 2020 -

IEEE Access

TL;DR: This work introduces a novel Transformers and Attention-based fusion mechanism that can combine multimodal SSL features and achieve state-of-the-art results for the task of multi-modal emotion recognition.

...read moreread less

Collapse

Learning Relationships between Text, Audio, and Video via Deep Canonical Correlation for Multimodal Language Analysis

Citations

MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis

Deep multi-view learning methods: a review

Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal Sentiment Analysis

Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis

Multimodal Emotion Recognition With Transformer-Based Self Supervised Feature Fusion

Related Papers (5)

Tensor Fusion Network for Multimodal Sentiment Analysis

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph

Glove: Global Vectors for Word Representation

COVAREP — A collaborative voice analysis repository for speech technologies