scispace - formally typeset
Journal ArticleDOI

End-to-End Learning for Multimodal Emotion Recognition in Video With Adaptive Loss

TLDR
In this article, a lightweight deep architecture with approximately 1 MB was proposed for emotion recognition in video through the interaction of visual, audio, and language information in an end-to-end learning manner with three key points: lightweight feature extractor, attention strategy, and adaptive loss.
Abstract
This work presents an approach for emotion recognition in video through the interaction of visual, audio, and language information in an end-to-end learning manner with three key points: 1) lightweight feature extractor, 2) attention strategy, and 3) adaptive loss. We proposed a lightweight deep architecture with approximately 1 MB, which for the most crucial part, accounts for feature extraction, in the emotion recognition systems. The relationship in regard to the time dimension of features is explored with temporal convolutional network instead of RNNs-based architecture to leverage the parallelism and avoid the challenge of vanishing gradient. The attention strategy is employed to adjust the knowledge of temporal networks based on the time dimension and learning of each modality’s contribution to the final results. The interaction between the modalities is also investigated when training with adaptive objective function, which adjusts the network’s gradient. The experimental results obtained on a large-scale dataset for emotion recognition on Koreans demonstrate the superiority of our method when employing attention mechanism and adaptive loss during training.

read more

Citations
More filters
Proceedings ArticleDOI

Audio-Visual Domain Adaptation Feature Fusion for Speech Emotion Recognition

TL;DR: In this article , a multi-attention based depthwise separable convolutional model for speech emotional feature extraction is proposed, which can reduce the feature redundancy through separating spatial-wise convolution and channel-wise CNN.
Proceedings ArticleDOI

Multimodal Emotion Recognition with Modality-Pairwise Unsupervised Contrastive Loss

TL;DR: This work focuses on unsupervised feature learning for Multimodal Emotion Recognition (MER), and considers discrete emotions, and as modalities text, audio and vision are used, which is the first attempt in MER literature.
Proceedings ArticleDOI

Multimodal Emotion Recognition with Modality-Pairwise Unsupervised Contrastive Loss

TL;DR: In this paper , an end-to-end feature learning approach for unsupervised multimodal emotion recognition is proposed. But, the method is based on contrastive loss between pairwise modalities, and it does not require backbones pre-trained on emotion recognition task.
References
More filters
Proceedings ArticleDOI

MobileNetV2: Inverted Residuals and Linear Bottlenecks

TL;DR: MobileNetV2 as mentioned in this paper is based on an inverted residual structure where the shortcut connections are between the thin bottleneck layers and intermediate expansion layer uses lightweight depthwise convolutions to filter features as a source of non-linearity.
Proceedings ArticleDOI

Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph

TL;DR: This paper introduces CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI), the largest dataset of sentiment analysis and emotion recognition to date and uses a novel multimodal fusion technique called the Dynamic Fusion Graph (DFG), which is highly interpretable and achieves competative performance when compared to the previous state of the art.
Proceedings ArticleDOI

Facial Expression Recognition Using Enhanced Deep 3D Convolutional Neural Networks

TL;DR: In this article, a 3D Convolutional Neural Network (CNN) is proposed for facial expression recognition in videos, which consists of 3D Inception-ResNet layers followed by an LSTM unit that together extracts the spatial relations within facial images as well as the temporal relations between different frames in the video.
Proceedings ArticleDOI

End-to-End Speech Emotion Recognition Using Deep Neural Networks

TL;DR: This model, which was trained end-to-end, is comprised of a Convolutional Neural Network, which extracts features from the raw signal, and stacked on top of it a 2-layer Long Short-Term Memory so as to consider the contextual information in the data.
Proceedings ArticleDOI

Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning.

TL;DR: A speech emotion recognition (SER) method using end-to-end (E2E) multitask learning with self attention to deal with several issues is proposed, which outperforms the state-of-the-art methods and improves the overall accuracy.
Related Papers (5)