End-to-End Learning for Multimodal Emotion Recognition in Video With Adaptive Loss

doi:10.1109/MMUL.2021.3080305

Journal ArticleDOI

End-to-End Learning for Multimodal Emotion Recognition in Video With Adaptive Loss

Van Thong Huynh, +3 more

- 01 Apr 2021 -

IEEE MultiMedia

- Vol. 28, Iss: 2, pp 59-66

TLDR

In this article, a lightweight deep architecture with approximately 1 MB was proposed for emotion recognition in video through the interaction of visual, audio, and language information in an end-to-end learning manner with three key points: lightweight feature extractor, attention strategy, and adaptive loss.

Abstract:

This work presents an approach for emotion recognition in video through the interaction of visual, audio, and language information in an end-to-end learning manner with three key points: 1) lightweight feature extractor, 2) attention strategy, and 3) adaptive loss. We proposed a lightweight deep architecture with approximately 1 MB, which for the most crucial part, accounts for feature extraction, in the emotion recognition systems. The relationship in regard to the time dimension of features is explored with temporal convolutional network instead of RNNs-based architecture to leverage the parallelism and avoid the challenge of vanishing gradient. The attention strategy is employed to adjust the knowledge of temporal networks based on the time dimension and learning of each modality’s contribution to the final results. The interaction between the modalities is also investigated when training with adaptive objective function, which adjusts the network’s gradient. The experimental results obtained on a large-scale dataset for emotion recognition on Koreans demonstrate the superiority of our method when employing attention mechanism and adaptive loss during training.

End-to-End Learning for Multimodal Emotion Recognition in Video With Adaptive Loss

Citations

Audio-Visual Domain Adaptation Feature Fusion for Speech Emotion Recognition

Multimodal Emotion Recognition with Modality-Pairwise Unsupervised Contrastive Loss

Multimodal Emotion Recognition with Modality-Pairwise Unsupervised Contrastive Loss

References

MobileNetV2: Inverted Residuals and Linear Bottlenecks

Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph

Facial Expression Recognition Using Enhanced Deep 3D Convolutional Neural Networks

End-to-End Speech Emotion Recognition Using Deep Neural Networks

Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning.

Related Papers (5)

Video multimodal emotion recognition based on Bi-GRU and attention fusion

Deep learning for robust feature generation in audiovisual emotion recognition

Metric Learning-Based Multimodal Audio-Visual Emotion Recognition

Multimodal Emotion Recognition With Transformer-Based Self Supervised Feature Fusion

Chinese Multimodal Emotion Recognition in Deep and Traditional Machine Leaming Approaches