Multimodal Speech Emotion Recognition Using Audio and Text

doi:10.1109/SLT.2018.8639583

Open AccessProceedings ArticleDOI

Multimodal Speech Emotion Recognition Using Audio and Text

- pp 112-118

TLDR

In this paper, a deep dual recurrent encoder model that utilizes text data and audio signals simultaneously to obtain a better understanding of speech data has been proposed, and the proposed model outperforms previous state-of-the-art methods in assigning data to one of four emotion categories (i.e., angry, happy, sad and neutral) when the model is applied to the IEMOCAP dataset, as reflected by accuracies ranging from 68.8% to 71.8%.

Abstract:

Speech emotion recognition is a challenging task, and extensive reliance has been placed on models that use audio features in building well-performing classifiers. In this paper, we propose a novel deep dual recurrent encoder model that utilizes text data and audio signals simultaneously to obtain a better understanding of speech data. As emotional dialogue is composed of sound and spoken content, our model encodes the information from audio and text sequences using dual recurrent neural networks (RNNs) and then combines the information from these sources to predict the emotion class. This architecture analyzes speech data from the signal level to the language level, and it thus utilizes the information within the data more comprehensively than models that focus on audio features. Extensive experiments are conducted to investigate the efficacy and properties of the proposed model. Our proposed model outperforms previous state-of-the-art methods in assigning data to one of four emotion categories (i.e., angry, happy, sad and neutral) when the model is applied to the IEMOCAP dataset, as reflected by accuracies ranging from 68.8% to 71.8%.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Speech emotion recognition with deep convolutional neural networks

Dias Issa, +2 more

- 01 May 2020 -

Biomedical Signal Processing and Control

TL;DR: A new architecture is introduced, which extracts mel-frequency cepstral coefficients, chromagram, mel-scale spectrogram, Tonnetz representation, and spectral contrast features from sound files and uses them as inputs for the one-dimensional Convolutional Neural Network for the identification of emotions using samples from the Ryerson Audio-Visual Database of Emotional Speech and Song, Berlin, and EMO-DB datasets.

...read moreread less

Journal ArticleDOI

MLT-DNet: Speech emotion recognition using 1D dilated CNN based on multi-learning trick approach

Mustaqeem, +1 more

- 01 Apr 2021 -

Expert Systems With Applications

TL;DR: This work proposes an end-to-end real-time SER model that is capable of processing original speech signals for the emotion recognition that utilizes lightweight dilated CNN architecture that implements the multi-learning trick (MLT) approach.

...read moreread less

Proceedings ArticleDOI

Speech Emotion Recognition Using Multi-hop Attention Mechanism

Seunghyun Yoon, +3 more

TL;DR: A framework to exploit acoustic information in tandem with lexical data using two bi-directional long short-term memory (BLSTM) for obtaining hidden representations of the utterance and an attention mechanism, referred to as the multi-hop, which is trained to automatically infer the correlation between the modalities.

...read moreread less

Proceedings ArticleDOI

Speech Emotion Recognition with Dual-Sequence LSTM Architecture

Jianyou Wang, +5 more

TL;DR: This work proposes a new dual-level model that predicts emotions based on both MFCC features and mel-spectrograms produced from raw audio signals, and is comparable with multimodal models that leverage textual information as well as audio signals.

...read moreread less

Journal ArticleDOI

A Comprehensive Review of Speech Emotion Recognition Systems

Taiba Majid Wani, +4 more

- 22 Mar 2021 -

IEEE Access

TL;DR: In this article, the authors identify and synthesize recent relevant literature related to the speech emotion recognition systems' varied design components/methodologies, thereby providing readers with a state-of-the-art understanding of the hot research topic.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, +2 more

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

Proceedings ArticleDOI

Glove: Global Vectors for Word Representation

Jeffrey Pennington, +2 more

TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.

...read moreread less

Proceedings Article

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, +2 more

TL;DR: It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.

...read moreread less

Posted Content

Empirical evaluation of gated recurrent neural networks on sequence modeling

Junyoung Chung, +5 more

- 11 Dec 2014 -

arXiv: Neural and Evolutionary Computing

TL;DR: These advanced recurrent units that implement a gating mechanism, such as a long short-term memory (LSTM) unit and a recently proposed gated recurrent unit (GRU), are found to be comparable to LSTM.

...read moreread less

Proceedings ArticleDOI

Effective Approaches to Attention-based Neural Machine Translation

Minh-Thang Luong, +2 more

TL;DR: A global approach which always attends to all source words and a local one that only looks at a subset of source words at a time are examined, demonstrating the effectiveness of both approaches on the WMT translation tasks between English and German in both directions.

...read moreread less

Collapse

Multimodal Speech Emotion Recognition Using Audio and Text

Citations

Speech emotion recognition with deep convolutional neural networks

MLT-DNet: Speech emotion recognition using 1D dilated CNN based on multi-learning trick approach

Speech Emotion Recognition Using Multi-hop Attention Mechanism

Speech Emotion Recognition with Dual-Sequence LSTM Architecture

A Comprehensive Review of Speech Emotion Recognition Systems

References

ImageNet Classification with Deep Convolutional Neural Networks

Glove: Global Vectors for Word Representation

Neural Machine Translation by Jointly Learning to Align and Translate

Empirical evaluation of gated recurrent neural networks on sequence modeling

Effective Approaches to Attention-based Neural Machine Translation

Related Papers (5)

IEMOCAP: interactive emotional dyadic motion capture database

Glove: Global Vectors for Word Representation

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Automatic speech emotion recognition using recurrent neural networks with local attention

Attention is All you Need