Journal Article•DOI•

LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework

Martin Wöllmer¹, Moritz Kaiser¹, Florian Eyben¹, Björn Schuller¹, Gerhard Rigoll¹ - Show less +1 more•Institutions (1)

01 Feb 2013-Image and Vision Computing (Butterworth-Heinemann)-Vol. 31, Iss: 2, pp 153-163

TL;DR: Comparing the results with the recognition scores of all Audiovisual Sub-Challenge participants, it is found that the proposed LSTM-based technique leads to the best average recognition performance that has been reported for this task so far.

read less

About: This article is published in Image and Vision Computing.The article was published on 2013-02-01 and is currently open access. It has received 281 citations till now. The article focuses on the topics: Affective computing & Context model.

...read moreread less

Summary (3 min read)

Jump to: [1. Introduction] – [2. The SEMAINE Database] – [3.1. Audio Feature Extraction] – [3.2. Linguistic and Non-Linguistic Feature Extraction] – [3.3. Visual Feature Extraction] – [3.3.1. Baseline Video Feature Extractor] – [3.3.2. Proposed Visual Feature Extraction Method] – [4. Classification] – [5. Experiments and Results] – [5.2. Experimental Settings] and [5.3. Results and Discussion]

1. Introduction

As speech recognition systems have matured over the last decades, automatic emotion recognition (AER) can be seen as going one step further in the design of natural, intuitive, and humanlike computer interfaces.
Currently, the authors are observing a shift from modeling prototypical emotional categories such as anger or happiness to viewing human affect in a continuous orthogonal way by defining emotional dimensions including for example arousal and valence.
Apart from preliminary experiments using facial marker information as additional input modality [13] and a recent study on subject dependent recognition of arousal and valence [27], LSTM architectures have hardly been applied for audiovisual emotion recognition.
The audio feature extraction front-end applied in their study is based on their open-source toolkit openSMILE [28] which is able to extract large sets of prosodic, spectral, and voice quality low-level descriptors (LLD) combined with various statistical functionals in real-time.

2. The SEMAINE Database

The freely available audiovisual SEMAINE corpus1 [14] was recorded to study natural social signals that occur in conversations between humans and artificially intelligent agents.
The scenario used during the creation of the database is called the Sensitive Artificial Listener (SAL).
Both, the user and the operator were recorded from a frontal view by both a greyscale camera and a color camera.
As the number of character conversations varies between recordings, the number of sessions is different per set:.
For the challenge, the originally continuous affective dimensions arousal, expectation, power, and valence were redefined as binary classification tasks by testing at every frame whether they are above or below average.

3.1. Audio Feature Extraction

The authors acoustic feature extraction approach is based on a large set of low-level descriptors and derivatives of LLD combined with suited statistical functionals to capture speech dynamics within a word.
All features and functionals are computed using their on-line audio analysis toolkit openSMILE [28].
Details on the LLD and functionals are given in Tables 2 and 3, respectively.
One example for a LLD/functional combination that contains no information is ‘minimum pitch’ which is always zero.

3.2. Linguistic and Non-Linguistic Feature Extraction

Linguistic features are extracted using the SEMAINE 3.0 ASR system [4].
It applies openSMILE as front-end to extract 13 Mel-Frequency Cepstral Coefficients (MFCC) together with first and second order temporal derivatives every 10 ms (window size 25 ms).
All of these corpora contain spontaneous, conversational, and partly emotional speech.
The phoneme HMMs consist of three states with 16 Gaussian mixtures per state.
Typically, one (key)word is detected for every audio chunk (which correspond to single words), however the recognizer is not restricted to detect exactly one word, thus insertions and deletions are possible.

3.3. Visual Feature Extraction

Generally, a large variety of purely visual emotion recognition systems has been presented in recent years, including combinations of Local Binary Patterns and Support Vector Machines [41], methods based on deformed grids and SVMs [42], Haar-like features modeled via AdaBoost [43], approaches using Gabor filters and non-negative matrix factorization [44], and variable-intensity models [45].
Glodek et al. [35] use Gabor filters to extract video features.
Subsequently, the face is cut out and rotated so that it is upright, before the optical flow with respect to the previous frame is computed.
Compared to [29], their method is faster and also extracts head tilt in addition to facial movement features.
Furthermore, unlike the Audio/Visual Emotion Challenge baseline video feature extractor [6] which is based on dense local appearance descriptors, their approach does not rely on correct eye detection.

3.3.1. Baseline Video Feature Extractor

The baseline video feature extractor for the 2011 Audio/Visual Emotion Challenge [6] works as follows:.
First, the face position is detected by a Viola Jones face detector which computes a squared window containing the face.
Once the two eyes are detected, the image can be rotated by angle α so that the eyes lie on a horizontal line.
Uniform Local Binary Patterns (LBP) [55] are used as dense local appearance descriptors.
Consisting of eight binary comparisons per pixel, they are fast to compute.

3.3.2. Proposed Visual Feature Extraction Method

In order to compute the visual low-level features applied in their proposed LSTM-based audiovisual emotion recognition framework the authors go through the steps depicted in the block diagram in Figure 3.
Each of the three components of the HSV color model has 20 bins in the histogram.
Subsequently, the face is tracked with a camshift tracker [57] which takes the probability image as input.
The following functionals are applied to frame-based video features: arithmetic mean (for delta coefficients: arithmetic mean of absolute values), standard deviation, 5% percentile, 95% percentile, and range of 5% and 95% percentile.
In Figure 5, the shading of the facial regions indicates the importance of the features corresponding to the respective region.

4. Classification

Widely used classifiers operating on static word- or turn-level feature vectors are, e. g., Support Vector Machines or Multilayer Perceptrons.
To exploit context between successive speech segments for improved audiovisual emotion recognition, this study considers recurrent neural network architectures which take into account past observations by cyclic connections in the network’s hidden layer.
Each memory block consists of one or more memory cells and multiplicative input, output, and forget gates.
The initial version of the LSTM architecture proposed in [18] contained only input and output gates to enable an architecture that can store and access activations via gate activations.

5. Experiments and Results

All experiments are carried out on the Audiovisual Sub-Challenge task as described in Section 2.
The task is to discriminate between high and low arousal, expectation, power, and valence.
As the class distribution in the training set is relatively well balanced, the official challenge measure is weighted accuracy, i. e., the recognition rates of the individual classes weighted by the class distribution.
This imbalence holds in particular for the Audio and Audio-Visual Sub-Challenge as they consider word-level modeling rather than frame-based recognition.

5.2. Experimental Settings

The authors investigate the performance of both, bidirectional LSTMs and unidirectional LSTM networks for fully incremental on-line audiovisual affect recognition.
The number of input nodes corresponds to the number of different features per speech segment and the number of output nodes corresponds to the number of target classes, i. e., the authors used two output nodes representing high and low arousal, expectation, power, and valence, respectively.
All networks were trained using a learning rate of 10−5.
To validate whether better recognition performance can be obtained when changing the number of memory blocks, the authors evaluated hidden layer sizes of between 80 and 160 memory blocks on the development set.
The resulting number of variables that need to be estimated during network training is equivalent to the number of weights in the network, e. g., an LSTM network that processes the full feature set consisting of acoustic, linguistic, and video information has 2 094 210 weights.

5.3. Results and Discussion

Table 5 shows both, weighted accuracies (WA) and unweighted accuracies (UA) obtained when training on the training set of the 2011 Audio/Visual Emotion Challenge and testing on the development set.
The performance of the different feature groups (acoustic, linguistic, video) heavily depends on the considered emotional dimension.
The classification of expectation seems to benefit from including visual information as the best WA (67.6 %) is reached for LSTM networks applying late fusion of audio and video modalities.
For recognition based on video only, CFS leads to a remarkable performance gain, increasing the average WA from 60.4 % to 65.8 % for unidirectional LSTM networks.
Networks were trained on the training and development set.

Did you find this useful? Give us your feedback

Figures (15)

Table 2: 31 acoustic low-level descriptors (LLD).

Table 3: Set of all 42 functionals. 1not applied to delta coefficient contours. 2for delta coefficients the mean of only positive values is applied, otherwise the arithmetic mean is applied. 3not applied to voicing related LLD.

Figure 6: LSTM memory block consisting of one memory cell: the input, output, and forget gates collect activations from inside and outside the block which control the cell through multiplicative units (depicted as small circles); input, output, and forget gate scale input, output, and internal state respectively; fi, fg, and fo denote activation functions; the recurrent connection of fixed weight 1.0 maintains the internal state.

Table 6: Development set of the Audiovisual Sub-Challenge; CFS feature selection: weighted accuracies (WA) and unweighted accuracies (UA) for the discrimination of high and low arousal, expectation, power, and valence using acoustic (A), linguistic (L), and video (V) features combined with different classifiers. LF: late fusion; the best weighted accuracies for each emotional dimension are highlighted.

Figure 5: Importance of facial regions for video feature extraction according to the ranking-based information gain attribute evaluation algorithm implemented in the Weka toolkit [58]. Information gain is evaluated for each emotional dimension. The shading of the facial regions indicates the importance of the features corresponding to the respective region.

Table 5: Development set of the Audiovisual Sub-Challenge; no feature selection: weighted accuracies (WA) and unweighted accuracies (UA) for the discrimination of high and low arousal, expectation, power, and valence using acoustic (A), linguistic (L), and video (V) features combined with different classifiers. LF: late fusion; the best weighted accuracies for each emotional dimension are highlighted.

Table 4: Approaches proposed for the Video Sub-Challenge of the 2011 Audio/Visual Emotion Challenge.

Table 1: Overview of the SEMAINE database as used for the 2011 Audio/Visual Emotion Challenge [6].

Table 8: Statistical significance of the average performance difference between the audio-based classification approaches denoted in the column and the approaches in the table header (evaluations on test set of the Audiovisual Sub-Challenge); ‘-’: not significant; ‘o’ significant at 0.1 level; ‘+’: significant at 0.05 level; ‘++’: significant at 0.001 level. Significance levels are computed according to the z-test described in [67].

Figure 7: System architecture for early fusion of acoustic, linguistic, and video features.

Figure 4: Example for optical flow computation: Between the frames there is a substantial change in the mouth region.

Figure 3: Basic steps for the computation of the low-level visual features.

Figure 1: Examples for low and high arousal, expectation, power, and valence.

Figure 2: Series of word-level screenshots of a user together with the corresponding valence annotation.

Table 7: Test set of the Audiovisual Sub-Challenge: weighted accuracies (WA) and unweighted accuracies (UA) for the discrimination of high and low arousal, expectation, power, and valence using acoustic (A), linguistic (L), and video (V) features combined with different classifiers. LF: late fusion; the best weighted accuracies for each emotional dimension are highlighted.

Frequently Asked Questions (18)

Q1. What are the contributions mentioned in the paper "Lstm-modeling of continuous emotions in an audiovisual affect recognition framework" ?

This article presents their recent advances in assessing dimensional representations of emotion, such as arousal, expectation, power, and valence, in an audiovisual human-computer interaction scenario. Building on previous studies which demonstrate that long-range context modeling tends to increase accuracies of emotion recognition, the authors propose a fully automatic audiovisual recognition approach based on Long Short-Term Memory ( LSTM ) modeling of word-level audio and video features. The authors apply the same acoustic features as used in the challenge baseline system whereas visual features are computed via a novel facial movement feature extractor. Comparing their results with the recognition scores of all Audiovisual Sub-Challenge participants, the authors find that the proposed LSTM-based technique leads to the best average recognition performance that has been reported for this task so far.

Q2. What future works have the authors mentioned in the paper "Lstm-modeling of continuous emotions in an audiovisual affect recognition framework" ?

However, the considered scenario reflects realistic conditions in natural interactions and thus highlights the need for further research in the area of affective computing in order to get closer to the human performance in judging emotions. Their future research in the area of video feature extraction will include the application of multi-camera input to be more robust to head rotations. The authors plan to combine the facial movements of the 2D camera sequences to predict 3D movement. Another possibility to increase recognition performance is to allow asynchronities between audio and video, e. g., by applying hybrid fusion techniques like asynchronous HMMs [ 69 ] or multi-dimensional dynamic time warping [ 48 ].

Q3. What are the functionals for the video features?

The following functionals are applied to frame-based video features: arithmetic mean (for delta coefficients: arithmetic mean of absolute values), standard deviation, 5% percentile, 95% percentile, and range of 5% and 95% percentile.

Q4. What is the function used to map the sequence of video features to a single vector?

In order to map the sequence of frame-based video features to a single vector describing the word-unit, statistical functionals are applied to the frame-based video features and their first order delta coefficients.

Q5. What is the acoustic feature extraction approach?

Their acoustic feature extraction approach is based on a large set of low-level descriptors and derivatives of LLD combined with suited statistical functionals to capture speech dynamics within a word.

Q6. How long does it take to compute the low-level features?

The computation of the low-level features takes 50 ms per frame for a C++ implementation on a 2.4 GHz Intel i5 processor with 4 GB RAM.

Q7. What is the official challenge measure for the LSTM?

As the class distribution in the training set is relatively well balanced, the official challenge measure is weighted accuracy, i. e., the recognition rates of the individual classes weighted by the class distribution.

Q8. What is the reason for the introduction of context-sensitivity in emotion classification frameworks?

Human emotions tend to evolve slowly over time which motivates the introduction of some form of context-sensitivity in emotion classification frameworks.

Q9. What are the functions used to ensure a similar dimensionality of the video feature vector and?

Fewer functionals as for audio features are used to ensure a similar dimensionality of the video feature vector and the audio feature vector.

Q10. What is the approach towards achieving acceptable recognition performance even in challenging conditions?

One approach towards reaching acceptable recognition performance even in challenging conditions is the modeling of contextual information.

Q11. What is the classification framework for long-range context modeling?

Among various classification frameworks that are able to exploit turn-level context, so-called Long Short-Term Memory (LSTM) networks [18] tend to be best suited for long-range context modeling in emotion recognition.

Q12. How many training epochs did the LSTM network achieve?

According to optimizations on the development set, the number of training epochs was 60 for networks classifying arousal and 30 for all other networks.

Q13. How is the dimensionality of the features computed?

By employing uniform LBPs instead of full LBPs and aggregating the LBP operator responses in histograms taken over regions of the face, the dimensionality of the features is rather low (59 dimensions per image block).

Q14. What is the WA for arousal?

For arousal, the best WA of 68.5 % is obtained for acoustic features only, whichis in line with previous studies showing that audio is the most important modality for assessing arousal [13].

Q15. What is the probability of a facial pixel in the current image?

For each pixel I(x, y) in the current image the probability of a facial pixel can be approximated byP f (x, y) = M(IH(x, y), IS (x, y), IV (x, y))N , (2)with N being the number of template pixels that have been used to create the histogram.

Q16. How many sessions are intended for this sub-challenge?

their test set consists only of the sessions that are intended for this sub-challenge, meaning only 10 out of the 32 test sessions.

Q17. What is the WA for LSTMs?

the classification of expectation seems to benefit from including visual information as the best WA (67.6 %) is reached for LSTM networks applying late fusion of audio and video modalities.

Q18. How can the authors improve the recognition performance of the audio/video emotion challenge?

To obtain the best possible recognition performance, future studies should also investigate which feature-classifier combinations lead to the best results, e. g., by combining the proposed LSTM framework with other audio or video features proposed for the 2011 Audio/Visual Emotion Challenge.

LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework

Summary (3 min read)

1. Introduction

2. The SEMAINE Database

3.1. Audio Feature Extraction

3.2. Linguistic and Non-Linguistic Feature Extraction

3.3. Visual Feature Extraction

3.3.1. Baseline Video Feature Extractor

3.3.2. Proposed Visual Feature Extraction Method

4. Classification

5. Experiments and Results

5.2. Experimental Settings

5.3. Results and Discussion

Figures (15)

Citations

Cites background from "LSTM-Modeling of continuous emotion..."

References

"LSTM-Modeling of continuous emotion..." refers background or methods in this paper

"LSTM-Modeling of continuous emotion..." refers methods in this paper

"LSTM-Modeling of continuous emotion..." refers methods in this paper

"LSTM-Modeling of continuous emotion..." refers methods in this paper

Related Papers (5)

Frequently Asked Questions (18)

Q1. What are the contributions mentioned in the paper "Lstm-modeling of continuous emotions in an audiovisual affect recognition framework" ?

Q2. What future works have the authors mentioned in the paper "Lstm-modeling of continuous emotions in an audiovisual affect recognition framework" ?

Q3. What are the functionals for the video features?

Q4. What is the function used to map the sequence of video features to a single vector?

Q5. What is the acoustic feature extraction approach?

Q6. How long does it take to compute the low-level features?

Q7. What is the official challenge measure for the LSTM?

Q8. What is the reason for the introduction of context-sensitivity in emotion classification frameworks?

Q9. What are the functions used to ensure a similar dimensionality of the video feature vector and?

Q10. What is the approach towards achieving acceptable recognition performance even in challenging conditions?

Q11. What is the classification framework for long-range context modeling?

Q12. How many training epochs did the LSTM network achieve?

Q13. How is the dimensionality of the features computed?

Q14. What is the WA for arousal?

Q15. What is the probability of a facial pixel in the current image?

Q16. How many sessions are intended for this sub-challenge?

Q17. What is the WA for LSTMs?

Q18. How can the authors improve the recognition performance of the audio/video emotion challenge?

Trending Questions (1)