LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework

Question

Q1. What are the contributions mentioned in the paper "Lstm-modeling of continuous emotions in an audiovisual affect recognition framework" ?

Q2. What future works have the authors mentioned in the paper "Lstm-modeling of continuous emotions in an audiovisual affect recognition framework" ?

Q3. What are the functionals for the video features?

Q4. What is the function used to map the sequence of video features to a single vector?

Q5. What is the acoustic feature extraction approach?

Q6. How long does it take to compute the low-level features?

Q7. What is the official challenge measure for the LSTM?

Q8. What is the reason for the introduction of context-sensitivity in emotion classification frameworks?

Q9. What are the functions used to ensure a similar dimensionality of the video feature vector and?

Q10. What is the approach towards achieving acceptable recognition performance even in challenging conditions?

Q11. What is the classification framework for long-range context modeling?

Q12. How many training epochs did the LSTM network achieve?

Q13. How is the dimensionality of the features computed?

Q14. What is the WA for arousal?

Q15. What is the probability of a facial pixel in the current image?

Q16. How many sessions are intended for this sub-challenge?

Q17. What is the WA for LSTMs?

Q18. How can the authors improve the recognition performance of the audio/video emotion challenge?

Accepted Answer

This article presents their recent advances in assessing dimensional representations of emotion, such as arousal, expectation, power, and valence, in an audiovisual human-computer interaction scenario. Building on previous studies which demonstrate that long-range context modeling tends to increase accuracies of emotion recognition, the authors propose a fully automatic audiovisual recognition approach based on Long Short-Term Memory ( LSTM ) modeling of word-level audio and video features. The authors apply the same acoustic features as used in the challenge baseline system whereas visual features are computed via a novel facial movement feature extractor. Comparing their results with the recognition scores of all Audiovisual Sub-Challenge participants, the authors find that the proposed LSTM-based technique leads to the best average recognition performance that has been reported for this task so far.

Accepted Answer

However, the considered scenario reflects realistic conditions in natural interactions and thus highlights the need for further research in the area of affective computing in order to get closer to the human performance in judging emotions. Their future research in the area of video feature extraction will include the application of multi-camera input to be more robust to head rotations. The authors plan to combine the facial movements of the 2D camera sequences to predict 3D movement. Another possibility to increase recognition performance is to allow asynchronities between audio and video, e. g., by applying hybrid fusion techniques like asynchronous HMMs [ 69 ] or multi-dimensional dynamic time warping [ 48 ].

Accepted Answer

The following functionals are applied to frame-based video features: arithmetic mean (for delta coefficients: arithmetic mean of absolute values), standard deviation, 5% percentile, 95% percentile, and range of 5% and 95% percentile.

Accepted Answer

In order to map the sequence of frame-based video features to a single vector describing the word-unit, statistical functionals are applied to the frame-based video features and their first order delta coefficients.

Accepted Answer

Their acoustic feature extraction approach is based on a large set of low-level descriptors and derivatives of LLD combined with suited statistical functionals to capture speech dynamics within a word.

Accepted Answer

The computation of the low-level features takes 50 ms per frame for a C++ implementation on a 2.4 GHz Intel i5 processor with 4 GB RAM.

Accepted Answer

As the class distribution in the training set is relatively well balanced, the official challenge measure is weighted accuracy, i. e., the recognition rates of the individual classes weighted by the class distribution.

Accepted Answer

Human emotions tend to evolve slowly over time which motivates the introduction of some form of context-sensitivity in emotion classification frameworks.

Accepted Answer

Fewer functionals as for audio features are used to ensure a similar dimensionality of the video feature vector and the audio feature vector.

Accepted Answer

One approach towards reaching acceptable recognition performance even in challenging conditions is the modeling of contextual information.

Accepted Answer

Among various classification frameworks that are able to exploit turn-level context, so-called Long Short-Term Memory (LSTM) networks [18] tend to be best suited for long-range context modeling in emotion recognition.

Accepted Answer

According to optimizations on the development set, the number of training epochs was 60 for networks classifying arousal and 30 for all other networks.

Accepted Answer

By employing uniform LBPs instead of full LBPs and aggregating the LBP operator responses in histograms taken over regions of the face, the dimensionality of the features is rather low (59 dimensions per image block).

Accepted Answer

For arousal, the best WA of 68.5 % is obtained for acoustic features only, whichis in line with previous studies showing that audio is the most important modality for assessing arousal [13].

Accepted Answer

For each pixel I(x, y) in the current image the probability of a facial pixel can be approximated byP f (x, y) = M(IH(x, y), IS (x, y), IV (x, y))N , (2)with N being the number of template pixels that have been used to create the histogram.

Accepted Answer

their test set consists only of the sessions that are intended for this sub-challenge, meaning only 10 out of the 32 test sessions.

Accepted Answer

the classification of expectation seems to benefit from including visual information as the best WA (67.6 %) is reached for LSTM networks applying late fusion of audio and video modalities.

Accepted Answer

To obtain the best possible recognition performance, future studies should also investigate which feature-classifier combinations lead to the best results, e. g., by combining the proposed LSTM framework with other audio or video features proposed for the 2011 Audio/Visual Emotion Challenge.

LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework

Figures

Citations

Deep learning architecture using rough sets and rough neural networks

Ballistocardiogram Based Person Identification and Authentication Using Recurrent Neural Networks

Towards Generic Models of Player Experience

Improved Active Speaker Detection Based on Optical Flow

Deep neural networks for anger detection from real life speech data

References

Long short-term memory

Data Mining: Practical Machine Learning Tools and Techniques

The WEKA data mining software: an update

Multiresolution gray-scale and rotation invariant texture classification with local binary patterns

Robust Real-Time Face Detection

Related Papers (5)