scispace - formally typeset
Open AccessJournal ArticleDOI

LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework

Reads0
Chats0
TLDR
Comparing the results with the recognition scores of all Audiovisual Sub-Challenge participants, it is found that the proposed LSTM-based technique leads to the best average recognition performance that has been reported for this task so far.
About
This article is published in Image and Vision Computing.The article was published on 2013-02-01 and is currently open access. It has received 281 citations till now. The article focuses on the topics: Affective computing & Context model.

read more

Figures
Citations
More filters
Journal ArticleDOI

Deep learning architecture using rough sets and rough neural networks

Yasser F. Hassan
- 03 Apr 2017 - 
TL;DR: The objective of this work is to propose a model for deep rough set theory that uses more than decision table and approximating these tables to a classification system, i.e. a novel framework of deep learning based on multi-decision tables.
Proceedings ArticleDOI

Ballistocardiogram Based Person Identification and Authentication Using Recurrent Neural Networks

TL;DR: The result proves BCG carries individual information and can be used as a biometric for person identification and authentication and also proves multi cardiac signals union such as BCG-ECG union may have huge potential in person identification.
Proceedings Article

Towards Generic Models of Player Experience

TL;DR: This paper proposes generic models of user experience in the computer games domain and investigates the modelling mechanism ability to generalise over the two datasets, and examines whether generic features of player behaviour can be defined and used to boost the modelling performance.
Proceedings ArticleDOI

Improved Active Speaker Detection Based on Optical Flow

TL;DR: A robust active speaker detection model is proposed by incorporating the dense optical flow to strengthen the visual representation of the facial motion and demonstrates that optical flow can improve the performance of neural networks when combined with raw pixels and audio signal.
Proceedings ArticleDOI

Deep neural networks for anger detection from real life speech data

TL;DR: This paper extensively evaluates the deep networks on a big real-life speech corpus of 26 970 utterances with utterance-level labels collected from a German voice portal, finding that the proposed neural networks significantly outperform traditional modelling algorithms for speech anger detection.
References
More filters
Journal ArticleDOI

Long short-term memory

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Book

Data Mining: Practical Machine Learning Tools and Techniques

TL;DR: This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining.
Journal ArticleDOI

The WEKA data mining software: an update

TL;DR: This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.
Journal ArticleDOI

Multiresolution gray-scale and rotation invariant texture classification with local binary patterns

TL;DR: A generalized gray-scale and rotation invariant operator presentation that allows for detecting the "uniform" patterns for any quantization of the angular space and for any spatial resolution and presents a method for combining multiple operators for multiresolution analysis.
Journal ArticleDOI

Robust Real-Time Face Detection

TL;DR: In this paper, a face detection framework that is capable of processing images extremely rapidly while achieving high detection rates is described. But the detection performance is limited to 15 frames per second.
Related Papers (5)
Frequently Asked Questions (18)
Q1. What are the contributions mentioned in the paper "Lstm-modeling of continuous emotions in an audiovisual affect recognition framework" ?

This article presents their recent advances in assessing dimensional representations of emotion, such as arousal, expectation, power, and valence, in an audiovisual human-computer interaction scenario. Building on previous studies which demonstrate that long-range context modeling tends to increase accuracies of emotion recognition, the authors propose a fully automatic audiovisual recognition approach based on Long Short-Term Memory ( LSTM ) modeling of word-level audio and video features. The authors apply the same acoustic features as used in the challenge baseline system whereas visual features are computed via a novel facial movement feature extractor. Comparing their results with the recognition scores of all Audiovisual Sub-Challenge participants, the authors find that the proposed LSTM-based technique leads to the best average recognition performance that has been reported for this task so far. 

However, the considered scenario reflects realistic conditions in natural interactions and thus highlights the need for further research in the area of affective computing in order to get closer to the human performance in judging emotions. Their future research in the area of video feature extraction will include the application of multi-camera input to be more robust to head rotations. The authors plan to combine the facial movements of the 2D camera sequences to predict 3D movement. Another possibility to increase recognition performance is to allow asynchronities between audio and video, e. g., by applying hybrid fusion techniques like asynchronous HMMs [ 69 ] or multi-dimensional dynamic time warping [ 48 ]. 

The following functionals are applied to frame-based video features: arithmetic mean (for delta coefficients: arithmetic mean of absolute values), standard deviation, 5% percentile, 95% percentile, and range of 5% and 95% percentile. 

In order to map the sequence of frame-based video features to a single vector describing the word-unit, statistical functionals are applied to the frame-based video features and their first order delta coefficients. 

Their acoustic feature extraction approach is based on a large set of low-level descriptors and derivatives of LLD combined with suited statistical functionals to capture speech dynamics within a word. 

The computation of the low-level features takes 50 ms per frame for a C++ implementation on a 2.4 GHz Intel i5 processor with 4 GB RAM. 

As the class distribution in the training set is relatively well balanced, the official challenge measure is weighted accuracy, i. e., the recognition rates of the individual classes weighted by the class distribution. 

Human emotions tend to evolve slowly over time which motivates the introduction of some form of context-sensitivity in emotion classification frameworks. 

Fewer functionals as for audio features are used to ensure a similar dimensionality of the video feature vector and the audio feature vector. 

One approach towards reaching acceptable recognition performance even in challenging conditions is the modeling of contextual information. 

Among various classification frameworks that are able to exploit turn-level context, so-called Long Short-Term Memory (LSTM) networks [18] tend to be best suited for long-range context modeling in emotion recognition. 

According to optimizations on the development set, the number of training epochs was 60 for networks classifying arousal and 30 for all other networks. 

By employing uniform LBPs instead of full LBPs and aggregating the LBP operator responses in histograms taken over regions of the face, the dimensionality of the features is rather low (59 dimensions per image block). 

For arousal, the best WA of 68.5 % is obtained for acoustic features only, whichis in line with previous studies showing that audio is the most important modality for assessing arousal [13]. 

For each pixel I(x, y) in the current image the probability of a facial pixel can be approximated byP f (x, y) = M(IH(x, y), IS (x, y), IV (x, y))N , (2)with N being the number of template pixels that have been used to create the histogram. 

their test set consists only of the sessions that are intended for this sub-challenge, meaning only 10 out of the 32 test sessions. 

the classification of expectation seems to benefit from including visual information as the best WA (67.6 %) is reached for LSTM networks applying late fusion of audio and video modalities. 

To obtain the best possible recognition performance, future studies should also investigate which feature-classifier combinations lead to the best results, e. g., by combining the proposed LSTM framework with other audio or video features proposed for the 2011 Audio/Visual Emotion Challenge. 

Trending Questions (1)
What is LSTM?

LSTM stands for Long Short-Term Memory, which is a type of recurrent neural network (RNN) that can model long-range dependencies in sequential data.