scispace - formally typeset
Search or ask a question
Journal ArticleDOI

LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework

01 Feb 2013-Image and Vision Computing (Butterworth-Heinemann)-Vol. 31, Iss: 2, pp 153-163
TL;DR: Comparing the results with the recognition scores of all Audiovisual Sub-Challenge participants, it is found that the proposed LSTM-based technique leads to the best average recognition performance that has been reported for this task so far.
About: This article is published in Image and Vision Computing.The article was published on 2013-02-01 and is currently open access. It has received 281 citations till now. The article focuses on the topics: Affective computing & Context model.

Summary (3 min read)

1. Introduction

  • As speech recognition systems have matured over the last decades, automatic emotion recognition (AER) can be seen as going one step further in the design of natural, intuitive, and humanlike computer interfaces.
  • Currently, the authors are observing a shift from modeling prototypical emotional categories such as anger or happiness to viewing human affect in a continuous orthogonal way by defining emotional dimensions including for example arousal and valence.
  • Apart from preliminary experiments using facial marker information as additional input modality [13] and a recent study on subject dependent recognition of arousal and valence [27], LSTM architectures have hardly been applied for audiovisual emotion recognition.
  • The audio feature extraction front-end applied in their study is based on their open-source toolkit openSMILE [28] which is able to extract large sets of prosodic, spectral, and voice quality low-level descriptors (LLD) combined with various statistical functionals in real-time.

2. The SEMAINE Database

  • The freely available audiovisual SEMAINE corpus1 [14] was recorded to study natural social signals that occur in conversations between humans and artificially intelligent agents.
  • The scenario used during the creation of the database is called the Sensitive Artificial Listener (SAL).
  • Both, the user and the operator were recorded from a frontal view by both a greyscale camera and a color camera.
  • As the number of character conversations varies between recordings, the number of sessions is different per set:.
  • For the challenge, the originally continuous affective dimensions arousal, expectation, power, and valence were redefined as binary classification tasks by testing at every frame whether they are above or below average.

3.1. Audio Feature Extraction

  • The authors acoustic feature extraction approach is based on a large set of low-level descriptors and derivatives of LLD combined with suited statistical functionals to capture speech dynamics within a word.
  • All features and functionals are computed using their on-line audio analysis toolkit openSMILE [28].
  • Details on the LLD and functionals are given in Tables 2 and 3, respectively.
  • One example for a LLD/functional combination that contains no information is ‘minimum pitch’ which is always zero.

3.2. Linguistic and Non-Linguistic Feature Extraction

  • Linguistic features are extracted using the SEMAINE 3.0 ASR system [4].
  • It applies openSMILE as front-end to extract 13 Mel-Frequency Cepstral Coefficients (MFCC) together with first and second order temporal derivatives every 10 ms (window size 25 ms).
  • All of these corpora contain spontaneous, conversational, and partly emotional speech.
  • The phoneme HMMs consist of three states with 16 Gaussian mixtures per state.
  • Typically, one (key)word is detected for every audio chunk (which correspond to single words), however the recognizer is not restricted to detect exactly one word, thus insertions and deletions are possible.

3.3. Visual Feature Extraction

  • Generally, a large variety of purely visual emotion recognition systems has been presented in recent years, including combinations of Local Binary Patterns and Support Vector Machines [41], methods based on deformed grids and SVMs [42], Haar-like features modeled via AdaBoost [43], approaches using Gabor filters and non-negative matrix factorization [44], and variable-intensity models [45].
  • Glodek et al. [35] use Gabor filters to extract video features.
  • Subsequently, the face is cut out and rotated so that it is upright, before the optical flow with respect to the previous frame is computed.
  • Compared to [29], their method is faster and also extracts head tilt in addition to facial movement features.
  • Furthermore, unlike the Audio/Visual Emotion Challenge baseline video feature extractor [6] which is based on dense local appearance descriptors, their approach does not rely on correct eye detection.

3.3.1. Baseline Video Feature Extractor

  • The baseline video feature extractor for the 2011 Audio/Visual Emotion Challenge [6] works as follows:.
  • First, the face position is detected by a Viola Jones face detector which computes a squared window containing the face.
  • Once the two eyes are detected, the image can be rotated by angle α so that the eyes lie on a horizontal line.
  • Uniform Local Binary Patterns (LBP) [55] are used as dense local appearance descriptors.
  • Consisting of eight binary comparisons per pixel, they are fast to compute.

3.3.2. Proposed Visual Feature Extraction Method

  • In order to compute the visual low-level features applied in their proposed LSTM-based audiovisual emotion recognition framework the authors go through the steps depicted in the block diagram in Figure 3.
  • Each of the three components of the HSV color model has 20 bins in the histogram.
  • Subsequently, the face is tracked with a camshift tracker [57] which takes the probability image as input.
  • The following functionals are applied to frame-based video features: arithmetic mean (for delta coefficients: arithmetic mean of absolute values), standard deviation, 5% percentile, 95% percentile, and range of 5% and 95% percentile.
  • In Figure 5, the shading of the facial regions indicates the importance of the features corresponding to the respective region.

4. Classification

  • Widely used classifiers operating on static word- or turn-level feature vectors are, e. g., Support Vector Machines or Multilayer Perceptrons.
  • To exploit context between successive speech segments for improved audiovisual emotion recognition, this study considers recurrent neural network architectures which take into account past observations by cyclic connections in the network’s hidden layer.
  • Each memory block consists of one or more memory cells and multiplicative input, output, and forget gates.
  • The initial version of the LSTM architecture proposed in [18] contained only input and output gates to enable an architecture that can store and access activations via gate activations.

5. Experiments and Results

  • All experiments are carried out on the Audiovisual Sub-Challenge task as described in Section 2.
  • The task is to discriminate between high and low arousal, expectation, power, and valence.
  • As the class distribution in the training set is relatively well balanced, the official challenge measure is weighted accuracy, i. e., the recognition rates of the individual classes weighted by the class distribution.
  • This imbalence holds in particular for the Audio and Audio-Visual Sub-Challenge as they consider word-level modeling rather than frame-based recognition.

5.2. Experimental Settings

  • The authors investigate the performance of both, bidirectional LSTMs and unidirectional LSTM networks for fully incremental on-line audiovisual affect recognition.
  • The number of input nodes corresponds to the number of different features per speech segment and the number of output nodes corresponds to the number of target classes, i. e., the authors used two output nodes representing high and low arousal, expectation, power, and valence, respectively.
  • All networks were trained using a learning rate of 10−5.
  • To validate whether better recognition performance can be obtained when changing the number of memory blocks, the authors evaluated hidden layer sizes of between 80 and 160 memory blocks on the development set.
  • The resulting number of variables that need to be estimated during network training is equivalent to the number of weights in the network, e. g., an LSTM network that processes the full feature set consisting of acoustic, linguistic, and video information has 2 094 210 weights.

5.3. Results and Discussion

  • Table 5 shows both, weighted accuracies (WA) and unweighted accuracies (UA) obtained when training on the training set of the 2011 Audio/Visual Emotion Challenge and testing on the development set.
  • The performance of the different feature groups (acoustic, linguistic, video) heavily depends on the considered emotional dimension.
  • The classification of expectation seems to benefit from including visual information as the best WA (67.6 %) is reached for LSTM networks applying late fusion of audio and video modalities.
  • For recognition based on video only, CFS leads to a remarkable performance gain, increasing the average WA from 60.4 % to 65.8 % for unidirectional LSTM networks.
  • Networks were trained on the training and development set.

Did you find this useful? Give us your feedback

Figures (15)
Citations
More filters
Proceedings ArticleDOI
07 Jun 2015
TL;DR: In this article, a recurrent neural network that uses Long Short-Term Memory (LSTM) cells which are connected to the output of the underlying CNN was proposed to model the video as an ordered sequence of frames.
Abstract: Convolutional neural networks (CNNs) have been extensively applied for image recognition problems giving state-of-the-art results on recognition, detection, segmentation and retrieval. In this work we propose and evaluate several deep neural network architectures to combine image information across a video over longer time periods than previously attempted. We propose two methods capable of handling full length videos. The first method explores various convolutional temporal feature pooling architectures, examining the various design choices which need to be made when adapting a CNN for this task. The second proposed method explicitly models the video as an ordered sequence of frames. For this purpose we employ a recurrent neural network that uses Long Short-Term Memory (LSTM) cells which are connected to the output of the underlying CNN. Our best networks exhibit significant performance improvements over previously published results on the Sports 1 million dataset (73.1% vs. 60.9%) and the UCF-101 datasets with (88.6% vs. 88.0%) and without additional optical flow information (82.6% vs. 73.0%).

2,066 citations

Journal ArticleDOI
TL;DR: This paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy to enable researchers to better understand the state of the field and identify directions for future research.
Abstract: Our experience of the world is multimodal - we see objects, hear sounds, feel texture, smell odors, and taste flavors Modality refers to the way in which something happens or is experienced and a research problem is characterized as multimodal when it includes multiple such modalities In order for Artificial Intelligence to make progress in understanding the world around us, it needs to be able to interpret such multimodal signals together Multimodal machine learning aims to build models that can process and relate information from multiple modalities It is a vibrant multi-disciplinary field of increasing importance and with extraordinary potential Instead of focusing on specific multimodal applications, this paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy We go beyond the typical early and late fusion categorization and identify broader challenges that are faced by multimodal machine learning, namely: representation, translation, alignment, fusion, and co-learning This new taxonomy will enable researchers to better understand the state of the field and identify directions for future research

1,945 citations

Journal ArticleDOI
TL;DR: This first of its kind, comprehensive literature review of the diverse field of affective computing focuses mainly on the use of audio, visual and text information for multimodal affect analysis, and outlines existing methods for fusing information from different modalities.

969 citations

Posted Content
TL;DR: This work proposes and evaluates several deep neural network architectures to combine image information across a video over longer time periods than previously attempted, and proposes two methods capable of handling full length videos.
Abstract: Convolutional neural networks (CNNs) have been extensively applied for image recognition problems giving state-of-the-art results on recognition, detection, segmentation and retrieval. In this work we propose and evaluate several deep neural network architectures to combine image information across a video over longer time periods than previously attempted. We propose two methods capable of handling full length videos. The first method explores various convolutional temporal feature pooling architectures, examining the various design choices which need to be made when adapting a CNN for this task. The second proposed method explicitly models the video as an ordered sequence of frames. For this purpose we employ a recurrent neural network that uses Long Short-Term Memory (LSTM) cells which are connected to the output of the underlying CNN. Our best networks exhibit significant performance improvements over previously published results on the Sports 1 million dataset (73.1% vs. 60.9%) and the UCF-101 datasets with (88.6% vs. 88.0%) and without additional optical flow information (82.6% vs. 72.8%).

496 citations


Cites background from "LSTM-Modeling of continuous emotion..."

  • ...For this reason, LSTMs yield state-of-the-art results in handwriting recognition [8, 10], speech recognition [9, 7], phoneme detection [5], emotion detection [25], segmentation of meetings and events [18], and evaluating programs [27]....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Abstract: Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O. 1. Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

72,897 citations


"LSTM-Modeling of continuous emotion..." refers background or methods in this paper

  • ...The initial version of the LSTM architecture proposed in [18] contained only input and output gates to enable an architecture that can store and access activations via gate activations....

    [...]

  • ...It is based on the Long Short-Term Memory principle originally introduced in [18] and improved in [61]....

    [...]

  • ...Long Short-Term Memory (LSTM) networks [18] tend to be best suited for long-range context modeling in emotion recognition....

    [...]

Book
25 Oct 1999
TL;DR: This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining.
Abstract: Data Mining: Practical Machine Learning Tools and Techniques offers a thorough grounding in machine learning concepts as well as practical advice on applying machine learning tools and techniques in real-world data mining situations. This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining. Thorough updates reflect the technical changes and modernizations that have taken place in the field since the last edition, including new material on Data Transformations, Ensemble Learning, Massive Data Sets, Multi-instance Learning, plus a new version of the popular Weka machine learning software developed by the authors. Witten, Frank, and Hall include both tried-and-true techniques of today as well as methods at the leading edge of contemporary research. *Provides a thorough grounding in machine learning concepts as well as practical advice on applying the tools and techniques to your data mining projects *Offers concrete tips and techniques for performance improvement that work by transforming the input or output in machine learning methods *Includes downloadable Weka software toolkit, a collection of machine learning algorithms for data mining tasks-in an updated, interactive interface. Algorithms in toolkit cover: data pre-processing, classification, regression, clustering, association rules, visualization

20,196 citations


"LSTM-Modeling of continuous emotion..." refers methods in this paper

  • ...Networks processing video data only are based on a video feature set reduced via CFS, whereas for all other networks, we did not apply CFS....

    [...]

  • ...However, for recognition based on video only, CFS leads to a remarkable performance gain, increasing the average WA from 60.4 % to 65.8 % for unidirectional LSTM networks....

    [...]

  • ...To investigate whether a smaller feature space leads to better recognition performance, we repeated all evaluations on the development set applying a Correlation based Feature Subset Selection (CFS) [66] for each modality combination....

    [...]

  • ...For most settings, CFS does not significantly improve the average weighted accuracy....

    [...]

Journal ArticleDOI
TL;DR: This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.
Abstract: More than twelve years have elapsed since the first public release of WEKA. In that time, the software has been rewritten entirely from scratch, evolved substantially and now accompanies a text on data mining [35]. These days, WEKA enjoys widespread acceptance in both academia and business, has an active community, and has been downloaded more than 1.4 million times since being placed on Source-Forge in April 2000. This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.

19,603 citations


"LSTM-Modeling of continuous emotion..." refers methods in this paper

  • ...Importance was evaluated employing the ranking-based information gain attribute evaluation algorithm implemented in the Weka toolkit [58]....

    [...]

  • ...Figure 5: Importance of facial regions for video feature extraction according to the ranking-based information gain attribute evaluation algorithm implemented in the Weka toolkit [58]....

    [...]

Journal ArticleDOI
TL;DR: A generalized gray-scale and rotation invariant operator presentation that allows for detecting the "uniform" patterns for any quantization of the angular space and for any spatial resolution and presents a method for combining multiple operators for multiresolution analysis.
Abstract: Presents a theoretically very simple, yet efficient, multiresolution approach to gray-scale and rotation invariant texture classification based on local binary patterns and nonparametric discrimination of sample and prototype distributions. The method is based on recognizing that certain local binary patterns, termed "uniform," are fundamental properties of local image texture and their occurrence histogram is proven to be a very powerful texture feature. We derive a generalized gray-scale and rotation invariant operator presentation that allows for detecting the "uniform" patterns for any quantization of the angular space and for any spatial resolution and presents a method for combining multiple operators for multiresolution analysis. The proposed approach is very robust in terms of gray-scale variations since the operator is, by definition, invariant against any monotonic transformation of the gray scale. Another advantage is computational simplicity as the operator can be realized with a few operations in a small neighborhood and a lookup table. Experimental results demonstrate that good discrimination can be achieved with the occurrence statistics of simple rotation invariant local binary patterns.

14,245 citations


"LSTM-Modeling of continuous emotion..." refers methods in this paper

  • ...Uniform Local Binary Patterns (LBP) [55] are used as dense local appearance descriptors....

    [...]

  • ...By employing uniform LBPs instead of full LBPs and aggregating the LBP operator responses in histograms taken over regions of the face, the dimensionality of the features is rather low (59 dimensions per image block)....

    [...]

Journal ArticleDOI
TL;DR: In this paper, a face detection framework that is capable of processing images extremely rapidly while achieving high detection rates is described. But the detection performance is limited to 15 frames per second.
Abstract: This paper describes a face detection framework that is capable of processing images extremely rapidly while achieving high detection rates. There are three key contributions. The first is the introduction of a new image representation called the “Integral Image” which allows the features used by our detector to be computed very quickly. The second is a simple and efficient classifier which is built using the AdaBoost learning algorithm (Freund and Schapire, 1995) to select a small number of critical visual features from a very large set of potential features. The third contribution is a method for combining classifiers in a “cascade” which allows background regions of the image to be quickly discarded while spending more computation on promising face-like regions. A set of experiments in the domain of face detection is presented. The system yields face detection performance comparable to the best previous systems (Sung and Poggio, 1998; Rowley et al., 1998; Schneiderman and Kanade, 2000; Roth et al., 2000). Implemented on a conventional desktop, face detection proceeds at 15 frames per second.

13,037 citations

Frequently Asked Questions (18)
Q1. What are the contributions mentioned in the paper "Lstm-modeling of continuous emotions in an audiovisual affect recognition framework" ?

This article presents their recent advances in assessing dimensional representations of emotion, such as arousal, expectation, power, and valence, in an audiovisual human-computer interaction scenario. Building on previous studies which demonstrate that long-range context modeling tends to increase accuracies of emotion recognition, the authors propose a fully automatic audiovisual recognition approach based on Long Short-Term Memory ( LSTM ) modeling of word-level audio and video features. The authors apply the same acoustic features as used in the challenge baseline system whereas visual features are computed via a novel facial movement feature extractor. Comparing their results with the recognition scores of all Audiovisual Sub-Challenge participants, the authors find that the proposed LSTM-based technique leads to the best average recognition performance that has been reported for this task so far. 

However, the considered scenario reflects realistic conditions in natural interactions and thus highlights the need for further research in the area of affective computing in order to get closer to the human performance in judging emotions. Their future research in the area of video feature extraction will include the application of multi-camera input to be more robust to head rotations. The authors plan to combine the facial movements of the 2D camera sequences to predict 3D movement. Another possibility to increase recognition performance is to allow asynchronities between audio and video, e. g., by applying hybrid fusion techniques like asynchronous HMMs [ 69 ] or multi-dimensional dynamic time warping [ 48 ]. 

The following functionals are applied to frame-based video features: arithmetic mean (for delta coefficients: arithmetic mean of absolute values), standard deviation, 5% percentile, 95% percentile, and range of 5% and 95% percentile. 

In order to map the sequence of frame-based video features to a single vector describing the word-unit, statistical functionals are applied to the frame-based video features and their first order delta coefficients. 

Their acoustic feature extraction approach is based on a large set of low-level descriptors and derivatives of LLD combined with suited statistical functionals to capture speech dynamics within a word. 

The computation of the low-level features takes 50 ms per frame for a C++ implementation on a 2.4 GHz Intel i5 processor with 4 GB RAM. 

As the class distribution in the training set is relatively well balanced, the official challenge measure is weighted accuracy, i. e., the recognition rates of the individual classes weighted by the class distribution. 

Human emotions tend to evolve slowly over time which motivates the introduction of some form of context-sensitivity in emotion classification frameworks. 

Fewer functionals as for audio features are used to ensure a similar dimensionality of the video feature vector and the audio feature vector. 

One approach towards reaching acceptable recognition performance even in challenging conditions is the modeling of contextual information. 

Among various classification frameworks that are able to exploit turn-level context, so-called Long Short-Term Memory (LSTM) networks [18] tend to be best suited for long-range context modeling in emotion recognition. 

According to optimizations on the development set, the number of training epochs was 60 for networks classifying arousal and 30 for all other networks. 

By employing uniform LBPs instead of full LBPs and aggregating the LBP operator responses in histograms taken over regions of the face, the dimensionality of the features is rather low (59 dimensions per image block). 

For arousal, the best WA of 68.5 % is obtained for acoustic features only, whichis in line with previous studies showing that audio is the most important modality for assessing arousal [13]. 

For each pixel I(x, y) in the current image the probability of a facial pixel can be approximated byP f (x, y) = M(IH(x, y), IS (x, y), IV (x, y))N , (2)with N being the number of template pixels that have been used to create the histogram. 

their test set consists only of the sessions that are intended for this sub-challenge, meaning only 10 out of the 32 test sessions. 

the classification of expectation seems to benefit from including visual information as the best WA (67.6 %) is reached for LSTM networks applying late fusion of audio and video modalities. 

To obtain the best possible recognition performance, future studies should also investigate which feature-classifier combinations lead to the best results, e. g., by combining the proposed LSTM framework with other audio or video features proposed for the 2011 Audio/Visual Emotion Challenge. 

Trending Questions (1)
What is LSTM?

LSTM stands for Long Short-Term Memory, which is a type of recurrent neural network (RNN) that can model long-range dependencies in sequential data.