Journal Article•DOI•

A review of affective computing

Soujanya Poria¹, Erik Cambria², Rajiv Bajpai², Amir Hussain¹•Institutions (2)

University of Stirling¹, Nanyang Technological University²

01 Sep 2017-Information Fusion (Elsevier)-Vol. 37, pp 98-125

TL;DR: This first of its kind, comprehensive literature review of the diverse field of affective computing focuses mainly on the use of audio, visual and text information for multimodal affect analysis, and outlines existing methods for fusing information from different modalities.

read less

About: This article is published in Information Fusion.The article was published on 2017-09-01 and is currently open access. It has received 969 citations till now. The article focuses on the topics: Affective computing & Modality (human–computer interaction).

...read moreread less

Summary (6 min read)

Jump to: [1. Introduction] – [2. Affective Computing] – [3. Available Datasets] – [3.1. Datasets for Multimodal Sentiment Analysis] – [3.2. Datasets for Multimodal Emotion Recognition] – [4. Unimodal Features for Affect Recognition] – [4.1. Visual Modality] – [4.1.1. Facial Action Coding System] – [4.1.2. Main Facial Expression Recognition Techniques] – [4.1.3. Extracting Temporal Features from Videos] – [4.1.4. Body Gestures] – [4.1.5. New Era: Deep Learning to Extract Visual Features] – [4.2. Audio Modality] – [4.2.1. Local Features vs. Global Features] – [4.2.3. Audio Features Extraction Using Deep Networks] – [4.3. Textual Modality] – [4.3.1. Single- vs. Cross-domain] – [4.3.2. Use of Linguistic Patterns] – [4.3.3. Bag of Words versus Bag of Concepts] – [4.3.4. Contextual Subjectivity] – [4.3.5. New Era of NLP: Emergence of Deep Learning] – [5. Multimodal Affect Recognition] – [5.1. Information Fusion Techniques] – [5.2.1. Multimodal Sentiment Analysis] – [5.2.2. Multimodal Emotion Recognition] – [5.2.3. Other Multimodal Cognitive Research] – [6. Available APIs] – [7. Discussion] – [7.1. Major Findings] and [7.2. Future Directions]

1. Introduction

Affective computing is an emerging field of research that aims to enable intelligent systems to recognize, feel, infer and interpret human emotions.
Video opinions, on the other hand, provide multimodal data in terms of vocal and visual modality.
The vocal modulations of opinions and facial expressions in the visual data, along with textual data, can provide important cues to better identify true affective states of the opinion holder.
To date, most of the research work in this field has focused on multimodal emotion recognition using visual and aural information.
Using them, it compensates for any incomplete information which can hinder decision processes.

2. Affective Computing

Before discussing the literature on unimodal and multimodal approaches to affect recognition, the authors introduce the notion of ‘affect’ and ‘affect taxonomy’.
While there is a fixed taxonomy for sentiment which is bound within positive, negative and neutral sentiments, the taxonomy for emotions is diverse.
Instead, he claimed that emotions are primarily social constructs; hence, a social level of analysis is necessary to truly understand the nature of emotions.
It usually does not enable operations between these, e.g., for studying compound emotions.

3. Available Datasets

As the primary purpose of this paper is to inform readers on recent advances in multimodal affect recognition, in this section the authors describe widely-used datasets for multimodal emotion recognition and sentiment analysis.
The authors do not cover unimodal datasets, for example facial expression recognition from image datasets (e.g., CK++), as they are outwith the scope of the paper.
To curate the latter, subjects were provided affect-related scripts and asked to act.
It is observed that such datasets can suffer from inaccurate actions by subjects, leading to corrupted samples or inaccurate information for the training dataset.
Utterance-level labeling scheme is particularly important to track the emotion and sentiment dynamics of the subject’s mindset in a video.

3.1. Datasets for Multimodal Sentiment Analysis

Available datasets for multimodal sentiment analysis have mostly been collected from product reviews available on different online video sharing platforms, e.g., YouTube.
The authors take pride in developing the first publicly available dataset for tri-modal sentiment analysis, by combining visual, audio and textual modalities.
The 47 videos in the dataset were further annotated with one of three sentiment labels: positive, negative or neutral.
This annotation task led to 13 positively, 12 negatively and 22 neutrally labeled videos.
For qualitative and statistical analysis of the dataset, the authors used polarized words in text, ‘smile’ and ‘look away’ in visual, and pauses and pitch in aural modality, as the main features.

3.2. Datasets for Multimodal Emotion Recognition

The authors describe the datasets currently available for multimodal emotion recognition below.
The database provides naturalistic clips of pervasive emotions from multiple modalities and labels the best ones describing them.
Activation and evaluation are dimensions that are known to discriminate effectively between emotional states.
The characters are Prudence, who is even-tempered and sensible; Poppy, who is happy and outgoing; Spike, who is angry and confrontational; and Obadiah, who is sad and depressive.
The recorded dyadic sessions were later manually segmented at utterance level (defined as continuous segments when one of the actors was actively speaking).

4. Unimodal Features for Affect Recognition

Unimodal systems act as the primary building blocks for a well-performing multimodal framework.
The authors de- scribe the literature of unimodal affect analysis primarily focusing on visual, audio and textual modalities.
The following section focuses on multimodal fusion.
This particularly benefits the readers as they can refer to this section for unimodal affect analysis literature while the following section will inform on how to integrate the output of unimodal systems, with the final goal of developing a multimodal affect analysis framework.

4.1. Visual Modality

Facial expressions are primary cues for understanding emotions and sentiments.
Across the ages of people involved, and the nature of conversations, facial expressions are the primary channel for forming an impression of the subject’s present state of mind.
The authors present various studies on the use of visual features for multimodal affect analysis.

4.1.1. Facial Action Coding System

These resources use combinations of AUs for specifying emotions.
Transient features are observed only at the time of facial expressions, such as contraction of the corrugator muscle that produces vertical furrows between the eyebrows.

4.1.2. Main Facial Expression Recognition Techniques

These methods are also termed differential methods as they are calculated using Taylor series.
These models are used mainly to enhance automatic analysis of images under noisy or cluttered environments.

4.1.3. Extracting Temporal Features from Videos

Most of those methods do not work well for videos as they do not model temporal information.the authors.
An important facet in video-based methods is maintaining accurate tracking throughout the video sequence.

4.1.4. Body Gestures

Though most research works have concentrated on facial feature extraction for emotion and sentiment analysis, there are some contributions based on features extracted from body gestures.
Research in psychology suggests that body gestures provide a significant source of features for emotion and sentiment recognition.
Some of the extracted motion cues to understand the subject’s temporal profile included: initial and final slope of the main peak, ratio between the maximum value and the duration of the main peak, ratio between the absolute maximum and the biggest following relative maximum, centroid of the energy, symmetry index, shift index of the main peak, and number of peaks.

4.1.5. New Era: Deep Learning to Extract Visual Features

In the last two sections, the authors described the use of handcrafted feature extraction from a visual modality and mathematical models for facial expression analysis.
Inspired by the recent success of deep learning, emotion and sentiment analysis tasks have also been enhanced by the adoption of deep learning algorithms, e.g., convolutional neural network (CNN).

4.2. Audio Modality

Similar to text and visual feature analysis, emotion and sentiment analysis through audio features has specific components.
Other features that have been used by some researchers for feature extraction include formants, mel frequency cepstral coefficients (MFCC), pause, teager energy operated based features, log frequency power coefficients (LFPC) and linear prediction cepstral coefficients (LPCC).
This feature is usually calculated by taking the Euclidean distance between two normalized spectra.
Discrete feeling states are defined as emotions that are spontaneous, uncontrollable or, in other words, universal emotions.

4.2.1. Local Features vs. Global Features

Audio affect classification is also classified into local features and global features.
The common approach to analyze audio modality is to segment each audio/utterance into either overlapped or non-overlapped segments and examine them.
Global features are the most commonly used features in the literature.
There are some drawbacks of calculating global features, as some of them are only useful to detect affect of high arousal, e.g., anger and disgust.
Global features also lack temporal information and dependence between two segments in an utterance.

4.2.3. Audio Features Extraction Using Deep Networks

As for computer vision, deep learning is also gaining increasing attention in audio classification research.
Authors trained CNN on the features extracted from all time frames.
These types of models are usually incapable of modeling temporal information.
A possible research question is whether deep networks can be replicated for automatic feature extraction from aural data.

4.3. Textual Modality

The authors present the state of the art of both emotion recognition and sentiment analysis from text.
While ini- tially the use of knowledge bases was more popular for the identification of emotions and polarity in text, recently sentiment analysis researchers have been increasingly using statisticsbased approaches, with a special focus on supervised statistical methods.

4.3.1. Single- vs. Cross-domain

Authors first incorporated domain-independent words to aid the clustering process and then exploited the resulting clusters to reduce the gap between domain-specific words of two domains.
Sentiment sensitivity was obtained by including documents’ sentiment labels into the context vector.
At the time of training and testing, this sentiment thesaurus was used to expand the feature vector.

4.3.2. Use of Linguistic Patterns

Whilst machine learning methods, for supervised training of the sentiment analysis system, are predominant in literature, a number of unsupervised methods such as linguistic patterns can also be found.
Their technique exploited a set of linguistic rules on connectives (‘and’, ‘or’, ‘but’, ‘either/or’, ‘neither/nor’) to identify sentiment words and their orientations.
When negations are implicit, e.g., cannot be recognized by an explicit negation identifier, sarcasm detection needs to be considered as well.
Authors used a 7-layer deep convolutional neural network to tag each word in opinionated sentences as either aspect or non-aspect words.
They also developed a set of linguistic patterns for the same purpose and combined them with the neural network.

4.3.3. Bag of Words versus Bag of Concepts

Text representation is a key task for any text classification framework.
According to their study, use of contextual semantic features along with the BoW model can be very useful for semantic text classification.
The analysis at concept level is intended to infer the semantic and affective information associated with natural language opinions and, hence, to enable a comparative fine-grained aspect-based sentiment analysis.
Rather than gathering isolated opinions about a whole item (e.g., iPhone7), users are generally more interested in comparing different products according to their specific features (e.g., iPhone7’s vs Galaxy S7’s touchscreen), or even sub-features (e.g., fragility of iPhone7’s vs Galaxy S7’s touchscreen).

4.3.4. Contextual Subjectivity

In their work, subjective expressions were first labeled, with the goal of the work aimed at classifying the contextual senti- ment of the given expressions.
The authors employed a supervised learning approach based on two steps: first, it determined whether the expression is subjective or objective, second it determined whether the subjective expression is positive, negative, or neutral.

4.3.5. New Era of NLP: Emergence of Deep Learning

Deep-learning architectures and algorithms have already made impressive advances in fields such as computer vision and pattern recognition.
Following this trend, recent NLP research is now increasingly focusing on the use of new deep learning methods.
Alternative approaches have exploited the fact that many short n-grams are neutral while longer phrases are well distributed among positive and negative subjective sentence classes.
Recursive neural networks predict the sentiment class at each node in the parse tree and attempt to capture the negation and its scope in the entire sentence.
In the standard configuration, each word is represented as a vector and it is first determined which parent has already computed its children.

5. Multimodal Affect Recognition

Multimodal affect analysis has already created a lot of buzz in the field of affective computing.
In the previous section, the authors have discussed state-of-the-art methods which used either of the Visual, Audio or Text modalities for affect recognition.
The authors discuss the approaches to solve the multimodal affect recognition problem.

5.1. Information Fusion Techniques

Multimodal affect recognition can be seen as the fusion of information from different modalities.
This type of fusion is the combination of both feature-level and decision-level fusion methods.
As the name suggests, multimodal information is fused by statistical rule based methods such as linear weighted fusion, majority voting and customdefined rules.
Various methods used under this category include: SVMs, Bayesian inference, Dempster-Shafer theory, dynamic bayesian networks, neural networks and maximum entropy models.
Thus, for systems with non-linear characteristics, extended kalman filter is used.

5.2.1. Multimodal Sentiment Analysis

FACS and AUs were used as visual features and openEAR was used for extracting acoustic, prosodic features.
The combination of these features were then fed to an SVM for fusion and 74.66% accuracy was obtained.
Simple Bag-Of-Words were utilized as text features.
Audio-visual features were fed to a Bidirectional-LSTM for early feature-level fusion and SVM was used to obtain the class label of the textual modality.

5.2.2. Multimodal Emotion Recognition

The endpoint of the audio-visual segment was then set to the frame including the offset, after crossing back to a non-negative valence value.
Sinks were used in the feature extractor, feeding data to different classifiers (K-Nearest Neighbor, Bayes and Support-Vector based classification and regression using the freely available LibSVM).
This emotion lexicon was then used to generate a vector feature representation for each utterance.
There were several acoustic features used, ranging from jitter and shimmer for negative emotions to intensity and voicing statistics per frame.

5.2.3. Other Multimodal Cognitive Research

The SimSensei Kiosk was developed in such a way the user feels comfortable talking and sharing information, thus providing clinicians an automatic assessment of psychological distress in a person.
The approach was executed by extraction of audio, visual and textual features to capture affect and semantics in the audio-video content and sentiment in the viewers’ comments.
The research employed both feature-level and decision-level fusion methods.

6. Available APIs

The authors list 20 popular APIs for emotion recognition from photos, videos, text and speech.
It focuses mainly on marketers and writers to improve their content on the basis of emotional insights.
It is a facial expression analysis software for analyzing universal emotions in addition to neutral and contempt.
This API is used to detect emotions such as happy, sad, angry, surprise, disgust, scared and neutral from faces.
Sentic API20 is a free API for emotion recognition and sentiment analysis providing semantics and sentics associated with 50,000 commonsense concepts in 40 different languages.

7. Discussion

Timely surveys are rudimentary for any field of research.
The authors not only discuss the state of the art but also collate available datasets and illustrate key steps involved in a multimodal affect analysis framework.
The authors have covered around 100 papers in their study.
The authors describe some of their major findings from this survey.

7.1. Major Findings

This is due to the need of mining information from the growing amount of videos posted the social media and the advancement of human-computer interaction agents.
In particular, there has been a growing interest in using deep learning techniques and a number of fusion methods.
It can be seen that from 2010 onwards, text modality has been considered in many research works on multimodal affect analysis.

7.2. Future Directions

As this survey paper has demonstrated, there are significant research challenges outstanding in this multi-disciplinary field.
With the advent of deep learning research, it is now a viable question whether to use deep features or low-level manuallyextracted features for the video classification.
If the multimodal system can model the inter person emotional dependency, that would lead to major advances in multimodal affect research.

Did you find this useful? Give us your feedback

Figures (18)

Figure 4: C3D for extracting spatio-temporal generic video features.

Table 6: Study characteristics of recent papers on multimodal analysis.

Table 4: State of the art of visual-audio affect recognition

Figure 12: Percentage of research articles in multimodal affect analysis using different fusion methods over the years (Legenda: F=Feature-Level Fusion; D=Decision-Level Fusion; H=Hybrid Fusion; M=Model-Based Fusion).

Figure 1: Human brain considers multisensor information together for decision making.

Table 3: State of the art of multimodal affect recognition where the text modality has been used.

Figure 9: RNTN applied on the dependency tree of the sentence “This movie doesn’t care about cleverness, wit or any other kind of intelligent humor”.

Figure 5: CNN for visual sentiment recognition as proposed by Poria et al. [? ].

Figure 11: Percentage of research works done using different modalities for affect Recognition over the years (Legenda: A=Audio; T= Text; V=Video).

Table 5: Comparative table: Key studies on multimodal emotion analysis datasets.

Table 1: Multimodal emotion analysis datasets.

Table 2: Multimodal sentiment analysis datasets.

Figure 10: Hybrid fusion for multimodal sentiment analysis in YouTube videos as proposed by [? ].

Figure 2: A typical multimodal affect analysis framework.

Figure 8: Example of how sentic patterns work on the sentence “The car is very old but it is rather not expensive”.

Frequently Asked Questions (9)

Q1. What contributions have the authors mentioned in the paper "A review of affective computing: from unimodal analysis to multimodal fusion" ?

This is the primary motivation behind their first of its kind, comprehensive literature review of the diverse field of affective computing. Furthermore, existing literature surveys lack a detailed discussion of state of the art in multimodal affect analysis frameworks, which this review aims to address. In this paper, the authors focus mainly on the use of audio, visual and text information for multimodal affect analysis, since around 90 % of the relevant literature appears to cover these three modalities. As part of this review, the authors carry out an extensive study of different categories of state-of-the-art fusion techniques, followed by a critical analysis of potential performance improvements with multimodal analysis compared to unimodal analysis. A comprehensive overview of these two complementary fields aims to form the building blocks for readers, to better understand this challenging and exciting research field.

Q2. What future works have the authors mentioned in the paper "A review of affective computing: from unimodal analysis to multimodal fusion" ?

One important area of future research is to investigate novel approaches for advancing their understanding of the temporal dependency between utterances, i. e., the effect of utterance at time t on the utterance at time t+1. The progress in text classification research can play a major role in future of the multimodal affect analysis research. Future research should focus on answering this question. The use of deep learning for multimodal fusion can also be an important future work.

Q3. What is the primary advantage of analyzing videos over textual analysis?

The primary advantage of analyzing videos over textual analysis, for detecting emotions and sentiments from opinions, is the surplus of behavioral cues.

Q4. What was the acoustic feature used to generate the feature representation of the entire dataset?

For acoustic features, low-level acoustic features were extracted at frame level on each utterance and used to generate feature representation of the entire dataset, using the OpenSMILE toolkit.

Q5. What are the common unsupervised methods for sentiment analysis?

Whilst machine learning methods, for supervised training of the sentiment analysis system, are predominant in literature, a number of unsupervised methods such as linguistic patterns can also be found.

Q6. What is the main channel for forming an impression of the subject’s present state of mind?

Across the ages of people involved, and the nature of conversations, facial expressions are the primary channel for forming an impression of the subject’s present state of mind.

Q7. What was the effect of the feature adaptation scheme on the emotion recognition system?

The results on uncontrolled recordings (i.e., speech downloaded from a video-sharing website) revealed that the feature adaptation scheme significantly improved the unweighted and weighted accuracies of the emotion recognition system.

Q8. What is the percentage of studies that report visual modality as superior to audio?

In their literature survey, the authors have found more than 90% of studies reported visual modality as superior to audio and other modalities.

Q9. How accurate was the synchronization of the audio and video signals?

To accommodate research in audio-visual fusion, the audio and video signals were synchronized with an accuracy of 25micro-seconds.