scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A review of affective computing

TL;DR: This first of its kind, comprehensive literature review of the diverse field of affective computing focuses mainly on the use of audio, visual and text information for multimodal affect analysis, and outlines existing methods for fusing information from different modalities.
About: This article is published in Information Fusion.The article was published on 2017-09-01 and is currently open access. It has received 969 citations till now. The article focuses on the topics: Affective computing & Modality (human–computer interaction).

Summary (6 min read)

1. Introduction

  • Affective computing is an emerging field of research that aims to enable intelligent systems to recognize, feel, infer and interpret human emotions.
  • Video opinions, on the other hand, provide multimodal data in terms of vocal and visual modality.
  • The vocal modulations of opinions and facial expressions in the visual data, along with textual data, can provide important cues to better identify true affective states of the opinion holder.
  • To date, most of the research work in this field has focused on multimodal emotion recognition using visual and aural information.
  • Using them, it compensates for any incomplete information which can hinder decision processes.

2. Affective Computing

  • Before discussing the literature on unimodal and multimodal approaches to affect recognition, the authors introduce the notion of ‘affect’ and ‘affect taxonomy’.
  • While there is a fixed taxonomy for sentiment which is bound within positive, negative and neutral sentiments, the taxonomy for emotions is diverse.
  • Instead, he claimed that emotions are primarily social constructs; hence, a social level of analysis is necessary to truly understand the nature of emotions.
  • It usually does not enable operations between these, e.g., for studying compound emotions.

3. Available Datasets

  • As the primary purpose of this paper is to inform readers on recent advances in multimodal affect recognition, in this section the authors describe widely-used datasets for multimodal emotion recognition and sentiment analysis.
  • The authors do not cover unimodal datasets, for example facial expression recognition from image datasets (e.g., CK++), as they are outwith the scope of the paper.
  • To curate the latter, subjects were provided affect-related scripts and asked to act.
  • It is observed that such datasets can suffer from inaccurate actions by subjects, leading to corrupted samples or inaccurate information for the training dataset.
  • Utterance-level labeling scheme is particularly important to track the emotion and sentiment dynamics of the subject’s mindset in a video.

3.1. Datasets for Multimodal Sentiment Analysis

  • Available datasets for multimodal sentiment analysis have mostly been collected from product reviews available on different online video sharing platforms, e.g., YouTube.
  • The authors take pride in developing the first publicly available dataset for tri-modal sentiment analysis, by combining visual, audio and textual modalities.
  • The 47 videos in the dataset were further annotated with one of three sentiment labels: positive, negative or neutral.
  • This annotation task led to 13 positively, 12 negatively and 22 neutrally labeled videos.
  • For qualitative and statistical analysis of the dataset, the authors used polarized words in text, ‘smile’ and ‘look away’ in visual, and pauses and pitch in aural modality, as the main features.

3.2. Datasets for Multimodal Emotion Recognition

  • The authors describe the datasets currently available for multimodal emotion recognition below.
  • The database provides naturalistic clips of pervasive emotions from multiple modalities and labels the best ones describing them.
  • Activation and evaluation are dimensions that are known to discriminate effectively between emotional states.
  • The characters are Prudence, who is even-tempered and sensible; Poppy, who is happy and outgoing; Spike, who is angry and confrontational; and Obadiah, who is sad and depressive.
  • The recorded dyadic sessions were later manually segmented at utterance level (defined as continuous segments when one of the actors was actively speaking).

4. Unimodal Features for Affect Recognition

  • Unimodal systems act as the primary building blocks for a well-performing multimodal framework.
  • The authors de- scribe the literature of unimodal affect analysis primarily focusing on visual, audio and textual modalities.
  • The following section focuses on multimodal fusion.
  • This particularly benefits the readers as they can refer to this section for unimodal affect analysis literature while the following section will inform on how to integrate the output of unimodal systems, with the final goal of developing a multimodal affect analysis framework.

4.1. Visual Modality

  • Facial expressions are primary cues for understanding emotions and sentiments.
  • Across the ages of people involved, and the nature of conversations, facial expressions are the primary channel for forming an impression of the subject’s present state of mind.
  • The authors present various studies on the use of visual features for multimodal affect analysis.

4.1.1. Facial Action Coding System

  • These resources use combinations of AUs for specifying emotions.
  • Transient features are observed only at the time of facial expressions, such as contraction of the corrugator muscle that produces vertical furrows between the eyebrows.

4.1.2. Main Facial Expression Recognition Techniques

  • These methods are also termed differential methods as they are calculated using Taylor series.
  • These models are used mainly to enhance automatic analysis of images under noisy or cluttered environments.

4.1.3. Extracting Temporal Features from Videos

  • Most of those methods do not work well for videos as they do not model temporal information.the authors.
  • An important facet in video-based methods is maintaining accurate tracking throughout the video sequence.

4.1.4. Body Gestures

  • Though most research works have concentrated on facial feature extraction for emotion and sentiment analysis, there are some contributions based on features extracted from body gestures.
  • Research in psychology suggests that body gestures provide a significant source of features for emotion and sentiment recognition.
  • Some of the extracted motion cues to understand the subject’s temporal profile included: initial and final slope of the main peak, ratio between the maximum value and the duration of the main peak, ratio between the absolute maximum and the biggest following relative maximum, centroid of the energy, symmetry index, shift index of the main peak, and number of peaks.

4.1.5. New Era: Deep Learning to Extract Visual Features

  • In the last two sections, the authors described the use of handcrafted feature extraction from a visual modality and mathematical models for facial expression analysis.
  • Inspired by the recent success of deep learning, emotion and sentiment analysis tasks have also been enhanced by the adoption of deep learning algorithms, e.g., convolutional neural network (CNN).

4.2. Audio Modality

  • Similar to text and visual feature analysis, emotion and sentiment analysis through audio features has specific components.
  • Other features that have been used by some researchers for feature extraction include formants, mel frequency cepstral coefficients (MFCC), pause, teager energy operated based features, log frequency power coefficients (LFPC) and linear prediction cepstral coefficients (LPCC).
  • This feature is usually calculated by taking the Euclidean distance between two normalized spectra.
  • Discrete feeling states are defined as emotions that are spontaneous, uncontrollable or, in other words, universal emotions.

4.2.1. Local Features vs. Global Features

  • Audio affect classification is also classified into local features and global features.
  • The common approach to analyze audio modality is to segment each audio/utterance into either overlapped or non-overlapped segments and examine them.
  • Global features are the most commonly used features in the literature.
  • There are some drawbacks of calculating global features, as some of them are only useful to detect affect of high arousal, e.g., anger and disgust.
  • Global features also lack temporal information and dependence between two segments in an utterance.

4.2.3. Audio Features Extraction Using Deep Networks

  • As for computer vision, deep learning is also gaining increasing attention in audio classification research.
  • Authors trained CNN on the features extracted from all time frames.
  • These types of models are usually incapable of modeling temporal information.
  • A possible research question is whether deep networks can be replicated for automatic feature extraction from aural data.

4.3. Textual Modality

  • The authors present the state of the art of both emotion recognition and sentiment analysis from text.
  • While ini- tially the use of knowledge bases was more popular for the identification of emotions and polarity in text, recently sentiment analysis researchers have been increasingly using statisticsbased approaches, with a special focus on supervised statistical methods.

4.3.1. Single- vs. Cross-domain

  • Authors first incorporated domain-independent words to aid the clustering process and then exploited the resulting clusters to reduce the gap between domain-specific words of two domains.
  • Sentiment sensitivity was obtained by including documents’ sentiment labels into the context vector.
  • At the time of training and testing, this sentiment thesaurus was used to expand the feature vector.

4.3.2. Use of Linguistic Patterns

  • Whilst machine learning methods, for supervised training of the sentiment analysis system, are predominant in literature, a number of unsupervised methods such as linguistic patterns can also be found.
  • Their technique exploited a set of linguistic rules on connectives (‘and’, ‘or’, ‘but’, ‘either/or’, ‘neither/nor’) to identify sentiment words and their orientations.
  • When negations are implicit, e.g., cannot be recognized by an explicit negation identifier, sarcasm detection needs to be considered as well.
  • Authors used a 7-layer deep convolutional neural network to tag each word in opinionated sentences as either aspect or non-aspect words.
  • They also developed a set of linguistic patterns for the same purpose and combined them with the neural network.

4.3.3. Bag of Words versus Bag of Concepts

  • Text representation is a key task for any text classification framework.
  • According to their study, use of contextual semantic features along with the BoW model can be very useful for semantic text classification.
  • The analysis at concept level is intended to infer the semantic and affective information associated with natural language opinions and, hence, to enable a comparative fine-grained aspect-based sentiment analysis.
  • Rather than gathering isolated opinions about a whole item (e.g., iPhone7), users are generally more interested in comparing different products according to their specific features (e.g., iPhone7’s vs Galaxy S7’s touchscreen), or even sub-features (e.g., fragility of iPhone7’s vs Galaxy S7’s touchscreen).

4.3.4. Contextual Subjectivity

  • In their work, subjective expressions were first labeled, with the goal of the work aimed at classifying the contextual senti- ment of the given expressions.
  • The authors employed a supervised learning approach based on two steps: first, it determined whether the expression is subjective or objective, second it determined whether the subjective expression is positive, negative, or neutral.

4.3.5. New Era of NLP: Emergence of Deep Learning

  • Deep-learning architectures and algorithms have already made impressive advances in fields such as computer vision and pattern recognition.
  • Following this trend, recent NLP research is now increasingly focusing on the use of new deep learning methods.
  • Alternative approaches have exploited the fact that many short n-grams are neutral while longer phrases are well distributed among positive and negative subjective sentence classes.
  • Recursive neural networks predict the sentiment class at each node in the parse tree and attempt to capture the negation and its scope in the entire sentence.
  • In the standard configuration, each word is represented as a vector and it is first determined which parent has already computed its children.

5. Multimodal Affect Recognition

  • Multimodal affect analysis has already created a lot of buzz in the field of affective computing.
  • In the previous section, the authors have discussed state-of-the-art methods which used either of the Visual, Audio or Text modalities for affect recognition.
  • The authors discuss the approaches to solve the multimodal affect recognition problem.

5.1. Information Fusion Techniques

  • Multimodal affect recognition can be seen as the fusion of information from different modalities.
  • This type of fusion is the combination of both feature-level and decision-level fusion methods.
  • As the name suggests, multimodal information is fused by statistical rule based methods such as linear weighted fusion, majority voting and customdefined rules.
  • Various methods used under this category include: SVMs, Bayesian inference, Dempster-Shafer theory, dynamic bayesian networks, neural networks and maximum entropy models.
  • Thus, for systems with non-linear characteristics, extended kalman filter is used.

5.2.1. Multimodal Sentiment Analysis

  • FACS and AUs were used as visual features and openEAR was used for extracting acoustic, prosodic features.
  • The combination of these features were then fed to an SVM for fusion and 74.66% accuracy was obtained.
  • Simple Bag-Of-Words were utilized as text features.
  • Audio-visual features were fed to a Bidirectional-LSTM for early feature-level fusion and SVM was used to obtain the class label of the textual modality.

5.2.2. Multimodal Emotion Recognition

  • The endpoint of the audio-visual segment was then set to the frame including the offset, after crossing back to a non-negative valence value.
  • Sinks were used in the feature extractor, feeding data to different classifiers (K-Nearest Neighbor, Bayes and Support-Vector based classification and regression using the freely available LibSVM).
  • This emotion lexicon was then used to generate a vector feature representation for each utterance.
  • There were several acoustic features used, ranging from jitter and shimmer for negative emotions to intensity and voicing statistics per frame.

5.2.3. Other Multimodal Cognitive Research

  • The SimSensei Kiosk was developed in such a way the user feels comfortable talking and sharing information, thus providing clinicians an automatic assessment of psychological distress in a person.
  • The approach was executed by extraction of audio, visual and textual features to capture affect and semantics in the audio-video content and sentiment in the viewers’ comments.
  • The research employed both feature-level and decision-level fusion methods.

6. Available APIs

  • The authors list 20 popular APIs for emotion recognition from photos, videos, text and speech.
  • It focuses mainly on marketers and writers to improve their content on the basis of emotional insights.
  • It is a facial expression analysis software for analyzing universal emotions in addition to neutral and contempt.
  • This API is used to detect emotions such as happy, sad, angry, surprise, disgust, scared and neutral from faces.
  • Sentic API20 is a free API for emotion recognition and sentiment analysis providing semantics and sentics associated with 50,000 commonsense concepts in 40 different languages.

7. Discussion

  • Timely surveys are rudimentary for any field of research.
  • The authors not only discuss the state of the art but also collate available datasets and illustrate key steps involved in a multimodal affect analysis framework.
  • The authors have covered around 100 papers in their study.
  • The authors describe some of their major findings from this survey.

7.1. Major Findings

  • This is due to the need of mining information from the growing amount of videos posted the social media and the advancement of human-computer interaction agents.
  • In particular, there has been a growing interest in using deep learning techniques and a number of fusion methods.
  • It can be seen that from 2010 onwards, text modality has been considered in many research works on multimodal affect analysis.

7.2. Future Directions

  • As this survey paper has demonstrated, there are significant research challenges outstanding in this multi-disciplinary field.
  • With the advent of deep learning research, it is now a viable question whether to use deep features or low-level manuallyextracted features for the video classification.
  • If the multimodal system can model the inter person emotional dependency, that would lead to major advances in multimodal affect research.

Did you find this useful? Give us your feedback

Figures (18)
Citations
More filters
Proceedings ArticleDOI
01 Jul 2017
TL;DR: A LSTM-based model is proposed that enables utterances to capture contextual information from their surroundings in the same video, thus aiding the classification process and showing 5-10% performance improvement over the state of the art and high robustness to generalizability.
Abstract: Multimodal sentiment analysis is a developing area of research, which involves the identification of sentiments in videos. Current research considers utterances as independent entities, i.e., ignores the interdependencies and relations among the utterances of a video. In this paper, we propose a LSTM-based model that enables utterances to capture contextual information from their surroundings in the same video, thus aiding the classification process. Our method shows 5-10% performance improvement over the state of the art and high robustness to generalizability.

570 citations


Cites background or methods from "A review of affective computing"

  • ...Poria et al. (Poria et al., 2015, 2016d, 2017b) extracted audio, visual and textual features using convolutional neural network (CNN); concatenated those features and employed multiple kernel learning (MKL) for final sentiment classification....

    [...]

  • ...Thus, a combination of text and video data helps to create a more robust emotion and sentiment analysis model (Poria et al., 2017a)....

    [...]

  • ...As pointed out by Poria et al. (Poria et al., 2017a), acted dataset like IEMOCAP can suffer from biased labeling and incorrect acting which can further cause the poor generalizability of the models trained on the acted datasets....

    [...]

Proceedings ArticleDOI
01 Sep 2017
TL;DR: In this article, a tensor fusion network (Tensor fusion network) is proposed to model intra-modality and inter-modal dynamics for multimodal sentiment analysis.
Abstract: Multimodal sentiment analysis is an increasingly popular research area, which extends the conventional language-based definition of sentiment analysis to a multimodal setup where other relevant modalities accompany language. In this paper, we pose the problem of multimodal sentiment analysis as modeling intra-modality and inter-modality dynamics. We introduce a novel model, termed Tensor Fusion Networks, which learns both such dynamics end-to-end. The proposed approach is tailored for the volatile nature of spoken language in online videos as well as accompanying gestures and voice. In the experiments, our model outperforms state-of-the-art approaches for both multimodal and unimodal sentiment analysis.

532 citations

Proceedings Article
26 Apr 2018
TL;DR: A novel solution to targeted aspect-based sentiment analysis, which tackles the challenges of both aspect- based sentiment analysis and targeted sentiment analysis by exploiting commonsense knowledge by augmenting the LSTM network with a hierarchical attention mechanism.
Abstract: Analyzing people’s opinions and sentiments towards certain aspects is an important task of natural language understanding. In this paper, we propose a novel solution to targeted aspect-based sentiment analysis, which tackles the challenges of both aspect-based sentiment analysis and targeted sentiment analysis by exploiting commonsense knowledge. We augment the long short-term memory (LSTM) network with a hierarchical attention mechanism consisting of a target-level attention and a sentence-level attention. Commonsense knowledge of sentiment-related concepts is incorporated into the end-to-end training of a deep neural network for sentiment classification. In order to tightly integrate the commonsense knowledge into the recurrent encoder, we propose an extension of LSTM, termed Sentic LSTM. We conduct experiments on two publicly released datasets, which show that the combination of the proposed attention architecture and Sentic LSTM can outperform state-of-the-art methods in targeted aspect sentiment tasks.

491 citations


Cites background from "A review of affective computing"

  • ...Sentiment analysis is a branch of affective computing research (Poria et al. 2017) that aims to classify text into either positive or negative, but sometimes also neutral (Chaturvedi et al. 2017)....

    [...]

  • ...Sentiment analysis is a branch of affective computing research (Poria et al. 2017) that aims to classify text into either positive or negative, but sometimes also neutral (Chaturvedi et al....

    [...]

Journal ArticleDOI
TL;DR: This paper provides a detailed survey of popular deep learning models that are increasingly applied in sentiment analysis and presents a taxonomy of sentiment analysis, which highlights the power of deep learning architectures for solving sentiment analysis problems.
Abstract: Social media is a powerful source of communication among people to share their sentiments in the form of opinions and views about any topic or article, which results in an enormous amount of unstructured information. Business organizations need to process and study these sentiments to investigate data and to gain business insights. Hence, to analyze these sentiments, various machine learning, and natural language processing-based approaches have been used in the past. However, deep learning-based methods are becoming very popular due to their high performance in recent times. This paper provides a detailed survey of popular deep learning models that are increasingly applied in sentiment analysis. We present a taxonomy of sentiment analysis and discuss the implications of popular deep learning architectures. The key contributions of various researchers are highlighted with the prime focus on deep learning approaches. The crucial sentiment analysis tasks are presented, and multiple languages are identified on which sentiment analysis is done. The survey also summarizes the popular datasets, key features of the datasets, deep learning model applied on them, accuracy obtained from them, and the comparison of various deep learning models. The primary purpose of this survey is to highlight the power of deep learning architectures for solving sentiment analysis problems.

385 citations


Cites background or methods from "A review of affective computing"

  • ...Sentiment analysis using BRNN is reported in Chen et al. (2017b), Baktha and Tripathy (2017), Poria et al. (2017b) and Wang et al. (2018a)....

    [...]

  • ...…detection (Chaturvedi et al. 2018) aspect extraction (Rana and Cheah 2016), social context (Sánchez-rada and Iglesias 2019), multimodal analysis (Poria et al. 2017a), information fusion (Balazs and Velásquez 2016), different languages and genre in sentiment analysis (Rani and Kumar 2019),…...

    [...]

Journal ArticleDOI
TL;DR: In this article , the authors divide the concepts and essential techniques necessary for realizing the Metaverse into three components (i.e., hardware, software, and contents) rather than marketing or hardware approach to conduct a comprehensive analysis.
Abstract: Unlike previous studies on the Metaverse based on Second Life, the current Metaverse is based on the social value of Generation Z that online and offline selves are not different. With the technological development of deep learning-based high-precision recognition models and natural generation models, Metaverse is being strengthened with various factors, from mobile-based always-on access to connectivity with reality using virtual currency. The integration of enhanced social activities and neural-net methods requires a new definition of Metaverse suitable for the present, different from the previous Metaverse. This paper divides the concepts and essential techniques necessary for realizing the Metaverse into three components (i.e., hardware, software, and contents) and three approaches (i.e., user interaction, implementation, and application) rather than marketing or hardware approach to conduct a comprehensive analysis. Furthermore, we describe essential methods based on three components and techniques to Metaverse’s representative Ready Player One, Roblox, and Facebook research in the domain of films, games, and studies. Finally, we summarize the limitations and directions for implementing the immersive Metaverse as social influences, constraints, and open challenges.

313 citations

References
More filters
Proceedings Article
03 Dec 2012
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

73,978 citations

Posted Content
TL;DR: This paper proposed two novel model architectures for computing continuous vector representations of words from very large data sets, and the quality of these representations is measured in a word similarity task and the results are compared to the previously best performing techniques based on different types of neural networks.
Abstract: We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.

20,077 citations

Journal ArticleDOI
TL;DR: A fast, greedy algorithm is derived that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory.
Abstract: We show how to use "complementary priors" to eliminate the explaining-away effects that make inference difficult in densely connected belief nets that have many hidden layers. Using complementary priors, we derive a fast, greedy algorithm that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory. The fast, greedy algorithm is used to initialize a slower learning procedure that fine-tunes the weights using a contrastive version of the wake-sleep algorithm. After fine-tuning, a network with three hidden layers forms a very good generative model of the joint distribution of handwritten digit images and their labels. This generative model gives better digit classification than the best discriminative learning algorithms. The low-dimensional manifolds on which the digits lie are modeled by long ravines in the free-energy landscape of the top-level associative memory, and it is easy to explore these ravines by using the directed connections to display what the associative memory has in mind.

15,055 citations

Proceedings ArticleDOI
Yoon Kim1
25 Aug 2014
TL;DR: The CNN models discussed herein improve upon the state of the art on 4 out of 7 tasks, which include sentiment analysis and question classification, and are proposed to allow for the use of both task-specific and static vectors.
Abstract: We report on a series of experiments with convolutional neural networks (CNN) trained on top of pre-trained word vectors for sentence-level classification tasks. We show that a simple CNN with little hyperparameter tuning and static vectors achieves excellent results on multiple benchmarks. Learning task-specific vectors through fine-tuning offers further gains in performance. We additionally propose a simple modification to the architecture to allow for the use of both task-specific and static vectors. The CNN models discussed herein improve upon the state of the art on 4 out of 7 tasks, which include sentiment analysis and question classification.

9,776 citations

Book
01 Jan 1872
TL;DR: The Expression of the Emotions in Man and Animals Introduction to the First Edition and Discussion Index, by Phillip Prodger and Paul Ekman.
Abstract: Acknowledgments List of Illustrations Figures Plates Preface to the Anniversary Edition by Paul Ekman Preface to the Third Edition by Paul Ekman Preface to the Second Edition by Francis Darwin Introduction to the Third Edition by Paul Ekman The Expression of the Emotions in Man and Animals Introduction to the First Edition 1. General Principles of Expression 2. General Principles of Expression -- continued 3. General Principles of Expression -- continued 4. Means of Expression in Animals 5. Special Expressions of Animals 6. Special Expressions of Man: Suffering and Weeping 7. Low Spirits, Anxiety, Grief, Dejection, Despair 8. Joy, High Spirits, Love, Tender Feelings, Devotion 9. Reflection - Meditation - Ill-temper - Sulkiness - Determination 10. Hatred and Anger 11. Disdain - Contempt - Disgust - Guilt - Pride, Etc. - Helplessness - Patience - Affirmation and Negation 12. Surprise - Astonishment - Fear - Horror 13. Self-attention - Shame - Shyness - Modesty: Blushing 14. Concluding Remarks and Summary Afterword, by Paul Ekman APPENDIX I: Charles Darwin's Obituary, by T. H. Huxley APPENDIX II: Changes to the Text, by Paul Ekman APPENDIX III: Photography and The Expression of the Emotions, by Phillip Prodger APPENDIX IV: A Note on the Orientation of the Plates, by Phillip Prodger and Paul Ekman APPENDIX V: Concordance of Illustrations, by Phillip Prodger APPENDIX VI: List of Head Words from the Index to the First Edition NOTES NOTES TO THE COMMENTARIES INDEX

9,342 citations

Frequently Asked Questions (9)
Q1. What contributions have the authors mentioned in the paper "A review of affective computing: from unimodal analysis to multimodal fusion" ?

This is the primary motivation behind their first of its kind, comprehensive literature review of the diverse field of affective computing. Furthermore, existing literature surveys lack a detailed discussion of state of the art in multimodal affect analysis frameworks, which this review aims to address. In this paper, the authors focus mainly on the use of audio, visual and text information for multimodal affect analysis, since around 90 % of the relevant literature appears to cover these three modalities. As part of this review, the authors carry out an extensive study of different categories of state-of-the-art fusion techniques, followed by a critical analysis of potential performance improvements with multimodal analysis compared to unimodal analysis. A comprehensive overview of these two complementary fields aims to form the building blocks for readers, to better understand this challenging and exciting research field. 

One important area of future research is to investigate novel approaches for advancing their understanding of the temporal dependency between utterances, i. e., the effect of utterance at time t on the utterance at time t+1. The progress in text classification research can play a major role in future of the multimodal affect analysis research. Future research should focus on answering this question. The use of deep learning for multimodal fusion can also be an important future work. 

The primary advantage of analyzing videos over textual analysis, for detecting emotions and sentiments from opinions, is the surplus of behavioral cues. 

For acoustic features, low-level acoustic features were extracted at frame level on each utterance and used to generate feature representation of the entire dataset, using the OpenSMILE toolkit. 

Whilst machine learning methods, for supervised training of the sentiment analysis system, are predominant in literature, a number of unsupervised methods such as linguistic patterns can also be found. 

Across the ages of people involved, and the nature of conversations, facial expressions are the primary channel for forming an impression of the subject’s present state of mind. 

The results on uncontrolled recordings (i.e., speech downloaded from a video-sharing website) revealed that the feature adaptation scheme significantly improved the unweighted and weighted accuracies of the emotion recognition system. 

In their literature survey, the authors have found more than 90% of studies reported visual modality as superior to audio and other modalities. 

To accommodate research in audio-visual fusion, the audio and video signals were synchronized with an accuracy of 25micro-seconds.