scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Affective Computing in 2019"


Journal ArticleDOI
TL;DR: A survey of the neurophysiological research performed from 2009 to 2016 is presented, providing a comprehensive overview of the existing works in emotion recognition using EEG signals, and a set of good practice recommendations that researchers must follow to achieve reproducible, replicable, well-validated and high-quality results.
Abstract: Emotions have an important role in daily life, not only in human interaction, but also in decision-making processes, and in the perception of the world around us. Due to the recent interest shown by the research community in establishing emotional interactions between humans and computers, the identification of the emotional state of the former became a need. This can be achieved through multiple measures, such as subjective self-reports, autonomic and neurophysiological measurements. In the last years, Electroencephalography (EEG) received considerable attention from researchers, since it can provide a simple, cheap, portable, and ease-to-use solution for identifying emotions. In this paper, we present a survey of the neurophysiological research performed from 2009 to 2016, providing a comprehensive overview of the existing works in emotion recognition using EEG signals. We focus our analysis in the main aspects involved in the recognition process (e.g., subjects, features extracted, classifiers), and compare the works per them. From this analysis, we propose a set of good practice recommendations that researchers must follow to achieve reproducible, replicable, well-validated and high-quality results. We intend this survey to be useful for the research community working on emotion recognition through EEG signals, and in particular for those entering this field of research, since it offers a structured starting point.

640 citations


Journal ArticleDOI
TL;DR: The experimental results indicate that stable patterns of electroencephalogram (EEG) over time for emotion recognition exhibit consistency across sessions; the lateral temporal areas activate more for positive emotions than negative emotions in beta and gamma bands; and the neural patterns of neutral emotions have higher alpha responses at parietal and occipital sites.
Abstract: In this paper, we investigate stable patterns of electroencephalogram (EEG) over time for emotion recognition using a machine learning approach. Up to now, various findings of activated patterns associated with different emotions have been reported. However, their stability over time has not been fully investigated yet. In this paper, we focus on identifying EEG stability in emotion recognition. We systematically evaluate the performance of various popular feature extraction, feature selection, feature smoothing and pattern classification methods with the DEAP dataset and a newly developed dataset called SEED for this study. Discriminative Graph regularized Extreme Learning Machine with differential entropy features achieves the best average accuracies of 69.67 and 91.07 percent on the DEAP and SEED datasets, respectively. The experimental results indicate that stable patterns exhibit consistency across sessions; the lateral temporal areas activate more for positive emotions than negative emotions in beta and gamma bands; the neural patterns of neutral emotions have higher alpha responses at parietal and occipital sites; and for negative emotions, the neural patterns have significant higher delta responses at parietal and occipital sites and higher gamma responses at prefrontal sites. The performance of our emotion recognition models shows that the neural patterns are relatively stable within and between sessions.

511 citations


Journal ArticleDOI
TL;DR: In this paper, the authors collected, annotated, and prepared for public distribution a new database of facial emotions in the wild (called AffectNet), which contains more than 1,000,000 facial images from the Internet by querying three major search engines using 1,250 emotion related keywords in six different languages.
Abstract: Automated affective computing in the wild setting is a challenging problem in computer vision. Existing annotated databases of facial expressions in the wild are small and mostly cover discrete emotions (aka the categorical model). There are very limited annotated facial databases for affective computing in the continuous dimensional model (e.g., valence and arousal). To meet this need, we collected, annotated, and prepared for public distribution a new database of facial emotions in the wild (called AffectNet). AffectNet contains more than 1,000,000 facial images from the Internet by querying three major search engines using 1,250 emotion related keywords in six different languages. About half of the retrieved images were manually annotated for the presence of seven discrete facial expressions and the intensity of valence and arousal. AffectNet is by far the largest database of facial expression, valence, and arousal in the wild enabling research in automated facial expression recognition in two different emotion models. Two baseline deep neural networks are used to classify images in the categorical model and predict the intensity of valence and arousal. Various evaluation metrics show that our deep neural network baselines can perform better than conventional machine learning methods and off-the-shelf facial expression recognition systems.

432 citations


Journal ArticleDOI
TL;DR: This paper systematically review all components of such systems: pre-processing, feature extraction and machine coding of facial actions, and the existing FACS-coded facial expression databases are summarised.
Abstract: As one of the most comprehensive and objective ways to describe facial expressions, the Facial Action Coding System (FACS) has recently received significant attention. Over the past 30 years, extensive research has been conducted by psychologists and neuroscientists on various aspects of facial expression analysis using FACS. Automating FACS coding would make this research faster and more widely applicable, opening up new avenues to understanding how we communicate through facial expressions. Such an automated process can also potentially increase the reliability, precision and temporal resolution of coding. This paper provides a comprehensive survey of research into machine analysis of facial actions. We systematically review all components of such systems: pre-processing, feature extraction and machine coding of facial actions. In addition, the existing FACS-coded facial expression databases are summarised. Finally, challenges that have to be addressed to make automatic facial action analysis applicable in real-life situations are extensively discussed. There are two underlying motivations for us to write this survey paper: the first is to provide an up-to-date review of the existing literature, and the second is to offer some insights into the future of machine recognition of facial actions: what are the challenges and opportunities that researchers in the field face.

257 citations


Journal ArticleDOI
TL;DR: The impact of stress to multiple bodily responses is surveyed and efficiency, robustness and consistency of biosignal data features across the current state of knowledge in stress detection are put on.
Abstract: This review investigates the effects of psychological stress on the human body measured through biosignals. When a potentially threatening stimulus is perceived, a cascade of physiological processes occurs mobilizing the body and nervous system to confront the imminent threat and ensure effective adaptation. Biosignals that can be measured reliably in relation to such stressors include physiological (EEG, ECG, EDA, EMG) and physical measures (respiratory rate, speech, skin temperature, pupil size, eye activity). A fundamental objective in this area of psychophysiological research is to establish reliable biosignal indices that reveal the underlying physiological mechanisms of the stress response. Motivated by the lack of comprehensive guidelines on the relationship between the multitude of biosignal features used in the literature and their corresponding behaviour during stress, in this paper, the impact of stress to multiple bodily responses is surveyed. Emphasis is put on the efficiency, robustness and consistency of biosignal data features across the current state of knowledge in stress detection. It is also explored multimodal biosignal analysis and modelling methods for deriving accurate stress correlates. This paper aims to provide a comprehensive review on biosignal patterns caused during stress conditions and reliable practical guidelines towards more efficient detection of stress.

243 citations


Journal ArticleDOI
TL;DR: The proposed approach combines machine learning algorithms to retrieve recordings conveying balanced emotional content with a cost effective annotation process using crowdsourcing, which make it possible to build a large scale speech emotional database.
Abstract: The lack of a large, natural emotional database is one of the key barriers to translate results on speech emotion recognition in controlled conditions into real-life applications. Collecting emotional databases is expensive and time demanding, which limits the size of existing corpora. Current approaches used to collect spontaneous databases tend to provide unbalanced emotional content, which is dictated by the given recording protocol (e.g., positive for colloquial conversations, negative for discussion or debates). The size and speaker diversity are also limited. This paper proposes a novel approach to effectively build a large, naturalistic emotional database with balanced emotional content, reduced cost and reduced manual labor. It relies on existing spontaneous recordings obtained from audio-sharing websites. The proposed approach combines machine learning algorithms to retrieve recordings conveying balanced emotional content with a cost effective annotation process using crowdsourcing, which make it possible to build a large scale speech emotional database. This approach provides natural emotional renditions from multiple speakers, with different channel conditions and conveying balanced emotional content that are difficult to obtain with alternative data collection protocols.

189 citations


Journal ArticleDOI
TL;DR: A new spatio-temporal feature representation learning for FER that is robust to expression intensity variations is proposed that achieved higher recognition rates in both datasets compared to the state-of-the-art methods.
Abstract: Facial expression recognition (FER) is increasingly gaining importance in various emerging affective computing applications. In practice, achieving accurate FER is challenging due to the large amount of inter-personal variations such as expression intensity variations. In this paper, we propose a new spatio-temporal feature representation learning for FER that is robust to expression intensity variations. The proposed method utilizes representative expression-states (e.g., onset, apex and offset of expressions) which can be specified in facial sequences regardless of the expression intensity. The characteristics of facial expressions are encoded in two parts in this paper. As the first part, spatial image characteristics of the representative expression-state frames are learned via a convolutional neural network. Five objective terms are proposed to improve the expression class separability of the spatial feature representation. In the second part, temporal characteristics of the spatial feature representation in the first part are learned with a long short-term memory of the facial expression. Comprehensive experiments have been conducted on a deliberate expression dataset (MMI) and a spontaneous micro-expression dataset (CASME II). Experimental results showed that the proposed method achieved higher recognition rates in both datasets compared to the state-of-the-art methods.

185 citations


Journal ArticleDOI
TL;DR: This paper presents a multimodal emotion recognition system, which is based on the analysis of audio and visual cues, and defines the current state-of-the-art in all three databases.
Abstract: This paper presents a multimodal emotion recognition system, which is based on the analysis of audio and visual cues. From the audio channel, Mel-Frequency Cepstral Coefficients, Filter Bank Energies and prosodic features are extracted. For the visual part, two strategies are considered. First, facial landmarks’ geometric relations, i.e., distances and angles, are computed. Second, we summarize each emotional video into a reduced set of key-frames, which are taught to visually discriminate between the emotions. In order to do so, a convolutional neural network is applied to key-frames summarizing videos. Finally, confidence outputs of all the classifiers from all the modalities are used to define a new feature space to be learned for final emotion label prediction, in a late fusion/stacking fashion. The experiments conducted on the SAVEE, eNTERFACE’05, and RML databases show significant performance improvements by our proposed system in comparison to current alternatives, defining the current state-of-the-art in all three databases.

166 citations


Journal ArticleDOI
TL;DR: The experimental results showed the robust classifying ability of the GCB-net and BLS in EEG emotion recognition and the broad learning system (BLS) was applied to enhance its features.
Abstract: In recent years, emotion recognition has become a research focus in the area of artificial intelligence. EEG data can be analyzed by applying graphical based algorithms or models much more efficiently. In this work, a GraphConvolutional Broad Network was designed for exploring the deeper-level information of graph-structured data. It used the graph convolutional layer to extract features of graph-structured input and stacks multiple regular convolutional layers to extract relatively abstract features. The final concatenation utilized the broad concept, which preserves the outputs of all hierarchical layers, allowing the model to search features in broad spaces. For comparison, two individual experiments were conducted to examine the efficiency of the proposed GCB-net based on the SJTU emotion EEG dataset (SEED) and DREAMER dataset respectively. In SEED, compared with other state-of-art methods, the GCB-net could better promote the accuracy (reaching 94.24%) on the DE feature of the all-frequency band. In DREAMER dataset, GCB-net performed better than other models with the same setting. Furthermore, the GCB-net reached high accuracies of 81.95%, 84.28% and 84.35% on dimensions of Valence, Arousal and Dominance respectively by working with BLS. The experimental results showed the robust classifying ability of the GCB-net and BLS in EEG emotion recognit.

160 citations


Journal ArticleDOI
TL;DR: A discriminative spatiotemporal local binary pattern based on an integral projection to resolve the problems of STLBP for micro-expression recognition and a new feature selection based on Laplacian method for increasing the discrimination of micro-expressions is proposed.
Abstract: Recently, there have been increasing interests in inferring mirco-expression from facial image sequences. Due to subtle facial movement of micro-expressions, feature extraction has become an important and critical issue for spontaneous facial micro-expression recognition. Recent works used spatiotemporal local binary pattern (STLBP) for micro-expression recognition and considered dynamic texture information to represent face images. However, they miss the shape attribute of face images. On the other hand, they extract the spatiotemporal features from the global face regions while ignore the discriminative information between two micro-expression classes. The above-mentioned problems seriously limit the application of STLBP to micro-expression recognition. In this paper, we propose a discriminative spatiotemporal local binary pattern based on an integral projection to resolve the problems of STLBP for micro-expression recognition. First, we revisit an integral projection for preserving the shape attribute of micro-expressions by using robust principal component analysis. Furthermore, a revisited integral projection is incorporated with local binary pattern across spatial and temporal domains. Specifically, we extract the novel spatiotemporal features incorporating shape attributes into spatiotemporal texture features. For increasing the discrimination of micro-expressions, we propose a new feature selection based on Laplacian method to extract the discriminative information for facial micro-expression recognition. Intensive experiments are conducted on three availably published micro-expression databases including CASME, CASME2 and SMIC databases. We compare our method with the state-of-the-art algorithms. Experimental results demonstrate that our proposed method achieves promising performance for micro-expression recognition.

133 citations


Journal ArticleDOI
TL;DR: The review outlines methods and algorithms for visual feature extraction, dimensionality reduction, decision methods for classification and regression approaches, as well as different fusion strategies, for automatic depression assessment utilizing visual cues alone or in combination with vocal or verbal cues.
Abstract: Automatic depression assessment based on visual cues is a rapidly growing research domain. The present exhaustive review of existing approaches as reported in over sixty publications during the last ten years focuses on image processing and machine learning algorithms. Visual manifestations of depression, various procedures used for data collection, and existing datasets are summarized. The review outlines methods and algorithms for visual feature extraction, dimensionality reduction, decision methods for classification and regression approaches, as well as different fusion strategies. A quantitative meta-analysis of reported results, relying on performance metrics robust to chance, is included, identifying general trends and key unresolved issues to be considered in future studies of automatic depression assessment utilizing visual cues alone or in combination with vocal or verbal cues.

Journal ArticleDOI
TL;DR: This paper explores the temporal features associated with facial micro-movements and proposes fuzzy histogram of optical flow orientation (FHOFO) features for recognition of micro-expressions and discusses the effect of inclusion and exclusion of the motion magnitudes during FHOFO feature extraction.
Abstract: In high-stake situations, the micro-expressions reveal the hidden emotions of a person and it has potential applications in many areas. The recognition of such short-lived subtle expressions is a challenging task. The literature proposes several spatio-temporal features to encode the subtle changes on the face during a micro-expression. The spatial changes are almost indistinguishable as the facial appearance does not change appreciably. However, these changes possess a temporal pattern. This paper explores the temporal features associated with facial micro-movements and proposes fuzzy histogram of optical flow orientation (FHOFO) features for recognition of micro-expressions. The FHOFO constructs suitable angular histograms from optical flow vector orientations using histogram fuzzification to encode the temporal pattern for classifying the micro-expressions. We have also discussed the effect of inclusion and exclusion of the motion magnitudes during FHOFO feature extraction. It has been demonstrated by repeated experiments on the publicly available databases, that the performance of FHOFO is consistent and close or at times even better than the state-of-art techniques.

Journal ArticleDOI
TL;DR: The proposed method, denoted by R2G-STNN, consists of spatial and temporal neural network models with regional to global hierarchical feature learning process to learn discriminative spatial-temporal EEG features.
Abstract: In this paper, we propose a novel Electroencephalograph (EEG) emotion recognition method inspired by neuroscience with respect to the brain response to different emotions. The proposed method, denoted by R2G-STNN, consists of spatial and temporal neural network models with regional to global hierarchical feature learning process to learn discriminative spatial-temporal EEG features. To learn the spatial features, a bidirectional long short term memory (BiLSTM) network is adopted to capture the intrinsic spatial relationships of EEG electrodes within brain region and between brain regions, respectively. Considering that different brain regions play different roles in the EEG emotion recognition, a region-attention layer into the R2G-STNN model is also introduced to learn a set of weights to strengthen or weaken the contributions of brain regions. Based on the spatial feature sequences, BiLSTM is adopted to learn both regional and global spatial-temporal features and the features are fitted into a classifier layer for learning emotion-discriminative features, in which a domain discriminator working corporately with the classifier is used to decrease the domain shift between training and testing data. Finally, to evaluate the proposed method, we conduct both subject-dependent and subject-independent EEG emotion recognition experiments on SEED database, and the experimental results show that the proposed method achieves state-of-the-art performance.

Journal ArticleDOI
TL;DR: The state of the art of pain recognition technology is assessed and guidance is provided for researchers to help make such advances to identify underexplored areas such as chronic pain and connections to treatments, and promising opportunities for continued advances.
Abstract: Automated tools for pain assessment have great promise but have not yet become widely used in clinical practice. In this survey paper, we review the literature that proposes and evaluates automatic pain recognition approaches, and discuss challenges and promising directions for advancing this field. Prior to that, we give an overview on pain mechanisms and responses, discuss common clinically used pain assessment tools, and address shared datasets and the challenge of validation in the context of pain recognition.

Journal ArticleDOI
TL;DR: This work applies novel deep-learning-based methods to various bio-sensing and video data of four publicly available multi-modal emotion datasets and proposes a new technique towards identifying salient brain regions corresponding to various affective states.
Abstract: In recent years, the use of bio-sensing signals such as electroencephalogram (EEG), electrocardiogram (ECG) etc. have garnered interest towards applications in affective computing. The parallel trend of deep learning has led to a huge leap in performance towards solving various vision-based research problems such as object detection. Yet, these advances in deep learning have not adequately translated into bio-sensing research. This work applies novel deep-learning-based methods to various bio-sensing and video data of four publicly available multi-modal emotion datasets. For each dataset, we first individually evaluate the emotion-classification performance obtained by each modality. We then evaluate the performance obtained by fusing the features from these modalities. We show that our algorithms outperform the results reported by other studies for emotion/valence/arousal/liking classification on DEAP and MAHNOB-HCI datasets and set up benchmarks for the newer AMIGOS and DREAMER datasets. We also evaluate the performance of our algorithms by combining the datasets and by using transfer learning to show that the proposed method overcomes the inconsistencies between the datasets. Hence, we do a thorough analysis on multi-modal affective data from more than 120 subjects and 2,800 trials. Finally, utilizing a convolution-deconvolution network, we propose a new technique towards identifying salient brain regions corresponding to various affective states.

Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed an end-to-end deep learning based attention and relation learning framework for AU detection with only AU labels, where multi-scale features shared by each AU are learned firstly, and then both channel-wise and spatial attentions are adaptively learned to select and extract AU-related local features.
Abstract: Attention mechanism has recently attracted increasing attentions in the field of facial action unit (AU) detection. By finding the region of interest of each AU with the attention mechanism, AU-related local features can be captured. Most of the existing attention based AU detection works use prior knowledge to predefine fixed attentions or refine the predefined attentions within a small range, which limits their capacity to model various AUs. In this paper, we propose an end-to-end deep learning based attention and relation learning framework for AU detection with only AU labels, which has not been explored before. In particular, multi-scale features shared by each AU are learned firstly, and then both channel-wise and spatial attentions are adaptively learned to select and extract AU-related local features. Moreover, pixel-level relations for AUs are further captured to refine spatial attentions so as to extract more relevant local features. Without changing the network architecture, our framework can be easily extended for AU intensity estimation. Extensive experiments show that our framework (i) soundly outperforms the state-of-the-art methods for both AU detection and AU intensity estimation on the challenging BP4D, DISFA, FERA 2015 and BP4D+ benchmarks, (ii) can adaptively capture the correlated regions of each AU, and (iii) also works well under severe occlusions and large poses.

Journal ArticleDOI
TL;DR: This work investigates the influence of corpus, domain, and gender on the cross-corpus generalizability of emotion recognition systems using a multi-task learning approach and finds that incorporating variability caused by corpus, Domain, and Gender through multi- task learning outperforms approaches that treat the tasks as either identical or independent.
Abstract: There is growing interest in emotion recognition due to its potential in many applications. However, a pervasive challenge is the presence of data variability caused by factors such as differences across corpora, speaker’s gender, and the “domain” of expression (e.g., whether the expression is spoken or sung). Prior work has addressed this challenge by combining data across corpora and/or genders, or by explicitly controlling for these factors. In this work, we investigate the influence of corpus, domain, and gender on the cross-corpus generalizability of emotion recognition systems. We use a multi-task learning approach, where we define the tasks according to these factors. We find that incorporating variability caused by corpus, domain, and gender through multi-task learning outperforms approaches that treat the tasks as either identical or independent. Domain is a larger differentiating factor than gender for multi-domain data. When considering only the speech domain, gender and corpus are similarly influential. Defining tasks by gender is more beneficial than by either corpus or corpus and gender for valence, while the opposite holds for activation. On average, cross-corpus performance increases with the number of training corpora. The results demonstrate that effective cross-corpus modeling requires that we understand how emotion expression patterns change as a function of non-emotional factors.

Journal ArticleDOI
TL;DR: The Multimodal Human-Human-Robot-Interaction (MHHRI) dataset as mentioned in this paper was proposed to study personality simultaneously in human-human interactions and human-robot interactions and its relationship with engagement.
Abstract: In this paper we introduce a novel dataset, the Multimodal Human-Human-Robot-Interactions (MHHRI) dataset, with the aim of studying personality simultaneously in human-human interactions (HHI) and human-robot interactions (HRI) and its relationship with engagement. Multimodal data was collected during a controlled interaction study where dyadic interactions between two human participants and triadic interactions between two human participants and a robot took place with interactants asking a set of personal questions to each other. Interactions were recorded using two static and two dynamic cameras as well as two biosensors, and meta-data was collected by having participants fill in two types of questionnaires, for assessing their own personality traits and their perceived engagement with their partners (self labels) and for assessing personality traits of the other participants partaking in the study (acquaintance labels). As a proof of concept, we present baseline results for personality and engagement classification. Our results show that (i) trends in personality classification performance remain the same with respect to the self and the acquaintance labels across the HHI and HRI settings; (ii) for extroversion, the acquaintance labels yield better results as compared to the self labels; (iii) in general, multi-modality yields better performance for the classification of personality traits.

Journal ArticleDOI
Peng Song1
TL;DR: A novel transfer linear subspace learning (TLSL) framework to learn a common feature subspace for source and target datasets for cross-corpus speech emotion recognition is presented and two kinds of TLSL approaches are proposed, called transfer unsupervised linear sub Space Learning (TULSL) and transfer supervised linear sub space learning (TSLSL), and the corresponding solutions for the optimization problems are provided.
Abstract: Speech emotion recognition has received an increasing interest in recent years, which is often conducted on the assumption that speech utterances in training and testing datasets are obtained under the same conditions. However, in reality, this assumption does not hold as the speech data are often collected from different devices or environments. Hence, there exists discrepancy between the training and testing data, which will have an adverse effect on recognition performance. In this paper, we examine the problem of cross-corpus speech emotion recognition. To address it, we present a novel transfer linear subspace learning (TLSL) framework to learn a common feature subspace for source and target datasets. In TLSL, a nearest neighbor graph algorithm is used to measure the similarity between different corpora, and a feature grouping strategy is introduced to divide the emotional features into two categories, i.e., high transferable part (HTP) versus low transferable part (LTP). To explore the proposed TLSL with different scenarios, we propose two kinds of TLSL approaches, called transfer unsupervised linear subspace learning (TULSL) and transfer supervised linear subspace learning (TSLSL), and provide the corresponding solutions for the optimization problems. Extensive experiments on several benchmark datasets validate the effectiveness of TLSL for cross-corpus speech emotion recognition.

Journal ArticleDOI
TL;DR: A novel paradigm for online emotion classification is provided, which exploits both audio and visual modalities and produces a responsive prediction when the system is confident enough, and against other state-of-the-art models.
Abstract: The advancement of Human-Robot Interaction (HRI) drives research into the development of advanced emotion identification architectures that fathom audio-visual (A-V) modalities of human emotion. State-of-the-art methods in multi-modal emotion recognition mainly focus on the classification of complete video sequences, leading to systems with no online potentialities. Such techniques are capable of predicting emotions only when the videos are concluded, thus restricting their applicability in practical scenarios. The paper at hand provides a novel paradigm for online emotion classification, which exploits both audio and visual modalities and produces a responsive prediction when the system is confident enough. We propose two deep Convolutional Neural Network (CNN) models for extracting emotion features, one for each modality, and a Deep Neural Network (DNN) for their fusion. In order to conceive the temporal quality of human emotion in interactive scenarios, we train in cascade a Long Short-Term Memory (LSTM) layer and a Reinforcement Learning (RL) agent -which monitors the speaker- thus stopping feature extraction and making the final prediction. The comparison of our results on two publicly available A-V emotional datasets viz., RML and BAUM-1s, against other state-of-the-art models, demonstrates the beneficial capabilities of our work.

Journal ArticleDOI
TL;DR: The MorpheuS music generation system, presented, has the ability to generate polyphonic pieces with a given tension profile and long- and short-term repeated pattern structures and is particularly useful in a game or film music context.
Abstract: Automatic music generation systems have gained in popularity and sophistication as advances in cloud computing have enabled large-scale complex computations such as deep models and optimization algorithms on personal devices. Yet, they still face an important challenge, that of long-term structure, which is key to conveying a sense of musical coherence. We present the MorpheuS music generation system designed to tackle this problem. MorpheuS’ novel framework has the ability to generate polyphonic pieces with a given tension profile and long- and short-term repeated pattern structures. A mathematical model for tonal tension quantifies the tension profile and state-of-the-art pattern detection algorithms extract repeated patterns in a template piece. An efficient optimization metaheuristic, variable neighborhood search, generates music by assigning pitches that best fit the prescribed tension profile to the template rhythm while hard constraining long-term structure through the detected patterns. This ability to generate affective music with specific tension profile and long-term structure is particularly useful in a game or film music context. Music generated by the MorpheuS system has been performed live in concerts.

Journal ArticleDOI
TL;DR: Results show that the gameplay can be used to predict various personality features using strategy game data and is used to classify a player, after a gameplay, into one of the two profiles.
Abstract: Computer games provide an ideal test bed to collect and study data related to human behavior using a virtual environment having real-world-like features. Studies regarding individual players’ actions in a gaming session and how this correlates with their real-life personality have the potential to reveal great insights in the field of affective computing. This study profiles players using data collected from strategy games. This is done by taking into account the gameplay and the associations between the personality traits and the subjects playing the game. This study uses two benchmark strategy game datasets, namely, StarCraft and World of Warcraft . In addition, the study also uses the Age of Empire-II game data, collected using 50 participants. The IPIP-NEO-120 personality test is conducted using these participants to evaluate them on the Big-Five personality traits. The three datasets are profiled using four clustering techniques. The results identify two clusters in each of these datasets. The quality of cluster formation is also evaluated through the cluster evaluation indices. Using the clustering results, the classifiers are then trained to classify a player, after a gameplay, into one of the two profiles. Results show that the gameplay can be used to predict various personality features using strategy game data.

Journal ArticleDOI
TL;DR: A review of existing vision-based approaches for apparent personality trait recognition can be found in this paper, where the authors describe seminal and cutting edge works on the subject, discussing and comparing their distinctive features and limitations.
Abstract: Personality analysis has been widely studied in psychology, neuropsychology, and signal processing fields, among others. From the past few years, it also became an attractive research area in visual computing. From the computational point of view, by far speech and text have been the most considered cues of information for analyzing personality. However, recently there has been an increasing interest from the computer vision community in analyzing personality from visual data. Recent computer vision approaches are able to accurately analyze human faces, body postures and behaviors, and use these information to infer apparent personality traits. Because of the overwhelming research interest in this topic, and of the potential impact that this sort of methods could have in society, we present in this paper an up-to-date review of existing vision-based approaches for apparent personality trait recognition. We describe seminal and cutting edge works on the subject, discussing and comparing their distinctive features and limitations. Future venues of research in the field are identified and discussed. Furthermore, aspects on the subjectivity in data labeling/evaluation, as well as current datasets and challenges organized to push the research on the field are reviewed.

Journal ArticleDOI
TL;DR: DepecheMood++ as discussed by the authors is an extension of an existing and widely used emotion lexicon for English and a novel version of the lexicon, targeting Italian, which can be used to boost performance on datasets and tasks of varying degree of domain-specificity.
Abstract: Several lexica for sentiment analysis have been developed; while most of these come with word polarity annotations (e.g., positive/negative), attempts at building lexica for finer-grained emotion analysis (e.g., happiness, sadness) have recently attracted significant attention. They are often exploited as a building block for developing emotion recognition learning models, and/or used as baselines to which the performance of the models can be compared. In this work, we contribute two new resources, that we call DepecheMood++ (DM++): a) an extension of an existing and widely used emotion lexicon for English; and b) a novel version of the lexicon, targeting Italian. Furthermore, we show how simple techniques can be used, both in supervised and unsupervised experimental settings, to boost performance on datasets and tasks of varying degree of domain-specificity. Also, we report an extensive comparative analysis against other available emotion lexica and state-of-the-art supervised approaches, showing that DepecheMood++ emerges as the best-performing non-domain-specific lexicon in unsupervised settings. We also observe that simple learning models on top of DM++ can provide more challenging baselines. We finally introduce embedding-based methodologies to perform a) vocabulary expansion to address data scarcity and b) vocabulary porting to new languages in case training data is not available.

Journal ArticleDOI
TL;DR: In terms of emotion classification rate, the proposed region switching based classification approach shows significant improvement in comparison to the classification approach by processing entire active speech region, and it outperforms other state-of-the-art approaches for all the three databases.
Abstract: In this work, a novel region switching based classification method is proposed for speech emotion classification using vowel-like regions (VLRs) and non-vowel-like regions (non-VLRs). In literature, normally the entire active speech region is processed for emotion classification. A few studies have been performed on segmented sound units, such as, syllables, phones, vowel, consonant and voiced, for speech emotion classification. This work presents a detailed analysis of emotion information contained independently in segmented VLRs and non-VLRs. The proposed region switching based method is implemented by choosing the features of either VLRs or non-VLRs for each emotion. The VLRs are detected by identifying hypothesized VLR onset and end points. Segmentation of non-VLRs is done by using the knowledge of VLRs and active speech regions. The performance is evaluated using EMODB, IEMOCAP and FAU AIBO databases. Experimental results show that both the VLRs and non-VLRs contain emotion-specific information. In terms of emotion classification rate, the proposed region switching based classification approach shows significant improvement in comparison to the classification approach by processing entire active speech region, and it outperforms other state-of-the-art approaches for all the three databases.

Journal ArticleDOI
TL;DR: This paper surveys new methods for stress assessment, focusing especially on those that are suited for the workplace: one of today's major sources of stress.
Abstract: The topic of stress is nowadays a very important one, not only in research but on social life in general. People are increasingly aware of this problem and its consequences at several levels: health, social life, work, quality of life, etc. This resulted in a significant increase in the search for devices and applications to measure and manage stress in real-time. Recent technological and scientific evolution fosters this interest with the development of new methods and approaches. In this paper we survey these new methods for stress assessment, focusing especially on those that are suited for the workplace: one of today's major sources of stress. We contrast them with more traditional methods and compare them between themselves, evaluating nine characteristics. Given the diversity of methods that exist nowadays, this work facilitates the stakeholders’ decision towards which one to use, based on how much their organization values aspects such as privacy, accuracy, cost-effectiveness or intrusiveness.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a multiscale deep convolutional long short-term memory (LSTM) framework for spontaneous speech emotion recognition, where a deep CNN model was used to learn segment-level features on the basis of the created image-like three channels of spectrograms.
Abstract: Recently, emotion recognition in real sceneries such as in the wild has attracted extensive attention in affective computing, because existing spontaneous emotions in real sceneries are more challenging and difficult to identify than other emotions. Motivated by the diverse effects of different lengths of audio spectrograms on emotion identification, this paper proposes a multiscale deep convolutional long short-term memory (LSTM) framework for spontaneous speech emotion recognition. Initially, a deep convolutional neural network (CNN) model is used to learn deep segment-level features on the basis of the created image-like three channels of spectrograms. Then, a deep LSTM model is adopted on the basis of the learned segment-level CNN features to capture the temporal dependency among all divided segments in an utterance for utterance-level emotion recognition. Finally, different emotion recognition results, obtained by combining CNN with LSTM at multiple lengths of segment-level spectrograms, are integrated by using a score-level fusion strategy. Experimental results on two challenging spontaneous emotional datasets, i.e., the AFEW5.0 and BAUM-1s databases, demonstrate the promising performance of the proposed method, outperforming state-of-the-art methods.

Journal ArticleDOI
TL;DR: This is the first work which statistically demonstrates that taking into account negation significantly improves the polarity classification of Spanish tweets, and can greatly improve the accuracy of the final system.
Abstract: Polarity classification is a well-known Sentiment Analysis task. However, most research has been oriented towards developing supervised or unsupervised systems without paying much attention to certain linguistic phenomena such as negation. In this paper we focus on this specific issue in order to demonstrate that dealing with negation can improve the final system. Although we can find some studies of negation detection, most of them deal with English documents. On the contrary, our study is focused on the scope of negation in Spanish Sentiment Analysis. Thus, we have built an unsupervised polarity classification system based on integrating external knowledge. In order to evaluate the influence of negation we have implemented a specific module for negation detection by applying several rules. The system has been tested considering and without considering negation, using a corpus of tweets written in Spanish. The results obtained reveal that the treatment of negation can greatly improve the accuracy of the final system. Moreover, we have carried out a comprehensive statistical study in order to demonstrate our approach. To the best of our knowledge, this is the first work which statistically demonstrates that taking into account negation significantly improves the polarity classification of Spanish tweets.

Journal ArticleDOI
TL;DR: A two-stage method is proposed for recognizing facial expressions given a sequence of images and the performance of the temporal model is better than that of the single architecture of the 2017 EmotiW challenge winner on the AFEW 7.0 dataset.
Abstract: Emotion recognition is indispensable in human-machine interaction systems. It comprises locating facial regions of interest in images and classifying them into one of seven classes: angry, disgust, fear, happy, neutral, sad, and surprise. Despite several breakthroughs in image classification, particularly in facial expression recognition, this research area is still challenging, as sampling in the wild is a demanding task. In this study, a two-stage method is proposed for recognizing facial expressions given a sequence of images. At the first stage, all face regions are extracted in each frame, and essential information that would be helpful and related to human emotion is obtained. Then, the extracted features from the previous step are considered temporal data and are assigned to one of the seven basic emotions. In addition, a study of multi-level features is conducted in a convolutional neural network for facial expression recognition. Moreover, various network connections are introduced to improve the classification task. By combining the proposed network connections, superior results are obtained compared to state-of-the-art methods on the FER2013 dataset. Furthermore, the performance of our temporal model is better than that of the single architecture of the 2017 EmotiW challenge winner on the AFEW 7.0 dataset.

Journal ArticleDOI
TL;DR: This work demonstrates how pitch can be used to improve estimates of emotion from the upper face, and how this estimate can be combined with emotion estimates from the lower face and speech in a multimodal classification system.
Abstract: Emotion is an essential part of human interaction. Automatic emotion recognition can greatly benefit human-centered interactive technology, since extracted emotion can be used to understand and respond to user needs. However, real-world emotion recognition faces a central challenge when a user is speaking: facial movements due to speech are often confused with facial movements related to emotion. Recent studies have found that the use of phonetic information can reduce speech-related variability in the lower face region. However, methods to differentiate upper face movements due to emotion and due to speech have been underexplored. This gap leads us to the proposal of the Informed Segmentation and Labeling Approach (ISLA). ISLA uses speech signals that alter the dynamics of the lower and upper face regions. We demonstrate how pitch can be used to improve estimates of emotion from the upper face, and how this estimate can be combined with emotion estimates from the lower face and speech in a multimodal classification system. Our emotion classification results on the IEMOCAP and SAVEE datasets show that ISLA improves overall classification performance. We also demonstrate how emotion estimates from different modalities correlate with each other, providing insights into the differences between posed and spontaneous expressions.