scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Affective Computing in 2015"


Journal ArticleDOI
TL;DR: An automated learning-free facial landmark detection technique has been proposed, which achieves similar performances as that of other state-of-art landmark detection methods, yet requires significantly less execution time.
Abstract: Extraction of discriminative features from salient facial patches plays a vital role in effective facial expression recognition. The accurate detection of facial landmarks improves the localization of the salient patches on face images. This paper proposes a novel framework for expression recognition by using appearance features of selected facial patches. A few prominent facial patches, depending on the position of facial landmarks, are extracted which are active during emotion elicitation. These active patches are further processed to obtain the salient patches which contain discriminative features for classification of each pair of expressions, thereby selecting different facial patches as salient for different pair of expression classes. One-against-one classification method is adopted using these features. In addition, an automated learning-free facial landmark detection technique has been proposed, which achieves similar performances as that of other state-of-art landmark detection methods, yet requires significantly less execution time. The proposed method is found to perform well consistently in different resolutions, hence, providing a solution for expression recognition in low resolution images. Experiments on CK+ and JAFFE facial expression databases show the effectiveness of the proposed system.

452 citations


Journal ArticleDOI
TL;DR: Experimental results show that the proposed Fourier parameter (FP) features are effective in identifying various emotional states in speech signals and improve the recognition rates over the methods using Mel frequency cepstral coefficient features.
Abstract: Recently, studies have been performed on harmony features for speech emotion recognition. It is found in our study that the first- and second-order differences of harmony features also play an important role in speech emotion recognition. Therefore, we propose a new Fourier parameter model using the perceptual content of voice quality and the first- and second-order differences for speaker-independent speech emotion recognition. Experimental results show that the proposed Fourier parameter (FP) features are effective in identifying various emotional states in speech signals. They improve the recognition rates over the methods using Mel frequency cepstral coefficient (MFCC) features by 16.2, 6.8 and 16.6 points on the German database (EMODB), Chinese language database (CASIA) and Chinese elderly emotion database (EESDB). In particular, when combining FP with MFCC, the recognition rates can be further improved on the aforementioned databases by 17.5, 10 and 10.5 points, respectively.

328 citations


Journal ArticleDOI
TL;DR: A large video database, namely LIRIS-ACCEDE, is proposed, which consists of 9,800 good quality video excerpts with a large content diversity and provides four experimental protocols and a baseline for prediction of emotions using a large set of both visual and audio features.
Abstract: Research in affective computing requires ground truth data for training and benchmarking computational models for machine-based emotion understanding. In this paper, we propose a large video database, namely LIRIS-ACCEDE, for affective content analysis and related applications, including video indexing, summarization or browsing. In contrast to existing datasets with very few video resources and limited accessibility due to copyright constraints, LIRIS-ACCEDE consists of 9,800 good quality video excerpts with a large content diversity. All excerpts are shared under creative commons licenses and can thus be freely distributed without copyright issues. Affective annotations were achieved using crowdsourcing through a pair-wise video comparison protocol, thereby ensuring that annotations are fully consistent, as testified by a high inter-annotator agreement, despite the large diversity of raters’ cultural backgrounds. In addition, to enable fair comparison and landmark progresses of future affective computational models, we further provide four experimental protocols and a baseline for prediction of emotions using a large set of both visual and audio features. The dataset (the video clips, annotations, features and protocols) is publicly available at: http://liris-accede.ec-lyon.fr/.

270 citations


Journal ArticleDOI
TL;DR: DECAF is presented, a detailed analysis of the correlations between participants' self-assessments and their physiological responses and single-trial classification results for valence, arousal and dominance are presented, with performance evaluation against existing data sets.
Abstract: In this work, we present DECAF —a multimodal data set for dec oding user physiological responses to af fective multimedia content. Different from data sets such as DEAP [15] and MAHNOB-HCI [31] , DECAF contains (1) brain signals acquired using the Magnetoencephalogram (MEG) sensor, which requires little physical contact with the user’s scalp and consequently facilitates naturalistic affective response, and (2) explicit and implicit emotional responses of 30 participants to 40 one-minute music video segments used in [15] and 36 movie clips, thereby enabling comparisons between the EEG versus MEG modalities as well as movie versus music stimuli for affect recognition. In addition to MEG data, DECAF comprises synchronously recorded near-infra-red (NIR) facial videos, horizontal Electrooculogram (hEOG), Electrocardiogram (ECG), and trapezius-Electromyogram (tEMG) peripheral physiological responses. To demonstrate DECAF’s utility, we present (i) a detailed analysis of the correlations between participants’ self-assessments and their physiological responses and (ii) single-trial classification results for valence , arousal and dominance , with performance evaluation against existing data sets. DECAF also contains time-continuous emotion annotations for movie clips from seven users, which we use to demonstrate dynamic emotion prediction.

257 citations


Journal ArticleDOI
TL;DR: This paper reports on how emotional states elicited by affective sounds can be effectively recognized by means of estimates of Autonomic Nervous System (ANS) dynamics.
Abstract: This paper reports on how emotional states elicited by affective sounds can be effectively recognized by means of estimates of Autonomic Nervous System (ANS) dynamics. Specifically, emotional states are modeled as a combination of arousal and valence dimensions according to the well-known circumplex model of affect, whereas the ANS dynamics is estimated through standard and nonlinear analysis of Heart rate variability (HRV) exclusively, which is derived from the electrocardiogram (ECG). In addition, Lagged Poincare Plots of the HRV series were also taken into account. The affective sounds were gathered from the International Affective Digitized Sound System and grouped into four different levels of arousal (intensity) and two levels of valence (unpleasant and pleasant). A group of 27 healthy volunteers were administered with these standardized stimuli while ECG signals were continuously recorded. Then, those HRV features showing significant changes (p $ 0.05 from statistical tests) between the arousal and valence dimensions were used as input of an automatic classification system for the recognition of the four classes of arousal and two classes of valence. Experimental results demonstrated that a quadratic discriminant classifier, tested through Leave-One-Subject-Out procedure, was able to achieve a recognition accuracy of 84.72 percent on the valence dimension, and 84.26 percent on the arousal dimension.

160 citations


Journal ArticleDOI
TL;DR: A general framework for video affective content analysis is proposed, which includes video content, emotional descriptors, and users' spontaneous nonverbal responses, as well as the relationships between the three.
Abstract: Video affective content analysis has been an active research area in recent decades, since emotion is an important component in the classification and retrieval of videos. Video affective content analysis can be divided into two approaches: direct and implicit. Direct approaches infer the affective content of videos directly from related audiovisual features. Implicit approaches, on the other hand, detect affective content from videos based on an automatic analysis of a user’s spontaneous response while consuming the videos. This paper first proposes a general framework for video affective content analysis, which includes video content, emotional descriptors, and users’ spontaneous nonverbal responses, as well as the relationships between the three. Then, we survey current research in both direct and implicit video affective content analysis, with a focus on direct video affective content analysis . Lastly, we identify several challenges in this field and put forward recommendations for future research.

158 citations


Journal ArticleDOI
TL;DR: A large-scale analysis of facial responses to video content measured over the Internet and their relationship to marketing effectiveness demonstrates a reliable and generalizable system for predicting ad effectiveness automatically from facial responses without a need to elicit self-report responses from the viewers.
Abstract: Billions of online video ads are viewed every month. We present a large-scale analysis of facial responses to video content measured over the Internet and their relationship to marketing effectiveness. We collected over 12,000 facial responses from 1,223 people to 170 ads from a range of markets and product categories. The facial responses were automatically coded frame-by-frame. Collection and coding of these 3.7 million frames would not have been feasible with traditional research methods. We show that detected expressions are sparse but that aggregate responses reveal rich emotion trajectories. By modeling the relationship between the facial responses and ad effectiveness, we show that ad liking can be predicted accurately (ROC AUC = 0.85) from webcam facial responses. Furthermore, the prediction of a change in purchase intent is possible (ROC AUC = 0.78). Ad liking is shown by eliciting expressions, particularly positive expressions. Driving purchase intent is more complex than just making viewers smile: peak positive responses that are immediately preceded by a brand appearance are more likely to be effective. The results presented here demonstrate a reliable and generalizable system for predicting ad effectiveness automatically from facial responses without a need to elicit self-report responses from the viewers. In addition we can gain insight into the structure of effective ads.

123 citations


Journal ArticleDOI
TL;DR: The challenges in developing an automatic mood analysis system are identified, a proposed three models based on the attributes in the study are proposed and an `in the wild' image-based database is collected.
Abstract: The recent advancement of social media has given users a platform to socially engage and interact with a larger population. Millions of images and videos are being uploaded everyday by users on the web from different events and social gatherings. There is an increasing interest in designing systems capable of understanding human manifestations of emotional attributes and affective displays. As images and videos from social events generally contain multiple subjects, it is an essential step to study these groups of people. In this paper, we study the problem of happiness intensity analysis of a group of people in an image using facial expression analysis. A user perception study is conducted to understand various attributes, which affect a person’s perception of the happiness intensity of a group. We identify the challenges in developing an automatic mood analysis system and propose three models based on the attributes in the study. An ‘in the wild’ image-based database is collected. To validate the methods, both quantitative and qualitative experiments are performed and applied to the problem of shot selection, event summarisation and album creation. The experiments show that the global and local attributes defined in the paper provide useful information for theme expression analysis, with results close to human perception results.

108 citations


Journal ArticleDOI
TL;DR: A time-shift that maximizes the mutual information between the expressive behaviors and the time-continuous annotations is proposed, which is implemented by making different assumptions about the evaluators' reaction lag.
Abstract: An appealing scheme to characterize expressive behaviors is the use of emotional dimensions such as activation (calm versus active) and valence (negative versus positive). These descriptors offer many advantages to describe the wide spectrum of emotions. Due to the continuous nature of fast-changing expressive vocal and gestural behaviors, it is desirable to continuously track these emotional traces, capturing subtle and localized events (e.g., with FEELTRACE). However, time-continuous annotations introduce challenges that affect the reliability of the labels. In particular, an important issue is the evaluators’ reaction lag caused by observing, appraising, and responding to the expressive behaviors. An empirical analysis demonstrates that this delay varies from 1 to 6 seconds, depending on the annotator, expressive dimension, and actual behaviors. Our experiments show accuracy improvements even with fixed delays (1-3 seconds). This paper proposes to compensate for this reaction lag by finding the time-shift that maximizes the mutual information between the expressive behaviors and the time-continuous annotations. The approach is implemented by making different assumptions about the evaluators’ reaction lag. The benefits of compensating for the delay is demonstrated with emotion classification experiments. On average, the classifiers trained with facial and speech features show more than 7 percent relative improvements over baseline classifiers trained and tested without shifting the time-continuous annotations.

105 citations


Journal ArticleDOI
TL;DR: These findings are important for affective computing because they suggest people's decisions might be influenced differently according to whether they believe emotional expressions shown in computers are being generated by algorithms or humans.
Abstract: Recent research in perception and theory of mind reveals that people show different behavior and lower activation of brain regions associated with mentalizing (i.e., the inference of other's mental states) when engaged in decision making with computers, when compared to humans. These findings are important for affective computing because they suggest people's decisions might be influenced differently according to whether they believe emotional expressions shown in computers are being generated by algorithms or humans. To test this, we had people engage in a social dilemma (Experiment 1) or negotiation (Experiment 2) with virtual humans that were either perceived to be agents (i.e., controlled by computers) or avatars (i.e., controlled by humans). The results showed that such perceptions have a deep impact on people's decisions: in Experiment 1, people cooperated more with virtual humans that showed cooperative facial displays (e.g., joy after mutual cooperation) than competitive displays (e.g., joy when the participant was exploited) but, the effect was stronger with avatars ( $d = .601$ ) than with agents ( $d = .360$ ); in Experiment 2, people conceded more to angry than neutral virtual humans but, again, the effect was much stronger with avatars ( $d = 1.162$ ) than with agents ( $d = .066$ ). Participants also showed less anger towards avatars and formed more positive impressions of avatars when compared to agents.

61 citations


Journal ArticleDOI
TL;DR: It is found that more social interactions correlate with higher emotion influence between two users, and the influence of negative emotions is stronger than positive ones.
Abstract: We study emotion influence in large image social networks. We focus on users’ emotions reflected by images that they have uploaded and social influence that plays a role in changing users’ emotions. We first verify the existence of emotion influence in the image networks, and then propose a probabilistic factor graph based emotion influence model to answer the questions of “who influences whom”. Employing a real network from Flickr as the basis in our empirical study, we evaluate the effectiveness of different factors in the proposed model with in-depth data analysis. The learned influence is fundamental for social network analysis and can be applied to many applications. We consider using the influence to help predict users’ emotions and our experiments can significantly improve the prediction accuracy ( $3.0$ - $26.2$ percent) over several alternative methods such as Naive Bayesian, SVM (Support Vector Machine) or traditional Graph Model. We further examine the behavior of the emotion influence model, and find that more social interactions correlate with higher emotion influence between two users, and the influence of negative emotions is stronger than positive ones.

Journal ArticleDOI
TL;DR: It is shown that while multiple facial expression cues have significant correlation with several of the Big-Five traits, they are only able to significantly predict Extraversion impressions with moderate values of R2.
Abstract: Social video sites where people share their opinions and feelings are increasing in popularity. The face is known to reveal important aspects of human psychological traits, so the understanding of how facial expressions relate to personal constructs is a relevant problem in social media. We present a study of the connections between automatically extracted facial expressions of emotion and impressions of Big-Five personality traits in YouTube vlogs (i.e., video blogs). We use the Computer Expression Recognition Toolbox (CERT) system to characterize users of conversational vlogs. From CERT temporal signals corresponding to instantaneously recognized facial expression categories, we propose and derive four sets of behavioral cues that characterize face statistics and dynamics in a compact way. The cue sets are first used in a correlation analysis to assess the relevance of each facial expression of emotion with respect to Big-Five impressions obtained from crowd-observers watching vlogs, and also as features for automatic personality impression prediction. Using a dataset of 281 vloggers, the study shows that while multiple facial expression cues have significant correlation with several of the Big-Five traits, they are only able to significantly predict Extraversion impressions with moderate values of $R^2$ .

Journal ArticleDOI
TL;DR: A novel generative model called acoustic emotion Gaussians (AEG), which treats the affective content of music as a (soft) probability distribution in the valence-arousal space and parameterizes it with a Gaussian mixture model (GMM).
Abstract: Modeling the association between music and emotion has been considered important for music information retrieval and affective human computer interaction. This paper presents a novel generative model called acoustic emotion Gaussians (AEG) for computational modeling of emotion. Instead of assigning a music excerpt with a deterministic (hard) emotion label, AEG treats the affective content of music as a (soft) probability distribution in the valence-arousal space and parameterizes it with a Gaussian mixture model (GMM). In this way, the subjective nature of emotion perception is explicitly modeled. Specifically, AEG employs two GMMs to characterize the audio and emotion data. The fitting algorithm of the GMM parameters makes the model learning process transparent and interpretable. Based on AEG, a probabilistic graphical structure for predicting the emotion distribution from music audio data is also developed. A comprehensive performance study over two emotion-labeled datasets demonstrates that AEG offers new insights into the relationship between music and emotion (e.g., to assess the “affective diversity” of a corpus) and represents an effective means of emotion modeling. Readers can easily implement AEG via the publicly available codes. As the AEG model is generic, it holds the promise of analyzing any signal that carries affective or other highly subjective information.

Journal ArticleDOI
TL;DR: It is suggested that angular displacement, angular velocity and their coordination between mothers and infants are strongly related to age-appropriate emotion challenge and attention to head movement can deepen the understanding of emotion communication.
Abstract: We investigated the dynamics of head movement in mothers and infants during an age-appropriate, well-validated emotion induction, the Still Face paradigm. In this paradigm, mothers and infants play normally for 2 minutes (Play) followed by 2 minutes in which the mothers remain unresponsive (Still Face), and then two minutes in which they resume normal behavior (Reunion). Participants were 42 ethnically diverse 4-month-old infants and their mothers. Mother and infant angular displacement and angular velocity were measured using the CSIRO head tracker. In male but not female infants, angular displacement increased from Play to Still-Face and decreased from Still Face to Reunion. Infant angular velocity was higher during Still-Face than Reunion with no differences between male and female infants. Windowed cross-correlation suggested changes in how infant and mother head movements are associated, revealing dramatic changes in direction of association. Coordination between mother and infant head movement velocity was greater during Play compared with Reunion. Together, these findings suggest that angular displacement, angular velocity and their coordination between mothers and infants are strongly related to age-appropriate emotion challenge. Attention to head movement can deepen our understanding of emotion communication.

Journal ArticleDOI
TL;DR: Experimental results demonstrate the benefits of MDT to predict time-varying musical emotions, and the proposed method for music retrieval based on emotion dynamics outperforms retrieval methods based on acoustic features.
Abstract: Musical signals have rich temporal information not only at the physical level but at the emotion level. The listeners may wish to find music excerpts that have similar sequence patterns of musical emotions with given excerpts. Most state-of-the-art systems for emotion-based music retrieval concentrate on static analysis of musical emotions, and ignore dynamic analysis and modeling of musical emotions over time. This paper presents a novel approach to perform music retrieval based on time-varying musical emotion dynamics. A three-dimensional musical emotion model—Resonance-Arousal-Valence (RAV)—is used, and emotions of a piece of music are represented by musical emotion dynamics in a time series. A multiple dynamic textures (MDT) model is proposed to model music and emotion dynamics over time, and expectation maximization (EM) algorithm along with Kalman filtering and smoothing is used to estimate model parameters. Two smoothing methods—Rauch-Tung-Striebel (RTS) and minimum-variance smoothing (MVS)—to robust model are investigated and compared to find an optimal solution to enhance prediction. To find similar sequence patterns of musical emotions, subsequence dynamic time warping (DTW) for emotion dynamics matching is presented. Experimental results demonstrate the benefits of MDT to predict time-varying musical emotions, and our proposed method for music retrieval based on emotion dynamics outperforms retrieval methods based on acoustic features.

Journal ArticleDOI
TL;DR: A novel framework for recognizing arousal levels by integrating low-level audio-visual features derived from video content and human brain's functional activity in response to videos measured by functional magnetic resonance imaging (fMRI).
Abstract: As the indicator of emotion intensity, arousal is a significant clue for users to find their interested content Hence, effective techniques for video arousal recognition are highly required In this paper, we propose a novel framework for recognizing arousal levels by integrating low-level audio-visual features derived from video content and human brain's functional activity in response to videos measured by functional magnetic resonance imaging (fMRI) At first, a set of audio-visual features which have been demonstrated to be correlated with video arousal are extracted Then, the fMRI-derived features that convey the brain activity of comprehending videos are extracted based on a number of brain regions of interests (ROIs) identified by a universal brain reference system Finally, these two sets of features are integrated to learn a joint representation by using a multimodal deep Boltzmann machine (DBM) The learned joint representation can be utilized as the feature for training classifiers Due to the fact that fMRI scanning is expensive and time-consuming, our DBM fusion model has the ability to predict the joint representation of the videos without fMRI scans The experimental results on a video benchmark demonstrated the effectiveness of our framework and the superiority of integrated features

Journal ArticleDOI
TL;DR: By combining the proposed features, together with the modulation spectral analysis of MFCC and statistical descriptors of short-term timbre features, this new feature set outperforms previous approaches with statistical significance.
Abstract: In recent years, many short-term timbre and long-term modulation features have been developed for content-based music classification. However, two operations in modulation analysis are likely to smooth out useful modulation information, which may degrade classification performance. To deal with this problem, this paper proposes the use of a two-dimensional representation of acoustic frequency and modulation frequency to extract joint acoustic frequency and modulation frequency features. Long-term joint frequency features, such as acoustic-modulation spectral contrast/valley (AMSC/AMSV), acoustic-modulation spectral flatness measure (AMSFM), and acoustic-modulation spectral crest measure (AMSCM), are then computed from the spectra of each joint frequency subband. By combining the proposed features, together with the modulation spectral analysis of MFCC and statistical descriptors of short-term timbre features, this new feature set outperforms previous approaches with statistical significance.

Journal ArticleDOI
TL;DR: This work further investigates these online populations through the contents of not only their posts but also their comments, and finds all three features are found to be significantly different between Autism and Control, and between autism Personal and Community.
Abstract: The Internet has provided an ever increasingly popular platform for individuals to voice their thoughts, and like-minded people to share stories. This unintentionally leaves characteristics of individuals and communities, which are often difficult to be collected in traditional studies. Individuals with autism are such a case, in which the Internet could facilitate even more communication given its social-spatial distance being a characteristic preference for individuals with autism. Previous studies examined the traces left in the posts of online autism communities (Autism) in comparison with other online communities (Control). This work further investigates these online populations through the contents of not only their posts but also their comments. We first compare the Autism and Control blogs based on three features: topics, language styles and affective information. The autism groups are then further examined, based on the same three features, by looking at their personal (Personal) and community (Community) blogs separately. Machine learning and statistical methods are used to discriminate blog contents in both cases. All three features are found to be significantly different between Autism and Control, and between autism Personal and Community. These features also show good indicative power in prediction of autism blogs in both personal and community settings.

Journal ArticleDOI
TL;DR: This paper designs a systematic and quantitative framework, and proposes an algorithm called multi-emotion similarity preserving embedding (ME-SPE), which is designed to adapt to the second-order music signals and shows good performance in two standard music emotion datasets.
Abstract: Music can convey and evoke powerful emotions. This amazing ability has not only fascinated the general public but also attracted the researchers from different fields to discover the relationship between music and emotion. Psychologists have indicated that some specific characters of rhythm, harmony, and melody can evoke certain kinds of emotions. Those hypotheses are based on real life experience and proved by psychological paradigms on human beings. Aiming at the same target, this paper intends to design a systematic and quantitative framework, and answer three widely interested questions: 1) what are the intrinsic features embedded in music signal that essentially evoke human emotions; 2) to what extent these features influence human emotions; and 3) whether the findings from computational models are consistent with the existing research results from psychology. We formulate these tasks as a multi-label dimensionality reduction problem and propose an algorithm called multi-emotion similarity preserving embedding (ME-SPE). To adapt to the second-order music signals, we extend ME-SPE to its bilinear version. The proposed techniques show good performance in two standard music emotion datasets. Moreover, they demonstrate some interesting results for further research in this interdisciplinary topic.

Journal ArticleDOI
TL;DR: A novel fusion method is introduced that utilizes the outputs of individual classifiers that are trained using multi-dimensional inputs with multiple temporal lengths to demonstrate the utility of the multimodal-multitemporal approach.
Abstract: Earlier studies have shown that certain emotional characteristics are best observed at different analysis-frame lengths. When features of multiple modalities are extracted, it is reasonable to believe that different temporal lengths would better model the underlying characteristics that result from different emotions. In this study, we examine the use of such differing timescales in constructing emotion classifiers. A novel fusion method is introduced that utilizes the outputs of individual classifiers that are trained using multi-dimensional inputs with multiple temporal lengths. We used the IEMOCAP database which contains audiovisual information of 10 subjects in dyadic interaction settings. The classification task was performed over three emotional dimensions: valence , activation , and dominance . The results demonstrate the utility of the multimodal-multitemporal approach. Statistically significant improvements in accuracy are seen for in all three dimensions when compared with unimodal-unitemporal classifiers.

Journal ArticleDOI
TL;DR: Results show recognition rates of the Random Forest model approach human rating levels and classification comparisons and feature importance analyses indicate an improvement in recognition of social laughter when localized features and nonlinear models are used.
Abstract: Despite its importance in social interactions, laughter remains little studied in affective computing. Intelligent virtual agents are often blind to users’ laughter and unable to produce convincing laughter themselves. Respiratory, auditory, and facial laughter signals have been investigated but laughter-related body movements have received less attention. The aim of this study is threefold. First, to probe human laughter perception by analyzing patterns of categorisations of natural laughter animated on a minimal avatar. Results reveal that a low dimensional space can describe perception of laughter “types”. Second, to investigate observers’ perception of laughter (hilarious, social, awkward, fake, and non-laughter) based on animated avatars generated from natural and acted motion-capture data. Significant differences in torso and limb movements are found between animations perceived as laughter and those perceived as non-laughter. Hilarious laughter also differs from social laughter. Different body movement features were indicative of laughter in sitting and standing avatar postures. Third, to investigate automatic recognition of laughter to the same level of certainty as observers’ perceptions. Results show recognition rates of the Random Forest model approach human rating levels. Classification comparisons and feature importance analyses indicate an improvement in recognition of social laughter when localized features and nonlinear models are used.

Journal ArticleDOI
TL;DR: In the tasks of music emotion annotation and retrieval, experimental results show that the proposed MER system outperforms state-of-the-art systems in terms of F-score and mean average precision.
Abstract: This study proposes a novel multi-label music emotion recognition (MER) system. An emotion cannot be defined clearly in the real world because the classes of emotions are usually considered overlapping. Accordingly, this study proposes an MER system that is based on hierarchical Dirichlet process mixture model (HPDMM), whose components can be shared between models of each emotion. Moreover, the HDPMM is improved by adding a discriminant factor to the proposed system based on the concept of linear discriminant analysis. The proposed system represents an emotion using weighting coefficients that are related to a global set of components. Moreover, three methods are proposed to compute the weighting coefficients of testing data, and the weighting coefficients are used to determine whether or not the testing data contain certain emotional content. In the tasks of music emotion annotation and retrieval, experimental results show that the proposed MER system outperforms state-of-the-art systems in terms of F- score and mean average precision.

Journal ArticleDOI
TL;DR: The present results indicate that for healthy individuals, there are indeed measurable and consistent relations between physiology and personality and physiological indicators of personality may ultimately be of value as predictors of stress resiliency.
Abstract: High extraversion and conscientiousness and low neuroticism predict successful performance during and after stressful conditions. We investigated whether these personality factors are linked to stress sensitivity and to baseline physiology. Stress was induced through negative feedback on gaming performance. Stress sensitivity was determined as the difference in baseline physiological variables (skin conductance, heart rate and heart rate variability) before and after performing the game, as well as the difference in subjectively reported stress. While physiological results suggest that the game indeed induced stress, subjective reports do not. Maybe due to a low level of experienced stress, stress sensitivity (as indicated by the difference in heart rate) only correlates with conscientiousness and not with extraversion or neuroticism. The baseline measurements show the expected correlations between extraversion and both heart rate and heart rate variability—negative and positive respectively. The negative correlation between neuroticism and skin conductance is opposite to what we expected. While the exact mechanisms are not clear yet, the present results indicate that for healthy individuals, there are indeed measurable and consistent relations between physiology and personality. Hence, physiological indicators of personality may ultimately be of value as predictors of stress resiliency.

Journal ArticleDOI
TL;DR: The results demonstrate that there exist consistent patterns underlying emotion evaluation, even given incongruence, positioning UMEME as an important new tool for understanding emotion perception.
Abstract: Emotion is cto communication; it colors our interpretation of events and social interactions. Emotion expression is generally multimodal, modulating our facial movement, vocal behavior, and body gestures. The method through which this multimodal information is integrated and perceived is not well understood. This knowledge has implications for the design of multimodal classification algorithms, affective interfaces, and even mental health assessment. We present a novel data set designed to support research into the emotion perception process, the University of Michigan Emotional McGurk Effect Data set (UMEME). UMEME has a critical feature that differentiates it from currently existing data sets; it contains not only emotionally congruent stimuli (emotionally matched faces and voices), but also emotionally incongruent stimuli (emotionally mismatched faces and voices). The inclusion of emotionally complex and dynamic stimuli provides an opportunity to study how individuals make assessments of emotion content in the presence of emotional incongruence, or emotional noise. We describe the collection, annotation, and statistical properties of the data and present evidence illustrating how audio and video interact to result in specific types of emotion perception. The results demonstrate that there exist consistent patterns underlying emotion evaluation, even given incongruence, positioning UMEME as an important new tool for understanding emotion perception.

Journal ArticleDOI
TL;DR: Through user studies, it was showed machine suggested comments were accepted by users for online posting in 90 percent of completed user sessions, while very favorable results were also observed in various dimensions (plausibility, preference, and realism) when assessing the quality of the generated image comments.
Abstract: We present a general framework and working system for predicting likely affective responses of the viewers in the social media environment after an image is posted online. Our approach emphasizes a mid-level concept representation, in which intended affects of the image publisher is characterized by a large pool of visual concepts (termed PACs) detected from image content directly instead of textual metadata, evoked viewer affects are represented by concepts (termed VACs) mined from online comments, and statistical methods are used to model the correlations among these two types of concepts. We demonstrate the utilities of such approaches by developing an end-to-end Assistive Comment Robot application, which further includes components for multi-sentence comment generation, interactive interfaces, and relevance feedback functions. Through user studies, we showed machine suggested comments were accepted by users for online posting in 90 percent of completed user sessions, while very favorable results were also observed in various dimensions (plausibility, preference, and realism) when assessing the quality of the generated image comments.

Journal ArticleDOI
TL;DR: Taking for granted the effectiveness of emotion recognition algorithms, a model for estimating mood from a known sequence of punctual emotions is proposed and results indicate that emotion annotations, continuous in time and value, facilitate mood estimation, as opposed to discrete emotion annotations scattered randomly within the video timespan.
Abstract: A smart environment designed to adapt to a user's affective state should be able to decipher unobtrusively that user's underlying mood. Great effort has been devoted to automatic punctual emotion recognition from visual input. Conversely, little has been done to recognize longer-lasting affective states, such as mood. Taking for granted the effectiveness of emotion recognition algorithms, we propose a model for estimating mood from a known sequence of punctual emotions. To validate our model experimentally, we rely on the human annotations of two well-established databases: the VAM and the HUMAINE. We perform two analyses: the first serves as a proof of concept and tests whether punctual emotions cluster around the mood in the emotion space. The results indicate that emotion annotations, continuous in time and value, facilitate mood estimation, as opposed to discrete emotion annotations scattered randomly within the video timespan. The second analysis explores factors that account for the mood recognition from emotions, by examining how individual human coders perceive the underlying mood of a person. A moving average function with exponential discount of the past emotions achieves mood prediction accuracy above 60 percent, which is higher than the chance level and higher than mutual human agreement.

Journal ArticleDOI
TL;DR: A characteristic relation between skewness and kurtosis of aesthetic score distributions in a massive photo aesthetics dataset generated from online voting is reported, supporting the necessity of a consensus property in addition to the preference used so far for accurate modeling of aesthetic evaluation process in human mind.
Abstract: This paper reports a characteristic relation between skewness and kurtosis of aesthetic score distributions in a massive photo aesthetics dataset generated from online voting. Analysis results reveal an unexpectedly wide range of kurtosis in the mediocre photo group, asymmetric consensus, the 4/3 power-law regime in both extremes, and tag-specific relation in the skewness-kurtosis plane. From the human cognition perspective on affective content analysis, these patterns are interpreted as supporting the necessity of a consensus property in addition to the preference used so far for accurate modeling of aesthetic evaluation process in human mind. For explaining the observed patterns, we propose a new computational model of a dynamic system based on the interaction between multiple attractors. Characteristic patterns in response time and consensus are predicted from the proposed model and observed in the experiments with human subjects for model validation.

Journal ArticleDOI
TL;DR: HapFACS is described, a free software and API that is developed to provide the affective computing community with a resource that produces static and dynamic facial expressions for three-dimensional speaking characters, and results of multiple experiments are discussed.
Abstract: With the growing number of researchers interested in modeling the inner workings of affective social intelligence, the need for tools to easily model its associated expressions has emerged. The goal of this article is two-fold: 1) we describe HapFACS, a free software and API that we developed to provide the affective computing community with a resource that produces static and dynamic facial expressions for three-dimensional speaking characters; and 2) we discuss results of multiple experiments that we conducted in order to scientifically validate our facial expressions and head animations in terms of the widely accepted Facial Action Coding System (FACS) standard, and its Action Units (AU). The result is that users, without any 3D-modeling nor computer graphics expertise, can animate speaking virtual characters with FACS-based realistic facial expression animations, and embed these expressive characters in their own application(s). The HapFACS software and API can also be used for generating repertoires of realistic FACS-validated facial expressions, useful for testing emotion expression generation theories.

Journal ArticleDOI
TL;DR: A probabilistic model of the process that was built based on the Bayesian network is described, and that relates the empathy perceived by observers to how the gaze and facial expressions of participants co-occur between a pair.
Abstract: This paper presents a research framework for understanding the empathy that arises between people while they are conversing. By focusing on the process by which empathy is perceived by other people, this paper aims to develop a computational model that automatically infers perceived empathy from participant behavior. To describe such perceived empathy objectively, we introduce the idea of using the collective impressions of external observers. In particular, we focus on the fact that the perception of other’s empathy varies from person to person, and take the standpoint that this individual difference itself is an essential attribute of human communication for building, for example, successful human relationships and consensus. This paper describes a probabilistic model of the process that we built based on the Bayesian network, and that relates the empathy perceived by observers to how the gaze and facial expressions of participants co-occur between a pair. In this model, the probability distribution represents the diversity of observers’ impression, which reflects the individual differences in the schema when perceiving others’ empathy from their behaviors, and the ambiguity of the behaviors. Comprehensive experiments demonstrate that the inferred distributions are similar to those made by observers.

Journal ArticleDOI
TL;DR: This paper analytically investigate the effect of the source angular position on the listener's emotional state, modeled in the well-established valence/arousal affective space, using an annotated sound events dataset using binaural processed versions of the available International Affective Digitized Sound (IADS) sound events library.
Abstract: Emotion recognition from sound signals represents an emerging field of recent research. Although many existing works focus on emotion recognition from music, there seems to be a relative scarcity of research on emotion recognition from general sounds. One of the key characteristics of sound events is the sound source spatial position, i.e. the location of the source relatively to the acoustic receiver. Existing studies that aim to investigate the relation of the latter source placement and the elicited emotions are limited to distance, front and back spatial localization and/or specific emotional categories. In this paper we analytically investigate the effect of the source angular position on the listener’s emotional state, modeled in the well-established valence/arousal affective space. Towards this aim, we have developed an annotated sound events dataset using binaural processed versions of the available International Affective Digitized Sound (IADS) sound events library. All subjective affective annotations were obtained using the Self Assessment Manikin (SAM) approach. Preliminary results obtained by processing these annotation scores are likely to indicate a systematic change in the listener affective state as the sound source angular position changes. This trend is more obvious when the sound source is located outside of the visible field of the listener.