scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Affective Computing in 2018"


Journal ArticleDOI
TL;DR: A newly developed spontaneous micro-facial movement dataset with diverse participants and coded using the Facial Action Coding System that outperforms the state of the art with a recall of 0.91 and can become a new standard for micro-movement data.
Abstract: Micro-facial expressions are spontaneous, involuntary movements of the face when a person experiences an emotion but attempts to hide their facial expression, most likely in a high-stakes environment. Recently, research in this field has grown in popularity, however publicly available datasets of micro-expressions have limitations due to the difficulty of naturally inducing spontaneous micro-expressions. Other issues include lighting, low resolution and low participant diversity. We present a newly developed spontaneous micro-facial movement dataset with diverse participants and coded using the Facial Action Coding System. The experimental protocol addresses the limitations of previous datasets, including eliciting emotional responses from stimuli tailored to each participant. Dataset evaluation was completed by running preliminary experiments to classify micro-movements from non-movements. Results were obtained using a selection of spatio-temporal descriptors and machine learning. We further evaluate the dataset on emerging methods of feature difference analysis and propose an Adaptive Baseline Threshold that uses individualised neutral expression to improve the performance of micro-movement detection. In contrast to machine learning approaches, we outperform the state of the art with a recall of 0.91. The outcomes show the dataset can become a new standard for micro-movement data, with future work expanding on data representation and analysis.

353 citations


Journal ArticleDOI
TL;DR: Experimental results cumulatively confirm that personality differences are better revealed while comparing user responses to emotionally homogeneous videos, and above-chance recognition is achieved for both affective and personality dimensions.
Abstract: We present ASCERTAIN—a multimodal databa AS e for impli C it p ER sonali T y and A ffect recognit I o N using commercial physiological sensors. To our knowledge, ASCERTAIN is the first database to connect personality traits and emotional states via physiological responses . ASCERTAIN contains big-five personality scales and emotional self-ratings of 58 users along with their Electroencephalogram (EEG), Electrocardiogram (ECG), Galvanic Skin Response (GSR) and facial activity data, recorded using off-the-shelf sensors while viewing affective movie clips. We first examine relationships between users’ affective ratings and personality scales in the context of prior observations, and then study linear and non-linear physiological correlates of emotion and personality. Our analysis suggests that the emotion-personality relationship is better captured by non-linear rather than linear statistics. We finally attempt binary emotion and personality trait recognition using physiological features. Experimental results cumulatively confirm that personality differences are better revealed while comparing user responses to emotionally homogeneous videos, and above-chance recognition is achieved for both affective and personality dimensions.

329 citations


Journal ArticleDOI
TL;DR: The first automatic ME analysis system (MESR), which can spot and recognize MEs from spontaneous video data is proposed, and it is shown that the method outperforms humans in the ME recognition task by a large margin, and achieves comparable performance at the very challenging task of spotting and then recognizing spontaneous MEs.
Abstract: Micro-expressions (MEs) are rapid, involuntary facial expressions which reveal emotions that people do not intend to show. Studying MEs is valuable as recognizing them has many important applications, particularly in forensic science and psychotherapy. However, analyzing spontaneous MEs is very challenging due to their short duration and low intensity. Automatic ME analysis includes two tasks: ME spotting and ME recognition. For ME spotting, previous studies have focused on posed rather than spontaneous videos. For ME recognition, the performance of previous studies is low. To address these challenges, we make the following contributions: (i) We propose the first method for spotting spontaneous MEs in long videos (by exploiting feature difference contrast). This method is training free and works on arbitrary unseen videos. (ii) We present an advanced ME recognition framework, which outperforms previous work by a large margin on two challenging spontaneous ME databases (SMIC and CASMEII). (iii) We propose the first automatic ME analysis system (MESR), which can spot and recognize MEs from spontaneous video data. Finally, we show our method outperforms humans in the ME recognition task by a large margin, and achieves comparable performance to humans at the very challenging task of spotting and then recognizing spontaneous MEs.

298 citations


Journal ArticleDOI
TL;DR: A real-time movie-induced emotion recognition system for identifying an individual's emotional states through the analysis of brain waves from EEG signals with the advantage over the existing state-of-the-art real- time emotion recognition systems in terms of classification accuracy and the ability to recognise similar discrete emotions that are close in the valence-arousal coordinate space.
Abstract: Recognition of a human's continuous emotional states in real time plays an important role in machine emotional intelligence and human-machine interaction. Existing real-time emotion recognition systems use stimuli with low ecological validity (e.g., picture, sound) to elicit emotions and to recognise only valence and arousal. To overcome these limitations, in this paper, we construct a standardised database of 16 emotional film clips that were selected from over one thousand film excerpts. Based on emotional categories that are induced by these film clips, we propose a real-time movie-induced emotion recognition system for identifying an individual's emotional states through the analysis of brain waves. Thirty participants took part in this study and watched 16 standardised film clips that characterise real-life emotional experiences and target seven discrete emotions and neutrality. Our system uses a 2-s window and a 50 percent overlap between two consecutive windows to segment the EEG signals. Emotional states, including not only the valence and arousal dimensions but also similar discrete emotions in the valence-arousal coordinate space, are predicted in each window. Our real-time system achieves an overall accuracy of 92.26 percent in recognising high-arousal and valenced emotions from neutrality and 86.63 percent in recognising positive from negative emotions. Moreover, our system classifies three positive emotions (joy, amusement, tenderness) with an average of 86.43 percent accuracy and four negative emotions (anger, disgust, fear, sadness) with an average of 65.09 percent accuracy. These results demonstrate the advantage over the existing state-of-the-art real-time emotion recognition systems from EEG signals in terms of classification accuracy and the ability to recognise similar discrete emotions that are close in the valence-arousal coordinate space.

218 citations


Journal ArticleDOI
TL;DR: The multiple feature fusion approach is robust in dealing with video-based facial expression recognition problems under lab-controlled environment and in the wild compared with the other state-of-the-art methods.
Abstract: Video based facial expression recognition has been a long standing problem and attracted growing attention recently. The key to a successful facial expression recognition system is to exploit the potentials of audiovisual modalities and design robust features to effectively characterize the facial appearance and configuration changes caused by facial motions. We propose an effective framework to address this issue in this paper. In our study, both visual modalities (face images) and audio modalities (speech) are utilized. A new feature descriptor called Histogram of Oriented Gradients from Three Orthogonal Planes (HOG-TOP) is proposed to extract dynamic textures from video sequences to characterize facial appearance changes. And a new effective geometric feature derived from the warp transformation of facial landmarks is proposed to capture facial configuration changes. Moreover, the role of audio modalities on recognition is also explored in our study. We applied the multiple feature fusion to tackle the video-based facial expression recognition problems under lab-controlled environment and in the wild, respectively. Experiments conducted on the extended Cohn-Kanade (CK+) database and the Acted Facial Expression in Wild (AFEW) 4.0 database show that our approach is robust in dealing with video-based facial expression recognition problems under lab-controlled environment and in the wild compared with the other state-of-the-art methods.

176 citations


Journal ArticleDOI
TL;DR: A new database, CAS(ME), which provides both long videos and cropped expression samples, which may aid researchers in developing efficient algorithms for the spotting and recognition of macro-expressions and micro- expressions.
Abstract: Deception is a very common phenomenon and its detection can be beneficial to our daily lives. Compared with other deception cues, micro-expression has shown great potential as a promising cue for deception detection. The spotting and recognition of micro-expression from long videos may significantly aid both law enforcement officers and researchers. However, database that contains both micro-expression and macro-expression in long videos is still not publicly available. To facilitate development in this field, we present a new database, Chinese Academy of Sciences Macro-Expressions and Micro-Expressions (CAS(ME) $^2$ ), which provides both macro-expressions and micro-expressions in two parts (A and B). Part A contains 87 long videos that contain spontaneous macro-expressions and micro-expressions. Part B includes 300 cropped spontaneous macro-expression samples and 57 micro-expression samples. The emotion labels are based on a combination of action units (AUs), self-reported emotion for every facial movement, and the emotion types of emotion-evoking videos. Local Binary Pattern (LBP) was employed for the spotting and recognition of macro-expressions and micro-expressions and the results were reported as a baseline evaluation. The CAS(ME) $^2$ database offers both long videos and cropped expression samples, which may aid researchers in developing efficient algorithms for the spotting and recognition of macro-expressions and micro-expressions.

174 citations


Journal ArticleDOI
TL;DR: A new approach to predict the Beck Depression Inventory II (BDI-II) values from video data is proposed based on the deep networks, designed in a two stream manner, aiming at capturing both the facial appearance and dynamics.
Abstract: As a severe psychiatric disorder disease, depression is a state of low mood and aversion to activity, which prevents a person from functioning normally in both work and daily lives. The study on automated mental health assessment has been given increasing attentions in recent years. In this paper, we study the problem of automatic diagnosis of depression. A new approach to predict the Beck Depression Inventory II (BDI-II) values from video data is proposed based on the deep networks. The proposed framework is designed in a two stream manner, aiming at capturing both the facial appearance and dynamics. Further, we employ joint tuning layers that can implicitly integrate the appearance and dynamic information. Experiments are conducted on two depression databases, AVEC2013 and AVEC2014. The experimental results show that our proposed approach significantly improve the depression prediction performance, compared to other visual-based approaches.

145 citations


Journal ArticleDOI
TL;DR: Rolling multi-task hypergraph learning (RMTHG) is presented to consistently combine these factors and a learning algorithm is designed for automatic optimization to predict the personalized emotion perceptions of images for each individual viewer.
Abstract: Images can convey rich semantics and induce various emotions to viewers. Most existing works on affective image analysis focused on predicting the dominant emotions for the majority of viewers. However, such dominant emotion is often insufficient in real-world applications, as the emotions that are induced by an image are highly subjective and different with respect to different viewers. In this paper, we propose to predict the personalized emotion perceptions of images for each individual viewer. Different types of factors that may affect personalized image emotion perceptions, including visual content, social context, temporal evolution, and location influence, are jointly investigated. Rolling multi-task hypergraph learning (RMTHG) is presented to consistently combine these factors and a learning algorithm is designed for automatic optimization. For evaluation, we set up a large scale image emotion dataset from Flickr, named Image-Emotion-Social-Net, on both dimensional and categorical emotion representations with over 1 million images and about 8,000 users. Experiments conducted on this dataset demonstrate that the proposed method can achieve significant performance gains on personalized emotion classification, as compared to several state-of-the-art approaches.

142 citations


Journal ArticleDOI
TL;DR: Using statistical features extracted from speaking behaviour, eye activity, and head pose, the behaviour associated with major depression is characterised and the performance of the classification of individual modalities and when fused is examined.
Abstract: An estimated 350 million people worldwide are affected by depression. Using affective sensing technology, our long-term goal is to develop an objective multimodal system that augments clinical opinion during the diagnosis and monitoring of clinical depression. This paper steps towards developing a classification system-oriented approach, where feature selection, classification and fusion-based experiments are conducted to infer which types of behaviour (verbal and nonverbal) and behaviour combinations can best discriminate between depression and non-depression. Using statistical features extracted from speaking behaviour, eye activity, and head pose, we characterise the behaviour associated with major depression and examine the performance of the classification of individual modalities and when fused. Using a real-world, clinically validated dataset of 30 severely depressed patients and 30 healthy control subjects, a Support Vector Machine is used for classification with several feature selection techniques. Given the statistical nature of the extracted features, feature selection based on T-tests performed better than other methods. Individual modality classification results were considerably higher than chance level (83 percent for speech, 73 percent for eye, and 63 percent for head). Fusing all modalities shows a remarkable improvement compared to unimodal systems, which demonstrates the complementary nature of the modalities. Among the different fusion approaches used here, feature fusion performed best with up to 88 percent average accuracy. We believe that is due to the compatible nature of the extracted statistical features.

124 citations


Journal ArticleDOI
TL;DR: The focus of this paper is on developing automatic classifiers to infer working conditions and stress related mental states from a multimodal set of sensor data (computer logging, facial expressions, posture and physiology).
Abstract: Employees often report the experience of stress at work. In the SWELL project we investigate how new context aware pervasive systems can support knowledge workers to diminish stress. The focus of this paper is on developing automatic classifiers to infer working conditions and stress related mental states from a multimodal set of sensor data (computer logging, facial expressions, posture and physiology). We address two methodological and applied machine learning challenges: 1) Detecting work stress using several (physically) unobtrusive sensors, and 2) Taking into account individual differences. A comparison of several classification approaches showed that, for our SWELL-KW dataset, neutral and stressful working conditions can be distinguished with 90 percent accuracy by means of SVM. Posture yields most valuable information, followed by facial expressions. Furthermore, we found that the subjective variable ‘mental effort’ can be better predicted from sensor data than, e.g., ‘perceived stress’. A comparison of several regression approaches showed that mental effort can be predicted best by a decision tree (correlation of 0.82). Facial expressions yield most valuable information, followed by posture. We find that especially for estimating mental states it makes sense to address individual differences. When we train models on particular subgroups of similar users, (in almost all cases) a specialized model performs equally well or better than a generic model.

104 citations


Journal ArticleDOI
TL;DR: A computational framework for automatically quantifying verbal and nonverbal behaviors in the context of job interviews is presented and recommends to speak more fluently, use fewer filler words, speak as “the authors” (versus “I”), use more unique words, and smile more.
Abstract: We present a computational framework for automatically quantifying verbal and nonverbal behaviors in the context of job interviews. The proposed framework is trained by analyzing the videos of 138 interview sessions with 69 internship-seeking undergraduates at the Massachusetts Institute of Technology (MIT). Our automated analysis includes facial expressions (e.g., smiles, head gestures, facial tracking points), language (e.g., word counts, topic modeling), and prosodic information (e.g., pitch, intonation, and pauses) of the interviewees. The ground truth labels are derived by taking a weighted average over the ratings of nine independent judges. Our framework can automatically predict the ratings for interview traits such as excitement, friendliness, and engagement with correlation coefficients of 0.70 or higher, and can quantify the relative importance of prosody, language, and facial expressions. By analyzing the relative feature weights learned by the regression models, our framework recommends to speak more fluently, use fewer filler words, speak as “we” (versus “I”), use more unique words, and smile more. We also find that the students who were rated highly while answering the first interview question were also rated highly overall (i.e., first impression matters). Finally, our MIT Interview dataset is available to other researchers to further validate and expand our findings.

Journal ArticleDOI
TL;DR: This paper proposes to improve the assembling of the committee by introducing supervised learning on the ensemble computation, and trains a CNN on the posterior-class probabilities resulting from the individual members allowing to capture non-linear dependencies among committee members, and to learn this combination from data.
Abstract: Automated emotion recognition from facial images is an unsolved problem in computer vision. Although recent methods achieve close to human accuracy in controlled scenarios, the recognition of emotions in the wild remains a challenging problem. Recent advances in Deep learning have supposed a significant breakthrough in many computer vision tasks, including facial expression analysis. Particularly, the use of Deep Convolutional Neural Networks has attained the best results in the recent public challenges. The current state-of-the-art algorithms suggest that the use of ensembles of CNNs can outperform individual CNN classifiers. Two key considerations influence these results: (i) The design of CNNs involves the adjustment of parameters that allow diversity and complementarity in the partial classification results, and (ii) the final classification rule that assembles the result of the committee. In this paper we propose to improve the assembling of the committee by introducing supervised learning on the ensemble computation. We train a CNN on the posterior-class probabilities resulting from the individual members allowing to capture non-linear dependencies among committee members, and to learn this combination from data. The validation shows an accuracy 5 percent higher with respect to previous state-of-the art results based on averaging classifiers, and 4 percent to the majority voting rule.

Journal ArticleDOI
Baohan Xu1, Yanwei Fu1, Yu-Gang Jiang1, Boyang Li2, Leonid Sigal2 
TL;DR: In this paper, the authors proposed a technique for transferring knowledge from heterogeneous external sources, including image and textual data, to facilitate three related tasks in understanding video emotion: emotion recognition, emotion attribution and emotion-oriented summarization.
Abstract: Emotion is a key element in user-generated video. However, it is difficult to understand emotions conveyed in such videos due to the complex and unstructured nature of user-generated content and the sparsity of video frames expressing emotion. In this paper, for the first time, we propose a technique for transferring knowledge from heterogeneous external sources, including image and textual data, to facilitate three related tasks in understanding video emotion: emotion recognition, emotion attribution and emotion-oriented summarization. Specifically, our framework (1) learns a video encoding from an auxiliary emotional image dataset in order to improve supervised video emotion recognition, and (2) transfers knowledge from an auxiliary textual corpora for zero-shot recognition of emotion classes unseen during training. The proposed technique for knowledge transfer facilitates novel applications of emotion attribution and emotion-oriented summarization. A comprehensive set of experiments on multiple datasets demonstrate the effectiveness of our framework.

Journal ArticleDOI
TL;DR: This paper recognizes an individual's personality traits by analyzing brain waves when he or she watches emotional materials and demonstrates the advantage of personality inference from EEG signals over state-of theart explicit behavioral indicators in terms of classification accuracy.
Abstract: The stable relationship between personality and EEG ensures the feasibility of personality inference from brain activities. In this paper, we recognize an individual's personality traits by analyzing brain waves when he or she watches emotional materials. Thirty-seven participants took part in this study and watched 7 standardized film clips that characterize real-life emotional experiences and target seven discrete emotions. Features extracted from EEG signals and subjective ratings enter the SVM classifier as inputs to predict five dimensions of personality traits. Our model achieves better classification performance for Extraversion (81.08 percent), Agreeableness (86.11 percent), and Conscientiousness (80.56 percent) when positive emotions are elicited than negative ones, higher classification accuracies for Neuroticism (78.38-81.08 percent) when negative emotions, except disgust, are evoked than positive emotions, and the highest classification accuracy for Openness (83.78 percent) when a disgusting film clip is presented. Additionally, the introduction of features from subjective ratings increases not only classification accuracy in all five personality traits (ranging from 0.43 percent for Conscientiousness to 6.3 percent for Neuroticism) but also the discriminative power of the classification accuracies between five personality traits in each category of emotion. These results demonstrate the advantage of personality inference from EEG signals over state-of-the-art explicit behavioral indicators in terms of classification accuracy.

Journal ArticleDOI
TL;DR: A novel transductive transfer subspace learning method for cross-domain facial expression recognition that achieves much better recognition performance compared with the state-of-the-art methods.
Abstract: Facial expression recognition across domains, e.g., training and testing facial images come from different facial poses, is very challenging due to the different marginal distributions between training and testing facial feature vectors. To deal with such challenging cross-domain facial expression recognition problem, a novel transductive transfer subspace learning method is proposed in this paper. In this method, a labelled facial image set from source domain is combined with an unlabelled auxiliary facial image set from target domain to jointly learn a discriminative subspace and make the class labels prediction of the unlabelled facial images, where a transductive transfer regularized least-squares regression (TTRLSR) model is proposed to this end. Then, based on the auxiliary facial image set, we train a SVM classifier for classifying the expressions of other facial images in the target domain. Moreover, we also investigate the use of color facial features to evaluate the recognition performance of the proposed facial expression recognition method, where color scale invariant feature transform (CSIFT) features associated with 49 landmark facial points are extracted to describe each color facial image. Finally, extensive experiments on BU-3DFE and Multi-PIE multiview color facial expression databases are conducted to evaluate the cross-database & cross-view facial expression recognition performance of the proposed method. Comparisons with state-of-the-art domain adaption methods are also included in the experiments. The experimental results demonstrate that the proposed method achieves much better recognition performance compared with the state-of-the-art methods.

Journal ArticleDOI
TL;DR: An audio-visual emotion recognition system that uses a mixture of rule-based and machine learning techniques to improve the recognition efficacy in the audio and video paths is proposed.
Abstract: This paper proposes an audio-visual emotion recognition system that uses a mixture of rule-based and machine learning techniques to improve the recognition efficacy in the audio and video paths. The visual path is designed using the Bi-directional Principal Component Analysis (BDPCA) and Least-Square Linear Discriminant Analysis (LSLDA) for dimensionality reduction and discrimination. The extracted visual features are passed into a newly designed Optimized Kernel-Laplacian Radial Basis Function (OKL-RBF) neural classifier. The audio path is designed using a combination of input prosodic features (pitch, log-energy, zero crossing rates and Teager energy operator) and spectral features (Mel-scale frequency cepstral coefficients). The extracted audio features are passed into an audio feature level fusion module that uses a set of rules to determine the most likely emotion contained in the audio signal. An audio visual fusion module fuses outputs from both paths. The performances of the proposed audio path, visual path, and the final system are evaluated on standard databases. Experiment results and comparisons reveal the good performance of the proposed system.

Journal ArticleDOI
TL;DR: A methodology for analyzing multimodal stress detection results by taking into account the variety of stress assessments is introduced and it is argued that a multiple assessment approach provide more robust results.
Abstract: Stress is a complex phenomenon that impacts the body and the mind at several levels. It has been studied for more than a century from different perspectives, which result in different definitions and different ways to assess the presence of stress. This paper introduces a methodology for analyzing multimodal stress detection results by taking into account the variety of stress assessments. As a first step, we have collected video, depth and physiological data from 25 subjects in a stressful situation: a socially evaluated mental arithmetic test. As a second step, we have acquired three different assessments of stress: self-assessment, assessments from external observers and assessment from a physiology expert. Finally, we extract 101 behavioural and physiological features and evaluate their predictive power for the three collected assessments using a classification task. Using multimodal features, we obtain average F1 scores up to 0.85. By investigating the composition of the best selected feature subsets and the individual feature classification performances, we show that several features provide valuable information for the classification of the three assessments: features related to body movement, blood volume pulse and heart rate. From a methodological point of view, we argue that a multiple assessment approach provide more robust results.

Journal ArticleDOI
TL;DR: A multidisciplinary state-of-the-art for affective movie content analysis is given, in order to promote and encourage exchanges between researchers from a very wide range of fields.
Abstract: In our present society, the cinema has become one of the major forms of entertainment providing unlimited contexts of emotion elicitation for the emotional needs of human beings. Since emotions are universal and shape all aspects of our interpersonal and intellectual experience, they have proved to be a highly multidisciplinary research field, ranging from psychology, sociology, neuroscience, etc., to computer science. However, affective multimedia content analysis work from the computer science community benefits but little from the progress achieved in other research fields. In this paper, a multidisciplinary state-of-the-art for affective movie content analysis is given, in order to promote and encourage exchanges between researchers from a very wide range of fields. In contrast to other state-of-the-art papers on affective video content analysis, this work confronts the ideas and models of psychology, sociology, neuroscience, and computer science. The concepts of aesthetic emotions and emotion induction, as well as the different representations of emotions are introduced, based on psychological and sociological theories. Previous global and continuous affective video content analysis work, including video emotion recognition and violence detection, are also presented in order to point out the limitations of affective video content analysis work.

Journal ArticleDOI
TL;DR: This study presents and evaluates several models trained on a large dataset of short YouTube video blog posts for predicting apparent Big Five personality traits of people and whether they seem suitable to be recommended to a job interview.
Abstract: People form first impressions about the personalities of unfamiliar individuals even after very brief interactions with them. In this study we present and evaluate several models that mimic this automatic social behavior. Specifically, we present several models trained on a large dataset of short YouTube video blog posts for predicting apparent Big Five personality traits of people and whether they seem suitable to be recommended to a job interview. Along with presenting our audiovisual approach and results that won the third place in the ChaLearn First Impressions Challenge, we investigate modeling in different modalities including audio only, visual only, language only, audiovisual, and combination of audiovisual and language. Our results demonstrate that the best performance could be obtained using a fusion of all data modalities. Finally, in order to promote explainability in machine learning and to provide an example for the upcoming ChaLearn challenges, we present a simple approach for explaining the predictions for job interview recommendations.

Journal ArticleDOI
TL;DR: This paper proposes a deep learning solution to APA from short video sequences, and presents a solution for the Apparent Personality Analysis competition track in the ChaLearn Looking at People challenge in association with ECCV 2016.
Abstract: Apparent personality analysis (APA) is an important problem of personality computing, and furthermore, automatic APA becomes a hot and challenging topic in computer vision and multimedia. In this paper, we propose a deep learning solution to APA from short video sequences. In order to capture rich information from both the visual and audio modality of videos, we tackle these tasks with our Deep Bimodal Regression (DBR) framework. In DBR, for the visual modality, we modify the traditional convolutional neural networks for exploiting important visual cues. In addition, taking into account the model efficiency, we extract audio representations and build a linear regressor for the audio modality. For combining the complementary information from the two modalities, we ensemble these predicted regression scores by both early fusion and late fusion. Finally, based on the proposed framework, we come up with a solution for the Apparent Personality Analysis competition track in the ChaLearn Looking at People challenge in association with ECCV 2016. Our DBR is the winner (first place) of this challenge with 86 registered participants. Beyond the competition, we further investigate the performance of different loss functions in our visual models, and prove non-convex loss functions for regression are optimal on the human-labeled video data.

Journal ArticleDOI
TL;DR: This paper takes a computational approach to studying details of facial expressions of children with high functioning autism, using motion capture data obtained from subjects with HFA and typically developing subjects while they produced various facial expressions to uncover those characteristics of facial expression which are otherwise difficult to detect by visual inspection.
Abstract: Several studies have established that facial expressions of children with autism are often perceived as atypical, awkward or less engaging by typical adult observers. Despite this clear deficit in the quality of facial expression production, very little is understood about its underlying mechanisms and characteristics. This paper takes a computational approach to studying details of facial expressions of children with high functioning autism (HFA). The objective is to uncover those characteristics of facial expressions, notably distinct from those in typically developing children, and which are otherwise difficult to detect by visual inspection. We use motion capture data obtained from subjects with HFA and typically developing subjects while they produced various facial expressions. This data is analyzed to investigate how the overall and local facial dynamics of children with HFA differ from their typically developing peers. Our major observations include reduced complexity in the dynamic facial behavior of the HFA group arising primarily from the eye region.

Journal ArticleDOI
TL;DR: An approach for stress assessment that leverages data extracted from smartphone sensors, and that is not invasive concerning privacy is presented, and results show how these two methods enable an accurate stress assessment without being too intrusive, thus increasing ecological validity of the data and user acceptance.
Abstract: The increasing presence of stress in people’ lives has motivated much research efforts focusing on continuous stress assessment methods of individuals, leveraging smartphones and wearable devices. These methods have several drawbacks, i.e., they use invasive external devices, thus increasing entry costs and reducing user acceptance, or they use some of privacy-related information. This paper presents an approach for stress assessment that leverages data extracted from smartphone sensors, and that is not invasive concerning privacy. Two different approaches are presented. One, based on smartphone gestures analysis, e.g., ‘tap’, ‘scroll’, ‘swipe’ and ‘text writing’, and evaluated in laboratory settings with 13 participants (F-measure 79-85 percent within-subject model, 70-80 percent global model); the second one based on smartphone usage analysis and tested in-the-wild with 25 participants (F-measure 77-88 percent within-subject model, 63-83 percent global model). Results show how these two methods enable an accurate stress assessment without being too intrusive, thus increasing ecological validity of the data and user acceptance.

Journal ArticleDOI
TL;DR: This research addresses the role of lyrics in the music emotion recognition process by conducting regression and classification by quadrant, arousal and valence categories and conducting experiments to identify interpretable rules that show the relation between features and emotions and the relation among features.
Abstract: This research addresses the role of lyrics in the music emotion recognition process. Our approach is based on several state of the art features complemented by novel stylistic, structural and semantic features. To evaluate our approach, we created a ground truth dataset containing 180 song lyrics, according to Russell's emotion model. We conduct four types of experiments: regression and classification by quadrant, arousal and valence categories. Comparing to the state of the art features (ngrams - baseline), adding other features, including novel features, improved the F-measure from 69.9, 82.7 and 85.6 percent to 80.1, 88.3 and 90 percent, respectively for the three classification experiments. To study the relation between features and emotions (quadrants) we performed experiments to identify the best features that allow to describe and discriminate each quadrant. To further validate these experiments, we built a validation set comprising 771 lyrics extracted from the AllMusic platform, having achieved 73.6 percent F-measure in the classification by quadrants. We also conducted experiments to identify interpretable rules that show the relation between features and emotions and the relation among features. Regarding regression, results show that, comparing to similar studies for audio, we achieve a similar performance for arousal and a much better performance for valence.

Journal ArticleDOI
TL;DR: This work attempts to model personality traits of users using a collection of images they tag as ‘favorite’ (or like) on Flickr, using a novel machine learning approach to model users’ personality based on the image features.
Abstract: The increased proliferation of data production technologies (e.g., cameras) and consumption avenues (e.g., social media) has led to images and videos being utilized by users to convey innate preferences and tastes. This has opened up the possibility of using multimedia as a source for user-modeling. This work attempts to model personality traits (based on the Five Factor Theory) of users using a collection of images they tag as ‘favorite’ (or like) on Flickr. First, a set of semantic features are proposed to be used for representing different concepts in images which influence users to like them. The addition of the proposed features led to improvement over state-of-the-art by 12 percent. Second, a novel machine learning approach is developed to model users’ personality based on the image features (resulting in upto 15 percent improvement). Third, efficacy of the semantic features and the modeling approach is shown in recommending images based on personality modeling. Using the modeling approach, recommendations are made regarding the factors that might influence users with different personality traits to like an image.

Journal ArticleDOI
TL;DR: Experimental results show this model works very well, and time-frequency features are effective in characterizing and recognizing emotions for this non-contact gait data, and the recognition accuracy can be further averagely improved by the optimization algorithm.
Abstract: Automatic emotion recognition from gaits information is discussed in this paper, which has been investigated widely in the fields of human-machine interaction, psychology, psychiatry, behavioral science, etc. The gaits information is non-contact, collected from Microsoft kinects, and contains 3-dimensional coordinates of 25 joints per person. These joints coordinates vary with the time. So, by the discrete Fourier transform and statistic methods, some time-frequency features related to neutral, happy and angry emotion are extracted and used to establish the classification model to identify these three emotions. Experimental results show this model works very well, and time-frequency features are effective in characterizing and recognizing emotions for this non-contact gait data. In particular, by the optimization algorithm, the recognition accuracy can be further averagely improved by about 13.7 percent.

Journal ArticleDOI
TL;DR: This paper aims to improve the recommendation accuracy of socially-aware recommender systems by proposing a linear hybrid recommender algorithm called Personality and Socially-Aware Recommender (PerSAR), which hybridizes the social and personality behaviours of smart conference attendees.
Abstract: In order to innovatively solve cold-start problems, research involving trust and socially aware recommender systems is currently proliferating. The relative importance of academic conferences has led to the necessity of recommender systems that seek to generate recommendations for conference attendees. In this paper, we aim to improve the recommendation accuracy of socially-aware recommender systems by proposing a linear hybrid recommender algorithm called Personality and Socially-Aware Recommender ( PerSAR ). PerSAR hybridizes the social and personality behaviours of smart conference attendees. Our recommendation methodology mainly aims to employ an algorithmic framework that computes the personality similarities and tie strengths of conference attendees so that effective and reliable recommendations can be generated for them using a relevant dataset. The experimental results substantiate that our proposed recommendation method is favorable and outperforms other related and contemporary recommendation methods and techniques.

Journal ArticleDOI
TL;DR: A data-driven facial beauty analysis framework that contains three application modules: prediction, retrieval, and manipulation is proposed and experimental results show that the exemplar-based approach is better for shape beautification; the model- based approach is suitable for texture beautification'; and the combination of them can increase the attractiveness of a query face robustly.
Abstract: Facial beauty analysis becomes an emerging research area due to many potential applications, such as aesthetic surgery plan, cosmetic industry, photo retouching, and entertainment. In this paper, we propose a data-driven facial beauty analysis framework that contains three application modules: prediction, retrieval, and manipulation. A beauty model is the core of the framework. With carefully designed features, the model can be built for different purposes. For prediction, we combine several low-level face representations and high-level features to form a feature vector and perform feature selection to optimize the feature set. The model built with the optimized feature set outperforms state-of-the-art methods. Then, we discuss two scenarios of beauty-oriented face retrieval: for recommendation and for beautification. Finally, we propose two approaches for facial beauty manipulation. One is an exemplar-based approach that uses the retrieved results. The other is a model-based approach that modifies facial features along the gradient of the beauty model. In this case, the model is built with the shape or appearance feature. Experimental results show that the exemplar-based approach is better for shape beautification; the model-based approach is suitable for texture beautification; and the combination of them can increase the attractiveness of a query face robustly.

Journal ArticleDOI
TL;DR: The design of a VR-based social communication platform augmented with technologically-enhanced eye-tracking facility as a proof-of-concept application and results of a usability study carried out in the Indian sub-continent showed the potential of the system to have implications on one's task performance and gaze-related indices in response to virtual peer's emotional expressions.
Abstract: Autism spectrum disorder (ASD) is often characterized by core deficits in social communication and ability to understand others’ non-verbal emotional cues. This can be attributed to their atypical eye-gaze patterns along with reduced fixation towards communicator's face during social communication. With technological progress, Virtual Reality (VR) augmented with peripherals such as, eye tracker can offer a promising complementary assistive platform for presenting various social situations to this target group along with quantification of one's task performance and measurement of gaze-related indices. This paper presents the design of a VR-based social communication platform augmented with technologically-enhanced eye-tracking facility as a proof-of-concept application. We measured one's performance score along with real-time synchronized gaze-related indices while one interacted with VR-based social tasks having both context-relevant verbal and non-verbal components of social interaction. The results of a usability study carried out in the Indian sub-continent with eight pairs of individuals with ASD and typically-developing individuals showed the potential of our system to have implications on one's task performance and gaze-related indices in response to virtual peer's emotional expressions. The implication of emotions on gaze-related behavioral and physiological indices shows the potential of using gaze-related indices as bio-markers of one's anxiety during social communication.

Journal ArticleDOI
TL;DR: The VRST revealed that the increased difficulty found in tasks like the Stroop interference task directly evoke autonomic changes in psychophysiological arousal beyond the threatening stimuli themselves.
Abstract: Understanding the ways in which persons rapidly transfer attention between tasks while still retaining ability to perform these tasks is an important area of study. Everyday activities commonly come in the form of emotional distractors. A recently developed Virtual Reality Stroop Task (VRST) allows for assessing neurocognitive and psychophysiological responding while traveling through simulated safe and ambush desert environments as Stroop stimuli appear on the windshield. We evaluated differences in psychophysiological response patterns associated with completion of an affective task alone versus completion of an affective task that also included a Stroop task. The VRST elicited increased heart rate, respiration rate, skin conductance level, and number of spontaneous fluctuations in electrodermal activity. Increased cognitive workload was found to be associated with the more cognitively challenging Stroop conditions which led to an increase in response level. This expands on previous findings and indicates that allocating attention away from the environment and toward Stroop stimuli likely requires greater inhibitory control. This is corroborated by behavioral findings from previous investigations with the VRST. The VRST revealed that the increased difficulty found in tasks like the Stroop interference task directly evoke autonomic changes in psychophysiological arousal beyond the threatening stimuli themselves.

Journal ArticleDOI
TL;DR: This study used web-based surveys to collect information about demographics, listening habits, musical education, personality, and perceptual ratings with respect to perceived emotions, tempo, complexity, and instrumentation for 15 segments of Beethoven's 3rd symphony, “Eroica”.
Abstract: This study deals with the strong relationship between emotions and music, investigating three main research questions: (RQ1) Are there differences in human music perception (e.g., emotions, tempo, instrumentation, and complexity), according to musical education, experience, demographics, and personality traits?; (RQ2) Do certain perceived music characteristics correlate (e.g., tension and sadness), irrespective of a particular listener's background or personality?; (RQ3) Does human perception of music characteristics, such as emotions and tempo, correlate with descriptors extracted from music audio signals? To investigate our research questions, we conducted two user studies focusing on different groups of subjects. We used web-based surveys to collect information about demographics, listening habits, musical education, personality, and perceptual ratings with respect to perceived emotions, tempo, complexity, and instrumentation for 15 segments of Beethoven's 3rd symphony, “Eroica”. Our experiments showed that all three research questions can be affirmed, at least partly. We found strong support for RQ2 and RQ3, while RQ1 could be confirmed only for some perceptual aspects and user groups.