scispace - formally typeset
Search or ask a question

Showing papers on "Facial expression published in 2018"


Journal ArticleDOI
16 May 2018-PLOS ONE
TL;DR: The RAVDESS is a validated multimodal database of emotional speech and song consisting of 24 professional actors, vocalizing lexically-matched statements in a neutral North American accent, which shows high levels of emotional validity and test-retest intrarater reliability.
Abstract: The RAVDESS is a validated multimodal database of emotional speech and song. The database is gender balanced consisting of 24 professional actors, vocalizing lexically-matched statements in a neutral North American accent. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions, and song contains calm, happy, sad, angry, and fearful emotions. Each expression is produced at two levels of emotional intensity, with an additional neutral expression. All conditions are available in face-and-voice, face-only, and voice-only formats. The set of 7356 recordings were each rated 10 times on emotional validity, intensity, and genuineness. Ratings were provided by 247 individuals who were characteristic of untrained research participants from North America. A further set of 72 participants provided test-retest data. High levels of emotional validity and test-retest intrarater reliability were reported. Corrected accuracy and composite "goodness" measures are presented to assist researchers in the selection of stimuli. All recordings are made freely available under a Creative Commons license and can be downloaded at https://doi.org/10.5281/zenodo.1188976.

1,036 citations


Book ChapterDOI
08 Sep 2018
TL;DR: In this article, a GAN conditioning scheme based on Action Units (AU) annotations is proposed, which allows controlling the magnitude of activation of each AU and combine several of them.
Abstract: Recent advances in Generative Adversarial Networks (GANs) have shown impressive results for task of facial expression synthesis. The most successful architecture is StarGAN, that conditions GANs’ generation process with images of a specific domain, namely a set of images of persons sharing the same expression. While effective, this approach can only generate a discrete number of expressions, determined by the content of the dataset. To address this limitation, in this paper, we introduce a novel GAN conditioning scheme based on Action Units (AU) annotations, which describes in a continuous manifold the anatomical facial movements defining a human expression. Our approach allows controlling the magnitude of activation of each AU and combine several of them. Additionally, we propose a fully unsupervised strategy to train the model, that only requires images annotated with their activated AUs, and exploit attention mechanisms that make our network robust to changing backgrounds and lighting conditions. Extensive evaluation show that our approach goes beyond competing conditional generators both in the capability to synthesize a much wider range of expressions ruled by anatomically feasible muscle movements, as in the capacity of dealing with images in the wild.

533 citations


Journal ArticleDOI
TL;DR: A novel framework for facial expression recognition to automatically distinguish the expressions with high accuracy is presented and a high recognition accuracy is achieved, which successfully demonstrates the feasibility and effectiveness of the approach.

419 citations


Journal ArticleDOI
TL;DR: A newly developed spontaneous micro-facial movement dataset with diverse participants and coded using the Facial Action Coding System that outperforms the state of the art with a recall of 0.91 and can become a new standard for micro-movement data.
Abstract: Micro-facial expressions are spontaneous, involuntary movements of the face when a person experiences an emotion but attempts to hide their facial expression, most likely in a high-stakes environment. Recently, research in this field has grown in popularity, however publicly available datasets of micro-expressions have limitations due to the difficulty of naturally inducing spontaneous micro-expressions. Other issues include lighting, low resolution and low participant diversity. We present a newly developed spontaneous micro-facial movement dataset with diverse participants and coded using the Facial Action Coding System. The experimental protocol addresses the limitations of previous datasets, including eliciting emotional responses from stimuli tailored to each participant. Dataset evaluation was completed by running preliminary experiments to classify micro-movements from non-movements. Results were obtained using a selection of spatio-temporal descriptors and machine learning. We further evaluate the dataset on emerging methods of feature difference analysis and propose an Adaptive Baseline Threshold that uses individualised neutral expression to improve the performance of micro-movement detection. In contrast to machine learning approaches, we outperform the state of the art with a recall of 0.91. The outcomes show the dataset can become a new standard for micro-movement data, with future work expanding on data representation and analysis.

353 citations


Proceedings ArticleDOI
18 Jun 2018
TL;DR: The DeRL method has been evaluated on five databases, CK+, Oulu-CASIA, MMI, BU-3DFE, and BP4D+.
Abstract: A facial expression is a combination of an expressive component and a neutral component of a person. In this paper, we propose to recognize facial expressions by extracting information of the expressive component through a de-expression learning procedure, called De-expression Residue Learning (DeRL). First, a generative model is trained by cGAN. This model generates the corresponding neutral face image for any input face image. We call this procedure de-expression because the expressive information is filtered out by the generative model; however, the expressive information is still recorded in the intermediate layers. Given the neutral face image, unlike previous works using pixel-level or feature-level difference for facial expression classification, our new method learns the deposition (or residue) that remains in the intermediate layers of the generative model. Such a residue is essential as it contains the expressive component deposited in the generative model from any input facial expression images. Seven public facial expression databases are employed in our experiments. With two databases (BU-4DFE and BP4D-spontaneous) for pre-training, the DeRL method has been evaluated on five databases, CK+, Oulu-CASIA, MMI, BU-3DFE, and BP4D+. The experimental results demonstrate the superior performance of the proposed method.

342 citations


Posted Content
TL;DR: A novel GAN conditioning scheme based on Action Units (AU) annotations, which describes in a continuous manifold the anatomical facial movements defining a human expression, and proposes a fully unsupervised strategy to train the model, that only requires images annotated with their activated AUs.
Abstract: Recent advances in Generative Adversarial Networks (GANs) have shown impressive results for task of facial expression synthesis. The most successful architecture is StarGAN, that conditions GANs generation process with images of a specific domain, namely a set of images of persons sharing the same expression. While effective, this approach can only generate a discrete number of expressions, determined by the content of the dataset. To address this limitation, in this paper, we introduce a novel GAN conditioning scheme based on Action Units (AU) annotations, which describes in a continuous manifold the anatomical facial movements defining a human expression. Our approach allows controlling the magnitude of activation of each AU and combine several of them. Additionally, we propose a fully unsupervised strategy to train the model, that only requires images annotated with their activated AUs, and exploit attention mechanisms that make our network robust to changing backgrounds and lighting conditions. Extensive evaluation show that our approach goes beyond competing conditional generators both in the capability to synthesize a much wider range of expressions ruled by anatomically feasible muscle movements, as in the capacity of dealing with images in the wild.

251 citations


Journal ArticleDOI
TL;DR: This work designs an effective multitask network that is capable of learning from rich auxiliary attributes such as gender, age, and head pose, beyond just facial expression data and uses the expression recognition network as branches for a Siamese model to predict inter-personal relation.
Abstract: Interpersonal relation defines the association, e.g., warm, friendliness, and dominance, between two or more people. We investigate if such fine-grained and high-level relation traits can be characterized and quantified from face images in the wild. We address this challenging problem by first studying a deep network architecture for robust recognition of facial expressions. Unlike existing models that typically learn from facial expression labels alone, we devise an effective multitask network that is capable of learning from rich auxiliary attributes such as gender, age, and head pose, beyond just facial expression data. While conventional supervised training requires datasets with complete labels (e.g., all samples must be labeled with gender, age, and expression), we show that this requirement can be relaxed via a novel attribute propagation method. The approach further allows us to leverage the inherent correspondences between heterogeneous attribute sources despite the disparate distributions of different datasets. With the network we demonstrate state-of-the-art results on existing facial expression recognition benchmarks. To predict inter-personal relation, we use the expression recognition network as branches for a Siamese model. Extensive experiments show that our model is capable of mining mutual context of faces for accurate fine-grained interpersonal prediction.

216 citations


Journal ArticleDOI
TL;DR: The multiple feature fusion approach is robust in dealing with video-based facial expression recognition problems under lab-controlled environment and in the wild compared with the other state-of-the-art methods.
Abstract: Video based facial expression recognition has been a long standing problem and attracted growing attention recently. The key to a successful facial expression recognition system is to exploit the potentials of audiovisual modalities and design robust features to effectively characterize the facial appearance and configuration changes caused by facial motions. We propose an effective framework to address this issue in this paper. In our study, both visual modalities (face images) and audio modalities (speech) are utilized. A new feature descriptor called Histogram of Oriented Gradients from Three Orthogonal Planes (HOG-TOP) is proposed to extract dynamic textures from video sequences to characterize facial appearance changes. And a new effective geometric feature derived from the warp transformation of facial landmarks is proposed to capture facial configuration changes. Moreover, the role of audio modalities on recognition is also explored in our study. We applied the multiple feature fusion to tackle the video-based facial expression recognition problems under lab-controlled environment and in the wild, respectively. Experiments conducted on the extended Cohn-Kanade (CK+) database and the Acted Facial Expression in Wild (AFEW) 4.0 database show that our approach is robust in dealing with video-based facial expression recognition problems under lab-controlled environment and in the wild compared with the other state-of-the-art methods.

176 citations


Journal ArticleDOI
TL;DR: In this paper, the authors classify the facial landmark detection algorithms into three major categories: holistic methods, constrained local model (CLM) methods, and regression-based methods, which differ in the ways to utilize the facial appearance and shape information.
Abstract: The locations of the fiducial facial landmark points around facial components and facial contour capture the rigid and non-rigid facial deformations due to head movements and facial expressions. They are hence important for various facial analysis tasks. Many facial landmark detection algorithms have been developed to automatically detect those key points over the years, and in this paper, we perform an extensive review of them. We classify the facial landmark detection algorithms into three major categories: holistic methods, Constrained Local Model (CLM) methods, and the regression-based methods. They differ in the ways to utilize the facial appearance and shape information. The holistic methods explicitly build models to represent the global facial appearance and shape information. The CLMs explicitly leverage the global shape model but build the local appearance models. The regression-based methods implicitly capture facial shape and appearance information. For algorithms within each category, we discuss their underlying theories as well as their differences. We also compare their performances on both controlled and in the wild benchmark datasets, under varying facial expressions, head poses, and occlusion. Based on the evaluations, we point out their respective strengths and weaknesses. There is also a separate section to review the latest deep learning-based algorithms. The survey also includes a listing of the benchmark databases and existing software. Finally, we identify future research directions, including combining methods in different categories to leverage their respective strengths to solve landmark detection "in-the-wild".

173 citations


Proceedings ArticleDOI
15 May 2018
TL;DR: In this article, the authors present an open-source pipeline for face registration based on Gaussian processes as well as an application to face image analysis, which considers symmetry, multi-scale and spatially varying details.
Abstract: In this paper, we present a novel open-source pipeline for face registration based on Gaussian processes as well as an application to face image analysis. Non-rigid registration of faces is significant for many applications in computer vision, such as the construction of 3D Morphable face models (3DMMs). Gaussian Process Morphable Models (GPMMs) unify a variety of non-rigid deformation models with B-splines and PCA models as examples. GPMM separate problem specific requirements from the registration algorithm by incorporating domain-specific adaptions as a prior model. The novelties of this paper are the following: (i) We present a strategy and modeling technique for face registration that considers symmetry, multi-scale and spatially-varying details. The registration is applied to neutral faces and facial expressions. (ii) We release an open-source software framework for registration model-building demonstrated on the publicly available BU3D-FE database. The released pipeline also contains an implementation of an Analysis-by-Synthesis model adaption of 2D face images, tested on the Multi-PIE and LFW database. This enables the community to reproduce, evaluate and compare the individual steps of registration to model-building and 3D/2D model fitting. (iii) Along with the framework release, we publish a new version of the Basel Face Model (BFM-2017) with an improved age distribution and an additional facial expression model.

166 citations


Journal ArticleDOI
TL;DR: BECV offers an externalist, functionalist view of facial displays that is not bound to Western conceptions about either expressions or emotions, and easily accommodates recent findings of diversity in facial displays, their public context-dependency, and the curious but common occurrence of solitary facial behavior.

Journal ArticleDOI
TL;DR: Overall, iMotions can achieve acceptable accuracy for standardized pictures of prototypical (vs. natural) facial expressions, but performs worse for more natural facial expressions.
Abstract: The goal of this study was to validate AFFDEX and FACET, two algorithms classifying emotions from facial expressions, in iMotions’s software suite. In Study 1, pictures of standardized emotional facial expressions from three databases, the Warsaw Set of Emotional Facial Expression Pictures (WSEFEP), the Amsterdam Dynamic Facial Expression Set (ADFES), and the Radboud Faces Database (RaFD), were classified with both modules. Accuracy (Matching Scores) was computed to assess and compare the classification quality. Results show a large variance in accuracy across emotions and databases, with a performance advantage for FACET over AFFDEX. In Study 2, 110 participants’ facial expressions were measured while being exposed to emotionally evocative pictures from the International Affective Picture System (IAPS), the Geneva Affective Picture Database (GAPED) and the Radboud Faces Database (RaFD). Accuracy again differed for distinct emotions, and FACET performed better. Overall, iMotions can achieve acceptable accuracy for standardized pictures of prototypical (vs. natural) facial expressions, but performs worse for more natural facial expressions. We discuss potential sources for limited validity and suggest research directions in the broader context of emotion research.

Journal ArticleDOI
TL;DR: The proposed FER method outperforms the state-of-the-art FER methods based on the hand-crafted features or deep networks using one channel, and can achieve comparable performance with easier procedures.
Abstract: Facial expression recognition (FER) is a significant task for the machines to understand the emotional changes in human beings. However, accurate hand-crafted features that are highly related to changes in expression are difficult to extract because of the influences of individual difference and variations in emotional intensity. Therefore, features that can accurately describe the changes in facial expressions are urgently required. Method: A weighted mixture deep neural network (WMDNN) is proposed to automatically extract the features that are effective for FER tasks. Several pre-processing approaches, such as face detection, rotation rectification, and data augmentation, are implemented to restrict the regions for FER. Two channels of facial images, including facial grayscale images and their corresponding local binary pattern (LBP) facial images, are processed by WMDNN. Expression-related features of facial grayscale images are extracted by fine-tuning a partial VGG16 network, the parameters of which are initialized using VGG16 model trained on ImageNet database. Features of LBP facial images are extracted by a shallow convolutional neural network (CNN) built based on DeepID. The outputs of both channels are fused in a weighted manner. The result of final recognition is calculated using softmax classification. Results: Experimental results indicate that the proposed algorithm can recognize six basic facial expressions (happiness, sadness, anger, disgust, fear, and surprise) with high accuracy. The average recognition accuracies for benchmarking data sets “CK+,” “JAFFE,” and “Oulu-CASIA” are 0.970, 0.922, and 0.923, respectively. Conclusions: The proposed FER method outperforms the state-of-the-art FER methods based on the hand-crafted features or deep networks using one channel. Compared with the deep networks that use multiple channels, our proposed network can achieve comparable performance with easier procedures. Fine-tuning is effective to FER tasks with a well pre-trained model if sufficient samples cannot be collected.

Proceedings ArticleDOI
01 Jun 2018
TL;DR: In this paper, a manifold network structure was used for covariance pooling to improve facial expression recognition. And the authors achieved a recognition accuracy of 58.14% on Static Facial Expressions in the Wild (SFEW2.0) and 87.0% on the validation set of Real-World Affective Faces (RAF) Database.
Abstract: Classifying facial expressions into different categories requires capturing regional distortions of facial landmarks. We believe that second-order statistics such as covariance is better able to capture such distortions in regional facial features. In this work, we explore the benefits of using a manifold network structure for covariance pooling to improve facial expression recognition. In particular, we first employ such kind of manifold networks in conjunction with traditional convolutional networks for spatial pooling within individual image feature maps in an end-to-end deep learning manner. By doing so, we are able to achieve a recognition accuracy of 58.14% on the validation set of Static Facial Expressions in the Wild (SFEW2.0) and 87.0% on the validation set of Real-World Affective Faces (RAF) Database1. Both of these results are the best results we are aware of. Besides, we leverage covariance pooling to capture the temporal evolution of per-frame features for video-based facial expression recognition. Our reported results demonstrate the advantage of pooling image-set features temporally by stacking the designed manifold network of covariance pooling on top of convolutional network layers.

Journal ArticleDOI
TL;DR: It is demonstrated that emotion recognition based on facial expressions is feasible in distance education, permitting identification of a student’s learning status in real time, and can help teachers to change teaching strategies in virtual learning environments according to the student's emotions.

Journal ArticleDOI
TL;DR: An effective performance analysis of the proposed as well as the conventional methods such as convolutional neural network, NN-Levenberg-Marquardt, N nN-Gradient Descent, N N-Evolutionary Algorithm, Nn-firefly, and N n-Particle Swarm Optimisation is provided by evaluating few performance measures and thereby, the effectiveness of the suggested strategy over the conventional method is validated.
Abstract: The channels used to convey the human emotions consider actions, behaviours, poses, facial expressions, and speech. An immense research has been carried out to analyse the relationship between the facial emotions and these channels. The goal of this study is to develop a system for Facial Emotion Recognition (FER) that can analyse the elemental facial expressions of human, such as normal, smile, sad, surprise, anger, fear, and disgust. The recognition process of the proposed FER system is categorised into four processes, namely pre-processing, feature extraction, feature selection, and classification. After preprocessing, scale invariant feature transform -based feature extraction method is used to extract the features from the facial point. Further, a meta-heuristic algorithm called Grey Wolf optimisation (GWO) is used to select the optimal features. Subsequently, GWO-based neural network (NN) is used to classify the emotions from the selected features. Moreover, an effective performance analysis of the proposed as well as the conventional methods such as convolutional neural network, NN-Levenberg-Marquardt, NN-Gradient Descent, NN-Evolutionary Algorithm, NN-firefly, and NN-Particle Swarm Optimisation is provided by evaluating few performance measures and thereby, the effectiveness of the proposed strategy over the conventional methods is validated.

Journal ArticleDOI
TL;DR: This paper proposes an artificial intelligent system that can predict the scales of Beck depression inventory II (BDI-II) from vocal and visual expressions, and outperforms all the other existing methods on the same dataset.
Abstract: A human being’s cognitive system can be simulated by artificial intelligent systems. Machines and robots equipped with cognitive capability can automatically recognize a humans mental state through their gestures and facial expressions. In this paper, an artificial intelligent system is proposed to monitor depression. It can predict the scales of Beck depression inventory II (BDI-II) from vocal and visual expressions. First, different visual features are extracted from facial expression images. Deep learning method is utilized to extract key visual features from the facial expression frames. Second, spectral low-level descriptors and mel-frequency cepstral coefficients features are extracted from short audio segments to capture the vocal expressions. Third, feature dynamic history histogram (FDHH) is proposed to capture the temporal movement on the feature space. Finally, these FDHH and audio features are fused using regression techniques for the prediction of the BDI-II scales. The proposed method has been tested on the public Audio/Visual Emotion Challenges 2014 dataset as it is tuned to be more focused on the study of depression. The results outperform all the other existing methods on the same dataset.

Proceedings ArticleDOI
Lingxiao Song, Zhihe Lu1, Ran He1, Zhenan Sun1, Tieniu Tan1 
15 Oct 2018
TL;DR: In this article, a Geometry-Guided Generative Adversarial Network (G2-GAN) is proposed for continuously adjusting and identity-preserving facial expression synthesis, which employs facial geometry (fiducial points) as a controllable condition to guide facial texture synthesis with specific expression.
Abstract: Facial expression synthesis has drawn much attention in the field of computer graphics and pattern recognition. It has been widely used in face animation and recognition. However, it is still challenging due to the high-level semantic presence of large and non-linear face geometry variations. This paper proposes a Geometry-Guided Generative Adversarial Network (G2-GAN) for continuously-adjusting and identity-preserving facial expression synthesis. We employ facial geometry (fiducial points) as a controllable condition to guide facial texture synthesis with specific expression. A pair of generative adversarial subnetworks is jointly trained towards opposite tasks: expression removal and expression synthesis. The paired networks form a mapping cycle between neutral expression and arbitrary expressions, with which the proposed approach can be conducted among unpaired data. The proposed paired networks also facilitate other applications such as face transfer, expression interpolation and expression-invariant face recognition. Experimental results on several facial expression databases show that our method can generate compelling perceptual results on different expression editing tasks.

Proceedings Article
27 Apr 2018
TL;DR: Wang et al. as discussed by the authors proposed an Expression Generative Adversarial Network (ExprGAN) for photo-realistic facial expression editing with controllable expression intensity, which enables the expression intensity to be continuously adjusted from low to high.
Abstract: Facial expression editing is a challenging task as it needs a high-level semantic understanding of the input face image. In conventional methods, either paired training data is required or the synthetic face’s resolution is low. Moreover,only the categories of facial expression can be changed. To address these limitations, we propose an Expression Generative Adversarial Network (ExprGAN) for photo-realistic facial expression editing with controllable expression intensity. An expression controller module is specially designed to learn an expressive and compact expression code in addition to the encoder-decoder network. This novel architecture enables the expression intensity to be continuously adjusted from low to high. We further show that our ExprGAN can be applied for other tasks, such as expression transfer, image retrieval, and data augmentation for training improved face expression recognition models. To tackle the small size of the training database, an effective incremental learning scheme is proposed. Quantitative and qualitative evaluations on the widely used Oulu-CASIA dataset demonstrate the effectiveness of ExprGAN.

Proceedings ArticleDOI
01 Aug 2018
TL;DR: An end-to-end trainable Patch-Gated Convolution Neutral Network (PG-CNN) that can automatically percept the occluded region of the face and focus on the most discriminative un-occluded regions and improves the recognition accuracy on both the original faces and faces with synthesized occlusions.
Abstract: Facial expression recognition in the wild is challenging due to various un-constrained conditions. Although existing facial expression classifiers have been almost perfect on analyzing constrained frontal faces, they fail to perform well on partially occluded faces that are common in the wild. In this paper, we propose an end-to-end trainable Patch-Gated Convolution Neutral Network (PG-CNN) that can automatically percept the occluded region of the face and focus on the most discriminative un-occluded regions. To determine the possible regions of interest on the face, PG-CNN decomposes an intermediate feature map into several patches according to the positions of related facial landmarks. Then, via a proposed Patch-Gated Unit, PG-CNN reweighs each patch by the unobstructed-ness or importance that is computed from the patch itself. The proposed PG-CNN is evaluated on two largest in-the-wild facial expression datasets (RAF-DB and AffectNet) and their modifications with synthesized facial occlusions. Experimental results show that PG-CNN improves the recognition accuracy on both the original faces and faces with synthesized occlusions. Visualization results demonstrate that, compared with the CNN without Patch-Gated Unit, PG-CNN is capable of shifting the attention from the occluded patch to other related but unobstructed ones. Experiments also show that PG-CNN outperforms other state-of-the-art methods on several widely used in-the-lab facial expression datasets under the cross-dataset evaluation protocol.

Journal ArticleDOI
TL;DR: New studies conducted since 2008 have examined a wider sample of small-scale societies, including on the African and South American continents, providing an important opportunity for reevaluating the universality thesis.
Abstract: It has long been claimed that certain facial movements are universally perceived as emotional expressions. The critical tests of this universality thesis were conducted between 1969 and 1975 in small-scale societies in the Pacific using confirmation-based research methods. New studies conducted since 2008 have examined a wider sample of small-scale societies, including on the African and South American continents. They used more discovery-based research methods, providing an important opportunity for reevaluating the universality thesis. These new studies reveal diversity, rather than uniformity, in how perceivers make sense of facial movements, calling the universality thesis into doubt. Instead, they support a perceiver-constructed account of emotion perception that is consistent with the broader literature on perception.

Journal ArticleDOI
Geng Jiahao1, Tianjia Shao1, Youyi Zheng1, Yanlin Weng1, Kun Zhou1 
TL;DR: This paper introduces a novel method for realtime portrait animation in a single photo that factorizes out the nonlinear geometric transformations exhibited in facial expressions by lightweight 2D warps and leaves the appearance detail synthesis to conditional generative neural networks for high-fidelity facial animation generation.
Abstract: This paper introduces a novel method for realtime portrait animation in a single photo. Our method requires only a single portrait photo and a set of facial landmarks derived from a driving source (e.g., a photo or a video sequence), and generates an animated image with rich facial details. The core of our method is a warp-guided generative model that instantly fuses various fine facial details (e.g., creases and wrinkles), which are necessary to generate a high-fidelity facial expression, onto a pre-warped image. Our method factorizes out the nonlinear geometric transformations exhibited in facial expressions by lightweight 2D warps and leaves the appearance detail synthesis to conditional generative neural networks for high-fidelity facial animation generation. We show such a factorization of geometric transformation and appearance synthesis largely helps the network better learn the high nonlinearity of the facial expression functions and also facilitates the design of the network architecture. Through extensive experiments on various portrait photos from the Internet, we show the significant efficacy of our method compared with prior arts.

Proceedings ArticleDOI
15 May 2018
TL;DR: It is shown that the ExpNet produces expression coefficients which better discriminate between facial emotions than those obtained using state of the art, facial landmark detectors, and is more robust to scale changes than landmark detectors.
Abstract: We describe a deep learning based method for estimating 3D facial expression coefficients. Unlike previous work, our process does not relay on facial landmark detection methods as a proxy step. Recent methods have shown that a CNN can be trained to regress accurate and discriminative 3D morphable model (3DMM) representations, directly from image intensities. By foregoing landmark detection, these methods were able to estimate shapes for occluded faces appearing in unprecedented viewing conditions. We build on those methods by showing that facial expressions can also be estimated by a robust, deep, landmark-free approach. Our ExpNet CNN is applied directly to the intensities of a face image and regresses a 29D vector of 3D expression coefficients. We propose a unique method for collecting data to train our network, leveraging on the robustness of deep networks to training label noise. We further offer a novel means of evaluating the accuracy of estimated expression coefficients: by measuring how well they capture facial emotions on the CK+ and EmotiW-17 emotion recognition benchmarks. We show that our ExpNet produces expression coefficients which better discriminate between facial emotions than those obtained using state of the art, facial landmark detectors. Moreover, this advantage grows as image scales drop, demonstrating that our ExpNet is more robust to scale changes than landmark detectors. Finally, our ExpNet is orders of magnitude faster than its alternatives.

Book ChapterDOI
01 Jan 2018
TL;DR: This work examines the performance of two known deep learning approaches (GoogLeNet and AlexNet) on facial expression recognition, more specifically the recognition of the existence of emotional content, and on the Recognition of the exact emotional content of facial expressions.
Abstract: Emotions constitute an innate and important aspect of human behavior that colors the way of human communication. The accurate analysis and interpretation of the emotional content of human facial expressions is essential for the deeper understanding of human behavior. Although a human can detect and interpret faces and facial expressions naturally, with little or no effort, accurate and robust facial expression recognition by computer systems is still a great challenge. The analysis of human face characteristics and the recognition of its emotional states are considered to be very challenging and difficult tasks. The main difficulties come from the non-uniform nature of human face and variations in conditions such as lighting, shadows, facial pose and orientation. Deep learning approaches have been examined as a stream of methods to achieve robustness and provide the necessary scalability on new type of data. In this work, we examine the performance of two known deep learning approaches (GoogLeNet and AlexNet) on facial expression recognition, more specifically the recognition of the existence of emotional content, and on the recognition of the exact emotional content of facial expressions. The results collected from the study are quite interesting.

BookDOI
19 Feb 2018
TL;DR: In this article, the authors discuss the universality of human nonverbal communication and its role in the development of emotions in a social and cultural context, as well as its role as a mediator between biology and human culture.
Abstract: Contents: Preface. Introduction: Nonverbal Communication: Crossing the Boundary Between Culture and Nature. Part I: New Findings on the Universality of Human Nonverbal Communication. P. Ekman, D. Keltner, Universal Facial Expressions of Emotion: An Old Controversy and New Findings. U. Dimberg, Psychophysiological Reactions to Facial Expressions. W. Schiefenhoevel, Universals in Interpersonal Interactions. Part II: Development of Emotions in a Social and Cultural Context. H. Papousek, M. Papousek, Preverbal Communication in Humans and the Genesis of Culture. K. Schneider, Development of Emotions and Their Expression in Task-Oriented Situations in Infants and Preschool Children. S.S. Suomi, Nonverbal Communication in Nonhuman Primates: Implications for the Emergence of Culture. Part III: The Social Role of Nonverbal Communication and Emotions: Phylogenetic Inference. P. Marler, C.S. Evans, Communication Signals of Animals: Contributions of Emotion and Reference. S. Preuschoft, J.A.R.A.M. van Hooff, The Social Function of "Smile" and "Laughter": Variations Across Primate Species and Societies. A. Maryanski, Primate Communication and the Ecology of a Language Niche. J.H. Turner, The Evolution of Emotions: The Nonverbal Basis of Human Social Organization. Part IV: Nonverbal Communication as a Mediator Between Biology and Human Culture. W. Goldschmidt, Nonverbal Communication and Culture. M. Heller, Posture as an Interface Between Nature and Culture. A. Nitschke, Sign Languages and Gestures in Medieval Europe: Monasteries, Courts of Justice, and Society. R. Frank, Nonverbal Communciation and the Emergence of Moral Sentiments.

Journal ArticleDOI
01 Feb 2018-Sensors
TL;DR: A brief study of the various approaches and the techniques of emotion recognition is presented, including a succinct review of the databases that are considered as data sets for algorithms detecting the emotions by facial expressions.
Abstract: Extensive possibilities of applications have made emotion recognition ineluctable and challenging in the field of computer science. The use of non-verbal cues such as gestures, body movement, and facial expressions convey the feeling and the feedback to the user. This discipline of Human–Computer Interaction places reliance on the algorithmic robustness and the sensitivity of the sensor to ameliorate the recognition. Sensors play a significant role in accurate detection by providing a very high-quality input, hence increasing the efficiency and the reliability of the system. Automatic recognition of human emotions would help in teaching social intelligence in the machines. This paper presents a brief study of the various approaches and the techniques of emotion recognition. The survey covers a succinct review of the databases that are considered as data sets for algorithms detecting the emotions by facial expressions. Later, mixed reality device Microsoft HoloLens (MHL) is introduced for observing emotion recognition in Augmented Reality (AR). A brief introduction of its sensors, their application in emotion recognition and some preliminary results of emotion recognition using MHL are presented. The paper then concludes by comparing results of emotion recognition by the MHL and a regular webcam.

Proceedings ArticleDOI
02 Oct 2018
TL;DR: Improvements in Emotion Recognition techniques, which combines acoustic features and facial features in both non-temporal and temporal mode, prove the effectiveness of the methods.
Abstract: In this paper, we present our latest progress in Emotion Recognition techniques, which combines acoustic features and facial features in both non-temporal and temporal mode. This paper presents the details of our techniques used in the Audio-Video Emotion Recognition subtask in the 2018 Emotion Recognition in the Wild (EmotiW) Challenge. After the multimodal results fusion, our final accuracy in Acted Facial Expression in Wild (AFEW) test dataset achieves 61.87%, which is 1.53% higher than the best results last year. Such improvements prove the effectiveness of our methods.

Book ChapterDOI
04 Oct 2018
TL;DR: A novel Multi-Region Ensemble CNN (MRE-CNN) framework for facial expression recognition, which aims to enhance the learning power of CNN models by capturing both the global and the local features from multiple human face sub-regions.
Abstract: Facial expressions play an important role in conveying the emotional states of human beings. Recently, deep learning approaches have been applied to image recognition field due to the discriminative power of Convolutional Neural Network (CNN). In this paper, we first propose a novel Multi-Region Ensemble CNN (MRE-CNN) framework for facial expression recognition, which aims to enhance the learning power of CNN models by capturing both the global and the local features from multiple human face sub-regions. Second, the weighted prediction scores from each sub-network are aggregated to produce the final prediction of high accuracy. Third, we investigate the effects of different sub-regions of the whole face on facial expression recognition. Our proposed method is evaluated based on two well-known publicly available facial expression databases: AFEW 7.0 and RAF-DB, and has been shown to achieve the state-of-the-art recognition accuracy.

Journal ArticleDOI
TL;DR: A computational framework for automatically quantifying verbal and nonverbal behaviors in the context of job interviews is presented and recommends to speak more fluently, use fewer filler words, speak as “the authors” (versus “I”), use more unique words, and smile more.
Abstract: We present a computational framework for automatically quantifying verbal and nonverbal behaviors in the context of job interviews. The proposed framework is trained by analyzing the videos of 138 interview sessions with 69 internship-seeking undergraduates at the Massachusetts Institute of Technology (MIT). Our automated analysis includes facial expressions (e.g., smiles, head gestures, facial tracking points), language (e.g., word counts, topic modeling), and prosodic information (e.g., pitch, intonation, and pauses) of the interviewees. The ground truth labels are derived by taking a weighted average over the ratings of nine independent judges. Our framework can automatically predict the ratings for interview traits such as excitement, friendliness, and engagement with correlation coefficients of 0.70 or higher, and can quantify the relative importance of prosody, language, and facial expressions. By analyzing the relative feature weights learned by the regression models, our framework recommends to speak more fluently, use fewer filler words, speak as “we” (versus “I”), use more unique words, and smile more. We also find that the students who were rated highly while answering the first interview question were also rated highly overall (i.e., first impression matters). Finally, our MIT Interview dataset is available to other researchers to further validate and expand our findings.

Proceedings ArticleDOI
01 Jan 2018
TL;DR: A Privacy-Preserving Representation-Learning Variational Generative Adversarial Network (PPRL-VGAN) to learn an image representation that is explicitly disentangled from the identity information so that it allows expression-equivalent face image synthesis.
Abstract: Reliable facial expression recognition plays a critical role in human-machine interactions. However, most of the facial expression analysis methodologies proposed to date pay little or no attention to the protection of a user's privacy. In this paper, we propose a Privacy-Preserving Representation-Learning Variational Generative Adversarial Network (PPRL-VGAN) to learn an image representation that is explicitly disentangled from the identity information. At the same time, this representation is discriminative from the standpoint of facial expression recognition and generative as it allows expression-equivalent face image synthesis. We evaluate the proposed model on two public datasets under various threat scenarios. Quantitative and qualitative results demonstrate that our approach strikes a balance between the preservation of privacy and data utility. We further demonstrate that our model can be effectively applied to other tasks such as expression morphing and image completion.