scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Affective Computing in 2020"


Journal ArticleDOI
TL;DR: A comprehensive review on deep facial expression recognition can be found in this article, including datasets and algorithms that provide insights into the problems of overfitting caused by a lack of sufficient training data and expression-unrelated variations.
Abstract: With the transition of facial expression recognition (FER) from laboratory-controlled to in-the-wild conditions and the recent success of deep learning in various fields, deep neural networks have increasingly been leveraged to learn discriminative representations for automatic FER. Recent deep FER systems generally focus on two important issues: overfitting caused by a lack of sufficient training data and expression-unrelated variations, such as illumination, head pose and identity bias. In this survey, we provide a comprehensive review on deep FER, including datasets and algorithms that provide insights into these problems. First, we introduce available datasets that are widely used and provide data selection and evaluation principles. We then describe the standard pipeline of a deep FER system with related background knowledge and suggestions of applicable implementations. For the state of the art in deep FER, we introduce existing deep networks and training strategies that are designed for FER, and discuss their advantages and limitations. Competitive performances and experimental comparisons on widely used benchmarks are also summarized. We then extend our survey to additional related issues and application scenarios. Finally, we review the remaining challenges and opportunities in this field as well as future directions for the design of robust deep FER system.

663 citations


Journal ArticleDOI
TL;DR: The proposed DGCNN method can dynamically learn the intrinsic relationship between different electroencephalogram (EEG) channels via training a neural network so as to benefit for more discriminative EEG feature extraction.
Abstract: In this paper, a multichannel EEG emotion recognition method based on a novel dynamical graph convolutional neural networks (DGCNN) is proposed. The basic idea of the proposed EEG emotion recognition method is to use a graph to model the multichannel EEG features and then perform EEG emotion classification based on this model. Different from the traditional graph convolutional neural networks (GCNN) methods, the proposed DGCNN method can dynamically learn the intrinsic relationship between different electroencephalogram (EEG) channels, represented by an adjacency matrix, via training a neural network so as to benefit for more discriminative EEG feature extraction. Then, the learned adjacency matrix is used to learn more discriminative features for improving the EEG emotion recognition. We conduct extensive experiments on the SJTU emotion EEG dataset (SEED) and DREAMER dataset. The experimental results demonstrate that the proposed method achieves better recognition performance than the state-of-the-art methods, in which the average recognition accuracy of 90.4 percent is achieved for subject dependent experiment while 79.95 percent for subject independent cross-validation one on the SEED database, and the average accuracies of 86.23, 84.54 and 85.02 percent are respectively obtained for valence, arousal and dominance classifications on the DREAMER database.

600 citations


Journal ArticleDOI
TL;DR: Empirical results demonstrate that using MTL to account for individual differences provides large performance improvements over traditional machine learning methods and provides personalized, actionable insights.
Abstract: While accurately predicting mood and wellbeing could have a number of important clinical benefits, traditional machine learning (ML) methods frequently yield low performance in this domain. We posit that this is because a one-size-fits-all machine learning model is inherently ill-suited to predicting outcomes like mood and stress, which vary greatly due to individual differences. Therefore, we employ Multitask Learning (MTL) techniques to train personalized ML models which are customized to the needs of each individual, but still leverage data from across the population. Three formulations of MTL are compared: i) MTL deep neural networks, which share several hidden layers but have final layers unique to each task; ii) Multi-task Multi-Kernel learning, which feeds information across tasks through kernel weights on feature types; and iii) a Hierarchical Bayesian model in which tasks share a common Dirichlet Process prior. We offer the code for this work in open source. These techniques are investigated in the context of predicting future mood, stress, and health using data collected from surveys, wearable sensors, smartphone logs, and the weather. Empirical results demonstrate that using MTL to account for individual differences provides large performance improvements over traditional machine learning methods and provides personalized, actionable insights.

187 citations


Journal ArticleDOI
TL;DR: In this paper, an attention-based convolutional recurrent neural network (ACRNN) was proposed to extract more discriminative features from EEG signals and improve the accuracy of emotion recognition.
Abstract: Emotion recognition based on electroencephalography (EEG) is a significant task in the brain-computer interface field. Recently, many deep learning-based emotion recognition methods are demonstrated to outperform traditional methods. However, it remains challenging to extract discriminative features for EEG emotion recognition, and most methods ignore useful information in channel and time. This paper proposes an attention-based convolutional recurrent neural network (ACRNN) to extract more discriminative features from EEG signals and improve the accuracy of emotion recognition. First, the proposed ACRNN adopts a channel-wise attention mechanism to adaptively assign the weights of different channels, and a CNN is employed to extract the spatial information of encoded EEG signals. Then, to explore the temporal information of EEG signals, extended self-attention is integrated into an RNN to recode the importance based on intrinsic similarity in EEG signals. We conducted extensive experiments on the DEAP and DREAMER databases. The experimental results demonstrate that the proposed ACRNN outperforms state-of-the-art methods.

166 citations


Journal ArticleDOI
TL;DR: This paper essentially maps out the state-of-the-art in cyberbullying detection research and serves as a resource for researchers to determine where to best direct their future research efforts in this field.
Abstract: Research into cyberbullying detection has increased in recent years, due in part to the proliferation of cyberbullying across social media and its detrimental effect on young people. A growing body of work is emerging on automated approaches to cyberbullying detection. These approaches utilise machine learning and natural language processing techniques to identify the characteristics of a cyberbullying exchange and automatically detect cyberbullying by matching textual data to the identified traits. In this paper, we present a systematic review of published research (as identified via Scopus, ACM and IEEE Xplore bibliographic databases) on cyberbullying detection approaches. On the basis of our extensive literature review, we categorise existing approaches into 4 main classes, namely supervised learning, lexicon-based, rule-based, and mixed-initiative approaches. Supervised learning-based approaches typically use classifiers such as SVM and Naive Bayes to develop predictive models for cyberbullying detection. Lexicon-based systems utilise word lists and use the presence of words within the lists to detect cyberbullying. Rule-based approaches match text to predefined rules to identify bullying, and mixed-initiatives approaches combine human-based reasoning with one or more of the aforementioned approaches. We found lack of labelled datasets and non-holistic consideration of cyberbullying by researchers when developing detection systems are two key challenges facing cyberbullying detection research. This paper essentially maps out the state-of-the-art in cyberbullying detection research and serves as a resource for researchers to determine where to best direct their future research efforts in this field.

142 citations


Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a regularized graph neural network (RGNN) for EEG-based emotion recognition, which considers the biological topology among different brain regions to capture both local and global relations among different EEG channels.
Abstract: Electroencephalography (EEG) measures the neuronal activities in different brain regions via electrodes. Many existing studies on EEG-based emotion recognition do not fully exploit the topology of EEG channels. In this paper, we propose a regularized graph neural network (RGNN) for EEG-based emotion recognition. RGNN considers the biological topology among different brain regions to capture both local and global relations among different EEG channels. Specifically, we model the inter-channel relations in EEG signals via an adjacency matrix in a graph neural network where the connection and sparseness of the adjacency matrix are inspired by neuroscience theories of human brain organization. In addition, we propose two regularizers, namely node-wise domain adversarial training (NodeDAT) and emotion-aware distribution learning (EmotionDL), to better handle cross-subject EEG variations and noisy labels, respectively. Extensive experiments on two public datasets, SEED and SEED-IV, demonstrate the superior performance of our model than state-of-the-art models in most experimental settings. Moreover, ablation studies show that the proposed adjacency matrix and two regularizers contribute consistent and significant gain to the performance of our RGNN model. Finally, investigations on the neuronal activities reveal important brain regions and inter-channel relations for EEG-based emotion recognition.

140 citations


Journal ArticleDOI
TL;DR: It is shown that recurrent neural networks, especially character-based ones, can improve over bag-of-words and latent semantic indexing models and that the newly proposed training heuristic produces a unison model with performance comparable to that of the three single models.
Abstract: Despite recent successes of deep learning in many fields of natural language processing, previous studies of emotion recognition on Twitter mainly focused on the use of lexicons and simple classifiers on bag-of-words models. The central question of our study is whether we can improve their performance using deep learning. To this end, we exploit hashtags to create three large emotion-labeled data sets corresponding to different classifications of emotions. We then compare the performance of several word- and character-based recurrent and convolutional neural networks with the performance on bag-of-words and latent semantic indexing models. We also investigate the transferability of the final hidden state representations between different classifications of emotions, and whether it is possible to build a unison model for predicting all of them using a shared representation. We show that recurrent neural networks, especially character-based ones, can improve over bag-of-words and latent semantic indexing models. Although the transfer capabilities of these models are poor, the newly proposed training heuristic produces a unison model with performance comparable to that of the three single models.

139 citations


Journal ArticleDOI
TL;DR: A deep regression network termed DepressNet is presented to learn a depression representation with visual explanation, with results showing that the DAM induced by the learned deep model may help reveal the visual depression pattern on faces and understand the insights of automated depression diagnosis.
Abstract: Recent evidence in mental health assessment have demonstrated that facial appearance could be highly indicative of depressive disorder. While previous methods based on the facial analysis promise to advance clinical diagnosis of depressive disorder in a more efficient and objective manner, challenges in visual representation of complex depression pattern prevent widespread practice of automated depression diagnosis. In this paper, we present a deep regression network termed DepressNet to learn a depression representation with visual explanation. Specifically, a deep convolutional neural network equipped with a global average pooling layer is first trained with facial depression data, which allows for identifying salient regions of input image in terms of its severity score based on the generated depression activation map (DAM). We then propose a multi-region DepressNet, with which multiple local deep regression models for different face regions are jointly leaned and their responses are fused to improve the overall recognition performance. We evaluate our method on two benchmark datasets, and the results show that our method significantly boosts state-of-the-art performance of the visual-based depression recognition. Most importantly, the DAM induced by our learned deep model may help reveal the visual depression pattern on faces and understand the insights of automated depression diagnosis.

129 citations


Journal ArticleDOI
TL;DR: An automatic ECG-based emotion recognition algorithm to recognize human emotions elicited by listening to music and a sequential forward floating selection-kernel-based class separability-based (SFFS-KBCS-based) feature selection algorithm to effectively select significant ECG features associated with emotions.
Abstract: This paper presents an automatic ECG-based emotion recognition algorithm for human emotion recognition. First, we adopt a musical induction method to induce participants’ real emotional states and collect their ECG signals without any deliberate laboratory setting. Afterward, we develop an automatic ECG-based emotion recognition algorithm to recognize human emotions elicited by listening to music. Physiological ECG features extracted from the time-, and frequency-domain, and nonlinear analyses of ECG signals are used to find emotion-relevant features and to correlate them with emotional states. Subsequently, we develop a sequential forward floating selection-kernel-based class separability-based (SFFS-KBCS-based) feature selection algorithm and utilize the generalized discriminant analysis (GDA) to effectively select significant ECG features associated with emotions and to reduce the dimensions of the selected features, respectively. Positive/negative valence, high/low arousal, and four types of emotions (joy, tension, sadness, and peacefulness) are recognized using least squares support vector machine (LS-SVM) recognizers. The results show that the correct classification rates for positive/negative valence, high/low arousal, and four types of emotion classification tasks are 82.78, 72.91, and 61.52 percent, respectively.

120 citations


Journal ArticleDOI
TL;DR: This article points out the shortcomings and under-explored, yet key aspects of this field that are necessary to attain true sentiment understanding and attempts to chart a possible course forThis field that covers many overlooked and unanswered questions.
Abstract: Sentiment analysis as a field has come a long way since it was first introduced as a task nearly 20 years ago. It has widespread commercial applications in various domains like marketing, risk management, market research, and politics, to name a few. Given its saturation in specific subtasks -- such as sentiment polarity classification -- and datasets, there is an underlying perception that this field has reached its maturity. In this article, we discuss this perception by pointing out the shortcomings and under-explored, yet key aspects of this field that are necessary to attain true sentiment understanding. We analyze the significant leaps responsible for its current relevance. Further, we attempt to chart a possible course for this field that covers many overlooked and unanswered questions.

119 citations


Journal ArticleDOI
TL;DR: This work advances the music emotion recognition state-of-the-art by proposing novel emotionally-relevant audio features related with musical texture and expressive techniques, and analysing the features relevance and results uncovered interesting relations.
Abstract: This work advances the music emotion recognition state-of-the-art by proposing novel emotionally-relevant audio features. We reviewed the existing audio features implemented in well-known frameworks and their relationships with the eight commonly defined musical concepts. This knowledge helped uncover musical concepts lacking computational extractors, to which we propose algorithms - namely related with musical texture and expressive techniques. To evaluate our work, we created a public dataset of 900 audio clips, with subjective annotations following Russell's emotion quadrants. The existent audio features (baseline) and the proposed features (novel) were tested using 20 repetitions of 10-fold cross-validation. Adding the proposed features improved the F1-score to 76.4 percent (by 9 percent), when compared to a similar number of baseline-only features. Moreover, analysing the features relevance and results uncovered interesting relations, namely the weight of specific features and musical concepts to each emotion quadrant, and warrant promising new directions for future research in the field of music emotion recognition, interactive media, and novel music interfaces.

Journal ArticleDOI
TL;DR: The issues and challenges that are related to extraction of different aspects and their relevant sentiments, relational mapping between aspects, interactions, dependencies and contextual-semantic relationships between different data objects for improved sentiment accuracy, and prediction of sentiment evolution dynamicity are emphasized.
Abstract: The domain of Aspect-based Sentiment Analysis, in which aspects are extracted, their sentiments are analyzed and sentiments are evolved over time, is getting much attention with increasing feedback of public and customers on social media. The immense advancements in the field urged researchers to devise new techniques and approaches, each sermonizing a different research analysis/question, that cope with upcoming issues and complex scenarios of Aspect-based Sentiment Analysis. Therefore, this survey emphasized on the issues and challenges that are related to extraction of different aspects and their relevant sentiments, relational mapping between aspects, interactions, dependencies and contextual-semantic relationships between different data objects for improved sentiment accuracy, and prediction of sentiment evolution dynamicity. A rigorous overview of the recent progress is summarized based on whether they contributed towards highlighting and mitigating the issue of Aspect Extraction, Aspect Sentiment Analysis or Sentiment Evolution. The reported performance for each scrutinized study of Aspect Extraction and Aspect Sentiment Analysis is also given, showing the quantitative evaluation of the proposed approach. Future research directions are proposed and discussed, by critically analysing the presented recent solutions, that will be helpful for researchers and beneficial for improving sentiment classification at aspect-level.

Journal ArticleDOI
TL;DR: It is suggested, that using only the signal section which best describes emotions improves the classification of emotions and a viable framework for emotion recognition is introduced.
Abstract: Emotion recognition using brain wave signals involves using high dimensional electroencephalogram (EEG) data. In this paper, a window selection method based on mutual information is introduced to select an appropriate signal window to reduce the length of the signals. The motivation of the windowing method comes from EEG emotion recognition being computationally costly and the data having low signal-to-noise ratio. The aim of the windowing method is to find a reduced signal where the emotions are strongest. In this paper, it is suggested, that using only the signal section which best describes emotions improves the classification of emotions. This is achieved by iteratively comparing different-length EEG signals at different time locations using the mutual information between the reduced signal and emotion labels as criterion. The reduced signal with the highest mutual information is used for extracting the features for emotion classification. In addition, a viable framework for emotion recognition is introduced. Experimental results on publicly available datasets, DEAP and MAHNOB-HCI, show significant improvement in emotion recognition accuracy.

Journal ArticleDOI
TL;DR: This paper presents a new database for the analysis of valence (positive or negative emotions), which comprises physiological recordings and 257-channel EEG data, contrary to all previously published datasets, which include at most 62 EEG channels.
Abstract: Electroencephalography (EEG)-based emotion recognition is currently a hot issue in the affective computing community. Numerous studies have been published on this topic, following generally the same schema: 1) presentation of emotional stimuli to a number of subjects during the recording of their EEG, 2) application of machine learning techniques to classify the subjects’ emotions. The proposed approaches vary mainly in the type of features extracted from the EEG and in the employed classifiers, but it is difficult to compare the reported results due to the use of different datasets. In this paper, we present a new database for the analysis of valence (positive or negative emotions), which is made publicly available. The database comprises physiological recordings and 257-channel EEG data, contrary to all previously published datasets, which include at most 62 EEG channels. Furthermore, we reconstruct the brain activity on the cortical surface by applying source localization techniques. We then compare the performances of valence classification that can be achieved with various features extracted from all source regions (source space features) and from all EEG channels (sensor space features), showing that the source reconstruction improves the classification results. Finally, we discuss the influence of several parameters on the classification scores.

Journal ArticleDOI
TL;DR: The proposed deep model is called ATtention-based LSTM with Domain Discriminator (ATDD-LSTM) that can characterize nonlinear relations among EEG signals of different electrodes that achieves superior performance on subject-dependent, subject-independent and cross-session evaluation.
Abstract: Most previous EEG-based emotion recognition methods studied hand-crafted EEG features extracted from different electrodes. In this paper, we study the relation among different EEG electrodes and propose a deep learning method to automatically extract the spatial features that characterize the functional relation between EEG signals at different electrodes. Our proposed deep model is called ATtention-based LSTM with Domain Discriminator (ATDD-LSTM) that can characterize nonlinear relations among EEG signals of different electrodes. To achieve state-of-the-art emotion recognition performance, the architecture of ATDD-LSTM has two distinguishing characteristics: (1) By applying the attention mechanism to the feature vectors produced by LSTM, ATDD-LSTM automatically selects suitable EEG channels for emotion recognition, which makes the learned model concentrate on the emotion related channels in response to a given emotion; (2) To minimize the significant feature distribution shift between different sessions and/or subjects, ATDD-LSTM uses a domain discriminator to modify the data representation space and generate domain-invariant features. We evaluate the proposed ATDD-LSTM model on three public EEG emotional databases (DEAP, SEED and CMEED) for emotion recognition. The experimental results demonstrate that our ATDD-LSTM model achieves superior performance on subject-dependent (for the same subject), subject-independent (for different subjects) and cross-session (for the same subject) evaluation.

Journal ArticleDOI
TL;DR: A multi-task ensemble framework that jointly learns multiple related problems of emotion and sentiment analysis and outperforms the single-task frameworks in all experiments.
Abstract: We propose a multi-task ensemble framework that jointly learns multiple related problems. The ensemble model aims to leverage the learned representations of three deep learning models (i.e., CNN, LSTM and GRU) and a hand-crafted feature representation for the predictions. Through multi-task framework, we address four problems of emotion and sentiment analysis, i.e., "emotion classification & intensity", "valence, arousal & dominance for emotion", "valence & arousal for sentiment", and "3-class categorical & 5-class ordinal classification for sentiment". The underlying problems cover two granularity (i.e., coarse-grained and fine-grained) and a diverse range of domains (i.e., tweets, Facebook posts, news headlines, blogs, letters etc.). Experimental results suggest that the proposed multi-task framework outperforms the single-task frameworks in all experiments.

Journal ArticleDOI
Byung Hyung Kim1, Sungho Jo1
TL;DR: A robust physiological model for the recognition of human emotions, based on a convolutional long short-term memory network and a new temporal margin-based loss function, which improves the performance of emotion recognition.
Abstract: Here we present a robust physiological model for the recognition of human emotions, called Deep Physiological Affect Network. This model is based on a convolutional long short-term memory (ConvLSTM) network and a new temporal margin-based loss function. Formulating the emotion recognition problem as a spectral-temporal sequence classification problem of bipolar EEG signals underlying brain lateralization and photoplethysmogram signals, the proposed model improves the performance of emotion recognition. Specifically, the new loss function allows the model to be more confident as it observes more of specific feelings while training ConvLSTM models. The function is designed to result in penalties for the violation of such confidence. Our experiments on a public dataset show that our deep physiological learning technology significantly increases the recognition rate of state-of-the-art techniques by 15.96 percent increase in accuracy. An extensive analysis of the relationship between participants’ emotion ratings and physiological changes in brain lateralization function during the experiment is also presented.

Journal ArticleDOI
TL;DR: This paper proposes a novel machine learning approach that characterizes the categorical image emotions as a discrete probability distribution (DPD) and presents shared sparse learning to learn the combination coefficients, with which the DPD of an unseen image is predicted by linearly combining the D PDs of the training images.
Abstract: Computationally modelling the affective content of images has been extensively studied recently because of its wide applications in entertainment, advertisement, and education. Significant progress has been made on designing discriminative features to bridge the affective gap. Assuming that viewers can reach a consensus on the emotion of images, most existing works focused on assigning the dominant emotion category or the average dimension values to an image. However, the image emotions perceived by viewers are subjective by nature with the influence of personal and situational factors. In this paper, we propose a novel machine learning approach that characterizes the categorical image emotions as a discrete probability distribution (DPD). To associate emotion with the visual features extracted from images, we present shared sparse learning to learn the combination coefficients, with which the DPD of an unseen image is predicted by linearly combining the DPDs of the training images. Furthermore, we extend our method to the setup where multi-features are available and learn the optimal weights for each feature to reflect the importance of different features. Extensive experiments are carried out on Abstract, Emotion6 and IESN datasets and the results demonstrate the superiority of the proposed method, as compared to the state-of-the-art approaches.

Journal ArticleDOI
TL;DR: A computational model of positive emotional contagion is proposed to describe how safety officers calm a crowd down and can provide guidance for emergency response management and a maximization problem of emotional contagions is formulated.
Abstract: In a crisis situation, negative emotions often spread among the crowd, and they have adverse impacts on human decisions, resulting in stampedes and crushes. Safety officers are often dispatched to scenes of emergencies because their positive emotions can calm the crowd down and avoid serious accidents. However, how to utilize the positive emotional contagion to maximize the “calm-down” effect remains a challenging problem in crowd evacuation. In this paper, we present an approach for optimizing positive emotional contagion in crowd evacuation. First, a computational model of positive emotional contagion is proposed to describe how safety officers calm a crowd down. To capture important influential factors for positive emotional contagion, such as the trust relationships among the individuals involved in a crisis situation and the variations of emotional contagion speed, we construct a trust-based emotional contagion network (Trust-ECN) and a heterogeneous emotional contagion speed computation model (HECS-CM). Based on these models, the emotional contagion process can be analyzed in a parametric way, and the infection probability for each individual in a given time window can be computed analytically with a continuous-time Markov chain (CTMC). Second, a maximization problem of emotional contagion is formulated. Since this optimization problem is NP-hard, an artificial bee colony optimized emotional contagion (ABCEC) algorithm is used to solve for the optimal positions of safety officers. We demonstrate the effectiveness of our method on both synthetic and real-world data at different scales. Finally, we implement a crowd simulation system to visualize the results of our theoretical analysis in a graphical manner. The proposed method can provide guidance for emergency response management.

Journal ArticleDOI
TL;DR: Empirical experiments on cross-corpus speech emotion recognition tasks demonstrate that the proposed feature selection based transfer subspace learning method can achieve encouraging results in comparison with state-of-the-art algorithms.
Abstract: Cross-corpus speech emotion recognition has recently received considerable attention due to the widespread existence of various emotional speech. It takes one corpus as the training data aiming to recognize emotions of another corpus, and generally involves two basic problems, i.e., feature matching and feature selection. Many previous works study these two problems independently, or just focus on solving the first problem. In this paper, we propose a novel algorithm, called feature selection based transfer subspace learning (FSTSL), to address these two problems. To deal with the first problem, a latent common subspace is learnt by reducing the difference of different corpora and preserving the important properties. Meanwhile, we adopt the $l_{2,1}$ l 2 , 1 -norm on the projection matrix to deal with the second problem. Besides, to guarantee the subspace to be robust and discriminative, the geometric information of data is exploited simultaneously in the proposed FSTSL framework. Empirical experiments on cross-corpus speech emotion recognition tasks demonstrate that our proposed method can achieve encouraging results in comparison with state-of-the-art algorithms.

Journal ArticleDOI
TL;DR: This paper proposes to use automatically detected human behaviour primitives as the low-dimensional descriptor for each frame of video-based automatic depression analysis, and proposes two novel spectral representations to represent video-level multi-scale temporal dynamics of expressive behaviour.
Abstract: Depression is a serious mental disorder affecting millions of people. Traditional clinical diagnosis methods are subjective, complicated and require extensive participation of clinicians. Recent advances in automatic depression analysis systems promise a future where these shortcomings are addressed by objective, repeatable, and readily available diagnostic tools to aid health professionals in their work. Yet there remain a number of barriers to the development of such tools. One barrier is that existing automatic depression analysis algorithms base their predictions on very brief sequential segments, sometimes as little as one frame. Another barrier is that existing methods do not take into account what the context of the measured behaviour is. In this paper, we extract multi-scale video-level features for video-based automatic depression analysis. We propose to use automatically detected human behaviour primitives as the low-dimensional descriptor for each frame. We also propose two novel spectral representations to represent video-level multi-scale temporal dynamics of expressive behaviour. Constructed spectral representations are fed to CNNs and ANNs for depression analysis. In addition to achieving state-of-the-art accuracy in depression severity estimation, we show that the task conducted by the user matters, that fusion of a combination of tasks reaches highest accuracy, and that longer tasks are more informative than shorter tasks, up to a point.

Journal ArticleDOI
Cai Hanshu1, Xiangzi Zhang1, Yanhao Zhang1, Ziyang Wang1, Bin Hu1 
TL;DR: This paper provides a novel pervasive and effective method for automatic detection of depression using Electroencephalography data collected using a portable three-electrode EEG device and applying multiple classifiers.
Abstract: Depression, threatening the well-being of millions, has become one of the major diseases in the past decade. However, the current method of diagnosing depression is questionnaire-based interviews, which is labor-intensive and highly dependent on doctors’ experience. Thus, objective and cost-efficient methods are needed. In this paper, we present a case-based reasoning model for identifying depression. Electroencephalography data were collected using a portable three-electrode EEG device, and then processed to remove artifacts and extract features. We applied multiple classifiers. The best performing k-Nearest Neighbor (KNN) was selected as the evaluation function to select the effective features which were then used to create the case base. Based on the weight set of standard deviations, the similarity was calculated using normalized Euclidean distance to get the optimal recognition rate of depression. The accuracy of optimal similarity identification of patients with depression was 91.25 percent, which was improved compared to the accuracy using KNN classifier (81.44 percent) or previously reported classifiers. Thus, we provide a novel pervasive and effective method for automatic detection of depression.

Journal ArticleDOI
TL;DR: A real-time anxiety monitoring system was established based on the above anxiety detecting method, and it was shown that social anxiety significantly reduces the complexity of the heartbeats.
Abstract: Social anxiety is a negative emotion which may impair the health of the heart and social functioning of an individual. This work analyzes the influence of social anxiety on the autonomic nerve control of the heart in two social exposure events: public speaking and thesis defending. In an experiment of public speaking, 59 human subjects were tested, and 11 conventional heartbeat measures and a heartbeat measure named the range of local Hurst exponents (RLHE) were evaluated for their capabilities to reveal the onset of social anxiety. Two-sample t -test between the baseline data and high anxiety data shows that social anxiety significantly reduces the complexity of the heartbeats. In an experiment of thesis defense, heartbeats data were acquired from nine graduate students. With the combination of three conventional features and the RLHE feature, a support vector machine classifier obtained true positive rate and true negative rate of 84.88 and 97.29 percent in the five-fold cross validation process of binary classification between high anxiety status and low anxiety status; the classifier also realized a generalization accuracy of 81.82 percent in detecting the high anxiety status in the thesis defense. A real-time anxiety monitoring system was established based on the above anxiety detecting method.

Journal ArticleDOI
TL;DR: A novel Slide-Patch and Whole-Face Attention model with SE blocks (SPWFA-SE), which jointly perceives the discriminative locality characteristics and informative global features of the face for effective FER, is proposed.
Abstract: Learning discriminative features is of vital importance for automatic Facial Expression Recognition (FER) in the wild. In this paper, we propose a novel Slide-Patch and Whole-Face Attention model with SE blocks (SPWFA-SE), which jointly perceives the discriminative locality characteristics and informative global features of the face for effective FER. Specifically, the well-designed slide patches are proposed to extract local features. Different from the existing methods, our slide patches not only can maintain the information at the edge area of patches, but also do not need to detect facial landmarks. Moreover, to make the model adaptively focus on the distinguishable regions, an attention module is proposed in the patch level to learn the weight of each patch. Furthermore, squeeze-and-excitation blocks are explored in the channel level to learn the weight of each channel. As such, the proposed multi-level feature extraction and attention mechanisms can enhance the representative ability of the learned features. Extensive experiments on five challenging datasets demonstrate that our method can achieve state-of-the-art performance. Cross database experiments on another three databases show the superior generalization performance of our model. Furthermore, complexity analysis results show that our model contains fewer parameters with fast training advantages than other competing models.

Journal ArticleDOI
TL;DR: An improved EMD applying Singular Value Decomposition (SVD)-based feature extraction method was proposed in this study, which can extract the features coefficients of expansion based on all IMFs as accurately as possible, ignoring potentially linear dependence of IMFs.
Abstract: Depression is a mental disorder characterized by persistent low mood that affects a person's thoughts, behavior, feelings, and sense of well-being. Depression will become the second major life-threatening illness in 2020. Electroencephalogram (EEG) signals are regarded as the best physiological tool for depression detection. Previous studies used the Empirical Mode Decomposition (EMD) method, which can deal with the highly complex, nonlinear and non-stationary nature of EEG, to extract features from EEG signals. However, for some special data, the neighboring components extracted through EMD could certainly have sections of data carrying the same frequency at different time durations. Thus, the Intrinsic Mode Functions (IMFs) of the data could be linearly dependent and the pre-proposed EMD-based features could not be extracted. To solve this problem, an improved EMD applying Singular Value Decomposition (SVD)-based feature extraction method was proposed in this study, which can extract the features coefficients of expansion based on all IMFs, ignoring potentially linear dependence of IMFs. Experiments were conducted on four EEG databases for detecting depression. The improved EMD-based feature extraction method can extract feature on the four EEG databases. The average classification results of the proposed method on the four EEG databases reached 83.27%,85.19%,81.98% and 88.07%, respectively.

Journal ArticleDOI
TL;DR: This study proposes a deeply-supervised attention network (DSAN) to recognize human emotions based on facial images automatically, taking full advantage of the race/gender/age-related information.
Abstract: Facial expression recognition (FER) is crucial for social communication. However, current studies present limitations when addressing facial expression difference due to demographic variation, e.g., race, gender, and age, etc. In this study, we first propose a deeply-supervised attention network (DSAN) to recognize human emotions based on facial images automatically. Based on DSAN, a two-stage training scheme is designed, taking full advantage of the race/gender/age-related information. In our DSAN framework, multi-scale features are leveraged to capture more discriminative information from the deep layers to the shallow layers. Furthermore, we adopt the attention block to highlight the essential local facial characteristics; it performs well when it is incorporated into the deeply-supervised framework. Finally, we combine the complementary characteristics of multiple convolutional layers in deeply-supervised manner and ensemble the intermediate predicted scores. Our experimental results have shown that our proposed framework can (i) effectively integrate demographic information in improving the performance of a variety of FER tasks, (ii) learn informative feature representations with a visual explanation by capturing the regions of interests (ROI), (iii) achieve superior performance for both the posed and the spontaneous FER databases, each containing pictures of human facial expressions varied in gender, age and race.

Journal ArticleDOI
TL;DR: This article proposed a multi-task learning framework that uses auxiliary tasks for which data is abundantly available, such as gender identifications and speaker recognition as auxiliary tasks, which allow the use of very large datasets, e.g., speaker classification datasets.
Abstract: Inspite the emerging importance of Speech Emotion Recognition (SER), the state-of-the-art accuracy is quite low and needs improvement to make commercial applications of SER viable. A key underlying reason for the low accuracy is the scarcity of emotion datasets, which is a challenge for developing any robust machine learning model in general. In this paper, we propose a solution to this problem: a multi-task learning framework that uses auxiliary tasks for which data is abundantly available. We show that utilisation of this additional data can improve the primary task of SER for which only limited labelled data is available. In particular, we use gender identifications and speaker recognition as auxiliary tasks, which allow the use of very large datasets, e.g., speaker classification datasets. To maximise the benefit of multi-task learning, we further use an adversarial autoencoder (AAE) within our framework, which has a strong capability to learn powerful and discriminative features. Furthermore, the unsupervised AAE in combination with the supervised classification networks enables semi-supervised learning which incorporates a discriminative component in the AAE unsupervised training pipeline. The proposed model is rigorously evaluated for categorical and dimensional emotion, and cross-corpus scenarios. Experimental results demonstrate that the proposed model achieves state-of-the-art performance on two publicly available dataset.

Journal ArticleDOI
TL;DR: This work proposes an approach to short-term detection of mood disorder based on the patterns in emotion of elicited speech responses and a class-specific latent affective structure model (LASM) is proposed to model the structural relationships among the emotion codewords with respect to six emotional videos for mood disorder detection.
Abstract: Mood disorders, including unipolar depression (UD) and bipolar disorder (BD) [1] , are reported to be one of the most common mental illnesses in recent years. In diagnostic evaluation on the outpatients with mood disorder, a large portion of BD patients are initially misdiagnosed as having UD [2] . As most previous research focused on long-term monitoring of mood disorders, short-term detection which could be used in early detection and intervention is thus desirable. This work proposes an approach to short-term detection of mood disorder based on the patterns in emotion of elicited speech responses. To the best of our knowledge, there is no database for short-term detection on the discrimination between BD and UD currently. This work collected two databases containing an emotional database (MHMC-EM) collected by the Multimedia Human Machine Communication (MHMC) lab and a mood disorder database (CHI-MEI) collected by the CHI-MEI Medical Center, Taiwan. As the collected CHI-MEI mood disorder database is quite small and emotion annotation is difficult, the MHMC-EM emotional database is selected as a reference database for data adaptation. For the CHI-MEI mood disorder data collection, six eliciting emotional videos are selected and used to elicit the participants’ emotions. After watching each of the six eliciting emotional video clips, the participants answer the questions raised by the clinician. The speech responses are then used to construct the CHI-MEI mood disorder database. Hierarchical spectral clustering is used to adapt the collected MHMC-EM emotional database to fit the CHI-MEI mood disorder database for dealing with the data bias problem. The adapted MHMC-EM emotional data are then fed to a denoising autoencoder for bottleneck feature extraction. The bottleneck features are used to construct a long short term memory (LSTM)-based emotion detector for generation of emotion profiles from each speech response. The emotion profiles are then clustered into emotion codewords using the K-means algorithm. Finally, a class-specific latent affective structure model (LASM) is proposed to model the structural relationships among the emotion codewords with respect to six emotional videos for mood disorder detection. Leave-one-group-out cross validation scheme was employed for the evaluation of the proposed class-specific LASM-based approaches. Experimental results show that the proposed class-specific LASM-based method achieved an accuracy of 73.33 percent for mood disorder detection, outperforming the classifiers based on SVM and LSTM.

Journal ArticleDOI
TL;DR: SplitFace as discussed by the authors is a deep convolutional neural network-based method that is explicitly designed to perform attribute detection in partially occluded faces, taking several facial segments and the full face as input, the proposed method takes a data driven approach to determine which attributes are localized in which facial segments.
Abstract: State-of-the-art methods of attribute detection from faces almost always assume the presence of a full, unoccluded face. Hence, their performance degrades for partially visible and occluded faces. In this paper, we introduce SPLITFACE, a deep convolutional neural network-based method that is explicitly designed to perform attribute detection in partially occluded faces. Taking several facial segments and the full face as input, the proposed method takes a data driven approach to determine which attributes are localized in which facial segments. The unique architecture of the network allows each attribute to be predicted by multiple segments, which permits the implementation of committee machine techniques for combining local and global decisions to boost performance. With access to segment-based predictions, SPLITFACE can predict well those attributes which are localized in the visible parts of the face, without having to rely on the presence of the whole face. We use the CelebA and LFWA facial attribute datasets for standard evaluations. We also modify both datasets, to occlude the faces, so that we can evaluate the performance of attribute detection algorithms on partial faces. Our evaluation shows that SPLITFACE significantly outperforms other recent methods especially for partial faces.

Journal ArticleDOI
TL;DR: Although the focus of this article is on classical feature engineering methodologies (based on handcrafted features), perspectives on deep learning-based approaches are discussed and strategies for future research on feature engineering for MER are proposed.
Abstract: The design of meaningful audio features is a key need to advance the state-of-the-art in Music Emotion Recognition (MER). This work presents a survey on the existing emotionally-relevant computational audio features, supported by the music psychology literature on the relations between eight musical dimensions (melody, harmony, rhythm, dynamics, tone color, expressivity, texture and form) and specific emotions. Based on this review, current gaps and needs are identified and strategies for future research on feature engineering for MER are proposed, namely ideas for computational audio features that capture elements of musical form, texture and expressivity that should be further researched. Finally, although the focus of this article is on classical feature engineering methodologies (based on handcrafted features), perspectives on deep learning-based approaches are discussed.