scispace - formally typeset
Open AccessProceedings ArticleDOI

AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge

Reads0
Chats0
TLDR
The challenge guidelines, the common data used, and the performance of the baseline system on the two tasks are presented, to establish to what extent fusion of the approaches is possible and beneficial.
Abstract
The Audio/Visual Emotion Challenge and Workshop (AVEC 2016) "Depression, Mood and Emotion" will be the sixth competition event aimed at comparison of multimedia processing and machine learning methods for automatic audio, visual and physiological depression and emotion analysis, with all participants competing under strictly the same conditions. The goal of the Challenge is to provide a common benchmark test set for multi-modal information processing and to bring together the depression and emotion recognition communities, as well as the audio, video and physiological processing communities, to compare the relative merits of the various approaches to depression and emotion recognition under well-defined and strictly comparable conditions and establish to what extent fusion of the approaches is possible and beneficial. This paper presents the challenge guidelines, the common data used, and the performance of the baseline system on the two tasks.

read more

Content maybe subject to copyright    Report

AVEC 2016 Depression, Mood, and Emotion Recognition
Workshop and Challenge
Michel Valstar
University of Nottingham
School of Computer Science
Jonathan Gratch
University of Southern
California
ICT
Björn Schuller
University of Passau
Chair of Complex & Intelligent
Systems
Fabien Ringeval
Université Grenoble Alpes
Laboratoire d’Informatique de
Grenoble
Denis Lalanne
University of Fribourg
Human-IST Research Center
Mercedes Torres Torres
University of Nottingham
School of Computer Science
Stefan Scherer
University of Southern
California
ICT
Giota Stratou
University of Southern
California
ICT
Roddy Cowie
Queen’s University Belfast
Department of Psychology
Maja Pantic
Imperial College London
Intelligent Behaviour
Understanding Group
ABSTRACT
The Audio/Visual Emotion Challenge and Workshop
(AVEC 2016) “Depression, Mood and Emotion” will be the
sixth competition event aimed at comparison of multime-
dia processing and machine learning methods for automatic
audio, visual and physiological depression and emotion anal-
ysis, with all participants competing under strictly the same
conditions. The goal of the Challenge is to provide a com-
mon benchmark test set for multi-modal information pro-
cessing and to bring together the depression and emotion
recognition communities, as well as the audio, video and
physiological processing communities, to compare the rela-
tive merits of the various approaches to depression and emo-
tion recognition under well-defined and strictly comparable
conditions and establish to what extent fusion of the ap-
proaches is possible and beneficial. This paper presents the
The author is further affiliated with Imperial College Lon-
don, Department of Computing, London, U.K.
The author is further affiliated with University of Passau,
Chair of Complex & Intelligent Systems
The author is further affiliated with Twente University,
EEMCS, Twente, The Netherlands.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from Permissions@acm.org.
AVEC’16 16 October 2016, Amsterdam, NL
Copyright is held by the owner/author(s). Publication rights licensed to ACM.
ACM 978-1-4503-4516-3/16/10. . . $15.00.
DOI: http://dx.doi.org/10.1145/2988257.2988258.
challenge guidelines, the common data used, and the perfor-
mance of the baseline system on the two tasks.
Keywords
Affective Computing, Emotion Recognition, Speech, Facial
Expression, Physiological signals, Challenge
1. INTRODUCTION
The 2016 Audio-Visual Emotion Challenge and Workshop
(AVEC 2016) will be the sixth competition event aimed
at comparison of multimedia processing and machine learn-
ing methods for automatic audio, video, and physiological
analysis of emotion and depression, with all participants
competing under strictly the same conditions. The goal
of the Challenge is to compare the relative merits of the
approaches (audio, video, and/or physiologic) to emotion
recognition and severity of depression estimation under well-
defined and strictly comparable conditions, and establish to
what extent fusion of the approaches is possible and ben-
eficial. A second motivation is the need to advance emo-
tion recognition for multimedia retrieval to a level where
behaviomedical systems [38] are able to deal with large vol-
umes of non-prototypical naturalistic behaviour in reaction
to known stimuli, as this is exactly the type of data that di-
agnostic and in particular monitoring tools, as well as other
applications, would have to face in the real world.
AVEC 2016 will address emotion and depression recog-
nition. The emotion recognition sub-challenge is a refined
re-run of the AVEC 2015 challenge [27], largely based on
the same dataset. The depression severity estimation sub-
challenge is based on a novel dataset of human-agent inter-
actions, and sees the return of depression analysis, which
arXiv:1605.01600v4 [cs.CV] 22 Nov 2016

was a huge success in the AVEC 2013 [41] and 2014 [40]
challenges.
Depression Classification Sub-Challenge (DCC):
participants are required to classify whether a per-
son is classified as depressed or not, where the binary
ground-truth is based on the severity of self-reported
depression as indicated by the PHQ-8 score for every
human-agent interaction. For the DCC, performance
in the competition will be measured using the average
F1 score for both classes depressed and not depressed.
Participants are encouraged to provide an estimate of
the severity of depression, by calculating the root mean
square error over all HCI experiment sessions between
the predicted and ground-truth PHQ-8 score. In ad-
dition, participants are also encouraged to report on
overall accuracy, average precision, and average recall
to further analyse their results in the paper accompa-
nying their submission.
Multimodal Affect Recognition Sub-Challenge
(MASC) participants are required to perform fully con-
tinuous affect recognition of two affective dimensions:
Arousal, and Valence, where the level of affect has
to be predicted for every moment of the recording.
For the MASC, two regression problems need to be
solved: prediction of the continuous dimensions Va-
lence and Arousal. The MASC competition mea-
sure is the Concordance Correlation Coefficient
(CCC), which combines the Pearson’s correlation co-
efficient (CC) with the square difference between the
mean of the two compared time series, as shown in 1.
ρ
c
=
2ρσ
x
σ
y
σ
2
x
+ σ
2
y
+ (µ
x
µ
y
)
2
(1)
where ρ is the Pearson correlation coefficient between
two time series (e. g., prediction and gold-standard),
σ
2
x
and σ
2
y
is the variance of each time series, and µ
x
and µ
y
are the mean value of each. Therefore, predic-
tions that are well correlated with the gold standard
but shifted in value are penalised in proportion to the
deviation.
To be eligible to participate in the challenge, every entry
has to be accompanied by a paper presenting the results and
the methods that created them, which will undergo peer-
review. Only contributions with a relevant accepted paper
will be eligible for challenge participation. The organisers
reserve the right to re-evaluate the findings, but will not
participate in the Challenge themselves.
2. DEPRESSION ANALYSIS CORPUS
The Distress Analysis Interview Corpus - Wizard of Oz
(DAIC-WOZ) database is part of a larger corpus, the Dis-
tress Analysis Interview Corpus (DAIC) [13], that contains
clinical interviews designed to support the diagnosis of psy-
chological distress conditions such as anxiety, depression,
and post-traumatic stress disorder. These interviews were
collected as part of a larger effort to create a computer agent
that interviews people and identifies verbal and nonverbal
indicators of mental illness [8]. Data collected include audio
and video recordings and extensive questionnaire responses;
this part of the corpus includes the Wizard-of-Oz interviews,
Depression Severity [PHQ8]
Frequency
0
5
10
45
25
20
15
40
35
30
0 5 10 15 2520
Figure 1: Histogram of depression severity scores for
DESC challenge. Data of training and development
set are provided here.
conducted by an animated virtual interviewer called Ellie,
controlled by a human interviewer in another room. Data
has been transcribed and annotated for a variety of verbal
and non-verbal features.
Information on how to obtain shared data can be found
in this location: http://dcapswoz.ict.usc.edu. Data is freely
available for research purposes.
2.1 Depression Analysis Labels
The level of depression is labelled with a single value per
recording using a standardised self-assessed subjective de-
pression questionnaire, the PHQ-8 [21]. This is similar to the
PHQ-9 questionnaire, but with the suicidal ideation question
removed for ethical reasons. The average depression sever-
ity on the training and development set of the challenge is
M = 6.67 (SD = 5.75). The distribution of the depression
severity scores based on the challenge training and develop-
ment set is provided in Figure 1. A baseline classifier that
constantly predicts the mean score of depression provides an
RMSE = 5.73 and an MAE = 4.74.
2.2 Depression Analysis Baseline Features
In the following sections we describe how the publicly avail-
able baseline feature sets are computed for either the audio
or the video data. Participants can use these feature sets
exclusively or in addition to their own features. For ethical
reasons, no raw video is made available.
2.2.1 Video Features
Based on the OpenFace [2] framework
1
, we provide differ-
ent types of video features:
facial landmarks: 2D and 3D coordinates of 68 points
on the face, estimated from video
HOG (histogram of oriented gradients) features on the
aligned 112x112 area of the face
gaze direction estimate for both eyes
head pose: 3D position and orientation of the head
1
https://github.com/TadasBaltrusaitis/
CLM-framework

In addition to that, we provide emotion and facial action
unit continuous measures based on FACET software[23].
Specifically, we provide the following measures:
emotion: {Anger, Contempt, Disgust, Joy, Fear, Neu-
tral, Sadness, Surprise, Confusion, Frustration}
AUs: {AU1, AU2, AU4, AU5, AU6, AU7, AU9, AU10,
AU12, AU14, AU15, AU17, AU18, AU20, AU23,
AU24, AU25, AU26, AU28, AU43}
2.2.2 Audio Features
For the audio features we utilized COVAREP(v1.3.2), a
freely available open source Matlab and Octave toolbox for
speech analyses [7]
2
. The toolbox comprises well validated
and tested feature extraction methods that aim to capture
both voice quality as well as prosodic characteristics of the
speaker. These methods have been successfully shown to
be correlated with psychological distress and depression in
particular [32, 33]. In particular, we extracted the following
features:
Prosodic: Fundamental frequency (F0) and voicing
(VUV)
Voice Quality: Normalized amplitude quotient
(NAQ), Quasi open quotient (QOQ), the difference
in amplitude of the first two harmonics of the dif-
ferentiated glottal source spectrum (H1H2), parabolic
spectral parameter (PSP), maxima dispersion quotient
(MDQ), spectral tilt/slope of wavelet responses (peak-
Slope), and shape parameter of the Liljencrants-Fant
model of the glottal pulse dynamics (Rd)
Spectral: Mel cepstral coefficients (MCEP0-24), Har-
monic Model and Phase Distortion mean (HMPDM0-
24) and deviations (HMPDD0-12).
In addition to the feature set above, raw audio and tran-
scripts of the interview are being provided, allowing the par-
ticipants to compute additional features on their own. For
more details on the shared features and the format of the
files participants should also review the DAIC-WOZ docu-
mentation
3
.
3. EMOTION ANALYSIS CORPUS
The Remote Collaborative and Affective Interactions
(RECOLA) database [29] was recorded to study socio-
affective behaviours from multimodal data in the context
of computer supported collaborative work [28]. Sponta-
neous and naturalistic interactions were collected during
the resolution of a collaborative task that was performed in
dyads and remotely through video conference. Multimodal
signals, i. e., audio, video, electro-cardiogram (ECG) and
electro-dermal activity (EDA), were synchronously recorded
from 27 French-speaking subjects. Even though all subjects
speak French fluently, they have different nationalities (i. e.,
French, Italian or German), which thus provide some diver-
sity in the expression of emotion.
Data is freely available for research purposes, information
on how to obtain the RECOLA database can be found on
this location: http://diuf.unifr.ch/diva/recola.
2
http://covarep.github.io/covarep/
3
http://dcapswoz.ict.usc.edu/wwwutil_files/
DAICWOZDepression_Documentation.pdf
Table 1: Inter-rater reliability on arousal and va-
lence for the 6 raters and the 27 subjects of the
RECOLA database; raw or normalised ratings [26].
RMSE CC CCC ICC α
Raw
Arousal .344 .400 .277 .775 .800
Valence .218 .446 .370 .811 .802
Normalised
Arousal .263 .496 .431 .827 .856
Valence .174 .492 .478 .844 .829
Table 2: Partitioning of the RECOLA database into
train, development, and test sets.
# train dev test
female 6 5 5
male 3 4 4
French 6 7 7
Italian 2 1 2
German 1 1 0
age µ (σ) 21.2 (1.9) 21.8 (2.5) 21.2 (1.9)
3.1 Emotion Analysis Labels
Regarding the annotation of the dataset, time-continuous
ratings (40 ms binned frames) of emotional arousal and va-
lence were created by six gender balanced French-speaking
assistants for the first five minutes of all recordings, because
participants discussed more about their strategy hence
showing emotions at the beginning of their interaction.
To assess inter-rater reliability, we computed the intra-
class correlation coefficient (ICC(3,1)) [36], and Cronbach’s
α [5]; ratings are concatenated over all subjects. Addi-
tionally, we computed the root-mean-square error (RMSE),
Pearson’s CC and the CCC [22]; values are averaged over
the C
6
2
pairs of raters. Results indicate a very strong inter-
rater reliability for both arousal and valence, cf. Table 1. A
normalisation technique based on the Evaluator Weighted
Estimator [14], is used prior to the computation of the gold-
standard, i. e., the average of all ratings for each subject [26].
This technique has significantly (p < 0.001 for CC) improved
the inter-rater reliability for both arousal and valence; the
Fisher Z-transform is used to perform statistical compar-
isons between CC in this study.
The dataset was divided into speaker disjoint subsets for
training, development (validation) and testing, by stratify-
ing (balancing) on gender and mother tongue, cf. Table 2.
3.2 Emotion Analysis Baseline Features
In the followings we describe how the baseline feature sets
are computed for video, audio, and physiological data.
3.2.1 Video Features
Facial expressions play an important role in the commu-
nication of emotion [9]. Features are usually grouped in
two types of facial descriptors: appearance and geometric
based [39]. For the video baseline features set, we computed

both, using Local Gabor Binary Patterns from Three Or-
thogonal Planes (LGBP-TOP) [1] for appearance and facial
landmarks [42] for geometric.
The LGBP-TOP are computed by splitting the video into
spatio-temporal video volumes. Each slice of the video vol-
ume extracted along 3 orthogonal planes (x-y, x-t and y-t) is
first convolved with a bank of 2D Gabor filters. The result-
ing Gabor pictures in the direction of x-y plane are divided
into 4x4 blocks. In the x-t and y-t directions they are divided
into 4x1 blocks. The LBP operator is then applied to each
of these resulting blocks followed by the concatenation of
the resulting LBP histograms from all the blocks. A feature
reduction is then performed by applying a Principal Com-
ponent Analysis (PCA) from a low-rank (up to rank 500)
approximation [15]. We obtained 84 features representing
98 % of the variance.
In order to extract geometric features, we tracked 49 facial
landmarks with the Supervised Descent Method (SDM) [42]
and aligned them with a mean shape from stable points (lo-
cated on the eye corners and on the nose region). As fea-
tures, we computed the difference between the coordinates of
the aligned landmarks and those from the mean shape, and
also between the aligned landmark locations in the previous
and the current frame; this procedure provided 196 features
in total. We then split the facial landmarks into groups ac-
cording to three different regions: i) the left eye and left
eyebrow, ii) the right eye and right eyebrow and iii) the
mouth. For each of these groups, the Euclidean distances
(L2-norm) and the angles (in radians) between the points
are computed, providing 71 features. We also computed the
Euclidean distance between the median of the stable land-
marks and each aligned landmark in a video frame. In total
the geometric set includes 316 features.
Both appearance and geometric feature sets are inter-
polated by a piecewise cubic Hermite polynomial to cope
with dropped frames. Finally, the arithmetic mean and the
standard-deviation are computed on all features using a slid-
ing window, which is shifted forward at a rate of 40 ms.
3.2.2 Audio Features
In contrast to large scale feature sets, which have been
successfully applied to many speech classification tasks [34,
35], smaller, expert-knowledge based feature sets have also
shown high robustness for the modelling of emotion from
speech [25, 3]. Some recommendations for the definition of
a minimalistic acoustic standard parameter set have been re-
cently investigated, and have led to the Geneva Minimalistic
Acoustic Parameter Set (GeMAPS), and to an extended
version (eGeMAPS) [10], which is used here as baseline.
The acoustic low-level descriptors (LLD) cover spectral, cep-
stral, prosodic and voice quality information and are ex-
tracted with the openSMILE toolkit [11].
As the data in the RECOLA database contains long con-
tinuous recordings, we used overlapping fixed length seg-
ments, which are shifted forward at a rate of 40 ms, to ex-
tract functionals; the arithmetic mean and the coefficient
of variation are computed on all 42 LLD. To pitch and
loudness the following functionals are additionally applied:
percentiles 20, 50 and 80, the range of percentiles 20 80
and the mean and standard deviation of the slope of ris-
ing/falling signal parts. Functionals applied to the pitch,
jitter, shimmer, and all formant related LLDs, are applied
to voiced regions only. Additionally, the average RMS en-
ergy is computed and 6 temporal features are included: the
rate of loudness peaks per second, mean length and standard
deviation of continuous voiced and unvoiced segments and
the rate of voiced segments per second, approximating the
pseudo syllable rate. Overall, the acoustic baseline features
set contains 88 features.
3.2.3 Physiological Features
Physiological signals are known to be well correlated with
emotion [20, 19], despite not being directly perceptible the
way audio-visual are. Although there are some controver-
sies about peripheral physiology and emotion [31, 18], we
believe that autonomic measures should be considered along
with audio-visual data in the realm of affective computing,
as they do not only provide complementary descriptions of
affect, but can also be easily and continuously monitored
with wearable sensors [30, 24, 4].
As baseline features, we extracted features from both ECG
and EDA signals with overlapping (step of 40 ms) windows.
The ECG signal was firstly band-pass filtered ([3 27] Hz)
with a zero-delay 6th order Butterworth filter [26], and 19
features were then computed: the zero-crossing rate, the
four first statistical moments, the normalised length den-
sity, the non-stationary index, the spectral entropy, slope,
mean frequency plus 6 spectral coefficients, the power in
low frequency (LF, 0.04-0.15 Hz), high frequency (HF, 0.15-
0.4 Hz) and the LF/HF power ratio. Additionally, we ex-
tracted the heart rate (HR) and its measure of variability
(HRV) from the filtered ECG signal [26]. For each of those
two descriptors, we computed the two first statistical mo-
ments, the arithmetic mean of rising and falling slope, and
the percentage of rising values, which provided 10 features
in total.
EDA reflects a rapid, transient response called skin con-
ductance response (SCR), as well as a slower, basal drift
called skin conductance level (SCL) [6]. Both, SCL (0–
0.5 Hz) and SCR (0.5–1 Hz) are estimated using a 3rd order
Butterworth filter, 8 features are then computed for each
of those three low-level descriptors: the four first statisti-
cal moments from the original time-series and its first order
derivate w.r.t. time.
4. CHALLENGE BASELINES
For transparency and reproducibility, we use standard and
open-source algorithms for both sub-challenges. We describe
below how the baseline system was defined and the results
we obtained for each modality separately, as well as on the
fusion of all modalities.
4.1 Depression
The challenge baseline for the depression classification
sub-challenge is computed using the scikit-learn toolbox
4
.
In particular, we fit a linear support vector machine with
stochastic gradient descent, i. e. the loss is computed one
sample at a time and the model is sequentially updated.
We validated the model on the development set and con-
ducted a grid search for optimal hyper-parameters on the
development set of both the audio data and video data sep-
arately. Features of both modalities are taken from the pro-
vided challenge baseline features. Classification and training
was performed on a frame-wise basis (i.e., at 100Hz for audio
4
http://scikit-learn.org/

Table 3: Baseline results for depression classification. Performance is measured in F1 score for depressed and
not depressed classes as reported through the PHQ-8. In addition, precision and recall are provided. Values
for class not depressed are reported in brackets.
Partition Modality F1 score Precision Recall
Development Audio .462 (.682) .316 (.938) .857 (0.54)
Development Video .500 (.896) .600 (.867) .428 (.928)
Development Ensemble .500 (.896) .600 (.867) .428 (.928)
Test Audio .410 (.582) .267 (.941) .889 (.421)
Test Video .583 (.851) .467 (.938) .778 (.790)
Test Ensemble .583 (.857) .467 (.938) .778 (.790)
Table 4: Baseline results for depression severity esti-
mation. Performance is measured in mean absolute
error (MAE) and root mean square error (RMSE)
between the predicted and reported PHQ-8 scores,
averaged over all sequences.
Partition Modality RMSE MAE
Development Audio 6.74 5.36
Development Video 7.13 5.88
Development Audio-Video 6.62 5.52
Test Audio 7.78 5.72
Test Video 6.97 6.12
Test Audio-Video 7.05 5.66
and 30Hz for video); temporal fusion was conducted through
simple majority voting of all the frames within an entire
screening interview. For both modalities we conducted
a grid search for the following parameters: loss function
{logarithmic, hinge loss}, regularization {L1, L2}, and
α {1e1, 1e0, . . . , 1e 5}. For the audio data the optimal
identified hyper-parameters are loss function = hinge loss,
regularization = L1, and α = 1e 3. For the video data
the optimal identified hyper-parameters are loss function
= logarithmic, regularization = L1, and α = 1e0. The en-
semble of audio and video was computed through a simple
binary fusion of a logical AND. The test performance was
computed on a classifier trained using the found optimal pa-
rameters from the grid search. Since the positive outputs of
the video modality are a subset of those of the audio the
ensemble classifier’s performance is exactly the same as the
video modality for both the development and test sets. Re-
sults are summarized in Table 3.
In addition to classification baseline, we also computed a
regression baseline using random forest regressor. The only
hyper-parameter in this experiment was the number of trees
10, 20, 50, 100, 200 in the random forest. For both au-
dio and video the best performing random forest has trees
= 10. Regression was performed on a frame-wise basis as
the classification and temporal fusion over the interview was
conduced by averaging of outputs over the entire screening
interview. Fusion of audio and video modalities was per-
formed by averaging the regression outputs of the unimodal
random forest regressors. The performance for both root
mean square error (RMSE) and mean absolute error (MAE)
for development and test sets is provided in Table 4.
Table 5: Size of the window W in seconds used to ex-
tract features on the different modalities, and delay
D in seconds applied to the gold-standard, according
to the emotional dimension, i. e., arousal (A), and
valence (V ); parameters were obtained as the result
of an optimisation of the performance measured as
CCC on the development partition.
Arousal Valence
Modality W
A
D
A
W
V
D
V
Audio 4 2.8 6 3.6
Video-appearance 6 2.8 4 2.4
Video-geometric 4 2.4 8 2.8
ECG 4 0.4 10 2.0
HRHRV 8 0.0 8 0.0
EDA 8 0.0 10 0.4
SCL 4 0.0 14 2.4
SCR 4 0.8 14 0.8
4.2 Affect
Mono-modal emotion recognition was first investigated
separately for each modality. Baseline features were ex-
tracted as previously described, with a window size W rang-
ing from four to 14 seconds, and a step of two seconds. The
window was centred, i. e., the first feature vector was as-
signed to the center of the window (W/2), and duplicated for
the previous frames; the same procedure was applied for the
last frames. For video data, frames for which the face was
not detected were ignored. For EDA, SCL, and SCR, test
data from the subject #7 was not used, due to issue during
the recording of this subject (sensor was partially detached
from the skin). Two different techniques were investigated
to standardise the features: (i) online (standardisation pa-
rameters µ and σ are computed on the training partition
and used on all partitions), and (ii) speaker dependent (µ
and σ are computed and applied on features of each sub-
ject). In order to compensate time reaction of the raters,
a time delay D is applied to the gold-standard, by shifting
back in time the values of the time-series (last value was du-
plicated), with a delay ranging from zero to eight seconds,
and a step of 400ms.
As machine learning, we used a linear Support Vector Ma-
chine (SVM) to perform the regression task with the liblin-
ear library [12]; the L2-regularised L2-loss dual solver was
chosen (option -s 12) and a unit bias was added to the fea-
ture vector (option -B 1), all others parameters were kept

Citations
More filters
Journal ArticleDOI

AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild

TL;DR: AffectNet is by far the largest database of facial expression, valence, and arousal in the wild enabling research in automated facial expression recognition in two different emotion models and various evaluation metrics show that the deep neural network baselines can perform better than conventional machine learning methods and off-the-shelf facial expressions recognition systems.
Journal ArticleDOI

Deep Facial Expression Recognition: A Survey

TL;DR: A comprehensive survey on deep facial expression recognition (FER) can be found in this article, including datasets and algorithms that provide insights into the intrinsic problems of deep FER, including overfitting caused by lack of sufficient training data and expression-unrelated variations, such as illumination, head pose and identity bias.
Journal ArticleDOI

End-to-End Multimodal Emotion Recognition Using Deep Neural Networks

TL;DR: This work proposes an emotion recognition system using auditory and visual modalities using a convolutional neural network to extract features from the speech, while for the visual modality a deep residual network of 50 layers is used.
Journal ArticleDOI

AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild

TL;DR: In this paper, the authors collected, annotated, and prepared for public distribution a new database of facial emotions in the wild (called AffectNet), which contains more than 1,000,000 facial images from the Internet by querying three major search engines using 1,250 emotion related keywords in six different languages.
Journal ArticleDOI

Facial Expression Analysis under Partial Occlusion: A Survey

TL;DR: A comprehensive review of recent advances in dataset creation, algorithm development, and investigations of the effects of occlusion critical for robust performance in FEA systems is presented in this paper.
References
More filters
Journal ArticleDOI

Coefficient alpha and the internal structure of tests.

TL;DR: In this paper, a general formula (α) of which a special case is the Kuder-Richardson coefficient of equivalence is shown to be the mean of all split-half coefficients resulting from different splittings of a test, therefore an estimate of the correlation between two random samples of items from a universe of items like those in the test.
Journal ArticleDOI

Intraclass correlations: uses in assessing rater reliability.

TL;DR: In this article, the authors present guidelines for choosing among six different forms of the intraclass correlation for reliability studies in which n target are rated by k judges, and the confidence intervals for each of the forms are reviewed.
Journal ArticleDOI

The WEKA data mining software: an update

TL;DR: This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.
Journal ArticleDOI

Development of a Rating Scale for Primary Depressive Illness

TL;DR: This is an account of further work on a rating scale for depressive states, including a detailed discussion on the general problems of comparing successive samples from a ‘population’, the meaning of factor scores, and the other results obtained.
Related Papers (5)
Frequently Asked Questions (12)
Q1. What have the authors contributed in "Avec 2016 – depression, mood, and emotion recognition workshop and challenge" ?

The Audio/Visual Emotion Challenge and Workshop ( AVEC 2016 ) “ Depression, Mood and Emotion ” will be the sixth competition event aimed at comparison of multimedia processing and machine learning methods for automatic audio, visual and physiological depression and emotion analysis, with all participants competing under strictly the same conditions. The goal of the Challenge is to provide a common benchmark test set for multi-modal information processing and to bring together the depression and emotion recognition communities, as well as the audio, video and physiological processing communities, to compare the relative merits of the various approaches to depression and emotion recognition under well-defined and strictly comparable conditions and establish to what extent fusion of the approaches is possible and beneficial. This paper presents the ∗The author is further affiliated with Imperial College London, Department of Computing, London, U. K. †The author is further affiliated with University of Passau, Chair of Complex & Intelligent Systems ‡The author is further affiliated with Twente University, EEMCS, Twente, The Netherlands. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author ( s ) must be honored. AVEC ’ 16 16 October 2016, Amsterdam, NL Copyright is held by the owner/author ( s ). 

The best value of complexity, window size, time delay, and standardisation method were obtained by maximising the performance - measured as CCC - on the development partition with the model, learned on the training partition. 

In particular, the authors fit a linear support vector machine with stochastic gradient descent, i. e. the loss is computed one sample at a time and the model is sequentially updated. 

In order to keep the complexity low, and estimate the contribution of each modality in the fusion process, the authors build the fusion model by a simple linear regression of the predictions obtained on the development partition, using Weka 3.7 with default parameters [16]. 

In order to extract geometric features, the authors tracked 49 facial landmarks with the Supervised Descent Method (SDM) [42] and aligned them with a mean shape from stable points (located on the eye corners and on the nose region). 

To pitch and loudness the following functionals are additionally applied: percentiles 20, 50 and 80, the range of percentiles 20 – 80 and the mean and standard deviation of the slope of rising/falling signal parts. 

Table 5 lists the best parameters for W and D, for each modality and emotional dimension, and shows that, the valencegenerally requires longer window size (to extract features) and time delay (to compensate for reaction time) than for arousal; W̄A = 5.3, W̄V = 9.3, D̄A = 1.2, D̄V = 1.8. 

This technique has significantly (p < 0.001 for CC) improved the inter-rater reliability for both arousal and valence; the Fisher Z-transform is used to perform statistical comparisons between CC in this study. 

Some recommendations for the definition of a minimalistic acoustic standard parameter set have been recently investigated, and have led to the Geneva Minimalistic Acoustic Parameter Set (GeMAPS), and to an extended version (eGeMAPS) [10], which is used here as baseline. 

For both modalities the authors conducted a grid search for the following parameters: loss function ∈ {logarithmic, hinge loss}, regularization ∈ {L1,L2}, and α ∈ {1e1, 1e0, . . . , 1e − 5}. 

To be eligible to participate in the challenge, every entry has to be accompanied by a paper presenting the results and the methods that created them, which will undergo peerreview. 

The 2016 Audio-Visual Emotion Challenge and Workshop (AVEC 2016) will be the sixth competition event aimed at comparison of multimedia processing and machine learning methods for automatic audio, video, and physiological analysis of emotion and depression, with all participants competing under strictly the same conditions.