How did the model achieve the performance?

The best value of complexity, window size, time delay, and standardisation method were obtained by maximising the performance - measured as CCC - on the development partition with the model, learned on the training partition.

What is the method for calculating the fusion model?

In order to keep the complexity low, and estimate the contribution of each modality in the fusion process, the authors build the fusion model by a simple linear regression of the predictions obtained on the development partition, using Weka 3.7 with default parameters [16].

What is the parameter for W and D?

Table 5 lists the best parameters for W and D, for each modality and emotional dimension, and shows that, the valencegenerally requires longer window size (to extract features) and time delay (to compensate for reaction time) than for arousal; W̄A = 5.3, W̄V = 9.3, D̄A = 1.2, D̄V = 1.8.

(Open Access) AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge (2016) | Michel Valstar

Q: What have the authors contributed in "Avec 2016 – depression, mood, and emotion recognition workshop and challenge" ?

The Audio/Visual Emotion Challenge and Workshop ( AVEC 2016 ) “ Depression, Mood and Emotion ” will be the sixth competition event aimed at comparison of multimedia processing and machine learning methods for automatic audio, visual and physiological depression and emotion analysis, with all participants competing under strictly the same conditions. The goal of the Challenge is to provide a common benchmark test set for multi-modal information processing and to bring together the depression and emotion recognition communities, as well as the audio, video and physiological processing communities, to compare the relative merits of the various approaches to depression and emotion recognition under well-defined and strictly comparable conditions and establish to what extent fusion of the approaches is possible and beneficial. This paper presents the ∗The author is further affiliated with Imperial College London, Department of Computing, London, U. K. †The author is further affiliated with University of Passau, Chair of Complex & Intelligent Systems ‡The author is further affiliated with Twente University, EEMCS, Twente, The Netherlands. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author ( s ) must be honored. AVEC ’ 16 16 October 2016, Amsterdam, NL Copyright is held by the owner/author ( s ).

Q: What is the way to fit the model?

In particular, the authors fit a linear support vector machine with stochastic gradient descent, i. e. the loss is computed one sample at a time and the model is sequentially updated.

Q: How many features were extracted from the facial landmarks?

In order to extract geometric features, the authors tracked 49 facial landmarks with the Supervised Descent Method (SDM) [42] and aligned them with a mean shape from stable points (located on the eye corners and on the nose region).

Q: What are the functionals applied to pitch and loudness?

To pitch and loudness the following functionals are additionally applied: percentiles 20, 50 and 80, the range of percentiles 20 – 80 and the mean and standard deviation of the slope of rising/falling signal parts.

Q: What is the method for comparing arousal and valence?

This technique has significantly (p < 0.001 for CC) improved the inter-rater reliability for both arousal and valence; the Fisher Z-transform is used to perform statistical comparisons between CC in this study.

Q: What is the definition of a minimalistic acoustic standard parameter set?

Some recommendations for the definition of a minimalistic acoustic standard parameter set have been recently investigated, and have led to the Geneva Minimalistic Acoustic Parameter Set (GeMAPS), and to an extended version (eGeMAPS) [10], which is used here as baseline.

Q: What is the way to determine the hyper-parameters for the two modalities?

For both modalities the authors conducted a grid search for the following parameters: loss function ∈ {logarithmic, hinge loss}, regularization ∈ {L1,L2}, and α ∈ {1e1, 1e0, . . . , 1e − 5}.

AVEC 2016 – Depression, Mood, and Emotion Recognition

Workshop and Challenge

Michel Valstar

University of Nottingham

School of Computer Science

Jonathan Gratch

University of Southern

California

ICT

Björn Schuller

∗

University of Passau

Chair of Complex & Intelligent

Systems

Fabien Ringeval

†

Université Grenoble Alpes

Laboratoire d’Informatique de

Grenoble

Denis Lalanne

University of Fribourg

Human-IST Research Center

Mercedes Torres Torres

University of Nottingham

School of Computer Science

Stefan Scherer

University of Southern

California

ICT

Giota Stratou

University of Southern

California

ICT

Roddy Cowie

Queen’s University Belfast

Department of Psychology

Maja Pantic

‡

Imperial College London

Intelligent Behaviour

Understanding Group

ABSTRACT

The Audio/Visual Emotion Challenge and Workshop

(AVEC 2016) “Depression, Mood and Emotion” will be the

sixth competition event aimed at comparison of multime-

dia processing and machine learning methods for automatic

audio, visual and physiological depression and emotion anal-

ysis, with all participants competing under strictly the same

conditions. The goal of the Challenge is to provide a com-

mon benchmark test set for multi-modal information pro-

cessing and to bring together the depression and emotion

recognition communities, as well as the audio, video and

physiological processing communities, to compare the rela-

tive merits of the various approaches to depression and emo-

tion recognition under well-deﬁned and strictly comparable

conditions and establish to what extent fusion of the ap-

proaches is possible and beneﬁcial. This paper presents the

∗

The author is further aﬃliated with Imperial College Lon-

don, Department of Computing, London, U.K.

†

The author is further aﬃliated with University of Passau,

Chair of Complex & Intelligent Systems

‡

The author is further aﬃliated with Twente University,

EEMCS, Twente, The Netherlands.

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for proﬁt or commercial advantage and that copies bear this notice and the full citation

on the ﬁrst page. Copyrights for components of this work owned by others than the

author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or

republish, to post on servers or to redistribute to lists, requires prior speciﬁc permission

and/or a fee. Request permissions from Permissions@acm.org.

AVEC’16 16 October 2016, Amsterdam, NL

ACM 978-1-4503-4516-3/16/10. . . $15.00.

DOI: http://dx.doi.org/10.1145/2988257.2988258.

challenge guidelines, the common data used, and the perfor-

mance of the baseline system on the two tasks.

Keywords

Aﬀective Computing, Emotion Recognition, Speech, Facial

Expression, Physiological signals, Challenge

1. INTRODUCTION

The 2016 Audio-Visual Emotion Challenge and Workshop

(AVEC 2016) will be the sixth competition event aimed

at comparison of multimedia processing and machine learn-

ing methods for automatic audio, video, and physiological

analysis of emotion and depression, with all participants

competing under strictly the same conditions. The goal

of the Challenge is to compare the relative merits of the

approaches (audio, video, and/or physiologic) to emotion

recognition and severity of depression estimation under well-

deﬁned and strictly comparable conditions, and establish to

what extent fusion of the approaches is possible and ben-

eﬁcial. A second motivation is the need to advance emo-

tion recognition for multimedia retrieval to a level where

behaviomedical systems [38] are able to deal with large vol-

umes of non-prototypical naturalistic behaviour in reaction

to known stimuli, as this is exactly the type of data that di-

agnostic and in particular monitoring tools, as well as other

applications, would have to face in the real world.

AVEC 2016 will address emotion and depression recog-

nition. The emotion recognition sub-challenge is a reﬁned

re-run of the AVEC 2015 challenge [27], largely based on

the same dataset. The depression severity estimation sub-

challenge is based on a novel dataset of human-agent inter-

actions, and sees the return of depression analysis, which

arXiv:1605.01600v4 [cs.CV] 22 Nov 2016

was a huge success in the AVEC 2013 [41] and 2014 [40]

challenges.

• Depression Classiﬁcation Sub-Challenge (DCC):

participants are required to classify whether a per-

son is classiﬁed as depressed or not, where the binary

ground-truth is based on the severity of self-reported

depression as indicated by the PHQ-8 score for every

human-agent interaction. For the DCC, performance

in the competition will be measured using the average

F1 score for both classes depressed and not depressed.

Participants are encouraged to provide an estimate of

the severity of depression, by calculating the root mean

square error over all HCI experiment sessions between

the predicted and ground-truth PHQ-8 score. In ad-

dition, participants are also encouraged to report on

overall accuracy, average precision, and average recall

to further analyse their results in the paper accompa-

nying their submission.

• Multimodal Aﬀect Recognition Sub-Challenge

(MASC) participants are required to perform fully con-

tinuous aﬀect recognition of two aﬀective dimensions:

Arousal, and Valence, where the level of aﬀect has

to be predicted for every moment of the recording.

For the MASC, two regression problems need to be

solved: prediction of the continuous dimensions Va-

lence and Arousal. The MASC competition mea-

sure is the Concordance Correlation Coeﬃcient

(CCC), which combines the Pearson’s correlation co-

eﬃcient (CC) with the square diﬀerence between the

mean of the two compared time series, as shown in 1.

2ρσ

+ σ

+ (µ

− µ

)

(1)

where ρ is the Pearson correlation coeﬃcient between

two time series (e. g., prediction and gold-standard),

and σ

is the variance of each time series, and µ

and µ

are the mean value of each. Therefore, predic-

tions that are well correlated with the gold standard

but shifted in value are penalised in proportion to the

deviation.

To be eligible to participate in the challenge, every entry

has to be accompanied by a paper presenting the results and

the methods that created them, which will undergo peer-

review. Only contributions with a relevant accepted paper

will be eligible for challenge participation. The organisers

reserve the right to re-evaluate the ﬁndings, but will not

participate in the Challenge themselves.

2. DEPRESSION ANALYSIS CORPUS

The Distress Analysis Interview Corpus - Wizard of Oz

(DAIC-WOZ) database is part of a larger corpus, the Dis-

tress Analysis Interview Corpus (DAIC) [13], that contains

clinical interviews designed to support the diagnosis of psy-

chological distress conditions such as anxiety, depression,

and post-traumatic stress disorder. These interviews were

collected as part of a larger eﬀort to create a computer agent

that interviews people and identiﬁes verbal and nonverbal

indicators of mental illness [8]. Data collected include audio

and video recordings and extensive questionnaire responses;

this part of the corpus includes the Wizard-of-Oz interviews,

Depression Severity [PHQ8]

Frequency

0 5 10 15 2520

Figure 1: Histogram of depression severity scores for

DESC challenge. Data of training and development

set are provided here.

conducted by an animated virtual interviewer called Ellie,

controlled by a human interviewer in another room. Data

has been transcribed and annotated for a variety of verbal

and non-verbal features.

Information on how to obtain shared data can be found

in this location: http://dcapswoz.ict.usc.edu. Data is freely

available for research purposes.

2.1 Depression Analysis Labels

The level of depression is labelled with a single value per

recording using a standardised self-assessed subjective de-

pression questionnaire, the PHQ-8 [21]. This is similar to the

PHQ-9 questionnaire, but with the suicidal ideation question

removed for ethical reasons. The average depression sever-

ity on the training and development set of the challenge is

M = 6.67 (SD = 5.75). The distribution of the depression

severity scores based on the challenge training and develop-

ment set is provided in Figure 1. A baseline classiﬁer that

constantly predicts the mean score of depression provides an

RMSE = 5.73 and an MAE = 4.74.

2.2 Depression Analysis Baseline Features

In the following sections we describe how the publicly avail-

able baseline feature sets are computed for either the audio

or the video data. Participants can use these feature sets

exclusively or in addition to their own features. For ethical

reasons, no raw video is made available.

2.2.1 Video Features

Based on the OpenFace [2] framework

, we provide diﬀer-

ent types of video features:

• facial landmarks: 2D and 3D coordinates of 68 points

on the face, estimated from video

• HOG (histogram of oriented gradients) features on the

aligned 112x112 area of the face

• gaze direction estimate for both eyes

• head pose: 3D position and orientation of the head

https://github.com/TadasBaltrusaitis/

CLM-framework

In addition to that, we provide emotion and facial action

unit continuous measures based on FACET software[23].

Speciﬁcally, we provide the following measures:

• emotion: {Anger, Contempt, Disgust, Joy, Fear, Neu-

tral, Sadness, Surprise, Confusion, Frustration}

• AUs: {AU1, AU2, AU4, AU5, AU6, AU7, AU9, AU10,

AU12, AU14, AU15, AU17, AU18, AU20, AU23,

AU24, AU25, AU26, AU28, AU43}

2.2.2 Audio Features

For the audio features we utilized COVAREP(v1.3.2), a

freely available open source Matlab and Octave toolbox for

speech analyses [7]

. The toolbox comprises well validated

and tested feature extraction methods that aim to capture

both voice quality as well as prosodic characteristics of the

speaker. These methods have been successfully shown to

be correlated with psychological distress and depression in

particular [32, 33]. In particular, we extracted the following

features:

• Prosodic: Fundamental frequency (F0) and voicing

(VUV)

• Voice Quality: Normalized amplitude quotient

(NAQ), Quasi open quotient (QOQ), the diﬀerence

in amplitude of the ﬁrst two harmonics of the dif-

ferentiated glottal source spectrum (H1H2), parabolic

spectral parameter (PSP), maxima dispersion quotient

(MDQ), spectral tilt/slope of wavelet responses (peak-

Slope), and shape parameter of the Liljencrants-Fant

model of the glottal pulse dynamics (Rd)

• Spectral: Mel cepstral coeﬃcients (MCEP0-24), Har-

monic Model and Phase Distortion mean (HMPDM0-

24) and deviations (HMPDD0-12).

In addition to the feature set above, raw audio and tran-

scripts of the interview are being provided, allowing the par-

ticipants to compute additional features on their own. For

more details on the shared features and the format of the

ﬁles participants should also review the DAIC-WOZ docu-

mentation

3. EMOTION ANALYSIS CORPUS

The Remote Collaborative and Aﬀective Interactions

(RECOLA) database [29] was recorded to study socio-

aﬀective behaviours from multimodal data in the context

of computer supported collaborative work [28]. Sponta-

neous and naturalistic interactions were collected during

the resolution of a collaborative task that was performed in

dyads and remotely through video conference. Multimodal

signals, i. e., audio, video, electro-cardiogram (ECG) and

electro-dermal activity (EDA), were synchronously recorded

from 27 French-speaking subjects. Even though all subjects

speak French ﬂuently, they have diﬀerent nationalities (i. e.,

French, Italian or German), which thus provide some diver-

sity in the expression of emotion.

Data is freely available for research purposes, information

on how to obtain the RECOLA database can be found on

this location: http://diuf.unifr.ch/diva/recola.

http://covarep.github.io/covarep/

http://dcapswoz.ict.usc.edu/wwwutil_files/

DAICWOZDepression_Documentation.pdf

Table 1: Inter-rater reliability on arousal and va-

lence for the 6 raters and the 27 subjects of the

RECOLA database; raw or normalised ratings [26].

RMSE CC CCC ICC α

Raw

Arousal .344 .400 .277 .775 .800

Valence .218 .446 .370 .811 .802

Normalised

Arousal .263 .496 .431 .827 .856

Valence .174 .492 .478 .844 .829

Table 2: Partitioning of the RECOLA database into

train, development, and test sets.

# train dev test

female 6 5 5

male 3 4 4

French 6 7 7

Italian 2 1 2

German 1 1 0

age µ (σ) 21.2 (1.9) 21.8 (2.5) 21.2 (1.9)

3.1 Emotion Analysis Labels

Regarding the annotation of the dataset, time-continuous

ratings (40 ms binned frames) of emotional arousal and va-

lence were created by six gender balanced French-speaking

assistants for the ﬁrst ﬁve minutes of all recordings, because

participants discussed more about their strategy – hence

showing emotions – at the beginning of their interaction.

To assess inter-rater reliability, we computed the intra-

class correlation coeﬃcient (ICC(3,1)) [36], and Cronbach’s

α [5]; ratings are concatenated over all subjects. Addi-

tionally, we computed the root-mean-square error (RMSE),

Pearson’s CC and the CCC [22]; values are averaged over

the C

pairs of raters. Results indicate a very strong inter-

rater reliability for both arousal and valence, cf. Table 1. A

normalisation technique based on the Evaluator Weighted

Estimator [14], is used prior to the computation of the gold-

standard, i. e., the average of all ratings for each subject [26].

This technique has signiﬁcantly (p < 0.001 for CC) improved

the inter-rater reliability for both arousal and valence; the

Fisher Z-transform is used to perform statistical compar-

isons between CC in this study.

The dataset was divided into speaker disjoint subsets for

training, development (validation) and testing, by stratify-

ing (balancing) on gender and mother tongue, cf. Table 2.

3.2 Emotion Analysis Baseline Features

In the followings we describe how the baseline feature sets

are computed for video, audio, and physiological data.

3.2.1 Video Features

Facial expressions play an important role in the commu-

nication of emotion [9]. Features are usually grouped in

two types of facial descriptors: appearance and geometric

based [39]. For the video baseline features set, we computed

both, using Local Gabor Binary Patterns from Three Or-

thogonal Planes (LGBP-TOP) [1] for appearance and facial

landmarks [42] for geometric.

The LGBP-TOP are computed by splitting the video into

spatio-temporal video volumes. Each slice of the video vol-

ume extracted along 3 orthogonal planes (x-y, x-t and y-t) is

ﬁrst convolved with a bank of 2D Gabor ﬁlters. The result-

ing Gabor pictures in the direction of x-y plane are divided

into 4x4 blocks. In the x-t and y-t directions they are divided

into 4x1 blocks. The LBP operator is then applied to each

of these resulting blocks followed by the concatenation of

the resulting LBP histograms from all the blocks. A feature

reduction is then performed by applying a Principal Com-

ponent Analysis (PCA) from a low-rank (up to rank 500)

approximation [15]. We obtained 84 features representing

98 % of the variance.

In order to extract geometric features, we tracked 49 facial

landmarks with the Supervised Descent Method (SDM) [42]

and aligned them with a mean shape from stable points (lo-

cated on the eye corners and on the nose region). As fea-

tures, we computed the diﬀerence between the coordinates of

the aligned landmarks and those from the mean shape, and

also between the aligned landmark locations in the previous

and the current frame; this procedure provided 196 features

in total. We then split the facial landmarks into groups ac-

cording to three diﬀerent regions: i) the left eye and left

eyebrow, ii) the right eye and right eyebrow and iii) the

mouth. For each of these groups, the Euclidean distances

(L2-norm) and the angles (in radians) between the points

are computed, providing 71 features. We also computed the

Euclidean distance between the median of the stable land-

marks and each aligned landmark in a video frame. In total

the geometric set includes 316 features.

Both appearance and geometric feature sets are inter-

polated by a piecewise cubic Hermite polynomial to cope

with dropped frames. Finally, the arithmetic mean and the

standard-deviation are computed on all features using a slid-

ing window, which is shifted forward at a rate of 40 ms.

3.2.2 Audio Features

In contrast to large scale feature sets, which have been

successfully applied to many speech classiﬁcation tasks [34,

35], smaller, expert-knowledge based feature sets have also

shown high robustness for the modelling of emotion from

speech [25, 3]. Some recommendations for the deﬁnition of

a minimalistic acoustic standard parameter set have been re-

cently investigated, and have led to the Geneva Minimalistic

Acoustic Parameter Set (GeMAPS), and to an extended

version (eGeMAPS) [10], which is used here as baseline.

The acoustic low-level descriptors (LLD) cover spectral, cep-

stral, prosodic and voice quality information and are ex-

tracted with the openSMILE toolkit [11].

As the data in the RECOLA database contains long con-

tinuous recordings, we used overlapping ﬁxed length seg-

ments, which are shifted forward at a rate of 40 ms, to ex-

tract functionals; the arithmetic mean and the coeﬃcient

of variation are computed on all 42 LLD. To pitch and

loudness the following functionals are additionally applied:

percentiles 20, 50 and 80, the range of percentiles 20 – 80

and the mean and standard deviation of the slope of ris-

ing/falling signal parts. Functionals applied to the pitch,

jitter, shimmer, and all formant related LLDs, are applied

to voiced regions only. Additionally, the average RMS en-

ergy is computed and 6 temporal features are included: the

rate of loudness peaks per second, mean length and standard

deviation of continuous voiced and unvoiced segments and

the rate of voiced segments per second, approximating the

pseudo syllable rate. Overall, the acoustic baseline features

set contains 88 features.

3.2.3 Physiological Features

Physiological signals are known to be well correlated with

emotion [20, 19], despite not being directly perceptible the

way audio-visual are. Although there are some controver-

sies about peripheral physiology and emotion [31, 18], we

believe that autonomic measures should be considered along

with audio-visual data in the realm of aﬀective computing,

as they do not only provide complementary descriptions of

aﬀect, but can also be easily and continuously monitored

with wearable sensors [30, 24, 4].

As baseline features, we extracted features from both ECG

and EDA signals with overlapping (step of 40 ms) windows.

The ECG signal was ﬁrstly band-pass ﬁltered ([3 − 27] Hz)

with a zero-delay 6th order Butterworth ﬁlter [26], and 19

features were then computed: the zero-crossing rate, the

four ﬁrst statistical moments, the normalised length den-

sity, the non-stationary index, the spectral entropy, slope,

mean frequency plus 6 spectral coeﬃcients, the power in

low frequency (LF, 0.04-0.15 Hz), high frequency (HF, 0.15-

0.4 Hz) and the LF/HF power ratio. Additionally, we ex-

tracted the heart rate (HR) and its measure of variability

(HRV) from the ﬁltered ECG signal [26]. For each of those

two descriptors, we computed the two ﬁrst statistical mo-

ments, the arithmetic mean of rising and falling slope, and

the percentage of rising values, which provided 10 features

in total.

EDA reﬂects a rapid, transient response called skin con-

ductance response (SCR), as well as a slower, basal drift

called skin conductance level (SCL) [6]. Both, SCL (0–

0.5 Hz) and SCR (0.5–1 Hz) are estimated using a 3rd order

Butterworth ﬁlter, 8 features are then computed for each

of those three low-level descriptors: the four ﬁrst statisti-

cal moments from the original time-series and its ﬁrst order

derivate w.r.t. time.

4. CHALLENGE BASELINES

For transparency and reproducibility, we use standard and

open-source algorithms for both sub-challenges. We describe

below how the baseline system was deﬁned and the results

we obtained for each modality separately, as well as on the

fusion of all modalities.

4.1 Depression

The challenge baseline for the depression classiﬁcation

sub-challenge is computed using the scikit-learn toolbox

In particular, we ﬁt a linear support vector machine with

stochastic gradient descent, i. e. the loss is computed one

sample at a time and the model is sequentially updated.

We validated the model on the development set and con-

ducted a grid search for optimal hyper-parameters on the

development set of both the audio data and video data sep-

arately. Features of both modalities are taken from the pro-

vided challenge baseline features. Classiﬁcation and training

was performed on a frame-wise basis (i.e., at 100Hz for audio

http://scikit-learn.org/

Table 3: Baseline results for depression classiﬁcation. Performance is measured in F1 score for depressed and

not depressed classes as reported through the PHQ-8. In addition, precision and recall are provided. Values

for class not depressed are reported in brackets.

Partition Modality F1 score Precision Recall

Development Audio .462 (.682) .316 (.938) .857 (0.54)

Development Video .500 (.896) .600 (.867) .428 (.928)

Development Ensemble .500 (.896) .600 (.867) .428 (.928)

Test Audio .410 (.582) .267 (.941) .889 (.421)

Test Video .583 (.851) .467 (.938) .778 (.790)

Test Ensemble .583 (.857) .467 (.938) .778 (.790)

Table 4: Baseline results for depression severity esti-

mation. Performance is measured in mean absolute

error (MAE) and root mean square error (RMSE)

between the predicted and reported PHQ-8 scores,

averaged over all sequences.

Partition Modality RMSE MAE

Development Audio 6.74 5.36

Development Video 7.13 5.88

Development Audio-Video 6.62 5.52

Test Audio 7.78 5.72

Test Video 6.97 6.12

Test Audio-Video 7.05 5.66

and 30Hz for video); temporal fusion was conducted through

simple majority voting of all the frames within an entire

screening interview. For both modalities we conducted

a grid search for the following parameters: loss function

∈ {logarithmic, hinge loss}, regularization ∈ {L1, L2}, and

α ∈ {1e1, 1e0, . . . , 1e − 5}. For the audio data the optimal

identiﬁed hyper-parameters are loss function = hinge loss,

regularization = L1, and α = 1e − 3. For the video data

the optimal identiﬁed hyper-parameters are loss function

= logarithmic, regularization = L1, and α = 1e0. The en-

semble of audio and video was computed through a simple

binary fusion of a logical AND. The test performance was

computed on a classiﬁer trained using the found optimal pa-

rameters from the grid search. Since the positive outputs of

the video modality are a subset of those of the audio the

ensemble classiﬁer’s performance is exactly the same as the

video modality for both the development and test sets. Re-

sults are summarized in Table 3.

In addition to classiﬁcation baseline, we also computed a

regression baseline using random forest regressor. The only

hyper-parameter in this experiment was the number of trees

∈ 10, 20, 50, 100, 200 in the random forest. For both au-

dio and video the best performing random forest has trees

= 10. Regression was performed on a frame-wise basis as

the classiﬁcation and temporal fusion over the interview was

conduced by averaging of outputs over the entire screening

interview. Fusion of audio and video modalities was per-

formed by averaging the regression outputs of the unimodal

random forest regressors. The performance for both root

mean square error (RMSE) and mean absolute error (MAE)

for development and test sets is provided in Table 4.

Table 5: Size of the window W in seconds used to ex-

tract features on the diﬀerent modalities, and delay

D in seconds applied to the gold-standard, according

to the emotional dimension, i. e., arousal (A), and

valence (V ); parameters were obtained as the result

of an optimisation of the performance measured as

CCC on the development partition.

Arousal Valence

Modality W

Audio 4 2.8 6 3.6

Video-appearance 6 2.8 4 2.4

Video-geometric 4 2.4 8 2.8

ECG 4 0.4 10 2.0

HRHRV 8 0.0 8 0.0

EDA 8 0.0 10 0.4

SCL 4 0.0 14 2.4

SCR 4 0.8 14 0.8

4.2 Affect

Mono-modal emotion recognition was ﬁrst investigated

separately for each modality. Baseline features were ex-

tracted as previously described, with a window size W rang-

ing from four to 14 seconds, and a step of two seconds. The

window was centred, i. e., the ﬁrst feature vector was as-

signed to the center of the window (W/2), and duplicated for

the previous frames; the same procedure was applied for the

last frames. For video data, frames for which the face was

not detected were ignored. For EDA, SCL, and SCR, test

data from the subject #7 was not used, due to issue during

the recording of this subject (sensor was partially detached

from the skin). Two diﬀerent techniques were investigated

to standardise the features: (i) online (standardisation pa-

rameters µ and σ are computed on the training partition

and used on all partitions), and (ii) speaker dependent (µ

and σ are computed and applied on features of each sub-

ject). In order to compensate time reaction of the raters,

a time delay D is applied to the gold-standard, by shifting

back in time the values of the time-series (last value was du-

plicated), with a delay ranging from zero to eight seconds,

and a step of 400ms.

As machine learning, we used a linear Support Vector Ma-

chine (SVM) to perform the regression task with the liblin-

ear library [12]; the L2-regularised L2-loss dual solver was

chosen (option -s 12) and a unit bias was added to the fea-

ture vector (option -B 1), all others parameters were kept

AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge

Figures

Citations

AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild

Deep Facial Expression Recognition: A Survey

End-to-End Multimodal Emotion Recognition Using Deep Neural Networks

AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild

Facial Expression Analysis under Partial Occlusion: A Survey

References

Coefficient alpha and the internal structure of tests.

Scikit-learn: Machine Learning in Python

Intraclass correlations: uses in assessing rater reliability.

The WEKA data mining software: an update

Development of a Rating Scale for Primary Depressive Illness

Related Papers (5)

Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions

The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing

Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network

A review of depression and suicide risk assessment using speech analysis

Long short-term memory

Frequently Asked Questions (12)

Q1. What have the authors contributed in "Avec 2016 – depression, mood, and emotion recognition workshop and challenge" ?

Q2. How did the model achieve the performance?

Q3. What is the way to fit the model?

Q4. What is the method for calculating the fusion model?

Q5. How many features were extracted from the facial landmarks?

Q6. What are the functionals applied to pitch and loudness?

Q7. What is the parameter for W and D?

Q8. What is the method for comparing arousal and valence?

Q9. What is the definition of a minimalistic acoustic standard parameter set?

Q10. What is the way to determine the hyper-parameters for the two modalities?

Q11. What is the requirement to participate in the challenge?

Q12. What is the purpose of the 2016 AVEC?