What future works have the authors mentioned in the paper "Continuous estimation of emotions in speech by dynamic cooperative speaker models" ?

Web-based applications could offer the possibility to everyone to upload to the cloud his/her speech sequence along with the corresponding annotation. Finally, the introduction of the QBTD paradigm suggests future developments based on modular architecture in which each SSRM is trained and optimised on each quadrant and then merged using a cooperative rule based on different machine learning scenarios and other databases of emotional speech.

What is the new strategy for combining speech sequences?

the addition to the cooperation of speech sequences of new speakers is now expected using the single speaker model construction, as well as the inclusion of additional affective contents of the same speaker by single speaker model relearning.

What is the frequency of inclusion of a speaker in the cooperation strategy?

Since their system dynamically adapts the ensemble of SSRM used in the cooperation strategy to perform emotion prediction, the authors have analysed the frequency of inclusion (i. e., the number of times the SSRM of a speaker is included in the cooperation over the number of observation windows) of each speaker in the model.

What is the significance of the synchronization procedure?

The statistical significance of the improvementsobtained with the inclusion of the synchronization procedure is verified by a paired t-test for both arousal and valence; the authors obtained p < 0.001 for both the experiments, demonstrating the importance of the synchronization procedure for constructing the SSRMs that cooperate in the CRM.

What is the advantage of the proposed architecture for mobile applications?

For this reason, the proposed architecture is perfectly suitable for mobile applications, thanks to the easiness and flexibility to develop single models separately trained on distinct speech sequences with different emotional contents.

What is the significance of the improvements obtained with the QBTD procedure?

9.The statistical significance of the improvements obtained with the QBTD procedure over the global optimisation (ALL), is verified with a paired t-test for both arousal and valence; the authors obtained p < 0.0011949-3045 (c) 2015 IEEE.

How can arousalbe be added to the cooperative system?

Arousalbe added to the cooperative system simply by training a new SSRM using the speech sequence along with the relative annotations for the new speaker.

What is the significance of the QBTD procedure for the construction of the SSRM?

See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.and p = 0.027 for arousal and valence respectively, demonstrating the importance of the QBTD procedure for constructing the SSRM.

What other strategies are used to quantify the performance of the proposed method?

To further quantify the performance of the proposed method (i.e., SSRM combined with CRM) with respect to standard regression approaches, the authors also implemented two other emotion recognition strategies.

(Open Access) Continuous Estimation of Emotions in Speech by Dynamic Cooperative Speaker Models (2017) | Arianna Mencattini

Q: What have the authors contributed in "Continuous estimation of emotions in speech by dynamic cooperative speaker models" ?

In this work, the authors introduce a novel emotion recognition system, based on ensemble of single-speaker-regression-models ( SSRMs ). The estimation of emotion is provided by combining a subset of the initial pool of SSRMs selecting those that are most concordance among them. The simplicity of this aggregation strategy, coupled with the flexibility assured by the modular architecture, and the promising results obtained on the RECOLA database highlight the potential implications of the proposed method in a real-life scenario and in particular in WEB-based applications.

Q: What is the main disadvantage of the automatic recognition of spontaneous emotions?

Whereas the use of all annotation data can help at preserving diversity in emotion perception, e. g., by using multi-task learning of each annotator [25], [26], it has the main disadvantage to increase the overall complexity of the model according to the number of available raters.

Q: What is the common approach in the literature to use all the emotion variability found in the data as?

The common approach in the literature is to use all the emotion variability found in the data as training material and tune the machine learning system in order to disregard the less relevant instances (e. g., by optimising the number of support vectors and the soft margin in Support Vector Regression (SVR)) for emotion prediction [19], [30], [31], [32].

Q: What is the advantage of the proposed architecture for mobile applications?

For this reason, the proposed architecture is perfectly suitable for mobile applications, thanks to the easiness and flexibility to develop single models separately trained on distinct speech sequences with different emotional contents.

Q: What is the significance of the improvements obtained with the QBTD procedure?

9.The statistical significance of the improvements obtained with the QBTD procedure over the global optimisation (ALL), is verified with a paired t-test for both arousal and valence; the authors obtained p < 0.0011949-3045 (c) 2015 IEEE.

Q: How can arousalbe be added to the cooperative system?

Arousalbe added to the cooperative system simply by training a new SSRM using the speech sequence along with the relative annotations for the new speaker.

information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/TAFFC.2016.2531664, IEEE Transactions on Affective Computing

Continuous Estimation of Emotions in Speech

by Dynamic Cooperative Speaker Models

Arianna Mencattini, Eugenio Martinelli*, Fabien Ringeval, Bj

orn Schuller, Corrado Di Natale

Abstract—Automatic emotion recognition from speech has been

recently focused on the prediction of time-continuous dimensions

(e.g., arousal and valence) of spontaneous and realistic expressions

of emotion, as found in real-life interactions. However, the automatic

prediction of such emotions poses several challenges, such as the

subjectivity found in the deﬁnition of a gold standard from a pool of

raters and the issue of data scarcity in training models. In this work,

we introduce a novel emotion recognition system, based on ensem-

ble of single-speaker-regression-models (SSRMs). The estimation

of emotion is provided by combining a subset of the initial pool of

SSRMs selecting those that are most concordance among them. The

proposed approach allows the addition or removal of speakers from

the ensemble without the necessity to re-build the entire machine

learning system. The simplicity of this aggregation strategy, coupled

with the ﬂexibility assured by the modular architecture, and the

promising results obtained on the RECOLA database highlight the

potential implications of the proposed method in a real-life scenario

and in particular in WEB-based applications.

Index Terms—Speech emotion recognition, cooperative regression

model, naturalistic emotional display

1 INTRODUCTION

Peech is one of, if not the, most natural way

for humans to communicate. In everyday social

interactions, humans express various complex feelings

such as emotion and empathy. Despite the fact that the

cognitive processes used to encode affective informa-

tion during social interactions are relatively complex,

humans can easily manage to decode such informa-

tion in real time from multimodal cues. Conversely,

the effort required of computer-based systems for a

reliable and autonomous understanding of emotion

is still challenging, even for the unimodal analysis of

• A. Mencattini, E. Martinelli, and C. Di Natale are with the

Department of Electronic Engineering, University of Rome Tor

Vergata, Rome, Italy.

E-mail: mencattini,martinelli@ing.uniroma2.it, dinatale@uniroma2.it

• F. Ringeval is with the Chair of Complex & Intelligent Systems,

University of Passau, Passau, Germany.

E-mail: fabien.ringeval@uni-passau.de

• B. Schuller is with the Chair of Complex & Intelligent Systems,

University of Passau, Passau, Germany and with the Department of

Computing, Imperial College London, London, UK.

E-mail: bjoern.schuller@imperial.ac.uk

speech. Nonetheless, the development of such affec-

tive computing systems is promising for many distinct

ﬁelds of research. Health care systems may offer a

personalized treatment according to the measured

emotional content, along with an auxiliary diagnostic

tool of the psychological or developmental state of the

patient, such as depression [1], [2] or autism spectrum

conditions [3]. Remote care assistance can beneﬁt from

the estimation of the affective state (e. g., stress or fear)

in the voice of elder people [4]. Moreover, applications

such as speech based advertising [5], remote teaching

(e-learning) [6], job interview [7], and surveillance

systems [8] may be incredibly enriched by customer-

affect oriented services and monitoring, among many

others.

Beyond the proven interests in the relatively new

discipline of affective computing, until now numerous

issues have limited the full development and use of

speech emotion recognition (SER) systems in real-

life applications [9]. Whereas the automatic recog-

nition of acted emotion can provide useful insights

in the process of affective behaviors encoding into

speech and lead to very high recognition rates [10],

[11], [12], it is widely acknowledged that such data

cannot be a good representative of the emotions

produced in real-life interactions [13]. Spontaneous

emotions are indeed much more subtle and almost

never appear as a ”full-blown” expression [14]. As

a result, the automatic recognition of spontaneous

emotions is much more challenging in comparison to

the automatic recognition of acted emotions. In such

scenario, we aimed at developing a system able to

continuously and automatically predict the perceived

emotional condition of a subject expressed in any kind

of naturalistic environment.

1.1 Related work

Recently, databases of emotion collected during nat-

ural interactions with time-continuous ratings (e. g.,

arousal and valence [15]) have emerged, such as the

Sensitive Artiﬁcial Listener (SAL) set in the HU-

MAINE database [16], the SEMAINE database [17]

and the RECOLA database [18]. Such databases have

caused a shift in methods, ﬁrst of all moving from

information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/TAFFC.2016.2531664, IEEE Transactions on Affective Computing

classiﬁcation to regression to be able to model contin-

uous affective dimensions [19], and next moving from

utterance or segment level labels [20] to quasi time-

continuous labels [17], [21]. Automatic recognition

of naturalistic emotion from time-continuous labels

presents however several challenges that are not yet

solved [9], such as the deﬁnition of a reliable gold-

standard from a pool of raters and the issue of data

scarcity in training models.

In the light of the appraisal theory from the do-

main of emotion psychology [22], each annotator may

have a subjective perception of the affective state

expressed by an individual, motivated by his/her

own past and present experience, memories, rea-

soning, etc. Additionally, humans have natural bias

and inconsistencies in their judgement [23], which

creates additional noise in the ratings. Further, the

variability in emotion perception can also be observed

in the time domain, since the evaluators may have

different reaction lag (RL) during the procedure of

time-continuous annotation [24]. However, the natu-

ral diversity found in emotion perception is usually

merged when a machine learning model is trained, by

averaging several evaluations from a pool of raters

into a single gold standard. Whereas the use of all

annotation data can help at preserving diversity in

emotion perception, e. g., by using multi-task learning

of each annotator [25], [26], it has the main disadvan-

tage to increase the overall complexity of the model

according to the number of available raters. The issue

of synchronisation of various individual ratings for

deﬁning a gold standard has also been investigated

with signal processing techniques. Models of RL have

been estimated from the data, by maximising the

correlation coefﬁcient ([27], [28], [29]), or the mutual

information ([24]) between audiovisual features and

emotional ratings while shifting back in time the

latter.

Regarding the issue of data scarcity, the main ques-

tion to be solved is how to deal with the huge

diversity found in a collection of spontaneous dis-

plays of emotion. The common approach in the lit-

erature is to use all the emotion variability found

in the data as training material and tune the ma-

chine learning system in order to disregard the less

relevant instances (e. g., by optimising the number

of support vectors and the soft margin in Support

Vector Regression (SVR)) for emotion prediction [19],

[30], [31], [32]. Some recent work have proposed to

use cooperative learning as a means to select the

most informative instances from a set of unlabelled

acoustic utterances [33]. But the core underlying idea

of this approach is to reduce the cost of the human

annotation task, e. g., by selecting instances which are

predicted with a low conﬁdence level, not to consider

consensus as a way to optimise the predictability

of a given SER system. Attempts have already been

made in developing cooperative strategies in super-

vised classiﬁcation with ensemble models [34], or by

considering multi-scaled sliding windows for binary

classiﬁcation [35]. Cooperative strategies have also

been used to perform fusion of multimodal stimuli, by

using either early (i. e., features) or late (i. e., decisions)

fusion techniques [26], [36], [37], [38].

Taking inspiration from the cooperative strategy

proposed in [39], here we introduce a system able to

autonomously and temporarily change the composi-

tion of a restricted group of predictors provided by

single-speaker-regression-models (SSRMs) in a coop-

eration task governed by a concordance paradigm.

1.2 Main contributions

Motivations of our work lie in the intention to pro-

duce a system that can predict the perceived level of

emotion of a subject from speech analysis through the

fusion of multiple independently trained systems. To

this regard, we propose a three-topics formulation of

the problem of SER from time-continuous labels: emo-

tion subjectivity, models concordance, and dynamic

settings.

As mentioned earlier, the use of annotated data of

emotion has the immediate consequence of forcing

the discrepancy between the emotion produced by

the subject and that perceived by the evaluators [22].

Even though the latter may not match the actual

affective state of the subject, the evaluators provide

the unique available judgement about the emotion,

transferring the natural subjectivity of the speaker

into the subjectivity of a group of listeners. Hence,

in this paper, we propose a modular strategy based

on cooperative models to perform emotion prediction

from speech data. Consensus-based merging strategy

is crucial for the cooperation of concordant responses,

either of the evaluators (e. g., the Evaluator Weighted

Estimator (EWE) [40]) or of the model developed for

each speaker. The main goal here is not to consider

emotion prediction as a ﬁxed evaluation procedure,

but rather as a dynamic cooperative task.

The ﬁrst stage of our SER system consists of de-

veloping an SSRM for each speaker. Then, a second

stage follows that consists of applying a coopera-

tive strategy to merge the responses provided by

the different SSRMs, while dynamically selecting the

window of observation in which the concordance of

the responses is estimated. The possibility to develop

single-speaker-models merged through a cooperative

strategy makes the proposed method easily applicable

for real-time applications. Mobile devices and WEB-

based applications require that the regression model

can be continuously updated with new data, while

avoiding the exponential increasing of the learning

time or the re-training of the whole model after the

addition of new speakers to the system. The coopera-

tive approach proposed in this paper offers an elegant

solution to this constraint, because it is able to em-

bed new speakers’ models independently trained on

information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/TAFFC.2016.2531664, IEEE Transactions on Affective Computing

separate speech sequences in a dynamic cooperative

generation rule. Further, the dynamic adaptation of

the SSRM along the observation window allows the

system to automatically select the most concordant

models and thus maximise the overall performance.

In line with the three-topics formulation reported

above, and in accordance with the paper organization,

the main contributions of this paper can be listed as

follow: (i) we propose to use a quadrant-based tempo-

ral division to estimate the RL of emotion annotation

and perform features selection (topic emotion sub-

jectivity), (ii) we deﬁne a dynamic consensus-based

cooperative strategy to predict emotion from several

SSRMs (topics dynamic settings and models concor-

dance), and (iii) we perform extensive evaluations on

a fully naturalistic database of emotion (RECOLA) to

compare the performance of our system with methods

from the state-of-the-art.

The remainder of this article is structured as fol-

lows: ﬁrst, Section 2 gives a detailed description of

the proposed consensus-based SER system, and in-

troduces the database used for the experiments; next

Section 3 reports results. Final remarks and direction

of future research are given in Section 4.

2 DATA AND METHODS

2.1 Database

A new multimodal corpus of spontaneous interactions

in French called RECOLA, for REmote COLlabora-

tive and Affective interactions, was recently intro-

duced by Ringeval et al. [18]. Spontaneous interac-

tions were collected during the resolving of a col-

laborative task (“Winter survival task”) that was per-

formed in dyads (i.e., interaction of two speakers at a

time) and remotely by video conference. The RECOLA

database includes 9.5 h of multimodal recordings, i. e.,

audio, video, electro-cardiogram (ECG) and electro-

dermal activity (EDA), that were continuously and

synchronously recorded from 46 participants. Ratings

of emotion were performed by six French-speaking

assistants (three male, three female) via the ANNEMO

web-based annotation tool [18], i. e., time- and value-

continuous, for the ﬁrst ﬁve minutes of all recorded

sequences. The dataset for which participants gave

their consent to share their data is reduced to a set of

34 participants for an overall duration of seven hours,

from which the annotation of 23 participants (10 male,

13 female; age: µ = 21.3 years and σ = 4.1 years)

were made publicly available

. Although if all partici-

pants were French speakers, they had different mother

tongue: 17 subjects were French, three German and

three Italian. Note that the nonconsecutive numeric

speaker labels displayed in this paper (e.g., P16, P17,

P21, and so on ) originate from the RECOLA dataset.

1. http://diuf.unifr.ch/diva/recola/

1st Speaker 2nd Speaker nth Speaker

Unlabelled

Speaker

Cooperative Regression Models (CRMs)

Prediction

Single Speaker

Regression Model

(SSRM)

Single Speaker

Regression Model

(SSRM)

Single Speaker

Regression Model

(SSRM)

Common Consensus

Fig. 1. Schematic description of the consensus based

speech emotion recognition system.

Algorithm 1 Construction of each Single Speaker

Regression Model (SSRM)

1: acoustic features extraction

2: gold standard estimation

3: Quadrant-Based Temporal Division (QBTD)

4: for all q = {a

−

, a

, v

−

, v

} do

5: L

← length of each segment

6: for all RL = 0 to 8 s step 0.04 s do

7: shift gold standard of RL

8: return CF S(RL) for feature selection

9: end for

10: RL

opt

← argmax

(CF S)

11: save selected features according to RL

opt

12: end for

13: return RL

←

−



−

· RL

−

opt

+ L

· RL

opt



14: return RL

←

−



−

· RL

−

opt

+ L

· RL

opt



15: gold standard synchronization by RL

and RL

16: concatenate selected features for each dimension

17: features normalization by Z−score

18: linear regression by Partial Least Square (PLS)

2.2 Single Speaker Regression Model (SSRM)

Fig. 1 shows a schematic description of the whole

method. Coloured blocks identify each SSRM receiv-

ing as input the speech of a speaker as well as

the corresponding annotations in terms of arousal

and valence. The cooperative regression model (CRM)

used for the prediction of an emotional dimension

(e. g., arousal or valence) from an unlabelled speech

sequence, involves to average the responses of each

SSRM exhibiting a common consensus, as illustrated

by the stylised men with raised hand. The steps

needed for the construction of SSRM and CRM are

listed in Algorithm 1 and 2, respectively, and are

detailed in the following sections.

2.2.1 Acoustic features extraction

According to previous work [26], we consider the 65

acoustic low level descriptors (LLDs) and their ﬁrst

order derivatives (producing 130 LLDs in total) that

information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/TAFFC.2016.2531664, IEEE Transactions on Affective Computing

were used for the INTERPSEECH Computational

Paralinguistic challengE since its 2013 edition [41].

The COMPARE features feature set have been com-

puted with the open source extractor OPENSMILE

(release 2.0) [42]. This feature set includes a group

of 4 energy related LLDs, 55 spectral related LLDs,

and 6 voicing related LLDs, cf. Table 1 and step 1

in Algorithm 1. For more details on the COMPARE

features set, the reader is referred to [43]. In what

follows, we denote with N

the temporal length of

each speech sequence, with N

the total number of

acoustic features, with N

the number of evaluators

for each speech sequence, and with N

the number of

speakers for which data and annotations are available

as training material.

2.2.2 Gold standard estimation

Learning the acoustic model of an emotional dimen-

sion requires the computation of a gold standard

from the annotated data of each speaker, cf. step 2 in

Algorithm 1. This is often achieved by averaging the

traces provided by each rater. The EWE [40] procedure

can be used to center the ratings to a value that max-

imises the inter-rater agreement [26]. Assuming that

individual mean centering of each annotation may

alter the original rating by resetting the natural bias

of each annotator, i. e., the subjective perception of

each rater, here we propose a new weighted averaging

strategy that maintains the original dynamic of the

annotations similarly to the one used in [26].

Formally, indicating with d each dimension, i.e., d =

{a, v}, and starting from the evaluation provided by

each rater, e

, y

(t), i = 1, . . . , N

, the six evaluations

are shifted of the same quantity ¯y

that is obtained by

applying Eqs. (1) - (3).

¯ρ

(i) =

− 1

j=1

(j6=i,ρ

(i,j)>0)

˜ρ

(i, j) (1)

¯y

i=1

¯ρ

(i)

i=1

(t) ¯ρ

(i) (2)

(t) =

i=1

(t) − ¯y

) (3)

with ¯ρ

(i) the mean pair-wise Pearson correlation

coefﬁcient of the annotation provided by the eval-

uator e

with the remaining N

− 1, and ˜ρ

(i, j) =

max (0, ρ

(i, j)) the positive Pearson’s correlation co-

efﬁcient of the ratings provided by the evaluators e

and e

Such procedure gives thus priority to the raters

that agree more with the pool when averaging their

respective annotation. If all raters perfectly agree with

each other, then all pair-wise correlation coefﬁcients

are equal to one and our procedure corresponds to a

TABLE 1

COMPARE acoustic feature set: 65 low-level

descriptors (LLD).

4 energy related LLD Group

Sum of auditory spectrum (loudness) prosodic

Sum of RASTA-ﬁltered auditory spectrum prosodic

RMS Energy, Zero-Crossing Rate prosodic

55 spectral LLD Group

RASTA-ﬁlt. aud. spect. bds. 1–26 (0–8 kHz) spectral

MFCC 1–14 cepstral

Spectral energy 250–650 Hz, 1 k–4 kHz spectral

Spectral Roll-Off Pt. 0.25, 0.5, 0.75, 0.9 spectral

Spectral Flux, Centroid, Entropy, Slope spectral

Psychoacoustic Sharpness, Harmonicity spectral

Spectral Variance, Skewness, Kurtosis spectral

6 voicing related LLD Group

(SHS & Viterbi smoothing) prosodic

Prob. of voicing voice qual.

log. HNR, Jitter (local & δ), Shimmer (local) voice qual.

simple average of the annotations after mean center-

ing.

Note that we do not consider in the computation

of the gold standard the annotations that exhibit

negative correlation coefﬁcients to avoid unwanted

compensation effects in the normalisation procedure.

2.2.3 Quadrant-based temporal division (QBTD)

According to the Russell’s two dimensional represen-

tation of emotions [15], each quadrant of the diagram

conveys speciﬁc characteristics of emotion. Further,

all emotions are not conveyed by a unique acoustic

features set [44], and such associations can also vary

according to the age and the gender of the speaker,

among many other paralinguistic traits and states [45].

We therefore propose to consider those peculiar-

ities to select relevant acoustic feature subsets and

estimate RL of the raters. For the purpose of opti-

mizing the feature selection as well as the reaction

lag estimation procedures, we decide to segment the

gold standards y

(t) and the corresponding acoustic

features x

(t), k = 1, . . . , N

into segments of posi-

tive and negative arousal or valence. Denoting with

q = {a

, a

−

, v

−

} each possible quadrant of the

2D arousal-valence space, cf. step 3 in Algorithm 1,

the corresponding segments of the gold standard are

indicated by y

(t), y

−

(t) and y

(t), y

−

(t), and

the corresponding segments of acoustic features by

(t), x

−

(t) and x

(t), x

−

(t), where y

(t) =

≥ 0}, y

−

(t) = {y

< 0} and y

(t) =

≥ 0}, y

−

(t) = {y

< 0}. With reference to

the Russell representation, we call this segmentation

the quadrant-based temporal division (QBTD). Seg-

mentation is performed by simply concatenating all

the segments of a single quadrant. Such procedure

adds the beneﬁt to avoid that feature selection is

mostly guided by the most populated quadrant.

information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI

10.1109/TAFFC.2016.2531664, IEEE Transactions on Affective Computing

2.2.4 Reaction lag estimation and feature selection

It is known that evaluators need some time to eval-

uate the cues observable in an audiovisual sequence

and then report the corresponding emotion. This is

especially observable on time-continuous ratings used

on dimensional models of emotion, where a delay

occurs between the observable cues and the reported

emotional value. According to the evaluations per-

formed in [24], we assume here a RL distinct for each

speaker and emotional dimension with a negligible

variation among the six ratings of the same speaker,

compensating this effect with the correlation-based

estimation of the gold standard. However, we relate

the estimation of the optimal RL to a feature selection

procedure that is performed independently on each

quadrant of the 2D arousal-valence emotional space,

to consider the peculiarities of the acoustic features

according to the emotions.

The importance of such kind of analysis has been

demonstrated by the results obtained in preliminary

comparative simulations performed without the RL-

based synchronization of features and gold standard.

In this regards, in Section 3.5 we will discuss results of

the related experiments run to reinforce our assump-

tion.

All gold standard segments y

(t), y

−

(t), y

(t)

and y

−

(t) extracted by the QBTD decomposition are

thus used separately for each quadrant to perform

synchronisation with the corresponding acoustic fea-

tures. For each quadrant q and a variable RL value

in the range [0, 8] s with a step of 0.04 s, the corre-

sponding gold standard segment is shifted back in

time with a lag equal to RL and the correlation-based

feature selection (CF S) measure is computed [46],

[47] (steps 7-8 in Algorithm 1). The optimal reaction

lag RL

opt

is then deﬁned as the RL that maximises

the CF S measure (step 10 Algorithm 1). Given the

two optimal values RL

opt

for a given dimension (i. e.,

arousal or valence), the ﬁnal reaction lag is estimated

by weighting the two values obtained on each side of

the considered dimension with the length of the corre-

sponding segments (step 13 for arousal and step 14 for

valence Algorithm 1). Compensation of the annotation

delay is ﬁnally obtained by shifting back in time the

gold standard with the corresponding RL (step 15

Algorithm 1).

Results show that an average RL of 3.89 s is ob-

tained for arousal (σ = 1.16 s) and 4.52 s (σ = 2.15 s)

for valence, in total agreement with the experimental

results reported in the literature [24], [26]. Arousal

is indeed a less subjective emotion than valence and

thus requires less time for being evaluated. Concern-

ing the results of feature selection, we list in Ap-

pendix A the most frequently selected features in each

quadrant along with the related description. Note that

the list of features that are selected in each quadrant

are saved (step 11 Algorithm 1) and concatenated

(step 16 Algorithm 1) for each affective dimension in

order to be used for the prediction of an unknown

speaker’s emotion.

2.2.5 Feature normalization and linear regression

The features selected using the QBTD procedure are

normalized by a Z-score (step 17 in Algorithm 1),

i. e., the mean is removed from the features and the

values are further divided by the standard-deviation,

and the normalization parameters µ

˜x

and σ

˜x

(mean

and standard deviation) are stored in the SSRM’s

parameters for being used later in the cooperative re-

gression. Concerning the regression part of the SSRM,

we trained Partial Least Square regression (PLS) on

the selected features (step 18 in Algorithm 1). The

SIMPLS algorithm is used for this purpose [48]. The

optimal numbers of latent variables LV

and LV

(for arousal and valence, respectively) are extracted

through contiguous block splitting cross-validation

(10 splits) performed on the entire speech of the

speaker.

2.3 Cooperative Regression Model (CRM)

The principle of the cooperative regression model

(CRM) is illustrated in Fig. 1. The CRM receives

as inputs the predictions provided by each SSRM.

Only the predictions that exhibit a common consensus

(indicated by the men with raised hand) are averaged

and a ﬁnal prediction is produced. The cooperation

principle is based on a two-fold strategy. First, each

SSRM is applied on the speech of a new speaker

which produces an individual response. Then,

only the most concordant responses among the N

available ones are retained and merged to produce

the ﬁnal prediction. In order to select the most con-

cordant predictions, we used the mutual concordance

correlation coefﬁcient (CCC), ρ

[49]. It is a measure of

agreement between two time-continuous predictions

that non-linearly combines in a unique parameter the

Pearson correlation coefﬁcient (CC), ρ, and the mean

square error. The parameter CCC computed on two

time-series y

(t) and y

(t) on a given observation

time-interval T is deﬁned as follows:

, y

) =

2ρ(y

, y

) σ

+ σ

+ (µ

− µ

)

(4)

where the CC (ρ), the mean (µ), and the standard

deviation (σ) are meant to be computed under the

assumption of stationary of the two time-series y

(t)

and y

(t) on the observation time-interval T . The

underlying idea of using the CCC is to measure the

consensus of the predictions provided by the speakers

in the cooperation observed on a given time period T .

The steps used in the CRM are listed in Algorithm 2

and detailed below.

Continuous Estimation of Emotions in Speech by Dynamic Cooperative Speaker Models

Figures

Citations

AVEC 2017: Real-life Depression, and Affect Recognition Workshop and Challenge

Parallelized Convolutional Recurrent Neural Network With Spectral Features for Speech Emotion Recognition

Prominence features: Effective emotional features for speech emotion recognition

End-to-end learning for dimensional emotion recognition from physiological signals

Speech emotion recognition research: an analysis of research focus

References

LIBSVM: A library for support vector machines

Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy

Feature selection based on mutual information: criteria ofmax-dependency, max-relevance, and min-redundancy

A concordance correlation coefficient to evaluate reproducibility.

A review of feature selection techniques in bioinformatics

Related Papers (5)

Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions

Prediction of asynchronous dimensional emotion ratings from audiovisual and physiological data

The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing

A concordance correlation coefficient to evaluate reproducibility.

AVEC 2016: Depression, Mood, and Emotion Recognition Workshop and Challenge

Frequently Asked Questions (14)

Q1. What have the authors contributed in "Continuous estimation of emotions in speech by dynamic cooperative speaker models" ?

Q2. What future works have the authors mentioned in the paper "Continuous estimation of emotions in speech by dynamic cooperative speaker models" ?

Q3. What is the main disadvantage of the automatic recognition of spontaneous emotions?

Q4. What is the common approach in the literature to use all the emotion variability found in the data as?

Q5. What is the new strategy for combining speech sequences?

Q6. What is the frequency of inclusion of a speaker in the cooperation strategy?

Q7. What is the significance of the synchronization procedure?

Q8. What is the advantage of the proposed architecture for mobile applications?

Q9. What is the significance of the improvements obtained with the QBTD procedure?

Q10. How can arousalbe be added to the cooperative system?

Q11. What is the significance of the QBTD procedure for the construction of the SSRM?

Q12. What other strategies are used to quantify the performance of the proposed method?

Q13. What are the challenges of the automatic recognition of naturalistic emotion from time-continuous labels?

Q14. What is the main question to be solved?

Trending Questions (1)