scispace - formally typeset
Open AccessJournal ArticleDOI

Continuous Estimation of Emotions in Speech by Dynamic Cooperative Speaker Models

Reads0
Chats0
TLDR
A novel emotion recognition system, based on ensembles of single-speaker-regression-models selecting those that are most concordant among them, which allows the addition or removal of speakers from the ensemble without the necessity to re-build the entire recognition system.
Abstract
Research on automatic emotion recognition from speech has recently focused on the prediction of time-continuous dimensions (e.g., arousal and valence) of spontaneous and realistic expressions of emotion, as found in real-life interactions. However, the automatic prediction of such emotions poses several challenges, such as the subjectivity found in the definition of a gold-standard from a pool of raters and the issue of data scarcity in training models. In this work, we introduce a novel emotion recognition system, based on ensembles of single-speaker-regression-models. The estimation of emotion is provided by combining a subset of the initial pool of single-speaker-regression-models selecting those that are most concordant among them. The proposed approach allows the addition or removal of speakers from the ensemble without the necessity to re-build the entire recognition system. The simplicity of this aggregation strategy, coupled with the flexibility assured by the modular architecture, and the promising results observed on the RECOLA database highlight the potential implications of the proposed method in a real-life scenario and in particular in web-based applications.

read more

Content maybe subject to copyright    Report

1949-3045 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TAFFC.2016.2531664, IEEE Transactions on Affective Computing
1
Continuous Estimation of Emotions in Speech
by Dynamic Cooperative Speaker Models
Arianna Mencattini, Eugenio Martinelli*, Fabien Ringeval, Bj
¨
orn Schuller, Corrado Di Natale
F
Abstract—Automatic emotion recognition from speech has been
recently focused on the prediction of time-continuous dimensions
(e.g., arousal and valence) of spontaneous and realistic expressions
of emotion, as found in real-life interactions. However, the automatic
prediction of such emotions poses several challenges, such as the
subjectivity found in the definition of a gold standard from a pool of
raters and the issue of data scarcity in training models. In this work,
we introduce a novel emotion recognition system, based on ensem-
ble of single-speaker-regression-models (SSRMs). The estimation
of emotion is provided by combining a subset of the initial pool of
SSRMs selecting those that are most concordance among them. The
proposed approach allows the addition or removal of speakers from
the ensemble without the necessity to re-build the entire machine
learning system. The simplicity of this aggregation strategy, coupled
with the flexibility assured by the modular architecture, and the
promising results obtained on the RECOLA database highlight the
potential implications of the proposed method in a real-life scenario
and in particular in WEB-based applications.
Index Terms—Speech emotion recognition, cooperative regression
model, naturalistic emotional display
1 INTRODUCTION
S
Peech is one of, if not the, most natural way
for humans to communicate. In everyday social
interactions, humans express various complex feelings
such as emotion and empathy. Despite the fact that the
cognitive processes used to encode affective informa-
tion during social interactions are relatively complex,
humans can easily manage to decode such informa-
tion in real time from multimodal cues. Conversely,
the effort required of computer-based systems for a
reliable and autonomous understanding of emotion
is still challenging, even for the unimodal analysis of
A. Mencattini, E. Martinelli, and C. Di Natale are with the
Department of Electronic Engineering, University of Rome Tor
Vergata, Rome, Italy.
E-mail: mencattini,martinelli@ing.uniroma2.it, dinatale@uniroma2.it
F. Ringeval is with the Chair of Complex & Intelligent Systems,
University of Passau, Passau, Germany.
E-mail: fabien.ringeval@uni-passau.de
B. Schuller is with the Chair of Complex & Intelligent Systems,
University of Passau, Passau, Germany and with the Department of
Computing, Imperial College London, London, UK.
E-mail: bjoern.schuller@imperial.ac.uk
speech. Nonetheless, the development of such affec-
tive computing systems is promising for many distinct
fields of research. Health care systems may offer a
personalized treatment according to the measured
emotional content, along with an auxiliary diagnostic
tool of the psychological or developmental state of the
patient, such as depression [1], [2] or autism spectrum
conditions [3]. Remote care assistance can benefit from
the estimation of the affective state (e. g., stress or fear)
in the voice of elder people [4]. Moreover, applications
such as speech based advertising [5], remote teaching
(e-learning) [6], job interview [7], and surveillance
systems [8] may be incredibly enriched by customer-
affect oriented services and monitoring, among many
others.
Beyond the proven interests in the relatively new
discipline of affective computing, until now numerous
issues have limited the full development and use of
speech emotion recognition (SER) systems in real-
life applications [9]. Whereas the automatic recog-
nition of acted emotion can provide useful insights
in the process of affective behaviors encoding into
speech and lead to very high recognition rates [10],
[11], [12], it is widely acknowledged that such data
cannot be a good representative of the emotions
produced in real-life interactions [13]. Spontaneous
emotions are indeed much more subtle and almost
never appear as a ”full-blown” expression [14]. As
a result, the automatic recognition of spontaneous
emotions is much more challenging in comparison to
the automatic recognition of acted emotions. In such
scenario, we aimed at developing a system able to
continuously and automatically predict the perceived
emotional condition of a subject expressed in any kind
of naturalistic environment.
1.1 Related work
Recently, databases of emotion collected during nat-
ural interactions with time-continuous ratings (e. g.,
arousal and valence [15]) have emerged, such as the
Sensitive Artificial Listener (SAL) set in the HU-
MAINE database [16], the SEMAINE database [17]
and the RECOLA database [18]. Such databases have
caused a shift in methods, first of all moving from

1949-3045 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TAFFC.2016.2531664, IEEE Transactions on Affective Computing
2
classification to regression to be able to model contin-
uous affective dimensions [19], and next moving from
utterance or segment level labels [20] to quasi time-
continuous labels [17], [21]. Automatic recognition
of naturalistic emotion from time-continuous labels
presents however several challenges that are not yet
solved [9], such as the definition of a reliable gold-
standard from a pool of raters and the issue of data
scarcity in training models.
In the light of the appraisal theory from the do-
main of emotion psychology [22], each annotator may
have a subjective perception of the affective state
expressed by an individual, motivated by his/her
own past and present experience, memories, rea-
soning, etc. Additionally, humans have natural bias
and inconsistencies in their judgement [23], which
creates additional noise in the ratings. Further, the
variability in emotion perception can also be observed
in the time domain, since the evaluators may have
different reaction lag (RL) during the procedure of
time-continuous annotation [24]. However, the natu-
ral diversity found in emotion perception is usually
merged when a machine learning model is trained, by
averaging several evaluations from a pool of raters
into a single gold standard. Whereas the use of all
annotation data can help at preserving diversity in
emotion perception, e. g., by using multi-task learning
of each annotator [25], [26], it has the main disadvan-
tage to increase the overall complexity of the model
according to the number of available raters. The issue
of synchronisation of various individual ratings for
defining a gold standard has also been investigated
with signal processing techniques. Models of RL have
been estimated from the data, by maximising the
correlation coefficient ([27], [28], [29]), or the mutual
information ([24]) between audiovisual features and
emotional ratings while shifting back in time the
latter.
Regarding the issue of data scarcity, the main ques-
tion to be solved is how to deal with the huge
diversity found in a collection of spontaneous dis-
plays of emotion. The common approach in the lit-
erature is to use all the emotion variability found
in the data as training material and tune the ma-
chine learning system in order to disregard the less
relevant instances (e. g., by optimising the number
of support vectors and the soft margin in Support
Vector Regression (SVR)) for emotion prediction [19],
[30], [31], [32]. Some recent work have proposed to
use cooperative learning as a means to select the
most informative instances from a set of unlabelled
acoustic utterances [33]. But the core underlying idea
of this approach is to reduce the cost of the human
annotation task, e. g., by selecting instances which are
predicted with a low confidence level, not to consider
consensus as a way to optimise the predictability
of a given SER system. Attempts have already been
made in developing cooperative strategies in super-
vised classification with ensemble models [34], or by
considering multi-scaled sliding windows for binary
classification [35]. Cooperative strategies have also
been used to perform fusion of multimodal stimuli, by
using either early (i. e., features) or late (i. e., decisions)
fusion techniques [26], [36], [37], [38].
Taking inspiration from the cooperative strategy
proposed in [39], here we introduce a system able to
autonomously and temporarily change the composi-
tion of a restricted group of predictors provided by
single-speaker-regression-models (SSRMs) in a coop-
eration task governed by a concordance paradigm.
1.2 Main contributions
Motivations of our work lie in the intention to pro-
duce a system that can predict the perceived level of
emotion of a subject from speech analysis through the
fusion of multiple independently trained systems. To
this regard, we propose a three-topics formulation of
the problem of SER from time-continuous labels: emo-
tion subjectivity, models concordance, and dynamic
settings.
As mentioned earlier, the use of annotated data of
emotion has the immediate consequence of forcing
the discrepancy between the emotion produced by
the subject and that perceived by the evaluators [22].
Even though the latter may not match the actual
affective state of the subject, the evaluators provide
the unique available judgement about the emotion,
transferring the natural subjectivity of the speaker
into the subjectivity of a group of listeners. Hence,
in this paper, we propose a modular strategy based
on cooperative models to perform emotion prediction
from speech data. Consensus-based merging strategy
is crucial for the cooperation of concordant responses,
either of the evaluators (e. g., the Evaluator Weighted
Estimator (EWE) [40]) or of the model developed for
each speaker. The main goal here is not to consider
emotion prediction as a fixed evaluation procedure,
but rather as a dynamic cooperative task.
The first stage of our SER system consists of de-
veloping an SSRM for each speaker. Then, a second
stage follows that consists of applying a coopera-
tive strategy to merge the responses provided by
the different SSRMs, while dynamically selecting the
window of observation in which the concordance of
the responses is estimated. The possibility to develop
single-speaker-models merged through a cooperative
strategy makes the proposed method easily applicable
for real-time applications. Mobile devices and WEB-
based applications require that the regression model
can be continuously updated with new data, while
avoiding the exponential increasing of the learning
time or the re-training of the whole model after the
addition of new speakers to the system. The coopera-
tive approach proposed in this paper offers an elegant
solution to this constraint, because it is able to em-
bed new speakers’ models independently trained on

1949-3045 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TAFFC.2016.2531664, IEEE Transactions on Affective Computing
3
separate speech sequences in a dynamic cooperative
generation rule. Further, the dynamic adaptation of
the SSRM along the observation window allows the
system to automatically select the most concordant
models and thus maximise the overall performance.
In line with the three-topics formulation reported
above, and in accordance with the paper organization,
the main contributions of this paper can be listed as
follow: (i) we propose to use a quadrant-based tempo-
ral division to estimate the RL of emotion annotation
and perform features selection (topic emotion sub-
jectivity), (ii) we define a dynamic consensus-based
cooperative strategy to predict emotion from several
SSRMs (topics dynamic settings and models concor-
dance), and (iii) we perform extensive evaluations on
a fully naturalistic database of emotion (RECOLA) to
compare the performance of our system with methods
from the state-of-the-art.
The remainder of this article is structured as fol-
lows: first, Section 2 gives a detailed description of
the proposed consensus-based SER system, and in-
troduces the database used for the experiments; next
Section 3 reports results. Final remarks and direction
of future research are given in Section 4.
2 DATA AND METHODS
2.1 Database
A new multimodal corpus of spontaneous interactions
in French called RECOLA, for REmote COLlabora-
tive and Affective interactions, was recently intro-
duced by Ringeval et al. [18]. Spontaneous interac-
tions were collected during the resolving of a col-
laborative task (“Winter survival task”) that was per-
formed in dyads (i.e., interaction of two speakers at a
time) and remotely by video conference. The RECOLA
database includes 9.5 h of multimodal recordings, i. e.,
audio, video, electro-cardiogram (ECG) and electro-
dermal activity (EDA), that were continuously and
synchronously recorded from 46 participants. Ratings
of emotion were performed by six French-speaking
assistants (three male, three female) via the ANNEMO
web-based annotation tool [18], i. e., time- and value-
continuous, for the first five minutes of all recorded
sequences. The dataset for which participants gave
their consent to share their data is reduced to a set of
34 participants for an overall duration of seven hours,
from which the annotation of 23 participants (10 male,
13 female; age: µ = 21.3 years and σ = 4.1 years)
were made publicly available
1
. Although if all partici-
pants were French speakers, they had different mother
tongue: 17 subjects were French, three German and
three Italian. Note that the nonconsecutive numeric
speaker labels displayed in this paper (e.g., P16, P17,
P21, and so on ) originate from the RECOLA dataset.
1. http://diuf.unifr.ch/diva/recola/
1st Speaker 2nd Speaker nth Speaker
Unlabelled
Speaker
Cooperative Regression Models (CRMs)
Prediction
Single Speaker
Regression Model
(SSRM)
Single Speaker
Regression Model
(SSRM)
Single Speaker
Regression Model
(SSRM)
Common Consensus
Fig. 1. Schematic description of the consensus based
speech emotion recognition system.
Algorithm 1 Construction of each Single Speaker
Regression Model (SSRM)
1: acoustic features extraction
2: gold standard estimation
3: Quadrant-Based Temporal Division (QBTD)
4: for all q = {a
, a
+
, v
, v
+
} do
5: L
q
length of each segment
6: for all RL = 0 to 8 s step 0.04 s do
7: shift gold standard of RL
8: return CF S(RL) for feature selection
9: end for
10: RL
q
opt
argmax
RL
(CF S)
11: save selected features according to RL
q
opt
12: end for
13: return RL
a
1
L
a
+L
a
+
L
a
· RL
a
opt
+ L
a
+
· RL
a
+
opt
14: return RL
v
1
L
v
+L
v
+
L
v
· RL
v
opt
+ L
v
+
· RL
v
+
opt
15: gold standard synchronization by RL
a
and RL
v
16: concatenate selected features for each dimension
17: features normalization by Zscore
18: linear regression by Partial Least Square (PLS)
2.2 Single Speaker Regression Model (SSRM)
Fig. 1 shows a schematic description of the whole
method. Coloured blocks identify each SSRM receiv-
ing as input the speech of a speaker as well as
the corresponding annotations in terms of arousal
and valence. The cooperative regression model (CRM)
used for the prediction of an emotional dimension
(e. g., arousal or valence) from an unlabelled speech
sequence, involves to average the responses of each
SSRM exhibiting a common consensus, as illustrated
by the stylised men with raised hand. The steps
needed for the construction of SSRM and CRM are
listed in Algorithm 1 and 2, respectively, and are
detailed in the following sections.
2.2.1 Acoustic features extraction
According to previous work [26], we consider the 65
acoustic low level descriptors (LLDs) and their first
order derivatives (producing 130 LLDs in total) that

1949-3045 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TAFFC.2016.2531664, IEEE Transactions on Affective Computing
4
were used for the INTERPSEECH Computational
Paralinguistic challengE since its 2013 edition [41].
The COMPARE features feature set have been com-
puted with the open source extractor OPENSMILE
(release 2.0) [42]. This feature set includes a group
of 4 energy related LLDs, 55 spectral related LLDs,
and 6 voicing related LLDs, cf. Table 1 and step 1
in Algorithm 1. For more details on the COMPARE
features set, the reader is referred to [43]. In what
follows, we denote with N
t
the temporal length of
each speech sequence, with N
f
the total number of
acoustic features, with N
e
the number of evaluators
for each speech sequence, and with N
sp
the number of
speakers for which data and annotations are available
as training material.
2.2.2 Gold standard estimation
Learning the acoustic model of an emotional dimen-
sion requires the computation of a gold standard
from the annotated data of each speaker, cf. step 2 in
Algorithm 1. This is often achieved by averaging the
traces provided by each rater. The EWE [40] procedure
can be used to center the ratings to a value that max-
imises the inter-rater agreement [26]. Assuming that
individual mean centering of each annotation may
alter the original rating by resetting the natural bias
of each annotator, i. e., the subjective perception of
each rater, here we propose a new weighted averaging
strategy that maintains the original dynamic of the
annotations similarly to the one used in [26].
Formally, indicating with d each dimension, i.e., d =
{a, v}, and starting from the evaluation provided by
each rater, e
i
, y
e
i
d
(t), i = 1, . . . , N
e
, the six evaluations
are shifted of the same quantity ¯y
d
that is obtained by
applying Eqs. (1) - (3).
¯ρ
d
(i) =
1
N
e
1
N
e
X
j=1
(j6=i,ρ
d
(i,j)>0)
˜ρ
d
(i, j) (1)
¯y
d
=
1
P
N
e
i=1
¯ρ
d
(i)
N
e
X
i=1
1
T
X
t
y
e
i
d
(t) ¯ρ
d
(i) (2)
y
d
(t) =
1
N
e
N
e
X
i=1
(y
e
i
d
(t) ¯y
d
) (3)
with ¯ρ
d
(i) the mean pair-wise Pearson correlation
coefficient of the annotation provided by the eval-
uator e
i
with the remaining N
e
1, and ˜ρ
d
(i, j) =
max (0, ρ
d
(i, j)) the positive Pearson’s correlation co-
efficient of the ratings provided by the evaluators e
i
and e
j
.
Such procedure gives thus priority to the raters
that agree more with the pool when averaging their
respective annotation. If all raters perfectly agree with
each other, then all pair-wise correlation coefficients
are equal to one and our procedure corresponds to a
TABLE 1
COMPARE acoustic feature set: 65 low-level
descriptors (LLD).
4 energy related LLD Group
Sum of auditory spectrum (loudness) prosodic
Sum of RASTA-filtered auditory spectrum prosodic
RMS Energy, Zero-Crossing Rate prosodic
55 spectral LLD Group
RASTA-filt. aud. spect. bds. 1–26 (0–8 kHz) spectral
MFCC 1–14 cepstral
Spectral energy 250–650 Hz, 1 k–4 kHz spectral
Spectral Roll-Off Pt. 0.25, 0.5, 0.75, 0.9 spectral
Spectral Flux, Centroid, Entropy, Slope spectral
Psychoacoustic Sharpness, Harmonicity spectral
Spectral Variance, Skewness, Kurtosis spectral
6 voicing related LLD Group
F
0
(SHS & Viterbi smoothing) prosodic
Prob. of voicing voice qual.
log. HNR, Jitter (local & δ), Shimmer (local) voice qual.
simple average of the annotations after mean center-
ing.
Note that we do not consider in the computation
of the gold standard the annotations that exhibit
negative correlation coefficients to avoid unwanted
compensation effects in the normalisation procedure.
2.2.3 Quadrant-based temporal division (QBTD)
According to the Russell’s two dimensional represen-
tation of emotions [15], each quadrant of the diagram
conveys specific characteristics of emotion. Further,
all emotions are not conveyed by a unique acoustic
features set [44], and such associations can also vary
according to the age and the gender of the speaker,
among many other paralinguistic traits and states [45].
We therefore propose to consider those peculiar-
ities to select relevant acoustic feature subsets and
estimate RL of the raters. For the purpose of opti-
mizing the feature selection as well as the reaction
lag estimation procedures, we decide to segment the
gold standards y
d
(t) and the corresponding acoustic
features x
k
(t), k = 1, . . . , N
f
into segments of posi-
tive and negative arousal or valence. Denoting with
q = {a
+
, a
, v
+
, v
} each possible quadrant of the
2D arousal-valence space, cf. step 3 in Algorithm 1,
the corresponding segments of the gold standard are
indicated by y
a
+
(t), y
a
(t) and y
v
+
(t), y
v
(t), and
the corresponding segments of acoustic features by
x
k
a
+
(t), x
k
a
(t) and x
k
v
+
(t), x
k
v
(t), where y
a
+
(t) =
{y
a
|y
a
0}, y
a
(t) = {y
a
|y
a
< 0} and y
v
+
(t) =
{y
v
|y
v
0}, y
v
(t) = {y
v
|y
v
< 0}. With reference to
the Russell representation, we call this segmentation
the quadrant-based temporal division (QBTD). Seg-
mentation is performed by simply concatenating all
the segments of a single quadrant. Such procedure
adds the benefit to avoid that feature selection is
mostly guided by the most populated quadrant.

1949-3045 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TAFFC.2016.2531664, IEEE Transactions on Affective Computing
5
2.2.4 Reaction lag estimation and feature selection
It is known that evaluators need some time to eval-
uate the cues observable in an audiovisual sequence
and then report the corresponding emotion. This is
especially observable on time-continuous ratings used
on dimensional models of emotion, where a delay
occurs between the observable cues and the reported
emotional value. According to the evaluations per-
formed in [24], we assume here a RL distinct for each
speaker and emotional dimension with a negligible
variation among the six ratings of the same speaker,
compensating this effect with the correlation-based
estimation of the gold standard. However, we relate
the estimation of the optimal RL to a feature selection
procedure that is performed independently on each
quadrant of the 2D arousal-valence emotional space,
to consider the peculiarities of the acoustic features
according to the emotions.
The importance of such kind of analysis has been
demonstrated by the results obtained in preliminary
comparative simulations performed without the RL-
based synchronization of features and gold standard.
In this regards, in Section 3.5 we will discuss results of
the related experiments run to reinforce our assump-
tion.
All gold standard segments y
a
+
(t), y
a
(t), y
v
+
(t)
and y
v
(t) extracted by the QBTD decomposition are
thus used separately for each quadrant to perform
synchronisation with the corresponding acoustic fea-
tures. For each quadrant q and a variable RL value
in the range [0, 8] s with a step of 0.04 s, the corre-
sponding gold standard segment is shifted back in
time with a lag equal to RL and the correlation-based
feature selection (CF S) measure is computed [46],
[47] (steps 7-8 in Algorithm 1). The optimal reaction
lag RL
q
opt
is then defined as the RL that maximises
the CF S measure (step 10 Algorithm 1). Given the
two optimal values RL
q
opt
for a given dimension (i. e.,
arousal or valence), the final reaction lag is estimated
by weighting the two values obtained on each side of
the considered dimension with the length of the corre-
sponding segments (step 13 for arousal and step 14 for
valence Algorithm 1). Compensation of the annotation
delay is finally obtained by shifting back in time the
gold standard with the corresponding RL (step 15
Algorithm 1).
Results show that an average RL of 3.89 s is ob-
tained for arousal (σ = 1.16 s) and 4.52 s (σ = 2.15 s)
for valence, in total agreement with the experimental
results reported in the literature [24], [26]. Arousal
is indeed a less subjective emotion than valence and
thus requires less time for being evaluated. Concern-
ing the results of feature selection, we list in Ap-
pendix A the most frequently selected features in each
quadrant along with the related description. Note that
the list of features that are selected in each quadrant
are saved (step 11 Algorithm 1) and concatenated
(step 16 Algorithm 1) for each affective dimension in
order to be used for the prediction of an unknown
speaker’s emotion.
2.2.5 Feature normalization and linear regression
The features selected using the QBTD procedure are
normalized by a Z-score (step 17 in Algorithm 1),
i. e., the mean is removed from the features and the
values are further divided by the standard-deviation,
and the normalization parameters µ
˜x
k
q
and σ
˜x
k
q
(mean
and standard deviation) are stored in the SSRM’s
parameters for being used later in the cooperative re-
gression. Concerning the regression part of the SSRM,
we trained Partial Least Square regression (PLS) on
the selected features (step 18 in Algorithm 1). The
SIMPLS algorithm is used for this purpose [48]. The
optimal numbers of latent variables LV
a
and LV
v
(for arousal and valence, respectively) are extracted
through contiguous block splitting cross-validation
(10 splits) performed on the entire speech of the
speaker.
2.3 Cooperative Regression Model (CRM)
The principle of the cooperative regression model
(CRM) is illustrated in Fig. 1. The CRM receives
as inputs the predictions provided by each SSRM.
Only the predictions that exhibit a common consensus
(indicated by the men with raised hand) are averaged
and a final prediction is produced. The cooperation
principle is based on a two-fold strategy. First, each
SSRM is applied on the speech of a new speaker
sp
x
which produces an individual response. Then,
only the most concordant responses among the N
sp
available ones are retained and merged to produce
the final prediction. In order to select the most con-
cordant predictions, we used the mutual concordance
correlation coefficient (CCC), ρ
c
[49]. It is a measure of
agreement between two time-continuous predictions
that non-linearly combines in a unique parameter the
Pearson correlation coefficient (CC), ρ, and the mean
square error. The parameter CCC computed on two
time-series y
1
(t) and y
2
(t) on a given observation
time-interval T is defined as follows:
ρ
c
(y
1
, y
2
) =
2ρ(y
1
, y
2
) σ
y
1
σ
y
2
σ
2
y
1
+ σ
2
y
2
+ (µ
y
1
µ
y
2
)
2
(4)
where the CC (ρ), the mean (µ), and the standard
deviation (σ) are meant to be computed under the
assumption of stationary of the two time-series y
1
(t)
and y
2
(t) on the observation time-interval T . The
underlying idea of using the CCC is to measure the
consensus of the predictions provided by the speakers
in the cooperation observed on a given time period T .
The steps used in the CRM are listed in Algorithm 2
and detailed below.

Figures
Citations
More filters
Proceedings ArticleDOI

AVEC 2017: Real-life Depression, and Affect Recognition Workshop and Challenge

TL;DR: This paper presents the novelties introduced this year, the challenge guidelines, the data used, and the performance of the baseline system on the two proposed tasks: dimensional emotion recognition (time and value-continuous), and dimensional depression estimation (value-Continuous).
Journal ArticleDOI

Parallelized Convolutional Recurrent Neural Network With Spectral Features for Speech Emotion Recognition

TL;DR: A parallelized convolutional recurrent neural network (PCRN) with spectral features is proposed for speech emotion recognition that simultaneously processes two different types of features in parallel to better learn the subtle changes in emotion.
Journal ArticleDOI

Prominence features: Effective emotional features for speech emotion recognition

TL;DR: A novel type of feature related to prominence is proposed, which, together with traditional acoustic features, are used to classify seven typical different emotional states, and the correlation of the prominence features with emotional states is analyzed using a curve fitting method.
Proceedings ArticleDOI

End-to-end learning for dimensional emotion recognition from physiological signals

TL;DR: This very first study on end-to-end learning of emotion based on physiology, yields significantly better performance in comparison to existing work on the challenging RECOLA database, which includes fully spontaneous affective behaviors displayed during naturalistic interactions.
Journal ArticleDOI

Speech emotion recognition research: an analysis of research focus

TL;DR: Analysis of research in speech emotion recognition from 2006 to 2017 finds that certain combination of databases, speech features and classifiers influence the recognition accuracy of the SER system.
References
More filters
Journal ArticleDOI

LIBSVM: A library for support vector machines

TL;DR: Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Journal ArticleDOI

Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy

TL;DR: In this article, the maximal statistical dependency criterion based on mutual information (mRMR) was proposed to select good features according to the maximal dependency condition. But the problem of feature selection is not solved by directly implementing mRMR.

Feature selection based on mutual information: criteria ofmax-dependency, max-relevance, and min-redundancy

TL;DR: This work derives an equivalent form, called minimal-redundancy-maximal-relevance criterion (mRMR), for first-order incremental feature selection, and presents a two-stage feature selection algorithm by combining mRMR and other more sophisticated feature selectors (e.g., wrappers).
Journal ArticleDOI

A concordance correlation coefficient to evaluate reproducibility.

TL;DR: A new reproducibility index is developed and studied that is simple to use and possesses desirable properties and the statistical properties of this estimate can be satisfactorily evaluated using an inverse hyperbolic tangent transformation.
Journal ArticleDOI

A review of feature selection techniques in bioinformatics

TL;DR: A basic taxonomy of feature selection techniques is provided, providing their use, variety and potential in a number of both common as well as upcoming bioinformatics applications.
Related Papers (5)
Frequently Asked Questions (14)
Q1. What have the authors contributed in "Continuous estimation of emotions in speech by dynamic cooperative speaker models" ?

In this work, the authors introduce a novel emotion recognition system, based on ensemble of single-speaker-regression-models ( SSRMs ). The estimation of emotion is provided by combining a subset of the initial pool of SSRMs selecting those that are most concordance among them. The simplicity of this aggregation strategy, coupled with the flexibility assured by the modular architecture, and the promising results obtained on the RECOLA database highlight the potential implications of the proposed method in a real-life scenario and in particular in WEB-based applications. 

Web-based applications could offer the possibility to everyone to upload to the cloud his/her speech sequence along with the corresponding annotation. Finally, the introduction of the QBTD paradigm suggests future developments based on modular architecture in which each SSRM is trained and optimised on each quadrant and then merged using a cooperative rule based on different machine learning scenarios and other databases of emotional speech. 

Whereas the use of all annotation data can help at preserving diversity in emotion perception, e. g., by using multi-task learning of each annotator [25], [26], it has the main disadvantage to increase the overall complexity of the model according to the number of available raters. 

The common approach in the literature is to use all the emotion variability found in the data as training material and tune the machine learning system in order to disregard the less relevant instances (e. g., by optimising the number of support vectors and the soft margin in Support Vector Regression (SVR)) for emotion prediction [19], [30], [31], [32]. 

the addition to the cooperation of speech sequences of new speakers is now expected using the single speaker model construction, as well as the inclusion of additional affective contents of the same speaker by single speaker model relearning. 

Since their system dynamically adapts the ensemble of SSRM used in the cooperation strategy to perform emotion prediction, the authors have analysed the frequency of inclusion (i. e., the number of times the SSRM of a speaker is included in the cooperation over the number of observation windows) of each speaker in the model. 

The statistical significance of the improvementsobtained with the inclusion of the synchronization procedure is verified by a paired t-test for both arousal and valence; the authors obtained p < 0.001 for both the experiments, demonstrating the importance of the synchronization procedure for constructing the SSRMs that cooperate in the CRM. 

For this reason, the proposed architecture is perfectly suitable for mobile applications, thanks to the easiness and flexibility to develop single models separately trained on distinct speech sequences with different emotional contents. 

9.The statistical significance of the improvements obtained with the QBTD procedure over the global optimisation (ALL), is verified with a paired t-test for both arousal and valence; the authors obtained p < 0.0011949-3045 (c) 2015 IEEE. 

Arousalbe added to the cooperative system simply by training a new SSRM using the speech sequence along with the relative annotations for the new speaker. 

See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.and p = 0.027 for arousal and valence respectively, demonstrating the importance of the QBTD procedure for constructing the SSRM. 

To further quantify the performance of the proposed method (i.e., SSRM combined with CRM) with respect to standard regression approaches, the authors also implemented two other emotion recognition strategies. 

Automatic recognition of naturalistic emotion from time-continuous labels presents however several challenges that are not yet solved [9], such as the definition of a reliable goldstandard from a pool of raters and the issue of data scarcity in training models. 

Regarding the issue of data scarcity, the main question to be solved is how to deal with the huge diversity found in a collection of spontaneous displays of emotion. 

Trending Questions (1)
How to use ceiling speakers without amplifier?

The proposed approach allows the addition or removal of speakers from the ensemble without the necessity to re-build the entire recognition system.