What are the contributions mentioned in the paper "Deap: a database for emotion analysis using physiological signals" ?

The authors present a multimodal data set for the analysis of human affective states. An extensive analysis of the participants ' ratings during the experiment is presented.

What test was used to test for significance?

To test for significance, an independent one-sample t-test was performed, comparing the F1-distribution over participants to the 0.5 baseline.

How many videos were selected via Last.fm affective tags?

Of the 40 selected videos, 17 were selected via Last.fm affective tags, indicating that useful stimuli can be selected via this method.

What was the emotional highlight score of the i-th segment ei?

The emotional highlight score of the i-th segment ei was computed using the following equation:ei = √ a2 i + v2 i (1)The arousal, ai, and valence, vi, were centered.

How did the participants rate their familiarity with the songs?

after the experiment, participants were asked to rate their familiarity with each of the songs on a scale of 1 (”Never heard it before the experiment”) to 5 (”Knew the song very well”).

(Open Access) DEAP: A Database for Emotion Analysis ;Using Physiological Signals (2012) | Sander Koelstra

Q: What are some common features used to characterize affect in music?

tempo, Mel-frequency cepstral coefficients (MFCC), pitch, zero crossing rate are amongst common features which have been used to characterize affect in music.

Q: What are the four quadrants of the valence-arousal space?

The valence-arousal space can be subdivided into 4 quadrants, namely low arousal/low valence (LALV), low arousal/high valence (LAHV), high arousal/low valence (HALV) and high arousal/high valence (HAHV).

Q: What are the common types of emotional information used for emotion assessment?

Physiological signals are also known to include emotional information that can be used for emotion assessment but they have received less attention.

Q: What are the two widely available databases for emotion assessment?

To the best of their knowledge, the only publicly available multi-modal emotional databases which includes both physiological responses and facial expressions are the enterface 2005 emotional database and MAHNOB HCI [4], [5].

Article

Reference

DEAP : a Database for Emotion Analysis Using Physiological Signals

KOELSTRA, Sander, et al.

Abstract

We present a multimodal data set for the analysis of human affective states. The

electroencephalogram (EEG) and peripheral physiological signals of 32 participants were

recorded as each watched 40 one-minute long excerpts of music videos. Participants rated

each video in terms of the levels of arousal, valence, like/dislike, dominance, and familiarity.

For 22 of the 32 participants, frontal face video was also recorded. A novel method for stimuli

selection is proposed using retrieval by affective tags from the last.fm website, video highlight

detection, and an online assessment tool. An extensive analysis of the participants' ratings

during the experiment is presented. Correlates between the EEG signal frequencies and the

participants' ratings are investigated. Methods and results are presented for single-trial

classification of arousal, valence, and like/dislike ratings using the modalities of EEG,

peripheral physiological signals, and multimedia content analysis. Finally, decision fusion of

the classification results from different modalities is performed. The data set is made publicly

available and we encourage other [...]

KOELSTRA, Sander, et al. DEAP : a Database for Emotion Analysis Using Physiological

Signals. IEEE transactions on affective computing, 2012, vol. 3, no. 1, p. 18-31

DOI : 10.1109/T-AFFC.2011.15

Available at:

http://archive-ouverte.unige.ch/unige:47405

Disclaimer: layout of this document may differ from the published version.

1 / 1

IEEE TRANS. AFFECTIVE COMPUTING 1

DEAP: A Database for Emotion Analysis using

Physiological Signals

Sander Koelstra, Student Member, IEEE, Christian M¨uhl, Mohammad Soleymani, Student Member, IEEE,

Jong-Seok Lee, Member, IEEE, Ashkan Yazdani, Touradj Ebrahimi, Member, IEEE,

Thierry Pun, Member, IEEE, Anton Nijholt, Member, IEEE, Ioannis Patras, Member, IEEE

Abstract—We present a multimodal dataset for the analysis of human affective states. The electroencephalogram (EEG) and

peripheral physiological signals of 32 participants were recorded as each watched 40 one-minute long excerpts of music videos.

Participants rated each video in terms of the levels of arousal, valence, like/dislike, dominance and familiarity. For 22 of the 32

participants, frontal face video was also recorded. A novel method for stimuli selection is proposed using retrieval by affective tags

from the last.fm website, video highlight detection and an online assessment tool. An extensive analysis of the participants’ ratings

during the experiment is presented. Correlates between the EEG signal frequencies and the participants’ ratings are investigated.

Methods and results are presented for single-trial classiﬁcation of arousal, valence and like/dislike ratings using the modalities of EEG,

peripheral physiological signals and multimedia content analysis. Finally, decision fusion of the classiﬁcation results from the different

modalities is performed. The dataset is made publicly available and we encourage other researchers to use it for testing their own

affective state estimation methods.

Index Terms—Emotion classiﬁcation, EEG, Physiological signals, Signal processing, Pattern classiﬁcation, Affective computing.

✦

1 INTRODUCTION

MOTION is a psycho-physiological process triggered

by conscious and/or unconscious perception of an

object or situation and is often associated with mood,

temperament, personality and disposition, and motiva-

tion. Emotions play an important role in human commu-

nication and can be expressed either verbally through

emotional vocabulary, or by expressing non-verbal cues

such as intonation of voice, facial expressions and ges-

tures. Most of the contemporary human-computer inter-

action (HCI) systems are deﬁcient in interpreting this

information and suffer from the lac k of emotional intelli-

gence. In other words, they a re unable to identify human

emotional states and use this information in deciding

upon proper actions to execute. The goal of affective

computing is to ﬁll this gap by detecting emotional

cues occurring during human-computer interaction and

synthesizing e motional responses.

Characterizing multimedia content with relevant, re-

liable and discriminating tags is vital for multimedia

• The ﬁrst three authors contributed equally to this work and are listed in

alphabetical order.

• Sander Koelstra and Ioannis Patras are with the School of Computer

Science and Electronic Engineering, Q u een Mary University of London

(QMUL). E-mail: sander.koelstra@eecs.qmul.ac.uk

• Christian M¨uhl and Anton Nijholt are with the Human Media Interaction

Group, University of Twente (UT).

• Mohammad Soleymani and Thierry Pun are w ith the Computer Vision

and Multimedia Laboratory, University of Geneva (UniG´e).

• Ashkan Yazdani, Jong-Seok Lee and Touradj Ebrahimi are with the Multi-

media Signal Processing Group, Ecole Polytechnique F´ed´erale de Lausanne

(EPFL).

information retrieval. Affective characteristics of multi-

media are important features for describing multime-

dia content and can be p resented by such emotional

tags. Implicit affective tagging refers to the effortless

generation of subjective and/or emotional tags. Implicit

tagging of videos using affective information c an help

recommendation and retrieval systems to improve their

performance [1]–[3]. The current dataset is recorded with

the goal of creating an adaptive music video recommen-

dation system. In our proposed music video recommen-

dation system, a user’s bodily responses will be trans-

lated to emotions. The emotions of a user while watc hing

music vide o clips will help the recommender system to

ﬁrst understand user’s taste and then to recommend a

music clip which matches users current emotion.

The presented database explores the possibility to

classify emotion dimensions induced by showing music

videos to different users. To the best of our knowledge,

the responses to this stimuli (music video clips) have

never been explored before, and the research in this

ﬁeld was mainly focused on images, music or non-music

video segments [4], [5]. In an adaptive music video

recommender, an emotion recognizer trained by phys-

iological responses to the content from similar nature,

music vide os, is be tte r able to fulﬁll its goal.

Various discrete categorizations of emotions have been

proposed, such as the six basic emotions proposed by

Ekman and Frie sen [6] and the tree structure of emotions

proposed by Pa rrot [7]. Dimensional scales of emotion

have also been proposed, such a s Plutchik’s emotion

wheel [8] and the valence-arousal scale by Russell [9].

In this work, we use Russell’s valence-arousal scale,

IEEE TRANS. AFFECTIVE COMPUTING 2

widely used in resea rch on affect, to quantitatively

describe emotions. In this scale , each emotional state

can be p laced on a two-dimensional plane with arousal

and valence as the horizontal and vertical ax es. While

arousal and valence explain most of the variation in

emotional states, a third dimension of dominance can

also be included in the model [9]. Arousal can range f rom

inactive (e.g. uninterested, bored) to active (e.g. alert,

excited), whereas valence ranges from unpleasant (e.g.

sad, stressed) to pleasant (e.g. happy, elate d). Dominance

ranges from a helpless and weak feeling (without con-

trol) to an empowered feeling (in control of everything).

For self-assessment along these scales, we use the well-

known self -assessment manikins (SAM) [10].

Emotion assessment is often c arried out through anal-

ysis of users’ emotiona l expressions and/or physiolog-

ical signals. Emotional expressions refer to any observ-

able verbal and non-verbal behavior that communicates

emotion. So far, most of the studies on emotion as-

sessment have focused on the analysis of facial expres-

sions and spee ch to determine a p erson’s emotional

state. Physiologica l signals are also known to include

emotional information that can be used for emotion

assessment but they have received less attention. They

comprise the signals originating from the central nervous

system (CNS) and the peripheral nervous system (PNS).

Recent advances in emotion recognition have mo-

tivated the creation of novel databases containing

emotional expressions in different modalities. These

databases mostly cover speech, visual, or audiovisual

data (e.g. [11]–[1 5]). The visual modality includes facial

expressions and/or body gestures. The audio modality

covers posed or genuine emotional speech in different

languages. M any of the existing visual databases include

only posed or de liberately expressed emotions.

Healey [ 16], [17] recorded one of the ﬁrst affective

physiological datasets. She recorded 24 participants driv-

ing around the Boston area and annotated the dataset

by the drivers’ stress level. 17 Of the 24 participant

responses are publicly available

. Her recordings include

electrocardiogram (ECG), galvanic skin response (GSR)

recorded from hands and feet, electromyogram (EMG)

from the right trapezius muscle and respiration pa tte rns.

To the best of our knowledge, the only publicly avail-

able multi-modal emotional databases which includes

both physiological responses and facial expressions a re

the enterface 2005 emotional database and MAHNOB

HCI [4], [5]. The ﬁrst one was recorde d by Savran

et al [5]. This database includes two sets. The ﬁrst

set has ele ctroencephalogram (EEG), peripheral physi-

ological signals, functional near infra -red spectroscopy

(fNIRS) and facial v ideos from 5 male participants. The

second dataset only has fNIRS and facial videos from 16

participants of both genders. Both databases recorded

spontaneous responses to emotional images from the

international affective picture system (IAPS) [18]. An

1. http://www.physionet.org/pn3/drivedb/

extensive review of affective audiovisual da tabases can

be found in [13], [19]. The MAHNOB HCI database [4]

consists of two experiments. The responses including,

EEG, physiological signals, eye gaze, audio and facial

expressions of 30 people were recorded. The ﬁrst exper-

iment was watching 20 emotional video extracted from

movies and online repositories. The second experiment

was tag agreement expe riment in which images and

short videos with human actions were shown the partic-

ipants ﬁrst without a tag and then with a displayed tag.

The tags were either correct or incorrect a nd par ticipants’

agreement with the displayed tag was assessed.

There has been a large number of published works

in the domain of emotion recognition from physiologi-

cal signals [16], [20]– [24]. Of these studies, only a few

achieved notable results using video stimuli. Lisetti and

Nasoz used p hysiological responses to recognize emo-

tions in response to movie scenes [23]. The movie scenes

were selected to elicit six emotions, namely sadness,

amusement, fear, anger, frustration and surprise. They

achieved a high recognition rate of 84% for the recog-

nition of these six emotions. However, the classiﬁcation

was based on the analysis of the signals in response to

pre-selected segments in the shown video known to be

related to highly emotional events.

Some efforts have been made towards implicit affec-

tive tagging of multimedia content. Kierkels et al. [25]

proposed a method for personalized affective tagging

of multimedia using peripheral physiological signals.

Valence and arousal lev els of participants’ emotions

when watching videos were computed from physiolog-

ical responses using linear regression [26]. Quantized

arousal and valence levels for a clip were then mapped

to emotion labels. This mapping enabled the retrieval of

video clips based on keyword queries. So far this novel

method achieved low precision.

Yazdani et al. [27] proposed using a brain computer

interface (BCI) based on P300 evoked potentials to emo-

tionally tag videos with one of the six Ekman basic

emotions [28]. Their system wa s trained with 8 partici-

pants and then tested on 4 others. They achieved a high

accuracy on selecting tags. However, in their proposed

system, a BCI only replaces the interface for explicit

expression of emotional tags, i.e. the method does not

implicitly tag a multimedia item using the participant’s

behavioral and psycho-physiological responses.

In a ddition to implicit tagging using behavioral

cues, multiple studies used multimedia content analy-

sis (MCA) for automated affective tagging of videos.

Hanjalic et al. [29] introduced ”personalized content

delivery” as a valuable tool in affective indexing and

retrieval systems. In order to represent affect in video,

they ﬁrst selected video- a nd audio- content features

based on their relation to the valence-arousal space.

Then, arising emotions were estimated in this space by

combining these features. While valence-arousal could

be used separ ately for indexing, they combined these

values by following their te mporal pattern. This allowed

IEEE TRANS. AFFECTIVE COMPUTING 3

for determining an affect curve, shown to be useful for

extracting vide o highlights in a movie or sports v id eo.

Wang and Cheong [30] used audio and video features

to classify basic emotions elicited by movie scenes. A u-

dio was classiﬁed into music, speech and environment

signals and these were treated separately to shape an

aural affective feature vector. The aura l affective vector

of each scene was fused with v ideo-based features such

as key lighting and visual excitement to form a scene

feature vector. Finally, using the scene feature vectors,

movie scenes were classiﬁed and labeled with emotions.

Soleymani et. al proposed a scene affective c haracter-

ization using a Bayesian framework [31]. Arousal and

valence of each shot were ﬁrst determined using linear

regression. Then, arousal and valence values in addition

to content features of each scene were used to classify

every scene into three classes, namely calm, excited pos-

itive and excited negative. The Bayesian framework was

able to incorporate the movie genre and the predicted

emotion from the last scene or temporal information to

improve the classiﬁcation accuracy.

There are also various studies on music affective char-

acterization from acoustic features [32]–[34]. Rhythm,

tempo, Mel-frequency cepstral coefﬁcients (MFCC),

pitch, zero crossing ra te are amongst common features

which have been used to character ize affect in music.

A p ilot study for the current work wa s presented in

[35]. In that study, 6 participants’ EEG and physiological

signals were recorded as each watched 20 music videos.

The participants rate d arousal and valence levels and

the EEG and physiological signals for e ach video were

classiﬁed into low/high arousal/valence classes.

In the current work, music video clips are used as the

visual stimuli to elicit d ifferent emotions. To this end,

a relatively large set of music video clips was gathered

using a novel stimuli selection method. A subjective test

was then performed to select the most appropriate test

material. For each video, a one-minute highlight was

selected automatically. 32 participants took part in the

experiment and their EEG and peripheral physiological

signals were recorded as they watched the 40 selected

music videos. Participants rated each video in terms of

arousal, valence, like/dislike, dominance and familiarity.

For 22 participants, frontal face video was also recorded.

This paper aims at introducing this publicly availa b le

database. The database contains all recorded signal data,

frontal face video for a subset of the participants a nd

subjective ratings f rom the participants. Also included

is the subjective ratings from the initial online subjective

annotation and the list of 120 videos used. Due to

licensing issues, we are not able to include the actual

videos, but YouTube links are included. Table 1 gives an

overview of the database contents.

To the best of our knowledge, this database has the

highest number of participants in publicly available

databases for analysis of spontaneous emotions from

2. http://www.eecs.qmul.ac.uk/mmv/datasets/deap/

TABLE 1

Database content summary

Online subjective annotation

Number of videos 120

Video duration 1 minute affective highlight (section 2.2)

Selection method

60 via last.fm affective tags,

60 manually selected

No. of ratings per video 14 - 16

Rating scales

Arousal

Valence

Dominance

Rating values Discrete scale of 1 - 9

Physiological Experiment

Number of participants 32

Number of videos 40

Selection method Subset of online annotated videos with

clearest responses (see section 2.3)

Rating scales

Arousal

Valence

Dominance

Liking (how much do you like the video?)

Familiarity (how well do you know the video?)

Rating values

Familiarity: discrete scale of 1 - 5

Others: continuous scale of 1 - 9

Recorded signals

32-channel 512Hz EEG

Peripheral physiological signals

Face video (for 22 participants)

physiological signals. In addition, it is the only database

that uses music vid eos as emotional stimuli.

We present an extensive statistical analysis of the

participant’s ratings and of the c orrelates between the

EEG signals and the ra tings. Preliminary single trial

classiﬁcation results of EEG, peripheral physiological

signals and MC A are presented and compared. Finally,

a fusion algorithm is utilized to combine the results of

each modality and arrive at a more robust decision.

The layout of the paper is a s follows. In Section 2

the stimuli sele ction procedure is described in detail.

The experiment setup is covered in Section 3. Section

4 provides a statistical analysis of the ratings given by

participants during the experiment and a validation of

our stimuli selection method. In Section 5, correlates be-

tween the EEG frequencies and the participants’ ratings

are presented. The method and results of single-trial

classiﬁcation a re given in Se c tion 6. The conclusion of

this work follows in Section 7.

2 STIMULI SELECTION

The stimuli used in the experiment were selected in

several steps. First, we selected 120 initial stimuli, half

of which were chosen semi-automatically and the rest

manually. Then, a one-minute highlight part was deter-

mined for each stimulus. Finally, through a web-based

subjective assessment experiment, 40 ﬁnal stimuli were

selected. Eac h of these steps is explained below.

IEEE TRANS. AFFECTIVE COMPUTING 4

2.1 Initial stimuli selection

Eliciting emotional reactions from test par ticipants is a

difﬁcult task and selecting the most effective stimulus

materials is crucial. We propose here a semi-automated

method for stimulus selection, with the goal of minimiz-

ing the bias arising from manual stimuli selection.

60 of the 120 initially selected stimuli were selected

using the Last.fm

music enthusiast website. Last.fm

allows users to track their music listening habits and

receive recommendations for new music and events.

Additionally, it allows the users to a ssign tags to individ-

ual songs, thus creating a folksonomy of tags. Many of

the tags carry emotional meanings, such as ’depressing’

or ’aggressive’. Last.fm offers an API, allowing one to

retrieve tags and tagged songs.

A list of emotional keywords was taken from [7] a nd

expanded to include inﬂections and synonyms, yielding

304 keywords. Next, for ea c h keyword, corresponding

tags were found in the Last.fm database. For each found

affective tag, the ten songs most often labeled with this

tag were selected. This resulted in a total of 1084 songs.

The vale nce -arousal space can be subdivided into 4

quadrants, namely low arousal/low va le nce (LALV), low

arousal/high valence (LAHV), high arousal/low v alence

(HALV) and high arousal/high valence (HAHV). In

order to ensure diversity of induced emotions, from the

1084 songs, 15 were selected manually for each quadrant

according to the following criteria:

Does th e tag accuratel y reﬂect the emotional content?

Examples of songs subjectively rejected according to this

criterium include songs that are tagged merely because

the song title or artist na me corresponds to the tag.

Also, in some cases the lyrics may correspond to the tag,

but the actual emotional c ontent of the song is entirely

different (e.g. happy songs about sad topics).

Is a music video available for the song?

Music videos for the songs were automatically retrieved

from YouTube, corrected manually where necessary.

However, many songs do not have a music video.

Is the song appropriate for use in t he experiment?

Since our test participants were mostly European stu-

dents, we selected those songs most likely to elicit

emotions for this target demographic. Therefore, mainly

European or North American artists were selected.

In addition to the songs selected using the method

described above, 60 stimulus videos were selected man-

ually, with 15 videos selected for each of the quadrants

in the arousal/valence space. The goal here was to select

those videos expected to induce the most clear e motional

reactions for each of the quadrants. The combination

of manual selection and selection using affective tags

produced a list of 120 candidate stimulus videos.

2.2 Detection of one-minute highlights

For each of the 120 initially selected music videos, a one

minute segment for use in the experiment was extra cted.

3. http://www.last.fm

In order to extract a segment with max imum emotional

content, an affective highlighting algorithm is proposed.

Soleymani et al. [31] used a linear regression method

to calculate arousal for each shot of in movies. In their

method, the arousal and valence of shots was computed

using a linear regression on the content-based features.

Informative fe atures for arousal estimation include loud-

ness and energy of the audio signals, motion component,

visual excitement and shot duration. The same approach

was used to compute valence. There are other content

features such as color variance and key lighting that

have been shown to be correlated with valence [30]. The

detailed description of the content features used in this

work is given in Section 6.2.

In order to ﬁnd the best weights for arousal and

valence estimation using regression, the regressors were

trained on a ll shots in 21 annotated movies in the dataset

presented in [31]. T he linear weights were computed by

means of a relevance vector machine (RVM) from the

RVM toolbox provided by Tipping [36]. The RVM is able

to reject uninformative features dur ing its training hence

no f urther feature selection wa s used for arousal and

valence determination.

The music vide os were then se gmented into one

minute segments with 55 seconds overlap between seg-

ments. Content features were extra cted and provided the

input for the regressors. The emotional highlight score

of the i-th segment e

was computed using the following

equation:

+ v

(1)

The arousal, a

, and valence, v

, were centered. There-

fore, a smaller emotional highlight score (e

) is closer

to the neutral state. For each video, the one minute

long segment with the highest emotional highlight score

was chosen to be extracted for the experiment. For a

few clips, the automatic affective highlight detection was

manually overridde n. This was done only for songs with

segments that are particularly characteristic of the song,

well-known to the public, and most likely to elicit emo-

tional reactions. In these cases, the one-minute highlight

was selected so that these segments were included.

Given the 1 20 one-minute music video segments, the

ﬁnal selection of 40 videos used in the experiment was

made on the basis of subjective ratings by volunteers, as

described in the next section.

2.3 Online subjective annotation

From the initial collection of 120 stimulus videos, the

ﬁnal 40 test video clips were chosen by using a web-

based subjective emotion assessment interface. Partici-

pants watched music videos and rated them on a discrete

9-point scale f or valence, arousal and dominance. A

screenshot of the interface is shown in Fig. 1. Each

participant watched as many videos as he/she wanted

and was able to end the rating at any time. The order of

DEAP: A Database for Emotion Analysis ;Using Physiological Signals

Figures

Citations

Continuous emotion detection in response to music videos

EEG Based Emotion Identification Using Unsupervised Deep Feature Learning

The Indian Spontaneous Expression Database for Emotion Recognition

Ontology-based context modeling for emotion recognition in an intelligent web

A Survey on Brain Biometrics

References

A circumplex model of affect

Praat, a system for doing phonetics by computer

Measuring emotion: The self-assessment manikin and the semantic differential

Affective Computing

International affective picture system (IAPS) : affective ratings of pictures and instruction manual

Related Papers (5)

A Multimodal Database for Affect Recognition and Implicit Tagging

A circumplex model of affect

Investigating Critical Frequency Bands and Channels for EEG-Based Emotion Recognition with Deep Neural Networks

Measuring emotion: The self-assessment manikin and the semantic differential

Toward machine emotional intelligence: analysis of affective physiological state

Frequently Asked Questions (11)

Q1. What are some common features used to characterize affect in music?

Q2. What are the contributions mentioned in the paper "Deap: a database for emotion analysis using physiological signals" ?

Q3. What test was used to test for significance?

Q4. What other features have been shown to be correlated with valence?

Q5. What are the four quadrants of the valence-arousal space?

Q6. How many videos were selected via Last.fm affective tags?

Q7. What are the common types of emotional information used for emotion assessment?

Q8. What are the two widely available databases for emotion assessment?

Q9. What was the emotional highlight score of the i-th segment ei?

Q10. How did the participants rate their familiarity with the songs?

Q11. What was the arousal and valence level of each video?

Trending Questions (1)