What are the three modalities used to recognize emotions from participants’ responses?

Three modalities, including peripheral and central nervous system physiological signals and information captured by eye gaze tracker, were used to recognize emotions from participants’ responses.

What is the way to combine the decisions of classifiers?

If the classifiers provide confidence measures on their decisions, combining decisions of classifiers can be done using a summation rule.

What format was used to record the physiological responses of each participant?

The physiological responses to each stimuli were recorded each in a separate file in Biosemi data format (BDF) which is an extension of European data format (EDF) and easily readable in different platforms.

What is the first database which has five modalities precisely synchronized?

To their knowledge, MAHNOB-HCI is the first database which has five modalities precisely synchronized, namely, eye gaze data, video, audio, and peripheral and central nervous system physiological signals.

What modalities were used to predict the correctness of displayed tags?

Two modalities were used to predict the correctness of displayed tags, namely, facial expression (captured by a camera) and the eye gaze location on the screen (captured by an eye gaze tracker).

How many files were containing physiological signals?

All the files containingphysiological signals include the signals recorded 30 s before the start and after the end of their stimuli.

How was the eye gaze tracker synchronized with the CPU cycle counter of its dedicated capture?

The eye gaze tracker (“g” in Fig. 3) synchronized with the CPU cycle counter of its dedicated capture PC (“f”) with an accuracy of approximately one millisecond.

(Open Access) A Multimodal Database for Affect Recognition and Implicit Tagging (2012) | Mohammad Soleymani

A Multimodal Database for

Affect Recognition and Implicit Tagging

Mohammad Soleymani, Member, IEEE, Jeroen Lichtenauer,

Thierry Pun, Member, IEEE, and Maja Pantic, Fellow, IEEE

Abstract—MAHNOB-HCI is a multimodal database recorded in response to affective stimuli with the goal of emotion recognition and

implicit tagging research. A multimodal setup was arranged for synchronized recording of face videos, audio signals, eye gaze data,

and peripheral/central nervous system physiological signals. Twenty-seven participants from both genders and different cultural

backgrounds participated in two experiments. In the first experiment, they watched 20 emotional videos and self-reported their felt

emotions using arousal, valence, dominance, and predictability as well as emotional keywords. In the second experiment, short videos

and images were shown once without any tag and then with correct or incorrect tags. Agreement or disagreement with the displayed

tags was assessed by the participants. The recorded videos and bodily responses were segmented and stored in a database. The

database is made available to the academic community via a web-based system. The collected data were analyzed and single

modality and modality fusion results for both emotion recognition and implicit tagging experiments are reported. These results show the

potential uses of the recorded modalities and the significance of the emotion elicitation protocol.

Index Terms—Emotion recognition, EEG, physiological signals, facial expressions, eye gaze, implicit tagging, pattern classification,

affective computing.

1INTRODUCTION

LTHOUGH the human emotional experience plays a

central part in our lives, our scientific knowledge about

human emotions is still very limited. Progress in the field of

affective sciences is crucial for the development of

psychology as a scientific discipline or application that

has anything to do with humans as emotional beings. More

specifically, the application of human-computer interaction

relies on knowledge about the human emotional experi-

ence, as well as on knowledge about the relation between

emotional experience and affective expression.

An area of commerce that could obviously benefit from

an automatic understanding of human emotional experi-

ence is the multimedia sector. Media items such as movies

and songs are often primarily valued for the way in which

they stimulate a certain emotional experience. While it

might often be the affective experience that a person is

looking for, media items are currently primarily tagged by

their genre, subject or their factual content. Implicit affective

tagging through automatic understanding of an individual’s

response to media items would make it possible to rapidly

tag large quantities of media, on a detailed level and in a

way that would be more meaningful to understand how

people experience the affective aspects of media content [1].

This allows more effective content retrieval, required to

manage the ever-increasing quantity of shared media.

To study human emotional experience and expression in

more detail and on a scientific level, and to develop and

benchmark methods for automatic recognition, researchers

are in need of rich sets of data of repeatable experiments [2].

Such corpora should include high-quality measurements of

important cues that relate to the human emotional experi-

ence and expression. The richness of the human emotional

expressiveness poses both a technological as well as a

research challenge. This is recognized and represented by

an increasing interest into pattern recognition methods for

human behavior analysis that can deal with the fusion of

measurements from different sensor modalities [2]. How-

ever, obtaining multimodal sensor data is a challenge in

itself. Different modalities of measurement require different

equipment, developed a nd manufactured by different

companies, and different expertise to set up and operate.

The need for interdisciplinary knowledge as well as

technological solutions to combine measurement data from

a diversity of sensor equipment is probably the main reason

for the current lack of multimodal databases of recordings

dedicated to human emotional experiences.

To contribute to this need for emotional databases and

affective tagging, we have recorded a database of multi-

modal recordings of partic ipants in their response to

affectively stimulating excerpts from movies and images

and videos with correct or incorrect tags associated with

human actions. The database is freely available to the

IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 3, NO. 1, JANUARY-MARCH 2012 1

. M. Soleymani and T. Pun are with the Computer Vision and Multimedia

Laboratory, Computer Science Department, University of Geneva, Battelle

Campus, Building A, Rte. de Drize 7, Carouge (GE) CH-1227, Switzer-

land. E-mail: {mohammad.soleymani, thierry.pun}@unige.ch.

. J. Lichtenauer is with the Department of Computing, Imperial College

London, 180 Queen’s Gate, London SW7 2AZ, United Kingdom.

E-mail: j.lichtenauer@imperial.ac.uk.

. M. Pantic is with the Department of Computing, Imperial College London,

180 Queen’s Gate, London SW7 2AZ, United Kingdom, and the Faculty of

Electrical Engineering, Mathematics and Computer Science (EEMCS),

University of Twente, Drienerlolaan 5, Enschede 7522 NB, The Nether-

lands. E-mail: m.pantic@imperial.ac.uk.

Manuscript received 12 Nov. 2010; revised 1 July 2011; accepted 6 July 2011;

published online 28 July 2011.

Recommended for acceptance by B. Schuller.

For information on obtaining reprints of this article, please send e-mail to:

taffc@computer.org, and reference IEEECS Log Number

TAFFCSI-2010-11-0112.

Digital Object Identifier no. 10.1109/T-AFFC.2011.25.

1949-3045/12/$31.00 ß 2012 IEEE Published by the IEEE Computer Society

academic community, and is easily accessible through a

web-interface.

The recordings for all excerpts are anno-

tated through an affective feedback form, filled in by the

participants immediately after each excerpt. A summary of

the MAHNOB-HCI database characteristics is given in

Table 1. The recordings of this database are precisely

synchronized and its multimodality permits researchers to

study the simultaneous emotional responses using different

channels. Two typical sets including responses to emotional

videos and implicit tagging or agreement with displayed

tags can be used for both emotion recognition as well as

multimedia tagging studies. Emotion recognitio n and

implicit tagging baseline results are given for researchers

who are going to use the database. The baseline results set a

target for the researchers to reach.

In Section 2, we give an overview of existing affective

databases, followed by descriptions of the modalities we

have recorded in our database in Section 3. Section 4

explains the experimental setup. The first experiment

paradigm, some statistics and results of classifications of

emotions are presented in Section 5 and for the second

experiment in Section 6. A discussion on the use of the

database and recommendations for recordings of such

databases are given in Section 7, followed by our conclu-

sions in Section 8.

2BACKGROUND

Creating affective databases is an important step in emotion

recognition studies. Recent advances in emotion recognition

have motivated the creation of novel databases containing

emotional expressions. These databases mostly include

speech, visual, or audio-visual data [5], [6], [7], [8]. The

visual modality of the emotional databases includes face

and/or body gestures. The audio modality carries acted or

genuine emotional speech in different languages. In the last

decade, most of the databases consisted only of acted or

deliberately expressed emotions. More recently, researchers

have begun sharing spontaneous and natural emotional

databases such as [6], [7], [9]. We only review the publicly

available spontaneous or naturalistic databases and refer

the reader to the following review [2] for posed, audio, and

audio-visual databases.

Pantic et al. created the MMI web-based emo tion al

database of posed and spontaneous facial expressions with

both static images and videos [5], [10]. The MMI database

consists of images and videos captured from both frontal and

profile view. The MMI database includes data from 61 adults

acting different basic emotions and 25 adults reacting to

emotional videos. This web-based database gives an option of

searching in the corpus and is downloadable.

One notable database with spontaneous reactions is the

Belfast database (BE) created by Cowie et al. [11]. The BE

database includes spontaneous reactions in TV talk shows.

Although the database is very rich in body gestures and

facial expressions, the variety in the background makes the

data a challenging data set of automated emotion recogni-

tion. The BE database was later included in a much larger

ensemble of databases in the HUMAINE database [6]. The

HUMAINE database consists of three naturalistic and six

induced reaction databases. Databases vary in size from 8 to

125 participants and in modalities, from only audio-visual

to peripheral physiological signals. These databases were

developed independently at different sites and collected

under the HUMAINE project.

The “Vera am Mittag” (VAM) audio-visual database [7]

is another example of using spontaneous naturalistic

reactions during a talk show to develop a database. Twelve

hours of audio-visual recordings from a German talk show,

“Vera am Mittag,” were segmented and annotated. The

segments were annotated using valence, activation, and

dominance. The audio-visual signals consist of the video

and utterances from 104 different speakers.

Compared to audio-visual databases, there are fewer

publicly available affective physiological databases. Healey

and Picard recorded one of the first affective physiological

data sets at MIT, which has reactions of 17 drivers under

different levels of stress [4]. Their recordings include

electrocardiogram (ECG), galvanic skin response (GSR)

recorded from hands and feet, electromyogram (EMG) from

the right trapezius, as well as the respiration pattern. The

database of stress recognition in drivers is publicly available

from Physionet.

The Database for Emotion Analysis using Physiological

Signals (DEAP) [9] is a recent database that includes

peripheral and central nervous system physiological signals

in addition to face videos from 32 participants. The face

videos were only recorded from 22 participants. EEG

signals were recorded from 32 active electrodes. Peripheral

nervous system physiological signals were EMG, electro-

ocologram (EOG), blood volume pulse (BVP) using plethys-

mograph, skin temperature, and GSR. The spontaneous

2 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 3, NO. 1, JANUARY-MARCH 2012

1. http://mahnob-db.eu.

TABLE 1

MAHNOB-HCI Database Content Summary

2. http://www.mmifacedb.com/.

3. http://www.physionet.org/pn3/drivedb/.

reactions of participants were recorded in response to music

video clips. This database is publicly available on the

Internet.

The characteristics of the reviewed databases are

summarized in Table 2.

3MODALITIES AND APPARATUS

3.1 Stimuli and Video Selection

Although the most straightforward way to represent an

emotion is to use discrete labels such as fear or joy, label-

based representations have some disadvantages. Specifi-

cally, labels are not cross-lingual: Emotions do not have

exact translations in different languages, e.g., “disgust”

does not have an exact translation in Polish [12]. Psychol-

ogists therefore often represent emotions or feelings in an

n-dimensional space (generally 2 or 3D). The most famous

such space, which is used in the present study and

originates from cognitive theory, is the 3D valence-

arousal-dominance or pleasure-arousal-dominance (PAD)

space [13]. The valence scale ranges from unpleasant to

pleasant. The arousal scale ranges from passive to active or

excited. The dominance scale ranges from submissive (or

“without control”) to dominant (or “in control, empow-

ered”). Fontaine et al. [14] proposed adding a predictability

dimension to PAD dimensions. Predictability level de-

scribes to what extent the sequence of events is predictable

or surprising for a viewer.

In a preliminary study, 155 video clips containing movie

scenes manually selected from 21 commercially produced

movies were shown to more than 50 participants; each

video clip received 10 annotations on average [15]. The

preliminary study was conducted utilizing an onl ine

affective annotation system in which the participants

reported their emotions in response to the videos played

by a web-based video player.

In the preliminary study, the participants were thus

asked to self-assess their emotion by reporting the felt

arousal (ranging from calm to excited/activated) and

valence (ranging from unpleasant to pleasant) on nine

points scales. SAM Manikins were shown to facilitate the

self-assessments of valence and arousal [16]. Fourteen video

clips were chosen based on the preliminary study from the

clips which received the highest number of tags in different

emotion classes, e.g., the clip with the highest number of

sad tags was selected to induce sadness. Three other

popular video clips from online resources were added to

this set (two for joy and one for disgust). Three past weather

forecast reports (retrieved from youtube.com) were also

used as neutral emotion clips. The videos from online

resources were added to the data set to enable us to

distribute some of the emotional video samples with the

multimodal database described below. The full list of

videos is given in Table 3.

Ultimately, 20 videos were selected to be shown which

were between 34.9 and 117 s long (M ¼ 81:4s;SD¼ 22:5s).

Psychologists recommended videos from 1 to 10 minutes

long for elicitation of a single emotion [17], [18]. Here, the

video clips were kept as short as possible to avoid multiple

emotions or habituation to the stimuli while keeping them

long enough to observe the effect.

3.2 Facial Expressions and Audio Signals

One of the most well-studied emotional expression chan-

nels is facial expressions. A human being uses facial

expressions as a natural mean of emotional communication.

Emotional expressions are also used in human-human

communication to clarify and stress what is said, to signal

comprehension, disagreement, and intentions, in brief, to

regulate interactions with the environment and other

persons in the vicinity [19], [20]. Automatic analysis of

facial expression is an interesting topic from both scientific

and practical point of view. It has attracted the interest of

many researchers since such systems will have numerous

applications in behavioral science, medicine, security, and

human-computer interaction. To develop and evaluate such

applications, large collections of training and test data are

needed [21], [22]. In the current database, we are interested

in studying the spontaneous responses of participants while

SOLEYMANI ET AL.: A MULTIMODAL DATABASE FOR AFFECT RECOGNITION AND IMPLICIT TAGGING 3

TABLE 2

The Summary of the Characteristics of the Emotional Databases Reviewed

TABLE 3

The Video Clips Listed with Their Sources

The listed emotional keywords were chosen by polling over participants’

self-reports in the preliminary study.

4. http://www.eecs.qmul.ac.uk/mmv/data sets/deap/.

watching video clips. This can be used later for emotional

implicit tagging of multimedia content.

Fig. 1 shows the synchronized views from the six different

cameras. Two types of cameras have been used in the

recordings: one Allied Vision Stingray F-046C, color camera

(C1), and five Allied Vision Stingray F-046B, monochrome

cameras (BW1 to BW5). All camer as recorded wit h a

resolution of 780  580 pixels at 60 frames per second. The

two close up cameras above the screen give a near-frontal

view of the face in color Fig. 1a or monochrome Fig. 1b. The

monochrome views have a better sharpness and less motion

blur than the color camera. The two views from the bottom of

the screen, Figs. 1c and 1d, give a close up view that may be

more useful during down-facing head poses, and make it

possible to apply passive stereo imaging. For this, the

intrinsic and extrinsic parameters of all cameras have been

calibrated. Linear polarizing filters were applied with the

two bottom cameras in order to reduce the reflection of the

computer screen in eyes and glasses. The profile view Fig. 1e

can be used to extract backward-forward head/body move-

ments or to aid the extraction of facial expressions, together

with the other cameras. The wide-angle view Fig. 1f captures

the upper body, arms and hands, which can also carry

important information about a person’s affective state.

Although we did not explicitly ask the participants to

express or talk during the experiments, we expected some

natural utterances and laughter in the recorded audio signals.

The audio was recorded for its potential to be used for video

tagging, e.g., it has been used to measure the hilarity of videos

by analyzing a user’s laughter [23]. However, the amount of

laughter and audio responses in the database from partici-

pants is not enough for such studies and therefore the audio

signals were not analyzed. The recorded audio contains two

channels. Channel one (or “left” if interpreted as a stereo

stream) contains the audio signal from a AKG C 1000 S MkIII

room microphone, which includes the room noise as well as

the sound of the video stimuli. Channel two contains the

audio signal from an AKG HC 577 L head-worn microphone.

3.3 Eye Gaze Data

The Tobii X120

eye gaze tracker provides the position of

the projected eye gaze on the screen, the pupil diameter, the

moments when the eyes were closed, and the instantaneous

distance of the participant’s eyes to the gaze tracker device.

The eye gaze data were sampled at 60 Hz due to instability

of the eye gaze tracker system at 120 Hz. The blinking

moments are also extractable from eye gaze data by finding

the moments in the eye gaze responses where the

coordinates are equal to 1. Pupil diameter has been

shown to change in different emotional states [24], [25].

Examples of eye gaze responses are shown in Fig. 2.

3.4 Physiological Signals

Physiological responses (ECG, GSR, respiration amplitude,

and skin temperature) were recorded with a 1,024 Hz

sampling rate and later downsampled to 256 Hz to reduce

the memory and processing costs. The trend of the ECG and

GSR signals was removed by subtracting the temporal low

frequency drift. The low frequency drift was computed by

smoothing the signals on each ECG and GSR channels with

a 256 points moving average.

GSR provides a measure of the resistance of the skin by

positioning two electrodes on the distal phalanges of the

middle and index fingers and passing a negligible current

through the body. This resistance decreases due to an

increase of perspiration, which usually occurs when one is

experiencing emotions such as stress or surprise. Moreover,

Lang et al. discovered that the mean value of the GSR is

related to the level of arousal [26].

ECG signals were recorded using three sensors attached

on the participants’ body. Two of the electrodes were

placed on the chest’s upper right and left corners below the

clavicle bones and the third electrode was placed on the

abdomen below the last rib for setup simplicity. This setup

allows precise identification of heart beats and conse-

quently to compute heart rate (HR).

Skin temperature was recorded by a temperature sensor

placed participant’s little finger. The respiration amplitude

was measured by tying a respiration belt around the

abdomen of the participant.

Psychological studies regarding the relations between

emotions and the brain are uncovering the strong implica-

tion of cognitive processes in emotions [27]. As a result, the

EEG signals carry valuable information about the partici-

pants’ felt emotions. EEG signals were recorded using

active AgCl electrodes placed according to the international

10-20 system. Examples of peripheral physiological re-

sponses are shown in Fig. 2.

4EXPERIMENTAL SETUP

4.1 Experimental Protocol

As explained above, we set up an apparatus to record facial

videos, audio and vocal expressions, eye gaze, and

physiological signals simultaneously. The experiment was

4 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 3, NO. 1, JANUARY-MARCH 2012

Fig. 1. Snapshots of videos captured from six cameras recording facial

expressions and head pose.

5. http://www.tobii.com.

controlled by the Tobii studio software. The Biosemi active

II system

with active electrodes was used for physiological

signals acquisition. Physiological signals including ECG,

EEG (32 channels), respiration amplitude, and skin

temperature were recorded while the videos were shown

to the participants. In the first experiment, five multiple

choice questions were asked during the self-report for each

video. For the second experiment, where the feedback was

limited to yes and no, two big colored buttons (red and

green) were provided.

Thirty participants with different cultural backgrounds

volunteered to participate in response to a campus wide call

for volunteers at Imperial College, London. Out of the

30 young healthy adult participants, 17 were female and 13

were male; ages varied between 19 to 40 years old

(M ¼ 26:06;SD ¼ 4:39). Participants had different educa-

tional background, from undergraduate students to post-

doctoral fellows, with different English proficiency from

intermediate to native speakers. The data recorded from

three participants (P9, P12, P15) were not analyzed due to

technical problems and unfinished data collection. Hence,

the analysis results of this paper are only based on the

responses recorded from 27 participants.

4.2 Synchronized Setup

An overview of the synchronization in the recording setup

is shown in Fig. 3. To synchronize between sensors, we

centrally monitored the timings of all sensors, using a

MOTU 8pre

audio interface (“c” in Fig. 3) that can sample

up to eight analog inputs simultaneously. This allowed the

derivation of the exact temporal relations between events in

each of the eight channels. By recording the external camera

trigger pulse signal (“b” in Fig. 3) in a parallel audio track

(see the fifth signal in Fig. 4), each recorded video frame

could be related to the recorded audio with an uncertainty

below 25 s. More details about the data synchronization

can be found in [28].

The gaze tracking data and physiological signals were

recorded with separated capture systems. Because neither of

SOLEYMANI ET AL.: A MULTIMODAL DATABASE FOR AFFECT RECOGNITION AND IMPLICIT TAGGING 5

Fig. 3. Overview of our synchronized multisensor data capture system,

consisting of (a) a physiological measurement device, (b) video

cameras, (c) a multichannel A/D converter, (d) an A/V capture PC,

(e) microphones, (f) an eye gaze capture PC, (g) an eye gaze tracker,

and (h) a photo diode to capture the pulsed IR-illumination from the eye

gaze tracker. Camera trigger was recorded as audio and physiological

channels for synchronization.

6. http://www.biosemi.com.

7. http://www.motu.com/products/motuaudio/8pre.

Fig. 2. Natural expressions to a fearful (on the left) and disgusting (on the right) video. The snapshots of the stimuli videos with eye gaze overlaid

and without eye gaze overlaid, frontal captured video, raw physiological signals, and raw eye gaze data are shown. In the first row, the red circles

show the fixation points and their radius indicates the time spent in each fixation point. The red lines indicate the moments where each of the

snapshots was captured.

A Multimodal Database for Affect Recognition and Implicit Tagging

Figures

Citations

DEAP: A Database for Emotion Analysis ;Using Physiological Signals

Deep learning-based electroencephalography analysis: a systematic review.

DISFA: A Spontaneous Facial Action Intensity Database

Emotions Recognition Using EEG Signals: A Survey

Algorithmic Principles of Remote PPG

References

LIBSVM: A library for support vector machines

A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

Learning realistic human actions from movies

Looking at pictures: affective, facial, visceral, and behavioral reactions

DEAP: A Database for Emotion Analysis ;Using Physiological Signals

Related Papers (5)

DEAP: A Database for Emotion Analysis ;Using Physiological Signals

A circumplex model of affect

Toward machine emotional intelligence: analysis of affective physiological state

A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions

Non-contact, automated cardiac pulse measurements using video imaging and blind source separation.

Frequently Asked Questions (11)

Q1. What were the physiological signals recorded while the videos were shown to the participants?

Q2. What are the three modalities used to recognize emotions from participants’ responses?

Q3. What is the way to combine the decisions of classifiers?

Q4. What format was used to record the physiological responses of each participant?

Q5. What are the disadvantages of labelbased representations?

Q6. What is the main reason for the lack of multimodal databases of recordings dedicated to human emotional experiences?

Q7. What is the first database which has five modalities precisely synchronized?

Q8. How could the video be related to the audio?

Q9. What modalities were used to predict the correctness of displayed tags?

Q10. How many files were containing physiological signals?

Q11. How was the eye gaze tracker synchronized with the CPU cycle counter of its dedicated capture?