scispace - formally typeset
Open AccessJournal ArticleDOI

A Multimodal Database for Affect Recognition and Implicit Tagging

TLDR
Results show the potential uses of the recorded modalities and the significance of the emotion elicitation protocol and single modality and modality fusion results for both emotion recognition and implicit tagging experiments are reported.
Abstract
MAHNOB-HCI is a multimodal database recorded in response to affective stimuli with the goal of emotion recognition and implicit tagging research. A multimodal setup was arranged for synchronized recording of face videos, audio signals, eye gaze data, and peripheral/central nervous system physiological signals. Twenty-seven participants from both genders and different cultural backgrounds participated in two experiments. In the first experiment, they watched 20 emotional videos and self-reported their felt emotions using arousal, valence, dominance, and predictability as well as emotional keywords. In the second experiment, short videos and images were shown once without any tag and then with correct or incorrect tags. Agreement or disagreement with the displayed tags was assessed by the participants. The recorded videos and bodily responses were segmented and stored in a database. The database is made available to the academic community via a web-based system. The collected data were analyzed and single modality and modality fusion results for both emotion recognition and implicit tagging experiments are reported. These results show the potential uses of the recorded modalities and the significance of the emotion elicitation protocol.

read more

Content maybe subject to copyright    Report

A Multimodal Database for
Affect Recognition and Implicit Tagging
Mohammad Soleymani, Member, IEEE, Jeroen Lichtenauer,
Thierry Pun, Member, IEEE, and Maja Pantic, Fellow, IEEE
Abstract—MAHNOB-HCI is a multimodal database recorded in response to affective stimuli with the goal of emotion recognition and
implicit tagging research. A multimodal setup was arranged for synchronized recording of face videos, audio signals, eye gaze data,
and peripheral/central nervous system physiological signals. Twenty-seven participants from both genders and different cultural
backgrounds participated in two experiments. In the first experiment, they watched 20 emotional videos and self-reported their felt
emotions using arousal, valence, dominance, and predictability as well as emotional keywords. In the second experiment, short videos
and images were shown once without any tag and then with correct or incorrect tags. Agreement or disagreement with the displayed
tags was assessed by the participants. The recorded videos and bodily responses were segmented and stored in a database. The
database is made available to the academic community via a web-based system. The collected data were analyzed and single
modality and modality fusion results for both emotion recognition and implicit tagging experiments are reported. These results show the
potential uses of the recorded modalities and the significance of the emotion elicitation protocol.
Index Terms—Emotion recognition, EEG, physiological signals, facial expressions, eye gaze, implicit tagging, pattern classification,
affective computing.
Ç
1INTRODUCTION
A
LTHOUGH the human emotional experience plays a
central part in our lives, our scientific knowledge about
human emotions is still very limited. Progress in the field of
affective sciences is crucial for the development of
psychology as a scientific discipline or application that
has anything to do with humans as emotional beings. More
specifically, the application of human-computer interaction
relies on knowledge about the human emotional experi-
ence, as well as on knowledge about the relation between
emotional experience and affective expression.
An area of commerce that could obviously benefit from
an automatic understanding of human emotional experi-
ence is the multimedia sector. Media items such as movies
and songs are often primarily valued for the way in which
they stimulate a certain emotional experience. While it
might often be the affective experience that a person is
looking for, media items are currently primarily tagged by
their genre, subject or their factual content. Implicit affective
tagging through automatic understanding of an individual’s
response to media items would make it possible to rapidly
tag large quantities of media, on a detailed level and in a
way that would be more meaningful to understand how
people experience the affective aspects of media content [1].
This allows more effective content retrieval, required to
manage the ever-increasing quantity of shared media.
To study human emotional experience and expression in
more detail and on a scientific level, and to develop and
benchmark methods for automatic recognition, researchers
are in need of rich sets of data of repeatable experiments [2].
Such corpora should include high-quality measurements of
important cues that relate to the human emotional experi-
ence and expression. The richness of the human emotional
expressiveness poses both a technological as well as a
research challenge. This is recognized and represented by
an increasing interest into pattern recognition methods for
human behavior analysis that can deal with the fusion of
measurements from different sensor modalities [2]. How-
ever, obtaining multimodal sensor data is a challenge in
itself. Different modalities of measurement require different
equipment, developed a nd manufactured by different
companies, and different expertise to set up and operate.
The need for interdisciplinary knowledge as well as
technological solutions to combine measurement data from
a diversity of sensor equipment is probably the main reason
for the current lack of multimodal databases of recordings
dedicated to human emotional experiences.
To contribute to this need for emotional databases and
affective tagging, we have recorded a database of multi-
modal recordings of partic ipants in their response to
affectively stimulating excerpts from movies and images
and videos with correct or incorrect tags associated with
human actions. The database is freely available to the
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 3, NO. 1, JANUARY-MARCH 2012 1
. M. Soleymani and T. Pun are with the Computer Vision and Multimedia
Laboratory, Computer Science Department, University of Geneva, Battelle
Campus, Building A, Rte. de Drize 7, Carouge (GE) CH-1227, Switzer-
land. E-mail: {mohammad.soleymani, thierry.pun}@unige.ch.
. J. Lichtenauer is with the Department of Computing, Imperial College
London, 180 Queen’s Gate, London SW7 2AZ, United Kingdom.
E-mail: j.lichtenauer@imperial.ac.uk.
. M. Pantic is with the Department of Computing, Imperial College London,
180 Queen’s Gate, London SW7 2AZ, United Kingdom, and the Faculty of
Electrical Engineering, Mathematics and Computer Science (EEMCS),
University of Twente, Drienerlolaan 5, Enschede 7522 NB, The Nether-
lands. E-mail: m.pantic@imperial.ac.uk.
Manuscript received 12 Nov. 2010; revised 1 July 2011; accepted 6 July 2011;
published online 28 July 2011.
Recommended for acceptance by B. Schuller.
For information on obtaining reprints of this article, please send e-mail to:
taffc@computer.org, and reference IEEECS Log Number
TAFFCSI-2010-11-0112.
Digital Object Identifier no. 10.1109/T-AFFC.2011.25.
1949-3045/12/$31.00 ß 2012 IEEE Published by the IEEE Computer Society

academic community, and is easily accessible through a
web-interface.
1
The recordings for all excerpts are anno-
tated through an affective feedback form, filled in by the
participants immediately after each excerpt. A summary of
the MAHNOB-HCI database characteristics is given in
Table 1. The recordings of this database are precisely
synchronized and its multimodality permits researchers to
study the simultaneous emotional responses using different
channels. Two typical sets including responses to emotional
videos and implicit tagging or agreement with displayed
tags can be used for both emotion recognition as well as
multimedia tagging studies. Emotion recognitio n and
implicit tagging baseline results are given for researchers
who are going to use the database. The baseline results set a
target for the researchers to reach.
In Section 2, we give an overview of existing affective
databases, followed by descriptions of the modalities we
have recorded in our database in Section 3. Section 4
explains the experimental setup. The first experiment
paradigm, some statistics and results of classifications of
emotions are presented in Section 5 and for the second
experiment in Section 6. A discussion on the use of the
database and recommendations for recordings of such
databases are given in Section 7, followed by our conclu-
sions in Section 8.
2BACKGROUND
Creating affective databases is an important step in emotion
recognition studies. Recent advances in emotion recognition
have motivated the creation of novel databases containing
emotional expressions. These databases mostly include
speech, visual, or audio-visual data [5], [6], [7], [8]. The
visual modality of the emotional databases includes face
and/or body gestures. The audio modality carries acted or
genuine emotional speech in different languages. In the last
decade, most of the databases consisted only of acted or
deliberately expressed emotions. More recently, researchers
have begun sharing spontaneous and natural emotional
databases such as [6], [7], [9]. We only review the publicly
available spontaneous or naturalistic databases and refer
the reader to the following review [2] for posed, audio, and
audio-visual databases.
Pantic et al. created the MMI web-based emo tion al
database of posed and spontaneous facial expressions with
both static images and videos [5], [10]. The MMI database
consists of images and videos captured from both frontal and
profile view. The MMI database includes data from 61 adults
acting different basic emotions and 25 adults reacting to
emotional videos. This web-based database gives an option of
searching in the corpus and is downloadable.
2
One notable database with spontaneous reactions is the
Belfast database (BE) created by Cowie et al. [11]. The BE
database includes spontaneous reactions in TV talk shows.
Although the database is very rich in body gestures and
facial expressions, the variety in the background makes the
data a challenging data set of automated emotion recogni-
tion. The BE database was later included in a much larger
ensemble of databases in the HUMAINE database [6]. The
HUMAINE database consists of three naturalistic and six
induced reaction databases. Databases vary in size from 8 to
125 participants and in modalities, from only audio-visual
to peripheral physiological signals. These databases were
developed independently at different sites and collected
under the HUMAINE project.
The “Vera am Mittag” (VAM) audio-visual database [7]
is another example of using spontaneous naturalistic
reactions during a talk show to develop a database. Twelve
hours of audio-visual recordings from a German talk show,
“Vera am Mittag,” were segmented and annotated. The
segments were annotated using valence, activation, and
dominance. The audio-visual signals consist of the video
and utterances from 104 different speakers.
Compared to audio-visual databases, there are fewer
publicly available affective physiological databases. Healey
and Picard recorded one of the first affective physiological
data sets at MIT, which has reactions of 17 drivers under
different levels of stress [4]. Their recordings include
electrocardiogram (ECG), galvanic skin response (GSR)
recorded from hands and feet, electromyogram (EMG) from
the right trapezius, as well as the respiration pattern. The
database of stress recognition in drivers is publicly available
from Physionet.
3
The Database for Emotion Analysis using Physiological
Signals (DEAP) [9] is a recent database that includes
peripheral and central nervous system physiological signals
in addition to face videos from 32 participants. The face
videos were only recorded from 22 participants. EEG
signals were recorded from 32 active electrodes. Peripheral
nervous system physiological signals were EMG, electro-
ocologram (EOG), blood volume pulse (BVP) using plethys-
mograph, skin temperature, and GSR. The spontaneous
2 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 3, NO. 1, JANUARY-MARCH 2012
1. http://mahnob-db.eu.
TABLE 1
MAHNOB-HCI Database Content Summary
2. http://www.mmifacedb.com/.
3. http://www.physionet.org/pn3/drivedb/.

reactions of participants were recorded in response to music
video clips. This database is publicly available on the
Internet.
4
The characteristics of the reviewed databases are
summarized in Table 2.
3MODALITIES AND APPARATUS
3.1 Stimuli and Video Selection
Although the most straightforward way to represent an
emotion is to use discrete labels such as fear or joy, label-
based representations have some disadvantages. Specifi-
cally, labels are not cross-lingual: Emotions do not have
exact translations in different languages, e.g., “disgust”
does not have an exact translation in Polish [12]. Psychol-
ogists therefore often represent emotions or feelings in an
n-dimensional space (generally 2 or 3D). The most famous
such space, which is used in the present study and
originates from cognitive theory, is the 3D valence-
arousal-dominance or pleasure-arousal-dominance (PAD)
space [13]. The valence scale ranges from unpleasant to
pleasant. The arousal scale ranges from passive to active or
excited. The dominance scale ranges from submissive (or
“without control”) to dominant (or “in control, empow-
ered”). Fontaine et al. [14] proposed adding a predictability
dimension to PAD dimensions. Predictability level de-
scribes to what extent the sequence of events is predictable
or surprising for a viewer.
In a preliminary study, 155 video clips containing movie
scenes manually selected from 21 commercially produced
movies were shown to more than 50 participants; each
video clip received 10 annotations on average [15]. The
preliminary study was conducted utilizing an onl ine
affective annotation system in which the participants
reported their emotions in response to the videos played
by a web-based video player.
In the preliminary study, the participants were thus
asked to self-assess their emotion by reporting the felt
arousal (ranging from calm to excited/activated) and
valence (ranging from unpleasant to pleasant) on nine
points scales. SAM Manikins were shown to facilitate the
self-assessments of valence and arousal [16]. Fourteen video
clips were chosen based on the preliminary study from the
clips which received the highest number of tags in different
emotion classes, e.g., the clip with the highest number of
sad tags was selected to induce sadness. Three other
popular video clips from online resources were added to
this set (two for joy and one for disgust). Three past weather
forecast reports (retrieved from youtube.com) were also
used as neutral emotion clips. The videos from online
resources were added to the data set to enable us to
distribute some of the emotional video samples with the
multimodal database described below. The full list of
videos is given in Table 3.
Ultimately, 20 videos were selected to be shown which
were between 34.9 and 117 s long (M ¼ 81:4s;SD¼ 22:5s).
Psychologists recommended videos from 1 to 10 minutes
long for elicitation of a single emotion [17], [18]. Here, the
video clips were kept as short as possible to avoid multiple
emotions or habituation to the stimuli while keeping them
long enough to observe the effect.
3.2 Facial Expressions and Audio Signals
One of the most well-studied emotional expression chan-
nels is facial expressions. A human being uses facial
expressions as a natural mean of emotional communication.
Emotional expressions are also used in human-human
communication to clarify and stress what is said, to signal
comprehension, disagreement, and intentions, in brief, to
regulate interactions with the environment and other
persons in the vicinity [19], [20]. Automatic analysis of
facial expression is an interesting topic from both scientific
and practical point of view. It has attracted the interest of
many researchers since such systems will have numerous
applications in behavioral science, medicine, security, and
human-computer interaction. To develop and evaluate such
applications, large collections of training and test data are
needed [21], [22]. In the current database, we are interested
in studying the spontaneous responses of participants while
SOLEYMANI ET AL.: A MULTIMODAL DATABASE FOR AFFECT RECOGNITION AND IMPLICIT TAGGING 3
TABLE 2
The Summary of the Characteristics of the Emotional Databases Reviewed
TABLE 3
The Video Clips Listed with Their Sources
The listed emotional keywords were chosen by polling over participants’
self-reports in the preliminary study.
4. http://www.eecs.qmul.ac.uk/mmv/data sets/deap/.

watching video clips. This can be used later for emotional
implicit tagging of multimedia content.
Fig. 1 shows the synchronized views from the six different
cameras. Two types of cameras have been used in the
recordings: one Allied Vision Stingray F-046C, color camera
(C1), and five Allied Vision Stingray F-046B, monochrome
cameras (BW1 to BW5). All camer as recorded wit h a
resolution of 780 580 pixels at 60 frames per second. The
two close up cameras above the screen give a near-frontal
view of the face in color Fig. 1a or monochrome Fig. 1b. The
monochrome views have a better sharpness and less motion
blur than the color camera. The two views from the bottom of
the screen, Figs. 1c and 1d, give a close up view that may be
more useful during down-facing head poses, and make it
possible to apply passive stereo imaging. For this, the
intrinsic and extrinsic parameters of all cameras have been
calibrated. Linear polarizing filters were applied with the
two bottom cameras in order to reduce the reflection of the
computer screen in eyes and glasses. The profile view Fig. 1e
can be used to extract backward-forward head/body move-
ments or to aid the extraction of facial expressions, together
with the other cameras. The wide-angle view Fig. 1f captures
the upper body, arms and hands, which can also carry
important information about a person’s affective state.
Although we did not explicitly ask the participants to
express or talk during the experiments, we expected some
natural utterances and laughter in the recorded audio signals.
The audio was recorded for its potential to be used for video
tagging, e.g., it has been used to measure the hilarity of videos
by analyzing a user’s laughter [23]. However, the amount of
laughter and audio responses in the database from partici-
pants is not enough for such studies and therefore the audio
signals were not analyzed. The recorded audio contains two
channels. Channel one (or “left” if interpreted as a stereo
stream) contains the audio signal from a AKG C 1000 S MkIII
room microphone, which includes the room noise as well as
the sound of the video stimuli. Channel two contains the
audio signal from an AKG HC 577 L head-worn microphone.
3.3 Eye Gaze Data
The Tobii X120
5
eye gaze tracker provides the position of
the projected eye gaze on the screen, the pupil diameter, the
moments when the eyes were closed, and the instantaneous
distance of the participant’s eyes to the gaze tracker device.
The eye gaze data were sampled at 60 Hz due to instability
of the eye gaze tracker system at 120 Hz. The blinking
moments are also extractable from eye gaze data by finding
the moments in the eye gaze responses where the
coordinates are equal to 1. Pupil diameter has been
shown to change in different emotional states [24], [25].
Examples of eye gaze responses are shown in Fig. 2.
3.4 Physiological Signals
Physiological responses (ECG, GSR, respiration amplitude,
and skin temperature) were recorded with a 1,024 Hz
sampling rate and later downsampled to 256 Hz to reduce
the memory and processing costs. The trend of the ECG and
GSR signals was removed by subtracting the temporal low
frequency drift. The low frequency drift was computed by
smoothing the signals on each ECG and GSR channels with
a 256 points moving average.
GSR provides a measure of the resistance of the skin by
positioning two electrodes on the distal phalanges of the
middle and index fingers and passing a negligible current
through the body. This resistance decreases due to an
increase of perspiration, which usually occurs when one is
experiencing emotions such as stress or surprise. Moreover,
Lang et al. discovered that the mean value of the GSR is
related to the level of arousal [26].
ECG signals were recorded using three sensors attached
on the participants’ body. Two of the electrodes were
placed on the chest’s upper right and left corners below the
clavicle bones and the third electrode was placed on the
abdomen below the last rib for setup simplicity. This setup
allows precise identification of heart beats and conse-
quently to compute heart rate (HR).
Skin temperature was recorded by a temperature sensor
placed participant’s little finger. The respiration amplitude
was measured by tying a respiration belt around the
abdomen of the participant.
Psychological studies regarding the relations between
emotions and the brain are uncovering the strong implica-
tion of cognitive processes in emotions [27]. As a result, the
EEG signals carry valuable information about the partici-
pants’ felt emotions. EEG signals were recorded using
active AgCl electrodes placed according to the international
10-20 system. Examples of peripheral physiological re-
sponses are shown in Fig. 2.
4EXPERIMENTAL SETUP
4.1 Experimental Protocol
As explained above, we set up an apparatus to record facial
videos, audio and vocal expressions, eye gaze, and
physiological signals simultaneously. The experiment was
4 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, VOL. 3, NO. 1, JANUARY-MARCH 2012
Fig. 1. Snapshots of videos captured from six cameras recording facial
expressions and head pose.
5. http://www.tobii.com.

controlled by the Tobii studio software. The Biosemi active
II system
6
with active electrodes was used for physiological
signals acquisition. Physiological signals including ECG,
EEG (32 channels), respiration amplitude, and skin
temperature were recorded while the videos were shown
to the participants. In the first experiment, five multiple
choice questions were asked during the self-report for each
video. For the second experiment, where the feedback was
limited to yes and no, two big colored buttons (red and
green) were provided.
Thirty participants with different cultural backgrounds
volunteered to participate in response to a campus wide call
for volunteers at Imperial College, London. Out of the
30 young healthy adult participants, 17 were female and 13
were male; ages varied between 19 to 40 years old
(M ¼ 26:06;SD ¼ 4:39). Participants had different educa-
tional background, from undergraduate students to post-
doctoral fellows, with different English proficiency from
intermediate to native speakers. The data recorded from
three participants (P9, P12, P15) were not analyzed due to
technical problems and unfinished data collection. Hence,
the analysis results of this paper are only based on the
responses recorded from 27 participants.
4.2 Synchronized Setup
An overview of the synchronization in the recording setup
is shown in Fig. 3. To synchronize between sensors, we
centrally monitored the timings of all sensors, using a
MOTU 8pre
7
audio interface (“c” in Fig. 3) that can sample
up to eight analog inputs simultaneously. This allowed the
derivation of the exact temporal relations between events in
each of the eight channels. By recording the external camera
trigger pulse signal (“b” in Fig. 3) in a parallel audio track
(see the fifth signal in Fig. 4), each recorded video frame
could be related to the recorded audio with an uncertainty
below 25 s. More details about the data synchronization
can be found in [28].
The gaze tracking data and physiological signals were
recorded with separated capture systems. Because neither of
SOLEYMANI ET AL.: A MULTIMODAL DATABASE FOR AFFECT RECOGNITION AND IMPLICIT TAGGING 5
Fig. 3. Overview of our synchronized multisensor data capture system,
consisting of (a) a physiological measurement device, (b) video
cameras, (c) a multichannel A/D converter, (d) an A/V capture PC,
(e) microphones, (f) an eye gaze capture PC, (g) an eye gaze tracker,
and (h) a photo diode to capture the pulsed IR-illumination from the eye
gaze tracker. Camera trigger was recorded as audio and physiological
channels for synchronization.
6. http://www.biosemi.com.
7. http://www.motu.com/products/motuaudio/8pre.
Fig. 2. Natural expressions to a fearful (on the left) and disgusting (on the right) video. The snapshots of the stimuli videos with eye gaze overlaid
and without eye gaze overlaid, frontal captured video, raw physiological signals, and raw eye gaze data are shown. In the first row, the red circles
show the fixation points and their radius indicates the time spent in each fixation point. The red lines indicate the moments where each of the
snapshots was captured.

Citations
More filters
Journal ArticleDOI

DEAP: A Database for Emotion Analysis ;Using Physiological Signals

TL;DR: A multimodal data set for the analysis of human affective states was presented and a novel method for stimuli selection is proposed using retrieval by affective tags from the last.fm website, video highlight detection, and an online assessment tool.
Journal ArticleDOI

Deep learning-based electroencephalography analysis: a systematic review.

TL;DR: In this paper, the authors present a review of 154 studies that apply deep learning to EEG, published between 2010 and 2018, and spanning different application domains such as epilepsy, sleep, brain-computer interfacing, and cognitive and affective monitoring.
Journal ArticleDOI

DISFA: A Spontaneous Facial Action Intensity Database

TL;DR: To meet the need for publicly available corpora of well-labeled video, the Denver intensity of spontaneous facial action database is collected, ground-truthed, and prepared for distribution.
Journal ArticleDOI

Emotions Recognition Using EEG Signals: A Survey

TL;DR: A survey of the neurophysiological research performed from 2009 to 2016 is presented, providing a comprehensive overview of the existing works in emotion recognition using EEG signals, and a set of good practice recommendations that researchers must follow to achieve reproducible, replicable, well-validated and high-quality results.
Journal ArticleDOI

Algorithmic Principles of Remote PPG

TL;DR: A mathematical model is introduced that incorporates the pertinent optical and physiological properties of skin reflections with the objective to increase the understanding of the algorithmic principles behind remote photoplethysmography (rPPG).
References
More filters
Journal ArticleDOI

LIBSVM: A library for support vector machines

TL;DR: Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Journal ArticleDOI

A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting

TL;DR: The model studied can be interpreted as a broad, abstract extension of the well-studied on-line prediction model to a general decision-theoretic setting, and it is shown that the multiplicative weight-update Littlestone?Warmuth rule can be adapted to this model, yielding bounds that are slightly weaker in some cases, but applicable to a considerably more general class of learning problems.
Proceedings ArticleDOI

Learning realistic human actions from movies

TL;DR: A new method for video classification that builds upon and extends several recent ideas including local space-time features,space-time pyramids and multi-channel non-linear SVMs is presented and shown to improve state-of-the-art results on the standard KTH action dataset.
Journal ArticleDOI

Looking at pictures: affective, facial, visceral, and behavioral reactions

TL;DR: Responsibility specificity, particularly facial expressiveness, supported the view that specific affects have unique patterns of reactivity, and consistency of the dimensional relationships between evaluative judgments and physiological response emphasizes that emotion is fundamentally organized by these motivational parameters.
Journal ArticleDOI

DEAP: A Database for Emotion Analysis ;Using Physiological Signals

TL;DR: A multimodal data set for the analysis of human affective states was presented and a novel method for stimuli selection is proposed using retrieval by affective tags from the last.fm website, video highlight detection, and an online assessment tool.
Related Papers (5)
Frequently Asked Questions (11)
Q1. What were the physiological signals recorded while the videos were shown to the participants?

Physiological signals including ECG, EEG (32 channels), respiration amplitude, and skin temperature were recorded while the videos were shown to the participants. 

Three modalities, including peripheral and central nervous system physiological signals and information captured by eye gaze tracker, were used to recognize emotions from participants’ responses. 

If the classifiers provide confidence measures on their decisions, combining decisions of classifiers can be done using a summation rule. 

The physiological responses to each stimuli were recorded each in a separate file in Biosemi data format (BDF) which is an extension of European data format (EDF) and easily readable in different platforms. 

Although the most straightforward way to represent an emotion is to use discrete labels such as fear or joy, labelbased representations have some disadvantages. 

The need for interdisciplinary knowledge as well as technological solutions to combine measurement data from a diversity of sensor equipment is probably the main reason for the current lack of multimodal databases of recordings dedicated to human emotional experiences. 

To their knowledge, MAHNOB-HCI is the first database which has five modalities precisely synchronized, namely, eye gaze data, video, audio, and peripheral and central nervous system physiological signals. 

By recording the external camera trigger pulse signal (“b” in Fig. 3) in a parallel audio track(see the fifth signal in Fig. 4), each recorded video frame could be related to the recorded audio with an uncertainty below 25 s. 

Two modalities were used to predict the correctness of displayed tags, namely, facial expression (captured by a camera) and the eye gaze location on the screen (captured by an eye gaze tracker). 

All the files containingphysiological signals include the signals recorded 30 s before the start and after the end of their stimuli. 

The eye gaze tracker (“g” in Fig. 3) synchronized with the CPU cycle counter of its dedicated capture PC (“f”) with an accuracy of approximately one millisecond.