scispace - formally typeset
Open AccessProceedings ArticleDOI

A Bayesian framework for video affective representation

Reads0
Chats0
TLDR
A Bayesian classification framework for affective video tagging that allows taking contextual information into account is introduced and two contextual priors have been proposed: the movie genre prior, and the temporal dimension prior consisting of the probability of transition between emotions in consecutive scenes.
Abstract
Emotions that are elicited in response to a video scene contain valuable information for multimedia tagging and indexing The novelty of this paper is to introduce a Bayesian classification framework for affective video tagging that allows taking contextual information into account A set of 21 full length movies was first segmented and informative content-based features were extracted from each shot and scene Shots were then emotionally annotated, providing ground truth affect The arousal of shots was computed using a linear regression on the content-based features Bayesian classification based on the shots arousal and content-based features allowed tagging these scenes into three affective classes, namely calm, positive excited and negative excited To improve classification accuracy, two contextual priors have been proposed: the movie genre prior, and the temporal dimension prior consisting of the probability of transition between emotions in consecutive scenes The f1 classification measure of 549% that was obtained on three emotional classes with a naive Bayes classifier was improved to 634% after utilizing all the priors

read more

Content maybe subject to copyright    Report

Proceedings Chapter
Reference
A Bayesian Framework for Video Affective Representation
SOLEYMANI, Mohammad, et al.
Abstract
Emotions that are elicited in response to a video scene contain valuable information for
multimedia tagging and indexing. The novelty of this paper is to introduce a Bayesian
classification framework for affective video tagging that allows taking contextual information
into account. A set of 21 full length movies was first segmented and informative content-based
features were extracted from each shot and scene. Shots were then emotionally annotated,
providing ground truth affect. The arousal of shots was computed using a linear regression on
the content-based features. Bayesian classification based on the shots arousal and
content-based features allowed tagging these scenes into three affective classes, namely
calm, positive excited and negative excited. To improve classification accuracy, two
contextual priors have been proposed: the movie genre prior, and the temporal dimension
prior consisting of the probability of transition between emotions in consecutive scenes. The f1
classification measure of 54.9% that was obtained on three emotional classes with a nai¿ve
Bayes classifier was improved to 63.4% after utilizing [...]
SOLEYMANI, Mohammad, et al. A Bayesian Framework for Video Affective Representation. In:
3rd International Conference on Affective Computing and Intelligent Interaction and
Workshops, 2009. ACII 2009 : proceedings. 2009. p. 1-7
DOI : 10.1109/ACII.2009.5349563
Available at:
http://archive-ouverte.unige.ch/unige:47663
Disclaimer: layout of this document may differ from the published version.
1 / 1

Abstract
Emotions that are elicited in response to a video
scene contain valuable information for multimedia
tagging and indexing. The novelty of this paper is to
introduce a Bayesian classification framework for
affective video tagging that allows taking contextual
information into account. A set of 21 full length movies
was first segmented and informative content-based
features were extracted from each shot and scene. Shots
were then emotionally annotated, providing ground
truth affect. The arousal of shots was computed using a
linear regression on the content-based features.
Bayesian classification based on the shots arousal and
content-based features allowed tagging these scenes
into three affective classes, namely calm, positive
excited and negative excited. To improve classification
accuracy, two contextual priors have been proposed:
the movie genre prior, and the temporal dimension prior
consisting of the probability of transition between
emotions in consecutive scenes. The f1 classification
measure of 54.9% that was obtained on three emotional
classes with a naïve Bayes classifier was improved to
63.4% after utilizing all the priors.
1. Introduction
1.1. Overview
Video and audio on-demand systems are getting more
and more popular and are likely to replace traditional
TVs. Online video content has been growing rapidly in
the last five years. For example the open access online
video database, YouTube, had a watching rate of more
than 100 millions videos per day in 2006 [1]. The
enormous mass of digital multimedia content with its
huge variety requires more efficient multimedia
management methods. Many studies have been
conducted in the last decade to increase the accuracy of
current multimedia retrieval systems. These studies were
mostly based on content analysis and textual tags [2].
Although the emotional preferences of a user play an
important role in multimedia content selection, few
publications exist in the field of affective indexing which
consider emotional preferences of users [2-7].
The present study is focused on movies because they
represent one of the most common and popular types of
multimedia content. An affective representation of
scenes will be useful for tagging, indexing and
highlighting of important parts in a movie. We believe
that using the existing online metadata can improve the
affective representation and classification of movies.
Such metadata, like movie genre, is available on internet
(e.g. internet movie database http://www.imdb.com).
Movie genre can be exploited to improve an affect
representation system's inference about the possible
emotion which is going to be elicited in the audience.
For example, the probability of a happy scene in a
comedy certainly differs from that in a drama. Moreover,
the temporal order of the evoked emotions, which can be
modeled by the probability of emotion transition in
consecutive scenes, is also expected to be useful for the
improvement of an affective representation system.
It is shown here how to benefit from the proposed
priors in a Bayesian classification framework. Affect
classification was done for a three labels scene
classification problem, where the labels are “calm”,
“positive excited”, and “negative excited”. Ground truth
was obtained through manual annotation with a
FEELTRACE-like [8] annotation tool with the self-
assessments serving as the classification ground-truth.
The usefulness of priors is shown by comparing
classification results with or without using them.
Metadata:
genre, rating, …
INTERNET
Database
Textual
information
Feature
extraction
Video
Audio
Affect
representation
system
Affect
Users personal
profile, gender,
age, etc
In our proposed affective indexing and retrieval
system, different modalities, such as video, audio, and
textual data (subtitles) of a movie will be used for
feature extraction. Figure 1 shows the diagram of such a
system. The feature extraction block extracts features
A Bayesian Framework for Video Affective Representation
Mohammad Soleymani Joep J.M. Kierkels Guillaume Chanel Thierry Pun
Computer Vision and Multimedia Laboratory, Computer Science Department
University of Geneva
Battelle Building A, Rte. De Drize 7,
CH - 1227 Carouge, Geneva, Switzerland
{mohammad.soleymani, joep.kierkels, guillaume.chanel,thierry.pun @unige.ch}
http://cvml.unige.ch
First Author
Institution1
Institution1 address
firstauthor@i1.org
Second Author
Institution2
First line of institution2 address
http://www.author.org/_second
Figure 1. A diagram of the proposed video affective
representation.

from the three modalities and stores them in a database.
Then, the affect representation system fuses the
extracted features, the stored personal information, and
the metadata to represent the evoked emotion. For a
personalized retrieval, a personal profile of a user (with
his/her gender, age, location, social network) will help
the affective retrieval process.
The paper is organized as follows. A review of the
current state of the art and an explanation on affect and
affective representation are given in the following
subsections of the first Section. Methods used including
the arousal representation at the shot level and affect
classification at the scene level are given in Section 2.
Section 3 details the movie dataset used and the features
that have been extracted. The obtained classification
results at the scene level, the comparisons with and
without using genre and temporal priors are discussed in
Section 4. Section 5 concludes the article and offers
perspectives for future work.
1.2. State of the art
Video affect representation requires understanding of
the intensity and type of user’s affect while watching a
video. There are only a limited number of studies on
content-based affective representation of movies. Wang
and Cheong [2] used content audio and video features to
classify basic emotions elicited by movie scenes. In [2]
audio was classified into music, speech and environment
signals and had been treated separately to shape an
affective feature vector. The audio affective vector was
used with video-based features such as key lighting and
visual excitement to form a scene feature vector. Finally,
the scene feature vector was classified and labeled with
emotions.
Hanjalic et al. [4] introduced “personalized content
delivery” as a valuable tool in affective indexing and
retrieval systems. In order to represent affect in video,
they first selected video- and audio- content based
features based on their relation to the valence-arousal
space that was defined as an affect model (for the
definition of affect model, see Section 1.3) [4]. Then,
arising emotions were estimated in this space by
combining these features. While arousal and valence
could be used separately for indexing, they combined
these values by following their temporal pattern in the
arousal and valence space. This allowed determining an
affect curve, shown to be useful for extracting video
highlights in a movie or sports video.
A hierarchical movie content analysis method based
on arousal and valence related features was presented by
M. Xu et al. [6]. In this method the affect of each shot
was first classified in the arousal domain using the
arousal correlated features and fuzzy clustering. The
audio short time energy and the first four Mel frequency
cepstral coefficients, MFCC (as a representation of
energy features), shot length, and the motion component
of consecutive frames were used to classify shots in
three arousal classes. Next, they used color energy,
lighting and brightness as valence related features to be
used for a HMM-based valence classification of the
previously arousal-categorized shots.
A personalized affect representation method based on
a regression approach for estimating user-felt arousal
and valence from multimedia content features and/or
from physiological responses was presented by
Soleymani et al. [7]. A relevance vector machine was
used to find linear regression weights. This allowed
predicting valence and arousal from the measured
multimedia and/or physiological data. During the
experiments, 64 video clips were shown to 8 participants
while their physiological responses were recorded; user's
self-assessments of valence and arousal served as ground
truth. A comparison was made on the arousal and
valence values obtained by different modalities which
were the physiological signals, the video- and audio-
based features, and the self-assessments. In [7] An
experiment with multiple participants has been
conducted for personalized emotion assessment based on
content analysis.
1.3. Affect and Affective representation
Russell [10] proposed a 3D continuous space called
the valence-arousal-dominance space which was based
on a self-representation of emotions from multiple
subjects. In this paper we use a valence-arousal
dimensional approach for affect representation and
annotation. The third dimensional axis, namely
dominance / control, is not used in our study. In the
valence-arousal space it is possible to represent almost
any emotion. The valence axis represents the
pleasantness of a situation, from unpleasant to pleasant;
the arousal axis expresses the degree of felt excitement,
from calm to exciting. Russel demonstrated that this
space has the advantages of being cross-cultural and that
it is possible to map labels on this space. Although, the
most straightforward way to represent an emotion is to
use discrete labels such as fear, anxiety and joy, label-
based representations have several disadvantages. The
main one is that despite the universality of basic
emotions, the labels themselves are not universal. They
can be misinterpreted from one language (or culture) to
another. In addition, emotions are continuous
phenomena rather than discrete ones and labels are
unable to define the strength of an emotion.
In a dimensional approach for affect representation,
the affect of a video scene can be represented by its
coordinates in the valence-arousal space. Valence and
arousal can be determined by self reporting. The goal of
an affective representation system is to estimate user’s
valence and arousal or emotion categories in response to
each movie segment. Emotion categories are defined as
regions in the valence-arousal space. Each movie
consists of scenes and each scene consists of a sequence
of shots which are happening in the same location. A

shot is the part of a movie between two cuts which is
typically filmed without interruptions [11].
2. Methods
2.1. Arousal estimation with regression on shots
Informative features for arousal estimation include
loudness and energy of the audio signals, motion
component, visual excitement and shot duration. Using a
method similar to Hanjalic et al. [4] and to the one
proposed in [7], the felt arousal from each shot is
computed by a regression of the content features (see
Section 3 for a detailed description). In order to find the
best weights for arousal estimation using regression, a
leave one movie out strategy on the whole dataset was
used and the linear weights were computed by means of
a relevance vector machine (RVM) from the RVM
toolbox provided by Tipping [12]. The RVM is able to
reject uninformative features during its training hence no
further feature selection was used for arousal
determination.
Equation (1) shows how N
s
audio and video based
features z
k
i
of the k-th shot are linearly combined by the
weights w
i
to compute the arousal â
k
at the shot level.
0
1
ˆ
wzwa
s
N
i
k
iik
(1)
After computing arousal at the shot level, the average
and maximum arousals of the shots of each scene are
computed and used as arousal indicator features for the
scene affective classification. During an exciting scene
the arousal related features do not all remain at their
extreme level. In order to represent the highest arousal
of each scene, the maximum of the shots’ arousal was
chosen to be used as a feature for scene classification.
The linear regression weights that were computed
from our data set were used to determine the arousal of
each movie’s shots. This was done in such a way that all
movies from the dataset except for the one to which the
shot belonged to were used as the training set for the
RVM. Any missing affective annotation for a shot was
approximated using linear interpolation from the closest
affective annotated time points in a movie.
It was observed that arousal has higher linear
correlation with multimedia content-based features than
valence. Valence estimation from regression is not as
accurate as arousal estimation and therefore valence
estimation has not been performed at the shot level.
2.2. Bayesian framework and scene
classification
For the purpose of categorizing the valence-arousal
space into three affect classes, the valence-arousal space
was divided into the three areas shown in Figure 2, each
corresponding to one class. According to [13] emotions
mapped to the lower arousal category are neither
extreme pleasant nor unpleasant emotions and are
difficult to differentiate. Emotional evaluations are
shown to have a heart shaped distribution on valence-
arousal space [13]. Hence, we categorized the lower half
of the plane into one class. The points with an arousal of
zero were counted in class 1 and the points with arousal
greater than zero and valence equal to zero were
considered in class 2. These classes were used as a
simple representation for the emotion categories based
on the previous literature on emotion assessment [14].
In order to characterize movie scenes into these
affective categories, the average and maximum arousal
of the shots of each scene and the low level extracted
audio- and video- based features were used to form a
feature vector. This feature vector in turn was used for
the classification.
If the content feature vector of the j-th scene is x
j
, the
problem of finding the emotion class, ŷ
j
, of this scene is
formulated as estimating the ŷ
j
which maximizes the
probability p(y
j
|x
j
,θ) where θ is the prior information
which can include the user’s preferences and video clip’s
metadata. In this paper one of the prior metadata (θ) we
used is for instance the genre of the movie. Personal
profile parameters can be also added to θ. Since in this
paper the whole affect representation is trained by the
self report of one participant the model is assumed to be
personalized for this participant. When the emotion of
the previous scene is used as another prior the scene
affect probability formula changes to p(y
j
|y
j-1
,x
j
,θ).
Assuming for simplification that the emotion of the
previous scene is independent from the content features
of the current scene this probability can be reformulated
as:
)|(
),|().,|(
),,|(
1
1
1
j
jjjj
jjj
yp
xypyyp
xyyp
(2)
The classification problem is then be simplified into
the determination of the maximum value of the
numerator of Equation (2), since the denominator will be
the same for all different affect classes y
j
. The priors are
established based on the empirical probabilities obtained
from the training data. For example, the occurrence
2
3
(0,0)
1
Figure 2. Three classes in the valence-arousal space are shown,
namely calm (1), positive excited (2) and negative excited (3).
An approximate of the heart shaped distribution of valence and
arousal is shown.
Arousal

probability of having a given emotion followed by any
of the emotion categories was computed from the
participant’s self-assessments and for each genre. This
allowed to obtain the p(y
j-1
|y
j
). Different methods were
evaluated to estimate the posterior probability p(y
j
|x
j
). A
naïve Bayesian approach which assumes the conditional
probabilities are Gaussian was chosen as providing the
best performance on the dataset; the superiority of this
method can be attributed to its generalization abilities.
3. Material description
A dataset of movies segmented and affectively
annotated by arousal and valence is used as the training
set. This training set consists of twenty one full length
movies (mostly popular movies). The majority of movies
were selected either because they were used in similar
studies (e.g. [15]), or because they were recent and
popular. The dataset included four genres: drama,
horror, action, and comedy. The following three
information streams were extracted from the media:
video (visual), sound (auditory), and, subtitles (textual).
The video stream of the movies has been segmented at
the shot level using the OMT shot segmentation software
and manually segmented into scenes [16;17]. Movie
videos were encoded into the MPEG-1 format to extract
motion vectors and I frames for further feature
extraction. We used the OVAL library (Object-based
Video Access Library) [18] to capture video frames and
extract motion vectors.
The second information stream, namely sound, has an
important impact on user’s affect. For example
according to the findings of Picard [19], loudness of
speech (energy) is related to evoked arousal, while
rhythm and average pitch in speech signals are related to
valence. The audio channels of the movies were
extracted and encoded into monophonic information
(MPEG layer 3 format) at a sampling rate of 48 kHz. All
of the resulting audio signals were normalized to the
same amplitude range before further processing.
Textual features were also extracted from the subtitles
track of the movies. According to [9] the semantic
analysis of the textual information can improve affect
classification. As the semantic analysis over the textual
data was not the focus of our work we extracted simple
features from subtitles by tokenizing the text and
counting the number of words. These statistics have
been used with the timing of the subtitles to extract the
talking rate feature which is the number of words that
had been spoken per second on the subtitles show time.
The other extracted feature is the number of spoken
words in a scene divided by the length of the scene,
which can represent the amount or existence of
dialogues in a scene. A list of the movies in the dataset
and their corresponding genre is given in Table 1.
3.1. Audio features
A total of 53 low-level audio features were
determined for each of the audio signals. These features,
listed in Table 2, are commonly used in audio and
speech processing and audio classification [20;21].
Wang et al. [2] demonstrated the relationship between
audio type’s proportions (for example, the proportion of
music in an audio segment) and affect, where these
proportions refer to the respective duration of music,
speech, environment, and silence in the audio signal of a
video clip. To determine the three important audio types
(music, speech, environment), we implemented a three
class audio type classifier using support vector machines
(SVM) operating on audio low-level features in a one
second segment. Before classification, silence had been
identified by comparing the average audio signal energy
of each sound segment (using the averaged square
magnitude in a time window) with a pre-defined
threshold empirically extracted from the first seven
percent of the audio energy histogram. This audio
histogram was computed from a randomly selected 30
minutes segment of each movie’s audio stream.
Table 2. Low-level features extracted from audio signals.
Feature
category
Extracted features
MFCC
MFCC coefficients (13 features) [20],
Derivative of MFCC (13 features),
Autocorrelation of MFCC (13 features)
Energy
Average energy of audio signal [20]
Formants
Formants up to 5500Hz (female voice) (five
features)
Time frequency
Spectrum flux, Spectral centroid, Delta
spectrum magnitude, Band energy ratio,
[20;21]
Pitch
First pitch frequency
Zero crossing
rate
Average, Standard deviation [20]
Silence ratio
Proportion of silence in a time window [24]
After removing silence, the remaining audio signals
were classified by a SVM with a polynomial kernel,
using the LIBSVM toolbox.
(http://www.csie.ntu.edu.tw/~cjlin/libsvm/). The SVM
was trained on about three hours of audio, extracted
from movies (not from the dataset of this paper) and
labeled manually. Despite the fact that in various cases
the audio type classes were overlapping (e.g. presence of
a musical background during a dialogue), the classifier
was usually able to recognize the dominant audio type
with an accuracy of about 80%.
The classification results were used to the ratio of
Table1. List of the movies in the dataset.
Drama Movies
Comedy Movies
The pianist, Blood
diamond, Hotel Rwanda,
Apocalypse now, American
history X, Hannibal
Man on the moon, Mr. Bean’s
holiday, Love actually, Shaun
of the dead, Shrek
Horror Movies
Action Movies
Silent hill, Ringu
(Japanese), 28 days later,
The shining
Man on Fire, Kill Bill Vol. 1,
Kill Bill Vol. 2, Platoon, The
thin red line, Gangs of New
York

Citations
More filters
Posted Content

EEV Dataset: Predicting Expressions Evoked by Diverse Videos

TL;DR: The Evoked Expressions in Videos (EEV) dataset is introduced, a large-scale dataset for studying viewer responses to videos based on their facial expressions and an improvement in performance on the LIRIS-ACCEDE video dataset when pre-trained on EEV.
Patent

Methods for training saturation-compensating predictors of affective response to stimuli

TL;DR: In this paper, the authors proposed methods for training a machine learning-based predictor of affective response to stimuli, which involve receiving samples comprising temporal windows of token instances to which a user was exposed, and target values representing affective responses annotations of the user in response to the temporal windows.

Deep features for multimodal emotion classification

TL;DR: This work proposes a multimodal framework with early fusion scheme and target an emotion classification task and shows that it outperforms most state-of-the-art systems on arousal accuracy while using a much smaller feature size.

Emotion assessment for affective computing based on brain and peripheral signals

Thierry Pun
TL;DR: Chanel et al. as mentioned in this paper presented a thèse traite de l'utilisation of deux types d'activités physiologiques for détecter les émotions dans le cadre of "l'affective computing" : l'activité du système nerveux central (le cerveau) and l'activation du sythème néphérique périphéric.
Patent

Methods for predicting affective response from stimuli

TL;DR: In this paper, a machine learning-based affective response predictor was proposed to predict a user's emotional state after being exposed to tokens representing stimuli that influence the user's affective state.
References
More filters
Journal ArticleDOI

Sparse bayesian learning and the relevance vector machine

TL;DR: It is demonstrated that by exploiting a probabilistic Bayesian learning framework, the 'relevance vector machine' (RVM) can derive accurate prediction models which typically utilise dramatically fewer basis functions than a comparable SVM while offering a number of additional advantages.
Journal ArticleDOI

Looking at pictures: affective, facial, visceral, and behavioral reactions

TL;DR: Responsibility specificity, particularly facial expressiveness, supported the view that specific affects have unique patterns of reactivity, and consistency of the dimensional relationships between evaluative judgments and physiological response emphasizes that emotion is fundamentally organized by these motivational parameters.
Journal ArticleDOI

Emotion elicitation using films

TL;DR: This article developed a set of films that reliably elicit eight emotional states (amusement, anger, contentment, disgust, fear, neutral, sadness, and surprise) from a large sample of 494 English-speaking subjects.
Related Papers (5)
Frequently Asked Questions (1)
Q1. What are the contributions in this paper?

The novelty of this paper is to introduce a Bayesian classification framework for affective video tagging that allows taking contextual information into account.