What are the contributions in this paper?

(Open Access) A Bayesian framework for video affective representation (2009) | Mohammad Soleymani

Proceedings Chapter

Reference

A Bayesian Framework for Video Affective Representation

SOLEYMANI, Mohammad, et al.

Abstract

Emotions that are elicited in response to a video scene contain valuable information for

multimedia tagging and indexing. The novelty of this paper is to introduce a Bayesian

classification framework for affective video tagging that allows taking contextual information

into account. A set of 21 full length movies was first segmented and informative content-based

features were extracted from each shot and scene. Shots were then emotionally annotated,

providing ground truth affect. The arousal of shots was computed using a linear regression on

the content-based features. Bayesian classification based on the shots arousal and

content-based features allowed tagging these scenes into three affective classes, namely

calm, positive excited and negative excited. To improve classification accuracy, two

contextual priors have been proposed: the movie genre prior, and the temporal dimension

prior consisting of the probability of transition between emotions in consecutive scenes. The f1

classification measure of 54.9% that was obtained on three emotional classes with a naiÂ¿ve

Bayes classifier was improved to 63.4% after utilizing [...]

SOLEYMANI, Mohammad, et al. A Bayesian Framework for Video Affective Representation. In:

3rd International Conference on Affective Computing and Intelligent Interaction and

Workshops, 2009. ACII 2009 : proceedings. 2009. p. 1-7

DOI : 10.1109/ACII.2009.5349563

Available at:

http://archive-ouverte.unige.ch/unige:47663

Disclaimer: layout of this document may differ from the published version.

1 / 1

Abstract

Emotions that are elicited in response to a video

scene contain valuable information for multimedia

tagging and indexing. The novelty of this paper is to

introduce a Bayesian classification framework for

affective video tagging that allows taking contextual

information into account. A set of 21 full length movies

was first segmented and informative content-based

features were extracted from each shot and scene. Shots

were then emotionally annotated, providing ground

truth affect. The arousal of shots was computed using a

linear regression on the content-based features.

Bayesian classification based on the shots arousal and

content-based features allowed tagging these scenes

into three affective classes, namely calm, positive

excited and negative excited. To improve classification

accuracy, two contextual priors have been proposed:

the movie genre prior, and the temporal dimension prior

consisting of the probability of transition between

emotions in consecutive scenes. The f1 classification

measure of 54.9% that was obtained on three emotional

classes with a naïve Bayes classifier was improved to

63.4% after utilizing all the priors.

1. Introduction

1.1. Overview

Video and audio on-demand systems are getting more

and more popular and are likely to replace traditional

TVs. Online video content has been growing rapidly in

the last five years. For example the open access online

video database, YouTube, had a watching rate of more

than 100 millions videos per day in 2006 [1]. The

enormous mass of digital multimedia content with its

huge variety requires more efficient multimedia

management methods. Many studies have been

conducted in the last decade to increase the accuracy of

current multimedia retrieval systems. These studies were

mostly based on content analysis and textual tags [2].

Although the emotional preferences of a user play an

important role in multimedia content selection, few

publications exist in the field of affective indexing which

consider emotional preferences of users [2-7].

The present study is focused on movies because they

represent one of the most common and popular types of

multimedia content. An affective representation of

scenes will be useful for tagging, indexing and

highlighting of important parts in a movie. We believe

that using the existing online metadata can improve the

affective representation and classification of movies.

Such metadata, like movie genre, is available on internet

(e.g. internet movie database http://www.imdb.com).

Movie genre can be exploited to improve an affect

representation system's inference about the possible

emotion which is going to be elicited in the audience.

For example, the probability of a happy scene in a

comedy certainly differs from that in a drama. Moreover,

the temporal order of the evoked emotions, which can be

modeled by the probability of emotion transition in

consecutive scenes, is also expected to be useful for the

improvement of an affective representation system.

It is shown here how to benefit from the proposed

priors in a Bayesian classification framework. Affect

classification was done for a three labels scene

classification problem, where the labels are “calm”,

“positive excited”, and “negative excited”. Ground truth

was obtained through manual annotation with a

FEELTRACE-like [8] annotation tool with the self-

assessments serving as the classification ground-truth.

The usefulness of priors is shown by comparing

classification results with or without using them.

Metadata:

genre, rating, …

INTERNET

Database

Textual

information

Feature

extraction

Video

Audio

Affect

representation

system

Affect

User’s personal

profile, gender,

age, etc…

In our proposed affective indexing and retrieval

system, different modalities, such as video, audio, and

textual data (subtitles) of a movie will be used for

feature extraction. Figure 1 shows the diagram of such a

system. The feature extraction block extracts features

A Bayesian Framework for Video Affective Representation

Mohammad Soleymani Joep J.M. Kierkels Guillaume Chanel Thierry Pun

Computer Vision and Multimedia Laboratory, Computer Science Department

University of Geneva

Battelle Building A, Rte. De Drize 7,

CH - 1227 Carouge, Geneva, Switzerland

{mohammad.soleymani, joep.kierkels, guillaume.chanel,thierry.pun @unige.ch}

http://cvml.unige.ch

First Author

Institution1

Institution1 address

firstauthor@i1.org

Second Author

Institution2

First line of institution2 address

http://www.author.org/_second

Figure 1. A diagram of the proposed video affective

representation.

from the three modalities and stores them in a database.

Then, the affect representation system fuses the

extracted features, the stored personal information, and

the metadata to represent the evoked emotion. For a

personalized retrieval, a personal profile of a user (with

his/her gender, age, location, social network) will help

the affective retrieval process.

The paper is organized as follows. A review of the

current state of the art and an explanation on affect and

affective representation are given in the following

subsections of the first Section. Methods used including

the arousal representation at the shot level and affect

classification at the scene level are given in Section 2.

Section 3 details the movie dataset used and the features

that have been extracted. The obtained classification

results at the scene level, the comparisons with and

without using genre and temporal priors are discussed in

Section 4. Section 5 concludes the article and offers

perspectives for future work.

1.2. State of the art

Video affect representation requires understanding of

the intensity and type of user’s affect while watching a

video. There are only a limited number of studies on

content-based affective representation of movies. Wang

and Cheong [2] used content audio and video features to

classify basic emotions elicited by movie scenes. In [2]

audio was classified into music, speech and environment

signals and had been treated separately to shape an

affective feature vector. The audio affective vector was

used with video-based features such as key lighting and

visual excitement to form a scene feature vector. Finally,

the scene feature vector was classified and labeled with

emotions.

Hanjalic et al. [4] introduced “personalized content

delivery” as a valuable tool in affective indexing and

retrieval systems. In order to represent affect in video,

they first selected video- and audio- content based

features based on their relation to the valence-arousal

space that was defined as an affect model (for the

definition of affect model, see Section 1.3) [4]. Then,

arising emotions were estimated in this space by

combining these features. While arousal and valence

could be used separately for indexing, they combined

these values by following their temporal pattern in the

arousal and valence space. This allowed determining an

affect curve, shown to be useful for extracting video

highlights in a movie or sports video.

A hierarchical movie content analysis method based

on arousal and valence related features was presented by

M. Xu et al. [6]. In this method the affect of each shot

was first classified in the arousal domain using the

arousal correlated features and fuzzy clustering. The

audio short time energy and the first four Mel frequency

cepstral coefficients, MFCC (as a representation of

energy features), shot length, and the motion component

of consecutive frames were used to classify shots in

three arousal classes. Next, they used color energy,

lighting and brightness as valence related features to be

used for a HMM-based valence classification of the

previously arousal-categorized shots.

A personalized affect representation method based on

a regression approach for estimating user-felt arousal

and valence from multimedia content features and/or

from physiological responses was presented by

Soleymani et al. [7]. A relevance vector machine was

used to find linear regression weights. This allowed

predicting valence and arousal from the measured

multimedia and/or physiological data. During the

experiments, 64 video clips were shown to 8 participants

while their physiological responses were recorded; user's

self-assessments of valence and arousal served as ground

truth. A comparison was made on the arousal and

valence values obtained by different modalities which

were the physiological signals, the video- and audio-

based features, and the self-assessments. In [7] An

experiment with multiple participants has been

conducted for personalized emotion assessment based on

content analysis.

1.3. Affect and Affective representation

Russell [10] proposed a 3D continuous space called

the valence-arousal-dominance space which was based

on a self-representation of emotions from multiple

subjects. In this paper we use a valence-arousal

dimensional approach for affect representation and

annotation. The third dimensional axis, namely

dominance / control, is not used in our study. In the

valence-arousal space it is possible to represent almost

any emotion. The valence axis represents the

pleasantness of a situation, from unpleasant to pleasant;

the arousal axis expresses the degree of felt excitement,

from calm to exciting. Russel demonstrated that this

space has the advantages of being cross-cultural and that

it is possible to map labels on this space. Although, the

most straightforward way to represent an emotion is to

use discrete labels such as fear, anxiety and joy, label-

based representations have several disadvantages. The

main one is that despite the universality of basic

emotions, the labels themselves are not universal. They

can be misinterpreted from one language (or culture) to

another. In addition, emotions are continuous

phenomena rather than discrete ones and labels are

unable to define the strength of an emotion.

In a dimensional approach for affect representation,

the affect of a video scene can be represented by its

coordinates in the valence-arousal space. Valence and

arousal can be determined by self reporting. The goal of

an affective representation system is to estimate user’s

valence and arousal or emotion categories in response to

each movie segment. Emotion categories are defined as

regions in the valence-arousal space. Each movie

consists of scenes and each scene consists of a sequence

of shots which are happening in the same location. A

shot is the part of a movie between two cuts which is

typically filmed without interruptions [11].

2. Methods

2.1. Arousal estimation with regression on shots

Informative features for arousal estimation include

loudness and energy of the audio signals, motion

component, visual excitement and shot duration. Using a

method similar to Hanjalic et al. [4] and to the one

proposed in [7], the felt arousal from each shot is

computed by a regression of the content features (see

Section 3 for a detailed description). In order to find the

best weights for arousal estimation using regression, a

leave one movie out strategy on the whole dataset was

used and the linear weights were computed by means of

a relevance vector machine (RVM) from the RVM

toolbox provided by Tipping [12]. The RVM is able to

reject uninformative features during its training hence no

further feature selection was used for arousal

determination.

Equation (1) shows how N

audio and video based

features z

of the k-th shot are linearly combined by the

weights w

to compute the arousal â

at the shot level.

wzwa

iik







(1)

After computing arousal at the shot level, the average

and maximum arousals of the shots of each scene are

computed and used as arousal indicator features for the

scene affective classification. During an exciting scene

the arousal related features do not all remain at their

extreme level. In order to represent the highest arousal

of each scene, the maximum of the shots’ arousal was

chosen to be used as a feature for scene classification.

The linear regression weights that were computed

from our data set were used to determine the arousal of

each movie’s shots. This was done in such a way that all

movies from the dataset except for the one to which the

shot belonged to were used as the training set for the

RVM. Any missing affective annotation for a shot was

approximated using linear interpolation from the closest

affective annotated time points in a movie.

It was observed that arousal has higher linear

correlation with multimedia content-based features than

valence. Valence estimation from regression is not as

accurate as arousal estimation and therefore valence

estimation has not been performed at the shot level.

2.2. Bayesian framework and scene

classification

For the purpose of categorizing the valence-arousal

space into three affect classes, the valence-arousal space

was divided into the three areas shown in Figure 2, each

corresponding to one class. According to [13] emotions

mapped to the lower arousal category are neither

extreme pleasant nor unpleasant emotions and are

difficult to differentiate. Emotional evaluations are

shown to have a heart shaped distribution on valence-

arousal space [13]. Hence, we categorized the lower half

of the plane into one class. The points with an arousal of

zero were counted in class 1 and the points with arousal

greater than zero and valence equal to zero were

considered in class 2. These classes were used as a

simple representation for the emotion categories based

on the previous literature on emotion assessment [14].

In order to characterize movie scenes into these

affective categories, the average and maximum arousal

of the shots of each scene and the low level extracted

audio- and video- based features were used to form a

feature vector. This feature vector in turn was used for

the classification.

If the content feature vector of the j-th scene is x

, the

problem of finding the emotion class, ŷ

, of this scene is

formulated as estimating the ŷ

which maximizes the

probability p(y

,θ) where θ is the prior information

which can include the user’s preferences and video clip’s

metadata. In this paper one of the prior metadata (θ) we

used is for instance the genre of the movie. Personal

profile parameters can be also added to θ. Since in this

paper the whole affect representation is trained by the

self report of one participant the model is assumed to be

personalized for this participant. When the emotion of

the previous scene is used as another prior the scene

affect probability formula changes to p(y

j-1

,θ).

Assuming for simplification that the emotion of the

previous scene is independent from the content features

of the current scene this probability can be reformulated

as:

)|(

),|().,|(

),,|(











jjjj

jjj

xypyyp

xyyp

(2)

The classification problem is then be simplified into

the determination of the maximum value of the

numerator of Equation (2), since the denominator will be

the same for all different affect classes y

. The priors are

established based on the empirical probabilities obtained

from the training data. For example, the occurrence

Valence

(0,0)

Figure 2. Three classes in the valence-arousal space are shown,

namely calm (1), positive excited (2) and negative excited (3).

An approximate of the heart shaped distribution of valence and

arousal is shown.

Arousal

probability of having a given emotion followed by any

of the emotion categories was computed from the

participant’s self-assessments and for each genre. This

allowed to obtain the p(y

j-1

,θ). Different methods were

evaluated to estimate the posterior probability p(y

). A

naïve Bayesian approach which assumes the conditional

probabilities are Gaussian was chosen as providing the

best performance on the dataset; the superiority of this

method can be attributed to its generalization abilities.

3. Material description

A dataset of movies segmented and affectively

annotated by arousal and valence is used as the training

set. This training set consists of twenty one full length

movies (mostly popular movies). The majority of movies

were selected either because they were used in similar

studies (e.g. [15]), or because they were recent and

popular. The dataset included four genres: drama,

horror, action, and comedy. The following three

information streams were extracted from the media:

video (visual), sound (auditory), and, subtitles (textual).

The video stream of the movies has been segmented at

the shot level using the OMT shot segmentation software

and manually segmented into scenes [16;17]. Movie

videos were encoded into the MPEG-1 format to extract

motion vectors and I frames for further feature

extraction. We used the OVAL library (Object-based

Video Access Library) [18] to capture video frames and

extract motion vectors.

The second information stream, namely sound, has an

important impact on user’s affect. For example

according to the findings of Picard [19], loudness of

speech (energy) is related to evoked arousal, while

rhythm and average pitch in speech signals are related to

valence. The audio channels of the movies were

extracted and encoded into monophonic information

(MPEG layer 3 format) at a sampling rate of 48 kHz. All

of the resulting audio signals were normalized to the

same amplitude range before further processing.

Textual features were also extracted from the subtitles

track of the movies. According to [9] the semantic

analysis of the textual information can improve affect

classification. As the semantic analysis over the textual

data was not the focus of our work we extracted simple

features from subtitles by tokenizing the text and

counting the number of words. These statistics have

been used with the timing of the subtitles to extract the

talking rate feature which is the number of words that

had been spoken per second on the subtitles show time.

The other extracted feature is the number of spoken

words in a scene divided by the length of the scene,

which can represent the amount or existence of

dialogues in a scene. A list of the movies in the dataset

and their corresponding genre is given in Table 1.

3.1. Audio features

A total of 53 low-level audio features were

determined for each of the audio signals. These features,

listed in Table 2, are commonly used in audio and

speech processing and audio classification [20;21].

Wang et al. [2] demonstrated the relationship between

audio type’s proportions (for example, the proportion of

music in an audio segment) and affect, where these

proportions refer to the respective duration of music,

speech, environment, and silence in the audio signal of a

video clip. To determine the three important audio types

(music, speech, environment), we implemented a three

class audio type classifier using support vector machines

(SVM) operating on audio low-level features in a one

second segment. Before classification, silence had been

identified by comparing the average audio signal energy

of each sound segment (using the averaged square

magnitude in a time window) with a pre-defined

threshold empirically extracted from the first seven

percent of the audio energy histogram. This audio

histogram was computed from a randomly selected 30

minutes segment of each movie’s audio stream.

Table 2. Low-level features extracted from audio signals.

Feature

A Bayesian framework for video affective representation

Figures

Citations

EEV Dataset: Predicting Expressions Evoked by Diverse Videos

Methods for training saturation-compensating predictors of affective response to stimuli

Deep features for multimodal emotion classification

Emotion assessment for affective computing based on brain and peripheral signals

Methods for predicting affective response from stimuli

References

Praat, a system for doing phonetics by computer

Praat : doing phonetics by computer

Sparse bayesian learning and the relevance vector machine

Looking at pictures: affective, facial, visceral, and behavioral reactions

Emotion elicitation using films

Related Papers (5)

Affective video content representation and modeling

Affective understanding in film

Affective content detection using HMMs

DEAP: A Database for Emotion Analysis ;Using Physiological Signals

A Multimodal Database for Affect Recognition and Implicit Tagging

Frequently Asked Questions (1)

Q1. What are the contributions in this paper?