scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

A Bayesian framework for video affective representation

TL;DR: A Bayesian classification framework for affective video tagging that allows taking contextual information into account is introduced and two contextual priors have been proposed: the movie genre prior, and the temporal dimension prior consisting of the probability of transition between emotions in consecutive scenes.
Abstract: Emotions that are elicited in response to a video scene contain valuable information for multimedia tagging and indexing The novelty of this paper is to introduce a Bayesian classification framework for affective video tagging that allows taking contextual information into account A set of 21 full length movies was first segmented and informative content-based features were extracted from each shot and scene Shots were then emotionally annotated, providing ground truth affect The arousal of shots was computed using a linear regression on the content-based features Bayesian classification based on the shots arousal and content-based features allowed tagging these scenes into three affective classes, namely calm, positive excited and negative excited To improve classification accuracy, two contextual priors have been proposed: the movie genre prior, and the temporal dimension prior consisting of the probability of transition between emotions in consecutive scenes The f1 classification measure of 549% that was obtained on three emotional classes with a naive Bayes classifier was improved to 634% after utilizing all the priors

Summary (3 min read)

1.1. Overview

  • Video and audio on-demand systems are getting more and more popular and are likely to replace traditional TVs.
  • The enormous mass of digital multimedia content with its huge variety requires more efficient multimedia management methods.
  • These studies were mostly based on content analysis and textual tags [2].
  • Then, the affect representation system fuses the extracted features, the stored personal information, and the metadata to represent the evoked emotion.
  • Section 3 details the movie dataset used and the features that have been extracted.

1.2. State of the art

  • Video affect representation requires understanding of the intensity and type of user’s affect while watching a video.
  • In order to represent affect in video, they first selected video- and audio- content based features based on their relation to the valence-arousal space that was defined as an affect model (for the definition of affect model, see Section 1.3) [4].
  • Next, they used color energy, lighting and brightness as valence related features to be used for a HMM-based valence classification of the previously arousal-categorized shots.
  • A personalized affect representation method based on a regression approach for estimating user-felt arousal and valence from multimedia content features and/or from physiological responses was presented by Soleymani et al. [7].
  • A relevance vector machine was used to find linear regression weights.

1.3. Affect and Affective representation

  • Russell [10] proposed a 3D continuous space called the valence-arousal-dominance space which was based on a self-representation of emotions from multiple subjects.
  • The valence axis represents the pleasantness of a situation, from unpleasant to pleasant; the arousal axis expresses the degree of felt excitement, from calm to exciting.
  • The most straightforward way to represent an emotion is to use discrete labels such as fear, anxiety and joy, labelbased representations have several disadvantages.
  • The main one is that despite the universality of basic emotions, the labels themselves are not universal.
  • Each movie consists of scenes and each scene consists of a sequence of shots which are happening in the same location.

2.1. Arousal estimation with regression on shots

  • Informative features for arousal estimation include loudness and energy of the audio signals, motion component, visual excitement and shot duration.
  • The RVM is able to reject uninformative features during its training hence no further feature selection was used for arousal determination.
  • 0 1 ˆ wzwa sN i k iik (1) After computing arousal at the shot level, the average and maximum arousals of the shots of each scene are computed and used as arousal indicator features for the scene affective classification.
  • During an exciting scene the arousal related features do not all remain at their extreme level.
  • This was done in such a way that all movies from the dataset except for the one to which the shot belonged to were used as the training set for the RVM.

2.2. Bayesian framework and scene

  • For the purpose of categorizing the valence-arousal space into three affect classes, the valence-arousal space was divided into the three areas shown in Figure 2, each corresponding to one class.
  • Hence, the authors categorized the lower half of the plane into one class.
  • These classes were used as a simple representation for the emotion categories based on the previous literature on emotion assessment [14].
  • This feature vector in turn was used for the classification.
  • Different methods were evaluated to estimate the posterior probability p(yj|xj).

3. Material description

  • A dataset of movies segmented and affectively annotated by arousal and valence is used as the training set.
  • The majority of movies were selected either because they were used in similar studies (e.g. [15]), or because they were recent and popular.
  • Movie videos were encoded into the MPEG-1 format to extract motion vectors and I frames for further feature extraction.
  • The second information stream, namely sound, has an important impact on user’s affect.
  • Textual features were also extracted from the subtitles track of the movies.

3.1. Audio features

  • A total of 53 low-level audio features were determined for each of the audio signals.
  • To determine the three important audio types (music, speech, environment), the authors implemented a three class audio type classifier using support vector machines (SVM) operating on audio low-level features in a one second segment.
  • Feature category Extracted features MFCC MFCC coefficients (13 features) [20], Derivative of MFCC (13 features), Autocorrelation of MFCC (13 features) Energy Average energy of audio signal [20].
  • Time frequency Spectrum flux, Spectral centroid, Delta spectrum magnitude, Band energy ratio, [20;21].
  • The thin red line, Gangs of New York each audio type in a movie segment.

3.2. Visual features

  • From a movie director's point of view, lighting key [2;23] and color variance [2] are important tools to evoke emotions.
  • The average shot change rate, and shot length variance were extracted to characterize video rhythm.
  • Fast moving scenes or objects' movements in consecutive frames are also an effective factor for evoking excitement.
  • Colors and their proportions are important parameters to elicit emotions [17].
  • In order to use colors in the list of video features, a 20 bin color histogram of hue and lightness values in the HSV space was computed for each I frame and subsequently averaged over all frames.

3.3. Affective annotation

  • The coordinates of a pointer manipulated by the user are continuously recorded during the show time of the stimuli (video, image, or external source) and used as the affect indicators.
  • A set of SAM manikins (Self-Assessment Manikins [25]) are generated for different combinations of arousal and valence to help the user understand the emotions related to the regions of valence-arousal space.
  • E.g. the positive excited manikin is generated by combining the positive manikin and the excited manikin.
  • The participant was asked to annotate the movies so as to indicate at which times his/her felt emotion has changed.
  • The participant was asked to indicate at least one point during each scene not to leave any scene without assessment.

4.1. Arousal estimation of shots

  • Figure 4 shows a sample arousal curve from part of the film entitled “Silent Hill”.
  • The participant’s felt emotion was however not completely in agreement with the estimated curve, as can for instance be observed in the second half of the plot.
  • A possible cause for the discrepancy is the low temporal resolution of the selfassessment.
  • Another possible cause is experimental weariness: after having had exciting stimuli for minutes, a participant's arousal might be decreasing despite strong movements in the video and loud audio.
  • Finally, some emotional feelings might simply not be captured by lowlevel features; this would for instance be the case for a racist comment in a movie dialogue which evokes disgust for a participant.

4.2. Classification results

  • 2 1 (3) For the ten-folding cross validation the original samples, movie scenes, were partitioned into 10 subsample sets.
  • The naïve Bayesian classifier results are shown in Table 3-a.
  • As with the temporal prior, the genre prior leads to better estimate of the emotion class.
  • The evolution of classification results over consecutive scenes when adding the time prior shows that this prior allows correcting results for some samples that were misclassified using the genre prior only.
  • Using physiological signals or audiovisual recordings will help overcome these problems and facilitate this part of the work, by yielding continuous affective annotations without interrupting the user [7].

5. Conclusions and perspectives

  • An affective representation system for estimating felt emotions at the scene level has been proposed using a Bayesian classification framework that allows taking some form of context into account.
  • Results showed the advantage of using well chosen priors, such as temporal information provided by the previous scene emotion, and movie genre.
  • The f1 classification measure of 54.9% that was obtained on three emotional classes with a naïve Bayesian classifier was improved to 56.5% and 59.5 using only the time and genre prior.
  • This measure finally improved to 63.4% after utilizing all the priors.
  • It will also provide us with a better understanding of the feasibility of using group-wise profiles containing some affective characteristics that are shared between users.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Proceedings Chapter
Reference
A Bayesian Framework for Video Affective Representation
SOLEYMANI, Mohammad, et al.
Abstract
Emotions that are elicited in response to a video scene contain valuable information for
multimedia tagging and indexing. The novelty of this paper is to introduce a Bayesian
classification framework for affective video tagging that allows taking contextual information
into account. A set of 21 full length movies was first segmented and informative content-based
features were extracted from each shot and scene. Shots were then emotionally annotated,
providing ground truth affect. The arousal of shots was computed using a linear regression on
the content-based features. Bayesian classification based on the shots arousal and
content-based features allowed tagging these scenes into three affective classes, namely
calm, positive excited and negative excited. To improve classification accuracy, two
contextual priors have been proposed: the movie genre prior, and the temporal dimension
prior consisting of the probability of transition between emotions in consecutive scenes. The f1
classification measure of 54.9% that was obtained on three emotional classes with a nai¿ve
Bayes classifier was improved to 63.4% after utilizing [...]
SOLEYMANI, Mohammad, et al. A Bayesian Framework for Video Affective Representation. In:
3rd International Conference on Affective Computing and Intelligent Interaction and
Workshops, 2009. ACII 2009 : proceedings. 2009. p. 1-7
DOI : 10.1109/ACII.2009.5349563
Available at:
http://archive-ouverte.unige.ch/unige:47663
Disclaimer: layout of this document may differ from the published version.
1 / 1

Abstract
Emotions that are elicited in response to a video
scene contain valuable information for multimedia
tagging and indexing. The novelty of this paper is to
introduce a Bayesian classification framework for
affective video tagging that allows taking contextual
information into account. A set of 21 full length movies
was first segmented and informative content-based
features were extracted from each shot and scene. Shots
were then emotionally annotated, providing ground
truth affect. The arousal of shots was computed using a
linear regression on the content-based features.
Bayesian classification based on the shots arousal and
content-based features allowed tagging these scenes
into three affective classes, namely calm, positive
excited and negative excited. To improve classification
accuracy, two contextual priors have been proposed:
the movie genre prior, and the temporal dimension prior
consisting of the probability of transition between
emotions in consecutive scenes. The f1 classification
measure of 54.9% that was obtained on three emotional
classes with a naïve Bayes classifier was improved to
63.4% after utilizing all the priors.
1. Introduction
1.1. Overview
Video and audio on-demand systems are getting more
and more popular and are likely to replace traditional
TVs. Online video content has been growing rapidly in
the last five years. For example the open access online
video database, YouTube, had a watching rate of more
than 100 millions videos per day in 2006 [1]. The
enormous mass of digital multimedia content with its
huge variety requires more efficient multimedia
management methods. Many studies have been
conducted in the last decade to increase the accuracy of
current multimedia retrieval systems. These studies were
mostly based on content analysis and textual tags [2].
Although the emotional preferences of a user play an
important role in multimedia content selection, few
publications exist in the field of affective indexing which
consider emotional preferences of users [2-7].
The present study is focused on movies because they
represent one of the most common and popular types of
multimedia content. An affective representation of
scenes will be useful for tagging, indexing and
highlighting of important parts in a movie. We believe
that using the existing online metadata can improve the
affective representation and classification of movies.
Such metadata, like movie genre, is available on internet
(e.g. internet movie database http://www.imdb.com).
Movie genre can be exploited to improve an affect
representation system's inference about the possible
emotion which is going to be elicited in the audience.
For example, the probability of a happy scene in a
comedy certainly differs from that in a drama. Moreover,
the temporal order of the evoked emotions, which can be
modeled by the probability of emotion transition in
consecutive scenes, is also expected to be useful for the
improvement of an affective representation system.
It is shown here how to benefit from the proposed
priors in a Bayesian classification framework. Affect
classification was done for a three labels scene
classification problem, where the labels are “calm”,
“positive excited”, and “negative excited”. Ground truth
was obtained through manual annotation with a
FEELTRACE-like [8] annotation tool with the self-
assessments serving as the classification ground-truth.
The usefulness of priors is shown by comparing
classification results with or without using them.
Metadata:
genre, rating, …
INTERNET
Database
Textual
information
Feature
extraction
Video
Audio
Affect
representation
system
Affect
Users personal
profile, gender,
age, etc
In our proposed affective indexing and retrieval
system, different modalities, such as video, audio, and
textual data (subtitles) of a movie will be used for
feature extraction. Figure 1 shows the diagram of such a
system. The feature extraction block extracts features
A Bayesian Framework for Video Affective Representation
Mohammad Soleymani Joep J.M. Kierkels Guillaume Chanel Thierry Pun
Computer Vision and Multimedia Laboratory, Computer Science Department
University of Geneva
Battelle Building A, Rte. De Drize 7,
CH - 1227 Carouge, Geneva, Switzerland
{mohammad.soleymani, joep.kierkels, guillaume.chanel,thierry.pun @unige.ch}
http://cvml.unige.ch
First Author
Institution1
Institution1 address
firstauthor@i1.org
Second Author
Institution2
First line of institution2 address
http://www.author.org/_second
Figure 1. A diagram of the proposed video affective
representation.

from the three modalities and stores them in a database.
Then, the affect representation system fuses the
extracted features, the stored personal information, and
the metadata to represent the evoked emotion. For a
personalized retrieval, a personal profile of a user (with
his/her gender, age, location, social network) will help
the affective retrieval process.
The paper is organized as follows. A review of the
current state of the art and an explanation on affect and
affective representation are given in the following
subsections of the first Section. Methods used including
the arousal representation at the shot level and affect
classification at the scene level are given in Section 2.
Section 3 details the movie dataset used and the features
that have been extracted. The obtained classification
results at the scene level, the comparisons with and
without using genre and temporal priors are discussed in
Section 4. Section 5 concludes the article and offers
perspectives for future work.
1.2. State of the art
Video affect representation requires understanding of
the intensity and type of user’s affect while watching a
video. There are only a limited number of studies on
content-based affective representation of movies. Wang
and Cheong [2] used content audio and video features to
classify basic emotions elicited by movie scenes. In [2]
audio was classified into music, speech and environment
signals and had been treated separately to shape an
affective feature vector. The audio affective vector was
used with video-based features such as key lighting and
visual excitement to form a scene feature vector. Finally,
the scene feature vector was classified and labeled with
emotions.
Hanjalic et al. [4] introduced “personalized content
delivery” as a valuable tool in affective indexing and
retrieval systems. In order to represent affect in video,
they first selected video- and audio- content based
features based on their relation to the valence-arousal
space that was defined as an affect model (for the
definition of affect model, see Section 1.3) [4]. Then,
arising emotions were estimated in this space by
combining these features. While arousal and valence
could be used separately for indexing, they combined
these values by following their temporal pattern in the
arousal and valence space. This allowed determining an
affect curve, shown to be useful for extracting video
highlights in a movie or sports video.
A hierarchical movie content analysis method based
on arousal and valence related features was presented by
M. Xu et al. [6]. In this method the affect of each shot
was first classified in the arousal domain using the
arousal correlated features and fuzzy clustering. The
audio short time energy and the first four Mel frequency
cepstral coefficients, MFCC (as a representation of
energy features), shot length, and the motion component
of consecutive frames were used to classify shots in
three arousal classes. Next, they used color energy,
lighting and brightness as valence related features to be
used for a HMM-based valence classification of the
previously arousal-categorized shots.
A personalized affect representation method based on
a regression approach for estimating user-felt arousal
and valence from multimedia content features and/or
from physiological responses was presented by
Soleymani et al. [7]. A relevance vector machine was
used to find linear regression weights. This allowed
predicting valence and arousal from the measured
multimedia and/or physiological data. During the
experiments, 64 video clips were shown to 8 participants
while their physiological responses were recorded; user's
self-assessments of valence and arousal served as ground
truth. A comparison was made on the arousal and
valence values obtained by different modalities which
were the physiological signals, the video- and audio-
based features, and the self-assessments. In [7] An
experiment with multiple participants has been
conducted for personalized emotion assessment based on
content analysis.
1.3. Affect and Affective representation
Russell [10] proposed a 3D continuous space called
the valence-arousal-dominance space which was based
on a self-representation of emotions from multiple
subjects. In this paper we use a valence-arousal
dimensional approach for affect representation and
annotation. The third dimensional axis, namely
dominance / control, is not used in our study. In the
valence-arousal space it is possible to represent almost
any emotion. The valence axis represents the
pleasantness of a situation, from unpleasant to pleasant;
the arousal axis expresses the degree of felt excitement,
from calm to exciting. Russel demonstrated that this
space has the advantages of being cross-cultural and that
it is possible to map labels on this space. Although, the
most straightforward way to represent an emotion is to
use discrete labels such as fear, anxiety and joy, label-
based representations have several disadvantages. The
main one is that despite the universality of basic
emotions, the labels themselves are not universal. They
can be misinterpreted from one language (or culture) to
another. In addition, emotions are continuous
phenomena rather than discrete ones and labels are
unable to define the strength of an emotion.
In a dimensional approach for affect representation,
the affect of a video scene can be represented by its
coordinates in the valence-arousal space. Valence and
arousal can be determined by self reporting. The goal of
an affective representation system is to estimate user’s
valence and arousal or emotion categories in response to
each movie segment. Emotion categories are defined as
regions in the valence-arousal space. Each movie
consists of scenes and each scene consists of a sequence
of shots which are happening in the same location. A

shot is the part of a movie between two cuts which is
typically filmed without interruptions [11].
2. Methods
2.1. Arousal estimation with regression on shots
Informative features for arousal estimation include
loudness and energy of the audio signals, motion
component, visual excitement and shot duration. Using a
method similar to Hanjalic et al. [4] and to the one
proposed in [7], the felt arousal from each shot is
computed by a regression of the content features (see
Section 3 for a detailed description). In order to find the
best weights for arousal estimation using regression, a
leave one movie out strategy on the whole dataset was
used and the linear weights were computed by means of
a relevance vector machine (RVM) from the RVM
toolbox provided by Tipping [12]. The RVM is able to
reject uninformative features during its training hence no
further feature selection was used for arousal
determination.
Equation (1) shows how N
s
audio and video based
features z
k
i
of the k-th shot are linearly combined by the
weights w
i
to compute the arousal â
k
at the shot level.
0
1
ˆ
wzwa
s
N
i
k
iik
(1)
After computing arousal at the shot level, the average
and maximum arousals of the shots of each scene are
computed and used as arousal indicator features for the
scene affective classification. During an exciting scene
the arousal related features do not all remain at their
extreme level. In order to represent the highest arousal
of each scene, the maximum of the shots’ arousal was
chosen to be used as a feature for scene classification.
The linear regression weights that were computed
from our data set were used to determine the arousal of
each movie’s shots. This was done in such a way that all
movies from the dataset except for the one to which the
shot belonged to were used as the training set for the
RVM. Any missing affective annotation for a shot was
approximated using linear interpolation from the closest
affective annotated time points in a movie.
It was observed that arousal has higher linear
correlation with multimedia content-based features than
valence. Valence estimation from regression is not as
accurate as arousal estimation and therefore valence
estimation has not been performed at the shot level.
2.2. Bayesian framework and scene
classification
For the purpose of categorizing the valence-arousal
space into three affect classes, the valence-arousal space
was divided into the three areas shown in Figure 2, each
corresponding to one class. According to [13] emotions
mapped to the lower arousal category are neither
extreme pleasant nor unpleasant emotions and are
difficult to differentiate. Emotional evaluations are
shown to have a heart shaped distribution on valence-
arousal space [13]. Hence, we categorized the lower half
of the plane into one class. The points with an arousal of
zero were counted in class 1 and the points with arousal
greater than zero and valence equal to zero were
considered in class 2. These classes were used as a
simple representation for the emotion categories based
on the previous literature on emotion assessment [14].
In order to characterize movie scenes into these
affective categories, the average and maximum arousal
of the shots of each scene and the low level extracted
audio- and video- based features were used to form a
feature vector. This feature vector in turn was used for
the classification.
If the content feature vector of the j-th scene is x
j
, the
problem of finding the emotion class, ŷ
j
, of this scene is
formulated as estimating the ŷ
j
which maximizes the
probability p(y
j
|x
j
,θ) where θ is the prior information
which can include the user’s preferences and video clip’s
metadata. In this paper one of the prior metadata (θ) we
used is for instance the genre of the movie. Personal
profile parameters can be also added to θ. Since in this
paper the whole affect representation is trained by the
self report of one participant the model is assumed to be
personalized for this participant. When the emotion of
the previous scene is used as another prior the scene
affect probability formula changes to p(y
j
|y
j-1
,x
j
,θ).
Assuming for simplification that the emotion of the
previous scene is independent from the content features
of the current scene this probability can be reformulated
as:
)|(
),|().,|(
),,|(
1
1
1
j
jjjj
jjj
yp
xypyyp
xyyp
(2)
The classification problem is then be simplified into
the determination of the maximum value of the
numerator of Equation (2), since the denominator will be
the same for all different affect classes y
j
. The priors are
established based on the empirical probabilities obtained
from the training data. For example, the occurrence
2
3
(0,0)
1
Figure 2. Three classes in the valence-arousal space are shown,
namely calm (1), positive excited (2) and negative excited (3).
An approximate of the heart shaped distribution of valence and
arousal is shown.
Arousal

probability of having a given emotion followed by any
of the emotion categories was computed from the
participant’s self-assessments and for each genre. This
allowed to obtain the p(y
j-1
|y
j
). Different methods were
evaluated to estimate the posterior probability p(y
j
|x
j
). A
naïve Bayesian approach which assumes the conditional
probabilities are Gaussian was chosen as providing the
best performance on the dataset; the superiority of this
method can be attributed to its generalization abilities.
3. Material description
A dataset of movies segmented and affectively
annotated by arousal and valence is used as the training
set. This training set consists of twenty one full length
movies (mostly popular movies). The majority of movies
were selected either because they were used in similar
studies (e.g. [15]), or because they were recent and
popular. The dataset included four genres: drama,
horror, action, and comedy. The following three
information streams were extracted from the media:
video (visual), sound (auditory), and, subtitles (textual).
The video stream of the movies has been segmented at
the shot level using the OMT shot segmentation software
and manually segmented into scenes [16;17]. Movie
videos were encoded into the MPEG-1 format to extract
motion vectors and I frames for further feature
extraction. We used the OVAL library (Object-based
Video Access Library) [18] to capture video frames and
extract motion vectors.
The second information stream, namely sound, has an
important impact on user’s affect. For example
according to the findings of Picard [19], loudness of
speech (energy) is related to evoked arousal, while
rhythm and average pitch in speech signals are related to
valence. The audio channels of the movies were
extracted and encoded into monophonic information
(MPEG layer 3 format) at a sampling rate of 48 kHz. All
of the resulting audio signals were normalized to the
same amplitude range before further processing.
Textual features were also extracted from the subtitles
track of the movies. According to [9] the semantic
analysis of the textual information can improve affect
classification. As the semantic analysis over the textual
data was not the focus of our work we extracted simple
features from subtitles by tokenizing the text and
counting the number of words. These statistics have
been used with the timing of the subtitles to extract the
talking rate feature which is the number of words that
had been spoken per second on the subtitles show time.
The other extracted feature is the number of spoken
words in a scene divided by the length of the scene,
which can represent the amount or existence of
dialogues in a scene. A list of the movies in the dataset
and their corresponding genre is given in Table 1.
3.1. Audio features
A total of 53 low-level audio features were
determined for each of the audio signals. These features,
listed in Table 2, are commonly used in audio and
speech processing and audio classification [20;21].
Wang et al. [2] demonstrated the relationship between
audio type’s proportions (for example, the proportion of
music in an audio segment) and affect, where these
proportions refer to the respective duration of music,
speech, environment, and silence in the audio signal of a
video clip. To determine the three important audio types
(music, speech, environment), we implemented a three
class audio type classifier using support vector machines
(SVM) operating on audio low-level features in a one
second segment. Before classification, silence had been
identified by comparing the average audio signal energy
of each sound segment (using the averaged square
magnitude in a time window) with a pre-defined
threshold empirically extracted from the first seven
percent of the audio energy histogram. This audio
histogram was computed from a randomly selected 30
minutes segment of each movie’s audio stream.
Table 2. Low-level features extracted from audio signals.
Feature
category
Extracted features
MFCC
MFCC coefficients (13 features) [20],
Derivative of MFCC (13 features),
Autocorrelation of MFCC (13 features)
Energy
Average energy of audio signal [20]
Formants
Formants up to 5500Hz (female voice) (five
features)
Time frequency
Spectrum flux, Spectral centroid, Delta
spectrum magnitude, Band energy ratio,
[20;21]
Pitch
First pitch frequency
Zero crossing
rate
Average, Standard deviation [20]
Silence ratio
Proportion of silence in a time window [24]
After removing silence, the remaining audio signals
were classified by a SVM with a polynomial kernel,
using the LIBSVM toolbox.
(http://www.csie.ntu.edu.tw/~cjlin/libsvm/). The SVM
was trained on about three hours of audio, extracted
from movies (not from the dataset of this paper) and
labeled manually. Despite the fact that in various cases
the audio type classes were overlapping (e.g. presence of
a musical background during a dialogue), the classifier
was usually able to recognize the dominant audio type
with an accuracy of about 80%.
The classification results were used to the ratio of
Table1. List of the movies in the dataset.
Drama Movies
Comedy Movies
The pianist, Blood
diamond, Hotel Rwanda,
Apocalypse now, American
history X, Hannibal
Man on the moon, Mr. Bean’s
holiday, Love actually, Shaun
of the dead, Shrek
Horror Movies
Action Movies
Silent hill, Ringu
(Japanese), 28 days later,
The shining
Man on Fire, Kill Bill Vol. 1,
Kill Bill Vol. 2, Platoon, The
thin red line, Gangs of New
York

Citations
More filters
Journal ArticleDOI
TL;DR: A multimodal data set for the analysis of human affective states was presented and a novel method for stimuli selection is proposed using retrieval by affective tags from the last.fm website, video highlight detection, and an online assessment tool.
Abstract: We present a multimodal data set for the analysis of human affective states. The electroencephalogram (EEG) and peripheral physiological signals of 32 participants were recorded as each watched 40 one-minute long excerpts of music videos. Participants rated each video in terms of the levels of arousal, valence, like/dislike, dominance, and familiarity. For 22 of the 32 participants, frontal face video was also recorded. A novel method for stimuli selection is proposed using retrieval by affective tags from the last.fm website, video highlight detection, and an online assessment tool. An extensive analysis of the participants' ratings during the experiment is presented. Correlates between the EEG signal frequencies and the participants' ratings are investigated. Methods and results are presented for single-trial classification of arousal, valence, and like/dislike ratings using the modalities of EEG, peripheral physiological signals, and multimedia content analysis. Finally, decision fusion of the classification results from different modalities is performed. The data set is made publicly available and we encourage other researchers to use it for testing their own affective state estimation methods.

3,013 citations


Cites background or methods from "A Bayesian framework for video affe..."

  • ...Ç...

    [...]

  • ...We propose here a semi-automated method for stimulus selection, with the goal of minimizing the bias arising from the manual stimuli selection....

    [...]

  • ...For each found affective tag, the 10 songs most often labeled with this tag were selected....

    [...]

Journal ArticleDOI
TL;DR: The results over a population of 24 participants demonstrate that user-independent emotion recognition can outperform individual self-reports for arousal assessments and do not underperform for valence assessments.
Abstract: This paper presents a user-independent emotion recognition method with the goal of recovering affective tags for videos using electroencephalogram (EEG), pupillary response and gaze distance. We first selected 20 video clips with extrinsic emotional content from movies and online resources. Then, EEG responses and eye gaze data were recorded from 24 participants while watching emotional video clips. Ground truth was defined based on the median arousal and valence scores given to clips in a preliminary study using an online questionnaire. Based on the participants' responses, three classes for each dimension were defined. The arousal classes were calm, medium aroused, and activated and the valence classes were unpleasant, neutral, and pleasant. One of the three affective labels of either valence or arousal was determined by classification of bodily responses. A one-participant-out cross validation was employed to investigate the classification performance in a user-independent approach. The best classification accuracies of 68.5 percent for three labels of valence and 76.4 percent for three labels of arousal were obtained using a modality fusion strategy and a support vector machine. The results over a population of 24 participants demonstrate that user-independent emotion recognition can outperform individual self-reports for arousal assessments and do not underperform for valence assessments.

582 citations


Cites background from "A Bayesian framework for video affe..."

  • ...[3] M. Soleymani, G. Chanel, J. J. M. Kierkels, and T. Pun,“ Affective Characterization of Movie Scenes Based on Content Analysis and Physiological Changes,”International Journal of Semantic Computing, vol. 3, no. 2, pp. 235-254, June 2009....

    [...]

Journal ArticleDOI
TL;DR: A large video database, namely LIRIS-ACCEDE, is proposed, which consists of 9,800 good quality video excerpts with a large content diversity and provides four experimental protocols and a baseline for prediction of emotions using a large set of both visual and audio features.
Abstract: Research in affective computing requires ground truth data for training and benchmarking computational models for machine-based emotion understanding. In this paper, we propose a large video database, namely LIRIS-ACCEDE, for affective content analysis and related applications, including video indexing, summarization or browsing. In contrast to existing datasets with very few video resources and limited accessibility due to copyright constraints, LIRIS-ACCEDE consists of 9,800 good quality video excerpts with a large content diversity. All excerpts are shared under creative commons licenses and can thus be freely distributed without copyright issues. Affective annotations were achieved using crowdsourcing through a pair-wise video comparison protocol, thereby ensuring that annotations are fully consistent, as testified by a high inter-annotator agreement, despite the large diversity of raters’ cultural backgrounds. In addition, to enable fair comparison and landmark progresses of future affective computational models, we further provide four experimental protocols and a baseline for prediction of emotions using a large set of both visual and audio features. The dataset (the video clips, annotations, features and protocols) is publicly available at: http://liris-accede.ec-lyon.fr/.

270 citations


Additional excerpts

  • ...Bayesian framework for video affective representation [25]...

    [...]

Journal ArticleDOI
TL;DR: A general framework for video affective content analysis is proposed, which includes video content, emotional descriptors, and users' spontaneous nonverbal responses, as well as the relationships between the three.
Abstract: Video affective content analysis has been an active research area in recent decades, since emotion is an important component in the classification and retrieval of videos. Video affective content analysis can be divided into two approaches: direct and implicit. Direct approaches infer the affective content of videos directly from related audiovisual features. Implicit approaches, on the other hand, detect affective content from videos based on an automatic analysis of a user’s spontaneous response while consuming the videos. This paper first proposes a general framework for video affective content analysis, which includes video content, emotional descriptors, and users’ spontaneous nonverbal responses, as well as the relationships between the three. Then, we survey current research in both direct and implicit video affective content analysis, with a focus on direct video affective content analysis . Lastly, we identify several challenges in this field and put forward recommendations for future research.

158 citations

Proceedings ArticleDOI
03 Nov 2014
TL;DR: A fast and effective heuristic ranking approach based on heterogeneous late fusion by jointly considering three aspects: venue categories, visual scene, and user listening history that recommends appealing soundtracks for UGVs to enhance the viewing experience is proposed.
Abstract: Capturing videos anytime and anywhere, and then instantly sharing them online, has become a very popular activity. However, many outdoor user-generated videos (UGVs) lack a certain appeal because their soundtracks consist mostly of ambient background noise. Aimed at making UGVs more attractive, we introduce ADVISOR, a personalized video soundtrack recommendation system. We propose a fast and effective heuristic ranking approach based on heterogeneous late fusion by jointly considering three aspects: venue categories, visual scene, and user listening history. Specifically, we combine confidence scores, produced by SVMhmm models constructed from geographic, visual, and audio features, to obtain different types of video characteristics. Our contributions are threefold. First, we predict scene moods from a real-world video dataset that was collected from users' daily outdoor activities. Second, we perform heuristic rankings to fuse the predicted confidence scores of multiple models, and third we customize the video soundtrack recommendation functionality to make it compatible with mobile devices. A series of extensive experiments confirm that our approach performs well and recommends appealing soundtracks for UGVs to enhance the viewing experience.

77 citations


Cites background from "A Bayesian framework for video affe..."

  • ...[27] introduced a Bayesian classification framework for affective video tagging which takes contextual information into account since emotions that are elicited in response to a video scene contain valuable information for multimedia indexing and tagging....

    [...]

  • ...There exist a few approaches [6,27,29] to recognize emotions from videos but the field of video soundtrack recommendation for UGVs [24,34] is largely unexplored....

    [...]

References
More filters
01 Sep 2000
TL;DR: FEELTRACE has resolving power comparable to an emotion vocabulary of 20 non-overlapping words, with the advantage of allowing intermediate ratings, and above all, the ability to track impressions continuously.
Abstract: FEELTRACE is an instrument developed to let observers track the emotional content of a stimulus as they perceive it over time, allowing the emotional dynamics of speech episodes to be examined It is based on activation-evaluation space, a representation derived from psychology The activation dimension measures how dynamic the emotional state is; the evaluation dimension is a global measure of the positive or negative feeling associated with the state Research suggests that the space is naturally circular, ie states which are at the limit of emotional intensity define a circle, with alert neutrality at the centre To turn those ideas into a recording tool, the space was represented by a circle on a computer screen, and observers described perceived emotional state by moving a pointer (in the form of a disc) to the appropriate point in the circle, using a mouse Prototypes were tested, and in the light of results, refinements were made to ensure that outputs were as consistent and meaningful as possible They include colour coding the pointer in a way that users readily associate with the relevant emotional state; presenting key emotion words as ‘landmarks’ at the strategic points in the space; and developing an induction procedure to introduce observers to the system An experiment assessed the reliability of the developed system Stimuli were 16 clips from TV programs, two showing relatively strong emotions in each quadrant of activationevaluation space, each paired with one of the same person in a relatively neural state 24 raters took part Differences between clips chosen to contrast were statistically robust Results were plotted in activation-evaluation space as ellipses, each with its centre at the mean co-ordinates for the clip, and its width proportional to standard deviation across raters The size of the ellipses meant that about 25 could be fitted into the space, ie FEELTRACE has resolving power comparable to an emotion vocabulary of 20 non-overlapping words, with the advantage of allowing intermediate ratings, and above all, the ability to track impressions continuously

568 citations


"A Bayesian framework for video affe..." refers methods in this paper

  • ...Ground truth was obtained through manual annotation with a FEELTRACE-like [8] annotation tool with the selfassessments serving as the classification ground-truth....

    [...]

Journal Article
TL;DR: The authors explored several different measurement formats including: verbal self-reports (adjective checklists), physiological techniques, photodecks, and dial-turning instruments, and The authors.
Abstract: Although consumer research began focusing on emotional response to advertising during the 1980s (Goodstein, Edell, and Chapman Moore. 1990; Burke and Edell, 1989; Aaker, Stayman, and Vezina, 1988; Holbrook and Batra, 1988), perhaps one of the most practical measures of affective response has only recently emerged. Part of the difficulty in developing measures of emotional response stems from the complexity of emotion itself (Plummer and Leckenby, 1985). Researchers have explored several different measurement formats including: verbal self-reports (adjective checklists), physiological techniques, photodecks, and dial-turning instruments.

484 citations

Journal ArticleDOI
TL;DR: This work describes a scheme that is able to classify audio segments into seven categories consisting of silence, single speaker speech, music, environmental noise, multiple speakers' speech, simultaneous speech and music, and speech and noise, and shows that cepstral-based features such as the Mel-frequency cep stral coefficients (MFCC) and linear prediction coefficients (LPC) provide better classification accuracy compared to temporal and spectral features.

315 citations

Journal ArticleDOI
TL;DR: A systematic approach grounded upon psychology and cinematography is developed to address several important issues in affective understanding and a holistic method of extracting affective information from the multifaceted audio stream has been introduced.
Abstract: Affective understanding of film plays an important role in sophisticated movie analysis, ranking and indexing. However, due to the seemingly inscrutable nature of emotions and the broad affective gap from low-level features, this problem is seldom addressed. In this paper, we develop a systematic approach grounded upon psychology and cinematography to address several important issues in affective understanding. An appropriate set of affective categories are identified and steps for their classification developed. A number of effective audiovisual cues are formulated to help bridge the affective gap. In particular, a holistic method of extracting affective information from the multifaceted audio stream has been introduced. Besides classifying every scene in Hollywood domain movies probabilistically into the affective categories, some exciting applications are demonstrated. The experimental results validate the proposed approach and the efficacy of the audiovisual cues.

273 citations


"A Bayesian framework for video affe..." refers background in this paper

  • ...These studies were mostly based on content analysis and textual tags [2]....

    [...]

Proceedings ArticleDOI
Lie Lu1, Hao Jiang1, Hong-Jiang Zhang1
01 Oct 2001
TL;DR: A robust algorithm that is capable of segmenting and classifying an audio stream into speech, music, environment sound and silence is presented and some new features such as the noise frame ratio and band periodicity are introduced.
Abstract: In this paper, we present a robust algorithm for audio classification that is capable of segmenting and classifying an audio stream into speech, music, environment sound and silence. Audio classification is processed in two steps, which makes it suitable for different applications. The first step of the classification is speech and non-speech discrimination. In this step, a novel algorithm based on KNN and LSP VQ is presented. The second step further divides non-speech class into music, environment sounds and silence with a rule based classification scheme. Some new features such as the noise frame ratio and band periodicity are introduced and discussed in detail. Our experiments in the context of video structure parsing have shown the algorithms produce very satisfactory results.

266 citations

Frequently Asked Questions (1)
Q1. What are the contributions in this paper?

The novelty of this paper is to introduce a Bayesian classification framework for affective video tagging that allows taking contextual information into account.