scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

A Bayesian framework for video affective representation

TL;DR: A Bayesian classification framework for affective video tagging that allows taking contextual information into account is introduced and two contextual priors have been proposed: the movie genre prior, and the temporal dimension prior consisting of the probability of transition between emotions in consecutive scenes.
Abstract: Emotions that are elicited in response to a video scene contain valuable information for multimedia tagging and indexing The novelty of this paper is to introduce a Bayesian classification framework for affective video tagging that allows taking contextual information into account A set of 21 full length movies was first segmented and informative content-based features were extracted from each shot and scene Shots were then emotionally annotated, providing ground truth affect The arousal of shots was computed using a linear regression on the content-based features Bayesian classification based on the shots arousal and content-based features allowed tagging these scenes into three affective classes, namely calm, positive excited and negative excited To improve classification accuracy, two contextual priors have been proposed: the movie genre prior, and the temporal dimension prior consisting of the probability of transition between emotions in consecutive scenes The f1 classification measure of 549% that was obtained on three emotional classes with a naive Bayes classifier was improved to 634% after utilizing all the priors

Summary (3 min read)

1.1. Overview

  • Video and audio on-demand systems are getting more and more popular and are likely to replace traditional TVs.
  • The enormous mass of digital multimedia content with its huge variety requires more efficient multimedia management methods.
  • These studies were mostly based on content analysis and textual tags [2].
  • Then, the affect representation system fuses the extracted features, the stored personal information, and the metadata to represent the evoked emotion.
  • Section 3 details the movie dataset used and the features that have been extracted.

1.2. State of the art

  • Video affect representation requires understanding of the intensity and type of user’s affect while watching a video.
  • In order to represent affect in video, they first selected video- and audio- content based features based on their relation to the valence-arousal space that was defined as an affect model (for the definition of affect model, see Section 1.3) [4].
  • Next, they used color energy, lighting and brightness as valence related features to be used for a HMM-based valence classification of the previously arousal-categorized shots.
  • A personalized affect representation method based on a regression approach for estimating user-felt arousal and valence from multimedia content features and/or from physiological responses was presented by Soleymani et al. [7].
  • A relevance vector machine was used to find linear regression weights.

1.3. Affect and Affective representation

  • Russell [10] proposed a 3D continuous space called the valence-arousal-dominance space which was based on a self-representation of emotions from multiple subjects.
  • The valence axis represents the pleasantness of a situation, from unpleasant to pleasant; the arousal axis expresses the degree of felt excitement, from calm to exciting.
  • The most straightforward way to represent an emotion is to use discrete labels such as fear, anxiety and joy, labelbased representations have several disadvantages.
  • The main one is that despite the universality of basic emotions, the labels themselves are not universal.
  • Each movie consists of scenes and each scene consists of a sequence of shots which are happening in the same location.

2.1. Arousal estimation with regression on shots

  • Informative features for arousal estimation include loudness and energy of the audio signals, motion component, visual excitement and shot duration.
  • The RVM is able to reject uninformative features during its training hence no further feature selection was used for arousal determination.
  • 0 1 ˆ wzwa sN i k iik (1) After computing arousal at the shot level, the average and maximum arousals of the shots of each scene are computed and used as arousal indicator features for the scene affective classification.
  • During an exciting scene the arousal related features do not all remain at their extreme level.
  • This was done in such a way that all movies from the dataset except for the one to which the shot belonged to were used as the training set for the RVM.

2.2. Bayesian framework and scene

  • For the purpose of categorizing the valence-arousal space into three affect classes, the valence-arousal space was divided into the three areas shown in Figure 2, each corresponding to one class.
  • Hence, the authors categorized the lower half of the plane into one class.
  • These classes were used as a simple representation for the emotion categories based on the previous literature on emotion assessment [14].
  • This feature vector in turn was used for the classification.
  • Different methods were evaluated to estimate the posterior probability p(yj|xj).

3. Material description

  • A dataset of movies segmented and affectively annotated by arousal and valence is used as the training set.
  • The majority of movies were selected either because they were used in similar studies (e.g. [15]), or because they were recent and popular.
  • Movie videos were encoded into the MPEG-1 format to extract motion vectors and I frames for further feature extraction.
  • The second information stream, namely sound, has an important impact on user’s affect.
  • Textual features were also extracted from the subtitles track of the movies.

3.1. Audio features

  • A total of 53 low-level audio features were determined for each of the audio signals.
  • To determine the three important audio types (music, speech, environment), the authors implemented a three class audio type classifier using support vector machines (SVM) operating on audio low-level features in a one second segment.
  • Feature category Extracted features MFCC MFCC coefficients (13 features) [20], Derivative of MFCC (13 features), Autocorrelation of MFCC (13 features) Energy Average energy of audio signal [20].
  • Time frequency Spectrum flux, Spectral centroid, Delta spectrum magnitude, Band energy ratio, [20;21].
  • The thin red line, Gangs of New York each audio type in a movie segment.

3.2. Visual features

  • From a movie director's point of view, lighting key [2;23] and color variance [2] are important tools to evoke emotions.
  • The average shot change rate, and shot length variance were extracted to characterize video rhythm.
  • Fast moving scenes or objects' movements in consecutive frames are also an effective factor for evoking excitement.
  • Colors and their proportions are important parameters to elicit emotions [17].
  • In order to use colors in the list of video features, a 20 bin color histogram of hue and lightness values in the HSV space was computed for each I frame and subsequently averaged over all frames.

3.3. Affective annotation

  • The coordinates of a pointer manipulated by the user are continuously recorded during the show time of the stimuli (video, image, or external source) and used as the affect indicators.
  • A set of SAM manikins (Self-Assessment Manikins [25]) are generated for different combinations of arousal and valence to help the user understand the emotions related to the regions of valence-arousal space.
  • E.g. the positive excited manikin is generated by combining the positive manikin and the excited manikin.
  • The participant was asked to annotate the movies so as to indicate at which times his/her felt emotion has changed.
  • The participant was asked to indicate at least one point during each scene not to leave any scene without assessment.

4.1. Arousal estimation of shots

  • Figure 4 shows a sample arousal curve from part of the film entitled “Silent Hill”.
  • The participant’s felt emotion was however not completely in agreement with the estimated curve, as can for instance be observed in the second half of the plot.
  • A possible cause for the discrepancy is the low temporal resolution of the selfassessment.
  • Another possible cause is experimental weariness: after having had exciting stimuli for minutes, a participant's arousal might be decreasing despite strong movements in the video and loud audio.
  • Finally, some emotional feelings might simply not be captured by lowlevel features; this would for instance be the case for a racist comment in a movie dialogue which evokes disgust for a participant.

4.2. Classification results

  • 2 1 (3) For the ten-folding cross validation the original samples, movie scenes, were partitioned into 10 subsample sets.
  • The naïve Bayesian classifier results are shown in Table 3-a.
  • As with the temporal prior, the genre prior leads to better estimate of the emotion class.
  • The evolution of classification results over consecutive scenes when adding the time prior shows that this prior allows correcting results for some samples that were misclassified using the genre prior only.
  • Using physiological signals or audiovisual recordings will help overcome these problems and facilitate this part of the work, by yielding continuous affective annotations without interrupting the user [7].

5. Conclusions and perspectives

  • An affective representation system for estimating felt emotions at the scene level has been proposed using a Bayesian classification framework that allows taking some form of context into account.
  • Results showed the advantage of using well chosen priors, such as temporal information provided by the previous scene emotion, and movie genre.
  • The f1 classification measure of 54.9% that was obtained on three emotional classes with a naïve Bayesian classifier was improved to 56.5% and 59.5 using only the time and genre prior.
  • This measure finally improved to 63.4% after utilizing all the priors.
  • It will also provide us with a better understanding of the feasibility of using group-wise profiles containing some affective characteristics that are shared between users.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Proceedings Chapter
Reference
A Bayesian Framework for Video Affective Representation
SOLEYMANI, Mohammad, et al.
Abstract
Emotions that are elicited in response to a video scene contain valuable information for
multimedia tagging and indexing. The novelty of this paper is to introduce a Bayesian
classification framework for affective video tagging that allows taking contextual information
into account. A set of 21 full length movies was first segmented and informative content-based
features were extracted from each shot and scene. Shots were then emotionally annotated,
providing ground truth affect. The arousal of shots was computed using a linear regression on
the content-based features. Bayesian classification based on the shots arousal and
content-based features allowed tagging these scenes into three affective classes, namely
calm, positive excited and negative excited. To improve classification accuracy, two
contextual priors have been proposed: the movie genre prior, and the temporal dimension
prior consisting of the probability of transition between emotions in consecutive scenes. The f1
classification measure of 54.9% that was obtained on three emotional classes with a nai¿ve
Bayes classifier was improved to 63.4% after utilizing [...]
SOLEYMANI, Mohammad, et al. A Bayesian Framework for Video Affective Representation. In:
3rd International Conference on Affective Computing and Intelligent Interaction and
Workshops, 2009. ACII 2009 : proceedings. 2009. p. 1-7
DOI : 10.1109/ACII.2009.5349563
Available at:
http://archive-ouverte.unige.ch/unige:47663
Disclaimer: layout of this document may differ from the published version.
1 / 1

Abstract
Emotions that are elicited in response to a video
scene contain valuable information for multimedia
tagging and indexing. The novelty of this paper is to
introduce a Bayesian classification framework for
affective video tagging that allows taking contextual
information into account. A set of 21 full length movies
was first segmented and informative content-based
features were extracted from each shot and scene. Shots
were then emotionally annotated, providing ground
truth affect. The arousal of shots was computed using a
linear regression on the content-based features.
Bayesian classification based on the shots arousal and
content-based features allowed tagging these scenes
into three affective classes, namely calm, positive
excited and negative excited. To improve classification
accuracy, two contextual priors have been proposed:
the movie genre prior, and the temporal dimension prior
consisting of the probability of transition between
emotions in consecutive scenes. The f1 classification
measure of 54.9% that was obtained on three emotional
classes with a naïve Bayes classifier was improved to
63.4% after utilizing all the priors.
1. Introduction
1.1. Overview
Video and audio on-demand systems are getting more
and more popular and are likely to replace traditional
TVs. Online video content has been growing rapidly in
the last five years. For example the open access online
video database, YouTube, had a watching rate of more
than 100 millions videos per day in 2006 [1]. The
enormous mass of digital multimedia content with its
huge variety requires more efficient multimedia
management methods. Many studies have been
conducted in the last decade to increase the accuracy of
current multimedia retrieval systems. These studies were
mostly based on content analysis and textual tags [2].
Although the emotional preferences of a user play an
important role in multimedia content selection, few
publications exist in the field of affective indexing which
consider emotional preferences of users [2-7].
The present study is focused on movies because they
represent one of the most common and popular types of
multimedia content. An affective representation of
scenes will be useful for tagging, indexing and
highlighting of important parts in a movie. We believe
that using the existing online metadata can improve the
affective representation and classification of movies.
Such metadata, like movie genre, is available on internet
(e.g. internet movie database http://www.imdb.com).
Movie genre can be exploited to improve an affect
representation system's inference about the possible
emotion which is going to be elicited in the audience.
For example, the probability of a happy scene in a
comedy certainly differs from that in a drama. Moreover,
the temporal order of the evoked emotions, which can be
modeled by the probability of emotion transition in
consecutive scenes, is also expected to be useful for the
improvement of an affective representation system.
It is shown here how to benefit from the proposed
priors in a Bayesian classification framework. Affect
classification was done for a three labels scene
classification problem, where the labels are “calm”,
“positive excited”, and “negative excited”. Ground truth
was obtained through manual annotation with a
FEELTRACE-like [8] annotation tool with the self-
assessments serving as the classification ground-truth.
The usefulness of priors is shown by comparing
classification results with or without using them.
Metadata:
genre, rating, …
INTERNET
Database
Textual
information
Feature
extraction
Video
Audio
Affect
representation
system
Affect
Users personal
profile, gender,
age, etc
In our proposed affective indexing and retrieval
system, different modalities, such as video, audio, and
textual data (subtitles) of a movie will be used for
feature extraction. Figure 1 shows the diagram of such a
system. The feature extraction block extracts features
A Bayesian Framework for Video Affective Representation
Mohammad Soleymani Joep J.M. Kierkels Guillaume Chanel Thierry Pun
Computer Vision and Multimedia Laboratory, Computer Science Department
University of Geneva
Battelle Building A, Rte. De Drize 7,
CH - 1227 Carouge, Geneva, Switzerland
{mohammad.soleymani, joep.kierkels, guillaume.chanel,thierry.pun @unige.ch}
http://cvml.unige.ch
First Author
Institution1
Institution1 address
firstauthor@i1.org
Second Author
Institution2
First line of institution2 address
http://www.author.org/_second
Figure 1. A diagram of the proposed video affective
representation.

from the three modalities and stores them in a database.
Then, the affect representation system fuses the
extracted features, the stored personal information, and
the metadata to represent the evoked emotion. For a
personalized retrieval, a personal profile of a user (with
his/her gender, age, location, social network) will help
the affective retrieval process.
The paper is organized as follows. A review of the
current state of the art and an explanation on affect and
affective representation are given in the following
subsections of the first Section. Methods used including
the arousal representation at the shot level and affect
classification at the scene level are given in Section 2.
Section 3 details the movie dataset used and the features
that have been extracted. The obtained classification
results at the scene level, the comparisons with and
without using genre and temporal priors are discussed in
Section 4. Section 5 concludes the article and offers
perspectives for future work.
1.2. State of the art
Video affect representation requires understanding of
the intensity and type of user’s affect while watching a
video. There are only a limited number of studies on
content-based affective representation of movies. Wang
and Cheong [2] used content audio and video features to
classify basic emotions elicited by movie scenes. In [2]
audio was classified into music, speech and environment
signals and had been treated separately to shape an
affective feature vector. The audio affective vector was
used with video-based features such as key lighting and
visual excitement to form a scene feature vector. Finally,
the scene feature vector was classified and labeled with
emotions.
Hanjalic et al. [4] introduced “personalized content
delivery” as a valuable tool in affective indexing and
retrieval systems. In order to represent affect in video,
they first selected video- and audio- content based
features based on their relation to the valence-arousal
space that was defined as an affect model (for the
definition of affect model, see Section 1.3) [4]. Then,
arising emotions were estimated in this space by
combining these features. While arousal and valence
could be used separately for indexing, they combined
these values by following their temporal pattern in the
arousal and valence space. This allowed determining an
affect curve, shown to be useful for extracting video
highlights in a movie or sports video.
A hierarchical movie content analysis method based
on arousal and valence related features was presented by
M. Xu et al. [6]. In this method the affect of each shot
was first classified in the arousal domain using the
arousal correlated features and fuzzy clustering. The
audio short time energy and the first four Mel frequency
cepstral coefficients, MFCC (as a representation of
energy features), shot length, and the motion component
of consecutive frames were used to classify shots in
three arousal classes. Next, they used color energy,
lighting and brightness as valence related features to be
used for a HMM-based valence classification of the
previously arousal-categorized shots.
A personalized affect representation method based on
a regression approach for estimating user-felt arousal
and valence from multimedia content features and/or
from physiological responses was presented by
Soleymani et al. [7]. A relevance vector machine was
used to find linear regression weights. This allowed
predicting valence and arousal from the measured
multimedia and/or physiological data. During the
experiments, 64 video clips were shown to 8 participants
while their physiological responses were recorded; user's
self-assessments of valence and arousal served as ground
truth. A comparison was made on the arousal and
valence values obtained by different modalities which
were the physiological signals, the video- and audio-
based features, and the self-assessments. In [7] An
experiment with multiple participants has been
conducted for personalized emotion assessment based on
content analysis.
1.3. Affect and Affective representation
Russell [10] proposed a 3D continuous space called
the valence-arousal-dominance space which was based
on a self-representation of emotions from multiple
subjects. In this paper we use a valence-arousal
dimensional approach for affect representation and
annotation. The third dimensional axis, namely
dominance / control, is not used in our study. In the
valence-arousal space it is possible to represent almost
any emotion. The valence axis represents the
pleasantness of a situation, from unpleasant to pleasant;
the arousal axis expresses the degree of felt excitement,
from calm to exciting. Russel demonstrated that this
space has the advantages of being cross-cultural and that
it is possible to map labels on this space. Although, the
most straightforward way to represent an emotion is to
use discrete labels such as fear, anxiety and joy, label-
based representations have several disadvantages. The
main one is that despite the universality of basic
emotions, the labels themselves are not universal. They
can be misinterpreted from one language (or culture) to
another. In addition, emotions are continuous
phenomena rather than discrete ones and labels are
unable to define the strength of an emotion.
In a dimensional approach for affect representation,
the affect of a video scene can be represented by its
coordinates in the valence-arousal space. Valence and
arousal can be determined by self reporting. The goal of
an affective representation system is to estimate user’s
valence and arousal or emotion categories in response to
each movie segment. Emotion categories are defined as
regions in the valence-arousal space. Each movie
consists of scenes and each scene consists of a sequence
of shots which are happening in the same location. A

shot is the part of a movie between two cuts which is
typically filmed without interruptions [11].
2. Methods
2.1. Arousal estimation with regression on shots
Informative features for arousal estimation include
loudness and energy of the audio signals, motion
component, visual excitement and shot duration. Using a
method similar to Hanjalic et al. [4] and to the one
proposed in [7], the felt arousal from each shot is
computed by a regression of the content features (see
Section 3 for a detailed description). In order to find the
best weights for arousal estimation using regression, a
leave one movie out strategy on the whole dataset was
used and the linear weights were computed by means of
a relevance vector machine (RVM) from the RVM
toolbox provided by Tipping [12]. The RVM is able to
reject uninformative features during its training hence no
further feature selection was used for arousal
determination.
Equation (1) shows how N
s
audio and video based
features z
k
i
of the k-th shot are linearly combined by the
weights w
i
to compute the arousal â
k
at the shot level.
0
1
ˆ
wzwa
s
N
i
k
iik
(1)
After computing arousal at the shot level, the average
and maximum arousals of the shots of each scene are
computed and used as arousal indicator features for the
scene affective classification. During an exciting scene
the arousal related features do not all remain at their
extreme level. In order to represent the highest arousal
of each scene, the maximum of the shots’ arousal was
chosen to be used as a feature for scene classification.
The linear regression weights that were computed
from our data set were used to determine the arousal of
each movie’s shots. This was done in such a way that all
movies from the dataset except for the one to which the
shot belonged to were used as the training set for the
RVM. Any missing affective annotation for a shot was
approximated using linear interpolation from the closest
affective annotated time points in a movie.
It was observed that arousal has higher linear
correlation with multimedia content-based features than
valence. Valence estimation from regression is not as
accurate as arousal estimation and therefore valence
estimation has not been performed at the shot level.
2.2. Bayesian framework and scene
classification
For the purpose of categorizing the valence-arousal
space into three affect classes, the valence-arousal space
was divided into the three areas shown in Figure 2, each
corresponding to one class. According to [13] emotions
mapped to the lower arousal category are neither
extreme pleasant nor unpleasant emotions and are
difficult to differentiate. Emotional evaluations are
shown to have a heart shaped distribution on valence-
arousal space [13]. Hence, we categorized the lower half
of the plane into one class. The points with an arousal of
zero were counted in class 1 and the points with arousal
greater than zero and valence equal to zero were
considered in class 2. These classes were used as a
simple representation for the emotion categories based
on the previous literature on emotion assessment [14].
In order to characterize movie scenes into these
affective categories, the average and maximum arousal
of the shots of each scene and the low level extracted
audio- and video- based features were used to form a
feature vector. This feature vector in turn was used for
the classification.
If the content feature vector of the j-th scene is x
j
, the
problem of finding the emotion class, ŷ
j
, of this scene is
formulated as estimating the ŷ
j
which maximizes the
probability p(y
j
|x
j
,θ) where θ is the prior information
which can include the user’s preferences and video clip’s
metadata. In this paper one of the prior metadata (θ) we
used is for instance the genre of the movie. Personal
profile parameters can be also added to θ. Since in this
paper the whole affect representation is trained by the
self report of one participant the model is assumed to be
personalized for this participant. When the emotion of
the previous scene is used as another prior the scene
affect probability formula changes to p(y
j
|y
j-1
,x
j
,θ).
Assuming for simplification that the emotion of the
previous scene is independent from the content features
of the current scene this probability can be reformulated
as:
)|(
),|().,|(
),,|(
1
1
1
j
jjjj
jjj
yp
xypyyp
xyyp
(2)
The classification problem is then be simplified into
the determination of the maximum value of the
numerator of Equation (2), since the denominator will be
the same for all different affect classes y
j
. The priors are
established based on the empirical probabilities obtained
from the training data. For example, the occurrence
2
3
(0,0)
1
Figure 2. Three classes in the valence-arousal space are shown,
namely calm (1), positive excited (2) and negative excited (3).
An approximate of the heart shaped distribution of valence and
arousal is shown.
Arousal

probability of having a given emotion followed by any
of the emotion categories was computed from the
participant’s self-assessments and for each genre. This
allowed to obtain the p(y
j-1
|y
j
). Different methods were
evaluated to estimate the posterior probability p(y
j
|x
j
). A
naïve Bayesian approach which assumes the conditional
probabilities are Gaussian was chosen as providing the
best performance on the dataset; the superiority of this
method can be attributed to its generalization abilities.
3. Material description
A dataset of movies segmented and affectively
annotated by arousal and valence is used as the training
set. This training set consists of twenty one full length
movies (mostly popular movies). The majority of movies
were selected either because they were used in similar
studies (e.g. [15]), or because they were recent and
popular. The dataset included four genres: drama,
horror, action, and comedy. The following three
information streams were extracted from the media:
video (visual), sound (auditory), and, subtitles (textual).
The video stream of the movies has been segmented at
the shot level using the OMT shot segmentation software
and manually segmented into scenes [16;17]. Movie
videos were encoded into the MPEG-1 format to extract
motion vectors and I frames for further feature
extraction. We used the OVAL library (Object-based
Video Access Library) [18] to capture video frames and
extract motion vectors.
The second information stream, namely sound, has an
important impact on user’s affect. For example
according to the findings of Picard [19], loudness of
speech (energy) is related to evoked arousal, while
rhythm and average pitch in speech signals are related to
valence. The audio channels of the movies were
extracted and encoded into monophonic information
(MPEG layer 3 format) at a sampling rate of 48 kHz. All
of the resulting audio signals were normalized to the
same amplitude range before further processing.
Textual features were also extracted from the subtitles
track of the movies. According to [9] the semantic
analysis of the textual information can improve affect
classification. As the semantic analysis over the textual
data was not the focus of our work we extracted simple
features from subtitles by tokenizing the text and
counting the number of words. These statistics have
been used with the timing of the subtitles to extract the
talking rate feature which is the number of words that
had been spoken per second on the subtitles show time.
The other extracted feature is the number of spoken
words in a scene divided by the length of the scene,
which can represent the amount or existence of
dialogues in a scene. A list of the movies in the dataset
and their corresponding genre is given in Table 1.
3.1. Audio features
A total of 53 low-level audio features were
determined for each of the audio signals. These features,
listed in Table 2, are commonly used in audio and
speech processing and audio classification [20;21].
Wang et al. [2] demonstrated the relationship between
audio type’s proportions (for example, the proportion of
music in an audio segment) and affect, where these
proportions refer to the respective duration of music,
speech, environment, and silence in the audio signal of a
video clip. To determine the three important audio types
(music, speech, environment), we implemented a three
class audio type classifier using support vector machines
(SVM) operating on audio low-level features in a one
second segment. Before classification, silence had been
identified by comparing the average audio signal energy
of each sound segment (using the averaged square
magnitude in a time window) with a pre-defined
threshold empirically extracted from the first seven
percent of the audio energy histogram. This audio
histogram was computed from a randomly selected 30
minutes segment of each movie’s audio stream.
Table 2. Low-level features extracted from audio signals.
Feature
category
Extracted features
MFCC
MFCC coefficients (13 features) [20],
Derivative of MFCC (13 features),
Autocorrelation of MFCC (13 features)
Energy
Average energy of audio signal [20]
Formants
Formants up to 5500Hz (female voice) (five
features)
Time frequency
Spectrum flux, Spectral centroid, Delta
spectrum magnitude, Band energy ratio,
[20;21]
Pitch
First pitch frequency
Zero crossing
rate
Average, Standard deviation [20]
Silence ratio
Proportion of silence in a time window [24]
After removing silence, the remaining audio signals
were classified by a SVM with a polynomial kernel,
using the LIBSVM toolbox.
(http://www.csie.ntu.edu.tw/~cjlin/libsvm/). The SVM
was trained on about three hours of audio, extracted
from movies (not from the dataset of this paper) and
labeled manually. Despite the fact that in various cases
the audio type classes were overlapping (e.g. presence of
a musical background during a dialogue), the classifier
was usually able to recognize the dominant audio type
with an accuracy of about 80%.
The classification results were used to the ratio of
Table1. List of the movies in the dataset.
Drama Movies
Comedy Movies
The pianist, Blood
diamond, Hotel Rwanda,
Apocalypse now, American
history X, Hannibal
Man on the moon, Mr. Bean’s
holiday, Love actually, Shaun
of the dead, Shrek
Horror Movies
Action Movies
Silent hill, Ringu
(Japanese), 28 days later,
The shining
Man on Fire, Kill Bill Vol. 1,
Kill Bill Vol. 2, Platoon, The
thin red line, Gangs of New
York

Citations
More filters
Proceedings ArticleDOI
06 Jun 2016
TL;DR: This doctoral research studies the multimodal analysis of UGC in support of above-mentioned social media problems and proposes approaches, results, and works in progress on these problems.
Abstract: The number of user-generated multimedia content (UGC) online has increased rapidly in recent years due to the ubiquitous availability of smartphones, cameras, and affordable network infrastructures. Thus, it attracts companies to provide diverse multimedia-related services such as preference-aware multimedia recommendations, multimedia-based e--learning, and event summarization from a large collection of multimedia content. However, a real-world UGC is complex and extracting semantics from only multimedia content is difficult because suitable concepts may be exhibited in different representations. Modern devices capture contextual information in conjunction with a multimedia content, which greatly facilitates in the semantics understanding of the multimedia content. Thus, it is beneficial to analyse UGC from multiple modalities such as multimedia content and contextual information (eg., spatial and temporal information). This doctoral research studies the multimodal analysis of UGC in support of above-mentioned social media problems. We present our proposed approaches, results, and works in progress on these problems.

28 citations


Cites background from "A Bayesian framework for video affe..."

  • ...There exist a few approaches [6,20,22] to recognize emotions from videos but the field of video soundtrack recommendation for UGVs [4,24] is largely unexplored....

    [...]

Journal ArticleDOI
TL;DR: In this paper, the state-of-the-art multimedia affective computing (AC) technologies comprehensively for large-scale heterogeneous multimedia data are surveyed and compared, with the focus on both handcrafted features-based methods and deep learning methods.
Abstract: The wide popularity of digital photography and social networks has generated a rapidly growing volume of multimedia data (i.e., image, music, and video), resulting in a great demand for managing, retrieving, and understanding these data. Affective computing (AC) of these data can help to understand human behaviors and enable wide applications. In this article, we survey the state-of-the-art AC technologies comprehensively for large-scale heterogeneous multimedia data. We begin this survey by introducing the typical emotion representation models from psychology that are widely employed in AC. We briefly describe the available datasets for evaluating AC algorithms. We then summarize and compare the representative methods on AC of different multimedia types, i.e., images, music, videos, and multimodal data, with the focus on both handcrafted features-based methods and deep learning methods. Finally, we discuss some challenges and future directions for multimedia affective computing.

27 citations

Journal ArticleDOI
TL;DR: It is found that variable shot lengths in a trailer tend to produce a rhythm that is likely to stimulate a viewer's positive preference, as demonstrated by viewers'eye-tracking data.
Abstract: Nowadays, there are many movie trailers publicly available on social media website such as YouTube, and many thousands of users have independently indicated whether they like or dislike those trailers. Although it is understandable that there are multiple factors that could influence viewers’ like or dislike of the trailer, we aim to address a preference question in this work: Can subjective multimedia features be developed to predict the viewer's preference presented by like (by thumbs-up) or dislike (by thumbs-down) during and after watching movie trailers? We designed and implemented a computational framework that is composed of low-level multimedia feature extraction, feature screening and selection, and classification, and applied it to a collection of 725 movie trailers. Experimental results demonstrated that, among dozens of multimedia features, the single low-level multimedia feature of shot length variance is highly predictive of a viewer's “like/dislike” for a large portion of movie trailers. We interpret these findings such that variable shot lengths in a trailer tend to produce a rhythm that is likely to stimulate a viewer's positive preference. This conclusion was also proved by the repeatability experiments results using another 600 trailer videos and it was further interpreted by viewers'eye-tracking data.

20 citations


Cites background from "A Bayesian framework for video affe..."

  • ...So, the trailer aims to tell the story of a film in a highly condensed and attractive fashion....

    [...]

Proceedings ArticleDOI
01 Oct 2016
TL;DR: This work leverages both multimedia content and contextual information (eg., spatial and temporal metadata) to address above-mentioned social media problems in their doctoral research.
Abstract: The rapid growth in the amount of user-generated content (UGCs) online necessitates for social media companies to automatically extract knowledge structures (concepts) from user-generated images (UGIs) and user-generated videos (UGVs) to provide diverse multimedia-related services. For instance, recommending preference-aware multimedia content, the understanding of semantics and sentics from UGCs, and automatically computing tag relevance for UGIs are benefited from knowledge structures extracted from multiple modalities. Since contextual information captured by modern devices in conjunction with a media item greatly helps in its understanding, we leverage both multimedia content and contextual information (eg., spatial and temporal metadata) to address above-mentioned social media problems in our doctoral research. We present our approaches, results, and works in progress on these problems.

17 citations


Cites background from "A Bayesian framework for video affe..."

  • ...STATE OF THE ART Earlier works [11, 34, 37] recognize emotions from videos but the field of soundtrack recommendation for UGVs [9,39] is largely unexplored....

    [...]

Book ChapterDOI
07 Sep 2015
TL;DR: There is a significant correlation between audio-visual features of movies and corresponding brain signals specially in the visual and temporal lobes, and the genre of movie clips can be classified with an accuracy significantly over the chance level using the MEG signal.
Abstract: Genre classification is an essential part of multimedia content recommender systems. In this study, we provide experimental evidence for the possibility of performing genre classification based on brain recorded signals. The brain decoding paradigm is employed to classify magnetoencephalography (MEG) data presented in [1] to four genre classes: Comedy, Romantic, Drama, and Horror. Our results show that: 1) there is a significant correlation between audio-visual features of movies and corresponding brain signals specially in the visual and temporal lobes; 2) the genre of movie clips can be classified with an accuracy significantly over the chance level using the MEG signal. On top of that we show that the combination of multimedia features and MEG-based features achieves the best accuracy. Our study provides a primary step towards user-centric media content retrieval using brain signals.

13 citations


Cites background from "A Bayesian framework for video affe..."

  • ...in response to a video clip contain useful information regarding the genre of the video clip [19]....

    [...]

  • ...[19] showed that a Bayesian classification approach can tag movie scenes into three affective classes (calm, positive excited and negative excited)....

    [...]

References
More filters
01 Jan 2006

5,265 citations

Journal ArticleDOI
Michael E. Tipping1
TL;DR: It is demonstrated that by exploiting a probabilistic Bayesian learning framework, the 'relevance vector machine' (RVM) can derive accurate prediction models which typically utilise dramatically fewer basis functions than a comparable SVM while offering a number of additional advantages.
Abstract: This paper introduces a general Bayesian framework for obtaining sparse solutions to regression and classification tasks utilising models linear in the parameters Although this framework is fully general, we illustrate our approach with a particular specialisation that we denote the 'relevance vector machine' (RVM), a model of identical functional form to the popular and state-of-the-art 'support vector machine' (SVM) We demonstrate that by exploiting a probabilistic Bayesian learning framework, we can derive accurate prediction models which typically utilise dramatically fewer basis functions than a comparable SVM while offering a number of additional advantages These include the benefits of probabilistic predictions, automatic estimation of 'nuisance' parameters, and the facility to utilise arbitrary basis functions (eg non-'Mercer' kernels) We detail the Bayesian framework and associated learning algorithm for the RVM, and give some illustrative examples of its application along with some comparative benchmarks We offer some explanation for the exceptional degree of sparsity obtained, and discuss and demonstrate some of the advantageous features, and potential extensions, of Bayesian relevance learning

5,116 citations

Journal ArticleDOI
TL;DR: Responsibility specificity, particularly facial expressiveness, supported the view that specific affects have unique patterns of reactivity, and consistency of the dimensional relationships between evaluative judgments and physiological response emphasizes that emotion is fundamentally organized by these motivational parameters.
Abstract: Colored photographic pictures that varied widely across the affective dimensions of valence (pleasant-unpleasant) and arousal (excited-calm) were each viewed for a 6-s period while facial electromyographic (zygomatic and corrugator muscle activity) and visceral (heart rate and skin conductance) reactions were measured. Judgments relating to pleasure, arousal, interest, and emotional state were measured, as was choice viewing time. Significant covariation was obtained between (a) facial expression and affective valence judgments and (b) skin conductance magnitude and arousal ratings. Interest ratings and viewing time were also associated with arousal. Although differences due to the subject's gender and cognitive style were obtained, affective responses were largely independent of the personality factors investigated. Response specificity, particularly facial expressiveness, supported the view that specific affects have unique patterns of reactivity. The consistency of the dimensional relationships between evaluative judgments (i.e., pleasure and arousal) and physiological response, however, emphasizes that emotion is fundamentally organized by these motivational parameters.

3,089 citations

Journal ArticleDOI
TL;DR: This article developed a set of films that reliably elicit eight emotional states (amusement, anger, contentment, disgust, fear, neutral, sadness, and surprise) from a large sample of 494 English-speaking subjects.
Abstract: Researchers interested in emotion have long struggled with the problem of how to elicit emotional responses in the laboratory. In this article, we summarise five years of work to develop a set of films that reliably elicit each of eight emotional states (amusement, anger, contentment, disgust, fear, neutral, sadness, and surprise). After evaluating over 250 films, we showed selected film clips to an ethnically diverse sample of 494 English-speaking subjects. We then chose the two best films for each of the eight target emotions based on the intensity and discreteness of subjects' responses to each film. We found that our set of 16 films successfully elicited amusement, anger, contentment. disgust, sadness, surprise, a relatively neutral state, and, to a lesser extent, fear. We compare this set of films with another set recently described by Philippot (1993), and indicate that detailed instructions for creating our set of film stimuli will be provided on request.

2,327 citations

Frequently Asked Questions (1)
Q1. What are the contributions in this paper?

The novelty of this paper is to introduce a Bayesian classification framework for affective video tagging that allows taking contextual information into account.