Proceedings Article•DOI•

A Bayesian framework for video affective representation

Mohammad Soleymani¹, Joep J. M. Kierkels¹, Guillaume Chanel¹, Thierry Pun¹•Institutions (1)

08 Dec 2009-pp 1-7

TL;DR: A Bayesian classification framework for affective video tagging that allows taking contextual information into account is introduced and two contextual priors have been proposed: the movie genre prior, and the temporal dimension prior consisting of the probability of transition between emotions in consecutive scenes.

read less

Abstract: Emotions that are elicited in response to a video scene contain valuable information for multimedia tagging and indexing The novelty of this paper is to introduce a Bayesian classification framework for affective video tagging that allows taking contextual information into account A set of 21 full length movies was first segmented and informative content-based features were extracted from each shot and scene Shots were then emotionally annotated, providing ground truth affect The arousal of shots was computed using a linear regression on the content-based features Bayesian classification based on the shots arousal and content-based features allowed tagging these scenes into three affective classes, namely calm, positive excited and negative excited To improve classification accuracy, two contextual priors have been proposed: the movie genre prior, and the temporal dimension prior consisting of the probability of transition between emotions in consecutive scenes The f1 classification measure of 549% that was obtained on three emotional classes with a naive Bayes classifier was improved to 634% after utilizing all the priors

...read moreread less

Summary (3 min read)

Jump to: [1.1. Overview] – [1.2. State of the art] – [1.3. Affect and Affective representation] – [2.1. Arousal estimation with regression on shots] – [2.2. Bayesian framework and scene] – [3. Material description] – [3.1. Audio features] – [3.2. Visual features] – [3.3. Affective annotation] – [4.1. Arousal estimation of shots] – [4.2. Classification results] and [5. Conclusions and perspectives]

1.1. Overview

Video and audio on-demand systems are getting more and more popular and are likely to replace traditional TVs.
The enormous mass of digital multimedia content with its huge variety requires more efficient multimedia management methods.
These studies were mostly based on content analysis and textual tags [2].
Then, the affect representation system fuses the extracted features, the stored personal information, and the metadata to represent the evoked emotion.
Section 3 details the movie dataset used and the features that have been extracted.

1.2. State of the art

Video affect representation requires understanding of the intensity and type of user’s affect while watching a video.
In order to represent affect in video, they first selected video- and audio- content based features based on their relation to the valence-arousal space that was defined as an affect model (for the definition of affect model, see Section 1.3) [4].
Next, they used color energy, lighting and brightness as valence related features to be used for a HMM-based valence classification of the previously arousal-categorized shots.
A personalized affect representation method based on a regression approach for estimating user-felt arousal and valence from multimedia content features and/or from physiological responses was presented by Soleymani et al. [7].
A relevance vector machine was used to find linear regression weights.

1.3. Affect and Affective representation

Russell [10] proposed a 3D continuous space called the valence-arousal-dominance space which was based on a self-representation of emotions from multiple subjects.
The valence axis represents the pleasantness of a situation, from unpleasant to pleasant; the arousal axis expresses the degree of felt excitement, from calm to exciting.
The most straightforward way to represent an emotion is to use discrete labels such as fear, anxiety and joy, labelbased representations have several disadvantages.
The main one is that despite the universality of basic emotions, the labels themselves are not universal.
Each movie consists of scenes and each scene consists of a sequence of shots which are happening in the same location.

2.1. Arousal estimation with regression on shots

Informative features for arousal estimation include loudness and energy of the audio signals, motion component, visual excitement and shot duration.
The RVM is able to reject uninformative features during its training hence no further feature selection was used for arousal determination.
0 1 ˆ wzwa sN i k iik (1) After computing arousal at the shot level, the average and maximum arousals of the shots of each scene are computed and used as arousal indicator features for the scene affective classification.
During an exciting scene the arousal related features do not all remain at their extreme level.
This was done in such a way that all movies from the dataset except for the one to which the shot belonged to were used as the training set for the RVM.

2.2. Bayesian framework and scene

For the purpose of categorizing the valence-arousal space into three affect classes, the valence-arousal space was divided into the three areas shown in Figure 2, each corresponding to one class.
Hence, the authors categorized the lower half of the plane into one class.
These classes were used as a simple representation for the emotion categories based on the previous literature on emotion assessment [14].
This feature vector in turn was used for the classification.
Different methods were evaluated to estimate the posterior probability p(yj|xj).

3. Material description

A dataset of movies segmented and affectively annotated by arousal and valence is used as the training set.
The majority of movies were selected either because they were used in similar studies (e.g. [15]), or because they were recent and popular.
Movie videos were encoded into the MPEG-1 format to extract motion vectors and I frames for further feature extraction.
The second information stream, namely sound, has an important impact on user’s affect.
Textual features were also extracted from the subtitles track of the movies.

3.1. Audio features

A total of 53 low-level audio features were determined for each of the audio signals.
To determine the three important audio types (music, speech, environment), the authors implemented a three class audio type classifier using support vector machines (SVM) operating on audio low-level features in a one second segment.
Feature category Extracted features MFCC MFCC coefficients (13 features) [20], Derivative of MFCC (13 features), Autocorrelation of MFCC (13 features) Energy Average energy of audio signal [20].
Time frequency Spectrum flux, Spectral centroid, Delta spectrum magnitude, Band energy ratio, [20;21].
The thin red line, Gangs of New York each audio type in a movie segment.

3.2. Visual features

From a movie director's point of view, lighting key [2;23] and color variance [2] are important tools to evoke emotions.
The average shot change rate, and shot length variance were extracted to characterize video rhythm.
Fast moving scenes or objects' movements in consecutive frames are also an effective factor for evoking excitement.
Colors and their proportions are important parameters to elicit emotions [17].
In order to use colors in the list of video features, a 20 bin color histogram of hue and lightness values in the HSV space was computed for each I frame and subsequently averaged over all frames.

3.3. Affective annotation

The coordinates of a pointer manipulated by the user are continuously recorded during the show time of the stimuli (video, image, or external source) and used as the affect indicators.
A set of SAM manikins (Self-Assessment Manikins [25]) are generated for different combinations of arousal and valence to help the user understand the emotions related to the regions of valence-arousal space.
E.g. the positive excited manikin is generated by combining the positive manikin and the excited manikin.
The participant was asked to annotate the movies so as to indicate at which times his/her felt emotion has changed.
The participant was asked to indicate at least one point during each scene not to leave any scene without assessment.

4.1. Arousal estimation of shots

Figure 4 shows a sample arousal curve from part of the film entitled “Silent Hill”.
The participant’s felt emotion was however not completely in agreement with the estimated curve, as can for instance be observed in the second half of the plot.
A possible cause for the discrepancy is the low temporal resolution of the selfassessment.
Another possible cause is experimental weariness: after having had exciting stimuli for minutes, a participant's arousal might be decreasing despite strong movements in the video and loud audio.
Finally, some emotional feelings might simply not be captured by lowlevel features; this would for instance be the case for a racist comment in a movie dialogue which evokes disgust for a participant.

4.2. Classification results

2 1 (3) For the ten-folding cross validation the original samples, movie scenes, were partitioned into 10 subsample sets.
The naïve Bayesian classifier results are shown in Table 3-a.
As with the temporal prior, the genre prior leads to better estimate of the emotion class.
The evolution of classification results over consecutive scenes when adding the time prior shows that this prior allows correcting results for some samples that were misclassified using the genre prior only.
Using physiological signals or audiovisual recordings will help overcome these problems and facilitate this part of the work, by yielding continuous affective annotations without interrupting the user [7].

5. Conclusions and perspectives

An affective representation system for estimating felt emotions at the scene level has been proposed using a Bayesian classification framework that allows taking some form of context into account.
Results showed the advantage of using well chosen priors, such as temporal information provided by the previous scene emotion, and movie genre.
The f1 classification measure of 54.9% that was obtained on three emotional classes with a naïve Bayesian classifier was improved to 56.5% and 59.5 using only the time and genre prior.
This measure finally improved to 63.4% after utilizing all the priors.
It will also provide us with a better understanding of the feasibility of using group-wise profiles containing some affective characteristics that are shared between users.

Did you find this useful? Give us your feedback

Figures (6)

Figure 4. Five-points smoothed shot arousal curve (full line), and corresponding self-assessments (dashed line).

Figure 3. A snapshot of the affective annotation software which is implemented in LABVIEW. The positive excited manikin can be seen in the central part of the display.

Figure 1. A diagram of the proposed video affective representation.

Figure 5. Classification results for consecutive scenes in a movie. The circle represents the target class, the plus sign shows the results of the Naïve Bayesian classifier with genre prior and the triangle shows the results with both genre and time priors. The samples which are misclassified by the Bayesian classifier with genre prior are encircled.

Table 3. Affective scene classification accuracies and f1 measures with different combinations of priors (on the left) and their confusion matrices (on the right). "1", "2", "3" correspond to the 3 classes "calm", "positive excited", and "negative excited".

Figure 2. Three classes in the valence-arousal space are shown, namely calm (1), positive excited (2) and negative excited (3). An approximate of the heart shaped distribution of valence and arousal is shown.

Content maybe subject to copyright Report

Proceedings Chapter

Reference

A Bayesian Framework for Video Affective Representation

SOLEYMANI, Mohammad, et al.

Abstract

Emotions that are elicited in response to a video scene contain valuable information for

multimedia tagging and indexing. The novelty of this paper is to introduce a Bayesian

classification framework for affective video tagging that allows taking contextual information

into account. A set of 21 full length movies was first segmented and informative content-based

features were extracted from each shot and scene. Shots were then emotionally annotated,

providing ground truth affect. The arousal of shots was computed using a linear regression on

the content-based features. Bayesian classification based on the shots arousal and

content-based features allowed tagging these scenes into three affective classes, namely

calm, positive excited and negative excited. To improve classification accuracy, two

contextual priors have been proposed: the movie genre prior, and the temporal dimension

prior consisting of the probability of transition between emotions in consecutive scenes. The f1

classification measure of 54.9% that was obtained on three emotional classes with a naiÂ¿ve

Bayes classifier was improved to 63.4% after utilizing [...]

SOLEYMANI, Mohammad, et al. A Bayesian Framework for Video Affective Representation. In:

3rd International Conference on Affective Computing and Intelligent Interaction and

Workshops, 2009. ACII 2009 : proceedings. 2009. p. 1-7

DOI : 10.1109/ACII.2009.5349563

Available at:

http://archive-ouverte.unige.ch/unige:47663

Disclaimer: layout of this document may differ from the published version.

1 / 1

Abstract

Emotions that are elicited in response to a video

scene contain valuable information for multimedia

tagging and indexing. The novelty of this paper is to

introduce a Bayesian classification framework for

affective video tagging that allows taking contextual

information into account. A set of 21 full length movies

was first segmented and informative content-based

features were extracted from each shot and scene. Shots

were then emotionally annotated, providing ground

truth affect. The arousal of shots was computed using a

linear regression on the content-based features.

Bayesian classification based on the shots arousal and

content-based features allowed tagging these scenes

into three affective classes, namely calm, positive

excited and negative excited. To improve classification

accuracy, two contextual priors have been proposed:

the movie genre prior, and the temporal dimension prior

consisting of the probability of transition between

emotions in consecutive scenes. The f1 classification

measure of 54.9% that was obtained on three emotional

classes with a naïve Bayes classifier was improved to

63.4% after utilizing all the priors.

1. Introduction

1.1. Overview

Video and audio on-demand systems are getting more

and more popular and are likely to replace traditional

TVs. Online video content has been growing rapidly in

the last five years. For example the open access online

video database, YouTube, had a watching rate of more

than 100 millions videos per day in 2006 [1]. The

enormous mass of digital multimedia content with its

huge variety requires more efficient multimedia

management methods. Many studies have been

conducted in the last decade to increase the accuracy of

current multimedia retrieval systems. These studies were

mostly based on content analysis and textual tags [2].

Although the emotional preferences of a user play an

important role in multimedia content selection, few

publications exist in the field of affective indexing which

consider emotional preferences of users [2-7].

The present study is focused on movies because they

represent one of the most common and popular types of

multimedia content. An affective representation of

scenes will be useful for tagging, indexing and

highlighting of important parts in a movie. We believe

that using the existing online metadata can improve the

affective representation and classification of movies.

Such metadata, like movie genre, is available on internet

(e.g. internet movie database http://www.imdb.com).

Movie genre can be exploited to improve an affect

representation system's inference about the possible

emotion which is going to be elicited in the audience.

For example, the probability of a happy scene in a

comedy certainly differs from that in a drama. Moreover,

the temporal order of the evoked emotions, which can be

modeled by the probability of emotion transition in

consecutive scenes, is also expected to be useful for the

improvement of an affective representation system.

It is shown here how to benefit from the proposed

priors in a Bayesian classification framework. Affect

classification was done for a three labels scene

classification problem, where the labels are “calm”,

“positive excited”, and “negative excited”. Ground truth

was obtained through manual annotation with a

FEELTRACE-like [8] annotation tool with the self-

assessments serving as the classification ground-truth.

The usefulness of priors is shown by comparing

classification results with or without using them.

Metadata:

genre, rating, …

INTERNET

Database

Textual

information

Feature

extraction

Video

Audio

Affect

representation

system

Affect

User’s personal

profile, gender,

age, etc…

In our proposed affective indexing and retrieval

system, different modalities, such as video, audio, and

textual data (subtitles) of a movie will be used for

feature extraction. Figure 1 shows the diagram of such a

system. The feature extraction block extracts features

A Bayesian Framework for Video Affective Representation

Mohammad Soleymani Joep J.M. Kierkels Guillaume Chanel Thierry Pun

Computer Vision and Multimedia Laboratory, Computer Science Department

University of Geneva

Battelle Building A, Rte. De Drize 7,

CH - 1227 Carouge, Geneva, Switzerland

{mohammad.soleymani, joep.kierkels, guillaume.chanel,thierry.pun @unige.ch}

http://cvml.unige.ch

First Author

Institution1

Institution1 address

firstauthor@i1.org

Second Author

Institution2

First line of institution2 address

http://www.author.org/_second

Figure 1. A diagram of the proposed video affective

representation.

from the three modalities and stores them in a database.

Then, the affect representation system fuses the

extracted features, the stored personal information, and

the metadata to represent the evoked emotion. For a

personalized retrieval, a personal profile of a user (with

his/her gender, age, location, social network) will help

the affective retrieval process.

The paper is organized as follows. A review of the

current state of the art and an explanation on affect and

affective representation are given in the following

subsections of the first Section. Methods used including

the arousal representation at the shot level and affect

classification at the scene level are given in Section 2.

Section 3 details the movie dataset used and the features

that have been extracted. The obtained classification

results at the scene level, the comparisons with and

without using genre and temporal priors are discussed in

Section 4. Section 5 concludes the article and offers

perspectives for future work.

1.2. State of the art

Video affect representation requires understanding of

the intensity and type of user’s affect while watching a

video. There are only a limited number of studies on

content-based affective representation of movies. Wang

and Cheong [2] used content audio and video features to

classify basic emotions elicited by movie scenes. In [2]

audio was classified into music, speech and environment

signals and had been treated separately to shape an

affective feature vector. The audio affective vector was

used with video-based features such as key lighting and

visual excitement to form a scene feature vector. Finally,

the scene feature vector was classified and labeled with

emotions.

Hanjalic et al. [4] introduced “personalized content

delivery” as a valuable tool in affective indexing and

retrieval systems. In order to represent affect in video,

they first selected video- and audio- content based

features based on their relation to the valence-arousal

space that was defined as an affect model (for the

definition of affect model, see Section 1.3) [4]. Then,

arising emotions were estimated in this space by

combining these features. While arousal and valence

could be used separately for indexing, they combined

these values by following their temporal pattern in the

arousal and valence space. This allowed determining an

affect curve, shown to be useful for extracting video

highlights in a movie or sports video.

A hierarchical movie content analysis method based

on arousal and valence related features was presented by

M. Xu et al. [6]. In this method the affect of each shot

was first classified in the arousal domain using the

arousal correlated features and fuzzy clustering. The

audio short time energy and the first four Mel frequency

cepstral coefficients, MFCC (as a representation of

energy features), shot length, and the motion component

of consecutive frames were used to classify shots in

three arousal classes. Next, they used color energy,

lighting and brightness as valence related features to be

used for a HMM-based valence classification of the

previously arousal-categorized shots.

A personalized affect representation method based on

a regression approach for estimating user-felt arousal

and valence from multimedia content features and/or

from physiological responses was presented by

Soleymani et al. [7]. A relevance vector machine was

used to find linear regression weights. This allowed

predicting valence and arousal from the measured

multimedia and/or physiological data. During the

experiments, 64 video clips were shown to 8 participants

while their physiological responses were recorded; user's

self-assessments of valence and arousal served as ground

truth. A comparison was made on the arousal and

valence values obtained by different modalities which

were the physiological signals, the video- and audio-

based features, and the self-assessments. In [7] An

experiment with multiple participants has been

conducted for personalized emotion assessment based on

content analysis.

1.3. Affect and Affective representation

Russell [10] proposed a 3D continuous space called

the valence-arousal-dominance space which was based

on a self-representation of emotions from multiple

subjects. In this paper we use a valence-arousal

dimensional approach for affect representation and

annotation. The third dimensional axis, namely

dominance / control, is not used in our study. In the

valence-arousal space it is possible to represent almost

any emotion. The valence axis represents the

pleasantness of a situation, from unpleasant to pleasant;

the arousal axis expresses the degree of felt excitement,

from calm to exciting. Russel demonstrated that this

space has the advantages of being cross-cultural and that

it is possible to map labels on this space. Although, the

most straightforward way to represent an emotion is to

use discrete labels such as fear, anxiety and joy, label-

based representations have several disadvantages. The

main one is that despite the universality of basic

emotions, the labels themselves are not universal. They

can be misinterpreted from one language (or culture) to

another. In addition, emotions are continuous

phenomena rather than discrete ones and labels are

unable to define the strength of an emotion.

In a dimensional approach for affect representation,

the affect of a video scene can be represented by its

coordinates in the valence-arousal space. Valence and

arousal can be determined by self reporting. The goal of

an affective representation system is to estimate user’s

valence and arousal or emotion categories in response to

each movie segment. Emotion categories are defined as

regions in the valence-arousal space. Each movie

consists of scenes and each scene consists of a sequence

of shots which are happening in the same location. A

shot is the part of a movie between two cuts which is

typically filmed without interruptions [11].

2. Methods

2.1. Arousal estimation with regression on shots

Informative features for arousal estimation include

loudness and energy of the audio signals, motion

component, visual excitement and shot duration. Using a

method similar to Hanjalic et al. [4] and to the one

proposed in [7], the felt arousal from each shot is

computed by a regression of the content features (see

Section 3 for a detailed description). In order to find the

best weights for arousal estimation using regression, a

leave one movie out strategy on the whole dataset was

used and the linear weights were computed by means of

a relevance vector machine (RVM) from the RVM

toolbox provided by Tipping [12]. The RVM is able to

reject uninformative features during its training hence no

further feature selection was used for arousal

determination.

Equation (1) shows how N

audio and video based

features z

of the k-th shot are linearly combined by the

weights w

to compute the arousal â

at the shot level.

wzwa

iik







(1)

After computing arousal at the shot level, the average

and maximum arousals of the shots of each scene are

computed and used as arousal indicator features for the

scene affective classification. During an exciting scene

the arousal related features do not all remain at their

extreme level. In order to represent the highest arousal

of each scene, the maximum of the shots’ arousal was

chosen to be used as a feature for scene classification.

The linear regression weights that were computed

from our data set were used to determine the arousal of

each movie’s shots. This was done in such a way that all

movies from the dataset except for the one to which the

shot belonged to were used as the training set for the

RVM. Any missing affective annotation for a shot was

approximated using linear interpolation from the closest

affective annotated time points in a movie.

It was observed that arousal has higher linear

correlation with multimedia content-based features than

valence. Valence estimation from regression is not as

accurate as arousal estimation and therefore valence

estimation has not been performed at the shot level.

2.2. Bayesian framework and scene

classification

For the purpose of categorizing the valence-arousal

space into three affect classes, the valence-arousal space

was divided into the three areas shown in Figure 2, each

corresponding to one class. According to [13] emotions

mapped to the lower arousal category are neither

extreme pleasant nor unpleasant emotions and are

difficult to differentiate. Emotional evaluations are

shown to have a heart shaped distribution on valence-

arousal space [13]. Hence, we categorized the lower half

of the plane into one class. The points with an arousal of

zero were counted in class 1 and the points with arousal

greater than zero and valence equal to zero were

considered in class 2. These classes were used as a

simple representation for the emotion categories based

on the previous literature on emotion assessment [14].

In order to characterize movie scenes into these

affective categories, the average and maximum arousal

of the shots of each scene and the low level extracted

audio- and video- based features were used to form a

feature vector. This feature vector in turn was used for

the classification.

If the content feature vector of the j-th scene is x

, the

problem of finding the emotion class, ŷ

, of this scene is

formulated as estimating the ŷ

which maximizes the

probability p(y

,θ) where θ is the prior information

which can include the user’s preferences and video clip’s

metadata. In this paper one of the prior metadata (θ) we

used is for instance the genre of the movie. Personal

profile parameters can be also added to θ. Since in this

paper the whole affect representation is trained by the

self report of one participant the model is assumed to be

personalized for this participant. When the emotion of

the previous scene is used as another prior the scene

affect probability formula changes to p(y

j-1

,θ).

Assuming for simplification that the emotion of the

previous scene is independent from the content features

of the current scene this probability can be reformulated

as:

)|(

),|().,|(

),,|(











jjjj

jjj

xypyyp

xyyp

(2)

The classification problem is then be simplified into

the determination of the maximum value of the

numerator of Equation (2), since the denominator will be

the same for all different affect classes y

. The priors are

established based on the empirical probabilities obtained

from the training data. For example, the occurrence

Valence

(0,0)

Figure 2. Three classes in the valence-arousal space are shown,

namely calm (1), positive excited (2) and negative excited (3).

An approximate of the heart shaped distribution of valence and

arousal is shown.

Arousal

probability of having a given emotion followed by any

of the emotion categories was computed from the

participant’s self-assessments and for each genre. This

allowed to obtain the p(y

j-1

,θ). Different methods were

evaluated to estimate the posterior probability p(y

). A

naïve Bayesian approach which assumes the conditional

probabilities are Gaussian was chosen as providing the

best performance on the dataset; the superiority of this

method can be attributed to its generalization abilities.

3. Material description

A dataset of movies segmented and affectively

annotated by arousal and valence is used as the training

set. This training set consists of twenty one full length

movies (mostly popular movies). The majority of movies

were selected either because they were used in similar

studies (e.g. [15]), or because they were recent and

popular. The dataset included four genres: drama,

horror, action, and comedy. The following three

information streams were extracted from the media:

video (visual), sound (auditory), and, subtitles (textual).

The video stream of the movies has been segmented at

the shot level using the OMT shot segmentation software

and manually segmented into scenes [16;17]. Movie

videos were encoded into the MPEG-1 format to extract

motion vectors and I frames for further feature

extraction. We used the OVAL library (Object-based

Video Access Library) [18] to capture video frames and

extract motion vectors.

The second information stream, namely sound, has an

important impact on user’s affect. For example

according to the findings of Picard [19], loudness of

speech (energy) is related to evoked arousal, while

rhythm and average pitch in speech signals are related to

valence. The audio channels of the movies were

extracted and encoded into monophonic information

(MPEG layer 3 format) at a sampling rate of 48 kHz. All

of the resulting audio signals were normalized to the

same amplitude range before further processing.

Textual features were also extracted from the subtitles

track of the movies. According to [9] the semantic

analysis of the textual information can improve affect

classification. As the semantic analysis over the textual

data was not the focus of our work we extracted simple

features from subtitles by tokenizing the text and

counting the number of words. These statistics have

been used with the timing of the subtitles to extract the

talking rate feature which is the number of words that

had been spoken per second on the subtitles show time.

The other extracted feature is the number of spoken

words in a scene divided by the length of the scene,

which can represent the amount or existence of

dialogues in a scene. A list of the movies in the dataset

and their corresponding genre is given in Table 1.

3.1. Audio features

A total of 53 low-level audio features were

determined for each of the audio signals. These features,

listed in Table 2, are commonly used in audio and

speech processing and audio classification [20;21].

Wang et al. [2] demonstrated the relationship between

audio type’s proportions (for example, the proportion of

music in an audio segment) and affect, where these

proportions refer to the respective duration of music,

speech, environment, and silence in the audio signal of a

video clip. To determine the three important audio types

(music, speech, environment), we implemented a three

class audio type classifier using support vector machines

(SVM) operating on audio low-level features in a one

second segment. Before classification, silence had been

identified by comparing the average audio signal energy

of each sound segment (using the averaged square

magnitude in a time window) with a pre-defined

threshold empirically extracted from the first seven

percent of the audio energy histogram. This audio

histogram was computed from a randomly selected 30

minutes segment of each movie’s audio stream.

Table 2. Low-level features extracted from audio signals.

Feature

References

PDF

Open Access

More filters

Proceedings Article•DOI•

Affective content detection in sitcom using subtitle and audio

[...]

Min Xu¹, Liang-Tien Chia¹, Haoran Yi¹, Deepu Rajan¹•Institutions (1)

Nanyang Technological University¹

24 Jul 2006

TL;DR: Subtitle file analysis and audio event detection provides effective and efficient clues to determine the emotional content of the videos to extract affective content for digital videos.

...read moreread less

Abstract: From a personalized media point of view, many users favor a flexible tool to quickly browse the affective content in a video. Such affective content may cause audiences' strong reactions or special emotional experiences, such as anger, sadness, fear, joy and love. This paper attempts to extract affective content for digital videos by analyzing the subtitle files of DVD/DivX videos and utilize audio event to assist affective content detection. Firstly, videos are segmented by dialogue script partition. Compared to traditional video shot, video segmented by scripts is not affected by camera changes and shooting angles and easy to include video segments with compact content. Secondly, emotion-related vocabularies in video script are detected to locate affective video content. Using script to directly access video content avoids complex video analysis. Thirdly, audio event detection is utilized to assist affective content detection. Compared with traditional video semantic analysis, affective content analysis puts much more emphasis on the audience's reactions and emotions. Initial experiments are carried on sitcom videos because its simple video structure provides useful domain knowledge. The experimental results demonstrate that subtitle file analysis and audio event detection provides effective and efficient clues to determine the emotional content of the videos.

...read moreread less

15 citations

Collapse

Frequently Asked Questions (1)

Q1. What are the contributions in this paper?

The novelty of this paper is to introduce a Bayesian classification framework for affective video tagging that allows taking contextual information into account.

A Bayesian framework for video affective representation

Summary (3 min read)

1.1. Overview

1.2. State of the art

1.3. Affect and Affective representation

2.1. Arousal estimation with regression on shots

2.2. Bayesian framework and scene

3. Material description

3.1. Audio features

3.2. Visual features

3.3. Affective annotation

4.1. Arousal estimation of shots

4.2. Classification results

5. Conclusions and perspectives

Figures (6)

Citations

Cites background or methods from "A Bayesian framework for video affe..."

Cites background from "A Bayesian framework for video affe..."

Additional excerpts

Cites background from "A Bayesian framework for video affe..."

References

Related Papers (5)

Frequently Asked Questions (1)

Q1. What are the contributions in this paper?