scispace - formally typeset
Open AccessJournal ArticleDOI

Affective video content representation and modeling

TLDR
A computational framework for affective video content representation and modeling is proposed based on the dimensional approach to affect that is known from the field of psychophysiology that is characterized by the dimensions of arousal (intensity of affect) and valence (type of affect).
Abstract
This paper looks into a new direction in video content analysis - the representation and modeling of affective video content . The affective content of a given video clip can be defined as the intensity and type of feeling or emotion (both are referred to as affect) that are expected to arise in the user while watching that clip. The availability of methodologies for automatically extracting this type of video content will extend the current scope of possibilities for video indexing and retrieval. For instance, we will be able to search for the funniest or the most thrilling parts of a movie, or the most exciting events of a sport program. Furthermore, as the user may want to select a movie not only based on its genre, cast, director and story content, but also on its prevailing mood, the affective content analysis is also likely to contribute to enhancing the quality of personalizing the video delivery to the user. We propose in this paper a computational framework for affective video content representation and modeling. This framework is based on the dimensional approach to affect that is known from the field of psychophysiology. According to this approach, the affective video content can be represented as a set of points in the two-dimensional (2-D) emotion space that is characterized by the dimensions of arousal (intensity of affect) and valence (type of affect). We map the affective video content onto the 2-D emotion space by using the models that link the arousal and valence dimensions to low-level features extracted from video data. This results in the arousal and valence time curves that, either considered separately or combined into the so-called affect curve, are introduced as reliable representations of expected transitions from one feeling to another along a video, as perceived by a viewer.

read more

Content maybe subject to copyright    Report

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 7, NO. 1, FEBRUARY 2005 143
Affective Video Content
Representation and Modeling
Alan Hanjalic, Member, IEEE, and Li-Qun Xu, Member, IEEE
Abstract—This paper looks into a new direction in video content
analysis the representation and modeling of affective video con-
tent. The affective content of a given video clip can be defined as
the intensity and type of feeling or emotion (both are referred to
as affect) that are expected to arise in the user while watching that
clip. The availability of methodologies for automatically extracting
this type of video content will extend the current scope of possibil-
ities for video indexing and retrieval. For instance, we will be able
to search for the funniest or the most thrilling parts of a movie,
or the most exciting events of a sport program. Furthermore, as
the user may want to select a movie not only based on its genre,
cast, director and story content, but also on its prevailing mood, the
affective content analysis is also likely to contribute to enhancing
the quality of personalizing the video delivery to the user. We pro-
pose in this paper a computational framework for affective video
content representation and modeling. This framework is based on
the dimensional approach to affect that is known from the field of
psychophysiology. According to this approach, the affective video
content can be represented as a set of points in the two-dimen-
sional (2-D) emotion space that is characterized by the dimensions
of arousal (intensity of affect) and valence (type of affect). We map
the affective video content onto the 2-D emotion space by using the
models that link the arousal and valence dimensions to low-level
features extracted from video data. This results in the arousal and
valence time curves that, either considered separately or combined
into the so-called affect curve, are introduced as reliable represen-
tations of expected transitions from one feeling to another along a
video, as perceived by a viewer.
Index Terms—Affective video content analysis, video abstrac-
tion, video content modeling, video content representation, video
highlights extraction.
I. INTRODUCTION
D
IGITAL VIDEO collections are growing rapidly in both
the professional and consumer environment, and are char-
acterized by a steadily increasing capacity and content variety.
Since searching manually through these collections is tedious
and time-consuming, transferring the search and retrieval tasks
to automated systems becomes crucial for being able to effi-
ciently handle stored video volumes. The development of such
systems is based on the
algorithms for video content analysis.
These algorithms are built around the models bridging the gap
between the syntax of the digital video data stream (captured
Manuscript received August 31, 2001; revised June 18, 2003. The associate
editor coordinating the review of this manuscript and approving it for publica-
tion was Dr. Sankar Basu.
A. Hanjalic is with the Department of Mediamatics, Delft University of Tech-
nology, 2628 CD Delft, The Netherlands (e-mail: A.Hanjalic@ewi.tudelft.nl).
L.-Q. Xu is with the Broadband Applications Research Centre, BT
Research Venturing, Martlesham Heath, Ipswich IP5 3RE, U.K. (e-mail:
li-qun.xu@bt.com).
Digital Object Identifier 10.1109/TMM.2004.840618
Fig. 1. Overview of two different levels of video content perception, analysis,
and retrieval.
in the so-called low-level features) and the semantic meaning
of that stream. Using the information that is extracted from a
video by these algorithms, digital video data can be indexed,
classified, filtered or organized automatically based on semantic
criteria.
The semantic meaning of a given video clip is not unique,
as the content of this clip can be perceived in many different
ways. Clearly, each way of perceiving video content requires a
particular type of information in order to index, classify, filter
or organize the video collection correspondingly. As depicted
in Fig. 1, we differentiate between two basic levels of video
content perception, hence two different levels of analyzing and
retrieving video content.
Cognitive level.
Affective level.
An algorithm analyzing a video at cognitive level aims at ex-
tracting information that describes the “facts, e.g., the struc-
ture of the story, the composition of a scene, and the objects and
people captured by the camera. For example, these facts can in-
clude the labels such as “a panorama of San Francisco, an “out-
door” or “indoor” scene, a broadcast news report on “Topic X,
a “dialog between person A and person B, or the “fast breaks,
“steals, and “scores” of a basketball match. Most of the world-
wide research efforts in the field of video content analysis have
been invested so far in raising the efficiency and reliability of
analyzing the video content at cognitive level. Good overviews
of, and references to the results of these efforts can be found in
[7], [13], and [21].
Little research effort has been invested so far in extracting the
information that describes the affective content of a video. This
content can be defined as the amount and type of affect (feeling
or emotion) that are contained in video and expected to arise in
1520-9210/$20.00 © 2005 IEEE
Authorized licensed use limited to: IEEE Xplore. Downloaded on February 9, 2009 at 05:48 from IEEE Xplore. Restrictions apply.

144 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 7, NO. 1, FEBRUARY 2005
users while watching that video. This expected feeling or emo-
tion can be seen as the one that is either intended to be commu-
nicated toward the audience (from video program directors), or
that is likely to be elicited from majority of the audience who
are watching the particular video clip. To illustrate the former,
we use the quote of I. Maitland [25], the Emmy-Award-winning
director and editor: It is the filmmaker’s job to create moods in
such a realistic manner that the audience will experience those
same emotions enacted on the screen, and thus feel part of the
experience. The expected affective response of a broad audi-
ence can best be illustrated by the example of a sport broadcast:
A score (goal) in a soccer match can generally be considered a
highly exciting event, just like the nish of a swimming compe-
tition or the sprint over the last 50 m in a running contest.
At this stage it is worthwhile emphasizing that the affective
content of a video does not necessarily correspond to the affec-
tive response of a particular user to this content. In other words,
the expected feeling or emotion as described above should not
be mixed up with the actual feeling or emotion that is evoked in
a user while watching video. The expected affective response
can be considered objective, as it results from the actions of
the movie director, or reects the more-or-less unanimous re-
sponse of a general audience to a given stimulus. Opposed to
this, the perceived feeling or emotion is highly subjective and
context-dependent. Therefore, it may be very different from the
expected one and may also vary from one individual to another.
For instance, the same soccer television broadcast may make
the winning teams fans happy, the losing fans sad, and elicit no
emotions at all from an audience that is not interested in soccer.
The relation between the expected and the subjective affective
responses (e.g., marking a horror movie with the label funny
for those people who always laugh while watching such movies)
and the information about the context (e.g., winning or losing
soccer fan) can be taken into account, for instance, by gener-
ating the prole of a particular user. This prole can then be
used to map the expected affective response to a given stimulus
onto the user-specic affective response to that stimulus.
We propose in this paper a computational framework for af-
fective video content representation and modeling. The repre-
sentation part of the framework consists of a set of curves that
reliably depict the expected transitions from one feeling to an-
other along a video, as elicited from a general user. The mod-
eling part addresses the problem of computing the values of the
content representation curves on the basis of low-level features
extracted from video.
This paper is organized as follows. In Section II, we discuss
the importance of extending the research in the eld of video
content analysis from the cognitive to the affective level, which
allows for a number of new or enhanced video indexing and
retrieval applications. In Section III we elaborate on the dimen-
sional approach to affect that is known from psychophysiology
and that provides the fundamentals of the proposed framework.
The detailed framework is then presented in Section IV (repre-
sentation part) and Section V (modeling part), together with the
validation using real video program data. Conclusions and rec-
ommendations for future research are given in Section VI.
II. W
HY AFFECTIVE
VIDEO CONTENT
ANALYSIS?
A. Personalized Video Delivery
In view of the rapidly growing technological awareness of
the average user, the availability of automated systems that can
optimally prepare data for an easy access by the user becomes
crucial for the commercial success of consumer-oriented mul-
timedia databases. The minimum expected capabilities of such
systems will denitely evolve beyond the pure automation of
retrieval processes: an average user will require more and more
from his electronic infrastructure at home. In the particular case
of video storage systems, this more can directly be interpreted
as personalized video delivery. Since video storage system at
home will soon become a necessary buffer for the hundreds of
television channels reaching ones home, the system module
handling the stored video data will increasingly be expected to
take into account the preferences of the user and to lter and
organize the stored content accordingly. The systems currently
available for personalized video delivery usually lter the pro-
grams on the basis of information like, in the case of a movie,
the genre, cast, director and story (script) content. As the user
preferences in this case are also largely determined by the pre-
vailing mood of a movie, then any information regarding this
mood (obtainable by analyzing the types and intensities of feel-
ings or emotions along a video) is likely to improve the quality
of personalized video delivery.
B. Video Indexing Using Affective Labels
The availability of methods for automatically extracting the
affective video content will extend the current scope of possi-
bilities for video indexing and retrieval. The evidence reported
by Picard [25] is that nding photographs having a particular
mood was the most frequent request of advertising customers in
a study of image retrieval made with Kodak Picture Exchange
[27]. One can easily extend this result to video collections as
well: an average user will often search for the funniest,”“most
sentimental, or most thrilling fragments of a movie, as well
as for the most exciting segments of a sport event.
C. Video Highlighting
Although the highlights generally stand for the most inter-
esting parts of a video, the denition of what is interesting
may vary widely across video genres and for different applica-
tions. For instance, while a highlight of a news program may
be determined by the novelty and impact of the news (e.g.,
breaking news,”“headline news), the criteria for highlight ex-
traction from a home video are rather content-dependent, like
where my baby walked for the rst time. The ability to an-
alyze video at affective level will broaden the possibilities for
highlights extraction in a number of application contexts, such
as automated movie trailer generation and sport broadcast sum-
marization.
A movie trailer is a concatenation of movie excerpts that last
only for several tens of seconds but are capable of commanding
the attention of a large number of potential cinema goers and
video on-demand users. Analyzing a movie at affective level
can provide valuable clues about which parts of the movie are
Authorized licensed use limited to: IEEE Xplore. Downloaded on February 9, 2009 at 05:48 from IEEE Xplore. Restrictions apply.

HANJALIC AND XU: AFFECTIVE VIDEO CONTENT REPRESENTATION AND MODELING 145
most suitable for being an element of the trailer. This is because
emotion plays a primary role when processing mediated stimuli
[9]. The emotion (affective content) inuences the attention of a
user and his or her evaluation and memory for the facts (cogni-
tive content). Consequently, the perception of the affective con-
tent interferes with the perception of the cognitive content and
inuences a users reactions to the cognitive content, such as
liking or not-liking, enjoyment and memory. Since memory is
the most important factor when creating a trailer, it is worthy
to notice that memory for highly emotional and, in particular,
highly arousing video fragments has been proven to last longer
than the memory for less-emotional video clips [16], [17]. If the
information on the affective video content is available, creation
of movie trailers can be performed fully automatically. Also,
the trailers can be generated remotely at a users home, for each
movie downloaded by the home digital video storage system.
Previous approaches to automated highlights extraction
in sport video were usually based on the development of
domain-specic models for predened events (e.g., goals in
soccer, home runs in baseball, fast breaks and steals in basket-
ball, etc.) that are supposed to be interpreted by the users as
highlights [3], [15], [22]. The need for event modeling not only
makes the highlight extraction technically and semantically a
complex task in many broadcasts, but it also requires the devel-
opment of a separate highlights-detecting algorithm for each
particular sport program genre. Since it is realistic to assume
that each highlight event (e.g., goal, touchdown, home run,
the nals of a swimming competition, and the last 50 meters
in a running contest) induces a steady increase in a usersex-
citement, an alternative to the domain-specic approach could
be to search for highlights in those video segments that excite
the users most. In this way, generic methods for highlights
extraction could be developed that are independent of the type
of events appearing in a particular sports program genre and
the differences in event realization and coverage.
III. D
IMENSIONAL APPROACH TO
AFFECT
As studied by Bradley [5], Lang et al. [19], Osgood et al. [24],
Russel and Mehrabian [28], affect has three basic underlying
dimensions.
Valence (
).
Arousal (
).
Control (Dominance) (
).
Valence is typically characterized as a continuous range of affec-
tive responses or states extending from pleasant or positive to
unpleasant or negative [8], while arousal is characterized by
affective states ranging on a continuous scale from energized,
excited and alert to calm, drowsy or peaceful. We can also say
that arousal stands for the intensity of emotion, while valence
can be related to the type of emotion. The third dimension
control (dominance) is particularly useful in distinguishing
among emotional states having similar arousal and valence (e.g.,
differentiating between grief and rage) and typically ranges
from no control to full control. Consequently, the entire
scope of human emotions can be represented as a set of points
in the three-dimensional (3-D) VAC coordinate space.
Fig. 2. Illustration of the 3-D emotion space (from Dietz and Lang [9]).
While we could tend to assume that the points corresponding
to different affective states are equally likely to be found any-
where in the three-dimensional VAC coordinate space, psycho-
physiological experiments show that only certain areas of this
space are actually relevant. These experiments typically include
measurements of affective responses of a large group of subjects
to calibrated audio-visual stimuli collected in the International
Affective Picture System (IAPS, Lang et al. [20]) and the Inter-
national Affective Digitized Sounds system (IADS, Bradley and
Lang [6]). Subjects affective responses to these stimuli can be
quantied either by evaluating their own reports, e.g., by using
the Self-Assessment Manikin ([18]) or by measuring physio-
logical functions that are considered related to particular affect
dimensions. For example, heart rate reliably indexes valence,
while skin conductance is associated with arousal. It was found
that the heart rate accelerates as a reaction to pleasant stimuli,
while unpleasant stimuli cause the heart rate to slow down [8],
[10], [12]. Also, an increase in arousal causes the sweat glands
to become active and the skin conductance responses larger and
more frequent [8], [14]. While IAPS and IADS are specially
created to evoke a wide range of emotions with their audio-vi-
sual content, the three-dimensional surface circumventing the
affective responses after their mapping onto the corresponding
points in the 3-D VAC coordinate system is roughly parabolic.
An idea about the shape of the surface can be obtained from
the illustration in Fig. 2. The parabolic shape becomes logical
if we realize that there are relatively few or even no stimuli that
would cause an emotional state characterized by, for instance,
high arousal and neutral valence, or high valence accompanied
by low arousal [9].
The dimensional approach to representing emotion as de-
scribed above can play an important role in the development
of affective agents that serve as mediators between the com-
puter and user, and involve the user in an interaction with the
computer in the same way as he/she interacts with other hu-
mans. Since human-to-human interaction is strongly determined
by emotions, an affective agent is able to sense, synthesize, and
express emotions. For example, Dietz and Lang [9] use the par-
abolic surface from Fig. 2 as the basis for assigning a tempera-
ment, mood and emotion to an affective agent, thus dening the
personality of that agent. The temperament is a xed point in
the space that denes the at rest state of the agent (its rudimen-
tary personality). While the temperament is static, the points
corresponding to the mood and emotion of the agent can move
freely within the space. The position of the emotion point gives
Authorized licensed use limited to: IEEE Xplore. Downloaded on February 9, 2009 at 05:48 from IEEE Xplore. Restrictions apply.

146 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 7, NO. 1, FEBRUARY 2005
Fig. 3. Illustration of the 2-D emotion space (from Dietz and Lang [9]).
rise to the expressions of the agent and determines its current
affective state. Further, the emotion point gravitates toward the
position of the mood that, again, moves through the space rel-
atively slowly, is mainly pulled by emotional events and gravi-
tates toward the position of the temperament. The dynamics of
the system is therefore inuenced by both the agents current
affective state and its temperament.
IV. A
FFECTIVE VIDEO CONTENT REPRESENTATION
A. Two-Dimensional (2-D) Emotion Space
As can be seen from Fig. 2, the effect of the control dimen-
sion becomes visible only at points with distinctly high abso-
lute valence values. This effect is also quite small, mainly due
to a rather narrow range of values belonging to this dimension.
Consequently, it can be said that the control dimension plays
only a limited role in characterizing various emotional states.
As a matter of fact, Greenwald et al. [12] have shown that va-
lence and arousal account for most of the independent variance
in emotional responses. This is especially true for the problem
to be addressed in this paper the extraction of the affective
content from a video. Numerous studies of human emotional
responses to media have shown that emotion elicited by pic-
tures, television, radio, computers, and sounds can be mapped
onto an emotion space created by the arousal and valence axes
[9]. For this reason, we neglect the control dimension and con-
sider the arousal and valence dimensions only. Instead of the
three-dimensional surface introduced in the previous section,
the relevant emotion space for the purpose of affective video
content analysis is reduced to the projection of this surface onto
the arousal-valence plane. Fig. 3 shows an illustration of the re-
sulting 2-D emotion space. The parabolic contour is generated to
circumvent the scatter plot of affective responses with respect to
arousal and valence only, which were collected using the IAPS
and IADS stimuli. It is expected that the affective states ex-
tracted from a video can be represented as the points within this
contour.
B. Arousal, Valence, and Affect Curve
By computing the arousal and valence values along a video
the arousal and valence time curves can be obtained. We intro-
duce these curves, either considered separately or combined into
the so-called affect curve, as suitable representations of the af-
fective content of a video in view of the applications described
in Section II.
Fig. 4. Illustration of the arousal, valence, and affect curve.
The arousal time curve indicates how the intensity of the
emotional load changes along a video, and depicts the expected
changes in users excitement while watching that video. In this
sense, the arousal curve is particularly suitable for locating the
exciting video segments. On the basis of the arousal time
curve we can generate a video abstract containing the highlights
in a desired length. Namely, given the maximum abstract length
in frames, a horizontal line can be drawn cutting off the peaks
of the curve in such a way that the number of frames covered by
the peaks is not larger than
. This is illustrated in Fig. 4(a).
The valence time curve depicts the state changes in the type
of feelings or emotions contained in a video over the time. As
such, this curve mimics the expected changes of moods of the
user while watching a video. Using the valence time curve we
can also determine the positive and negative video segments
with respect to the expected type of feeling that is evoked in the
user during these segments. This information can serve to match
the video to personal preferences of the user, but also to auto-
matically perform censorship tasks, that is, to remove all seg-
ments from a video that are too negative for certain groups of
the audience. As illustrated in Fig. 4(b), such segments may be
searched among those for which the valence curve reaches local
minima. The arousal and valence time curves can be combined
into the affect curve. This curve is composed of the value pairs
of the arousal and valence time curves that are taken per time
stamp of the video and mapped onto the corresponding points
of the 2-D emotion space [Fig. 4(c)].
The affect curve can be seen as the most complete repre-
sentation of the affective content of a video, which can be ob-
tained automatically. This curve can be interpreted in various
Authorized licensed use limited to: IEEE Xplore. Downloaded on February 9, 2009 at 05:48 from IEEE Xplore. Restrictions apply.

HANJALIC AND XU: AFFECTIVE VIDEO CONTENT REPRESENTATION AND MODELING 147
Fig. 5. Illustration of the possibility for video content indexing and retrieval
at affective level.
ways and used for numerous applications related to video con-
tent representation and retrieval at the affective level. For in-
stance, assuming that the affect curve has already been com-
puted for a given video, an arbitrary temporal segment of that
video can automatically be indexed with respect to the affective
states through which the corresponding part of the affect curve
passes. Indexes can be provided in the form of labels that are
assigned a priori to different regions of the 2-D emotion space,
as illustrated in Fig. 5. Also, the area of the 2-D emotion space
in which the curve traverses most of the time corresponds to the
dominant affective state (prevailing mood) of a video. This
can be highly useful for automatically classifying a video into
different affective genres. Further, the affect curve may directly
serve as a criterion for ltering the incoming videos according
to a users preferences. Namely, an affect curve representing
a users preferences can be obtained by simply combining the
affect curves of all programs that the user has selected in the
past (in the learning phase of the system). Filtering an incoming
video according to this users preferences is then nothing more
than matching the affect curve of the incoming video with the
affect curve describing the users preferences.
V. A
FFECTIVE VIDEO CONTENT MODELING
In order to obtain the affective content representation as de-
scribed in the previous section, models need to be developed
for the arousal and valence time curve. These models fulll the
tasks of deriving arousal and valence values from the values
of low-level features computed in a video. In Section V-A, we
introduce the basic criteria that need to be taken into account
during the model development. Then, in Section V-B, we elab-
orate on the possibilities for establishing relations between the
affect dimensions and low-level features. Finally, we propose
models for the arousal and valence time curve and experiment
with these models on a number of video excerpts from movies
and soccer television broadcasts.
A. Criteria for Developing Affect Models
As arousal and valence are psychological categories, their
models need to be psychologically justiable. To achieve this,
we introduce the following three criteria that a model for the
arousal, valence or affect curve should satisfy.
Comparability.
Compatibility.
Smoothness.
The rst criterion (Comparability) ensures that the values of
the arousal, valence and the resulting affect curve obtained in
different videos for similar types of events are comparable.
This criterion obviously imposes normalization and scaling
requirements when computing the time curves. The second
criterion (Compatibility) ensures that the affect curve covers
an area in the valence-arousal coordinate system, the shape of
which roughly corresponds to the parabolic-like contour of the
2-D emotion space. The third criterion (Smoothness) accounts
for the degree of memory retention of preceding frames and
shots [1]. It ensures that the perception of the content, and
consequently the mediated affective state, does not change
abruptly from one video frame to another but is a function of a
number of consecutive frames (shots).
B. Feature Selection
Little is known regarding the relations between the low-level
features and affect. While the problem of bridging the semantic
gap remains very hard in the case of cognitive video content
analysis, the magnitude of this problem in the affective case
is even bigger. The reason for this is that in the cognitive case
the low-level features describe aspects of a real entity, e.g., the
choice of the color red as one of the features to characterize
a red car. In the affective case, however, we need to relate the
low-level features to something rather abstract, such as feeling
or emotion. In the context of this paper, we are particularly inter-
ested in the relations between low-level features and the affect
dimensions of arousal and valence.
One of the most extensively investigated visual features in the
context of affective video content analysis is motion. Research
results show that motion in a television picture has a signi-
cant impact on individual affective responses. This has been re-
alized also by lm theorists who contend that motion is highly
expressive and is able to evoke strong emotional responses in
viewers ([2], [11]). In particular, Detenber et al. [8] and Sim-
mons et al. [29] investigated the inuence of camera and object
motion on emotional responses of humans and concluded that
an increase of motion intensity on the screen causes an increase
in arousal. The type of emotion (represented by the sign of va-
lence) was found independent of motion: if the mood of a test
person was positive or negative while watching a still pic-
ture, the sign of the mood will not change if a motion is intro-
duced within that picture.
Based on the results obtained by Murray and Arnott [23], as
well as those reported by Picard in [25] and [26], various vocal
effects present in the sound track of a video may bear broad
relations to the affective content of that video. In terms of affect
dimensions, the loudness (signal energy) and speech rate (e.g.,
faster for fear or joy and slower for disgust or romance) are
often being related to the arousal, while the inection, rhythm,
duration of the last syllable of a sentence, voice quality (e.g.,
breathy or resonant), as well as the pitch-related features (pitch
average, pitch range and pitch changes), are commonly related
to valence [23], [25].
Authorized licensed use limited to: IEEE Xplore. Downloaded on February 9, 2009 at 05:48 from IEEE Xplore. Restrictions apply.

Citations
More filters
Journal ArticleDOI

The measurement of meaning

Journal ArticleDOI

DEAP: A Database for Emotion Analysis ;Using Physiological Signals

TL;DR: A multimodal data set for the analysis of human affective states was presented and a novel method for stimuli selection is proposed using retrieval by affective tags from the last.fm website, video highlight detection, and an online assessment tool.
Journal ArticleDOI

Content-based multimedia information retrieval: State of the art and challenges

TL;DR: This survey reviews 100+ recent articles on content-based multimedia information retrieval and discusses their role in current research directions which include browsing and search paradigms, user studies, affective computing, learning, semantic queries, new features and media types, high performance indexing, and evaluation techniques.
Journal ArticleDOI

Multimodal human-computer interaction: A survey

TL;DR: This paper reviews the major approaches to multimodal human-computer interaction, giving an overview of the field from a computer vision perspective, and focuses on body, gesture, gaze, and affective interaction.
Proceedings ArticleDOI

Affective image classification using features inspired by psychology and art theory

TL;DR: This work investigates and develops methods to extract and combine low-level features that represent the emotional content of an image, and uses these for image emotion classification.
References
More filters
Book

The Measurement of Meaning

TL;DR: In this article, the authors deal with the nature and theory of meaning and present a new, objective method for its measurement which they call the semantic differential, which can be adapted to a wide variety of problems in such areas as clinical psychology, social psychology, linguistics, mass communications, esthetics, and political science.
Journal ArticleDOI

Measuring emotion: The self-assessment manikin and the semantic differential

TL;DR: Reports of affective experience obtained using SAM are compared to the Semantic Differential scale devised by Mehrabian and Russell (An approach to environmental psychology, 1974), which requires 18 different ratings.
Book

Affective Computing

TL;DR: Key issues in affective computing, " computing that relates to, arises from, or influences emotions", are presented and new applications are presented for computer-assisted learning, perceptual information retrieval, arts and entertainment, and human health and interaction.
Journal ArticleDOI

The measurement of meaning

Journal ArticleDOI

Toward machine emotional intelligence: analysis of affective physiological state

TL;DR: It is found that the technique of seeding a Fisher Projection with the results of sequential floating forward search improves the performance of the Fisher Projections and provides the highest recognition rates reported to date for classification of affect from physiology: 81 percent recognition accuracy on eight classes of emotion, including neutral.
Related Papers (5)
Frequently Asked Questions (10)
Q1. What have the authors contributed in "Affective video content representation and modeling" ?

This paper looks into a new direction in video content analysis – the representation and modeling of affective video content. The authors propose in this paper a computational framework for affective video content representation and modeling. This framework is based on the dimensional approach to affect that is known from the field of psychophysiology. The authors map the affective video content onto the 2-D emotion space by using the models that link the arousal and valence dimensions to low-level features extracted from video data. This results in the arousal and valence time curves that, either considered separately or combined into the so-called affect curve, are introduced as reliable representations of expected transitions from one feeling to another along a video, as perceived by a viewer. Furthermore, as the user may want to select a movie not only based on its genre, cast, director and story content, but also on its prevailing mood, the affective content analysis is also likely to contribute to enhancing the quality of personalizing the video delivery to the user. 

At last, when the course of the game becomes more dynamical around frame 850, the excitement of the user will start to rise, though with a certain delay – again due to the inertia of human affective states. 

Subjects’ affective responses to these stimuli can be quantified either by evaluating their own reports, e.g., by using the Self-Assessment Manikin ([18]) or by measuring physiological functions that are considered related to particular affect dimensions. 

Scaling the convolution result back to the original value range results in the function that the authors adopt as the rhythm component of their arousal model (1), which is illustrated in Fig. 8(b)where(5) 

The minimum expected capabilities of such systems will definitely evolve beyond the pure automation of retrieval processes: an average user will require more and more from his electronic infrastructure at home. 

The systems currently available for personalized video delivery usually filter the programs on the basis of information like, in the case of a movie, the genre, cast, director and story (script) content. 

whenever there is a goal, or an important break (e.g., due to foul play, free kick, etc.), the director immediately increases the rate of shot changes trying to show everything that is happening on the field and among the spectators at that moment. 

The director switches from one to another camera (e.g., by zooming onto a particular event, the bench or the spectators) only occasionally, which results in rather long shots. 

the perception of the affective content interferes with the perception of the cognitive content and influences a user’s reactions to the cognitive content, such as liking or not-liking, enjoyment and memory. 

In view of the smoothness criterion, the function (11) is not directly suitable to serve as a valence component time curve due to its step-wise nature.