scispace - formally typeset
Open AccessJournal ArticleDOI

Motion-Based Selection of Relevant Video Segments for Video Summarization

TLDR
In this article, a motion-based video segmentation and segment classification method is proposed for video summarization, where the dominant image motion is assumed to be related to camera motion and represented by a 2D affine model.
Abstract
We present a method for motion-based video segmentation and segment classification as a step towards video summarization. The sequential segmentation of the video is performed by detecting changes in the dominant image motion, assumed to be related to camera motion and represented by a 2D affine model. The detection is achieved by analysing the temporal variations of some coefficients of the 2D affine model (robustly) estimated. The obtained video segments supply reasonable temporal units to be further classified. For the second stage, we adopt a statistical representation of the residual motion content of the video scene, relying on the distribution of temporal co-occurrences of local motion-related measurements. Pre-identified classes of dynamic events are learned off-line from a training set of video samples of the genre of interest. Each video segment is then classified according to a Maximum Likelihood criterion. Finally, excerpts of the relevant classes can be selected for video summarization. Experiments regarding the two steps of the method are presented on different video genres leading to very encouraging results while only low-level motion information is considered.

read more

Content maybe subject to copyright    Report

Multimedia Tools and Applications, 26, 259–276, 2005
c
2005 Springer Science + Business Media, Inc. Manufactured in The Netherlands.
Motion-Based Selection of Relevant Video Segments
forVideo Summarization
NATHALIE PEYRARD nathalie.peyrard@avignon.inra.fr
INRA, Domaine Saint Paul Site Agroparc, 84914 Avignon Cedex 9, France
PATRICK BOUTHEMY patrick.bouthemy@irisa.fr
IRISA/INRIA, Campus Universitaire de Beaulieu, 35042 Rennes Cedex, France
Abstract. We present a method for motion-based video segmentation and segment classification as a step towards
video summarization. The sequential segmentation of the video is performed by detecting changes in the dominant
image motion, assumed to be related to camera motion and represented by a 2D affine model. The detection is
achieved by analysing the temporal variations of some coefficients of the 2D affine model (robustly) estimated.
The obtained video segments supply reasonable temporal units to be further classified. For the second stage, we
adopt a statistical representation of the residual motion content of the video scene, relying on the distribution
of temporal co-occurrences of local motion-related measurements. Pre-identified classes of dynamic events are
learned off-line from a training set of video samples of the genre of interest. Each video segment is then classified
according to a Maximum Likelihood criterion. Finally, excerpts of the relevant classes can be selected for video
summarization. Experiments regarding the two steps of the method are presented on different video genres leading
to very encouraging results while only low-level motion information is considered.
Keywords: video segmentation, probabilistic motion modelling, supervised event classification
1. Introduction
Replacing a long video by a small number of representative segments provides a synthetic
description of the document, which can be exploited for numerous applications including
both home video and professional usages. However, the construction of video summary
remains an open problem at the source of active research activities. The main difficulty
obviously relies on the detection of semantic events from low-level information. Several
approaches have been developed involving different video information and different rep-
resentations. For instance, [21] proposed a strategy for video summarization and browsing
by selecting key-frames maximally distinct and which carry the more information, based
on the chrominance components of each pixel in the image. The video elementary unit
considered in this method is simply the frame, which can be restrictive when trying to
detect temporal semantic events. In [12], the authors present a generic method based on the
modelling of user attention. Visual as well as audio information are combined to provide
a user attention curve whose maxima define video segments likely to be of interest for the
viewer. The combination of audio and image information is also exploited in [14] for video
summarization. In this paper, we consider the task of selecting relevant temporal segments
in a video and we adopt an approach based on motion-content analysis. It is obvious that

260
PEYRARD AND BOUTHEMY
the use of complementary information, such as color or audio, would lead to better results.
The aim here is not to fully solve the problem but to explore the potentiality of motion
information for the specified task.
When dealing with video summarization, the first step usually consists in partitioning the
video into elementary temporal segments. Such elementary units can be shots [1, 3, 4, 13, 20]
which reveal the technical acquisition and editing processes of the video. We believe that a
content-based segmentation, relying on the analysis of the evolution of the motion content in
the sequence of images, should be more suited to the extraction of relevant dynamic events.
In a previous approach [16], we have exploited the global motion in a video and a distance
between probabilistic motion models to perform the video segmentation. In this paper we
propose a simpler method based on the camera motion only. Indeed, a single shot can involve
different successive camera motions and a camera motion change usually traduces a change
in the activity content of the depicted scene. This segmentation is performed by detecting
changes in the temporal evolution of coefficients of the 2D affine model representing the
dominant motion in the images, the latter being assumed to be due to camera motion. Such
a model has often been considered to estimate the camera motion, for instance in [13] using
the MPEG motion vector of compressed video stream.
In a second stage, we apply a supervised classification algorithm based on the characteri-
zation of the residual image motion. Indeed, if the dominant image motion is due to camera
motion, the residual motion (i.e., after subtracting the dominant motion in the global image
motion) can be related to the projection of the scene motion. Once temporal units of the
processed video are identified, one way to characterize their dynamic content would be to
consider again parametric motion models (e.g. 2D affine or quadratic motion models). How-
ever, the dynamic situations which can be described by such models are too restricted. They
are suitable for modelling camera motion but no more for modelling general scene motion.
Several motion characterizations have been investigated. In [1], motion-related features are
derived from the computation of the optical flow field. Based on these features, a method
for video indexing is proposed. The study described in [18] for the detection of a sequence
of home activities in a video relies on segmenting moving objects and detecting temporal
discontinuities in the successive optical flow fields. It involves the analysis of the evolution
of the most significant coefficients of the singular value decomposition (SVD) of the set of
successive flow fields. Still dealing with video content characterization but in the context
of video classification into genres, statistical models are introduced by [20] to describe two
components of the video structure: shot duration and shot activity. In [22], in order to clus-
ter temporal dynamic events, the latter are captured by the computed histograms of local
spatial and temporal intensity gradients at different temporal scales. A distance between
events is then built, based on the comparison of the empirical histograms of these features.
Recently in [11], the authors have proposed motion pattern descriptors extracted from mo-
tion vector fields and have exploited support vector machines (SVM) for classification of
video clips into semantic categories. For an objective very close to video summarization,
shots overview, the method in [5] relies on the nonlinear temporal modelling of wavelet-
based motion features. As [22], we propose to exploit low-level motion measurements but
conveying more elaborated motion information, while still easily computable compared to
the optic flow. These local measurements are straightforwardly extracted from the images

MOTION-BASED SELECTION OF RELEVANT VIDEO SEGMENTS
261
intensities and are exploited with statistical motion models as introduced in [8]. These mod-
els are specified from the temporal co-occurrences of the quantized local motion-related
measurements and can handle a wide range of dynamic contents. Exploiting this statistical
framework, we propose to label each video segment according to learned classes of dy-
namic events, using a Maximum Likelihood criterion. Then, only the segments associated
to classes defined as relevant in terms of dynamic events are selected, and excerpts of these
significant segments could be further exploited for video summarization.
Section 2 describes the temporal video segmentation stage based on camera motion,
and the behaviour of the resulting algorithm is illustrated on videos of different genres. In
Section 3, we present the classification stage relying on a probabilistic motion modelling,
and the global two-step method is applied for the recognition of relevant events in two sports
videos. Section 4 contains concluding remarks.
2. Temporal video segmentation
In this section, we present the first stage of our method for selecting relevant segments
in a given video. It consists in performing a sequential segmentation of the video into
homogeneous segments in terms of camera motion. Its performance is illustrated on three
real video documents. In particular, we compare the results with those obtained with other
types of segmentation (segmentation into shots or based onglobal motion content) toconfirm
the suitability of the segmentation presented in this paper for the objective in sight, i.e., the
detection of particular dynamic events in a video.
2.1. Camera motion modelling and estimation
The video segmentation relies on the analysis of the dominant image motion computed
between two successive frames of the video. The dominant image motion is assumed to
correspond to the apparent motion of the background induced by the 3D camera motion. It
is possible to represent the projection of the 3D motion field (relative to the camera) of the
static background by a 2D parametric motion model (assuming a shallow environment, or
accounting for its main part if it involves different depth layers). For example, we can deal
with the affine motion model defined at image point p = (x, y) by:
w
θ
(p) =
a
1
+ a
2
x + a
3
y
a
4
+ a
5
x + a
6
y
, (1)
where θ = (a
1
, a
2
, a
3
, a
4
, a
5
, a
6
)isthe motion model parameter vector. Such a model can
handle different camera motions: panning, zooming, tracking (including of course static
shots). For more complex situations, a quadratic model can be used but we will restrict
ourselves here to the affine model. It forms a good trade-off between the relevance of the
motion representation and the complexity of the estimation. The model parameters θ of the
dominant image motion are then estimated using the gradient-based robust multiresolution
algorithm designed in [15]. More precisely, the robustness is ensured by the minimization

262
PEYRARD AND BOUTHEMY
of a M-estimator criterion. The constraint is derived from the assumption of brightness
constancy and the parameter estimator of the affine motion model between frame I (k) and
frame I (k + 1) is defined as:
ˆ
θ = arg min
θ
pR
ρ(DFD
θ
(p)),
where DFD
θ
(p) = I (p +w
θ
(p), k +1) I (p, k)) is the Displaced Frame Difference and
R is the spatial image grid. The M-estimator ρ is chosen as a hard-redescending func-
tion. The minimization takes advantage of a multiresolution framework and an incremental
scheme based on the Gauss-Newton method. It is implemented as an iteratively reweighted
least-squares technique. This method yields an accurate estimation of the dominant motion
between two images even if other secondary motions are present.
2.2. Detection of camera motion changes
In order to detect changes in the camera motion, we analyse the temporal evolution of the
two translation coefficients a
1
and a
4
of the affine model (1). In general, a change in camera
motion induces a jump in the evolution of these two signals. To detect such ruptures we
propose to apply a Hinkley test on each signal, a
1
(k) and a
4
(k). This statistical test is issued
from likelihood ratio tests, evaluating the “no change hypothesis” (no change between
frames k 1 and k)versus the “change hypothesis”. It provides a simple and efficient
means to detect jumps in the mean of a signal and it is known to be robust since cumulative
(see [2]). Since the direction of the change in the mean of the signal is unknown, in practice,
two tests are performed in parallel, to look for respectively a decrease and an increase in
the mean. Let us consider first the signal a
1
(k) and the case of testing for a decrease. Let µ
0
be the value of the mean before the change occurs and
j
min
2
the, a priori chosen, minimum
jump magnitude. The sequence D
n
defined as follows:
D
0
= 0, D
n
=
n
k=1
a
1
(k) µ
0
+
j
min
2
represents the cumulative sum of the differences between signal a
1
and µ
0
j
min
2
.Ajump
is detected when d
n
D
n
, with d
n
= max
0kn
D
k
and λ a predefined threshold.
Intuitively, it means that a change in the mean is detected when a value of a
1
is significatively
smaller than µ
0
j
min
2
and the phenomenum is not isolated. The mean µ
0
is estimated on-line
and reinitialised after each jump detection. In the case of testing for an increase, the test
performed is defined by:
U
0
= 0, U
n
=
n
k=1
a
1
(k) µ
0
j
min
2
u
n
= min
0kn
U
k
, alarm if U
n
u
n

MOTION-BASED SELECTION OF RELEVANT VIDEO SEGMENTS
263
When a change is detected, the jump location is given by the last k satisfying d
k
D
k
= 0
or U
k
u
k
= 0, the variable D
n
is reinitialised to 0 and the next search starts from the
instant (image) following the detected jump location.
Similarly and in parallel, the Hinkley test is performed for detecting ruptures in signal
a
4
, corresponding to the two detection rules:
M
0
= 0, M
n
=
n
k=1
a
4
(k) µ
0
+
j
min
2
m
n
= max
0kn
M
k
, alarm if m
n
M
n
,
R
0
= 0, R
n
=
n
k=1
a
4
(k) µ
0
j
min
2
r
n
= min
0kn
R
k
, alarm if R
n
r
n
Note that for (perfect) zooming and for a static shot, the coefficients a
1
and a
4
are supposed
to be zero. Thus, if two successive shots are two static shots, two pure zooming motions
or one is a static shot and the other one involves a pure zooming motion, no change would
occur in a
1
and a
4
values. In practice, if the two shots are separated by a cut transition,
(high) erroneous measures of a
1
and a
4
are observed at the cut location and the motion
change is detected all the same. Besides, the method could be completed by the analysis
of the temporal evolution of other parameters of model (1). More precisely, in the case of
a pure zoom, the two diagonal coefficients a
2
and a
6
are supposed to be equal in theory
and they often exhibit a rather constant slope over the time interval corresponding to the
zooming motion. Consequently, performing a Hinkley test on the temporal derivate of the
signal a
2
(t) should allow us to detect changes between two zooms or between a zoom and
a static shot. In practice, it seems more reasonable to work with the divergence parameter
div =
1
2
(a
2
+ a
6
) for stability reasons.
2.3. Results
We have carried out experiments on three real video documents of different genres (a
movie, a documentary and a sport program). For each example, we compare the result of
the automatic segmentation based on camera motion with a manually-made segmentation
according to the same criterion. We compare also with a manual segmentation into shots
and with the method proposed in [16]. With the latter, homogeneous video segments are
built in a sequential way by analysing the temporal variations of the global motion (i.e.,
residual plus dominant motion) in the sequence of images. The global motion of successive
temporal units of the video is described by a statistical motion model as introduced in
[8] (see also Section 3.1). Then, a merging decision criterion is considered, relying on a
distance between the involved statistical motion models, to sequentially decide whether
the successive temporal video units should be merged into an homogeneous segment or
not.

Figures
Citations
More filters
Book ChapterDOI

Video Summarization: Techniques and Classification

TL;DR: This research categorizes video summariztion methods on the basis of methodology used, provides detailed description of leading methods in each category, and discusses their advantages and disadvantages.
Patent

System and method for video summarization

TL;DR: In this paper, the authors proposed a method for video summarization, and more specifically to a system for segmenting and classifying data from a video in order to create a summary video that preserves and summarizes relevant content.
Proceedings ArticleDOI

Stained-glass visualization for highly condensed video summaries

TL;DR: This work presents a method for creating highly condensed video summaries called stained-glass visualizations, especially suitable for small displays on mobile devices, using a Voronoi-based method for packing and laying out germs.
Book ChapterDOI

Video summarisation for surveillance and news domain

TL;DR: This paper proposes and evaluates two novel approaches for video summarization, one based on spectral methods and the other on ant-tree clustering, and examines the feasibility and the robustness of these approaches.
Journal ArticleDOI

Hysteroscopy video summarization and browsing by estimating the physician's attention on video segments.

TL;DR: The proposed technique associates clinical relevance with the attention attracted by a diagnostic hysteroscopy video segment during the video acquisition, and shows that the resulting video summary allows specialists to browse the video contents nonlinearly, while avoiding spending time on spurious visual information.
References
More filters
Journal ArticleDOI

Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images

TL;DR: The analogy between images and statistical mechanics systems is made and the analogous operation under the posterior distribution yields the maximum a posteriori (MAP) estimate of the image given the degraded observations, creating a highly parallel ``relaxation'' algorithm for MAP estimation.
Journal ArticleDOI

Detecting changes in signals and systems—a survey

TL;DR: A tentative general framework for change detection in signals and systems is presented, based upon a non-exhaustive survey of available methods, which are presented according to the increasing order of complexity of the change problem.
Journal ArticleDOI

Robust multiresolution estimation of parametric motion models

TL;DR: Numerical results support this approach, as validated by the use of these algorithms on complex sequences, and two robust estimators in a multi-resolution framework are developed.
Journal ArticleDOI

Comparison of video shot boundary detection techniques

TL;DR: This paper presents a comparison of several shot boundary detection and classification techniques and their variations including histograms, discrete cosine transform, motion vector, and block matching methods.
Proceedings ArticleDOI

A user attention model for video summarization

TL;DR: A generic framework of video summarization based on the modeling of viewer's attention is presented, which takes advantage of computational attention models and eliminates the needs of complex heuristic rules inVideo summarization.
Related Papers (5)
Frequently Asked Questions (12)
Q1. What have the authors contributed in "Motion-based selection of relevant video segments for video summarization" ?

The authors present a method for motion-based video segmentation and segment classification as a step towards video summarization. For the second stage, the authors adopt a statistical representation of the residual motion content of the video scene, relying on the distribution of temporal co-occurrences of local motion-related measurements. The obtained video segments supply reasonable temporal units to be further classified. 

The automatic segmentation leads to an oversegmentation in the first third of the video, probably due to the successive slight increases and decreases in the velocity of the camera following the actors in the corridor scenes. 

For an objective very close to video summarization, shots overview, the method in [5] relies on the nonlinear temporal modelling of waveletbased motion features. 

a continuous local motion measure is computed as a weighted mean, over a small spatial window, of the residual normal flow magnitude in order to obtain a more reliable motion information. 

The goal of the proposed video segmentation method is only to detect changes in camera motion and not to identify the nature of this motion. 

Once temporal units of the processed video are identified, one way to characterize their dynamic content would be to consider again parametric motion models (e.g. 2D affine or quadratic motion models). 

The study described in [18] for the detection of a sequence of home activities in a video relies on segmenting moving objects and detecting temporal discontinuities in the successive optical flow fields. 

The video elementary unit considered in this method is simply the frame, which can be restrictive when trying to detect temporal semantic events. 

In all their experiments, the ability of the method to provide homogeneous segments in terms of camera motion has been proved, which is the main requirement to carry on with the second step. 

the accuracy of the evaluation of the residual normal flow is highly dependent on the norm of the spatial intensity gradient, and this accuracy increases with ||∇ The author(p, k)||. 

With Table 1(b), the authors can see that among the segments classified as play by their algorithm, only 3% are no play segments (false alarms). 

The temporal co-occurrences distribution (y) of the sequence y is a matrix { (ν, ν ′|y)}(ν,ν ′)∈ 2 defined by:(ν, ν ′ | y) = K∑k=2 ∑ p∈R δ(ν, y(p, k)) · δ(ν ′, y(p, k − 1)), (3)where δ(i, j) is the Kronecker symbol (equal to 1 if i = j and to zero otherwise).