What is the reason for the oversegmentation in the first third of the video?

The automatic segmentation leads to an oversegmentation in the first third of the video, probably due to the successive slight increases and decreases in the velocity of the camera following the actors in the corridor scenes.

What is the method used to compute the residual normal flow?

a continuous local motion measure is computed as a weighted mean, over a small spatial window, of the residual normal flow magnitude in order to obtain a more reliable motion information.

What is the goal of the proposed video segmentation method?

The goal of the proposed video segmentation method is only to detect changes in camera motion and not to identify the nature of this motion.

What is the main requirement to carry on with the second step?

In all their experiments, the ability of the method to provide homogeneous segments in terms of camera motion has been proved, which is the main requirement to carry on with the second step.

How accurate is the evaluation of the residual normal flow?

the accuracy of the evaluation of the residual normal flow is highly dependent on the norm of the spatial intensity gradient, and this accuracy increases with ||∇ The author(p, k)||.

How many segments are correctly classified as play?

With Table 1(b), the authors can see that among the segments classified as play by their algorithm, only 3% are no play segments (false alarms).

what is the temporal co-occurrences distribution of the sequence y?

The temporal co-occurrences distribution (y) of the sequence y is a matrix { (ν, ν ′|y)}(ν,ν ′)∈ 2 defined by:(ν, ν ′ | y) = K∑k=2 ∑ p∈R δ(ν, y(p, k)) · δ(ν ′, y(p, k − 1)), (3)where δ(i, j) is the Kronecker symbol (equal to 1 if i = j and to zero otherwise).

(Open Access) Motion-Based Selection of Relevant Video Segments for Video Summarization (2005) | Nathalie Peyrard

Q: What have the authors contributed in "Motion-based selection of relevant video segments for video summarization" ?

The authors present a method for motion-based video segmentation and segment classification as a step towards video summarization. For the second stage, the authors adopt a statistical representation of the residual motion content of the video scene, relying on the distribution of temporal co-occurrences of local motion-related measurements. The obtained video segments supply reasonable temporal units to be further classified.

Q: What is the way to characterize the dynamic content of the video?

Once temporal units of the processed video are identified, one way to characterize their dynamic content would be to consider again parametric motion models (e.g. 2D affine or quadratic motion models).

Multimedia Tools and Applications, 26, 259–276, 2005

 2005 Springer Science + Business Media, Inc. Manufactured in The Netherlands.

Motion-Based Selection of Relevant Video Segments

forVideo Summarization

NATHALIE PEYRARD nathalie.peyrard@avignon.inra.fr

INRA, Domaine Saint Paul Site Agroparc, 84914 Avignon Cedex 9, France

PATRICK BOUTHEMY patrick.bouthemy@irisa.fr

IRISA/INRIA, Campus Universitaire de Beaulieu, 35042 Rennes Cedex, France

Abstract. We present a method for motion-based video segmentation and segment classiﬁcation as a step towards

video summarization. The sequential segmentation of the video is performed by detecting changes in the dominant

image motion, assumed to be related to camera motion and represented by a 2D afﬁne model. The detection is

achieved by analysing the temporal variations of some coefﬁcients of the 2D afﬁne model (robustly) estimated.

The obtained video segments supply reasonable temporal units to be further classiﬁed. For the second stage, we

adopt a statistical representation of the residual motion content of the video scene, relying on the distribution

of temporal co-occurrences of local motion-related measurements. Pre-identiﬁed classes of dynamic events are

learned off-line from a training set of video samples of the genre of interest. Each video segment is then classiﬁed

according to a Maximum Likelihood criterion. Finally, excerpts of the relevant classes can be selected for video

summarization. Experiments regarding the two steps of the method are presented on different video genres leading

to very encouraging results while only low-level motion information is considered.

Keywords: video segmentation, probabilistic motion modelling, supervised event classiﬁcation

1. Introduction

Replacing a long video by a small number of representative segments provides a synthetic

description of the document, which can be exploited for numerous applications including

both home video and professional usages. However, the construction of video summary

remains an open problem at the source of active research activities. The main difﬁculty

obviously relies on the detection of semantic events from low-level information. Several

approaches have been developed involving different video information and different rep-

resentations. For instance, [21] proposed a strategy for video summarization and browsing

by selecting key-frames maximally distinct and which carry the more information, based

on the chrominance components of each pixel in the image. The video elementary unit

considered in this method is simply the frame, which can be restrictive when trying to

detect temporal semantic events. In [12], the authors present a generic method based on the

modelling of user attention. Visual as well as audio information are combined to provide

a user attention curve whose maxima deﬁne video segments likely to be of interest for the

viewer. The combination of audio and image information is also exploited in [14] for video

summarization. In this paper, we consider the task of selecting relevant temporal segments

in a video and we adopt an approach based on motion-content analysis. It is obvious that

260

PEYRARD AND BOUTHEMY

the use of complementary information, such as color or audio, would lead to better results.

The aim here is not to fully solve the problem but to explore the potentiality of motion

information for the speciﬁed task.

When dealing with video summarization, the ﬁrst step usually consists in partitioning the

video into elementary temporal segments. Such elementary units can be shots [1, 3, 4, 13, 20]

which reveal the technical acquisition and editing processes of the video. We believe that a

content-based segmentation, relying on the analysis of the evolution of the motion content in

the sequence of images, should be more suited to the extraction of relevant dynamic events.

In a previous approach [16], we have exploited the global motion in a video and a distance

between probabilistic motion models to perform the video segmentation. In this paper we

propose a simpler method based on the camera motion only. Indeed, a single shot can involve

different successive camera motions and a camera motion change usually traduces a change

in the activity content of the depicted scene. This segmentation is performed by detecting

changes in the temporal evolution of coefﬁcients of the 2D afﬁne model representing the

dominant motion in the images, the latter being assumed to be due to camera motion. Such

a model has often been considered to estimate the camera motion, for instance in [13] using

the MPEG motion vector of compressed video stream.

In a second stage, we apply a supervised classiﬁcation algorithm based on the characteri-

zation of the residual image motion. Indeed, if the dominant image motion is due to camera

motion, the residual motion (i.e., after subtracting the dominant motion in the global image

motion) can be related to the projection of the scene motion. Once temporal units of the

processed video are identiﬁed, one way to characterize their dynamic content would be to

consider again parametric motion models (e.g. 2D afﬁne or quadratic motion models). How-

ever, the dynamic situations which can be described by such models are too restricted. They

are suitable for modelling camera motion but no more for modelling general scene motion.

Several motion characterizations have been investigated. In [1], motion-related features are

derived from the computation of the optical ﬂow ﬁeld. Based on these features, a method

for video indexing is proposed. The study described in [18] for the detection of a sequence

of home activities in a video relies on segmenting moving objects and detecting temporal

discontinuities in the successive optical ﬂow ﬁelds. It involves the analysis of the evolution

of the most signiﬁcant coefﬁcients of the singular value decomposition (SVD) of the set of

successive ﬂow ﬁelds. Still dealing with video content characterization but in the context

of video classiﬁcation into genres, statistical models are introduced by [20] to describe two

components of the video structure: shot duration and shot activity. In [22], in order to clus-

ter temporal dynamic events, the latter are captured by the computed histograms of local

spatial and temporal intensity gradients at different temporal scales. A distance between

events is then built, based on the comparison of the empirical histograms of these features.

Recently in [11], the authors have proposed motion pattern descriptors extracted from mo-

tion vector ﬁelds and have exploited support vector machines (SVM) for classiﬁcation of

video clips into semantic categories. For an objective very close to video summarization,

shots overview, the method in [5] relies on the nonlinear temporal modelling of wavelet-

based motion features. As [22], we propose to exploit low-level motion measurements but

conveying more elaborated motion information, while still easily computable compared to

the optic ﬂow. These local measurements are straightforwardly extracted from the images

MOTION-BASED SELECTION OF RELEVANT VIDEO SEGMENTS

261

intensities and are exploited with statistical motion models as introduced in [8]. These mod-

els are speciﬁed from the temporal co-occurrences of the quantized local motion-related

measurements and can handle a wide range of dynamic contents. Exploiting this statistical

framework, we propose to label each video segment according to learned classes of dy-

namic events, using a Maximum Likelihood criterion. Then, only the segments associated

to classes deﬁned as relevant in terms of dynamic events are selected, and excerpts of these

signiﬁcant segments could be further exploited for video summarization.

Section 2 describes the temporal video segmentation stage based on camera motion,

and the behaviour of the resulting algorithm is illustrated on videos of different genres. In

Section 3, we present the classiﬁcation stage relying on a probabilistic motion modelling,

and the global two-step method is applied for the recognition of relevant events in two sports

videos. Section 4 contains concluding remarks.

2. Temporal video segmentation

In this section, we present the ﬁrst stage of our method for selecting relevant segments

in a given video. It consists in performing a sequential segmentation of the video into

homogeneous segments in terms of camera motion. Its performance is illustrated on three

real video documents. In particular, we compare the results with those obtained with other

types of segmentation (segmentation into shots or based onglobal motion content) toconﬁrm

the suitability of the segmentation presented in this paper for the objective in sight, i.e., the

detection of particular dynamic events in a video.

2.1. Camera motion modelling and estimation

The video segmentation relies on the analysis of the dominant image motion computed

between two successive frames of the video. The dominant image motion is assumed to

correspond to the apparent motion of the background induced by the 3D camera motion. It

is possible to represent the projection of the 3D motion ﬁeld (relative to the camera) of the

static background by a 2D parametric motion model (assuming a shallow environment, or

accounting for its main part if it involves different depth layers). For example, we can deal

with the afﬁne motion model deﬁned at image point p = (x, y) by:

(p) =



+ a

x + a

+ a

x + a



, (1)

where θ = (a

, a

)isthe motion model parameter vector. Such a model can

handle different camera motions: panning, zooming, tracking (including of course static

shots). For more complex situations, a quadratic model can be used but we will restrict

ourselves here to the afﬁne model. It forms a good trade-off between the relevance of the

motion representation and the complexity of the estimation. The model parameters θ of the

dominant image motion are then estimated using the gradient-based robust multiresolution

algorithm designed in [15]. More precisely, the robustness is ensured by the minimization

262

PEYRARD AND BOUTHEMY

of a M-estimator criterion. The constraint is derived from the assumption of brightness

constancy and the parameter estimator of the afﬁne motion model between frame I (k) and

frame I (k + 1) is deﬁned as:

θ = arg min



p∈R

ρ(DFD

(p)),

where DFD

(p) = I (p +w

(p), k +1) − I (p, k)) is the Displaced Frame Difference and

R is the spatial image grid. The M-estimator ρ is chosen as a hard-redescending func-

tion. The minimization takes advantage of a multiresolution framework and an incremental

scheme based on the Gauss-Newton method. It is implemented as an iteratively reweighted

least-squares technique. This method yields an accurate estimation of the dominant motion

between two images even if other secondary motions are present.

2.2. Detection of camera motion changes

In order to detect changes in the camera motion, we analyse the temporal evolution of the

two translation coefﬁcients a

and a

of the afﬁne model (1). In general, a change in camera

motion induces a jump in the evolution of these two signals. To detect such ruptures we

propose to apply a Hinkley test on each signal, a

(k) and a

(k). This statistical test is issued

from likelihood ratio tests, evaluating the “no change hypothesis” (no change between

frames k − 1 and k)versus the “change hypothesis”. It provides a simple and efﬁcient

means to detect jumps in the mean of a signal and it is known to be robust since cumulative

(see [2]). Since the direction of the change in the mean of the signal is unknown, in practice,

two tests are performed in parallel, to look for respectively a decrease and an increase in

the mean. Let us consider ﬁrst the signal a

(k) and the case of testing for a decrease. Let µ

be the value of the mean before the change occurs and

min

the, a priori chosen, minimum

jump magnitude. The sequence D

deﬁned as follows:

= 0, D



k=1



(k) −µ

min



represents the cumulative sum of the differences between signal a

and µ

−

min

.Ajump

is detected when d

− D

>λ, with d

= max

0≤k≤n

and λ a predeﬁned threshold.

Intuitively, it means that a change in the mean is detected when a value of a

is signiﬁcatively

smaller than µ

−

min

and the phenomenum is not isolated. The mean µ

is estimated on-line

and reinitialised after each jump detection. In the case of testing for an increase, the test

performed is deﬁned by:

= 0, U



k=1



(k) −µ

−

min



= min

0≤k≤n

, alarm if U

− u

>λ

MOTION-BASED SELECTION OF RELEVANT VIDEO SEGMENTS

263

When a change is detected, the jump location is given by the last k satisfying d

− D

= 0

or U

− u

= 0, the variable D

is reinitialised to 0 and the next search starts from the

instant (image) following the detected jump location.

Similarly and in parallel, the Hinkley test is performed for detecting ruptures in signal

, corresponding to the two detection rules:

= 0, M



k=1



(k) −µ

min



= max

0≤k≤n

, alarm if m

− M

>λ,

= 0, R



k=1



(k) −µ

−

min



= min

0≤k≤n

, alarm if R

−r

>λ

Note that for (perfect) zooming and for a static shot, the coefﬁcients a

and a

are supposed

to be zero. Thus, if two successive shots are two static shots, two pure zooming motions

or one is a static shot and the other one involves a pure zooming motion, no change would

occur in a

and a

values. In practice, if the two shots are separated by a cut transition,

(high) erroneous measures of a

and a

are observed at the cut location and the motion

change is detected all the same. Besides, the method could be completed by the analysis

of the temporal evolution of other parameters of model (1). More precisely, in the case of

a pure zoom, the two diagonal coefﬁcients a

and a

are supposed to be equal in theory

and they often exhibit a rather constant slope over the time interval corresponding to the

zooming motion. Consequently, performing a Hinkley test on the temporal derivate of the

signal a

(t) should allow us to detect changes between two zooms or between a zoom and

a static shot. In practice, it seems more reasonable to work with the divergence parameter

div =

+ a

) for stability reasons.

2.3. Results

We have carried out experiments on three real video documents of different genres (a

movie, a documentary and a sport program). For each example, we compare the result of

the automatic segmentation based on camera motion with a manually-made segmentation

according to the same criterion. We compare also with a manual segmentation into shots

and with the method proposed in [16]. With the latter, homogeneous video segments are

built in a sequential way by analysing the temporal variations of the global motion (i.e.,

residual plus dominant motion) in the sequence of images. The global motion of successive

temporal units of the video is described by a statistical motion model as introduced in

[8] (see also Section 3.1). Then, a merging decision criterion is considered, relying on a

distance between the involved statistical motion models, to sequentially decide whether

the successive temporal video units should be merged into an homogeneous segment or

not.

Motion-Based Selection of Relevant Video Segments for Video Summarization

Figures

Citations

Video Summarization: Techniques and Classification

System and method for video summarization

Stained-glass visualization for highly condensed video summaries

Video summarisation for surveillance and news domain

Hysteroscopy video summarization and browsing by estimating the physician's attention on video segments.

References

Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images

Detecting changes in signals and systems—a survey

Robust multiresolution estimation of parametric motion models

Comparison of video shot boundary detection techniques

A user attention model for video summarization

Related Papers (5)

Sports video summarization based on motion analysis

Detection of meaningful events in videos based on a supervised classification approach

Extraction of Semantic Dynamic Content from Videos with Probabilistic Motion Models

Segmentation, Index and Summarization of Digital Video Content

Video summarization using shot segmentation and local motion estimation

Frequently Asked Questions (12)

Q1. What have the authors contributed in "Motion-based selection of relevant video segments for video summarization" ?

Q2. What is the reason for the oversegmentation in the first third of the video?

Q3. What is the objective of the method in shots overview?

Q4. What is the method used to compute the residual normal flow?

Q5. What is the goal of the proposed video segmentation method?

Q6. What is the way to characterize the dynamic content of the video?

Q7. What is the study for the detection of a sequence of home activities in a video?

Q8. What is the basic unit of the video summarization method?

Q9. What is the main requirement to carry on with the second step?

Q10. How accurate is the evaluation of the residual normal flow?

Q11. How many segments are correctly classified as play?

Q12. what is the temporal co-occurrences distribution of the sequence y?