scispace - formally typeset
Open AccessJournal ArticleDOI

Equivalent Key Frames Selection Based on Iso-Content Principles

Reads0
Chats0
TLDR
A key frames selection algorithm based on three iso-content principles (iso-content distance, iso- Content error and iso- content distortion) is presented, so that the selected key frames are equidistant in video content according to the used principle.
Abstract
We present a key frames selection algorithm based on three iso-content principles (iso-content distance, iso-content error and iso-content distortion), so that the selected key frames are equidistant in video content according to the used principle. Two automatic approaches for defining the most appropriate number of key frames are proposed by exploiting supervised and unsupervised content criteria. Experimental results and the comparisons with existing methods from literature on large dataset of real-life video sequences illustrate the high performance of the proposed schemata.

read more

Content maybe subject to copyright    Report

Equivalent Key Frames Selection Based on
Iso-Content Distance and Iso-Distortion Principles
Costas Panagiotakis, Anastasios Doulamis and Georgios Tziritas
Multimedia Informatics Laboratory of Computer Science Department, University of Crete
Heraklion, P.O. Box 2208, 71409, Greece
phone: + (30) 2810 393517, fax: + (30) 2810 393501
e-mail: {cpanag, adoulam, tziritas}@csd.uoc.gr
Abstract We present a key frames selection algorithm, which
is very flexible on any changes of content descriptors, based on
Iso-Content Distance and Iso-Distortion principles. In both of the
cases, the equality principle provides to the selected key frames
the property to be equivalent on content video summarization.
The estimated key frames properties and the experimental results
indicate the good performance of the proposed schemata.
I. INTRODUCTION
The most of key frames selection techniques assume that
the video has been segmented into shots and then extract
within each shot a small number of representative key frames.
A shot can be defined as a sequence of frames that are (or
appears to be) continuously captured from the same camera.
Key frames can be defined as a subset of a video sequence
that can represent the video visual content as close as possible,
with a limiting number of frames [1].
Before applying a video summarization algorithm, appro-
priate visual features are extracted from each video file [2].
Visual content descriptors like color-texture descriptors, color-
edge histograms, motion vectors have been used in key frames
selection methods [3].
Key frames selection approaches can be classified into:
cluster-based methods, energy minimization-based methods
and sequential methods. The clustering techniques [4] take all
the frames of a shot together and classify them according to
their content similarity. Then, the key frames are determined
as the representative frame of a cluster. The disadvantage of
these approaches is that they ignore the temporal information
of a video sequence. Thus, the selected key frames can not
be used in video similarity and indexing based applications.
The energy minimization based methods [5] extract the key
frames by solving an energy minimization problem. These
methods are generally computational expensive using iterative
techniques to perform minimization. The sequential methods
[6] consider a new key frame when the content difference
from the previous key frame exceed a a predefined threshold
that is determined by the user. Three approaches for video
summarization has been proposed in [7]. All approaches min-
imize a cross-correlation criterion so that the most uncorrelated
frames as represented in the feature domain to be extracted as
the most representative. Since the complexity of an exhaustive
search is very high especially for long shots and high number
of key-frames, a logarithmic, stochastic logarithmic and a
genetic approach has been proposed in [7] to improve the
search efficiency. Extension of [7] to stereoscopic data has
been proposed in [8].
All the above mentioned approaches address the video
summarization problem focusing either on a restricted video
content, ignoring temporal variation, minimizing metric cri-
teria on feature domain, or applying simple clustering-based
techniques. On the contrary, in this paper, video summa-
rization is performed by the use of an innovative computa-
tional geometry algorithm, which equally partitions the content
curve of a video sequence resulting in key frames that are
equivalent in the content domain under any type of video
content description ( [9]–[11]). In this paper, we propose two
general principles based on this algorithm that can be used
in key frames selection. Applying the equipartition algorithm
directly on content description, we get equidistant key frames
in the sense of video content, named Iso-Content Distance
principle. Alternatively, under Iso-Content Distortion principle
the selected key frames minimizes a global distortion criterion
providing at the same time equal distortions per key frame
cluster. The main contribution of this work, is to address the
problem of video summarization from different views, the
proposed Iso-Distance and Iso-Distortion principles, that take
into account the equivalent property of the key frames under
any type of content description.
The rest of the paper is organized as follows: Section II
gives the problem formulation describing the proposed princi-
ples. Section III presents the key frames selection algorithm.
The visual content descriptors are presented in Section IV.
The experimental results are given in Section V. Finally,
conclusions are provided in Section VI.
II. P
ROBLEM FORMULATION
Let us assume a video shot of N frames duration and that for
each frame of the shot, we have extracted several descriptors
and included them in a vector denoted as p
i
, where index
i corresponds to the ith frame shot. Let us denote as P a
set which includes all vectors p
i
for i =1, 2, ···,N, that is
P = {p
1
, p
2
, ··· , p
N
}. Vectors p
i
are assumed to be R
n
,
i.e., n descriptors are extracted to represent its frame content.
According to EP problem, we have to use as input a con-
tinuous time descriptor curve C(t), where t denotes the time
variable, instead of the set P . Therefore, C(t) can be derived

by the linear interpolation of the successive frames descriptors
in n dimensional space. To simplify the mathematical formu-
lations and without loosing generality, we have normalized
the time variable t, t [0, 1], so that 0, 1 correspond to first,
last frames, respectively. Thus, we assume that C(t) starts on
A = C(0) = p
1
and ends on B = C(1) = p
N
. In the next
sections, we are going to keep using the continuous normalized
time space [0, 1] instead of the discrete frames’ time space
{1, 2, ··· ,N}.
The EP problem is defined under a predefined smooth
semimetric function like Euclidean distance. By our analysis
[9]–[11], the equipartition problem (EP) admits always a
solution under any semimetric function. Therefore we have
to use as g(x, y) a semimetric function in order to get at
least one solution. Let g(x, y), where x, y [0, 1] denote
normalized time variables, be the used smooth semimetric
function between two curve points C(x), C(y).
Under the key frames selection problem, the key frames are
selected to summarize the video content. Let M be the number
of the selected key frames and t
i
[0, 1],i∈{1, ··· ,M} be
the selected key frames under the normalized time space. The
proposed method selects the first and the last key frame to be
the first and the last frame of the shot sequence, respectively
(t
0
=0,t
1
=1). Therefore, the goal of the proposed method
is to compute M 2 key frames K, t
i
,i∈{2, ··· ,M 1},
under the constraint that are equidistant in the sense of the
used semimetric function g(x, y),g(t
i1
,t
i
)=g(t
i
,t
i+1
),i
{2, ··· ,M 1}, with t
1
=0and t
M
=1. This means that
the distance between each successive pair of key frames will
be equal. The length r of each equal chord is given by the
following equation:
r = g(0,t
2
)=g(t
2
,t
3
)=···= g(t
M1
, 1) (1)
Therefore, the set of key frames K = {t
1
,t
2
, ··· ,t
M
},t
i
<
t
i+1
,C(t
1
)=p
1
,C(t
M
)=p
N
are selected under any
predefined content description having equivalent property on
video content descriptors. We propose two general principles
that can be used in key frames selection: the Iso-Content
Distance and the Iso-Content Distortion principles.
A. Iso-Content Distance Principle
Under Iso-Content Distance principle, the content distances
between two successive key frames should be equal, so the
selected key frames are equidistant in content. Thus under
the definition of Section II, we have to compute M 2
sequential key frames t
i
,i ∈{2, ··· ,M 1}, t
1
=0,
t
M
=1under the constraint: r = d(t
i1
,t
i
)=d(t
i
,t
i+1
),
i ∈{2, ··· ,M 1}, where d(x, y),x,y [0, 1] denotes the
semimetric distance function. Several distances d(x, y) can be
used, like Euclidean, Manhattan, χ
2
, depending on content
descriptors’ space. If there are several solutions, the one with
the maximum chord length r is selected since this solution is
the best approximation of the content curve.
B. Iso-Content Distortion Principle
Under Iso-Content Distortion principle, the distortions be-
tween two pairs of key frames,
¯
d(t
i
,t
i+1
)=
¯
d(t
j
,t
j+1
) should
be equal. We consider the following definition for distortion:
Let t
i
,t
i+1
be two successive key frames, then the distortion
¯
d(t
i
,t
i+1
) is defined as the sum of minimum content distances
of the frames t
j
, t
i
t
j
t
i+1
and the two key frames
t
i
,t
i+1
,
¯
d(t
i
,t
i+1
)=
t
i+1
j=t
i
min(d(t
i
,t
j
),d(t
j
,t
i+1
)). (2)
This is a similar definition with the definition of distortion
used by Lee and Kim [5],
¯
d(K, B)=
M
i=1
b
i+1
b
i
d(t, t
i
)dt.
Breakpoints have not been used (B = {b
0
, ··· ,b
M
}),
since their meaning is included in the two successive key
frames combination in Eq. (2). If we define the total dis-
tortion as the maximum of the corresponding distortions
max
i∈{1,2,··· ,M1}
¯
d(t
i
,t
i+1
), then almost optimal solutions
are achieved using the proposed schema. If there are several
solutions, the one with the minimum distortion is selected.
III. K
EY FRAMES SELECTION ALGORITHM
The straightforward implementation of the EP method pro-
vides directly the M key frames. The EP algorithm computes
for a specific M , the M key frames K under the semimetric
g(x, y) function. The number of key frames M can be given
by the user or can be estimated automatically by exploiting
the variation of feature vector trajectory in time [12].
The input of the method is the number of key frames
M. In addition, it needs the values of symmetric matrix
g(t
k
,t
l
),k,l ∈{1, 2, ··· ,N}. This algorithm is described in
[9], [11] computing all solutions in O(M · N
2
) steps. A brief
description of EP algorithm is given next. It is an iterative
method. Thus, when it is executed for M segments, it uses
the precomputed results for M 1 segments. In each iteration
step l, the algorithm computes the zero level curves L
l
by the
L
l1
. These curves points belong on the same level of g(x, y)
and the key frames are inductively computed (from L
l
to L
l1
)
on these curves (see Figs. 4(a) and 4(g)). By our analysis [9]–
[11], the equipartition problem (EP) admits always a solution.
The EP problem can have more than one solutions depending
on curve shape, distance metric, and the value of M .
IV. V
ISUAL CONTENT DESCRIPTION
The proposed method can be executed under any choice
or combination of audio/visual content descriptors. However,
the selected key frames are related with the used content
description, so we have to choose appropriate descriptors.
On this framework, we propose to use the MPEG-7 visual
descriptors [3] like the Color Layout Descriptor (CLD), a
low cost and compact descriptor, which suffices to describe
smoothly the changes in visual content of a shot. We used
the following function D to measure the content distance
of two CLDs, {DY,DCb,DCr} and {DY
,DCb
,DCr
},
D =
i
(DY
i
DY
i
)
2
+
i
(DCb
i
DCb
i
)
2
+
i
(DCr
i
DCr
i
)
2
, where, (DY,DCb,DCr) represent
the ith DCT coefficients of the respective color components.
The function D is a semimetric distance.

(a) #0 (b) #97 (c) #132 (d) #299
Fig. 1. The proposed key frames are {0, 97, 299} and {0, 132, 299} in coast
shot under Iso-Content Distance and Iso-Distortion principles, respectively.
V. EXPERIMENTAL RESULTS
In this section, the experimental results of the proposed
algorithm and comparisons to other algorithms are presented.
The method has been implemented using C and Matlab.
A. Evaluation of the Proposed Schemata
We have tested the proposed algorithm on a data set
containing more than 250 video sequences. The most of them
are athletics videos like pole vault, high jump, triple jump,
long jump, running and hurdling. Moreover, we have used the
widely known as MPEG test sequences like coast sequence,
the table tennis sequence, hall monitor sequence, etc.. Figs. 5
and 6, show the sequences that we used in the article.
A typical processing time for the execution of the proposed
EP algorithm, when the shot contains 300 images (e.g. coast
MPEG sequence) and M =10, is between 4 to 5 seconds de-
pending on the used principle. Figs. (4(a), 4(g)) and (2(a), 2(f))
show the surfaces g(x, y) in pole vault and coast sequences,
respectively, under the proposed principles. The deep blue
colors correspond to close to zero values. This is the reason of
the deep blue diagonals, since it holds that g(x, x)=0. The
deep red colors correspond to the highest values of g(x, y).
The estimated solution is projected on g(x, y) with cycles.
The L
l
curves are projected on g(x, y), with gray colors, at
both sides of diagonal x = y. If more than one solutions are
appeared, the selected solution points are drawn with large
cycles.
Figs. 2 and 4 illustrate the results of the two proposed
schemata in pole vault and coast sequences, respectively. The
number of key frames has been automatically estimated using
the criterion of [12]. We have observed that, under Iso-Content
Distortion principle the better representative frames of their
cluster are selected. Moreover, the selected key frames under
Iso-Content Distance principle don’t take the duration between
the selected key frames into account, while the Iso-Content
Distortion combines the duration with the content variation
(Fig. 1). Moreover, we have tested the two proposed schemata
for low number of key frames (see Fig. 1), indicating the
robustness of the method.
B. Comparison to Other Algorithms
The proposed scheme has been compared with two ap-
proaches presented in the literature. The first exploits a mini-
mization of a cross correlation criterion [7], so that the most
uncorrelated frames are extracted as key ones. A logarithmic
search approach is adopted as in [7] to estimate the key frames.
The second technique formulate the summarization problem as
0 50 100 150 200 250
0
50
100
150
200
250
(a) (b) #0 (c) #70 (d) #168 (e) #299
0 50 100 150 200 250
0
50
100
150
200
250
(f) (g) #0 (h) #73 (i) #173 (j) #299
Fig. 2. Results of the two proposed schemas in coast shot using four key
frames. The estimated solution and the L
l
curves are projected on g(x, y )
under (a) Iso-Content Distance, (f) Iso-Content Distortion principle. (b), ···,
(e) The selected key frames under Iso-Content Distance principle. (g), ···,
(j) The selected key frames under Iso-Content Distortion principle.
an interpolation problem. Fig. 3 illustrates the performance of
both methods along with the proposed one for the pole vault
sequence. It is observed that the proposed approach represents
the content of the sequence more efficiently rather than the
compared works. In all case, the same number of key-frames
has been extracted obtained using the criterion of [12].
VI. C
ONCLUSIONS
In this paper, two key frames selection schemata are de-
scribed based on equipartition problem. The first one uses
Iso-Content Distance principle, the key frames are equidistant
in video content. Under Iso-Content Distortion principle, the
frames clusters derived by the key frames are equal-sized.
Thus, the selected key frames have different properties ac-
cording to the used principle. However, in any case, the key
frames are equivalent on content video summarization. Each
key frame has the same significance under the used principle.
In this work, we have used the Color Layout Descriptor of
MPEG-7 visual descriptors.
An extension of the proposed methodology may include
automatic computation of key frame number, the using of more
audio/visual descriptors and principles.
A
CKNOWLEDGMENT
This research was partially supported by the Greek PENED
2003 project.
R
EFERENCES
[1] M. Yeung and B.-L. Yeo, “Video visualization for compact presentation
and fast browsing ofpictorial content, IEEE Trans. Circuits Syst. Video
Techn., vol. 7, no. 5, pp. 771 785, 1997.
[2] Y.-P. Tan, S. R. Kulkarni, and P. J. Ramadge, A Framework For
Measuring Video Similarity And Its Application To Video Query By
Example, 1999.
[3] B. Manjunath, J. Ohm, V. Vasudevan, and A. Yamada, “Color and texture
descriptors, IEEE Trans. On Circuits And Systems For Video Tech.,
vol. 11, no. 6, pp. 703–715, 2001.
[4] A. Girgensohn and J. S. Boreczky, “Time-constrained keyframe selection
technique, Multimedia Tools and Applications, vol. 11, no. 3, pp. 347–
358, 2000.
[5] H.-C. Lee and S.-D. Kim, “Iterative key frame selection in the rate-
constraint environment, Signal Processing: Image Communication,
vol. 18, pp. 1–15, 2003.
[6] J. Vermaak, P. Perez, and M. Gangnet, “Rapid summarization and
browsing of video sequences, in British Machine Vision Conf., 2002.

0 20 40 60 80 100 120
0
20
40
60
80
100
120
(a) (b) #1 (c) #65 (d) #104 (e) #109 (f) #123
0 20 40 60 80 100 120
0
20
40
60
80
100
120
(g) (h) #1 (i) #53 (j) #69 (k) #101 (l) #123
Fig. 4. Results of the two proposed schemas in pole vault shot using five key frames. The estimated solution and the L
l
curves are projected on g(x, y)
under (a) Iso-Content Distance, (g) Iso-Content Distortion principle. (b), ···, (f) The selected key frames under Iso-Content Distance principle. (h), ···, (l)
The selected key frames under Iso-Content Distortion principle.
(a) #0 (b) #10 (c) #20 (d) #30 (e) #40 (f) #50 (g) #60 (h) #70 (i) #80 (j) #90 (k) #100 (l) #110 (m) #123
Fig. 5. The pole vault sequence which contains 123 frames.
(a) #0 (b) #15 (c) #30 (d) #45 (e) #60 (f) #75 (g) #90 (h) #105 (i) #120 (j) #135 (k) #150
(l) #165 (m) #180 (n) #195 (o) #210 (p) #225 (q) #240 (r) #255 (s) #270 (t) #285 (u) #299
Fig. 6. The coast sequence which contains 300 frames.
[7] N. Doulamis, A. Doulamis, Y. Avrithis, K. Ntalianis, and S. Kollias, A
stochastic framework for optimal key frame extraction from mpeg video
databases, Journal of Computer Vision and Image Understanding,,
vol. 75, no. 4, pp. 3–24, 1999.
[8] ——, “Efficient summarization of stereoscopic video sequences, IEEE
Trans. Circuits Syst. Video Techn., vol. 10, no. 4, pp. 501–517, 2000.
[9] C. Panagiotakis, G. Georgakopoulos, and G. Tziritas, “The curve
equipartition problem, submitted to Computational Geometry, 2005.
[Online]. Available: http://www.csd.uoc.gr/
cpanag/papers/EP.pdf
[10] ——, “On the curve equipartition problem: a brief exposition of basic
issues, in European Workshop on Computational Geometry, 2006.
[11] C. Panagiotakis and G. Tziritas, Any dimension polygonal approxima-
tion based on equal errors principle, Pattern Recogn. Lett., vol. 28,
no. 5, pp. 582–591, 2007.
[12] A. D. Doulamis, N. Doulamis, and S. Kollias, “Non-sequential video
content representation using temporal variation of feature vectors, IEEE
Trans. on Consumer Electronics, vol. 46, pp. 758–768, 2000.
(a) #1 (b) #21 (c) #69 (d) #82 (e) #123
(f) #34 (g) #49 (h) #67 (i) #72 (j) #90
Fig. 3. Results of interpolation and logarithmic method described in [7]
in pole vault shot using five key frames. (a), ···, (e) The selected key
frames under interpolation method. (f), ···, (j) The selected key frames under
logarithmic method.
Citations
More filters
Journal ArticleDOI

Video summarization via minimum sparse reconstruction

TL;DR: This paper formulate the video summarization task with a novel minimum sparse reconstruction (MSR) problem, where the original video sequence can be best reconstructed with as few selected keyframes as possible.
Journal ArticleDOI

Keypoint-Based Keyframe Selection

TL;DR: This work proposes a keypoint-based framework to address the keyframe selection problem so that local features can be employed in selecting keyframes, and introduces two criteria, coverage and redundancy, based on keypoint matching in the selection process.
Journal ArticleDOI

A Bag-of-Importance Model With Locality-Constrained Coding Based Feature Learning for Video Summarization

TL;DR: The proposed Bag-of-Importance (BoI) model for static video summarization is able to exploit both the inter-frame and intra-frame properties of feature representations and identify keyframes capturing both the dominant content and discriminative details within a video.
Journal ArticleDOI

Eratosthenes sieve based key-frame extraction technique for event summarization in videos

TL;DR: An Eratosthenes Sieve based key-frame extraction approach for video summarization (VS) which can work better for real-time applications and outperform the state-of-the-art models on F-measure.
Proceedings ArticleDOI

Visual analytics methods for categoric spatio-temporal data

TL;DR: A new approach which interactively combines visualization of categorical changes over time; various spatial data displays; computational techniques for task-oriented selection of time steps provides an expressive visualization with regard to either the overall evolution over time or unusual changes.
References
More filters
Journal ArticleDOI

Color and texture descriptors

TL;DR: An overview of color and texture descriptors that have been approved for the Final Committee Draft of the MPEG-7 standard is presented, explained in detail by their semantics, extraction and usage.
Journal ArticleDOI

Performance characterization of video-shot-change detection methods

TL;DR: The results of a performance evaluation and characterization of a number of shot-change detection methods that use color histograms, block motion matching, or MPEG compressed data are presented.
Journal ArticleDOI

Video visualization for compact presentation and fast browsing of pictorial content

TL;DR: This work proposes techniques to analyze video and build a compact pictorial summary for visual presentation and presents a set of video posters, each of which is a compact, visually pleasant, and intuitive representation of the story content.
Journal ArticleDOI

An integrated scheme for automated video abstraction based on unsupervised cluster-validity analysis

TL;DR: A novel method for generating key frames and previews for an arbitrary video sequence by first applying multiple partitional clustering to all frames of a video sequence and then selecting the most suitable clustering option(s) using an unsupervised procedure for cluster-validity analysis.
Journal ArticleDOI

Video summarization and scene detection by graph modeling

TL;DR: In this application, video summaries that emphasize both content balance and perceptual quality can be generated directly from a temporal graph that embeds both the structure and attention information.
Related Papers (5)
Frequently Asked Questions (9)
Q1. What are the popular video sequences that the authors have used?

the authors have used the widely known as MPEG test sequences like coast sequence, the table tennis sequence, hall monitor sequence, etc.. 

A typical processing time for the execution of the proposed EP algorithm, when the shot contains 300 images (e.g. coast MPEG sequence) and M = 10, is between 4 to 5 seconds depending on the used principle. 

The first exploits a minimization of a cross correlation criterion [7], so that the most uncorrelated frames are extracted as key ones. 

the goal of the proposed method is to compute M − 2 key frames K , t′i, i ∈ {2, · · · , M − 1}, under the constraint that are equidistant in the sense of the used semimetric function g(x, y),g(t′i−1, t ′ i) = g(t ′ i, t ′ i+1), i ∈ {2, · · · , M − 1}, with t′1 = 0 and t′M = 1. 

Let g(x, y), where x, y ∈ [0, 1] denote normalized time variables, be the used smooth semimetric function between two curve points C(x), C(y). 

The authors used the following function D to measure the content distance of two CLDs, {DY, DCb, DCr} and {DY ′, DCb′, DCr′}, D = √∑ i (DYi − DY ′i )2 + √∑ i (DCbi − DCb′i)2 +√∑ i (DCri − DCr′i)2, where, (DY, DCb, DCr) represent the ith DCT coefficients of the respective color components. 

Let M be the number of the selected key frames and t′i ∈ [0, 1], i ∈ {1, · · · , M} be the selected key frames under the normalized time space. 

The proposed method selects the first and the last key frame to be the first and the last frame of the shot sequence, respectively (t′0 = 0, t ′ 1 = 1). 

If the authors define the total distortion as the maximum of the corresponding distortions maxi∈{1,2,··· ,M−1} d̄(t′i, t ′ i+1), then almost optimal solutions are achieved using the proposed schema.