scispace - formally typeset
Open AccessProceedings ArticleDOI

Space-time interest points

Laptev, +1 more
- pp 432-439
Reads0
Chats0
TLDR
This work builds on the idea of the Harris and Forstner interest point operators and detects local structures in space-time where the image values have significant local variations in both space and time to detect spatio-temporal events.
Abstract
Local image features or interest points provide compact and abstract representations of patterns in an image. We propose to extend the notion of spatial interest points into the spatio-temporal domain and show how the resulting features often reflect interesting events that can be used for a compact representation of video data as well as for its interpretation. To detect spatio-temporal events, we build on the idea of the Harris and Forstner interest point operators and detect local structures in space-time where the image values have significant local variations in both space and time. We then estimate the spatio-temporal extents of the detected events and compute their scale-invariant spatio-temporal descriptors. Using such descriptors, we classify events and construct video representation in terms of labeled space-time points. For the problem of human motion analysis, we illustrate how the proposed method allows for detection of walking people in scenes with occlusions and dynamic backgrounds.

read more

Content maybe subject to copyright    Report

Space-time Interest Points
Ivan Laptev and Tony Lindeberg
Computational Vision and Active Perception Laboratory (CVAP)
Dept. of Numerical Analysis and Computer Science
KTH, SE-100 44 Stockholm, Sweden
{laptev, tony}@nada.kth.se
Abstract
Local image features or interest points provide compact
and abstract representations of patterns in an image. In this
paper, we propose to extend the notion of spatial interest
points into the spatio-temporal domain and show how the
resulting features often reflect interesting events that can be
used for a compact representation of video data as well as
for its interpretation.
To detect spatio-temporal events, we build on the idea of
the Harris and F
¨
orstner interest point operators and detect
local structures in space-time where the image values have
significant local variations in both space and time. We then
estimate the spatio-temporal extents of the detected events
and compute their scale-invariant spatio-temporal descrip-
tors. Using such descriptors, we classify events and con-
struct video representation in terms of labeled space-time
points. For the problem of human motion analysis, we illus-
trate how the proposed method allows for detection of walk-
ing people in scenes with occlusions and dynamic back-
grounds.
1. Introduction
Analyzing and interpreting video is a growing topic in com-
puter vision and its applications. Video data contains infor-
mation about changes in the environment and is highly im-
portant for many visual tasks including navigation, surveil-
lance and video indexing.
Traditional approaches for motion analysis mainly in-
volve the computation of optic flow [1] or feature tracking
[28, 4]. Although very effective for many tasks, both of
these techniques have limitations. Optic flow approaches
mostly capture first-order motion and often fail when the
motion has sudden changes. Feature trackers often assume
a constant appearance of image patches over time and may
hence fail when this appearance changes, for example, in
situations when two objects in the image merge or split.
The support from the Swedish Research Council and from the Royal
Swedish Academy of Sciences as well as the Knut and Alice Wallenberg
Foundation is gratefully acknowledged.
Figure 1: Result of detecting the strongest spatio-temporal
interest point in a football sequence with a player heading
the ball. The detected event corresponds to the high spatio-
temporal variation of the image data or a “space-time cor-
ner” as illustrated by the spatio-temporal slice on the right.
Image structures in video are not restricted to constant
velocity and/or constant appearance over time. On the con-
trary, many interesting events in video are characterized by
strong variations of the data in both the spatial and the tem-
poral dimensions. As example, consider scenes with a per-
son entering a room, applauding hand gestures, a car crash
or a water splash; see also the illustration in figure 1.
More generally, points with non-constant motion corre-
spond to accelerating local image structures that might cor-
respond to the accelerating objects in the world. Hence,
such points might contain important information about the
forces that act in the environment and change its structure.
In the spatial domain, points with a significant local vari-
ation of image intensities have been extensively investigated
in the past [9, 11, 26]. Such image points are frequently de-
noted as “interest points” and are attractive due to their high
information contents. Highly successful applications of in-
terest point detectors have been presented for image index-
ing [25], stereo matching [30, 23, 29], optic flow estimation
and tracking [28], and recognition [20, 10].
In this paper we detect interest points in the spatio-
temporal domain and illustrate how the resulting space-
time f eatures often correspond to interesting events in video
data. To detect spatio-temporal interest points, we build on
the idea of the Harris and F¨orstner interest point operators
[11, 9] and describe the detection method in section 2. To
capture events with different spatio-temporal extents [32],
Proceedings of the Ninth IEEE International Conference on Computer Vision (ICCV’03)
0-7695-1950-4/03 $ 17.00 © 2003 IEEE

we compute interest points in spatio-temporal scale-space
and select scales that roughly correspond to the size o f the
detected events in space and to their durations in time.
In section 3 we show how interesting events in video can
be learned and classified using k-means clustering and point
descriptors defined by local spatio-temporal image deriva-
tives. In section 4 we consider video rep resentation in terms
of classified spatio-temporal interest points and demonstrate
how this representation can be efficient for the task of video
registration. In particular, we present an approach for de-
tecting walking people in complex scenes with occlusions
and dynamic background. Finally, section 5 concludes the
paper with the discussion of the method.
2. Interest point detection
2.1. Interest points in spatial domain
The idea of the Harris interest point detector is to detect
locations in a spatial image f
sp
where the image values
have significant variations in both directions. For a given
scale of observation σ
2
l
, such interest points can be found
from a windowed second moment matrix integrated at scale
σ
2
i
=
2
l
µ
sp
= g
sp
(·; σ
2
i
)
(L
sp
x
)
2
L
sp
x
L
sp
y
L
sp
x
L
sp
y
(L
sp
y
)
2
(1)
where L
sp
x
and L
sp
y
are Gaussian derivatives defined as
L
sp
x
(·; σ
2
l
)=
x
(g
sp
(·; σ
2
l
) f
sp
)
L
sp
y
(·; σ
2
l
)=
y
(g
sp
(·; σ
2
l
) f
sp
),
(2)
and where g
sp
is the spatial Gaussian kernel
g
sp
(x, y; σ
2
)=
1
2πσ
2
exp((x
2
+ y
2
)/2σ
2
). (3)
As the eigenvalues λ
1
2
,(λ
1
λ
2
)ofµ
sp
represent char-
acteristic variations of f
sp
in both image directions, two
significant values of λ
1
2
indicate the presence of an in-
terest point. To detect such points, Harris and Stephens [11]
propose to detect positive maxima of the corner function
H
sp
=det(µ
sp
) k trace
2
(µ
sp
)=λ
1
λ
2
k(λ
1
+ λ
2
)
2
.
(4)
2.2. Interest points in the space-time
The idea of interest points in the spatial domain can be ex-
tended into the spatio-temporal domain by requiring the im-
age values in space-time to have large variations in both the
spatial and the temporal dimensions. Points with such prop-
erties will be spatial interest points with a distinct location
in time corresponding to the moments with non-constant
motion of the image in a local spatio-temporal neighbor-
hood [15].
To model a spatio-temporal image sequence, we use a
function f : R
2
×R R and construct its linear scale-space
representation L : R
2
× R × R
2
+
→ R by convolution of f
with an anisotropic Gaussian kernel
1
with d istinct spatial
variance σ
2
l
and temporal variance τ
2
l
L(·; σ
2
l
2
l
)=g(·; σ
2
l
2
l
) f (·), (5)
where the spatio-temporal separable Gaussian kernel is de-
fined as
g(x, y, t; σ
2
l
2
l
)=
exp((x
2
+ y
2
)/2σ
2
l
t
2
/2τ
2
l
)
(2π)
3
σ
4
l
τ
2
l
.
(6)
Similar to the spatial domain, we consider the spatio-
temporal second-moment matrix which is a 3-by-3 matrix
composed of first order spatial and temporal derivatives av-
eraged with a Gaussian weighting function g(·; σ
2
i
2
i
)
µ = g(·; σ
2
i
2
i
)
L
2
x
L
x
L
y
L
x
L
t
L
x
L
y
L
2
y
L
y
L
t
L
x
L
t
L
y
L
t
L
2
t
, (7)
where the integration scales are σ
2
i
=
2
l
and τ
2
i
=
2
l
, while the first-order derivatives are defined as
L
ξ
(·; σ
2
l
2
l
)=
ξ
(g f ). The second-moment matrix µ
has been used previously by Nagel and Gehrke [24] in the
context of optic ow computation.
To detect interest points, we search for regions in f hav-
ing significant eigenvalues λ
1
2
3
of µ. Among differ-
ent approaches to find such regions, we choose to extend
the Harris corner function (4) defined for the spatial domain
into the spatio-temporal domain by combining the determi-
nant and the trace of µ in the following way
H =det(µ) k trace
3
(µ)=λ
1
λ
2
λ
3
k(λ
1
+ λ
2
+ λ
3
)
3
.
(8)
To show that the positive local maxima o f H correspond to
points with high values of λ
1
2
3
(λ
1
λ
2
λ
3
), we
define the ratios α = λ
2
1
and β = λ
3
1
and rewrite
H = λ
3
1
(αβ k(1 + α + β)
3
). Then, from the require-
ment H 0,wegetk αβ/(1 + α + β)
3
and it follows
that as k increases towards its maximal value k =1/27,
both ratios α and β tend to one. For sufficiently large val-
ues of k, positive local maxima of H correspond to points
with high variation of the image gray-values in both the spa-
tial and the temporal dimensions. Thus, spatio-temporal in-
terest points o f f can be found by detecting local positive
spatio-temporal maxima in H.
1
In general, convolution with a Gaussian kernel in the temporal domain
violates causality constraints, since the temporal image data is available
only for the past. Whereas for real-time implementations this problem can
be solved using causal recursive filters [12, 19], in this paper we simplify
the investigation and assume that the data is available for a sufficiently long
period of time and that the image sequence can hence be convolved with a
truncated Gaussian in both space and time. However, the proposed interest
points can be computed using recursive filters in on-line mode.
Proceedings of the Ninth IEEE International Conference on Computer Vision (ICCV’03)
0-7695-1950-4/03 $ 17.00 © 2003 IEEE

2.3. Experiments on synthetic sequences
To illustrate the detection of spatio-temporal interest points
on synthetic image sequences, we show the spatio-temporal
data as 3-D space-time plots where the original signal is
represented by a threshold surface while the detected inter-
est points are presented b y ellip soids with semi-axes pro-
portional to corresponding scale parameters σ
l
and τ
l
.
(a) (b)
(c) (d)
Figure 2: Results of detecting spatio-temporal interest
points on synthetic image sequences: (a) Moving corner;
(b) A merge of a ball and a wall; (c) Collision of two balls
with interest points detected at scales σ
2
l
=8and τ
2
l
=8;
(d) the same as in (c) but with interest points detected at
scales σ
2
l
=16and τ
2
l
=16.
Figure 2a illustrates a sequence with a moving corner.
The interest point is detected at the moment in time when
the motion of the corner changes direction. This type of
event occurs frequently in natural sequences such as se-
quences of articulated motion. Other typical types of events
detected by the proposed method are splits and mergers of
image structures. In figure 2b, the interest point is detected
at the moment and the position corresponding to the colli-
sion of a ball and a wall. Similarly, interest points are de-
tected at the moment of collision and bouncing of two balls
as shown in figure 2c-d. Note, that different types of events
are detected depending on the scale of observation.
In general, the result of interest point detector will de-
pend on the scale parameters. Hence, the correct estimation
of spatio-temporal extents of events is highly important for
their detection and further interpretation.
2.4. Scale selection in space-time
To estimate the spatio-temporal extent of an event in space-
time, we follow the idea of local scale selection proposed
in the spatial domain by Lindeberg [18] as well as in the
temporal domain [17]. As a prototype event we study a
spatio-temporal Gaussian blob f = g(x, y, t; σ
2
0
2
0
) with
spatial variance σ
2
0
and temporal variance τ
2
0
.Usingthe
semi-group property of the Gaussian kernel, it follows that
the scale-space representation of f is L(x, y, t; σ
2
2
)=
g(x, y, t; σ
2
0
+ σ
2
2
0
+ τ
2
).
To recover the spatio-temporal extent (σ
0
0
) of f we
consider the scale-normalized spatio-temporal Laplacian
operator defined by
2
norm
L = L
xx,norm
+ L
yy,norm
+ L
tt,norm
, (9)
where L
xx,norm
= σ
2a
τ
2b
L
xx
and L
tt,norm
= σ
2c
τ
2d
L
tt
.
As shown in [15], given the appropriate normalization pa-
rameters a =1,b =1/4,c =1/2 and d =3/4,thesize
of the blob f can be estimated from the scale values ˜σ
2
and
˜τ
2
for which
2
norm
L assumes local extrema over scales,
space and time. Hence, the spatio-temporal extent of the
blob can be estimated by detecting local ex trema of
2
norm
L = σ
2
τ
1/2
(L
xx
+ L
yy
)+στ
3/2
L
tt
. (10)
over both spatial and temporal scales.
2.5. Scale-adapted space-time interest points
Local scale estimation using the normalized Laplace opera-
tor has shown to be very useful in the spatial domain [18, 6].
In particularly, Mikolajczyk and Schmid [22] combined the
Harris interest point operator with the normalized Laplace
operator and derived the scale-invariant Harris-Laplace in-
terest point detector. The idea is to nd points in scale-space
that are both maxima of the Harris function H
sp
(4) in space
and extrema of the scale-normalized spatial Laplace opera-
tor over scale.
Here, we extend this idea and detect interest points
that are simultaneous maxima of the spatio-temporal cor-
ner function H (8) as well as extrema of the normalized
spatio-temporal Laplace operator
2
norm
L (9). Hence, we
detect interest points for a set of sparsely distributed scale
values and then track these points in spatio-temporal scale-
time-space towards the extrema of
2
norm
L.Wedothisby
iteratively updating the scale and the position of the inter-
est points by (i) selecting the neighboring spatio-temporal
scale that maximizes (
2
norm
L)
2
and (ii) re-detecting the
space-time location of the interest point at a new scale until
the position and the scale converge to the stable values [15].
To illustrate the performance of th e scale-adapted spatio-
temporal interest point detector, let us consider a sequence
with a walking person and non-constant image velocities
due to the oscillating motion of the legs. As can be seen in
figure 3, the pattern gives rise to stable interest points. Note
that the detected points are well-localized both in space and
time and correspond to events such as the stopping and start-
ing feet. From the space-time plot in figure 3(a), we can also
Proceedings of the Ninth IEEE International Conference on Computer Vision (ICCV’03)
0-7695-1950-4/03 $ 17.00 © 2003 IEEE

(a) (b)
Figure 3: Results of detecting spatio-temporal interest
points for the motion of the legs of a walking person: (a)
3-D plot with a threshold surface of a leg pattern (up side
down) and detected interest points; (b) interest points over-
laid on single frames in the sequence.
observe how the selected spatial and temporal scales of the
detected features roughly match the spatio-temporal extents
of the corresponding image structures.
Hand waves with high frequency
(a)
Hand waves with low frequency
(b)
Figure 4: Result of interest point detection for a sequence
with waving hand gestures: (a) Interest points for hand ges-
tures with high frequency; (b) Interest points for hand ges-
tures with low frequency.
The second example explicitly illustrates how the pro-
posed method is able to estimate the temporal extent of de-
tected events. Figure 4 shows a person making hand-waving
gestures with high frequency on the left and low frequency
on the right. The distinct interest points are detected at mo-
ments and at spatial positions where the hand changes its
direction of motion. Whereas the spatial scales of the de-
tected interest points remain roughly constant, the selected
temporal scales depend on the frequency of the wave pat-
tern.
3. Classification of events
The detected interest points have significant variations of
image values in the local spatio-temporal neighborhood. To
differentiate events from each other and from noise, one
approach is to compare their local neighborhoods and as-
sign points with similar n eighborhoods to the same class
of events. Similar approach has proven to be successful in
the spatial domain for the task of image representation [21]
indexing [2 5] and reco gnition [10, 31, 16]. In the spatio-
temporal domain local descriptors have been used previ-
ously by [7] and others.
To describe a spatio-temporal neighborhood we consider
normalized spatio-temporal Gaussian derivatives defined as
L
x
m
y
n
t
k = σ
m+n
τ
k
(
x
m
y
n
t
k g) f, (11)
computed at the scales used for detecting the correspond-
ing interest points. The normalization with respect to scale
parameters guarantees the invariance of the derivative re-
sponses with respect to image scalings in both the spatial
domain and the temporal domain. Using derivatives, we de-
fine event descriptors from the third order local jet
2
[13]
j =(L
x
,L
y
,L
t
,L
xx
, ..., L
ttt
). (12)
To compare two events we compute the Mahalanobis dis-
tance between their descriptors as
d
2
(j
1
,j
2
)=(j
1
j
2
1
(j
1
j
2
)
T
, (13)
where Σ is a covariance matrix corresponding to the typical
distribution of interest points in the data.
To detect similar events in the data, we apply k-means
clustering [8] in the space of point descriptors and de-
tect groups of points with similar spatio-temporal neighbor-
hoods. The clustering of spatio-temporal neighborhoods is
similar to the idea of textons [21] used to describe image
texture as well as to detect object parts for spatial r ecog-
nition [31]. Given training sequences with periodic mo-
tion, we can expect repeating events to give rise to popu-
lated clusters. On the contrary, sporadic interest points can
be expected to be sparsely distributed over the descriptor
space giving rise to weakly populated clusters. To prove this
idea we applied k-means clustering with k =15to th e se-
quence of a walking person in the upper row of figure 5. We
found out that four of the most densely populated clusters
c1, ..., c4 indeed corresponded to the stable interest points
of the gait pattern. Local spatio-temporal neighborhoods of
these points are shown in figure 6, where we can confirm the
similarity of patterns inside the clusters and their difference
between clusters.
2
Note that our representation is currently not invariant with respect to
planar image rotations. Such invariance could be added if considering
steerable derivati ves or rotationally invariant operators in space.
Proceedings of the Ninth IEEE International Conference on Computer Vision (ICCV’03)
0-7695-1950-4/03 $ 17.00 © 2003 IEEE

c4
c2
c4
c1
c1
c3
c4
c3
c4
c2
c1
K-means clustering of interest points
c4c2 c4c2 c4c2 c4c2
c3
c4c2 c4
c3
c2
c3c3
c2 c4c4c2 c4c2 c4c4c2 c4c2 c4c2
c3
c2
c3
c2
c1c1
c4c4c4c4
c3c3c3
c4
c3
c4
c3
c4
c3c3
c4
c3
c4
c3c3
c2
c4
c2c2
c4
c2
c4
c2c2
c4c4
c2
c4
c3
c4
c2
c4c4
c2
c4
c3
c2
c4
c3
c4
c2
c4c4
c2
c3
c1c1
Classification of interest points
Figure 5: Interest points detected for sequences of walking persons. First row: result of clustering spatio-temporal interest
points. Labeled points correspond to the four most populated clusters; Second row: result of classification of interest points
with respect to the clusters found in the first sequence.
Time
X
Y
Time
X
Y
Time
X
Y
Time
X
Y
c1
Time
X
Y
Time
X
Y
Time
X
Y
Time
X
Y
c2
Time
X
Y
Time
X
Y
Time
X
Y
Time
X
Y
c3
Time
X
Y
Time
X
Y
Time
X
Y
Time
X
Y
c4
Figure 6: Local spatio-temporal neighborhoods of interest
points corresponding to the first four most populated clus-
ters.
To represent characteristic rep e titive events in video, we
compute cluster means m
i
=
1
n
i
n
i
k=1
j
k
for each signif-
icant cluster c
i
consisting of n
i
points. Then, in order to
classify an event on an unseen sequence, we assign the de-
tected point to the cluster c
i
if it minimizes the distance
d(m
i
,j
0
) (13) between the jet of the interest point j
0
and
the cluster mean m
i
. If the distance is above a threshold,
the point is classified as the background. Application of this
classification scheme is demonstrated in the second row o f
figure 5. As can be seen, most of the points corresponding
to the gait pattern are correctly classified while the other
interest points are discarded. Observe that the person in
the second sequence of figure 5 undergoes significant size
changes in the image. Due to the scale-invariance of interest
points as well as the jet responses, size transformations do
not effect neither the result of event detection nor the result
of classification.
4. Application to video interpretation
In this section, we illustrate how the representation of video
sequences by classified spatio-temporal interest points can
be used for video interpretation. We consider the problem of
detecting walking people and estimating their poses when
viewed from the side in outdoor scenes. Such a task is
complicated, since the variations in appearance of people
together with the variations in the background may lead to
ambiguous interpretations. Human motion is a strong cue
that has been used to resolve this ambiguity in a number
of previous works. Some of the works rely on pure spatial
image features while using sophisticated body models and
tracking schemes to constrain the interpretation [2, 5, 27].
Other approaches use spatio-temporal image cues such as
optical flow [3] or motion templates [2].
The idea of this approach is to represent both the model
and the data using local and discriminative spatio-temporal
features and to match the model by matching its features
Proceedings of the Ninth IEEE International Conference on Computer Vision (ICCV’03)
0-7695-1950-4/03 $ 17.00 © 2003 IEEE

Citations
More filters
Proceedings ArticleDOI

Learning Spatiotemporal Features with 3D Convolutional Networks

TL;DR: The learned features, namely C3D (Convolutional 3D), with a simple linear classifier outperform state-of-the-art methods on 4 different benchmarks and are comparable with current best methods on the other 2 benchmarks.
Journal ArticleDOI

3D Convolutional Neural Networks for Human Action Recognition

TL;DR: Wang et al. as mentioned in this paper developed a novel 3D CNN model for action recognition, which extracts features from both the spatial and the temporal dimensions by performing 3D convolutions, thereby capturing the motion information encoded in multiple adjacent frames.
Proceedings Article

3D Convolutional Neural Networks for Human Action Recognition

TL;DR: A novel 3D CNN model for action recognition that extracts features from both the spatial and the temporal dimensions by performing 3D convolutions, thereby capturing the motion information encoded in multiple adjacent frames.
Proceedings ArticleDOI

Learning realistic human actions from movies

TL;DR: A new method for video classification that builds upon and extends several recent ideas including local space-time features,space-time pyramids and multi-channel non-linear SVMs is presented and shown to improve state-of-the-art results on the standard KTH action dataset.
References
More filters
Proceedings ArticleDOI

Object recognition from local scale-invariant features

TL;DR: Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds.
Proceedings ArticleDOI

A Combined Corner and Edge Detector

TL;DR: The problem the authors are addressing in Alvey Project MMI149 is that of using computer vision to understand the unconstrained 3D world, in which the viewed scenes will in general contain too wide a diversity of objects for topdown recognition techniques to work.
Journal ArticleDOI

C ONDENSATION —Conditional Density Propagation forVisual Tracking

TL;DR: The Condensation algorithm uses “factored sampling”, previously applied to the interpretation of static images, in which the probability distribution of possible interpretations is represented by a randomly generated set.
Journal ArticleDOI

Performance of optical flow techniques

TL;DR: These comparisons are primarily empirical, and concentrate on the accuracy, reliability, and density of the velocity measurements; they show that performance can differ significantly among the techniques the authors implemented.
Related Papers (5)
Frequently Asked Questions (14)
Q1. What contributions have the authors mentioned in the paper "Space-time interest points" ?

In this paper, the authors propose to extend the notion of spatial interest points into the spatio-temporal domain and show how the resulting features often reflect interesting events that can be used for a compact representation of video data as well as for its interpretation. Using such descriptors, the authors classify events and construct video representation in terms of labeled space-time points. For the problem of human motion analysis, the authors illustrate how the proposed method allows for detection of walking people in scenes with occlusions and dynamic backgrounds. To detect spatio-temporal events, the authors build on the idea of the Harris and Förstner interest point operators and detect local structures in space-time where the image values have significant local variations in both space and time. 

The need of careful initialization and/or simple background have been frequent obstacles in previous approaches for human motion analysis. 

The normalization with respect to scale parameters guarantees the invariance of the derivative responses with respect to image scalings in both the spatial domain and the temporal domain. 

The clustering of spatio-temporal neighborhoods is similar to the idea of textons [21] used to describe image texture as well as to detect object parts for spatial recognition [31]. 

Highly successful applications of interest point detectors have been presented for image indexing [25], stereo matching [30, 23, 29], optic flow estimation and tracking [28], and recognition [20, 10]. 

For sufficiently large values of k, positive local maxima of H correspond to points with high variation of the image gray-values in both the spatial and the temporal dimensions. 

To find the best match between the model and the data, the authors search for the model state X̃ that minimizes H in (15)X̃ = argminX H(M̃(X), D, t0) (17)using a standard Gauss-Newton optimization method. 

The distance h between two features of the same class is defined as a Euclidean distance between two points in space-time, where the spatial and the temporal dimensions are weighted with respect to parameter ν as well as by extents of features in space-timeh2(fm, fd) = (xm − xd)2 + (ym − yd)2 (1 − ν)(σm)2 + (tm − td)2 ν(τm)2 .(16) The distance between features of different classes is regarded as infinite. 

Using the semi-group property of the Gaussian kernel, it follows that the scale-space representation of f is L(x, y, t; σ2, τ2) = g(x, y, t; σ20 + σ2, τ20 + τ2). 

As shown in [15], given the appropriate normalization parameters a = 1, b = 1/4, c = 1/2 and d = 3/4, the size of the blob f can be estimated from the scale values σ̃2 and τ̃2 for which ∇2normL assumes local extrema over scales, space and time. 

the spatio-temporal extent of the blob can be estimated by detecting local extrema of∇2normL = σ2τ1/2(Lxx + Lyy) + στ3/2Ltt. (10) over both spatial and temporal scales. 

It follows that the current scheme does not allow for scalings of the model in the temporal direction and enables only the first-order variations of positions and sizes of the model features over time. 

The authors do this by iteratively updating the scale and the position of the interest points by (i) selecting the neighboring spatio-temporal scale that maximizes (∇2normL)2 and (ii) re-detecting the space-time location of the interest point at a new scale until the position and the scale converge to the stable values [15]. 

Proceedings of the Ninth IEEE International Conference on Computer Vision (ICCV’03) 0-7695-1950-4/03 $ 17.00 © 2003 IEEETo illustrate the detection of spatio-temporal interest points on synthetic image sequences, the authors show the spatio-temporal data as 3-D space-time plots where the original signal is represented by a threshold surface while the detected interest points are presented by ellipsoids with semi-axes proportional to corresponding scale parameters σl and τl.