What are the main obstacles in previous approaches for human motion analysis?

The need of careful initialization and/or simple background have been frequent obstacles in previous approaches for human motion analysis.

What is the way to find the match between the model and the data?

To find the best match between the model and the data, the authors search for the model state X̃ that minimizes H in (15)X̃ = argminX H(M̃(X), D, t0) (17)using a standard Gauss-Newton optimization method.

What is the distance between features of different classes?

The distance h between two features of the same class is defined as a Euclidean distance between two points in space-time, where the spatial and the temporal dimensions are weighted with respect to parameter ν as well as by extents of features in space-timeh2(fm, fd) = (xm − xd)2 + (ym − yd)2 (1 − ν)(σm)2 + (tm − td)2 ν(τm)2 .(16) The distance between features of different classes is regarded as infinite.

How can the authors estimate the spatio-temporal extent of a Gaussian blo?

the spatio-temporal extent of the blob can be estimated by detecting local extrema of∇2normL = σ2τ1/2(Lxx + Lyy) + στ3/2Ltt. (10) over both spatial and temporal scales.

What is the current scheme for defining the walking model?

It follows that the current scheme does not allow for scalings of the model in the temporal direction and enables only the first-order variations of positions and sizes of the model features over time.

How do the authors estimate the spatio-temporal extent of an event?

The authors do this by iteratively updating the scale and the position of the interest points by (i) selecting the neighboring spatio-temporal scale that maximizes (∇2normL)2 and (ii) re-detecting the space-time location of the interest point at a new scale until the position and the scale converge to the stable values [15].

What is the simplest way to find spatio-temporal interest points?

Proceedings of the Ninth IEEE International Conference on Computer Vision (ICCV’03) 0-7695-1950-4/03 $ 17.00 © 2003 IEEETo illustrate the detection of spatio-temporal interest points on synthetic image sequences, the authors show the spatio-temporal data as 3-D space-time plots where the original signal is represented by a threshold surface while the detected interest points are presented by ellipsoids with semi-axes proportional to corresponding scale parameters σl and τl.

(Open Access) Space-time interest points (2003) | Laptev

Q: What contributions have the authors mentioned in the paper "Space-time interest points" ?

In this paper, the authors propose to extend the notion of spatial interest points into the spatio-temporal domain and show how the resulting features often reflect interesting events that can be used for a compact representation of video data as well as for its interpretation. Using such descriptors, the authors classify events and construct video representation in terms of labeled space-time points. For the problem of human motion analysis, the authors illustrate how the proposed method allows for detection of walking people in scenes with occlusions and dynamic backgrounds. To detect spatio-temporal events, the authors build on the idea of the Harris and Förstner interest point operators and detect local structures in space-time where the image values have significant local variations in both space and time.

Q: What are the successful applications of interest point detectors?

Highly successful applications of interest point detectors have been presented for image indexing [25], stereo matching [30, 23, 29], optic flow estimation and tracking [28], and recognition [20, 10].

Q: What is the way to find the spatial maxima of H?

For sufficiently large values of k, positive local maxima of H correspond to points with high variation of the image gray-values in both the spatial and the temporal dimensions.

Q: What is the semi-group property of the Gaussian kernel?

Using the semi-group property of the Gaussian kernel, it follows that the scale-space representation of f is L(x, y, t; σ2, τ2) = g(x, y, t; σ20 + σ2, τ20 + τ2).

Q: How can the authors estimate the spatial extent of a Gaussian blob?

As shown in [15], given the appropriate normalization parameters a = 1, b = 1/4, c = 1/2 and d = 3/4, the size of the blob f can be estimated from the scale values σ̃2 and τ̃2 for which ∇2normL assumes local extrema over scales, space and time.

Space-time Interest Points

Ivan Laptev and Tony Lindeberg

∗

Computational Vision and Active Perception Laboratory (CVAP)

Dept. of Numerical Analysis and Computer Science

KTH, SE-100 44 Stockholm, Sweden

{laptev, tony}@nada.kth.se

Abstract

Local image features or interest points provide compact

and abstract representations of patterns in an image. In this

paper, we propose to extend the notion of spatial interest

points into the spatio-temporal domain and show how the

resulting features often reﬂect interesting events that can be

used for a compact representation of video data as well as

for its interpretation.

To detect spatio-temporal events, we build on the idea of

the Harris and F

orstner interest point operators and detect

local structures in space-time where the image values have

signiﬁcant local variations in both space and time. We then

estimate the spatio-temporal extents of the detected events

and compute their scale-invariant spatio-temporal descrip-

tors. Using such descriptors, we classify events and con-

struct video representation in terms of labeled space-time

points. For the problem of human motion analysis, we illus-

trate how the proposed method allows for detection of walk-

ing people in scenes with occlusions and dynamic back-

grounds.

1. Introduction

Analyzing and interpreting video is a growing topic in com-

puter vision and its applications. Video data contains infor-

mation about changes in the environment and is highly im-

portant for many visual tasks including navigation, surveil-

lance and video indexing.

Traditional approaches for motion analysis mainly in-

volve the computation of optic ﬂow [1] or feature tracking

[28, 4]. Although very effective for many tasks, both of

these techniques have limitations. Optic ﬂow approaches

mostly capture ﬁrst-order motion and often fail when the

motion has sudden changes. Feature trackers often assume

a constant appearance of image patches over time and may

hence fail when this appearance changes, for example, in

situations when two objects in the image merge or split.

∗

The support from the Swedish Research Council and from the Royal

Swedish Academy of Sciences as well as the Knut and Alice Wallenberg

Foundation is gratefully acknowledged.

Figure 1: Result of detecting the strongest spatio-temporal

interest point in a football sequence with a player heading

the ball. The detected event corresponds to the high spatio-

temporal variation of the image data or a “space-time cor-

ner” as illustrated by the spatio-temporal slice on the right.

Image structures in video are not restricted to constant

velocity and/or constant appearance over time. On the con-

trary, many interesting events in video are characterized by

strong variations of the data in both the spatial and the tem-

poral dimensions. As example, consider scenes with a per-

son entering a room, applauding hand gestures, a car crash

or a water splash; see also the illustration in ﬁgure 1.

More generally, points with non-constant motion corre-

spond to accelerating local image structures that might cor-

respond to the accelerating objects in the world. Hence,

such points might contain important information about the

forces that act in the environment and change its structure.

In the spatial domain, points with a signiﬁcant local vari-

ation of image intensities have been extensively investigated

in the past [9, 11, 26]. Such image points are frequently de-

noted as “interest points” and are attractive due to their high

information contents. Highly successful applications of in-

terest point detectors have been presented for image index-

ing [25], stereo matching [30, 23, 29], optic ﬂow estimation

and tracking [28], and recognition [20, 10].

In this paper we detect interest points in the spatio-

temporal domain and illustrate how the resulting space-

time f eatures often correspond to interesting events in video

data. To detect spatio-temporal interest points, we build on

the idea of the Harris and F¨orstner interest point operators

[11, 9] and describe the detection method in section 2. To

capture events with different spatio-temporal extents [32],

Proceedings of the Ninth IEEE International Conference on Computer Vision (ICCV’03)

we compute interest points in spatio-temporal scale-space

and select scales that roughly correspond to the size o f the

detected events in space and to their durations in time.

In section 3 we show how interesting events in video can

be learned and classiﬁed using k-means clustering and point

descriptors deﬁned by local spatio-temporal image deriva-

tives. In section 4 we consider video rep resentation in terms

of classiﬁed spatio-temporal interest points and demonstrate

how this representation can be efﬁcient for the task of video

registration. In particular, we present an approach for de-

tecting walking people in complex scenes with occlusions

and dynamic background. Finally, section 5 concludes the

paper with the discussion of the method.

2. Interest point detection

2.1. Interest points in spatial domain

The idea of the Harris interest point detector is to detect

locations in a spatial image f

where the image values

have signiﬁcant variations in both directions. For a given

scale of observation σ

, such interest points can be found

from a windowed second moment matrix integrated at scale

= sσ

= g

(·; σ

) ∗



)



(1)

where L

and L

are Gaussian derivatives deﬁned as

(·; σ

)=∂

(·; σ

) ∗ f

)

(·; σ

)=∂

(·; σ

) ∗ f

(2)

and where g

is the spatial Gaussian kernel

(x, y; σ

2πσ

exp(−(x

+ y

)/2σ

). (3)

As the eigenvalues λ

,λ

,(λ

≤ λ

)ofµ

represent char-

acteristic variations of f

in both image directions, two

signiﬁcant values of λ

,λ

indicate the presence of an in-

terest point. To detect such points, Harris and Stephens [11]

propose to detect positive maxima of the corner function

=det(µ

) − k trace

(µ

)=λ

− k(λ

+ λ

)

(4)

2.2. Interest points in the space-time

The idea of interest points in the spatial domain can be ex-

tended into the spatio-temporal domain by requiring the im-

age values in space-time to have large variations in both the

spatial and the temporal dimensions. Points with such prop-

erties will be spatial interest points with a distinct location

in time corresponding to the moments with non-constant

motion of the image in a local spatio-temporal neighbor-

hood [15].

To model a spatio-temporal image sequence, we use a

function f : R

×R → R and construct its linear scale-space

representation L : R

× R × R

→ R by convolution of f

with an anisotropic Gaussian kernel

with d istinct spatial

variance σ

and temporal variance τ

L(·; σ

,τ

)=g(·; σ

,τ

) ∗ f (·), (5)

where the spatio-temporal separable Gaussian kernel is de-

ﬁned as

g(x, y, t; σ

,τ

exp(−(x

+ y

)/2σ

− t

/2τ

)



(2π)

(6)

Similar to the spatial domain, we consider the spatio-

temporal second-moment matrix which is a 3-by-3 matrix

composed of ﬁrst order spatial and temporal derivatives av-

eraged with a Gaussian weighting function g(·; σ

,τ

)

µ = g(·; σ

,τ

) ∗









, (7)

where the integration scales are σ

= sσ

and τ

sτ

, while the ﬁrst-order derivatives are deﬁned as

(·; σ

,τ

)=∂

(g ∗ f ). The second-moment matrix µ

has been used previously by Nagel and Gehrke [24] in the

context of optic ﬂow computation.

To detect interest points, we search for regions in f hav-

ing signiﬁcant eigenvalues λ

,λ

of µ. Among differ-

ent approaches to ﬁnd such regions, we choose to extend

the Harris corner function (4) deﬁned for the spatial domain

into the spatio-temporal domain by combining the determi-

nant and the trace of µ in the following way

H =det(µ) − k trace

(µ)=λ

− k(λ

+ λ

)

(8)

To show that the positive local maxima o f H correspond to

points with high values of λ

,λ

(λ

≤ λ

), we

deﬁne the ratios α = λ

/λ

and β = λ

/λ

and rewrite

H = λ

(αβ − k(1 + α + β)

). Then, from the require-

ment H ≥ 0,wegetk ≤ αβ/(1 + α + β)

and it follows

that as k increases towards its maximal value k =1/27,

both ratios α and β tend to one. For sufﬁciently large val-

ues of k, positive local maxima of H correspond to points

with high variation of the image gray-values in both the spa-

tial and the temporal dimensions. Thus, spatio-temporal in-

terest points o f f can be found by detecting local positive

spatio-temporal maxima in H.

In general, convolution with a Gaussian kernel in the temporal domain

violates causality constraints, since the temporal image data is available

only for the past. Whereas for real-time implementations this problem can

be solved using causal recursive ﬁlters [12, 19], in this paper we simplify

the investigation and assume that the data is available for a sufﬁciently long

period of time and that the image sequence can hence be convolved with a

truncated Gaussian in both space and time. However, the proposed interest

points can be computed using recursive ﬁlters in on-line mode.

Proceedings of the Ninth IEEE International Conference on Computer Vision (ICCV’03)

2.3. Experiments on synthetic sequences

To illustrate the detection of spatio-temporal interest points

on synthetic image sequences, we show the spatio-temporal

data as 3-D space-time plots where the original signal is

represented by a threshold surface while the detected inter-

est points are presented b y ellip soids with semi-axes pro-

portional to corresponding scale parameters σ

and τ

(a) (b)

Figure 2: Results of detecting spatio-temporal interest

points on synthetic image sequences: (a) Moving corner;

(b) A merge of a ball and a wall; (c) Collision of two balls

with interest points detected at scales σ

=8and τ

=8;

(d) the same as in (c) but with interest points detected at

scales σ

=16and τ

=16.

Figure 2a illustrates a sequence with a moving corner.

The interest point is detected at the moment in time when

the motion of the corner changes direction. This type of

event occurs frequently in natural sequences such as se-

quences of articulated motion. Other typical types of events

detected by the proposed method are splits and mergers of

image structures. In ﬁgure 2b, the interest point is detected

at the moment and the position corresponding to the colli-

sion of a ball and a wall. Similarly, interest points are de-

tected at the moment of collision and bouncing of two balls

as shown in ﬁgure 2c-d. Note, that different types of events

are detected depending on the scale of observation.

In general, the result of interest point detector will de-

pend on the scale parameters. Hence, the correct estimation

of spatio-temporal extents of events is highly important for

their detection and further interpretation.

2.4. Scale selection in space-time

To estimate the spatio-temporal extent of an event in space-

time, we follow the idea of local scale selection proposed

in the spatial domain by Lindeberg [18] as well as in the

temporal domain [17]. As a prototype event we study a

spatio-temporal Gaussian blob f = g(x, y, t; σ

,τ

) with

spatial variance σ

and temporal variance τ

.Usingthe

semi-group property of the Gaussian kernel, it follows that

the scale-space representation of f is L(x, y, t; σ

,τ

g(x, y, t; σ

+ σ

,τ

+ τ

To recover the spatio-temporal extent (σ

,τ

) of f we

consider the scale-normalized spatio-temporal Laplacian

operator deﬁned by

∇

norm

L = L

xx,norm

+ L

yy,norm

+ L

tt,norm

, (9)

where L

xx,norm

= σ

and L

tt,norm

= σ

As shown in [15], given the appropriate normalization pa-

rameters a =1,b =1/4,c =1/2 and d =3/4,thesize

of the blob f can be estimated from the scale values ˜σ

and

˜τ

for which ∇

norm

L assumes local extrema over scales,

space and time. Hence, the spatio-temporal extent of the

blob can be estimated by detecting local ex trema of

∇

norm

L = σ

1/2

+ L

)+στ

3/2

. (10)

over both spatial and temporal scales.

2.5. Scale-adapted space-time interest points

Local scale estimation using the normalized Laplace opera-

tor has shown to be very useful in the spatial domain [18, 6].

In particularly, Mikolajczyk and Schmid [22] combined the

Harris interest point operator with the normalized Laplace

operator and derived the scale-invariant Harris-Laplace in-

terest point detector. The idea is to ﬁnd points in scale-space

that are both maxima of the Harris function H

(4) in space

and extrema of the scale-normalized spatial Laplace opera-

tor over scale.

Here, we extend this idea and detect interest points

that are simultaneous maxima of the spatio-temporal cor-

ner function H (8) as well as extrema of the normalized

spatio-temporal Laplace operator ∇

norm

L (9). Hence, we

detect interest points for a set of sparsely distributed scale

values and then track these points in spatio-temporal scale-

time-space towards the extrema of ∇

norm

L.Wedothisby

iteratively updating the scale and the position of the inter-

est points by (i) selecting the neighboring spatio-temporal

scale that maximizes (∇

norm

and (ii) re-detecting the

space-time location of the interest point at a new scale until

the position and the scale converge to the stable values [15].

To illustrate the performance of th e scale-adapted spatio-

temporal interest point detector, let us consider a sequence

with a walking person and non-constant image velocities

due to the oscillating motion of the legs. As can be seen in

ﬁgure 3, the pattern gives rise to stable interest points. Note

that the detected points are well-localized both in space and

time and correspond to events such as the stopping and start-

ing feet. From the space-time plot in ﬁgure 3(a), we can also

Proceedings of the Ninth IEEE International Conference on Computer Vision (ICCV’03)

(a) (b)

Figure 3: Results of detecting spatio-temporal interest

points for the motion of the legs of a walking person: (a)

3-D plot with a threshold surface of a leg pattern (up side

down) and detected interest points; (b) interest points over-

laid on single frames in the sequence.

observe how the selected spatial and temporal scales of the

detected features roughly match the spatio-temporal extents

of the corresponding image structures.

Hand waves with high frequency

(a)

Hand waves with low frequency

(b)

Figure 4: Result of interest point detection for a sequence

with waving hand gestures: (a) Interest points for hand ges-

tures with high frequency; (b) Interest points for hand ges-

tures with low frequency.

The second example explicitly illustrates how the pro-

posed method is able to estimate the temporal extent of de-

tected events. Figure 4 shows a person making hand-waving

gestures with high frequency on the left and low frequency

on the right. The distinct interest points are detected at mo-

ments and at spatial positions where the hand changes its

direction of motion. Whereas the spatial scales of the de-

tected interest points remain roughly constant, the selected

temporal scales depend on the frequency of the wave pat-

tern.

3. Classiﬁcation of events

The detected interest points have signiﬁcant variations of

image values in the local spatio-temporal neighborhood. To

differentiate events from each other and from noise, one

approach is to compare their local neighborhoods and as-

sign points with similar n eighborhoods to the same class

of events. Similar approach has proven to be successful in

the spatial domain for the task of image representation [21]

indexing [2 5] and reco gnition [10, 31, 16]. In the spatio-

temporal domain local descriptors have been used previ-

ously by [7] and others.

To describe a spatio-temporal neighborhood we consider

normalized spatio-temporal Gaussian derivatives deﬁned as

k = σ

m+n

(∂

k g) ∗ f, (11)

computed at the scales used for detecting the correspond-

ing interest points. The normalization with respect to scale

parameters guarantees the invariance of the derivative re-

sponses with respect to image scalings in both the spatial

domain and the temporal domain. Using derivatives, we de-

ﬁne event descriptors from the third order local jet

[13]

j =(L

, ..., L

ttt

). (12)

To compare two events we compute the Mahalanobis dis-

tance between their descriptors as

)=(j

− j

)Σ

−1

− j

)

, (13)

where Σ is a covariance matrix corresponding to the typical

distribution of interest points in the data.

To detect similar events in the data, we apply k-means

clustering [8] in the space of point descriptors and de-

tect groups of points with similar spatio-temporal neighbor-

hoods. The clustering of spatio-temporal neighborhoods is

similar to the idea of textons [21] used to describe image

texture as well as to detect object parts for spatial r ecog-

nition [31]. Given training sequences with periodic mo-

tion, we can expect repeating events to give rise to popu-

lated clusters. On the contrary, sporadic interest points can

be expected to be sparsely distributed over the descriptor

space giving rise to weakly populated clusters. To prove this

idea we applied k-means clustering with k =15to th e se-

quence of a walking person in the upper row of ﬁgure 5. We

found out that four of the most densely populated clusters

c1, ..., c4 indeed corresponded to the stable interest points

of the gait pattern. Local spatio-temporal neighborhoods of

these points are shown in ﬁgure 6, where we can conﬁrm the

similarity of patterns inside the clusters and their difference

between clusters.

Note that our representation is currently not invariant with respect to

planar image rotations. Such invariance could be added if considering

steerable derivati ves or rotationally invariant operators in space.

Proceedings of the Ninth IEEE International Conference on Computer Vision (ICCV’03)

K-means clustering of interest points

c4c2 c4c2 c4c2 c4c2

c4c2 c4

c3c3

c2 c4c4c2 c4c2 c4c4c2 c4c2 c4c2

c1c1

c4c4c4c4

c3c3c3

c3c3

c2c2

c4c4

c1c1

Classiﬁcation of interest points

Figure 5: Interest points detected for sequences of walking persons. First row: result of clustering spatio-temporal interest

points. Labeled points correspond to the four most populated clusters; Second row: result of classiﬁcation of interest points

with respect to the clusters found in the ﬁrst sequence.

Time

Figure 6: Local spatio-temporal neighborhoods of interest

points corresponding to the ﬁrst four most populated clus-

ters.

To represent characteristic rep e titive events in video, we

compute cluster means m



k=1

for each signif-

icant cluster c

consisting of n

points. Then, in order to

classify an event on an unseen sequence, we assign the de-

tected point to the cluster c

if it minimizes the distance

d(m

) (13) between the jet of the interest point j

and

the cluster mean m

. If the distance is above a threshold,

the point is classiﬁed as the background. Application of this

classiﬁcation scheme is demonstrated in the second row o f

ﬁgure 5. As can be seen, most of the points corresponding

to the gait pattern are correctly classiﬁed while the other

interest points are discarded. Observe that the person in

the second sequence of ﬁgure 5 undergoes signiﬁcant size

changes in the image. Due to the scale-invariance of interest

points as well as the jet responses, size transformations do

not effect neither the result of event detection nor the result

of classiﬁcation.

4. Application to video interpretation

In this section, we illustrate how the representation of video

sequences by classiﬁed spatio-temporal interest points can

be used for video interpretation. We consider the problem of

detecting walking people and estimating their poses when

viewed from the side in outdoor scenes. Such a task is

complicated, since the variations in appearance of people

together with the variations in the background may lead to

ambiguous interpretations. Human motion is a strong cue

that has been used to resolve this ambiguity in a number

of previous works. Some of the works rely on pure spatial

image features while using sophisticated body models and

tracking schemes to constrain the interpretation [2, 5, 27].

Other approaches use spatio-temporal image cues such as

optical ﬂow [3] or motion templates [2].

The idea of this approach is to represent both the model

and the data using local and discriminative spatio-temporal

features and to match the model by matching its features

Proceedings of the Ninth IEEE International Conference on Computer Vision (ICCV’03)

Space-time interest points

Figures

Citations

Learning Spatiotemporal Features with 3D Convolutional Networks

Large-scale Video Classiﬁcation with Convolutional Neural Networks

3D Convolutional Neural Networks for Human Action Recognition

3D Convolutional Neural Networks for Human Action Recognition

Learning realistic human actions from movies

References

Pattern Classification

Object recognition from local scale-invariant features

A Combined Corner and Edge Detector

C ONDENSATION —Conditional Density Propagation forVisual Tracking

Performance of optical flow techniques

Related Papers (5)

Behavior recognition via sparse spatio-temporal features

Recognizing human actions: a local SVM approach

Learning realistic human actions from movies

The recognition of human movement using temporal templates

Actions as space-time shapes

Frequently Asked Questions (14)

Q1. What contributions have the authors mentioned in the paper "Space-time interest points" ?

Q2. What are the main obstacles in previous approaches for human motion analysis?

Q3. What is the significance of the normalization with respect to scale parameters?

Q4. What is the idea of k-means clustering?

Q5. What are the successful applications of interest point detectors?

Q6. What is the way to find the spatial maxima of H?

Q7. What is the way to find the match between the model and the data?

Q8. What is the distance between features of different classes?

Q9. What is the semi-group property of the Gaussian kernel?

Q10. How can the authors estimate the spatial extent of a Gaussian blob?

Q11. How can the authors estimate the spatio-temporal extent of a Gaussian blo?

Q12. What is the current scheme for defining the walking model?

Q13. How do the authors estimate the spatio-temporal extent of an event?

Q14. What is the simplest way to find spatio-temporal interest points?