What have the authors contributed in "Discovery of collocation episodes in spatiotemporal data∗" ?

Given a collection of trajectories of moving objects with different types ( e. g., pumas, deers, vultures, etc. ), the authors introduce the problem of discovering collocation episodes in them ( e. g., if a puma is moving near a deer, then a vulture is also going to move close to the same deer with high probability within the next 3 minutes ). The authors formally define the problem of mining collocation episodes and propose two scaleable algorithms for its efficient solution. The authors empirically evaluate the performance of the proposed methods using synthetically generated data that emulate real-world object movements.

What is the naive method for the computation of Sfi for each Si?

A naive method for the computation of Sfi for each Si, is to scan all the other sequences in order to identify the windows and feature-sets in each Sfi .

What is the definition of sliding window counting?

Sliding window counting for event episodes has also been proposed in [5], however, the valid instances in their case are more difficult to count, because of the constraint that one collocation unit instance should end before the beginning of the next one (see condition (ii) in Definition 6).

What is the process of combining two ITLists?

For counting the instances of a longer candidate pattern P (procedure MJ count cand), the authors slide a W -window along the two ITLists of the two subpatterns P1 and P2 that generate P , and merge-join the lists to create ITListfr(P ).

How many times can a vulture move near to a deer?

An exemplary collocation episode for this application could be “Once the authors detect that a puma is moving close to a deer for 1 minute, the authors expect that a vulture will also move near to this deer in 3 minutes with high probability.

What is the cost of examining a window?

As a result, the cost of examining a feature-set sequence Sfi becomes proportional to |S f i |, instead of the number of window positions (which normally is much larger).

(Open Access) Discovery of Collocation Episodes in Spatiotemporal Data (2006) | Huiping Cao

Q: What is the simplest way to join two ITLists?

The band-join is a straightforward extension of the merge join algorithm that replaces the equality condition by a maximum difference constraint (maximum time difference W in their example).

Q: how many windows are used to count c?

Function slide window is used to count |Ic| (the number of windows that contain valid instances of c) for each candidate c ∈ C from a transformed sequence Sfi ∈ Sfr .

Q: What is the method for finding close feature pairs?

To summarize, for finding close feature pairs, the hash-based technique is much faster than the linear scan method, whereas for discovering collocation episodes from feature sets, the Apriori method with the counting optimization technique performs best.

Q: What are the default values of the data generation parameters?

The default values of the data generation parameters are n = 500, m = 2000, w = 2, W = 20, |F|= 40, = 7, = 1% and min sup = 0.03.

Q: How many times does a pattern with a reference type appear in a database?

Problem Definition: Given a database of trajectories S1, · · · ,Sn of n moving objects, each with type(oi) ∈ F , discover all the frequent spatiotemporal collocations, subject to , a closeness duration window length w, a maximum pattern window length W , and a frequency threshold min sup∈ [0, 1).

Q: How many windows are there to count?

Since only three windows, [5, 13), [6, 14), [9, 17), correspond to the event of a feature-set entering the sliding window, and two windows, [11, 19), [12, 20), correspond to the event that a feature-set leaves the window, the authors just need to examine these five windows.

Q: What is the definition of a spatial collocation pattern?

A spatiotemporal collocation pattern (or episode) P is an ordered list of spatiotemporal collocation units: g1g2 · · · g where ∩ i=1(gi.V ) = ∅.

Q: What are the parameters used to generate trajectories?

Given these parameters, the authors generate n trajectories, each of which is assigned to a type in F while making sure that the generated trajectories instantiate collocation episodes.

Discovery of Collocation Episodes in Spatiotemporal Data

∗

Huiping Cao, Nikos Mamoulis, and David W. Cheung

Department of Computer Science

The University of Hong Kong

Pokfulam Road, Hong Kong

{hpcao,nikos,dcheung}@cs.hku.hk

Abstract

Given a collection of trajectories of moving objects with

different types (e.g., pumas, deers, vultures, etc.), we intro-

duce the problem of discovering collocation episodes in them

(e.g., if a puma is moving near a deer, then a vulture is also

going to move close to the same deer with high probabil-

ity within the next 3 minutes). Collocation episodes catch

the inter-movement regularities among different types of ob-

jects. We formally deﬁne the problem of mining collocation

episodes and propose two scaleable algorithms for its efﬁ-

cient solution. We empirically evaluate the performance of

the p roposed methods using synthetically generated data that

emulate real-world object movements.

1 Introduction

The large volume of the spatiotemporal data, (i.e., mov-

ing objects trajectories) renders their manual analysis tedious

and impossible. For their efﬁcient analysis, spatiotemporal

data mining [1] is proposed for the development and appli-

cation of novel computational techniques. Given a trajectory

database, our goal is to unveil inter-movement regularities

among objects of different types, modeled as a sequence of

collocation events. Consider an application that monitors the

activities of anima ls (e.g., via sen sors attached to them). An

exemplary collocation episode for this application could be

“Once we detect that a puma is moving close to a deer for 1

minute, we expect that a vulture will also move near to this

deer in 3 minutes with high probability.”

A collocation episode is in fact a sequence of spatiotem-

poral collocation events. Such events are sets of objects mov-

ing close to each othe r for some period. In addition, there

is a particular object type (e.g., deer), called centric fea-

ture, which participates in a sequence of collocations (e.g.,

deer-puma, deer-vulture). Finally, in a valid instance of

the episode the object that instantiates the common feature

should be the same in all collocation instances (e.g., the same

deer appears in both deer-puma and deer-vulture colloca-

tions). Our deﬁnition of a collocation event is a temporal ex-

tension of the spatial collocation deﬁned in [6], which mod-

els the co-existence of a set of (non-spatial) features, such

as environmental observations (humidity and pollution val-

ues), plant and animal types, etc., in a spatial neighborhood.

E.g., pattern (wet, bamboo) means that, with high probabil-

ity, we can ﬁnd b amboo plants near places with high hu-

∗

Work supported by grant HKU 7142/04E from Hong Kong RGC.

midity values. Existing methods that consider only spatial

relationships between static features are not directly applica-

ble for our problem, since (i) we require a temporal duration

for a valid collocation event and ( ii) we search for temporal

episodes of such events. Our problem also has some simi-

larity with episodes minin g in sequence data [5], where fre-

quent episodes a re (partially or totally) ordered list of events.

A sliding window w is used to extract subsequences in the

event series, and the contribution of every subsequence to

each candidate episode’s frequency is counted. However, the

events in our episodes are complex collocations as opposed

to simple categorical values. In additio n, for a valid episode

instance there must be a comm on f eature instantiation

(cen-

tric feature requirement), as opposed to an appearance of any

event of the same type in [5].

In view of the challenges in this problem, we propose a

two-step framework for mining collocation episodes. First,

we apply a hash-based technique to efﬁciently retrieve the

object pairs, whose trajectories are close during some p eriods

and identify these intervals. Then, we provide two colloca-

tion episode mining algorithms (one Apriori-based approach

and one that is based on the vertical mining paradigm) and

some pruning techniques to improve the mining efﬁciency.

Finally, we empirically evaluate the performance of the pro-

posed methods using synthetically generated data. In the re-

mainder of the paper, we introduce some related work, for-

mally deﬁne the prob lem, outline our algor ithms, and present

an experimental evaluation for them.

2 Related Work

Besides the work reviewed in Section 1, our work is also

related to pattern discovery in one-dimensional time series,

e.g., [4] etc. Nevertheless, these p roblems differ in three

main aspects from our work; (i) the pattern/rule element is

just a symbol or an event, while our pattern unit is a topo-

logical structure with a temporal duration; (ii) patterns are

deﬁned based on a single time series, but our patterns are

based on relationships among d ifferent sequen ces; and (iii)

temporal and time-series data mining is usually based on p re-

deﬁned categorization of 1D values, whereas we work on a

continuous (spatiotemporal) data space. In addition, several

efforts have been made to extend spatial collocation patterns

[6] to contain temporal aspects towards different d irections.

E.g., from spatiotempo ral data, [9] searches for evolving col-

locations, which in nature are p attern components in our con-

text, while [8] d iscovers topological patterns (without tempo-

Note that two trajectories of the same type (e.g., deer) may correspond

to different objects (e.g., two different deers).

ral order). Finally, a related piece of work to our problem is

[3], where spatiotemporal pattern queries are proposed and

studied. An intuitive example of such a query is “ﬁnd the

moving object that is close to location A at time t

,andthen

moves to region C dur ing time interval [t

]”. The main

differences of our work are (i) we au tomatically iden tify fr e-

quent patterns that relate the movement of objects, instead

of posing explicit queries and (ii) our patterns relate two

or more trajectories (that are feature instances), instead of

searching for trajectories that follow a speciﬁc “route” spec-

iﬁed by a temporal sequence of static regions.

3 Problem Deﬁnition

This section formally d e ﬁnes the spatiotemporal colloca-

tion episodes by gracefully combining the concept of episode

in event sequences and collocation in spatial d atabases.

3.1 Spatiotemporal sequences and close subsequences

A spatiotemporal sequence S is th e trajectory of a mov-

ing object. Formally, it is an ordered list of location-time

pairs (l

), (l

), ···, (l

m−1

) where t

if i<j

(i, j ∈[0,m)). The pair (l

) denotes that the object was at

location l

at tim e t

. In p ractice, l

is a 2D position (x

and t

records the time represented in time units (e.g., one

minute is a time unit). In the following discussion, the sub-

script of a location implicitly refers to its timestamp (i.e., l

implies the location at time t

Given n (n ≥ 1) objects o

, ···,o

, the trajectory of

object o

is denoted by S

. Figure 1 plots three exemplary

sequences S

, S

and S

(abbreviating S

Puma

, S

Deer

,and

Vulture

, respectively). For illustration purposes, we use 1D

values to represent spatial locations, however, our discussion

extends naturally to th e multidimensional space.

A subsequence s of S is a list of continuous location-

time pairs of S: (l

), (l

), ···, (l

), where for

∀j ∈[1,q], l

is in S and t

+1=t

j+1

. The starting (ending)

time of s is denoted by s.t

(s.t

). For a complete sequence

S with m positions, S.t

= t

and S.t

= t

m−1

+1.

Deﬁnition 1 A window is a time interval [t

). The time

span (or length) of a window [t

) is t

−t

. A window with

time span w is called a w-window.

Deﬁnition 2 A subsequence s is on window [t

) if s.t

and s.t

= t

. A subsequence is called a w-subsequence if

it is on a w-window. Two subsequences s

∈S

and s

∈S

are concurrent subsequences if they are on the same win-

dow.

8 10 12 14 16 18 20

location

time

Puma

Deer

Vulture

Figure 1: Example of trajectories

and windows

(a) (b)

Figure 2: Collocation

unit and episode

Example: Figure 1 shows two windows with time span 3:

w1=[10, 13) and w2=[15, 18).ForS

, the two subsequences

on w1 and w2 are s

=(3.5, 10), (4.3, 11), (5.0, 12) and

=(6.2, 15), (6.1, 16), (6.0, 17).ForS

, the subse-

quence on w1 is s

=(2.5, 10), (3.6, 11), (4.6, 12), while

on w2 we have s

=(6.7, 15), (6.5, 16), (5.2, 17) from

. s

and s

(also: s

and s

) are concurrent subse-

quences. S

is on window [9, 20) and has eight (t

−t

−w=

20−9−3) 3-subsequences, the second of which is on w1.

We deﬁne the closeness of two concurrent subsequences,

using an aggregate function aggDist for their element-to-

element distances. Typically, aggDist can be the m axi-

mum (max) or average (avg) of the component distances.

Assuming that s

and s

are both on window [t

),and

dist is some atomic distance function (e.g., Euclidean dis-

tance), maxDist(s

)=max

≤i

{dist(l

)}

and avgDist(s

≤i

dist(l

)

−t

.Adistance

threshold  is used to model closeness:

Deﬁnition 3 Two concurrent subsequences s

and s

are

close, denoted by close(s

),ifaggDist(s

)≤ .

Example: Assuming  =2.5 and aggDist= maxDist,the

two concurrent subsequences s

and s

on window w1 of

Figure 1 are close to each other since for ∀i

= j

∈ [10, 13),

the location pair l

∈s

and l

∈s

satisﬁes dist(l

)≤ .

3.2 Spatiotemporal collocations and episodes thereof

Let F be a set of moving object types (e.g., different ani-

mals). Given a database of object trajectories, the type (or

feature)ofanobjecto

is denoted by type(o

), such that

type(o

) ∈F. In general, the number of objects n in the

database can be larger than the number |F| of types in F; i.e.,

more than one object may belong to the same object type.

Deﬁnition 4 A spatiotemporal collocation unit g (simply

unit) is an undirected graph (V, E) where each vertex in g.V

is an object type in F.Thelength of the unit g is the number

of vertices |g.V | in it. Given a unit time span w,avalid

instance I

of unit g =(V,E),whereV ={f

,...,f

|V |

is a set of concurrent w-subsequences {s

,...,s

|V |

} on

a window [t

) such that (i) s

is of type f

, (1 ≤ i ≤|V |)

and (ii) if (f

)∈E then close(s

The starting (ending) time of I

is denoted by I

For example, the two concurrent window trajectories s

and

in Figure 1 is an instance of the collocation unit in Figure

2a, and the related window is [10, 13) (i.e., w1).

Deﬁnition 5 A spatiotemporal collocation pattern (or

episode) P is an ordered list of spatiotemporal collocation

units: g

···g



where ∩



i=1

.V ) = ∅.

The object types in ∩



i=1

.V ), ar e called the reference types

(features) of pattern P .Thelength of the pattern P is de-

ﬁned by



i=1

.V |. A pattern with length k is called

a k-pattern. In this paper, we only consider the case that

|∩



i=1

.V )| =1, and we denote the common (reference)

object type as f

. The reference object type f

is also called

the centric feature of the pattern. Thus, we can also represent

a pattern in the form (f

.V −f

)→···→(f



.V −f

where the underlined feature is the reference feature.

Example: Figure 2b shows a 4-collocation episode, indi-

cating that when a deer and a puma are close during w =3

time units, a vulture will co me close to this deer later. This

episode’s common feature is D and can also be represented

by (D

,P)→ (D,V).

Deﬁnition 6 Given a maximum pattern time span W ,a

valid instance I

for a pattern P = g

···g



, is a sequence

of valid unit instances I

···I



such that (i) in all unit

instances the reference feature f

is instantiated by a sub-

sequence of the same object sequence, (ii) for every i<j,

≤ I

, and (iii) I



−I

≤ W .

Example:Let =2.5, w =3,andW =8. In Figure 1, we

can identify a valid instance of the episode of Figure 2b. In

speciﬁc, s

and s

) instantiate the ﬁrst (sec-

ond) unit of the pattern. In addition, s

and s

instantiate

the common feature D in both units and they are parts of the

same trajectory. Furthermore, I

, since the end

point of w1 is before w2. Finally, I

−I

=18−10≤ W .

Given the maximal episode time span W , we say that

a W -window [t

) covers a pattern instance, if the time

span of the instance [I



) satisﬁes t

≤ I

and



≤ t

.Weuse|I

| to denote the number of W -

windows, which cover at least one instance of pattern P .

Deﬁnition 7 Pattern P = g

···g

is a superpattern of



= g



···g



if (i) P.f

= P



and (ii) there exists q

units g

···g

(1 ≤ i

j+1

≤ p, 1 ≤ j<q) of P such

that g



.V ⊆ g

.V and g



.E ⊆ g

.E for ∀j ∈ [1,q]. P



is a

subpattern of P .

For Example, P =(A

,B,C) → (A,C,D) → (A,E) is a

superpattern of P



=(A,B)→ (A,C,D)

To measure the interestingness of a collocation episode,

we use the reference type as the key factor since it does not

make sense to overcount the same instance of the reference

feature (e.g., deer) with different instances of the other ob-

ject types (e.g., puma, vulture) in the pattern. In addition,

we consider all possible time windows W , where the p attern

may appear.

Deﬁnition 8 The frequency of a pattern P with reference ob-

ject type f

is fr(P, w, W )=

type(o

)=f

|win|

Here, win is the total number of W -subsequences in all S

-s

where type(o

)=f

Let min

sup be the minimum frequent threshold that

the users are interested, one pattern P is frequent if

fr(P, w, W) ≥ min

sup.

Problem Deﬁnition: Given a database of trajectories

, ···, S

of n moving objects, each with type(o

) ∈F,

discover all the frequent spatiotemporal collocations, sub-

ject to , a closeness duration window length w,amaxi-

mum pattern window length W , and a frequency threshold

min

sup∈[0, 1).

4 Algorithms

To ﬁnd the collocation episodes, the main tasks are: (i)

identify the types of objects that move closely to each other,

and(ii)ﬁndonwhichW -windows this closeness is observed.

4.1 Finding close subsequences

The ﬁrst mining phase aims at discovering object pairs

of different types (f

) that have close concurrent subse-

quences. The ultimate objective is to identify the collocation

units that may form longer episodes. For this, we scan each

of type f

to identify its w-subsequences that are close to

object subsequences of different type f

, j = i. We store the

starting position t

of each such window [t

+ w) along

with the set of object ty pes close to S

, during [t

+ w).

Eventually, each trajectory S

is converted to a feature se-

quence of the form S

= {F

, F

,...,F

},

where F

is the set of object types other than f

close to the

w-subsequence of S

that starts at time t

A naive method for the computation of S

for each S

,is

to scan all the other sequences in order to identify the win-

dows and f eature-sets in each S

. We now present a hash-

based technique that achieves this goal in two database scans

only and is shown in Figure 3. In the ﬁrst pass, all data are

hashed to a 3D grid in the trajectory space (Line 1), where

G and T are the projected length of each cell on spatial and

temporal dimension, and are chosen to be  and w respec-

tively. Then, the algorithm performs a pass over the hashed

data by examining only one hyperplane of cells at a time,

corresponding to a w-period. For each cell gc, the neighbor-

ing cells in the spatial dimensions having the same temporal

coordinates as gc are examined. In the 2D example of Fig-

ure 4, for cell gc, starting at time t

, and for the trajectory S

(partially) inside gc, the (shaded) cells are checked for possi-

ble containment of subsequences which are partially close to

. Note that S

is close to S

at time t

+2, which means that

this closeness relationship can be extended to a subsequence

closeness in cells of time span [t

+w, t

+2w). Thus, the al-

gorithm for each S

buffers such partially close subsequences

that can be extended to S

elements. When the next hyper-

plane of cells is examined, the partial closeness results are

examined for potential extension and inclusion to S

, along

with generation of new partial results. The sorting of cell

contents by time facilitates the fast identiﬁcation and exten-

sion of partial closeness results, in a merge-join fashion.

Algorithm getFS(S

, ···S

, D, w)

1. impose a spatiotemporal G×G×T grid;

2. hash all locations of S

, ···, S

to cells;

3. for every cell gc in the grid,

4. sort locations in gc according to their time;

5. initialize a (partial results) buffer buf

for S

;

6. for each timestamp t

, multiple of w

7. GC := cells with time interval [t

+w);

8. for each timestamp t ∈ [ t

+ w)

9. for each grid cell gc ∈ GC

10. ﬁnd location pairs (l

) within ;

11. extend buf

and buf

for each pair;

12. if S

in buf

is close for at least w

13. add (f

,t−w+1) to S

;

14. if S

in buf

is close for at least w

15. add (f

,t−w+1) to S

;

Figure 3: Hash-based computation

of close feature sets

time

space

Figure 4: Exam-

ple of hashing

4.2 Discovery of collocation episodes

In this section, we show two algorithms to discover fre-

quent collocations, based on different usage of the trans-

formed sequence of feature sets.

4.2.1 Pattern extraction from sequences of feature sets

The ﬁrst algorithm Apriori shown in Figure 5 ﬁnds the col-

locations level-by-level. It takes as input the close feature

sets S

found for each S

, the minimum frequency mincnt

(= min sup ×|win|) for an episode to be frequent. First,

the S

’s are partitioned to |F| groups, one for each different

∈F. Thus, the group S

, for feature f

is used to ﬁnd the

patterns, having f

as their r eference feature. We note that

the apriori property holds for frequent episodes, i.e., if P



a superpattern of P ,thenfr(P, w, W ) ≤ fr(P



,w,W).In

this algorithm, when we measure the length of a pattern, we

exclude the reference feature f

from the units, since it is im-

plicit. For example, a 3-candidate (f

) → (f

) represents

a real 5-candidate (f

)→ (f

Function gen cand, used to generate the -candidates

from ( − 1)-patterns, is exactly as that in sequential pat-

tern mining [7], so we will not discuss it in detail. The

Algorithm Apriori(S

, W , mincnt

)

1. L

:= frequent 1-patterns;  := 2;

2. while (L

−1

= ∅)

3. C



:= gen cand(L

−1

);

4. for each S

∈ S

5. slide window (C



, S

, W );

6. L



:= {P ∈ C



P.count ≥ mincnt

};

7.  :=  +1;

8. return L := ∪



;

Figure 5: Apriori-based al-

gorithm

Algorithm MJ(S

, W , mincnt

)

1. generate ITList

) for each (f

);

2. use ITList

) to generate L

;

3.  := 2;

4. while (L

−1

= ∅)

5. C



:= gen cand(L

−1

);

6. for each P ∈ C



7. ITList

(P ) :=

count cand(P , W );

8. L



:= {P ∈ C



P.count ≥ mincnt

};

9.  :=  +1;

10. return L := ∪



;

Figure 6: Merge join algo-

rithm

patterns excluding the reference features are similar to the

sequential patterns in transactional databases [7]. However,

counting the support of our patterns is different, since we

consider all positions of a sliding window, whereas for se-

quential patterns each transaction sequence contributes one

or none to a sequential pattern (depending on whether the se-

quence is a superpattern of it or not). Function slide

window

is used to count |I

| (the number of windows that contain

valid instan ces of c) for each candidate c ∈ C



from a trans-

formed sequence S

∈ S

. In brief, the idea is to slide

a W -window over S

to get a subsequence of feature sets.

For each subsequence s on a W -window, we ﬁnd the can-

didates that have a valid instance, which is c overed (i.e.,

supported) by s, and increase their count. Sliding window

counting for event episodes has also been proposed in [5],

however, the valid instances in our case are more difﬁcult

to count, b ecause of the constraint that one collocation unit

instance should end before the beginning of the next one

(see condition (ii) in Deﬁnition 6). For example, assuming

w =3and S

= {(f

), 10, (f

), 11, (f

), 14)},

pattern c

=(f

) → (f

) is supported by S

, but pattern

=(f

) → (f

) is not, since f

is close to the reference

feature at time 11, which is before the end of f

(10+w = 13).

In simple words, in a valid pattern instance, th e collocation

unit instances should not overlap in time.

Optimizing the support counting While sliding a W -

window over the transformed sequence, if the subsequence

of S

covered by the window remains the same compared

to the previous window position, the set of candidates sup-

ported by the window does not change. As a result, we

examine only positions of the W window, where either (i)

a feature-set F is included in the window for the ﬁrst time

or (ii) F ceases to be included in the window ( compared to

the previous position). E.g., let w =3, W =8,andS

{(f

), 10, (f

), 11, (f

), 14)}. Since only three

windows, [5, 13), [6, 14), [9, 17), correspond to the event of

a feature-set entering the sliding window, and two windows,

[11, 19), [12, 20), correspond to the event that a feature-set

leaves the window, we just need to examine these ﬁve win-

dows. Each feature-set F

∈S

affects two positions

of window [t

); the one with t

= t

+w (where F

enters

the window) and the one with t

= t

+1 (where F

leaves

it). As a result, the cost of examining a feature-set sequence

becomes proportional to |S

|, instead of the number of

window positions (which normally is much larger).

Figure 7 shows in detail this optimized counting method

applied for each S

. To avoid overcounting a pattern having

more than one instance at a window position, when we detect

a valid instance, we add to its support only for the window

positions, where previous instances are not valid. For this,

we main tain a variable c.last fo r each candidate (initialized

to −1), indicating the last known position of W ,havingan

instance of c. In addition, the algorithm keeps track of the

feature-sets fs contained in W . Whenever a feature-set F

exits the sliding window, it is removed from fs.IfanewF

enters fs, we search for candidates for which the last unit is

instantiated by some features in F (instances not affected by

F are identiﬁed at earlier positions of W ). I.e., only candi-

dates, for which the features in the last unit are all contained

in F , are checked for instantiation. For each candidate, if

we detect a valid instance at the current window position,

we look for the pattern instance with the latest starting time

. The support of the candidate is then updated with the

number of window positions I

−t

+1, during which the

pattern instance remains valid ( w hen t

, the instance

becomes outdated). Finally, if some window positions were

already counted due to the last detected pattern for c, i.e., if

c.last ≥ t

, then we add I

−c.last to c.count (instead of

−t

+1), in order not to overcount the speciﬁc candidate.

Function slide window(C



, S

, W )

1. for each candidate c

2. c.last := −1; c.count := 0; fs := ∅;

3. slide a [t

) W -window over S

4. if some feature set F ∈ fs becomes outdated

5. fs := fs−F ;

6. if some feature set F enters the window

7. fs := fs+F ;

8. for each candidate c

9. ﬁnd instance of c with I



instantiated by F

10. and largest possible I

;

11. if there exists such an instance

12. h := min{I

−t

+1,I

−c.last};

13. c.count := c.count+h;

14. c.last := I

;

Figure 7: Optimized support counting

4.2.2 Pattern extraction by joining instances of patterns

Our second algorithm follows the vertical mining paradigm.

Instead of scanning the S

lists multiple times, while gener-

ating and counting candidates level-by-level, we keep track

of the details about the instances of the patterns and join them

to produce the instances of their superpatterns.

Figure 6 shows a pseudocode for this merge join (MJ)

algorithm. First (Line 1), we scan the S

lists, to produce

the instance lists (ITLists) of all 1-patterns. For each refer-

ence feature f

,allS

∈ S

produce the instances of 1-

patterns having f

as reference feature. Consider a feature-

set F

∈S

. For each feature f

∈ F

an element

) is added to list ITList

), indicating that there

is an object of feature f

close to object o

of feature f

at time window [t

+ w). By sliding a window W over

ITList

), we can compute the supports of the 1-pattern

) referencing f

. The ITlists are then used to ﬁnd the fre-

quent 1-patterns L

(Line 2 of the algorithm).

For counting the instances of a longer candidate pattern P

(procedure MJ

count cand), we slide a W -window along

the two ITLists of the two subpatterns P

and P

that gen-

erate P , a nd merge-join the lists to create ITList

(P ).

For every position t of W , such that ITList

) and

ITList

) contain entries of the same o

and these en-

tries qualify the pattern constraints, a new instance is gener-

ated for ITList

(P ). Entries in the ITList of a long pattern

with k units is a list of (o

,...,I

). We distin-

guish three cases for this merge-join process:

• P

and P

contain collo cation units that are exactly

thesameinP . For example, P

=(f

) → (f

=(f

) → (f

), P =(f

) → (f

).Inthis

case, ITList

), ITList

) are joined accord-

ing to the t

time of the common unit, while the rest of

the temporal constraints are veriﬁed.

• P

and P

contain collocation units that are joine d in P .

E.g, P

=(f

), P

=(f

) → (f

), P =(f

) →

). In this case, ITList

), ITList

) are

joined according to the t

time of the joined units, while

the rest of the temporal constraints are veriﬁed.

• P

and P

do not have common or joined units. For

example, P

=(f

), P

=(f

), P =(f

) →

). In this case, we perform a band-join [2] be-

tween ITList

) and ITList

) to produce

ITList

(P ). The band-join is a straightforward ex-

tension of the merge join algorithm that replaces the

equality condition by a maximum difference constraint

(maximum time difference W in our example).

5 Experimental Evaluation

This section experimentally evaluates the performance of

the proposed algorithms based o n synthetically generated

data due to the lack of real data. All experiments were run

on a Pentium III Xeon 700MHz workstation with 4096MB

RAM, running Solaris 9x86. The generator takes as input

the following parameters: |F|, the number of features; ,the

maximal length of the generated episodes; n, the number of

sequences (i.e., objects); m, the maximal length of every se-

quence; , w, W ,andmin

sup, which have the same mean-

ing as that in the problem deﬁnition. Given these parame-

ters, we g enerate n trajectories, each of which is assigned to

a type in F while making sure that the generated trajectories

instantiate collocation episodes. The default values of the

data generation parameters are n = 500, m = 2000, w =2,

W =20, |F| =40,  =7,  =1%and min

sup =0.03.Un-

less otherwise stated, we use the same parameter values in

data generation and data mining.

Performance evaluation Our methods discover the col-

location episodes in two steps; ﬁrst, close feature sets are

found and then longer patterns are extracted from them. For

the ﬁrst step, apart from the proposed hash-based method, we

implement a naive one by linearly scanning all other trajecto-

ries. For the second step, besides implementing the two algo-

rithms Apriori and MJ, we also developed a non-optimized

version of the Apriori algorithm, which does not employ the

optimized counting approach shown in Figure 7 . We com-

pare the performance of four methods. Apriori-base applies

linear scan in the ﬁrst step and non-optimized Apriori for

ﬁnding the patterns. Apriori-noprune, Apriori,andMJ use

the hash-based method in the ﬁrst step, and non-optimized

Apriori, optimized Apriori, and MJ respectively, in the sec-

ond step. The difference between the linear scan method and

the hash-based approach in the ﬁrst step can be seen by com-

paring Apriori-base and Apriori-noprune. The difference be-

tween ﬁnding collocations using the transformed sequences

and the ITLists could be observed from Apriori and MJ.Fi-

nally, by comparing Apriori-noprune with Apriori we can

see the effect of optimized support counting in Apriori.

100

120

140

2 4 6 8 10

time (sec)

Apriori-base

Apriori-noprune

Apriori

Figure8:Timevs.

100

1 2 3 4 5

time (sec)

m (k)

Apriori-base

Apriori-noprune

Apriori

Figure 9: Time vs. m

Figure 8 shows that the mining cost increases with the

maximal length  of the g enerated episodes. In addition,

since the number of candidates in each level grows expo-

nentially to , the cost varies slightly for smaller ,andin-

creases sharply when  becomes large. However, the op-

timized counting of Apriori slows down this exponential

growth. Figure 9 illustrates the scalability of the m ethods

over the maximal length m of the sequences. It shows that

the mining cost grows nearly linear to m, exhibiting good

scalability over the data volume. For changing n, the linear

changing trend could be observed. To summarize, for ﬁnding

close feature pairs, the hash-based technique is much faster

than the linear scan method, whereas for discovering collo-

cation episodes from feature sets, the Apriori method with

the counting optimization technique perfor ms best. On the

other hand, in most cases, MJ is not as efﬁcient as Apriori,

due to the large volume of generated and joined ITLists.

6Conclusion

In this paper, we studied the problem of discovering fre-

quent collocation episodes from spatiotemporal data. We

provided a novel an d carefully de signed deﬁnition of this

new and important mining problem. In addition, we designed

an efﬁcient two-phase mining methodology. In the ﬁrst

phase, a hash-based technique is used to convert the original

trajectories to sequences of close features to the correspond-

ing object. In the second phase, an Apriori-based technique

is devised to discover the frequent episodes. We showed by

experimentation that the best combination of techniques in

both phases is efﬁcient and scalable.

References

[1] G. Andrienko, D. Malerba, M. May, and M. Teisseire, ed-

itors. ECML/PKDD Workshop on Mining Spatio-Temporal

Data, 2005.

[2] D. J. DeWitt, J. F. Naughton, and D. A. Schneider. An evalua-

tion of non-equijoin algorithms. In VLDB, 1991.

[3] M. Hadjieleftheriou, G. Kollios, P. Bakalov, and V. J. Tsotras.

Complex spatio-temporal pattern queries. In VLDB, 2005.

[4] J. Lin, E. J. Keogh, A. W.-C. Fu, and H. V. Herle. Approxima-

tions to magic: Finding unusual medical time series. In 18th

IEEE Symp. on Computer-Based Medical Systems (CBMS),

2005.

[5] H. Mannila, H. Toivonen, and A. I. Verkamo. Discovery of fre-

quent episodes in e vent sequences. Data Min. Knowl. Discov.,

1(3):9, 1997.

[6] S. Shekhar and Y. Huang. Discovering spatial co-location pat-

terns: A summary of results. In SSTD, 2001.

[7] R. Srikant and R. Agrawal. Mining sequential patterns: Gener-

alizations and performance improvements. In EDBT, 1996.

[8] J. Wang, W. Hsu, and M.-L. Lee. A framework for mining topo-

logical patterns in spatio-temporal databases. In CIKM, 2005.

[9] H. Yang, S. Parthasarathy, and S. Mehta. A generalized frame-

work for mining spatio-temporal patterns in scientiﬁc data. In

KDD, 2005.

Discovery of Collocation Episodes in Spatiotemporal Data

Figures

Citations

CONSTAnT – A Conceptual Data Model for Semantic Trajectories of Moving Objects

DB-SMoT: A direction-based spatio-temporal clustering method

Mining frequent trajectory patterns in spatial-temporal databases

ST‐DMQL: A Semantic Trajectory Data Mining Query Language

Mobility Data Management and Exploration

References

Mining Sequential Patterns: Generalizations and Performance Improvements

Discovery of Frequent Episodes in Event Sequences

Levelwise Search and Borders of Theories in KnowledgeDiscovery

Discovering Spatial Co-location Patterns: A Summary of Results

An Evaluation of Non-Equijoin Algorithms

Related Papers (5)

Discovering colocation patterns from spatial data sets: a general approach

On discovering moving clusters in spatio-temporal data

Trajectory pattern mining

Discovering Spatial Co-location Patterns: A Summary of Results

Mining association rules between sets of items in large databases

Frequently Asked Questions (15)

Q1. What have the authors contributed in "Discovery of collocation episodes in spatiotemporal data∗" ?

Q2. What is the simplest way to join two ITLists?

Q3. how many windows are used to count c?

Q4. What is the method for finding close feature pairs?

Q5. What are the default values of the data generation parameters?

Q6. What is the naive method for the computation of Sfi for each Si?

Q7. How many times does a pattern with a reference type appear in a database?

Q8. How many windows are there to count?

Q9. What is the definition of a spatial collocation pattern?

Q10. What are the parameters used to generate trajectories?

Q11. What is the definition of sliding window counting?

Q12. What is the process of combining two ITLists?

Q13. How many times can a vulture move near to a deer?

Q14. What is the cost of examining a window?

Q15. What are the two methods used to extract the collocation episodes?