scispace - formally typeset
Open AccessProceedings ArticleDOI

Discovery of Collocation Episodes in Spatiotemporal Data

Reads0
Chats0
TLDR
This work formally defines the problem of mining collocation episodes and proposes two scaleable algorithms for its efficient solution and empirically evaluates the performance of the proposed methods using synthetically generated data that emulate real-world object movements.
Abstract
Given a collection of trajectories of moving objects with different types (e.g., pumas, deers, vultures, etc.), we introduce the problem of discovering collocation episodes in them (e.g., if a puma is moving near a deer, then a vulture is also going to move close to the same deer with high probability within the next 3 minutes). Collocation episodes catch the inter-movement regularities among different types of objects. We formally define the problem of mining collocation episodes and propose two scaleable algorithms for its efficient solution. We empirically evaluate the performance of the proposed methods using synthetically generated data that emulate real-world object movements.

read more

Content maybe subject to copyright    Report

Discovery of Collocation Episodes in Spatiotemporal Data
Huiping Cao, Nikos Mamoulis, and David W. Cheung
Department of Computer Science
The University of Hong Kong
Pokfulam Road, Hong Kong
{hpcao,nikos,dcheung}@cs.hku.hk
Abstract
Given a collection of trajectories of moving objects with
different types (e.g., pumas, deers, vultures, etc.), we intro-
duce the problem of discovering collocation episodes in them
(e.g., if a puma is moving near a deer, then a vulture is also
going to move close to the same deer with high probabil-
ity within the next 3 minutes). Collocation episodes catch
the inter-movement regularities among different types of ob-
jects. We formally define the problem of mining collocation
episodes and propose two scaleable algorithms for its effi-
cient solution. We empirically evaluate the performance of
the p roposed methods using synthetically generated data that
emulate real-world object movements.
1 Introduction
The large volume of the spatiotemporal data, (i.e., mov-
ing objects trajectories) renders their manual analysis tedious
and impossible. For their efficient analysis, spatiotemporal
data mining [1] is proposed for the development and appli-
cation of novel computational techniques. Given a trajectory
database, our goal is to unveil inter-movement regularities
among objects of different types, modeled as a sequence of
collocation events. Consider an application that monitors the
activities of anima ls (e.g., via sen sors attached to them). An
exemplary collocation episode for this application could be
“Once we detect that a puma is moving close to a deer for 1
minute, we expect that a vulture will also move near to this
deer in 3 minutes with high probability.
A collocation episode is in fact a sequence of spatiotem-
poral collocation events. Such events are sets of objects mov-
ing close to each othe r for some period. In addition, there
is a particular object type (e.g., deer), called centric fea-
ture, which participates in a sequence of collocations (e.g.,
deer-puma, deer-vulture). Finally, in a valid instance of
the episode the object that instantiates the common feature
should be the same in all collocation instances (e.g., the same
deer appears in both deer-puma and deer-vulture colloca-
tions). Our definition of a collocation event is a temporal ex-
tension of the spatial collocation defined in [6], which mod-
els the co-existence of a set of (non-spatial) features, such
as environmental observations (humidity and pollution val-
ues), plant and animal types, etc., in a spatial neighborhood.
E.g., pattern (wet, bamboo) means that, with high probabil-
ity, we can find b amboo plants near places with high hu-
Work supported by grant HKU 7142/04E from Hong Kong RGC.
midity values. Existing methods that consider only spatial
relationships between static features are not directly applica-
ble for our problem, since (i) we require a temporal duration
for a valid collocation event and ( ii) we search for temporal
episodes of such events. Our problem also has some simi-
larity with episodes minin g in sequence data [5], where fre-
quent episodes a re (partially or totally) ordered list of events.
A sliding window w is used to extract subsequences in the
event series, and the contribution of every subsequence to
each candidate episode’s frequency is counted. However, the
events in our episodes are complex collocations as opposed
to simple categorical values. In additio n, for a valid episode
instance there must be a comm on f eature instantiation
1
(cen-
tric feature requirement), as opposed to an appearance of any
event of the same type in [5].
In view of the challenges in this problem, we propose a
two-step framework for mining collocation episodes. First,
we apply a hash-based technique to efficiently retrieve the
object pairs, whose trajectories are close during some p eriods
and identify these intervals. Then, we provide two colloca-
tion episode mining algorithms (one Apriori-based approach
and one that is based on the vertical mining paradigm) and
some pruning techniques to improve the mining efficiency.
Finally, we empirically evaluate the performance of the pro-
posed methods using synthetically generated data. In the re-
mainder of the paper, we introduce some related work, for-
mally define the prob lem, outline our algor ithms, and present
an experimental evaluation for them.
2 Related Work
Besides the work reviewed in Section 1, our work is also
related to pattern discovery in one-dimensional time series,
e.g., [4] etc. Nevertheless, these p roblems differ in three
main aspects from our work; (i) the pattern/rule element is
just a symbol or an event, while our pattern unit is a topo-
logical structure with a temporal duration; (ii) patterns are
defined based on a single time series, but our patterns are
based on relationships among d ifferent sequen ces; and (iii)
temporal and time-series data mining is usually based on p re-
defined categorization of 1D values, whereas we work on a
continuous (spatiotemporal) data space. In addition, several
efforts have been made to extend spatial collocation patterns
[6] to contain temporal aspects towards different d irections.
E.g., from spatiotempo ral data, [9] searches for evolving col-
locations, which in nature are p attern components in our con-
text, while [8] d iscovers topological patterns (without tempo-
1
Note that two trajectories of the same type (e.g., deer) may correspond
to different objects (e.g., two different deers).

ral order). Finally, a related piece of work to our problem is
[3], where spatiotemporal pattern queries are proposed and
studied. An intuitive example of such a query is “find the
moving object that is close to location A at time t
1
,andthen
moves to region C dur ing time interval [t
3
,t
4
]”. The main
differences of our work are (i) we au tomatically iden tify fr e-
quent patterns that relate the movement of objects, instead
of posing explicit queries and (ii) our patterns relate two
or more trajectories (that are feature instances), instead of
searching for trajectories that follow a specific “route” spec-
ified by a temporal sequence of static regions.
3 Problem Definition
This section formally d e fines the spatiotemporal colloca-
tion episodes by gracefully combining the concept of episode
in event sequences and collocation in spatial d atabases.
3.1 Spatiotemporal sequences and close subsequences
A spatiotemporal sequence S is th e trajectory of a mov-
ing object. Formally, it is an ordered list of location-time
pairs (l
0
,t
0
), (l
1
,t
1
), ···, (l
m1
,t
m1
) where t
i
<t
j
if i<j
(i, j [0,m)). The pair (l
i
,t
i
) denotes that the object was at
location l
i
at tim e t
i
. In p ractice, l
i
is a 2D position (x
i
,y
i
),
and t
i
records the time represented in time units (e.g., one
minute is a time unit). In the following discussion, the sub-
script of a location implicitly refers to its timestamp (i.e., l
i
implies the location at time t
i
).
Given n (n 1) objects o
1
,o
2
, ···,o
n
, the trajectory of
object o
i
is denoted by S
i
. Figure 1 plots three exemplary
sequences S
P
, S
D
and S
V
(abbreviating S
Puma
, S
Deer
,and
S
Vulture
, respectively). For illustration purposes, we use 1D
values to represent spatial locations, however, our discussion
extends naturally to th e multidimensional space.
A subsequence s of S is a list of continuous location-
time pairs of S: (l
i
1
,t
i
1
), (l
i
2
,t
i
2
), ···, (l
i
q
,t
i
q
), where for
j [1,q], l
i
j
is in S and t
i
j
+1=t
i
j+1
. The starting (ending)
time of s is denoted by s.t
s
(s.t
e
). For a complete sequence
S with m positions, S.t
s
= t
0
and S.t
e
= t
m1
+1.
Definition 1 A window is a time interval [t
s
,t
e
). The time
span (or length) of a window [t
s
,t
e
) is t
e
t
s
. A window with
time span w is called a w-window.
Definition 2 A subsequence s is on window [t
s
,t
e
) if s.t
s
=
t
s
and s.t
e
= t
e
. A subsequence is called a w-subsequence if
it is on a w-window. Two subsequences s
i
∈S
i
and s
j
∈S
j
are concurrent subsequences if they are on the same win-
dow.
0
1
2
3
4
5
6
7
8
9
10
8 10 12 14 16 18 20
location
time
w1
w2
w1
w2
w1
w2
w1
w2
w1
w2
w1
w2
w1
w2
w1
w2
w1
w2
w1
w2
w1
w2
w1
w2
Puma
Deer
Vulture
Figure 1: Example of trajectories
and windows
D
P
D
P
D
V
(a) (b)
Figure 2: Collocation
unit and episode
Example: Figure 1 shows two windows with time span 3:
w1=[10, 13) and w2=[15, 18).ForS
D
, the two subsequences
on w1 and w2 are s
D
1
=(3.5, 10), (4.3, 11), (5.0, 12) and
s
D
2
=(6.2, 15), (6.1, 16), (6.0, 17).ForS
P
, the subse-
quence on w1 is s
P
1
=(2.5, 10), (3.6, 11), (4.6, 12), while
on w2 we have s
V
2
=(6.7, 15), (6.5, 16), (5.2, 17) from
S
V
. s
D
1
and s
P
1
(also: s
D
2
and s
V
2
) are concurrent subse-
quences. S
D
is on window [9, 20) and has eight (t
e
t
s
w=
2093) 3-subsequences, the second of which is on w1.
We define the closeness of two concurrent subsequences,
using an aggregate function aggDist for their element-to-
element distances. Typically, aggDist can be the m axi-
mum (max) or average (avg) of the component distances.
Assuming that s
i
and s
j
are both on window [t
s
,t
e
),and
dist is some atomic distance function (e.g., Euclidean dis-
tance), maxDist(s
i
,s
j
)=max
t
s
i
t
=j
t
<t
e
{dist(l
i
t
,l
j
t
)}
and avgDist(s
i
,s
j
)=
P
t
s
i
t
=j
t
<t
e
dist(l
i
t
,l
j
t
)
t
e
t
s
.Adistance
threshold is used to model closeness:
Definition 3 Two concurrent subsequences s
i
and s
j
are
close, denoted by close(s
i
,s
j
),ifaggDist(s
i
,s
j
) .
Example: Assuming =2.5 and aggDist= maxDist,the
two concurrent subsequences s
D
1
and s
P
1
on window w1 of
Figure 1 are close to each other since for i
t
= j
t
[10, 13),
the location pair l
i
t
s
i
and l
j
t
s
j
satisfies dist(l
i
t
,l
j
t
) .
3.2 Spatiotemporal collocations and episodes thereof
Let F be a set of moving object types (e.g., different ani-
mals). Given a database of object trajectories, the type (or
feature)ofanobjecto
i
is denoted by type(o
i
), such that
type(o
i
) ∈F. In general, the number of objects n in the
database can be larger than the number |F| of types in F; i.e.,
more than one object may belong to the same object type.
Definition 4 A spatiotemporal collocation unit g (simply
unit) is an undirected graph (V, E) where each vertex in g.V
is an object type in F.Thelength of the unit g is the number
of vertices |g.V | in it. Given a unit time span w,avalid
instance I
g
of unit g =(V,E),whereV ={f
1
,f
2
,...,f
|V |
},
is a set of concurrent w-subsequences {s
1
,s
2
,...,s
|V |
} on
a window [t
s
,t
e
) such that (i) s
i
is of type f
i
, (1 i ≤|V |)
and (ii) if (f
i
,f
j
)E then close(s
i
,s
j
).
The starting (ending) time of I
g
is denoted by I
g
.t
s
(I
g
.t
e
).
For example, the two concurrent window trajectories s
D
1
and
s
P
1
in Figure 1 is an instance of the collocation unit in Figure
2a, and the related window is [10, 13) (i.e., w1).
Definition 5 A spatiotemporal collocation pattern (or
episode) P is an ordered list of spatiotemporal collocation
units: g
1
g
2
···g
where
i=1
(g
i
.V ) = .
The object types in
i=1
(g
i
.V ), ar e called the reference types
(features) of pattern P .Thelength of the pattern P is de-
fined by
i=1
|g
i
.V |. A pattern with length k is called
a k-pattern. In this paper, we only consider the case that
|∩
i=1
(g
i
.V )| =1, and we denote the common (reference)
object type as f
r
. The reference object type f
r
is also called
the centric feature of the pattern. Thus, we can also represent
a pattern in the form (f
r
,g
1
.V f
r
)···(f
r
,g
.V f
r
),
where the underlined feature is the reference feature.
Example: Figure 2b shows a 4-collocation episode, indi-
cating that when a deer and a puma are close during w =3
time units, a vulture will co me close to this deer later. This
episode’s common feature is D and can also be represented
by (D
,P) (D,V).
Definition 6 Given a maximum pattern time span W ,a
valid instance I
P
for a pattern P = g
1
g
2
···g
, is a sequence
of valid unit instances I
g
1
I
g
2
···I
g
such that (i) in all unit

instances the reference feature f
r
is instantiated by a sub-
sequence of the same object sequence, (ii) for every i<j,
I
g
i
.t
e
I
g
j
.t
s
, and (iii) I
g
.t
e
I
g
1
.t
s
W .
Example:Let =2.5, w =3,andW =8. In Figure 1, we
can identify a valid instance of the episode of Figure 2b. In
specific, s
D
1
and s
P
1
(s
D
2
and s
V
2
) instantiate the first (sec-
ond) unit of the pattern. In addition, s
D
1
and s
D
2
instantiate
the common feature D in both units and they are parts of the
same trajectory. Furthermore, I
g
1
.t
e
<I
g
2
.t
s
, since the end
point of w1 is before w2. Finally, I
g
2
.t
e
I
g
1
.t
s
=1810 W .
Given the maximal episode time span W , we say that
a W -window [t
s
,t
e
) covers a pattern instance, if the time
span of the instance [I
g
1
.t
s
,I
g
.t
e
) satisfies t
s
I
g
1
.t
s
and
I
g
.t
e
t
e
.Weuse|I
P
| to denote the number of W -
windows, which cover at least one instance of pattern P .
Definition 7 Pattern P = g
1
g
2
···g
p
is a superpattern of
P
= g
1
g
2
···g
q
if (i) P.f
r
= P
.f
r
and (ii) there exists q
units g
i
1
g
i
2
···g
i
q
(1 i
j
<i
j+1
p, 1 j<q) of P such
that g
j
.V g
i
j
.V and g
j
.E g
i
j
.E for j [1,q]. P
is a
subpattern of P .
For Example, P =(A
,B,C) (A,C,D) (A,E) is a
superpattern of P
=(A,B) (A,C,D)
To measure the interestingness of a collocation episode,
we use the reference type as the key factor since it does not
make sense to overcount the same instance of the reference
feature (e.g., deer) with different instances of the other ob-
ject types (e.g., puma, vulture) in the pattern. In addition,
we consider all possible time windows W , where the p attern
may appear.
Definition 8 The frequency of a pattern P with reference ob-
ject type f
r
is fr(P, w, W )=
|I
P
|
P
type(o
i
)=f
r
|win|
.
Here, win is the total number of W -subsequences in all S
i
-s
where type(o
i
)=f
r
.
Let min
sup be the minimum frequent threshold that
the users are interested, one pattern P is frequent if
fr(P, w, W) min
sup.
Problem Definition: Given a database of trajectories
S
1
, ···, S
n
of n moving objects, each with type(o
i
) ∈F,
discover all the frequent spatiotemporal collocations, sub-
ject to , a closeness duration window length w,amaxi-
mum pattern window length W , and a frequency threshold
min
sup[0, 1).
4 Algorithms
To find the collocation episodes, the main tasks are: (i)
identify the types of objects that move closely to each other,
and(ii)findonwhichW -windows this closeness is observed.
4.1 Finding close subsequences
The first mining phase aims at discovering object pairs
of different types (f
i
,f
j
) that have close concurrent subse-
quences. The ultimate objective is to identify the collocation
units that may form longer episodes. For this, we scan each
S
i
of type f
i
to identify its w-subsequences that are close to
object subsequences of different type f
j
, j = i. We store the
starting position t
s
of each such window [t
s
,t
s
+ w) along
with the set of object ty pes close to S
i
, during [t
s
,t
s
+ w).
Eventually, each trajectory S
i
is converted to a feature se-
quence of the form S
f
i
= {F
1
,t
1
, F
2
,t
2
,...,F
m
,t
m
},
where F
s
is the set of object types other than f
i
close to the
w-subsequence of S
i
that starts at time t
s
.
A naive method for the computation of S
f
i
for each S
i
,is
to scan all the other sequences in order to identify the win-
dows and f eature-sets in each S
f
i
. We now present a hash-
based technique that achieves this goal in two database scans
only and is shown in Figure 3. In the first pass, all data are
hashed to a 3D grid in the trajectory space (Line 1), where
G and T are the projected length of each cell on spatial and
temporal dimension, and are chosen to be and w respec-
tively. Then, the algorithm performs a pass over the hashed
data by examining only one hyperplane of cells at a time,
corresponding to a w-period. For each cell gc, the neighbor-
ing cells in the spatial dimensions having the same temporal
coordinates as gc are examined. In the 2D example of Fig-
ure 4, for cell gc, starting at time t
s
, and for the trajectory S
i
(partially) inside gc, the (shaded) cells are checked for possi-
ble containment of subsequences which are partially close to
S
i
. Note that S
j
is close to S
i
at time t
s
+2, which means that
this closeness relationship can be extended to a subsequence
closeness in cells of time span [t
s
+w, t
s
+2w). Thus, the al-
gorithm for each S
i
buffers such partially close subsequences
that can be extended to S
f
i
elements. When the next hyper-
plane of cells is examined, the partial closeness results are
examined for potential extension and inclusion to S
f
i
, along
with generation of new partial results. The sorting of cell
contents by time facilitates the fast identification and exten-
sion of partial closeness results, in a merge-join fashion.
Algorithm getFS(S
1
, ···S
n
, D, w)
1. impose a spatiotemporal T grid;
2. hash all locations of S
1
, ···, S
n
to cells;
3. for every cell gc in the grid,
4. sort locations in gc according to their time;
5. initialize a (partial results) buffer buf
i
for S
i
;
6. for each timestamp t
s
, multiple of w
7. GC := cells with time interval [t
s
,t
s
+w);
8. for each timestamp t [ t
s
,t
s
+ w)
9. for each grid cell gc GC
10. find location pairs (l
i
t
,l
j
t
) within ;
11. extend buf
i
and buf
j
for each pair;
12. if S
j
in buf
i
is close for at least w
13. add (f
j
,tw+1) to S
f
i
;
14. if S
i
in buf
j
is close for at least w
15. add (f
i
,tw+1) to S
f
j
;
Figure 3: Hash-based computation
of close feature sets
time
space
İ
w
S
i
S
j
gc
t
s
Figure 4: Exam-
ple of hashing
4.2 Discovery of collocation episodes
In this section, we show two algorithms to discover fre-
quent collocations, based on different usage of the trans-
formed sequence of feature sets.
4.2.1 Pattern extraction from sequences of feature sets
The first algorithm Apriori shown in Figure 5 finds the col-
locations level-by-level. It takes as input the close feature
sets S
f
i
found for each S
i
, the minimum frequency mincnt
f
r
(= min sup ×|win|) for an episode to be frequent. First,
the S
f
i
s are partitioned to |F| groups, one for each different
f
i
∈F. Thus, the group S
f
r
, for feature f
r
is used to find the
patterns, having f
r
as their r eference feature. We note that
the apriori property holds for frequent episodes, i.e., if P
is
a superpattern of P ,thenfr(P, w, W ) fr(P
,w,W).In
this algorithm, when we measure the length of a pattern, we
exclude the reference feature f
r
from the units, since it is im-
plicit. For example, a 3-candidate (f
i
) (f
j
,f
k
) represents
a real 5-candidate (f
r
,f
i
) (f
r
,f
j
,f
k
).

Function gen cand, used to generate the -candidates
from ( 1)-patterns, is exactly as that in sequential pat-
tern mining [7], so we will not discuss it in detail. The
Algorithm Apriori(S
f
r
, W , mincnt
f
r
)
1. L
1
:= frequent 1-patterns; := 2;
2. while (L
1
= )
3. C
:= gen cand(L
1
);
4. for each S
f
i
S
f
r
5. slide window (C
, S
f
i
, W );
6. L
:= {P C
|
P.count mincnt
f
r
};
7. := +1;
8. return L :=
L
;
Figure 5: Apriori-based al-
gorithm
Algorithm MJ(S
f
r
, W , mincnt
f
r
)
1. generate ITList
f
r
(f
j
) for each (f
j
);
2. use ITList
fr
(f
j
) to generate L
1
;
3. := 2;
4. while (L
1
= )
5. C
:= gen cand(L
1
);
6. for each P C
7. ITList
f
r
(P ) :=
MJ
count cand(P , W );
8. L
:= {P C
|
P.count mincnt
f
r
};
9. := +1;
10. return L :=
L
;
Figure 6: Merge join algo-
rithm
patterns excluding the reference features are similar to the
sequential patterns in transactional databases [7]. However,
counting the support of our patterns is different, since we
consider all positions of a sliding window, whereas for se-
quential patterns each transaction sequence contributes one
or none to a sequential pattern (depending on whether the se-
quence is a superpattern of it or not). Function slide
window
is used to count |I
c
| (the number of windows that contain
valid instan ces of c) for each candidate c C
from a trans-
formed sequence S
f
i
S
f
r
. In brief, the idea is to slide
a W -window over S
f
i
to get a subsequence of feature sets.
For each subsequence s on a W -window, we find the can-
didates that have a valid instance, which is c overed (i.e.,
supported) by s, and increase their count. Sliding window
counting for event episodes has also been proposed in [5],
however, the valid instances in our case are more difficult
to count, b ecause of the constraint that one collocation unit
instance should end before the beginning of the next one
(see condition (ii) in Definition 6). For example, assuming
w =3and S
f
i
= {(f
1
,f
2
), 10, (f
1
,f
3
), 11, (f
4
), 14)},
pattern c
1
=(f
1
,f
2
) (f
4
) is supported by S
f
i
, but pattern
c
2
=(f
1
,f
2
) (f
3
) is not, since f
3
is close to the reference
feature at time 11, which is before the end of f
2
(10+w = 13).
In simple words, in a valid pattern instance, th e collocation
unit instances should not overlap in time.
Optimizing the support counting While sliding a W -
window over the transformed sequence, if the subsequence
of S
f
i
covered by the window remains the same compared
to the previous window position, the set of candidates sup-
ported by the window does not change. As a result, we
examine only positions of the W window, where either (i)
a feature-set F is included in the window for the first time
or (ii) F ceases to be included in the window ( compared to
the previous position). E.g., let w =3, W =8,andS
f
i
=
{(f
1
,f
2
), 10, (f
1
,f
3
), 11, (f
4
), 14)}. Since only three
windows, [5, 13), [6, 14), [9, 17), correspond to the event of
a feature-set entering the sliding window, and two windows,
[11, 19), [12, 20), correspond to the event that a feature-set
leaves the window, we just need to examine these five win-
dows. Each feature-set F
i
,t
i
∈S
f
i
affects two positions
of window [t
s
,t
e
); the one with t
e
= t
i
+w (where F
i
enters
the window) and the one with t
s
= t
i
+1 (where F
i
leaves
it). As a result, the cost of examining a feature-set sequence
S
f
i
becomes proportional to |S
f
i
|, instead of the number of
window positions (which normally is much larger).
Figure 7 shows in detail this optimized counting method
applied for each S
f
i
. To avoid overcounting a pattern having
more than one instance at a window position, when we detect
a valid instance, we add to its support only for the window
positions, where previous instances are not valid. For this,
we main tain a variable c.last fo r each candidate (initialized
to 1), indicating the last known position of W ,havingan
instance of c. In addition, the algorithm keeps track of the
feature-sets fs contained in W . Whenever a feature-set F
exits the sliding window, it is removed from fs.IfanewF
enters fs, we search for candidates for which the last unit is
instantiated by some features in F (instances not affected by
F are identified at earlier positions of W ). I.e., only candi-
dates, for which the features in the last unit are all contained
in F , are checked for instantiation. For each candidate, if
we detect a valid instance at the current window position,
we look for the pattern instance with the latest starting time
I
g
1
.t
s
. The support of the candidate is then updated with the
number of window positions I
g
1
.t
s
t
s
+1, during which the
pattern instance remains valid ( w hen t
s
>I
g
1
.t
s
, the instance
becomes outdated). Finally, if some window positions were
already counted due to the last detected pattern for c, i.e., if
c.last t
s
, then we add I
g
1
.t
s
c.last to c.count (instead of
I
g
1
.t
s
t
s
+1), in order not to overcount the specific candidate.
Function slide window(C
, S
f
i
, W )
1. for each candidate c
2. c.last := 1; c.count := 0; fs := ;
3. slide a [t
s
,t
e
) W -window over S
f
i
4. if some feature set F fs becomes outdated
5. fs := fsF ;
6. if some feature set F enters the window
7. fs := fs+F ;
8. for each candidate c
9. find instance of c with I
g
instantiated by F
10. and largest possible I
g
1
.t
s
;
11. if there exists such an instance
12. h := min{I
g
1
.t
s
t
s
+1,I
g
1
.t
s
c.last};
13. c.count := c.count+h;
14. c.last := I
g
1
.t
s
;
Figure 7: Optimized support counting
4.2.2 Pattern extraction by joining instances of patterns
Our second algorithm follows the vertical mining paradigm.
Instead of scanning the S
f
i
lists multiple times, while gener-
ating and counting candidates level-by-level, we keep track
of the details about the instances of the patterns and join them
to produce the instances of their superpatterns.
Figure 6 shows a pseudocode for this merge join (MJ)
algorithm. First (Line 1), we scan the S
f
i
lists, to produce
the instance lists (ITLists) of all 1-patterns. For each refer-
ence feature f
r
,allS
f
i
S
f
r
produce the instances of 1-
patterns having f
r
as reference feature. Consider a feature-
set F
i
,t
i
∈S
f
i
. For each feature f
j
F
i
an element
(o
i
,t
i
) is added to list ITList
f
r
(f
j
), indicating that there
is an object of feature f
j
close to object o
i
of feature f
r
at time window [t
i
,t
i
+ w). By sliding a window W over
ITList
f
r
(f
j
), we can compute the supports of the 1-pattern
(f
j
) referencing f
r
. The ITlists are then used to find the fre-
quent 1-patterns L
1
(Line 2 of the algorithm).
For counting the instances of a longer candidate pattern P
(procedure MJ
count cand), we slide a W -window along
the two ITLists of the two subpatterns P
1
and P
2
that gen-
erate P , a nd merge-join the lists to create ITList
f
r
(P ).
For every position t of W , such that ITList
f
r
(P
1
) and
ITList
f
r
(P
2
) contain entries of the same o
i
and these en-

tries qualify the pattern constraints, a new instance is gener-
ated for ITList
f
r
(P ). Entries in the ITList of a long pattern
with k units is a list of (o
i
,I
g
1
.t
s
,...,I
g
k
.t
s
). We distin-
guish three cases for this merge-join process:
P
1
and P
2
contain collo cation units that are exactly
thesameinP . For example, P
1
=(f
1
) (f
2
),
P
2
=(f
1
) (f
3
), P =(f
1
) (f
2
) (f
3
).Inthis
case, ITList
f
r
(P
1
), ITList
f
r
(P
2
) are joined accord-
ing to the t
s
time of the common unit, while the rest of
the temporal constraints are verified.
P
1
and P
2
contain collocation units that are joine d in P .
E.g, P
1
=(f
1
,f
2
), P
2
=(f
2
) (f
3
), P =(f
1
,f
2
)
(f
3
). In this case, ITList
f
r
(P
1
), ITList
f
r
(P
2
) are
joined according to the t
s
time of the joined units, while
the rest of the temporal constraints are verified.
P
1
and P
2
do not have common or joined units. For
example, P
1
=(f
1
), P
2
=(f
2
), P =(f
1
)
(f
2
). In this case, we perform a band-join [2] be-
tween ITList
f
r
(P
1
) and ITList
f
r
(P
2
) to produce
ITList
f
r
(P ). The band-join is a straightforward ex-
tension of the merge join algorithm that replaces the
equality condition by a maximum difference constraint
(maximum time difference W in our example).
5 Experimental Evaluation
This section experimentally evaluates the performance of
the proposed algorithms based o n synthetically generated
data due to the lack of real data. All experiments were run
on a Pentium III Xeon 700MHz workstation with 4096MB
RAM, running Solaris 9x86. The generator takes as input
the following parameters: |F|, the number of features; ,the
maximal length of the generated episodes; n, the number of
sequences (i.e., objects); m, the maximal length of every se-
quence; , w, W ,andmin
sup, which have the same mean-
ing as that in the problem definition. Given these parame-
ters, we g enerate n trajectories, each of which is assigned to
a type in F while making sure that the generated trajectories
instantiate collocation episodes. The default values of the
data generation parameters are n = 500, m = 2000, w =2,
W =20, |F| =40, =7, =1%and min
sup =0.03.Un-
less otherwise stated, we use the same parameter values in
data generation and data mining.
Performance evaluation Our methods discover the col-
location episodes in two steps; first, close feature sets are
found and then longer patterns are extracted from them. For
the first step, apart from the proposed hash-based method, we
implement a naive one by linearly scanning all other trajecto-
ries. For the second step, besides implementing the two algo-
rithms Apriori and MJ, we also developed a non-optimized
version of the Apriori algorithm, which does not employ the
optimized counting approach shown in Figure 7 . We com-
pare the performance of four methods. Apriori-base applies
linear scan in the first step and non-optimized Apriori for
finding the patterns. Apriori-noprune, Apriori,andMJ use
the hash-based method in the first step, and non-optimized
Apriori, optimized Apriori, and MJ respectively, in the sec-
ond step. The difference between the linear scan method and
the hash-based approach in the first step can be seen by com-
paring Apriori-base and Apriori-noprune. The difference be-
tween finding collocations using the transformed sequences
and the ITLists could be observed from Apriori and MJ.Fi-
nally, by comparing Apriori-noprune with Apriori we can
see the effect of optimized support counting in Apriori.
0
20
40
60
80
100
120
140
2 4 6 8 10
time (sec)
l
Apriori-base
Apriori-noprune
Apriori
MJ
Figure8:Timevs.
0
20
40
60
80
100
1 2 3 4 5
time (sec)
m (k)
Apriori-base
Apriori-noprune
Apriori
MJ
Figure 9: Time vs. m
Figure 8 shows that the mining cost increases with the
maximal length of the g enerated episodes. In addition,
since the number of candidates in each level grows expo-
nentially to , the cost varies slightly for smaller ,andin-
creases sharply when becomes large. However, the op-
timized counting of Apriori slows down this exponential
growth. Figure 9 illustrates the scalability of the m ethods
over the maximal length m of the sequences. It shows that
the mining cost grows nearly linear to m, exhibiting good
scalability over the data volume. For changing n, the linear
changing trend could be observed. To summarize, for finding
close feature pairs, the hash-based technique is much faster
than the linear scan method, whereas for discovering collo-
cation episodes from feature sets, the Apriori method with
the counting optimization technique perfor ms best. On the
other hand, in most cases, MJ is not as efficient as Apriori,
due to the large volume of generated and joined ITLists.
6Conclusion
In this paper, we studied the problem of discovering fre-
quent collocation episodes from spatiotemporal data. We
provided a novel an d carefully de signed definition of this
new and important mining problem. In addition, we designed
an efficient two-phase mining methodology. In the first
phase, a hash-based technique is used to convert the original
trajectories to sequences of close features to the correspond-
ing object. In the second phase, an Apriori-based technique
is devised to discover the frequent episodes. We showed by
experimentation that the best combination of techniques in
both phases is efficient and scalable.
References
[1] G. Andrienko, D. Malerba, M. May, and M. Teisseire, ed-
itors. ECML/PKDD Workshop on Mining Spatio-Temporal
Data, 2005.
[2] D. J. DeWitt, J. F. Naughton, and D. A. Schneider. An evalua-
tion of non-equijoin algorithms. In VLDB, 1991.
[3] M. Hadjieleftheriou, G. Kollios, P. Bakalov, and V. J. Tsotras.
Complex spatio-temporal pattern queries. In VLDB, 2005.
[4] J. Lin, E. J. Keogh, A. W.-C. Fu, and H. V. Herle. Approxima-
tions to magic: Finding unusual medical time series. In 18th
IEEE Symp. on Computer-Based Medical Systems (CBMS),
2005.
[5] H. Mannila, H. Toivonen, and A. I. Verkamo. Discovery of fre-
quent episodes in e vent sequences. Data Min. Knowl. Discov.,
1(3):9, 1997.
[6] S. Shekhar and Y. Huang. Discovering spatial co-location pat-
terns: A summary of results. In SSTD, 2001.
[7] R. Srikant and R. Agrawal. Mining sequential patterns: Gener-
alizations and performance improvements. In EDBT, 1996.
[8] J. Wang, W. Hsu, and M.-L. Lee. A framework for mining topo-
logical patterns in spatio-temporal databases. In CIKM, 2005.
[9] H. Yang, S. Parthasarathy, and S. Mehta. A generalized frame-
work for mining spatio-temporal patterns in scientific data. In
KDD, 2005.
Citations
More filters
Journal ArticleDOI

CONSTAnT – A Conceptual Data Model for Semantic Trajectories of Moving Objects

TL;DR: A semantic trajectory conceptual data model named CONSTAnT is presented, which defines the most important aspects of semantic trajectories and believes that this model will be the foundation for the design of semantic trajectory databases, where several aspects that make a trajectory “semantic” are taken into account.
Proceedings ArticleDOI

DB-SMoT: A direction-based spatio-temporal clustering method

TL;DR: This paper presents a novel approach to find interesting places in trajectories, considering the variation of the direction as the main aspect, and demonstrates that the method is very appropriate for applications in which the direction variation plays the essential role.
Journal ArticleDOI

Mining frequent trajectory patterns in spatial-temporal databases

TL;DR: An efficient graph-based mining (GBM) algorithm for mining the frequent trajectory patterns in a spatial-temporal database that outperforms the Apriori-based and PrefixSpan-based methods by more than one order of magnitude.
Journal ArticleDOI

ST‐DMQL: A Semantic Trajectory Data Mining Query Language

TL;DR: This paper proposes through a semantic trajectory data mining query language several functionalities to select, preprocess, and transform trajectory sample points into semantic trajectories at higher abstraction levels, in order to allow the user to extract meaningful, understandable, and useful patterns from trajectories.
Book

Mobility Data Management and Exploration

TL;DR: This text presents a step-by-step methodology to understand and exploit mobility data: collecting and cleansing data, storage in Moving Object Database engines, indexing, processing, analyzing and mining mobility data.
References
More filters
Book ChapterDOI

Mining Sequential Patterns: Generalizations and Performance Improvements

TL;DR: This work adds time constraints that specify a minimum and/or maximum time period between adjacent elements in a pattern, and relax the restriction that the items in an element of a sequential pattern must come from the same transaction.
Journal ArticleDOI

Discovery of Frequent Episodes in Event Sequences

TL;DR: This work gives efficient algorithms for the discovery of all frequent episodes from a given class of episodes, and presents detailed experimental results that are in use in telecommunication alarm management.
Journal ArticleDOI

Levelwise Search and Borders of Theories in KnowledgeDiscovery

TL;DR: The concept of the border of a theory, a notion that turns out to be surprisingly powerful in analyzing the algorithm, is introduced and strong connections between the verification problem and the hypergraph transversal problem are shown.
Book ChapterDOI

Discovering Spatial Co-location Patterns: A Summary of Results

TL;DR: This work proposes a notion of user-specified neighborhoods in place of transactions to specify groups of items to solve the spatial co-location rule problem.
Proceedings Article

An Evaluation of Non-Equijoin Algorithms

TL;DR: A comparison between the partitioned band ,join algorithm and the classical sort-merge join algorit and data from speedup and scalcup experiments demonstrating that the partitioning hand join is efficiently paral-efficient are presented.
Related Papers (5)
Frequently Asked Questions (15)
Q1. What have the authors contributed in "Discovery of collocation episodes in spatiotemporal data∗" ?

Given a collection of trajectories of moving objects with different types ( e. g., pumas, deers, vultures, etc. ), the authors introduce the problem of discovering collocation episodes in them ( e. g., if a puma is moving near a deer, then a vulture is also going to move close to the same deer with high probability within the next 3 minutes ). The authors formally define the problem of mining collocation episodes and propose two scaleable algorithms for its efficient solution. The authors empirically evaluate the performance of the proposed methods using synthetically generated data that emulate real-world object movements. 

The band-join is a straightforward extension of the merge join algorithm that replaces the equality condition by a maximum difference constraint (maximum time difference W in their example). 

Function slide window is used to count |Ic| (the number of windows that contain valid instances of c) for each candidate c ∈ C from a transformed sequence Sfi ∈ Sfr . 

To summarize, for finding close feature pairs, the hash-based technique is much faster than the linear scan method, whereas for discovering collocation episodes from feature sets, the Apriori method with the counting optimization technique performs best. 

The default values of the data generation parameters are n = 500, m = 2000, w = 2, W = 20, |F|= 40, = 7, = 1% and min sup = 0.03. 

A naive method for the computation of Sfi for each Si, is to scan all the other sequences in order to identify the windows and feature-sets in each Sfi . 

Problem Definition: Given a database of trajectories S1, · · · ,Sn of n moving objects, each with type(oi) ∈ F , discover all the frequent spatiotemporal collocations, subject to , a closeness duration window length w, a maximum pattern window length W , and a frequency threshold min sup∈ [0, 1). 

Since only three windows, [5, 13), [6, 14), [9, 17), correspond to the event of a feature-set entering the sliding window, and two windows, [11, 19), [12, 20), correspond to the event that a feature-set leaves the window, the authors just need to examine these five windows. 

A spatiotemporal collocation pattern (or episode) P is an ordered list of spatiotemporal collocation units: g1g2 · · · g where ∩ i=1(gi.V ) = ∅. 

Given these parameters, the authors generate n trajectories, each of which is assigned to a type in F while making sure that the generated trajectories instantiate collocation episodes. 

Sliding window counting for event episodes has also been proposed in [5], however, the valid instances in their case are more difficult to count, because of the constraint that one collocation unit instance should end before the beginning of the next one (see condition (ii) in Definition 6). 

For counting the instances of a longer candidate pattern P (procedure MJ count cand), the authors slide a W -window along the two ITLists of the two subpatterns P1 and P2 that generate P , and merge-join the lists to create ITListfr(P ). 

An exemplary collocation episode for this application could be “Once the authors detect that a puma is moving close to a deer for 1 minute, the authors expect that a vulture will also move near to this deer in 3 minutes with high probability. 

As a result, the cost of examining a feature-set sequence Sfi becomes proportional to |S f i |, instead of the number of window positions (which normally is much larger). 

the authors provide two collocation episode mining algorithms (one Apriori-based approach and one that is based on the vertical mining paradigm) and some pruning techniques to improve the mining efficiency.