Mining frequent spatio-temporal sequential patterns

doi:10.1109/ICDM.2005.95

Mining Frequent Spatio-temporal Sequential Patterns

Huiping Cao, Nikos Mamoulis, and David W. Cheung

Department of Computer Science

The University of Hong Kong

Pokfulam Road, Hong Kong

{hpcao, nikos, dcheung}@cs.hku.hk

Abstract

Many applications track the movement of mobile objects,

which can be represented as sequences of timestamped lo-

cations. Given such a spatio-temporal series, we study

the problem of discovering sequential patterns, which are

routes frequently followed by the object. Sequential pat-

tern mining algorithms for transaction data are not directly

applicable for this setting. The challenges to address are

(i) the fuzziness of locations in patterns, and (ii) the iden-

tiﬁcation of non-explicit pattern instances. In this paper,

we deﬁne pattern elements as spatial regions around fre-

quent line segments. Our method ﬁrst transforms the orig-

inal sequence into a list of sequence segments, and detects

frequent regions in a heuristic way. Then, we propose al-

gorithms to ﬁnd patterns by employing a newly proposed

substring tree structure and improving Apriori technique. A

performance evaluation demonstrates the effectiveness and

efﬁciency of our approach.

1 Introduction

The movement of an object (i.e., trajectory) can be de-

scribed by a sequence of spatial locations sampled at con-

secutive timestamps (e.g., with the use of Global Position-

ing System (GPS) devices). Parts of the object routes are

often repeated in the archived history of locations. For in -

stance, buses move along series of streets repeatedly, people

go to and return from work following more or less the same

routes, etc. The movement routes of most objects (e.g., pri-

vate cars) are not predeﬁned. Even for objects (e.g., buses)

with pre-scheduled paths, the routes may not be repeated

with same frequency due to different schedule in weekends

or some special days. We are interested in ﬁnding fre-

quently repeated paths, i.e., spatio-temporal sequential pat-

terns, from a long spatio-temporal sequence. These patterns

could help to analyze/predict the past/future movement of

the object, support approximate query on the original data,

and so on. However, they cannot be obtained straightfor-

wardly by eliminating the noisy movement because of the

large volume of the spatio-temporal data.

Discovery of sequential patterns from transactional

databases has attracted lots of interest since Agrawal et al.

introduced the problem [1]. In such a database, each trans-

action contains a set of items bought by some customer in

one time, and a transaction sequence is a list of transac-

tions ordered by time. For example, (a, b), (a, c), (b) is

a sequence containing three transactions (a, b), (a, c) and

(b). Given a collection of transaction sequences, the prob-

lem is to ﬁnd ord ered lists of itemsets appearing with high

frequency. E.g., (b), (a), (b) is a pattern supported by the

above sequence.

Unfortunately, pattern discovery techniques in transac-

tional databases are not readily applicable for ﬁnding se-

quential patterns in spatio-temporal data. First, the elemen ts

in a transactional pattern are items that explicitly appear in

pattern instances. On the other hand, location coordinates in

a spatio-temporal series are real numbers, which do not re-

peat themselves exactly in every pattern instance. Second,

the patterns are discovered from explicitly deﬁned sets of

sequences, like (a, b), (a, c), (b), in the previous example.

Thus, a transaction list only contributes 0 or 1 to the sup-

port of a pattern, depending on whether the pattern appears

or not in the speciﬁc sequence-set. In our setting, however,

we detect frequent patterns from one long spatio-temporal

sequence, without predeﬁned segmentation of the data. The

challenge is to identify the segments that contribute to a pat-

tern, without allowing them to overlap with each other.

To summarize, the main contributions of this paper are:

(i) We propose a model for spatio-temporal sequential pat-

terns mining, based on appropriate deﬁnitions for pattern

elements and pattern instances. (ii) We present an effective

method for extracting p attern elements. (iii) We provide

efﬁcient pattern mining algorithms for discovering longer

patterns. The remainder of the paper is organized as fol-

lows. Section 2 reviews the related literature. The formal

deﬁnition of spatio-temporal sequential pattern is given in

Section 3. Section 4 presents our solutions in detail. An ex-

Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05)

perimental evaluation about the effectiveness and efﬁciency

of our approach is presented in Section 5 . Finally, Section

6 concludes this paper.

2 Related work

Our work is most related to pattern discovery from se-

quential data, which include time series, event sequences,

and spatio-temporal trajectories.

Mannila et al. [10] investigated the discovery of frequent

episodes from event sequences. An episodes is a (partially

or totally) o rdered list of events, thus is a variant of sequen-

tial pattern. A ﬁxed sliding window w is used to extract

segments (i.e., subsequences) in the event series, and the

contribution of every segment to each candidate episode’s

frequency is counted. The segments supporting one episode

may overlap, which is reasonable since episodes try to cap-

ture the appearing order of instantaneous events. However,

this methodology may not get satisfactory results in ﬁnding

spatio-temporal patterns, for several reasons. First, the win-

dow limits the length of the patterns. Second, pattern sup-

ports may not be counted correctly. E.g., the object’s move-

ment is aabbcdefg, where each character a, b, etc. corre-

sponds to a spatial region. The occurrence of the pattern abc

should be 1, since the object moves from a to c, once. How-

ever, if w is 5, pattern abc has support 4 due to the contri-

bution of 4 segments (a

b c, ab c, a bc,and a bc). Third,

as opposed to well-deﬁned categorical values for event in-

stances, object locations do not repeat themselves exactly

in pattern instances, for th ey are usually ordinal and inex-

act. Yang et al. investigated mining long sequential patterns

in [13], also dealing with event series with noise.

Previous work on detecting patterns from time-series

(e.g, [2, 7]) converted the problem to ﬁnding subsequences

in lists of categorical data (e.g., event sequences), by pre-

processing the original sequence to a string. A window w of

ﬁxed size is slided along the sequence, and a subsequence

with length w is extracted for every position. In [2], the

subsequences are clustered based on their shapes, and each

cluster is given an id. In [7], some features are extracted

from each subsequence (e.g., the slope of the best-ﬁtting

line of the sub-series, the mean of the signal, etc.). The fea-

ture space is divided into groups of similar values, and every

subsequence is converted to a group-id. The raw sequence is

then transformed to a string of cluster-ids or group-ids. The

use of the window may over-count the patterns due to the

reason explained above. In addition, since w is ﬁxed ,theex-

tracted subsequences have the same length, which may af-

fect the resultant patterns. Furthermore, for spatio-temporal

data, even when we extract the subsequences using a slid-

ing window and get simple features from these segments,

we cannot directly group these features using methods in

[2] and [7]. The cluster-based approach ([2]) has been dis-

credited by [8]. The way to group the subsequence features

([7]) may be effective for time-series with 1-dimension val-

ues. For more complex spatio-temporal data, if we directly

apply this method, i.e., split the features into groups, we

may miss the information about the spatial proximity of seg-

ments, which is essential for grouping.

The ﬁrst study on ﬁnding frequent sequential patterns

from spatio-temporal data is [11]. The raw data here is not

a long sequence, but lists of spatial locations. After dis-

cretizing the locations to pre-deﬁned spatial decomp osition,

the pro cess is intrinsically similar to that in transactional

databases.

[9] addresses the problem of discovering periodic pat-

terns in spatio-temporal data, which is a generalization of

mining periodic patterns in event sequences. Given a pe-

riod T , in the case of spatio-temporal data, a periodic pat-

tern is a (not necessarily contiguous) sequence of spatial

regions, which appears frequently every T timestamps and

describes the object movement (e.g., a bus moves from dis-

trict a to district b andthentoc with high probability, every

three hours). The contribution of [9] is that it does not treat

spatio-temporal series as event sequences, b y m erely replac-

ing each location by a predeﬁned region enclosing it, but

automatically discovers the regions that form the patterns.

This method, although effective for its purpose, relies on

aﬁxedT (i.e., the patterns repeat themselves every regular

time periods). In addition, it is prone to distortions/shiftings

of the pattern instances, i.e., periodic segments where the

pattern does not appear in the same positions as in the pat-

tern deﬁnition do not contribute to the pattern’s support.

3 Spatio-temporal sequential patterns

A spatio-temporal sequence S is a list of locatio ns,

(x

1

,y

1

,t

1

), (x

2

,y

2

,t

2

), ..., (x

n

,y

n

,t

n

),wheret

i

repre-

sents the timestamp of location (x

i

,y

i

) (1 ≤ i ≤ n). Figure

1 illustrates the movement of an object which rep eats a sim-

ilar route in three runs. We are interested in movement pat-

terns repeated frequently in such a series. This section ﬁrst

motivates our so lution, then formally deﬁnes the pro blem.

3.1 Motivation

Locations are not repeated exactly in every instance of

a movement pattern. Our idea is to summarize a series of

spatial locations to that of spatial regions.

A naive method is to use a regular grid (or some pre-

deﬁned spatial decomposition) to divide the space into re-

gions by taking a user-deﬁned parameter G,an approximate

number that each axis will be split to. Then, the locations

series can become a sequence of g rid-ids utilizing a trans-

formation approach. The ﬁrst method, Grid I, converts each

location to the id of the cell it falls in. E.g ., the raw se-

ries in Figure 1a, can be transformed to the cell-id sequence

c

2

c

4

c

8

c

9

c

6

c

2

...c

3

. Although intuitive, this method has

two problems. First, we lose the information on how the

object moves in sid e a cell, if the space decomposition is

Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05)

coarse. The patterns may not be very descriptive. Second,

for two instances of a pattern, the locations may not fall into

the same cell (i.e., two adjacent locations appear in neigh-

boring cells). We may miss some frequent patterns, whose

instances are divided between different grid-based patterns.

The ﬁrst problem could be alleviated by decreasing G,how-

ever, this would increase the chances of missing patterns

due to the second problem. An alternative conversion tech-

nique adds the ids of cells that intersect with the line seg-

ments connecting consecutive locations to the transformed

sequence. In the example of Figure 1a, Grid II converts the

sequence for the ﬁrst run to c

2

c

1

c

4

c

7

c

8

c

9

c

6

c

3

c

2

. Neverthe-

less, by this improvement, the new series may be signiﬁ-

cantly longer than the original one, which may already be

extremely long, like spatio-temporal sequences usually are.

1 2 3

4 5 6

7 8 9

run 1

run 2

run 3

l

run 1

run 2

run 3

(a) (b)

Figure 1. Object Movement

Thus, we need a better way to abstract the trajectory.

Motivated by line simpliﬁcation techniques ([3]), we repre-

sent segments of the spatio-temporal series by directed line

segments. Figure 1b shows that the line segment l su mma-

rizes the ﬁrst three points in each of the three runs with little

error. In this way, not only do we compress the original data,

decreasing the mining effort, but also the derived line seg-

ments (which approximately describe movement) provide

initial seeds for deﬁning the spatial regions, which could be

expanded later by merging similar and close segments.

3.2 Problem deﬁnition

A segment s

ij

in a spatio-temporal sequence S (1 ≤

i<j≤ n) is a contiguous subsequence of S, starting

from (x

i

,y

i

,t

i

) and ending at (x

j

,y

j

,t

j

).Givens

ij

,wede-

ﬁne its representative line segment



l

ij

with starting point

(x

i

,y

i

) and ending point (x

j

,y

j

).Let beadistanceer-

ror threshold, s

ij

complies with



l

ij

with respect to  and

is denoted as s

ij

∝



l

ij

,ifdist((x

k

,y

k

),



l

ij

) ≤  for all

k(i ≤ k ≤ j), where dist((x

k

,y

k

),



l) is the distance be-

tween (x

k

,y

k

) and line segment



l.Whens

ij

∝



l

ij

, each

point (x

k

,y

k

),i ≤ k ≤ j,ins

ij

can be projected to a point

(x



k

,y



k

)on



l

ij

. (x



k

,y



k

) implicitly denotes the projection

of (x

k

,y

k

) to



l

ij

. Figure 2a illustrates a segment s

ij

com-

plying with



l

ij

and shows the projection (x



k

,y



k

) of point

(x

k

,y

k

) on



l

ij

.Asegmental decomposition S

s

of S is

deﬁned by a list o f consecutive segments that constitute S.

Formally, S

s

= s

k

0

k

1

s

k

1

k

2

... s

k

m−1

k

m

, k

0

=1,k

m

=

n, m < n,wheres

k

i

k

i+1

∝



l

k

i

k

i+1

for all i, To simplify

notation, we use s

0

s

1

...s

m−1

to denote S

s

.

Let



l represent a directed line segment,



l.angle and



l.len

be its slope angle and length respectively. Two line seg-

ments



l

ij

and



l

gh

representing segments s

ij

and s

gh

are

similar, denoted by



l

ij

∼



l

gh

, with respect to angle dif-

ference threshold θ and length factor f (0 ≤ f ≤ 1)if:

(i) |



l

ij

.angle −



l

gh

.angle|≤θ and

(ii) |



l

ij

.len −



l

gh

.len|≤f × max(



l

ij

.len,



l

gh

.len) If



l

ij

∼



l

gh

, s

ij

and s

gh

are also treated as similar to each

other. Note that similarity is symmetric. The location infor-

mation of segments is not co nsidered in deﬁning similarity,

since we use it when deﬁning the segments’ closeness.

Line segment



l

ij

is close to



l

gh

if for ∀(x



k

,y



k

) ∈



l

ij

,

dist((x



k

,y



k

),



l

gh

) ≤ .When



l

ij

is close to



l

gh

,wealso

say that the segment s

ij

is close to the segment s

gh

,where

s

ij

∝



l

ij

and s

gh

∝



l

gh

. As opposed to similarity, close-

ness is asymmetric. Figure 2b shows an example. Let



l

ij

is parallel to



l

gh

and  =5.0. The distance between these

two parallel line segments is 4.5. Observe that



l

ij

is close to



l

gh

because the distance from each point in



l

ij

to



l

gh

is less

than 5.0.However,



l

gh

is not close to



l

ij

for the point in the

right upper part has distance to



l

ij

bigger than 5.0.

Let L be a set of segments from sequence S

s

.Themean

line segment for L,



l

c

, is a line segment that best ﬁts all

the points in L with the minimum sum of squared errors

(SSE). In other words, if PSet contains all the points of

the segments in L, the mean line segment



l

c

is such that



p∈PSet

dist(p,



l

c

) ≤



p∈PSet

dist(p,



l) ∀



l =



l

c

.

Let tol be the average orthogonal distance of all the

points in L to



l

c

. A spatial pattern element is a rectangu-

lar spatial region r

L

with four sides determined by (



l

c

,tol)

as following: (1) two sides of r’s that are parallel to



l

c

,have

thesamelengthas



l

c

, and their distances to



l

c

are tol;(2)

the other two vertical sides have length 2 · tol,andtheir

midpoints are the two end points of



l

c

. We refer to



l

c

as

the central line segment of region r

L

. We say that region

r

L

contains k segments or k segments contribute to r

L

if L

consists of k segments. Figure 2c visualizes this deﬁnition.

A spatio-temporal sequential pattern P is an ordered se-

quence of pattern elements: r

1

r

2

...r

q

, (1 ≤ q ≤ m).The

length of pattern P is the number of regions in it.

A contiguous subsequence of S

s

, s

i

s

i+1

...s

i+q−1

,isa

pattern instance for P : r

1

r

2

...r

q

if ∀j(1 ≤ j ≤ q),if

the representative line segment for segment s

i+j−1

is sim-

ilar and close to the central line segment of reg ion r

j

.A

pattern’s instances cannot overlap in time (the pattern may

be over-counted like that in [10] otherwise), i.e., if two con-

Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05)

ij

l

H

H H

()','

kk

yx

(

kk

yx ,)

H

ij

l

gh

l

4.5

r

tol

o

c

l

(a) Segment complies with



l

ij

(b) Example for closeness (c) Region r determined by (



l

c

,tol)

Figure 2. Example of deﬁnitions

tinuous subsequences of S

s

, s

i

...s

j

and s

g

...s

h

,aretwo

instances for pattern P , either j<gor h<i.Givenpat-

terns P



: r



1

r



2

...r



i

and P : r

1

r

2

...r

j

, P



is a subpattern

of P if i ≤ j and ∃k, (1 ≤ k ≤ j −i +1) such that r



1

= r

k

,

r



2

= r

k+1

, ..., r



i

= r

k+i−1

. P is a superpattern of P



.

The support of a pattern P is the number of instances

supporting P . Given a support threshold min

sup, P is

frequent if its support exceeds min

sup. Since a pattern

with same frequency to one of its supersets is redundant, we

focus on detecting closed frequent patterns [4], for which

every proper subpattern has equal frequency. The mining

problem is to ﬁnd frequent patterns from a long spatio-

temporal sequence S with respect to a support threshold

min

sup, and subject to a segmenting distance error thresh-

old , a similarity parameter θ and a length factor f.The

parameter values depend on the application domain, or can

be tuned as part of the mining process [2]. In using the raw

data to discover patterns, we discuss how to set the parame-

ters in Section 5.1 more applicably.

4Solution

In this section, we describe how to discover frequent sin-

gular patterns, i.e., frequent spatial regions (Section 4 .1)

and longer closed patterns (Section 4.2).

4.1 Discovering frequent singular patterns

The segmentation (line simpliﬁcation) algorithm ([3, 5,

6]) is used to convert the locations series to segments se-

quences so that each raw sequence segment could be ab-

stracted by a line segment. Our idea is to transform S to

S

s

using such a technique, and take the segments obtained

as seed for the desired spatial regions, whose central line

segments best ﬁt the points of segments in the regions. The

DP (Douglas-Peucker) algorithm [3] is a classical top down

approach for this problem. [6] provides an online algorithm

in splitting a sequence to segments with quite good quality.

Since it is important to keep the internal movement inside a

region, we need to capture the sharp turn of the movement

in the transformation. We employ DP method because it

has been proved to be the best algorithm in choosing split-

ting points [12]. In brief, DP algorithm recursively decom-

poses S: {p

1

,...,p

n

} to a series of line segments l

1

,...l

m

,

m ≤ n, each of which, l

i

, simpliﬁes a subsequence S

l

i

,

such that the perpendicular distance from every point in S

l

i

to l

i

is at most . For efﬁciency purpose, DP’s improved

version ([5]) could be adopted.

Discovering frequent singular patterns from S

s

is a hard

problem, since in the worst-case, all combinations of seg-

ments in S

s

have to be considered as candidate. To expe-

dite the process, we employ a heuristic, Growing.LetSegs

be a set initially containing all the segments in S

s

. Grow-

ing works as follows. It selects the segment s with median

length, i.e., the median of the lengths of the segments in

Segs, as seed for the initial spatial region r. Then, r is

grown by merging other segments in Segs through ﬁltering

and veriﬁcation steps, described later. Next, for the set of

remaining segments not merged to r, the segment s



with

median length in it is selected as seed for growing. Finally,

the overall algorithm terminates after all segments (i) have

been assigned to a region (as initial seeds or to the region of

another seed), or (ii) have been found not to belong to any

frequent region and marked as outliers. Selecting the seg-

ment with median length as seed could help to absorb short

segments with less error, compared to taking segment with

longer length as seed. Meanwhile, it could prevent gener-

ating regions with too ﬁne granularity, which could happen

when shorter length segment is used as seed. Growing is

deterministic in using this seed selection procedure.

The ﬁltering process checks two conditions. First, for

each s

i

in Segsthe angle difference dif f a

i

between



l

s

and

s

i

is computed, and s

i

is treated as candidate if dif f a

i

is

less than θ. All the candidate segments are put into a set C.

Second, the minimum distance from every segment in C to



l

s

is computed and all segments whose minimum distances

to



l

s

is larger than f ·



l

s

.len are pruned. The remaining

segments in C will be used for veriﬁcation.

The ﬁltering step computes the minimum distance be-

tween segments, but it does not consider the length differ-

ence (second condition of similarity), between each



l

s

i

∈ C

and



l

s

, and the exact spatial distances of segments in C to



l

s

(closeness condition). In the veriﬁcation step, Algorithm

1 (shown below) merges the segments in C to the spatial re-

gion r around



l

s

,ifs

i

∈ C satisﬁes the closeness and length

difference condition. Otherwise, we extract from s

i

the part

that satisﬁes the condition, and merge this part with r.The

remaining part of s

i

is a new segment and inserted back to

Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05)

Segs (Line 15) for later processing.

Algorithm 1 Veriﬁcation(



l

s

, C, Segs, f , min sup)

1: α :=



l

s

.len × f ; m:=0;

2: //length check

3: for each segment s

i

in C do

4: intersect s

i

with



l

s

,gets



and



l

s

 ;

5: if (diff(



l

s

.len,



l

s

 .len) ≤ α) m++;

6: end for

7: //closeness check

8: while (m ≥ min

sup) do

9: Get



l

c

from all intersected points for region r;

10: Validate all intersected parts from C;

11: if (all intersected parts are close to



l

c

) break;

12: end while

13: if (m<min

sup) return;

14: for each segment s

i

in C do

15: Add non-intersected part of s

i

to Segs;

16: Remove s

i

from Segs;

17: end for

18: Remove segment that



l

s

represents from Segs;

We explain how we compute the intersected part of s

i

and



l

s

in Line 4. Let



l

s

i

be the rep resentative line seg-

ment for s

i

. If all projection points (x



k

,y



k

) in



l

s

i

have

distance to



l

s

no more than α (Line 1), its related location

point (x

k

,y

k

) in the segment is put into the intersected part

s



. The line segment created by mapping each point in s



to



l

s

i

is denoted as



l

s



. For example, let s

i

represent seg-

ment (x

10

,y

10

,t

10

), ..., (x

30

,y

30

,t

30

). Assume that the

distances from points in



l

s

i

to



l

s

are all smaller than α ex-

cept points from (x



10

,y



10

) to (x



15

,y



15

). Then, s



is seg-

ment (x

16

,y

16

,t

16

), ..., (x

30

,y

30

,t

30

),and



l

s



represents

line segment from (x



16

,y



16

) to (x



30

,y



30

) in



l

s

i

.

4.2 Deriving longer patterns

After ﬁnding frequently visited spatial regions, original

data S is converted to a series S

R

of spatial regions by

changing the segments in frequent regions to region ids,

and those not in any region to outliers. S

R

preserves the

motion continuity of the object by showing how it moves

among regions. Although each region in S

R

is repeated

frequently, the concatenation of some regions may not be

frequent. E.g., a person living in r

1

often goes to a place r

2

in some days and to region r

3

in other days. r

1

, r

2

and r

3

are frequently visited, but the path r

2

r

3

is not frequent. This

section discusses how to detect the longer frequent patterns.

4.2.1 Level-wise mining

A direct way is to perform level-wise pattern mining. How-

ever, this approach suffers from the disadvantage that S

R

needs to be scanned many times. We propose solutions to

reduce the number of candidates and scans in probing long

candidates, based on the following properties we observe.

Property 1 (Connectivity Constraint): Due to conti-

nuity of object movement, a spatial region can only connect

to some but not all the others in S

R

. This constraint can

help reduce the number of generated candidates, as follows.

We ﬁrst construct a connectivity graph for all the spatial re-

gions in S

R

. A directed edge from r

i

to r

j

is added to the

graph if the substring r

i

r

j

appears in the sequence. The

edge weight is the frequency that r

i

r

j

appears in the se-

quence. Let r

1

r

2

...r

k

be a frequent pattern, and r

k

only

points to r

i

and r

j

, only two candidates, r

1

r

2

...r

k

r

i

and

r

1

r

2

...r

k

r

j

are generated. Further, if the edge weight from

r

k

to some element, say r

i

, is no more than min sup,we

need not generate candidate r

1

r

2

...r

k

r

i

.

Property 2 (Closeness Property): Given a pattern P ,

suppose its last element connects to r

1

, r

1

connects to r

2

,

..., r

m−1

connects to r

m

,(m ≥ 2). We can get pattern

P

1

= Pr

1

(concatenating P and r

1

), P

2

= Pr

1

r

2

, ...,

P

m

= Pr

1

r

2

...r

m

. Obviously, if P

1

and P

m

have the

same support, any P

i

,(1 <i<m) also has the same

support. This property helps to generate candidates more

efﬁciently. Let result be the frequent patterns at the end of

the kth scan and P be a pattern in it with last element r.We

can ex tend P using other patterns in result that start with r.

For instance, let P = r

1

r

2

r

3

,andr

3

only connect to r

4

in

the connectivity graph. In addition, assume that result con-

tains only one pattern starting from r

3

: P



= r

3

r

4

r

6

r

7

. P

can then be extended to candidates r

1

r

2

r

3

r

4

(using Property

1), and r

1

r

2

r

3

r

4

r

6

r

7

(using Property 2). If r

1

r

2

r

3

r

4

and

r

1

r

2

r

3

r

4

r

6

r

7

have the same support after the counting, we

only need to consider candidates longer than r

1

r

2

r

3

r

4

r

6

r

7

later, signiﬁcantly reducing the number of scans.

4.2.2 Mining using the substring tree

We propose a substring tr ee structure to facilitate counting

of long substrings with different elements. The substring

tree is a ro oted directed tree whose ro ot links to multiple

substring sub-trees. Each node in a sub-tree consists of pat-

tern element and a counter, which counts the number of

substrings (i.e., subsequences of elements) that contribute

to the pattern formed by the path from the root to this node.

A substring tree example is shown in Figure 3a.

To construct the tree, in scanning S

R

, we extract sub-

strings containing distinct elements, and insert them to the

tree. In seeing an element r in S

R

, we concatenate it to

the substrings found so far that do not contain r. Also, if

no substring starting with r is found, r is treated as a new

substring. We give an example to illustrate the extraction of

substrings. Let S

R

be r

1

r

2

r

3

r

4

r

1

r

3

r

4

r

2

r

3

r

4

r

1

r

2

r

3

r

4

.Ini-

tially, no substring is extracted. When see the ﬁrst r

1

,we

create a new substring for it. On seeing the second element

r

2

, we create a new substring r

2

since no substring starting

with r

2

exists. In addition, we concatenate it to the only

substring r

1

and get r

1

r

2

. The process continues until we

see the ﬁfth element r

1

. There is already a string r

1

r

2

r

3

r

4

with r

1

as ﬁrst element, so r

1

r

2

r

3

r

4

is inserted to the tree,

Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05)

Mining frequent spatio-temporal sequential patterns

Figures

Citations

Visual Analytics of Taxi Trajectory Data via Topical Sub-trajectories

Forecasting Citywide Traffic Congestion Based on Social Media

Hiding co-occurring frequent itemsets

Extraction de motifs spatio-temporels dans des séries d'images de télédétection : application à des données optiques et radar

Data Mining for Moving Object Databases

References

Mining sequential patterns

Algorithms for the reduction of the number of points required to represent a digitized line or its caricature

Discovery of Frequent Episodes in Event Sequences

An online algorithm for segmenting time series

Levelwise Search and Borders of Theories in KnowledgeDiscovery

Related Papers (5)

Mining sequential patterns

Trajectory clustering: a partition-and-group framework

A density-based algorithm for discovering clusters in large spatial Databases with Noise

Mining association rules between sets of items in large databases

Mining interesting locations and travel sequences from GPS trajectories