scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Mining frequent spatio-temporal sequential patterns

27 Nov 2005-pp 82-89
TL;DR: This paper proposes algorithms to find patterns by employing a newly proposed substring tree structure and improving a priori technique, and defines pattern elements as spatial regions around frequent line segments.
Abstract: Many applications track the movement of mobile objects, which can be represented as sequences of timestamped locations. Given such a spatiotemporal series, we study the problem of discovering sequential patterns, which are routes frequently followed by the object. Sequential pattern mining algorithms for transaction data are not directly applicable for this setting. The challenges to address are: (i) the fuzziness of locations in patterns, and (ii) the identification of non-explicit pattern instances. In this paper, we define pattern elements as spatial regions around frequent line segments. Our method first transforms the original sequence into a list of sequence segments, and detects frequent regions in a heuristic way. Then, we propose algorithms to find patterns by employing a newly proposed substring tree structure and improving a priori technique. A performance evaluation demonstrates the effectiveness and efficiency of our approach.

Summary (3 min read)

1 Introduction

  • The movement of an object (i.e., trajectory) can be described by a sequence of spatial locations sampled at consecutive timestamps (e.g., with the use of Global Positioning System (GPS) devices).
  • Buses move along series of streets repeatedly, people go to and return from work following more or less the same routes, etc.
  • Unfortunately, pattern discovery techniques in transactional databases are not readily applicable for finding sequential patterns in spatio-temporal data.

3.1 Motivation

  • Locations are not repeated exactly in every instance of a movement pattern.
  • A naive method is to use a regular grid (or some predefined spatial decomposition) to divide the space into regions by taking a user-defined parameter G,an approximate number that each axis will be split to.
  • The authors may miss some frequent patterns, whose instances are divided between different grid-based patterns.
  • An alternative conversion technique adds the ids of cells that intersect with the line segments connecting consecutive locations to the transformed sequence.
  • Motivated by line simplification techniques ([3]), the authors represent segments of the spatio-temporal series by directed line segments.

3.2 Problem definition

  • Given sij , the authors define its representative line segment lij with starting point (xi, yi) and ending point (xj , yj).
  • Lgh is not close to lij for the point in the right upper part has distance to lij bigger than 5.0.
  • Given a support threshold min sup, P is frequent if its support exceeds min sup.
  • The parameter values depend on the application domain, or can be tuned as part of the mining process [2].

4.1 Discovering frequent singular patterns

  • The segmentation (line simplification) algorithm ([3, 5, 6]) is used to convert the locations series to segments sequences so that each raw sequence segment could be abstracted by a line segment.
  • The DP (Douglas-Peucker) algorithm [3] is a classical top down approach for this problem. [6] provides an online algorithm in splitting a sequence to segments with quite good quality.
  • It selects the segment s with median length, i.e., the median of the lengths of the segments in Segs, as seed for the initial spatial region r.
  • Let lsi be the representative line segment for si.

4.2 Deriving longer patterns

  • SR preserves the motion continuity of the object by showing how it moves among regions.
  • The concatenation of some regions may not be frequent.
  • R1, r2 and r3 are frequently visited, but the path r2r3 is not frequent.
  • This section discusses how to detect the longer frequent patterns.

4.2.1 Level-wise mining

  • This approach suffers from the disadvantage that SR needs to be scanned many times.
  • This constraint can help reduce the number of generated candidates, as follows.
  • The authors first construct a connectivity graph for all the spatial regions in SR.
  • The edge weight is the frequency that rirj appears in the sequence.
  • In addition, assume that result contains only one pattern starting from r3: P ′ = r3r4r6r7.

4.2.2 Mining using the substring tree

  • The authors propose a substring tree structure to facilitate counting of long substrings with different elements.
  • The substring tree is a rooted directed tree whose root links to multiple substring sub-trees.
  • The process continues until the authors see the fifth element r1.
  • Each element in the stack comprises of a pattern, its count and a level, indicating whether the pattern has reached a leaf or not.

5 Experiments

  • This section evaluates their proposed approach with real and synthetic data.
  • The real data contain tracked bus movements in Patras, Greece.
  • The generator takes three parameters, |p|, n, and m. |p| is the number of line segments constituting circular paths (i.e., patterns) of the movement.
  • The description of the artificial series is given in related experiments.
  • For each value in the set, the authors cluster the y coordinates of the sample points and derive dense ranges of y values.

5.2 Effectiveness and efficiency study

  • The authors examine the effectiveness of their method taking as input a raw bus movement sequence shown in Figure 5a, which contains 6921 locations.
  • This is quite coarse, since the movement inside each cell is unknown.
  • Table 1b compares the total time spent by their methods, and the grid methods which use the substring tree for finding longer patterns.
  • This happens because many cells in the sequence become outliers for this case, thus Grid II discovers shorter patterns (whereas Grid I finds longer ones, since it does not introduce intermediate cells at a sharp movement).

6 Conclusion

  • The authors modeled the problem of mining sequential patterns from spatio-temporal data by considering both spatial and temporal information.
  • Singular frequent pat- terns are found effectively, by grouping segments not only by similar shape (like previous work in time-series mining), but also by closeness in space.
  • In addition, the authors employed special properties of the problem (spatial connectivity, closeness) and a newly proposed substring tree to accelerate search for longer patterns.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Mining Frequent Spatio-temporal Sequential Patterns
Huiping Cao, Nikos Mamoulis, and David W. Cheung
Department of Computer Science
The University of Hong Kong
Pokfulam Road, Hong Kong
{hpcao, nikos, dcheung}@cs.hku.hk
Abstract
Many applications track the movement of mobile objects,
which can be represented as sequences of timestamped lo-
cations. Given such a spatio-temporal series, we study
the problem of discovering sequential patterns, which are
routes frequently followed by the object. Sequential pat-
tern mining algorithms for transaction data are not directly
applicable for this setting. The challenges to address are
(i) the fuzziness of locations in patterns, and (ii) the iden-
tification of non-explicit pattern instances. In this paper,
we define pattern elements as spatial regions around fre-
quent line segments. Our method first transforms the orig-
inal sequence into a list of sequence segments, and detects
frequent regions in a heuristic way. Then, we propose al-
gorithms to find patterns by employing a newly proposed
substring tree structure and improving Apriori technique. A
performance evaluation demonstrates the effectiveness and
efficiency of our approach.
1 Introduction
The movement of an object (i.e., trajectory) can be de-
scribed by a sequence of spatial locations sampled at con-
secutive timestamps (e.g., with the use of Global Position-
ing System (GPS) devices). Parts of the object routes are
often repeated in the archived history of locations. For in -
stance, buses move along series of streets repeatedly, people
go to and return from work following more or less the same
routes, etc. The movement routes of most objects (e.g., pri-
vate cars) are not predefined. Even for objects (e.g., buses)
with pre-scheduled paths, the routes may not be repeated
with same frequency due to different schedule in weekends
or some special days. We are interested in finding fre-
quently repeated paths, i.e., spatio-temporal sequential pat-
terns, from a long spatio-temporal sequence. These patterns
could help to analyze/predict the past/future movement of
the object, support approximate query on the original data,
and so on. However, they cannot be obtained straightfor-
wardly by eliminating the noisy movement because of the
large volume of the spatio-temporal data.
Discovery of sequential patterns from transactional
databases has attracted lots of interest since Agrawal et al.
introduced the problem [1]. In such a database, each trans-
action contains a set of items bought by some customer in
one time, and a transaction sequence is a list of transac-
tions ordered by time. For example, (a, b), (a, c), (b) is
a sequence containing three transactions (a, b), (a, c) and
(b). Given a collection of transaction sequences, the prob-
lem is to find ord ered lists of itemsets appearing with high
frequency. E.g., (b), (a), (b) is a pattern supported by the
above sequence.
Unfortunately, pattern discovery techniques in transac-
tional databases are not readily applicable for finding se-
quential patterns in spatio-temporal data. First, the elemen ts
in a transactional pattern are items that explicitly appear in
pattern instances. On the other hand, location coordinates in
a spatio-temporal series are real numbers, which do not re-
peat themselves exactly in every pattern instance. Second,
the patterns are discovered from explicitly defined sets of
sequences, like (a, b), (a, c), (b), in the previous example.
Thus, a transaction list only contributes 0 or 1 to the sup-
port of a pattern, depending on whether the pattern appears
or not in the specific sequence-set. In our setting, however,
we detect frequent patterns from one long spatio-temporal
sequence, without predefined segmentation of the data. The
challenge is to identify the segments that contribute to a pat-
tern, without allowing them to overlap with each other.
To summarize, the main contributions of this paper are:
(i) We propose a model for spatio-temporal sequential pat-
terns mining, based on appropriate definitions for pattern
elements and pattern instances. (ii) We present an effective
method for extracting p attern elements. (iii) We provide
efficient pattern mining algorithms for discovering longer
patterns. The remainder of the paper is organized as fol-
lows. Section 2 reviews the related literature. The formal
definition of spatio-temporal sequential pattern is given in
Section 3. Section 4 presents our solutions in detail. An ex-
Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05)
1550-4786/05 $20.00 © 2005 IEEE

perimental evaluation about the effectiveness and efficiency
of our approach is presented in Section 5 . Finally, Section
6 concludes this paper.
2 Related work
Our work is most related to pattern discovery from se-
quential data, which include time series, event sequences,
and spatio-temporal trajectories.
Mannila et al. [10] investigated the discovery of frequent
episodes from event sequences. An episodes is a (partially
or totally) o rdered list of events, thus is a variant of sequen-
tial pattern. A fixed sliding window w is used to extract
segments (i.e., subsequences) in the event series, and the
contribution of every segment to each candidate episode’s
frequency is counted. The segments supporting one episode
may overlap, which is reasonable since episodes try to cap-
ture the appearing order of instantaneous events. However,
this methodology may not get satisfactory results in finding
spatio-temporal patterns, for several reasons. First, the win-
dow limits the length of the patterns. Second, pattern sup-
ports may not be counted correctly. E.g., the object’s move-
ment is aabbcdefg, where each character a, b, etc. corre-
sponds to a spatial region. The occurrence of the pattern abc
should be 1, since the object moves from a to c, once. How-
ever, if w is 5, pattern abc has support 4 due to the contri-
bution of 4 segments (a
b c, ab c, a bc,and a bc). Third,
as opposed to well-defined categorical values for event in-
stances, object locations do not repeat themselves exactly
in pattern instances, for th ey are usually ordinal and inex-
act. Yang et al. investigated mining long sequential patterns
in [13], also dealing with event series with noise.
Previous work on detecting patterns from time-series
(e.g, [2, 7]) converted the problem to finding subsequences
in lists of categorical data (e.g., event sequences), by pre-
processing the original sequence to a string. A window w of
fixed size is slided along the sequence, and a subsequence
with length w is extracted for every position. In [2], the
subsequences are clustered based on their shapes, and each
cluster is given an id. In [7], some features are extracted
from each subsequence (e.g., the slope of the best-fitting
line of the sub-series, the mean of the signal, etc.). The fea-
ture space is divided into groups of similar values, and every
subsequence is converted to a group-id. The raw sequence is
then transformed to a string of cluster-ids or group-ids. The
use of the window may over-count the patterns due to the
reason explained above. In addition, since w is fixed ,theex-
tracted subsequences have the same length, which may af-
fect the resultant patterns. Furthermore, for spatio-temporal
data, even when we extract the subsequences using a slid-
ing window and get simple features from these segments,
we cannot directly group these features using methods in
[2] and [7]. The cluster-based approach ([2]) has been dis-
credited by [8]. The way to group the subsequence features
([7]) may be effective for time-series with 1-dimension val-
ues. For more complex spatio-temporal data, if we directly
apply this method, i.e., split the features into groups, we
may miss the information about the spatial proximity of seg-
ments, which is essential for grouping.
The first study on finding frequent sequential patterns
from spatio-temporal data is [11]. The raw data here is not
a long sequence, but lists of spatial locations. After dis-
cretizing the locations to pre-defined spatial decomp osition,
the pro cess is intrinsically similar to that in transactional
databases.
[9] addresses the problem of discovering periodic pat-
terns in spatio-temporal data, which is a generalization of
mining periodic patterns in event sequences. Given a pe-
riod T , in the case of spatio-temporal data, a periodic pat-
tern is a (not necessarily contiguous) sequence of spatial
regions, which appears frequently every T timestamps and
describes the object movement (e.g., a bus moves from dis-
trict a to district b andthentoc with high probability, every
three hours). The contribution of [9] is that it does not treat
spatio-temporal series as event sequences, b y m erely replac-
ing each location by a predefined region enclosing it, but
automatically discovers the regions that form the patterns.
This method, although effective for its purpose, relies on
afixedT (i.e., the patterns repeat themselves every regular
time periods). In addition, it is prone to distortions/shiftings
of the pattern instances, i.e., periodic segments where the
pattern does not appear in the same positions as in the pat-
tern definition do not contribute to the pattern’s support.
3 Spatio-temporal sequential patterns
A spatio-temporal sequence S is a list of locatio ns,
(x
1
,y
1
,t
1
), (x
2
,y
2
,t
2
), ..., (x
n
,y
n
,t
n
),wheret
i
repre-
sents the timestamp of location (x
i
,y
i
) (1 i n). Figure
1 illustrates the movement of an object which rep eats a sim-
ilar route in three runs. We are interested in movement pat-
terns repeated frequently in such a series. This section first
motivates our so lution, then formally defines the pro blem.
3.1 Motivation
Locations are not repeated exactly in every instance of
a movement pattern. Our idea is to summarize a series of
spatial locations to that of spatial regions.
A naive method is to use a regular grid (or some pre-
defined spatial decomposition) to divide the space into re-
gions by taking a user-defined parameter G,an approximate
number that each axis will be split to. Then, the locations
series can become a sequence of g rid-ids utilizing a trans-
formation approach. The first method, Grid I, converts each
location to the id of the cell it falls in. E.g ., the raw se-
ries in Figure 1a, can be transformed to the cell-id sequence
c
2
c
4
c
8
c
9
c
6
c
2
...c
3
. Although intuitive, this method has
two problems. First, we lose the information on how the
object moves in sid e a cell, if the space decomposition is
Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05)
1550-4786/05 $20.00 © 2005 IEEE

coarse. The patterns may not be very descriptive. Second,
for two instances of a pattern, the locations may not fall into
the same cell (i.e., two adjacent locations appear in neigh-
boring cells). We may miss some frequent patterns, whose
instances are divided between different grid-based patterns.
The first problem could be alleviated by decreasing G,how-
ever, this would increase the chances of missing patterns
due to the second problem. An alternative conversion tech-
nique adds the ids of cells that intersect with the line seg-
ments connecting consecutive locations to the transformed
sequence. In the example of Figure 1a, Grid II converts the
sequence for the first run to c
2
c
1
c
4
c
7
c
8
c
9
c
6
c
3
c
2
. Neverthe-
less, by this improvement, the new series may be signifi-
cantly longer than the original one, which may already be
extremely long, like spatio-temporal sequences usually are.
1 2 3
4 5 6
7 8 9
run 1
run 2
run 3
l
run 1
run 2
run 3
(a) (b)
Figure 1. Object Movement
Thus, we need a better way to abstract the trajectory.
Motivated by line simplification techniques ([3]), we repre-
sent segments of the spatio-temporal series by directed line
segments. Figure 1b shows that the line segment l su mma-
rizes the first three points in each of the three runs with little
error. In this way, not only do we compress the original data,
decreasing the mining effort, but also the derived line seg-
ments (which approximately describe movement) provide
initial seeds for defining the spatial regions, which could be
expanded later by merging similar and close segments.
3.2 Problem definition
A segment s
ij
in a spatio-temporal sequence S (1
i<j n) is a contiguous subsequence of S, starting
from (x
i
,y
i
,t
i
) and ending at (x
j
,y
j
,t
j
).Givens
ij
,wede-
fine its representative line segment
l
ij
with starting point
(x
i
,y
i
) and ending point (x
j
,y
j
).Let beadistanceer-
ror threshold, s
ij
complies with
l
ij
with respect to and
is denoted as s
ij
l
ij
,ifdist((x
k
,y
k
),
l
ij
) for all
k(i k j), where dist((x
k
,y
k
),
l) is the distance be-
tween (x
k
,y
k
) and line segment
l.Whens
ij
l
ij
, each
point (x
k
,y
k
),i k j,ins
ij
can be projected to a point
(x
k
,y
k
)on
l
ij
. (x
k
,y
k
) implicitly denotes the projection
of (x
k
,y
k
) to
l
ij
. Figure 2a illustrates a segment s
ij
com-
plying with
l
ij
and shows the projection (x
k
,y
k
) of point
(x
k
,y
k
) on
l
ij
.Asegmental decomposition S
s
of S is
defined by a list o f consecutive segments that constitute S.
Formally, S
s
= s
k
0
k
1
s
k
1
k
2
... s
k
m1
k
m
, k
0
=1,k
m
=
n, m < n,wheres
k
i
k
i+1
l
k
i
k
i+1
for all i, To simplify
notation, we use s
0
s
1
...s
m1
to denote S
s
.
Let
l represent a directed line segment,
l.angle and
l.len
be its slope angle and length respectively. Two line seg-
ments
l
ij
and
l
gh
representing segments s
ij
and s
gh
are
similar, denoted by
l
ij
l
gh
, with respect to angle dif-
ference threshold θ and length factor f (0 f 1)if:
(i) |
l
ij
.angle
l
gh
.angle|≤θ and
(ii) |
l
ij
.len
l
gh
.len|≤f × max(
l
ij
.len,
l
gh
.len) If
l
ij
l
gh
, s
ij
and s
gh
are also treated as similar to each
other. Note that similarity is symmetric. The location infor-
mation of segments is not co nsidered in defining similarity,
since we use it when defining the segments’ closeness.
Line segment
l
ij
is close to
l
gh
if for (x
k
,y
k
)
l
ij
,
dist((x
k
,y
k
),
l
gh
) .When
l
ij
is close to
l
gh
,wealso
say that the segment s
ij
is close to the segment s
gh
,where
s
ij
l
ij
and s
gh
l
gh
. As opposed to similarity, close-
ness is asymmetric. Figure 2b shows an example. Let
l
ij
is parallel to
l
gh
and =5.0. The distance between these
two parallel line segments is 4.5. Observe that
l
ij
is close to
l
gh
because the distance from each point in
l
ij
to
l
gh
is less
than 5.0.However,
l
gh
is not close to
l
ij
for the point in the
right upper part has distance to
l
ij
bigger than 5.0.
Let L be a set of segments from sequence S
s
.Themean
line segment for L,
l
c
, is a line segment that best fits all
the points in L with the minimum sum of squared errors
(SSE). In other words, if PSet contains all the points of
the segments in L, the mean line segment
l
c
is such that
pPSet
dist(p,
l
c
)
pPSet
dist(p,
l)
l =
l
c
.
Let tol be the average orthogonal distance of all the
points in L to
l
c
. A spatial pattern element is a rectangu-
lar spatial region r
L
with four sides determined by (
l
c
,tol)
as following: (1) two sides of rs that are parallel to
l
c
,have
thesamelengthas
l
c
, and their distances to
l
c
are tol;(2)
the other two vertical sides have length 2 · tol,andtheir
midpoints are the two end points of
l
c
. We refer to
l
c
as
the central line segment of region r
L
. We say that region
r
L
contains k segments or k segments contribute to r
L
if L
consists of k segments. Figure 2c visualizes this definition.
A spatio-temporal sequential pattern P is an ordered se-
quence of pattern elements: r
1
r
2
...r
q
, (1 q m).The
length of pattern P is the number of regions in it.
A contiguous subsequence of S
s
, s
i
s
i+1
...s
i+q1
,isa
pattern instance for P : r
1
r
2
...r
q
if j(1 j q),if
the representative line segment for segment s
i+j1
is sim-
ilar and close to the central line segment of reg ion r
j
.A
pattern’s instances cannot overlap in time (the pattern may
be over-counted like that in [10] otherwise), i.e., if two con-
Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05)
1550-4786/05 $20.00 © 2005 IEEE

ij
l
H
H H
()','
kk
yx
(
kk
yx ,)
H
ij
l
gh
l
4.5
r
tol
o
c
l
(a) Segment complies with
l
ij
(b) Example for closeness (c) Region r determined by (
l
c
,tol)
Figure 2. Example of definitions
tinuous subsequences of S
s
, s
i
...s
j
and s
g
...s
h
,aretwo
instances for pattern P , either j<gor h<i.Givenpat-
terns P
: r
1
r
2
...r
i
and P : r
1
r
2
...r
j
, P
is a subpattern
of P if i j and k, (1 k j i +1) such that r
1
= r
k
,
r
2
= r
k+1
, ..., r
i
= r
k+i1
. P is a superpattern of P
.
The support of a pattern P is the number of instances
supporting P . Given a support threshold min
sup, P is
frequent if its support exceeds min
sup. Since a pattern
with same frequency to one of its supersets is redundant, we
focus on detecting closed frequent patterns [4], for which
every proper subpattern has equal frequency. The mining
problem is to find frequent patterns from a long spatio-
temporal sequence S with respect to a support threshold
min
sup, and subject to a segmenting distance error thresh-
old , a similarity parameter θ and a length factor f.The
parameter values depend on the application domain, or can
be tuned as part of the mining process [2]. In using the raw
data to discover patterns, we discuss how to set the parame-
ters in Section 5.1 more applicably.
4Solution
In this section, we describe how to discover frequent sin-
gular patterns, i.e., frequent spatial regions (Section 4 .1)
and longer closed patterns (Section 4.2).
4.1 Discovering frequent singular patterns
The segmentation (line simplification) algorithm ([3, 5,
6]) is used to convert the locations series to segments se-
quences so that each raw sequence segment could be ab-
stracted by a line segment. Our idea is to transform S to
S
s
using such a technique, and take the segments obtained
as seed for the desired spatial regions, whose central line
segments best fit the points of segments in the regions. The
DP (Douglas-Peucker) algorithm [3] is a classical top down
approach for this problem. [6] provides an online algorithm
in splitting a sequence to segments with quite good quality.
Since it is important to keep the internal movement inside a
region, we need to capture the sharp turn of the movement
in the transformation. We employ DP method because it
has been proved to be the best algorithm in choosing split-
ting points [12]. In brief, DP algorithm recursively decom-
poses S: {p
1
,...,p
n
} to a series of line segments l
1
,...l
m
,
m n, each of which, l
i
, simplifies a subsequence S
l
i
,
such that the perpendicular distance from every point in S
l
i
to l
i
is at most . For efficiency purpose, DP’s improved
version ([5]) could be adopted.
Discovering frequent singular patterns from S
s
is a hard
problem, since in the worst-case, all combinations of seg-
ments in S
s
have to be considered as candidate. To expe-
dite the process, we employ a heuristic, Growing.LetSegs
be a set initially containing all the segments in S
s
. Grow-
ing works as follows. It selects the segment s with median
length, i.e., the median of the lengths of the segments in
Segs, as seed for the initial spatial region r. Then, r is
grown by merging other segments in Segs through filtering
and verification steps, described later. Next, for the set of
remaining segments not merged to r, the segment s
with
median length in it is selected as seed for growing. Finally,
the overall algorithm terminates after all segments (i) have
been assigned to a region (as initial seeds or to the region of
another seed), or (ii) have been found not to belong to any
frequent region and marked as outliers. Selecting the seg-
ment with median length as seed could help to absorb short
segments with less error, compared to taking segment with
longer length as seed. Meanwhile, it could prevent gener-
ating regions with too fine granularity, which could happen
when shorter length segment is used as seed. Growing is
deterministic in using this seed selection procedure.
The filtering process checks two conditions. First, for
each s
i
in Segsthe angle difference dif f a
i
between
l
s
and
s
i
is computed, and s
i
is treated as candidate if dif f a
i
is
less than θ. All the candidate segments are put into a set C.
Second, the minimum distance from every segment in C to
l
s
is computed and all segments whose minimum distances
to
l
s
is larger than f ·
l
s
.len are pruned. The remaining
segments in C will be used for verification.
The filtering step computes the minimum distance be-
tween segments, but it does not consider the length differ-
ence (second condition of similarity), between each
l
s
i
C
and
l
s
, and the exact spatial distances of segments in C to
l
s
(closeness condition). In the verification step, Algorithm
1 (shown below) merges the segments in C to the spatial re-
gion r around
l
s
,ifs
i
C satisfies the closeness and length
difference condition. Otherwise, we extract from s
i
the part
that satisfies the condition, and merge this part with r.The
remaining part of s
i
is a new segment and inserted back to
Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05)
1550-4786/05 $20.00 © 2005 IEEE

Segs (Line 15) for later processing.
Algorithm 1 Verification(
l
s
, C, Segs, f , min sup)
1: α :=
l
s
.len × f ; m:=0;
2: //length check
3: for each segment s
i
in C do
4: intersect s
i
with
l
s
,gets
and
l
s
;
5: if (diff(
l
s
.len,
l
s
.len) α) m++;
6: end for
7: //closeness check
8: while (m min
sup) do
9: Get
l
c
from all intersected points for region r;
10: Validate all intersected parts from C;
11: if (all intersected parts are close to
l
c
) break;
12: end while
13: if (m<min
sup) return;
14: for each segment s
i
in C do
15: Add non-intersected part of s
i
to Segs;
16: Remove s
i
from Segs;
17: end for
18: Remove segment that
l
s
represents from Segs;
We explain how we compute the intersected part of s
i
and
l
s
in Line 4. Let
l
s
i
be the rep resentative line seg-
ment for s
i
. If all projection points (x
k
,y
k
) in
l
s
i
have
distance to
l
s
no more than α (Line 1), its related location
point (x
k
,y
k
) in the segment is put into the intersected part
s
. The line segment created by mapping each point in s
to
l
s
i
is denoted as
l
s
. For example, let s
i
represent seg-
ment (x
10
,y
10
,t
10
), ..., (x
30
,y
30
,t
30
). Assume that the
distances from points in
l
s
i
to
l
s
are all smaller than α ex-
cept points from (x
10
,y
10
) to (x
15
,y
15
). Then, s
is seg-
ment (x
16
,y
16
,t
16
), ..., (x
30
,y
30
,t
30
),and
l
s
represents
line segment from (x
16
,y
16
) to (x
30
,y
30
) in
l
s
i
.
4.2 Deriving longer patterns
After finding frequently visited spatial regions, original
data S is converted to a series S
R
of spatial regions by
changing the segments in frequent regions to region ids,
and those not in any region to outliers. S
R
preserves the
motion continuity of the object by showing how it moves
among regions. Although each region in S
R
is repeated
frequently, the concatenation of some regions may not be
frequent. E.g., a person living in r
1
often goes to a place r
2
in some days and to region r
3
in other days. r
1
, r
2
and r
3
are frequently visited, but the path r
2
r
3
is not frequent. This
section discusses how to detect the longer frequent patterns.
4.2.1 Level-wise mining
A direct way is to perform level-wise pattern mining. How-
ever, this approach suffers from the disadvantage that S
R
needs to be scanned many times. We propose solutions to
reduce the number of candidates and scans in probing long
candidates, based on the following properties we observe.
Property 1 (Connectivity Constraint): Due to conti-
nuity of object movement, a spatial region can only connect
to some but not all the others in S
R
. This constraint can
help reduce the number of generated candidates, as follows.
We first construct a connectivity graph for all the spatial re-
gions in S
R
. A directed edge from r
i
to r
j
is added to the
graph if the substring r
i
r
j
appears in the sequence. The
edge weight is the frequency that r
i
r
j
appears in the se-
quence. Let r
1
r
2
...r
k
be a frequent pattern, and r
k
only
points to r
i
and r
j
, only two candidates, r
1
r
2
...r
k
r
i
and
r
1
r
2
...r
k
r
j
are generated. Further, if the edge weight from
r
k
to some element, say r
i
, is no more than min sup,we
need not generate candidate r
1
r
2
...r
k
r
i
.
Property 2 (Closeness Property): Given a pattern P ,
suppose its last element connects to r
1
, r
1
connects to r
2
,
..., r
m1
connects to r
m
,(m 2). We can get pattern
P
1
= Pr
1
(concatenating P and r
1
), P
2
= Pr
1
r
2
, ...,
P
m
= Pr
1
r
2
...r
m
. Obviously, if P
1
and P
m
have the
same support, any P
i
,(1 <i<m) also has the same
support. This property helps to generate candidates more
efficiently. Let result be the frequent patterns at the end of
the kth scan and P be a pattern in it with last element r.We
can ex tend P using other patterns in result that start with r.
For instance, let P = r
1
r
2
r
3
,andr
3
only connect to r
4
in
the connectivity graph. In addition, assume that result con-
tains only one pattern starting from r
3
: P
= r
3
r
4
r
6
r
7
. P
can then be extended to candidates r
1
r
2
r
3
r
4
(using Property
1), and r
1
r
2
r
3
r
4
r
6
r
7
(using Property 2). If r
1
r
2
r
3
r
4
and
r
1
r
2
r
3
r
4
r
6
r
7
have the same support after the counting, we
only need to consider candidates longer than r
1
r
2
r
3
r
4
r
6
r
7
later, significantly reducing the number of scans.
4.2.2 Mining using the substring tree
We propose a substring tr ee structure to facilitate counting
of long substrings with different elements. The substring
tree is a ro oted directed tree whose ro ot links to multiple
substring sub-trees. Each node in a sub-tree consists of pat-
tern element and a counter, which counts the number of
substrings (i.e., subsequences of elements) that contribute
to the pattern formed by the path from the root to this node.
A substring tree example is shown in Figure 3a.
To construct the tree, in scanning S
R
, we extract sub-
strings containing distinct elements, and insert them to the
tree. In seeing an element r in S
R
, we concatenate it to
the substrings found so far that do not contain r. Also, if
no substring starting with r is found, r is treated as a new
substring. We give an example to illustrate the extraction of
substrings. Let S
R
be r
1
r
2
r
3
r
4
r
1
r
3
r
4
r
2
r
3
r
4
r
1
r
2
r
3
r
4
.Ini-
tially, no substring is extracted. When see the first r
1
,we
create a new substring for it. On seeing the second element
r
2
, we create a new substring r
2
since no substring starting
with r
2
exists. In addition, we concatenate it to the only
substring r
1
and get r
1
r
2
. The process continues until we
see the fifth element r
1
. There is already a string r
1
r
2
r
3
r
4
with r
1
as first element, so r
1
r
2
r
3
r
4
is inserted to the tree,
Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05)
1550-4786/05 $20.00 © 2005 IEEE

Citations
More filters
Journal ArticleDOI
TL;DR: It is believed that frequent pattern mining research has substantially broadened the scope of data analysis and will have deep impact on data mining methodologies and applications in the long run, however, there are still some challenging research issues that need to be solved before frequent patternmining can claim a cornerstone approach in data mining applications.
Abstract: Frequent pattern mining has been a focused theme in data mining research for over a decade. Abundant literature has been dedicated to this research and tremendous progress has been made, ranging from efficient and scalable algorithms for frequent itemset mining in transaction databases to numerous research frontiers, such as sequential pattern mining, structured pattern mining, correlation mining, associative classification, and frequent pattern-based clustering, as well as their broad applications. In this article, we provide a brief overview of the current status of frequent pattern mining and discuss a few promising research directions. We believe that frequent pattern mining research has substantially broadened the scope of data analysis and will have deep impact on data mining methodologies and applications in the long run. However, there are still some challenging research issues that need to be solved before frequent pattern mining can claim a cornerstone approach in data mining applications.

1,448 citations


Cites background from "Mining frequent spatio-temporal seq..."

  • ...Such optimization ideas can be extended to mining spatiotemporal sequential patterns as well, as shown in Cao et al. (2005)....

    [...]

Journal ArticleDOI
Yu Zheng1
TL;DR: A systematic survey on the major research into trajectory data mining, providing a panorama of the field as well as the scope of its research topics, and introduces the methods that transform trajectories into other data formats, such as graphs, matrices, and tensors.
Abstract: The advances in location-acquisition and mobile computing techniques have generated massive spatial trajectory data, which represent the mobility of a diversity of moving objects, such as people, vehicles, and animals. Many techniques have been proposed for processing, managing, and mining trajectory data in the past decade, fostering a broad range of applications. In this article, we conduct a systematic survey on the major research into trajectory data mining, providing a panorama of the field as well as the scope of its research topics. Following a road map from the derivation of trajectory data, to trajectory data preprocessing, to trajectory data management, and to a variety of mining tasks (such as trajectory pattern mining, outlier detection, and trajectory classification), the survey explores the connections, correlations, and differences among these existing techniques. This survey also introduces the methods that transform trajectories into other data formats, such as graphs, matrices, and tensors, to which more data mining and machine learning techniques can be applied. Finally, some public trajectory datasets are presented. This survey can help shape the field of trajectory data mining, providing a quick understanding of this field to the community.

1,289 citations


Cites background or methods from "Mining frequent spatio-temporal seq..."

  • ...1 Sequential Pattern Mining in a Free Space Line-simplification-based methods: An early solution aiming to deal with the aforementioned issues was proposed in 2005 [11]....

    [...]

  • ...Line-Simplification-Based Methods: An early solution aiming to deal with the aforementioned issues was proposed in 2005 [Cao et al. 2005]....

    [...]

Proceedings ArticleDOI
12 Aug 2007
TL;DR: This paper develops an extension of the sequential pattern mining paradigm that analyzes the trajectories of moving objects and introduces trajectory patterns as concise descriptions of frequent behaviours in terms of both space and time.
Abstract: The increasing pervasiveness of location-acquisition technologies (GPS, GSM networks, etc.) is leading to the collection of large spatio-temporal datasets and to the opportunity of discovering usable knowledge about movement behaviour, which fosters novel applications and services. In this paper, we move towards this direction and develop an extension of the sequential pattern mining paradigm that analyzes the trajectories of moving objects. We introduce trajectory patterns as concise descriptions of frequent behaviours, in terms of both space (i.e., the regions of space visited during movements) and time (i.e., the duration of movements). In this setting, we provide a general formal statement of the novel mining problem and then study several different instantiations of different complexity. The various approaches are then empirically evaluated over real data and synthetic benchmarks, comparing their strengths and weaknesses.

1,099 citations


Cites background from "Mining frequent spatio-temporal seq..."

  • ...The work in [3] considers patterns that are in the form...

    [...]

Proceedings ArticleDOI
04 Nov 2009
TL;DR: The results show that the ST-matching algorithm significantly outperform incremental algorithm in terms of matching accuracy for low-sampling trajectories and when compared with AFD-based global algorithm, ST-Matching also improves accuracy as well as running time.
Abstract: Map-matching is the process of aligning a sequence of observed user positions with the road network on a digital map. It is a fundamental pre-processing step for many applications, such as moving object management, traffic flow analysis, and driving directions. In practice there exists huge amount of low-sampling-rate (e.g., one point every 2--5 minutes) GPS trajectories. Unfortunately, most current map-matching approaches only deal with high-sampling-rate (typically one point every 10--30s) GPS data, and become less effective for low-sampling-rate points as the uncertainty in data increases. In this paper, we propose a novel global map-matching algorithm called ST-Matching for low-sampling-rate GPS trajectories. ST-Matching considers (1) the spatial geometric and topological structures of the road network and (2) the temporal/speed constraints of the trajectories. Based on spatio-temporal analysis, a candidate graph is constructed from which the best matching path sequence is identified. We compare ST-Matching with the incremental algorithm and Average-Frechet-Distance (AFD) based global map-matching algorithm. The experiments are performed both on synthetic and real dataset. The results show that our ST-matching algorithm significantly outperform incremental algorithm in terms of matching accuracy for low-sampling trajectories. Meanwhile, when compared with AFD-based global algorithm, ST-Matching also improves accuracy as well as running time.

817 citations

BookDOI
01 Oct 2011
TL;DR: This book presents an overview on both fundamentals and the state-of-the-art research inspired by spatial trajectory data, as well as a special focus on trajectory pattern mining, spatio-temporal data mining and location-based social networks.
Abstract: Spatial trajectories have been bringing the unprecedented wealth to a variety of research communities. A spatial trajectory records the paths of a variety of moving objects, such as people who log their travel routes with GPS trajectories. The field of moving objects related research has become extremely active within the last few years, especially with all major database and data mining conferences and journals. Computing with Spatial Trajectories introduces the algorithms, technologies, and systems used to process, manage and understand existing spatial trajectories for different applications. This book also presents an overview on both fundamentals and the state-of-the-art research inspired by spatial trajectory data, as well as a special focus on trajectory pattern mining, spatio-temporal data mining and location-based social networks. Each chapter provides readers with a tutorial-style introduction to one important aspect of location trajectory computing, case studies and many valuable references to other relevant research work. Computing with Spatial Trajectories is designed as a reference or secondary text book for advanced-level students and researchers mainly focused on computer science and geography. Professionals working on spatial trajectory computing will also find this book very useful.

564 citations

References
More filters
Proceedings ArticleDOI
06 Mar 1995
TL;DR: Three algorithms are presented to solve the problem of mining sequential patterns over databases of customer transactions, and empirically evaluating their performance using synthetic data shows that two of them have comparable performance.
Abstract: We are given a large database of customer transactions, where each transaction consists of customer-id, transaction time, and the items bought in the transaction. We introduce the problem of mining sequential patterns over such databases. We present three algorithms to solve this problem, and empirically evaluate their performance using synthetic data. Two of the proposed algorithms, AprioriSome and AprioriAll, have comparable performance, albeit AprioriSome performs a little better when the minimum number of customers that must support a sequential pattern is low. Scale-up experiments show that both AprioriSome and AprioriAll scale linearly with the number of customer transactions. They also have excellent scale-up properties with respect to the number of transactions per customer and the number of items in a transaction. >

5,663 citations

Journal ArticleDOI
TL;DR: In this paper, two algorithms to reduce the number of points required to represent the line and, if desired, produce caricatures are presented and compared with the most promising methods so far suggested.
Abstract: All digitizing methods, as a general rule, record lines with far more data than is necessary for accurate graphic reproduction or for computer analysis. Two algorithms to reduce the number of points required to represent the line and, if desired, produce caricatures, are presented and compared with the most promising methods so far suggested. Line reduction will form a major part of automated generalization. Regle generale, les methodes numeriques enregistrent des lignes avec beaucoup plus de donnees qu'il n'est necessaire a la reproduction graphique precise ou a la recherche par ordinateur. L'auteur presente deux algorithmes pour reduire le nombre de points necessaires pour representer la ligne et produire des caricatures si desire, et les compare aux methodes les plus prometteuses suggerees jusqu'ici. La reduction de la ligne constituera une partie importante de la generalisation automatique.

3,749 citations


"Mining frequent spatio-temporal seq..." refers methods in this paper

  • ...The DP (Douglas-Peucker) algorithm [3] is a classical top down approach for this problem....

    [...]

  • ...Motivated by line simplification techniques ([3]), we repre sent segments of the spatio-temporal series by directed lin segments....

    [...]

  • ...The segmentation (line simplification) algorithm ([3, 5, 6]) is used to convert the locations series to segments sequences so that each raw sequence segment could be abstracted by a line segment....

    [...]

Journal ArticleDOI
TL;DR: This work gives efficient algorithms for the discovery of all frequent episodes from a given class of episodes, and presents detailed experimental results that are in use in telecommunication alarm management.
Abstract: Sequences of events describing the behavior and actions of users or systems can be collected in several domains. An episode is a collection of events that occur relatively close to each other in a given partial order. We consider the problem of discovering frequently occurring episodes in a sequence. Once such episodes are known, one can produce rules for describing or predicting the behavior of the sequence. We give efficient algorithms for the discovery of all frequent episodes from a given class of episodes, and present detailed experimental results. The methods are in use in telecommunication alarm management.

1,593 citations


"Mining frequent spatio-temporal seq..." refers background in this paper

  • ...A pattern’s instances cannot overlap in time (the pattern may be over-counted like that in [10] otherwise), i....

    [...]

  • ...[10] investigated the discovery of frequent episodesfrom event sequences....

    [...]

Proceedings ArticleDOI
29 Nov 2001
TL;DR: This paper undertake the first extensive review and empirical comparison of all proposed techniques for mining time-series data with fatal flaws and introduces a novel algorithm that is empirically show to be superior to all others in the literature.
Abstract: In recent years, there has been an explosion of interest in mining time-series databases. As with most computer science problems, representation of the data is the key to efficient and effective solutions. One of the most commonly used representations is piecewise linear approximation. This representation has been used by various researchers to support clustering, classification, indexing and association rule mining of time-series data. A variety of algorithms have been proposed to obtain this representation, with several algorithms having been independently rediscovered several times. In this paper, we undertake the first extensive review and empirical comparison of all proposed techniques. We show that all these algorithms have fatal flaws from a data-mining perspective. We introduce a novel algorithm that we empirically show to be superior to all others in the literature.

1,193 citations


"Mining frequent spatio-temporal seq..." refers methods in this paper

  • ...[6] provides an online algorithm in splitting a sequence to segments with quite good quality....

    [...]

  • ...The segmentation (line simplification) algorithm ([3, 5, 6]) is used to convert the locations series to segments sequences so that each raw sequence segment could be abstracted by a line segment....

    [...]

Journal ArticleDOI
TL;DR: The concept of the border of a theory, a notion that turns out to be surprisingly powerful in analyzing the algorithm, is introduced and strong connections between the verification problem and the hypergraph transversal problem are shown.
Abstract: One of the basic problems in knowledge discovery in databases (KDD) is the following: given a data set r, a class L of sentences for defining subgroups of r, and a selection predicate, find all sentences of L deemed interesting by the selection predicate. We analyze the simple levelwise algorithm for finding all such descriptions. We give bounds for the number of database accesses that the algorithm makes. For this, we introduce the concept of the border of a theory, a notion that turns out to be surprisingly powerful in analyzing the algorithm. We also consider the verification problem of a KDD process: given r and a set of sentences S ⊆ L determine whether S is exactly the set of interesting statements about r. We show strong connections between the verification problem and the hypergraph transversal problem. The verification problem arises in a natural way when using sampling to speed up the pattern discovery step in KDD.

952 citations