scispace - formally typeset
Open AccessProceedings ArticleDOI

Mining frequent spatio-temporal sequential patterns

TLDR
This paper proposes algorithms to find patterns by employing a newly proposed substring tree structure and improving a priori technique, and defines pattern elements as spatial regions around frequent line segments.
Abstract
Many applications track the movement of mobile objects, which can be represented as sequences of timestamped locations. Given such a spatiotemporal series, we study the problem of discovering sequential patterns, which are routes frequently followed by the object. Sequential pattern mining algorithms for transaction data are not directly applicable for this setting. The challenges to address are: (i) the fuzziness of locations in patterns, and (ii) the identification of non-explicit pattern instances. In this paper, we define pattern elements as spatial regions around frequent line segments. Our method first transforms the original sequence into a list of sequence segments, and detects frequent regions in a heuristic way. Then, we propose algorithms to find patterns by employing a newly proposed substring tree structure and improving a priori technique. A performance evaluation demonstrates the effectiveness and efficiency of our approach.

read more

Content maybe subject to copyright    Report

Mining Frequent Spatio-temporal Sequential Patterns
Huiping Cao, Nikos Mamoulis, and David W. Cheung
Department of Computer Science
The University of Hong Kong
Pokfulam Road, Hong Kong
{hpcao, nikos, dcheung}@cs.hku.hk
Abstract
Many applications track the movement of mobile objects,
which can be represented as sequences of timestamped lo-
cations. Given such a spatio-temporal series, we study
the problem of discovering sequential patterns, which are
routes frequently followed by the object. Sequential pat-
tern mining algorithms for transaction data are not directly
applicable for this setting. The challenges to address are
(i) the fuzziness of locations in patterns, and (ii) the iden-
tification of non-explicit pattern instances. In this paper,
we define pattern elements as spatial regions around fre-
quent line segments. Our method first transforms the orig-
inal sequence into a list of sequence segments, and detects
frequent regions in a heuristic way. Then, we propose al-
gorithms to find patterns by employing a newly proposed
substring tree structure and improving Apriori technique. A
performance evaluation demonstrates the effectiveness and
efficiency of our approach.
1 Introduction
The movement of an object (i.e., trajectory) can be de-
scribed by a sequence of spatial locations sampled at con-
secutive timestamps (e.g., with the use of Global Position-
ing System (GPS) devices). Parts of the object routes are
often repeated in the archived history of locations. For in -
stance, buses move along series of streets repeatedly, people
go to and return from work following more or less the same
routes, etc. The movement routes of most objects (e.g., pri-
vate cars) are not predefined. Even for objects (e.g., buses)
with pre-scheduled paths, the routes may not be repeated
with same frequency due to different schedule in weekends
or some special days. We are interested in finding fre-
quently repeated paths, i.e., spatio-temporal sequential pat-
terns, from a long spatio-temporal sequence. These patterns
could help to analyze/predict the past/future movement of
the object, support approximate query on the original data,
and so on. However, they cannot be obtained straightfor-
wardly by eliminating the noisy movement because of the
large volume of the spatio-temporal data.
Discovery of sequential patterns from transactional
databases has attracted lots of interest since Agrawal et al.
introduced the problem [1]. In such a database, each trans-
action contains a set of items bought by some customer in
one time, and a transaction sequence is a list of transac-
tions ordered by time. For example, (a, b), (a, c), (b) is
a sequence containing three transactions (a, b), (a, c) and
(b). Given a collection of transaction sequences, the prob-
lem is to find ord ered lists of itemsets appearing with high
frequency. E.g., (b), (a), (b) is a pattern supported by the
above sequence.
Unfortunately, pattern discovery techniques in transac-
tional databases are not readily applicable for finding se-
quential patterns in spatio-temporal data. First, the elemen ts
in a transactional pattern are items that explicitly appear in
pattern instances. On the other hand, location coordinates in
a spatio-temporal series are real numbers, which do not re-
peat themselves exactly in every pattern instance. Second,
the patterns are discovered from explicitly defined sets of
sequences, like (a, b), (a, c), (b), in the previous example.
Thus, a transaction list only contributes 0 or 1 to the sup-
port of a pattern, depending on whether the pattern appears
or not in the specific sequence-set. In our setting, however,
we detect frequent patterns from one long spatio-temporal
sequence, without predefined segmentation of the data. The
challenge is to identify the segments that contribute to a pat-
tern, without allowing them to overlap with each other.
To summarize, the main contributions of this paper are:
(i) We propose a model for spatio-temporal sequential pat-
terns mining, based on appropriate definitions for pattern
elements and pattern instances. (ii) We present an effective
method for extracting p attern elements. (iii) We provide
efficient pattern mining algorithms for discovering longer
patterns. The remainder of the paper is organized as fol-
lows. Section 2 reviews the related literature. The formal
definition of spatio-temporal sequential pattern is given in
Section 3. Section 4 presents our solutions in detail. An ex-
Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05)
1550-4786/05 $20.00 © 2005 IEEE

perimental evaluation about the effectiveness and efficiency
of our approach is presented in Section 5 . Finally, Section
6 concludes this paper.
2 Related work
Our work is most related to pattern discovery from se-
quential data, which include time series, event sequences,
and spatio-temporal trajectories.
Mannila et al. [10] investigated the discovery of frequent
episodes from event sequences. An episodes is a (partially
or totally) o rdered list of events, thus is a variant of sequen-
tial pattern. A fixed sliding window w is used to extract
segments (i.e., subsequences) in the event series, and the
contribution of every segment to each candidate episode’s
frequency is counted. The segments supporting one episode
may overlap, which is reasonable since episodes try to cap-
ture the appearing order of instantaneous events. However,
this methodology may not get satisfactory results in finding
spatio-temporal patterns, for several reasons. First, the win-
dow limits the length of the patterns. Second, pattern sup-
ports may not be counted correctly. E.g., the object’s move-
ment is aabbcdefg, where each character a, b, etc. corre-
sponds to a spatial region. The occurrence of the pattern abc
should be 1, since the object moves from a to c, once. How-
ever, if w is 5, pattern abc has support 4 due to the contri-
bution of 4 segments (a
b c, ab c, a bc,and a bc). Third,
as opposed to well-defined categorical values for event in-
stances, object locations do not repeat themselves exactly
in pattern instances, for th ey are usually ordinal and inex-
act. Yang et al. investigated mining long sequential patterns
in [13], also dealing with event series with noise.
Previous work on detecting patterns from time-series
(e.g, [2, 7]) converted the problem to finding subsequences
in lists of categorical data (e.g., event sequences), by pre-
processing the original sequence to a string. A window w of
fixed size is slided along the sequence, and a subsequence
with length w is extracted for every position. In [2], the
subsequences are clustered based on their shapes, and each
cluster is given an id. In [7], some features are extracted
from each subsequence (e.g., the slope of the best-fitting
line of the sub-series, the mean of the signal, etc.). The fea-
ture space is divided into groups of similar values, and every
subsequence is converted to a group-id. The raw sequence is
then transformed to a string of cluster-ids or group-ids. The
use of the window may over-count the patterns due to the
reason explained above. In addition, since w is fixed ,theex-
tracted subsequences have the same length, which may af-
fect the resultant patterns. Furthermore, for spatio-temporal
data, even when we extract the subsequences using a slid-
ing window and get simple features from these segments,
we cannot directly group these features using methods in
[2] and [7]. The cluster-based approach ([2]) has been dis-
credited by [8]. The way to group the subsequence features
([7]) may be effective for time-series with 1-dimension val-
ues. For more complex spatio-temporal data, if we directly
apply this method, i.e., split the features into groups, we
may miss the information about the spatial proximity of seg-
ments, which is essential for grouping.
The first study on finding frequent sequential patterns
from spatio-temporal data is [11]. The raw data here is not
a long sequence, but lists of spatial locations. After dis-
cretizing the locations to pre-defined spatial decomp osition,
the pro cess is intrinsically similar to that in transactional
databases.
[9] addresses the problem of discovering periodic pat-
terns in spatio-temporal data, which is a generalization of
mining periodic patterns in event sequences. Given a pe-
riod T , in the case of spatio-temporal data, a periodic pat-
tern is a (not necessarily contiguous) sequence of spatial
regions, which appears frequently every T timestamps and
describes the object movement (e.g., a bus moves from dis-
trict a to district b andthentoc with high probability, every
three hours). The contribution of [9] is that it does not treat
spatio-temporal series as event sequences, b y m erely replac-
ing each location by a predefined region enclosing it, but
automatically discovers the regions that form the patterns.
This method, although effective for its purpose, relies on
afixedT (i.e., the patterns repeat themselves every regular
time periods). In addition, it is prone to distortions/shiftings
of the pattern instances, i.e., periodic segments where the
pattern does not appear in the same positions as in the pat-
tern definition do not contribute to the pattern’s support.
3 Spatio-temporal sequential patterns
A spatio-temporal sequence S is a list of locatio ns,
(x
1
,y
1
,t
1
), (x
2
,y
2
,t
2
), ..., (x
n
,y
n
,t
n
),wheret
i
repre-
sents the timestamp of location (x
i
,y
i
) (1 i n). Figure
1 illustrates the movement of an object which rep eats a sim-
ilar route in three runs. We are interested in movement pat-
terns repeated frequently in such a series. This section first
motivates our so lution, then formally defines the pro blem.
3.1 Motivation
Locations are not repeated exactly in every instance of
a movement pattern. Our idea is to summarize a series of
spatial locations to that of spatial regions.
A naive method is to use a regular grid (or some pre-
defined spatial decomposition) to divide the space into re-
gions by taking a user-defined parameter G,an approximate
number that each axis will be split to. Then, the locations
series can become a sequence of g rid-ids utilizing a trans-
formation approach. The first method, Grid I, converts each
location to the id of the cell it falls in. E.g ., the raw se-
ries in Figure 1a, can be transformed to the cell-id sequence
c
2
c
4
c
8
c
9
c
6
c
2
...c
3
. Although intuitive, this method has
two problems. First, we lose the information on how the
object moves in sid e a cell, if the space decomposition is
Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05)
1550-4786/05 $20.00 © 2005 IEEE

coarse. The patterns may not be very descriptive. Second,
for two instances of a pattern, the locations may not fall into
the same cell (i.e., two adjacent locations appear in neigh-
boring cells). We may miss some frequent patterns, whose
instances are divided between different grid-based patterns.
The first problem could be alleviated by decreasing G,how-
ever, this would increase the chances of missing patterns
due to the second problem. An alternative conversion tech-
nique adds the ids of cells that intersect with the line seg-
ments connecting consecutive locations to the transformed
sequence. In the example of Figure 1a, Grid II converts the
sequence for the first run to c
2
c
1
c
4
c
7
c
8
c
9
c
6
c
3
c
2
. Neverthe-
less, by this improvement, the new series may be signifi-
cantly longer than the original one, which may already be
extremely long, like spatio-temporal sequences usually are.
1 2 3
4 5 6
7 8 9
run 1
run 2
run 3
l
run 1
run 2
run 3
(a) (b)
Figure 1. Object Movement
Thus, we need a better way to abstract the trajectory.
Motivated by line simplification techniques ([3]), we repre-
sent segments of the spatio-temporal series by directed line
segments. Figure 1b shows that the line segment l su mma-
rizes the first three points in each of the three runs with little
error. In this way, not only do we compress the original data,
decreasing the mining effort, but also the derived line seg-
ments (which approximately describe movement) provide
initial seeds for defining the spatial regions, which could be
expanded later by merging similar and close segments.
3.2 Problem definition
A segment s
ij
in a spatio-temporal sequence S (1
i<j n) is a contiguous subsequence of S, starting
from (x
i
,y
i
,t
i
) and ending at (x
j
,y
j
,t
j
).Givens
ij
,wede-
fine its representative line segment
l
ij
with starting point
(x
i
,y
i
) and ending point (x
j
,y
j
).Let beadistanceer-
ror threshold, s
ij
complies with
l
ij
with respect to and
is denoted as s
ij
l
ij
,ifdist((x
k
,y
k
),
l
ij
) for all
k(i k j), where dist((x
k
,y
k
),
l) is the distance be-
tween (x
k
,y
k
) and line segment
l.Whens
ij
l
ij
, each
point (x
k
,y
k
),i k j,ins
ij
can be projected to a point
(x
k
,y
k
)on
l
ij
. (x
k
,y
k
) implicitly denotes the projection
of (x
k
,y
k
) to
l
ij
. Figure 2a illustrates a segment s
ij
com-
plying with
l
ij
and shows the projection (x
k
,y
k
) of point
(x
k
,y
k
) on
l
ij
.Asegmental decomposition S
s
of S is
defined by a list o f consecutive segments that constitute S.
Formally, S
s
= s
k
0
k
1
s
k
1
k
2
... s
k
m1
k
m
, k
0
=1,k
m
=
n, m < n,wheres
k
i
k
i+1
l
k
i
k
i+1
for all i, To simplify
notation, we use s
0
s
1
...s
m1
to denote S
s
.
Let
l represent a directed line segment,
l.angle and
l.len
be its slope angle and length respectively. Two line seg-
ments
l
ij
and
l
gh
representing segments s
ij
and s
gh
are
similar, denoted by
l
ij
l
gh
, with respect to angle dif-
ference threshold θ and length factor f (0 f 1)if:
(i) |
l
ij
.angle
l
gh
.angle|≤θ and
(ii) |
l
ij
.len
l
gh
.len|≤f × max(
l
ij
.len,
l
gh
.len) If
l
ij
l
gh
, s
ij
and s
gh
are also treated as similar to each
other. Note that similarity is symmetric. The location infor-
mation of segments is not co nsidered in defining similarity,
since we use it when defining the segments’ closeness.
Line segment
l
ij
is close to
l
gh
if for (x
k
,y
k
)
l
ij
,
dist((x
k
,y
k
),
l
gh
) .When
l
ij
is close to
l
gh
,wealso
say that the segment s
ij
is close to the segment s
gh
,where
s
ij
l
ij
and s
gh
l
gh
. As opposed to similarity, close-
ness is asymmetric. Figure 2b shows an example. Let
l
ij
is parallel to
l
gh
and =5.0. The distance between these
two parallel line segments is 4.5. Observe that
l
ij
is close to
l
gh
because the distance from each point in
l
ij
to
l
gh
is less
than 5.0.However,
l
gh
is not close to
l
ij
for the point in the
right upper part has distance to
l
ij
bigger than 5.0.
Let L be a set of segments from sequence S
s
.Themean
line segment for L,
l
c
, is a line segment that best fits all
the points in L with the minimum sum of squared errors
(SSE). In other words, if PSet contains all the points of
the segments in L, the mean line segment
l
c
is such that
pPSet
dist(p,
l
c
)
pPSet
dist(p,
l)
l =
l
c
.
Let tol be the average orthogonal distance of all the
points in L to
l
c
. A spatial pattern element is a rectangu-
lar spatial region r
L
with four sides determined by (
l
c
,tol)
as following: (1) two sides of rs that are parallel to
l
c
,have
thesamelengthas
l
c
, and their distances to
l
c
are tol;(2)
the other two vertical sides have length 2 · tol,andtheir
midpoints are the two end points of
l
c
. We refer to
l
c
as
the central line segment of region r
L
. We say that region
r
L
contains k segments or k segments contribute to r
L
if L
consists of k segments. Figure 2c visualizes this definition.
A spatio-temporal sequential pattern P is an ordered se-
quence of pattern elements: r
1
r
2
...r
q
, (1 q m).The
length of pattern P is the number of regions in it.
A contiguous subsequence of S
s
, s
i
s
i+1
...s
i+q1
,isa
pattern instance for P : r
1
r
2
...r
q
if j(1 j q),if
the representative line segment for segment s
i+j1
is sim-
ilar and close to the central line segment of reg ion r
j
.A
pattern’s instances cannot overlap in time (the pattern may
be over-counted like that in [10] otherwise), i.e., if two con-
Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05)
1550-4786/05 $20.00 © 2005 IEEE

ij
l
H
H H
()','
kk
yx
(
kk
yx ,)
H
ij
l
gh
l
4.5
r
tol
o
c
l
(a) Segment complies with
l
ij
(b) Example for closeness (c) Region r determined by (
l
c
,tol)
Figure 2. Example of definitions
tinuous subsequences of S
s
, s
i
...s
j
and s
g
...s
h
,aretwo
instances for pattern P , either j<gor h<i.Givenpat-
terns P
: r
1
r
2
...r
i
and P : r
1
r
2
...r
j
, P
is a subpattern
of P if i j and k, (1 k j i +1) such that r
1
= r
k
,
r
2
= r
k+1
, ..., r
i
= r
k+i1
. P is a superpattern of P
.
The support of a pattern P is the number of instances
supporting P . Given a support threshold min
sup, P is
frequent if its support exceeds min
sup. Since a pattern
with same frequency to one of its supersets is redundant, we
focus on detecting closed frequent patterns [4], for which
every proper subpattern has equal frequency. The mining
problem is to find frequent patterns from a long spatio-
temporal sequence S with respect to a support threshold
min
sup, and subject to a segmenting distance error thresh-
old , a similarity parameter θ and a length factor f.The
parameter values depend on the application domain, or can
be tuned as part of the mining process [2]. In using the raw
data to discover patterns, we discuss how to set the parame-
ters in Section 5.1 more applicably.
4Solution
In this section, we describe how to discover frequent sin-
gular patterns, i.e., frequent spatial regions (Section 4 .1)
and longer closed patterns (Section 4.2).
4.1 Discovering frequent singular patterns
The segmentation (line simplification) algorithm ([3, 5,
6]) is used to convert the locations series to segments se-
quences so that each raw sequence segment could be ab-
stracted by a line segment. Our idea is to transform S to
S
s
using such a technique, and take the segments obtained
as seed for the desired spatial regions, whose central line
segments best fit the points of segments in the regions. The
DP (Douglas-Peucker) algorithm [3] is a classical top down
approach for this problem. [6] provides an online algorithm
in splitting a sequence to segments with quite good quality.
Since it is important to keep the internal movement inside a
region, we need to capture the sharp turn of the movement
in the transformation. We employ DP method because it
has been proved to be the best algorithm in choosing split-
ting points [12]. In brief, DP algorithm recursively decom-
poses S: {p
1
,...,p
n
} to a series of line segments l
1
,...l
m
,
m n, each of which, l
i
, simplifies a subsequence S
l
i
,
such that the perpendicular distance from every point in S
l
i
to l
i
is at most . For efficiency purpose, DP’s improved
version ([5]) could be adopted.
Discovering frequent singular patterns from S
s
is a hard
problem, since in the worst-case, all combinations of seg-
ments in S
s
have to be considered as candidate. To expe-
dite the process, we employ a heuristic, Growing.LetSegs
be a set initially containing all the segments in S
s
. Grow-
ing works as follows. It selects the segment s with median
length, i.e., the median of the lengths of the segments in
Segs, as seed for the initial spatial region r. Then, r is
grown by merging other segments in Segs through filtering
and verification steps, described later. Next, for the set of
remaining segments not merged to r, the segment s
with
median length in it is selected as seed for growing. Finally,
the overall algorithm terminates after all segments (i) have
been assigned to a region (as initial seeds or to the region of
another seed), or (ii) have been found not to belong to any
frequent region and marked as outliers. Selecting the seg-
ment with median length as seed could help to absorb short
segments with less error, compared to taking segment with
longer length as seed. Meanwhile, it could prevent gener-
ating regions with too fine granularity, which could happen
when shorter length segment is used as seed. Growing is
deterministic in using this seed selection procedure.
The filtering process checks two conditions. First, for
each s
i
in Segsthe angle difference dif f a
i
between
l
s
and
s
i
is computed, and s
i
is treated as candidate if dif f a
i
is
less than θ. All the candidate segments are put into a set C.
Second, the minimum distance from every segment in C to
l
s
is computed and all segments whose minimum distances
to
l
s
is larger than f ·
l
s
.len are pruned. The remaining
segments in C will be used for verification.
The filtering step computes the minimum distance be-
tween segments, but it does not consider the length differ-
ence (second condition of similarity), between each
l
s
i
C
and
l
s
, and the exact spatial distances of segments in C to
l
s
(closeness condition). In the verification step, Algorithm
1 (shown below) merges the segments in C to the spatial re-
gion r around
l
s
,ifs
i
C satisfies the closeness and length
difference condition. Otherwise, we extract from s
i
the part
that satisfies the condition, and merge this part with r.The
remaining part of s
i
is a new segment and inserted back to
Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05)
1550-4786/05 $20.00 © 2005 IEEE

Segs (Line 15) for later processing.
Algorithm 1 Verification(
l
s
, C, Segs, f , min sup)
1: α :=
l
s
.len × f ; m:=0;
2: //length check
3: for each segment s
i
in C do
4: intersect s
i
with
l
s
,gets
and
l
s
;
5: if (diff(
l
s
.len,
l
s
.len) α) m++;
6: end for
7: //closeness check
8: while (m min
sup) do
9: Get
l
c
from all intersected points for region r;
10: Validate all intersected parts from C;
11: if (all intersected parts are close to
l
c
) break;
12: end while
13: if (m<min
sup) return;
14: for each segment s
i
in C do
15: Add non-intersected part of s
i
to Segs;
16: Remove s
i
from Segs;
17: end for
18: Remove segment that
l
s
represents from Segs;
We explain how we compute the intersected part of s
i
and
l
s
in Line 4. Let
l
s
i
be the rep resentative line seg-
ment for s
i
. If all projection points (x
k
,y
k
) in
l
s
i
have
distance to
l
s
no more than α (Line 1), its related location
point (x
k
,y
k
) in the segment is put into the intersected part
s
. The line segment created by mapping each point in s
to
l
s
i
is denoted as
l
s
. For example, let s
i
represent seg-
ment (x
10
,y
10
,t
10
), ..., (x
30
,y
30
,t
30
). Assume that the
distances from points in
l
s
i
to
l
s
are all smaller than α ex-
cept points from (x
10
,y
10
) to (x
15
,y
15
). Then, s
is seg-
ment (x
16
,y
16
,t
16
), ..., (x
30
,y
30
,t
30
),and
l
s
represents
line segment from (x
16
,y
16
) to (x
30
,y
30
) in
l
s
i
.
4.2 Deriving longer patterns
After finding frequently visited spatial regions, original
data S is converted to a series S
R
of spatial regions by
changing the segments in frequent regions to region ids,
and those not in any region to outliers. S
R
preserves the
motion continuity of the object by showing how it moves
among regions. Although each region in S
R
is repeated
frequently, the concatenation of some regions may not be
frequent. E.g., a person living in r
1
often goes to a place r
2
in some days and to region r
3
in other days. r
1
, r
2
and r
3
are frequently visited, but the path r
2
r
3
is not frequent. This
section discusses how to detect the longer frequent patterns.
4.2.1 Level-wise mining
A direct way is to perform level-wise pattern mining. How-
ever, this approach suffers from the disadvantage that S
R
needs to be scanned many times. We propose solutions to
reduce the number of candidates and scans in probing long
candidates, based on the following properties we observe.
Property 1 (Connectivity Constraint): Due to conti-
nuity of object movement, a spatial region can only connect
to some but not all the others in S
R
. This constraint can
help reduce the number of generated candidates, as follows.
We first construct a connectivity graph for all the spatial re-
gions in S
R
. A directed edge from r
i
to r
j
is added to the
graph if the substring r
i
r
j
appears in the sequence. The
edge weight is the frequency that r
i
r
j
appears in the se-
quence. Let r
1
r
2
...r
k
be a frequent pattern, and r
k
only
points to r
i
and r
j
, only two candidates, r
1
r
2
...r
k
r
i
and
r
1
r
2
...r
k
r
j
are generated. Further, if the edge weight from
r
k
to some element, say r
i
, is no more than min sup,we
need not generate candidate r
1
r
2
...r
k
r
i
.
Property 2 (Closeness Property): Given a pattern P ,
suppose its last element connects to r
1
, r
1
connects to r
2
,
..., r
m1
connects to r
m
,(m 2). We can get pattern
P
1
= Pr
1
(concatenating P and r
1
), P
2
= Pr
1
r
2
, ...,
P
m
= Pr
1
r
2
...r
m
. Obviously, if P
1
and P
m
have the
same support, any P
i
,(1 <i<m) also has the same
support. This property helps to generate candidates more
efficiently. Let result be the frequent patterns at the end of
the kth scan and P be a pattern in it with last element r.We
can ex tend P using other patterns in result that start with r.
For instance, let P = r
1
r
2
r
3
,andr
3
only connect to r
4
in
the connectivity graph. In addition, assume that result con-
tains only one pattern starting from r
3
: P
= r
3
r
4
r
6
r
7
. P
can then be extended to candidates r
1
r
2
r
3
r
4
(using Property
1), and r
1
r
2
r
3
r
4
r
6
r
7
(using Property 2). If r
1
r
2
r
3
r
4
and
r
1
r
2
r
3
r
4
r
6
r
7
have the same support after the counting, we
only need to consider candidates longer than r
1
r
2
r
3
r
4
r
6
r
7
later, significantly reducing the number of scans.
4.2.2 Mining using the substring tree
We propose a substring tr ee structure to facilitate counting
of long substrings with different elements. The substring
tree is a ro oted directed tree whose ro ot links to multiple
substring sub-trees. Each node in a sub-tree consists of pat-
tern element and a counter, which counts the number of
substrings (i.e., subsequences of elements) that contribute
to the pattern formed by the path from the root to this node.
A substring tree example is shown in Figure 3a.
To construct the tree, in scanning S
R
, we extract sub-
strings containing distinct elements, and insert them to the
tree. In seeing an element r in S
R
, we concatenate it to
the substrings found so far that do not contain r. Also, if
no substring starting with r is found, r is treated as a new
substring. We give an example to illustrate the extraction of
substrings. Let S
R
be r
1
r
2
r
3
r
4
r
1
r
3
r
4
r
2
r
3
r
4
r
1
r
2
r
3
r
4
.Ini-
tially, no substring is extracted. When see the first r
1
,we
create a new substring for it. On seeing the second element
r
2
, we create a new substring r
2
since no substring starting
with r
2
exists. In addition, we concatenate it to the only
substring r
1
and get r
1
r
2
. The process continues until we
see the fifth element r
1
. There is already a string r
1
r
2
r
3
r
4
with r
1
as first element, so r
1
r
2
r
3
r
4
is inserted to the tree,
Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05)
1550-4786/05 $20.00 © 2005 IEEE

Citations
More filters
Proceedings ArticleDOI

Visual Analytics of Taxi Trajectory Data via Topical Sub-trajectories

TL;DR: The bigram topic model is employed instead of traditional topic models to analyze textualized trajectories to take into account the direction information of trajectories, and a modified Apriori algorithm is proposed to extract frequent sub-trajectories and use them to represent each topic as topical sub-Trajectories.
Journal ArticleDOI

Forecasting Citywide Traffic Congestion Based on Social Media

TL;DR: Instead of crawling all the traffic related tweets of a city, the proposed approach only focuses on utilizing the tweets posted by some particular organizations or governments, which are more accurate and formal, thus it is much easier for traffic information extraction.
Proceedings ArticleDOI

Hiding co-occurring frequent itemsets

TL;DR: This work proposes a two-stage sanitization framework, essentially a reduction, where an instance of the frequent itemset hiding is constructed in the first stage and the instance is solved in the second stage as the task is shown to be NP-Hard and the reduction is one-to-many.
Dissertation

Extraction de motifs spatio-temporels dans des séries d'images de télédétection : application à des données optiques et radar

TL;DR: In this paper, the concept of motifs sequentiels frequents groupes (MSFG) is introduced. André et al. propose a methode d'extraction, which is non supervisee and basee on le niveau pixel.
Book ChapterDOI

Data Mining for Moving Object Databases

TL;DR: In this paper, the authors presented a method for mobility prediction using movement histories using sequential pattern mining based approaches. And they found other interesting patterns Clustering Moving Objects Dense Regions and Selectivity Estimation Comparing Moving Object Trajectories
References
More filters
Proceedings ArticleDOI

Mining sequential patterns

TL;DR: Three algorithms are presented to solve the problem of mining sequential patterns over databases of customer transactions, and empirically evaluating their performance using synthetic data shows that two of them have comparable performance.
Journal ArticleDOI

Algorithms for the reduction of the number of points required to represent a digitized line or its caricature

TL;DR: In this paper, two algorithms to reduce the number of points required to represent the line and, if desired, produce caricatures are presented and compared with the most promising methods so far suggested.
Journal ArticleDOI

Discovery of Frequent Episodes in Event Sequences

TL;DR: This work gives efficient algorithms for the discovery of all frequent episodes from a given class of episodes, and presents detailed experimental results that are in use in telecommunication alarm management.
Proceedings ArticleDOI

An online algorithm for segmenting time series

TL;DR: This paper undertake the first extensive review and empirical comparison of all proposed techniques for mining time-series data with fatal flaws and introduces a novel algorithm that is empirically show to be superior to all others in the literature.
Journal ArticleDOI

Levelwise Search and Borders of Theories in KnowledgeDiscovery

TL;DR: The concept of the border of a theory, a notion that turns out to be surprisingly powerful in analyzing the algorithm, is introduced and strong connections between the verification problem and the hypergraph transversal problem are shown.