scispace - formally typeset
Open AccessBook ChapterDOI

On the marriage of Lp-norms and edit distance

Reads0
Chats0
TLDR
A new distance function, which is a marriage of L1- norm and the edit distance, ERP, which can support local time shifting, and is a metric, and dominates all existing strategies.
Abstract
Existing studies on time series are based on two categories of distance functions. The first category consists of the Lp-norms. They are metric distance functions but cannot support local time shifting. The second category consists of distance functions which are capable of handling local time shifting but are nonmetric. The first contribution of this paper is the proposal of a new distance function, which we call ERP ("Edit distance with Real Penalty"). Representing a marriage of L1- norm and the edit distance, ERP can support local time shifting, and is a metric. The second contribution of the paper is the development of pruning strategies for large time series databases. Given that ERP is a metric, one way to prune is to apply the triangle inequality. Another way to prune is to develop a lower bound on the ERP distance. We propose such a lower bound, which has the nice computational property that it can be efficiently indexed with a standard B+- tree. Moreover, we show that these two ways of pruning can be used simultaneously for ERP distances. Specifically, the false positives obtained from the B+-tree can be further minimized by applying the triangle inequality. Based on extensive experimentation with existing benchmarks and techniques, we show that this combination delivers superb pruning power and search time performance, and dominates all existing strategies.

read more

Content maybe subject to copyright    Report

On The Marriage of Lp-norms and Edit Distance
Lei Chen
School of Computer Science
University of Waterloo
l6chen@uwaterloo.ca
Raymond Ng
Department of Computer Science
University of British Columbia
rng@cs.ubc.ca
Abstract
Existing studies on time series are based on
two categories of distance functions. The first
category consists of the Lp-norms. They are
metric distance functions but cannot support
local time shifting. The second category con-
sists of distance functions which are capable
of handling local time shifting but are non-
metric. The first contribution of this paper
is the proposal of a new distance function,
which we call ERP (“Edit distance with Real
Penalty”). Representing a marriage of L1-
norm and the edit distance, ERP can support
local time shifting, and is a metric.
The second contribution of the paper is the de-
velopment of pruning strategies for large time
series databases. Given that ERP is a met-
ric, one way to prune is to apply the trian-
gle inequality. Another way to prune is to
develop a lower bound on the ERP distance.
We propose such a lower bound, which has
the nice computational property that it can
be efficiently indexed with a standard B+-
tree. Moreover, we show that these two ways
of pruning can be used simultaneously for
ERP distances. Specifically, the false posi-
tives obtained from the B+-tree can be further
minimized by applying the triangle inequal-
ity. Based on extensive experimentation with
existing benchmarks and techniques, we show
that this combination delivers superb pruning
power and search time performance, and dom-
inates all existing strategies.
Permission to copy without fee all or part of this material is
granted provided that the copies are not made or distributed for
direct commercial advantage, the VLDB copyright notice and
the title of the publication and its date appear, and notice is
given that copying is by permission of the Very Large Data Base
Endowment. To copy otherwise, or to republish, requires a fee
and/or special permission from the Endowment.
Proceedings of the 30th VLDB Conference,
Toronto, Canada, 2004
1 Introduction
Many applications require the retrieval of similar time
series. Examples include financial data analysis and
market prediction [1, 2, 10], moving object trajectory
determination [6] and music retrieval [31]. Studies in
this area revolve around two key issues: the choice of
a distance function (similarity mo del), and the mech-
anism to improve retrieval efficiency.
Concerning the first issue, many distance functions
have been considered, including Lp-norms [1, 10], dy-
namic time wraping (DTW) [30, 18, 14], longest com-
mon subsequence (LCSS) [4, 25] and edit distance on
real sequence (EDR) [6]. Lp-norms are easy to com-
pute. However, they cannot handle local time shifting,
which is essential for time series similarity matching.
DTW, LCSS and EDR have been prop osed to exactly
deal with local time shifting. However, they are non-
metric distance functions.
This leads to the second issue of improving retrieval
efficiency. Specifically, non-metric distance functions
complicate matters, as the violation of the triangle in-
equality renders most indexing structures inapplica-
ble. To this end, studies on this topic propose various
lower bounds on the actual distance to guarantee no
false dismissals [30, 18, 14, 31]. However, those lower
bounds can admit a high percentage of false positives.
In this paper, we consider both issues and explore
the following questions.
Is there a way to combine Lp-norms and the other
distance functions so that we can get the best of
both worlds namely being able to support local
time shifting and being a metric distance func-
tion?
With such a metric distance function, we can ap-
ply the triangle inequality for pruning, but can we
develop a lo wer bound for the distance function?
If so, is lower bounding more efficient than apply-
ing the triangle inequality? Or, is it possible to
do both?
Our contributions are as follows:
We propose in Section 3 a distance function which
we call Edit distance with Real Penalty (ERP). It
792

can be viewed as a variant of L1-norm, except
that it can support local time shifting. It can
also be viewed as a variant of EDR and DTW,
except that it is a metric distance function. We
present benchmark results showing that this dis-
tance function is natural for time series data.
We propose in Section 4 a new lower bound for
ERP, which can be efficiently indexed with a stan-
dard B+-tree. Given that ERP is a metric dis-
tance function, we can also apply the triangle in-
equality. We pr esent benchmark results in Section
5 comparing the efficiency of lower bounding ver-
sus applying the triangle inequality.
Last but not least, we develop in Section 4 a k-
nearest neighbor (k-NN) algorithm that applies
both lowering bounding and the triangle inequal-
ity. We give extensive experimental results in Sec-
tion 5 showing that this algorithm gets the best
of both paradigms, delivers super b retrieval effi-
ciency and dominates all existing strategies.
2 Related Work
Many studies on similarity-based retrieval of time se-
ries were conducted in the past decade. The pioneering
work by Agrawal et al. [1] used Euclidean distance to
measure similarity. Discrete Fourier Transform (DFT)
was used as a dimensionality reduction technique for
time series data, and an R-tree was used as the index
structure. Faloutsos et al. [10] extended this work to
allow subsequence matching and proposed the GEM-
INI framework for indexing time ser ies. The key is the
use of a lower bound on the true distance to guarantee
no false dismissals when the index is used as a filter.
Subsequent work have focused on two main aspects:
new dimensionality reduction techniques (assuming
that the Euclidean distance is the similarity measure);
and new approaches for measuring the similarity be-
tween two time series. Examples of dimensionality
reduction techniques include Single Value Decompo-
sition [19], Discrete Wavelet Transform [20, 22], Piece-
wise Aggregate Approximation [15, 29], and Adaptive
Piecewise Constant Approximation [14].
The motivation for seeking new similarity measures
is that the Euclidean distance is very weak on han-
dling noise and local time shifting. Berndt and Clif-
ford [3] introduced DTW to allow a time series to be
“stretched” to provide a better match with another
time series. Das et al. [9] and Vlachos et al. [25] ap-
plied the LCSS measure to time series matching. Chen
et al. [6] applied EDR to trajectory data retrieval and
proposed a dimensionality reduction technique via a
symbolic representation of trajectories. However, none
of DTW, LCSS and EDR is a metric distance function
for time series.
Most of the approaches on indexing time series fol-
low the GEMINI framework. However, if the distance
measure is a metric, then existing indexing structures
Symbols Meaning
S a time series [s
1
, . . . , s
n
]
Rest(S) [s
2
, . . . , s
n
]
dist(s
i
, r
i
) the distance between two elements
e
S
S after aligned with another series
DLB a lower bound of the distance
Figure 1: Meanings of Symbols Used
proposed for metrics may be applicable. Examples in-
clude the MVP-tree [5], the M-tree [8], the Sa-tree [21],
and the OMNI-family of access methods [11]. A sur-
vey of metric space indexing is given in [7]. In our
experiments, we pick M-trees and OMNI-sequential as
the strawman structures for comparison; MVP-trees
and Sa-trees are not compared because they are main
memory resident structures. The other access methods
of OMNI-family are not used because the dimension-
ality of OMNI-coordinates is high (e.g., 20), which
may lead to dimensionality curse [28]. In general, a
common strategy to apply the triangle inequality for
pruning is to use a set of reference points (time series
in this case). Different studies propose different ways
to choose the reference points. In our experiments,
we compare our strategies in selecting reference points
with the HF algorithm of the OMNI-family.
3 Edit Distance With Real Penalty
3.1 Reviewing Existing Distance Functions
A time series S is defined as a sequence of real val-
ues, with each value s
i
sampled at a specific time,
i.e., S = [s
1
, s
2
, . . . , s
n
]. The length of S is n,
and the n values are referred to as the n elements.
This sequence is called the raw representation of
the time ser ies. Given S, we can normalize it us-
ing its mean (µ) and standard deviation (σ) [13]:
Norm(S) = [
s
1
µ
σ
,
s
2
µ
σ
, . . . ,
s
n
µ
σ
]. Normalization is
recommended so that the distance between two time
series is invariant to amplitude scaling and (global)
shifting of the time series. Throughout this paper, we
use S to denote Norm(S) for simplicity, even though
all the results developed below apply to the raw rep-
resentation as well. Figure 1 summarizes the main
symbols used in this paper.
Given two time series R and S of the same
length n, the L1-norm distance between R and S is:
P
n
i=1
dist(r
i
, s
i
) =
P
n
i=1
|r
i
s
i
|. This distance func-
tion satisfies the triangle inequality and is a metric.
The problem in using L1-norm for time series is that
it requires the time series to be of the same length and
does not support local time shifting.
To cope with local time shifting, one can borrow
ideas from the domain of strings. A string is a se-
quence of elements, each of which is a symbol in an
alphabet. Two strings, possibly of different lengths,
are aligned so that they become identical with the
smallest number of added, deleted or changed sym-
bols. Among these three operations, deletion can be
793

treated as adding a symbol in the other string. Here-
after, we refer to an added symbol as a gap element.
This distance is called the string edit distance. The
cost/distance of introducing a gap element is set to 1.
dist(r
i
, s
i
) =
0 if r
i
= s
i
1 if r
i
or s
i
is a gap
1 otherwise
(1)
In the above formula, we highlight the second case to
indicate that if a gap is introduced in the alignment,
the cost is 1. String edit distance satisfies the triangle
inequality and is a metric [27].
To generalize from strings to time series, the compli-
cation is that the elements r
i
and s
i
are not symbols,
but real values. For most applications, strict equal-
ity would not make sense as, for instance, the pair
r
i
= 1, s
i
= 2 should be considered more similar than
the pair r
i
= 1, s
i
= 10000. To take the real values
into account, one way is to relax equality to be within
a certain tolerance δ:
dist
edr
(r
i
, s
i
) =
0 if |r
i
s
i
| δ
1 if r
i
or s
i
is a gap
1 otherwise
(2)
This is a simple generalization of Formula (1).
Based on Formula (2) on individual elements and
gaps, the edit distance between two time sequences
R and S of length m and n res pectively is de-
fined in [6] as Formula (3) in Figure 2. r
1
and
Rest(R) denote the first element and the remain-
ing sequence of R respectively. Notice that given
Formula (2), the last case in Formula (3) can
be simplified to: min{EDR(Rest(R), Rest(S)) +
1, EDR(Rest(R), S)+1, EDR(R, Rest(S))+1}. Local
time shifting is essentially implemented by a dynamic-
programming style minimization of the above three
possibilities.
While EDR can handle local time shifting, it no
longer satisfies the triangle inequality. The problem
arises precisely from relaxing equality, i.e., |r
i
s
i
| δ.
More specifically, for three elements q
i
, r
i
, s
i
, we can
have |q
i
r
i
| δ, |r
i
s
i
| δ, but |q
i
s
i
| > δ.
To illustrate, let us consider a very simple exam-
ple of three time series: Q = [0], R = [1, 2] and
S = [2, 3, 3]. Let δ = 1. To best match R, Q is
aligned to be
e
Q = [0, ], where the symbol “-” de-
notes a gap. (There may exist many alternative ways
to align sequences to get their best match. We only
show one of the possible alignments for simplicity.)
Thus, EDR(Q, R) = 0 + 1 = 1. Similarly, to best
match S, R is aligned to be
e
R = [1, 2, ], giving rise
to EDR(R, S) = 1. Finally, to best match S, Q is
aligned to be
e
Q = [0, , ], leading to EDR(Q, S) =
3 > EDR(Q, R) + EDR(R, S) = 1 + 1 = 2!
DTW differs from EDR in two key ways, summa-
rized in the following formula:
dist
dtw
(r
i
, s
i
) =
|r
i
s
i
| if r
i
, s
i
not gaps
|r
i
s
i1
| if s
i
is a gap
|s
i
r
i1
| if r
i
is a gap
(6)
First, unlike EDR, DTW does not use a δ threshold
to relax equality, the actual L1-norm is used. Second,
unlike EDR, there is no explicit gap concept being in-
troduced in its original definition [3]. We treat the
replicated elements during the process of aligning two
sequences as gaps of DTW. Therefore, the cost of a
gap is not set to 1 as EDR does; it amounts to repli-
cating the previous element, based on which the L1-
norm is computed. Based on the above formula, the
dynamic warping distance between two time series, de-
noted as DT W (R, S), is defined formally as Formula
(4) in Figure 2. The last case in the formula deals with
the possibilities of replicating either s
i1
or r
i1
.
Let us repeat the previous example with DTW:
Q = [0], R = [1, 2] and S = [2, 3, 3]. To best match
R, Q is aligned to be
e
Q = [0, ] = [0, 0]. Thus,
DT W (Q, R) = 1 + 2 = 3. Similarly, to best match
S, R is aligned to be
e
R = [1, 2, ] = [1, 2, 2], giving
rise to DT W (R , S) = 3. Finally, to best match S,
Q is aligned to be
e
Q = [0, , ] = [0, 0, 0], leading
to DT W (Q, S) = 8 > DT W (Q, R) + DT W (R , S) =
3 + 3 = 6.
It has been shown in [24] that for speech applica-
tions, DTW “loosely” satisfies the triangle inequality.
We verified this observation with the 24 benchmark
data sets used in [14, 31]. It appears that this obser-
vation is not true in general, as on average nearly 30%
of all the triplets do not satisfy the triangle inequality.
3.2 ERP and its Properties
The key reason why DTW does not satisfy the trian-
gle inequality is that, when a gap needs to be added,
it replicates the previous element. Thus, as shown in
the second and third cases of Formula (6), the differ-
ence between an element and a gap varies according to
r
i1
or s
i1
. Contrast this situation with EDR, which
makes every difference to be a constant 1 (second case
in Formula (2)). On the other hand, the problem for
EDR lies in its use of a δ tolerance. DTW does not
have this problem because it uses the L1-norm between
two non-gap elements.
We propose ERP such that it uses real penalty be-
tween two non-gap elements, but a constant value for
computing the distance for gaps. Thus, ERP uses the
following distance formula:
dist
erp
(r
i
, s
i
) =
|r
i
s
i
| if r
i
, s
i
not gaps
|r
i
g| if s
i
is a gap
|s
i
g| if r
i
is a gap
(7)
where g is a constant value. Based on Formula (7),
we define the ERP distance between two time series,
794

EDR(R, S) =
n if m = 0
m if n = 0
EDR(Rest(R), Rest(S)) if dist
edr
(r
1
, s
1
) = 0
min{EDR(Rest(R), Rest(S)) + dist
edr
(r
1
, s
1
), otherwise
EDR(Rest(R), S) + dist
edr
(r
1
, gap), EDR(R, Rest(S)) + dist
edr
(g ap, s
1
)}
(3)
DT W (R, S) =
0 if m = n = 0
if m = 0 or n = 0
dist
dtw
(r
1
, s
1
) + min{D T W (Rest(R), Rest(S)), otherwise
DT W (Rest(R), S), DT W (R, Rest(S))}
(4)
ERP (R, S) =
P
n
1
|s
i
g| if m = 0
P
m
1
|r
i
g| if n = 0
min{ERP (Rest(R), Rest(S)) + dist
erp
(r
1
, s
1
), otherwise
ERP (Rest(R), S) + dist
erp
(r
1
, gap), ERP (R, Rest(S)) + dist
erp
(s
1
, gap)}
(5)
Figure 2: Comparing the Distance Functions
denoted as ERP (R, S), as Formula (5) in Figure 2. A
careful comparison of the formulas reveals that ERP
can be seen as a combination of L1-norm and EDR.
ERP differs from EDR in avoiding the δ tolerance. On
the other hand, ERP differs from DTW in not replicat-
ing the previous elements. The following lemma shows
that for any fixed constant g, the triangle inequality is
satisfied.
Lemma 1 For any three elements q
i
, r
i
, s
i
, any of
which can be a gap element, it is necessary that
dist(q
i
, s
i
) dist(q
i
, r
i
) + dist(r
i
, s
i
) based on For-
mula (7).
Theorem 1 Let Q, R, S be three time series of arbi-
trary length. Then it is necessary that ERP (Q, S)
ERP (Q, R) + ERP (R, S).
The proof of this theorem is a consequence of
Lemma 1 and the proof of the result by Waterman et
al. [27] on string edit distance. The Waterman proof
essentially shows that defining the distance between
two strings based on their best alignment in a dynamic
programming style preserves the triangle inequality, as
long as the underlying distance function also satisfies
the triangle inequality. The latter requirement is guar-
anteed by Lemma 1. Due to lack of space, we omit a
detailed proof.
3.2.1 Picking a Value for g
A natural question to ask here is: what is an appro-
priate value of g? The above lemma says that any
value of g, as long as it is fixed, satisfies the triangle
inequality. We pick g = 0 for two reasons. First, g = 0
admits an intuitive geometric interpretation. Consider
plotting the time series with the x-axis representing
(equally-spaced) time points and the y-axis represent-
ing the values of the elements. In this case, the x-axis
corresponds to g = 0. Thus, the distance between two
time series R, S corresponds to the difference between
the area under R and the area under S.
Second, to best match R, S is aligned to form
e
S
with the addition of gap elements. However, since
the gap elements are of value g = 0, it is easy to
see that
P
es
i
=
P
s
j
, making the area under S and
that under
e
S the same. The following lemma states
this property. In the next section, we will see the
computational significance of this lemma.
Lemma 2 Let R, S be two time series. By setting
g = 0 in Formula (7),
P
es
i
=
P
s
j
, where S is aligned
to form
e
S to match R.
Let us repeat the previous example with ERP: Q =
[0], R = [1, 2] and S = [2, 3, 3]. To best match R,
Q is aligned to be
e
Q = [0, 0]. Thus, ERP (Q, R) =
1+2 = 3. Similarly, to best match S, R is aligned to be
e
R = [1, 2, 0], giving rise to ERP (R, S) = 5. Finally, to
best match S, Q is aligned to be
e
Q = [0, 0, 0], leading
to ERP (Q, S) = 8 ERP (Q, R) + ERP (R, S) =
3 + 5 = 8, satisfying the triangle inequality.
To see how local time shifting works for ERP, let us
change Q = [3] instead. Then ERP (Q, R) = 1+1 = 2,
as
e
Q = [0, 3]. Similarly, ERP (Q, S) = 2 + 3 = 5, as
e
Q = [0, 3, 0]. The triangle inequality is satisfied as
expected.
Notice that none of the results in this section are
restricted to L1-norm. That is, if we use another Lp-
norm to r eplace L1-norm in Formula (7), the lemma
and the theorem remain valid. For the rest of the
paper, we continue with L1-norm for simplicity.
3.3 On the Naturalness of ERP
Even though ERP is a metric distance function, it is
a valid question to ask whether ERP is “natural” for
time series. In general, whether a distance function is
natural mainly depends on the application semantics.
Nonetheless, we show two experiments below suggest-
ing that ERP appears to be at least as natural as the
existing distance functions.
The first experiment is a simple sanity check. We
first generated a simple time series Q shown in Fig-
ure 3. Then we generated 5 other time series (T
1
-T
5
)
by adding time shifting or noise data on one or two
positions of Q as shown in Figure 3. For example, T
1
was generated by shifting the sequence values of Q to
the left starting from position 4, and T
2
was derived
from Q by introducing noise in position 4. Finally, we
used L1-norm, DTW, EDR, ERP and LCSS to rank
the five time series relative to Q. The rankings are
listed left to right, with the leftmost being the most
similar to Q. The rankings are as follow:
795

Figure 3: Subjective Evaluation of Distance Functions
L1-norm: T
1
, T
4
, T
5
, T
3
, T
2
LCSS: T
1
, {T
2
, T
3
, T
4
}, T
5
EDR: T
1
, {T
2
, T
3
}, T
4
, T
5
DTW: T
1
, T
4
, T
3
, T
5
, T
2
ERP: T
1
, T
2
, T
4
, T
5
, T
3
As shown from the above results, L1-norm is sen-
sitive to noise, as T
2
is considered the worst match.
LCSS focuses only on the matched parts and ignores
all the unmatched portions. As such, it gives T
2
, T
3
, T
4
the same rank, and considers T
5
the worst match.
EDR gives T
2
, T
3
the same rank, higher than T
4
. DTW
gives T
3
a higher rank than T
5
. Finally, ERP gives a
ranked list different from all the others. Notice that
the point here is not that ERP is the most natural.
Rather, the point is that ERP appears to be no worse,
if not better, than the existing distance functions.
In the second exper iment, we turn to a more ob-
jective evaluation. Recently, Keogh et al. [17] have
proposed using classification on labelled data to eval-
uate the efficacy of a distance function on time series.
Specifically, each time series is assigned a class label.
Then the “leave one out” prediction mechanism is ap-
plied to each time series in turn. That is, the class
label of the chosen time series is predicted to be the
class label of its nearest neighbour, defined based on
the given distance function. If the prediction is correct,
then it is a hit; otherwise, it is a miss. The classifica-
tion error rate is defined as the ratio of the number of
misses to the total number of the time series. In the
table below, we show the average classification error
rate using three benchmarks: the Cylinder-Bell-Funnel
(CBFtr) data [12, 14], the ASL data [25] and the “cam-
eramouse” (CM) data [25]. (All can be downloaded
from http://db.uwaterloo.ca/
l6chen/testdata). Com-
pared to the standard CBF data [17], temporal shifting
is introduced in the CBFtr data set. The CBFtr data
set is a 3-class problem. The ASL data set from UCI
KDD archive consists of of signs from the Australian
Sign Language. The ASL data set is a 10-class prob-
lem; The “cameramouse” data set contains 15 trajec-
tories of 5 classes (words) (3 for each word). As shown
in the table below, for three data sets, ERP performs
(one of) the best, showing that it is not dominated by
other well known alternatives.
Avg. Error Rate L1 DTW LCSS EDR ERP
CBFtr 0.03 0.01 0.01 0.01 0.01
ASL 0.16 0.10 0.11 0.11 0.09
CM 0.4 0.00 0.06 0.00 0.00
4 Indexing for ERP
Recall from Figure 2 that ERP can be seen as a vari-
ant of EDR and DTW. In particular, they share the
same computational behavior. Thus, like EDR and
DTW, it takes O(mn) time to compute ERP (Q, S)
for time series Q, S of length m, n respectively. For
large time series databases, it is important that for a
given query Q, we try to minimize the computation of
the true distance between Q and S for all series S in
the database. The topic explored here is indexing for
k-NN queries. An extension to range queries is rather
straightforward; we omit details for brevity.
Given that ERP is a metric distance function, one
obvious way to prune is to apply the triangle inequal-
ity. In Section 4.1, we present an algorithm to do just
that. Metric or not, another common way to prune
is to apply the GEMINI framework that is, using
lower bounds to guarantee no false negatives. Specif-
ically, even though DTW is not a metric, three lower
bounds have been proposed [30, 18, 14]. In Section
4.2.1, we show how to adapt these lower bounds for
ERP. In Section 4.2.2, we propose a new lower bound
for ERP. The beauty of this lower bound is that it can
be indexed by a simple B+-tree.
4.1 Pruning by the Triangle Inequality
The procedure TrianglePruning shown in Figure 4
shows a skeleton of how the Triangle inequality is
applied. The array procArray stores the true ERP
distances computed so far. That is, if {R
1
, . . . , R
u
}
is the set of time series for which ERP (Q, R
i
) has
been computed, the distance ERP (Q, R
i
) is recorded
in procArray. Thus, for time series S currently be-
ing evaluated, the triangle inequality ensures that
ERP (Q, S) ERP (Q, R
i
) ERP (R
i
, S), for all
1 i u. Thus, it is necessary that ERP (Q, S)
(max
1iu
{ERP (Q, R
i
) ERP (R
i
, S)}). This is
implemented in lines 2 to 4. If this distance
maxP runeDist is already worse than the current k-
NN distance s tored in result, then S can b e skipped
entirely. Otherwise, the true distance ERP (Q, S) is
computed, and procArray is updated to include S.
Finally, the result array is updated, if necessary, to
reflect the current k-NN neighbours and distances in
sorted order.
The algorithm given in Figure 5 shows how the
result and procArray should be initialized when the
procedure TrianglePruning is called repeatedly in line
4. Line 3 of the algorithm represents a simple sequen-
tial scan of all the time series in the database. Note
that we are not saying that a sequential scan should be
used. We include it for two reasons. The first reason
is to show how the procedure TrianglePruning can be
796

Citations
More filters
Journal ArticleDOI

Querying and mining of time series data: experimental comparison of representations and distance measures

TL;DR: An extensive set of time series experiments are conducted re-implementing 8 different representation methods and 9 similarity measures and their variants and testing their effectiveness on 38 time series data sets from a wide variety of application domains to provide a unified validation of some of the existing achievements.
Journal ArticleDOI

Trajectory Data Mining: An Overview

TL;DR: A systematic survey on the major research into trajectory data mining, providing a panorama of the field as well as the scope of its research topics, and introduces the methods that transform trajectories into other data formats, such as graphs, matrices, and tensors.
Journal ArticleDOI

Time-series clustering - A decade review

TL;DR: This review will expose four main components of time-series clustering and is aimed to represent an updated investigation on the trend of improvements in efficiency, quality and complexity of clustering time- series approaches during the last decade and enlighten new paths for future works.
Proceedings ArticleDOI

Robust and fast similarity search for moving object trajectories

TL;DR: Analysis and comparison of EDR with other popular distance functions, such as Euclidean distance, Dynamic Time Warping (DTW), Edit distance with Real Penalty (ERP), and Longest Common Subsequences, indicate that EDR is more robust than Euclideans distance, DTW and ERP, and it is on average 50% more accurate than LCSS.
Journal ArticleDOI

The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances

TL;DR: This work implemented 18 recently proposed algorithms in a common Java framework and compared them against two standard benchmark classifiers (and each other) by performing 100 resampling experiments on each of the 85 datasets, indicating that only nine of these algorithms are significantly more accurate than both benchmarks.
References
More filters
Book ChapterDOI

Efficient Similarity Search In Sequence Databases

TL;DR: An indexing method for time sequences for processing similarity queries using R * -trees to index the sequences and efficiently answer similarity queries and provides experimental results which show that the method is superior to search based on sequential scanning.
Journal ArticleDOI

Exact indexing of dynamic time warping

TL;DR: This work introduces a novel technique for the exact indexing of Dynamic time warping and proves its vast superiority over all competing approaches in the largest and most comprehensive set of time series indexing experiments ever undertaken.
Proceedings Article

M-tree: An Efficient Access Method for Similarity Search in Metric Spaces

TL;DR: The results demonstrate that the Mtree indeed extends the domain of applicability beyond the traditional vector spaces, performs reasonably well in high-dimensional data spaces, and scales well in case of growing files.
Proceedings ArticleDOI

Fast subsequence matching in time-series databases

TL;DR: An efficient indexing method to locate 1-dimensional subsequences within a collection of sequences, such that the subsequences match a given (query) pattern within a specified tolerance.
Proceedings Article

A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces

TL;DR: It is shown formally that partitioning and clustering techniques for similarity search in HDVSs exhibit linear complexity at high dimensionality, and that existing methods are outperformed on average by a simple sequential scan if the number of dimensions exceeds around 10.
Related Papers (5)