scispace - formally typeset
Open AccessProceedings ArticleDOI

Word image matching using dynamic time warping

Reads0
Chats0
TLDR
This work presents an algorithm for matching handwritten words in noisy historical documents that performs better and is faster than competing matching techniques and presents experimental results on two different data sets from the George Washington collection.
Abstract
Libraries and other institutions are interested in providing access to scanned versions of their large collections of handwritten historical manuscripts on electronic media. Convenient access to a collection requires an index, which is manually created at great labor and expense. Since current handwriting recognizers do not perform well on historical documents, a technique called word spotting has been developed: clusters with occurrences of the same word in a collection are established using image matching. By annotating "interesting" clusters, an index can be built automatically. We present an algorithm for matching handwritten words in noisy historical documents. The segmented word images are preprocessed to create sets of 1-dimensional features, which are then compared using dynamic time warping. We present experimental results on two different data sets from the George Washington collection. Our experiments show that this algorithm performs better and is faster than competing matching techniques.

read more

Content maybe subject to copyright    Report

Word Image Matching Using Dynamic Time Warping
Toni M. Rath and R. Manmatha
Multi-Media Indexing and Retrieval Group
Center for Intelligent Information Retrieval
University of Massachusetts
Amherst, MA 01003
Abstract
Libraries and other institutions are interested in provid-
ing access to scanned versions of their large collections
of handwritten historical manuscripts on electronic media.
Convenient access to a collection requires an index, which
is manually created at great labour and expense. Since cur-
rent handwriting recognizers do not perform well on histor-
ical documents, a technique called word spotting has been
developed: clusters with occurrences of the same word in
a collection are established using image matching. By an-
notating “interesting” clusters, an index can be built auto-
matically.
We present an algorithm for matching handwritten words
in noisy historical documents. The segmented word images
are preprocessed to create sets of 1-dimensional features,
which are then compared using dynamic time warping. We
present experimental results on two different data sets from
the George Washington collection. Our experiments show
that this algorithm performs better and is faster than com-
peting matching techniques.
1. Introduction
Traditional libraries contain an enormous amount of hand-
written historical documents that they would like to make
available electronically on the Internet or on digital media.
However, such large collections can only be accessed effi-
ciently if a searchable or browsable index exists, just like
in the back of a book. The current state-of-the-art approach
to this task is to manually create an index for the collection.
Since manual indexing is expensive, automation is desirable
in order to reduce costs.
Success in offline handwriting recognition, where only
an image of the produced writing is available, has been lim-
ited to domains with small vocabularies, such as automatic
This work was supported in part by the Center for Intelligent Informa-
tion Retrieval and in part by the National Science Foundation under grant
number IIS-9909073. Any opinions, findings and conclusions or recom-
mendations expressed in this material are the author(s) and do not neces-
sarily reflect those of the sponsor.
mail sorting and check processing. In addition, these do-
mains usually provide good quality images, while the qual-
ity of historical documents is often significantly degraded
due to faded ink, stained paper, and other adverse factors
(see Figure 1). Consequently, traditional Optical Charac-
ter Recognition (OCR) techniques that usually recognize
words character-by-character, fail when applied to histori-
cal manuscripts.
For collections of handwritten manuscripts written by a
single author (or a few authors) for example the George
Washington collection used in this paper the images of
multiple instances of the same word are likely to look simi-
lar. For such collections, the Word spotting idea [5] provides
an alternative approach to index generation: first, each page
in the document collection is segmented into words, and the
different instances of a word are clustered together using
image matching. Then, a human can tag the n most in-
teresting clusters for indexing with the appropriate ASCII-
equivalent, which could be used to build a partial index for
the analyzed collection. Historical handwritten documents
are often of poor quality and unlike printed documents,
there is variation in the way the words are written. Thus,
both segmentation of a page into words and the matching of
word images are challenging problems for such documents.
Previous work by [
6] has dealt with the problem of seg-
menting such images of historical documents. In this work,
we present a word matching algorithm that compares word
images using Dynamic Time Warping (DTW). DTW has
been widely used in the speech processing, bio-informatics
and also the online handwriting communities to match 1-D
signals. Although the matching of word images is in general
a 2-dimensional problem, we recast it as a 1-dimensional
problem since there is a loose association of image columns
with the time that they were written over. By carefully pre-
processing the image we try to minimize the variations in
the other dimension. We then extract a number of features
from each image column and match the resulting feature se-
quences with the DTW algorithm. DTW can handle local
distortions in word images and is not restricted to a single
global transform. We compare this approach to a number of
1

other techniques, including affine-corrected Euclidean Dis-
tance Mapping, the shape context algorithm, and correlation
using sum of squared differences. Our results show that the
algorithm proposed here outperforms the other techniques
both in terms of accuracy as well as speed.
In the following section, we put our work in context with
previous efforts in this direction. Section 2 reviews the dy-
namic time warping algorithm and introduces our matching
technique. After presenting our results and comparing them
to other word image matching methods in section 3, we con-
clude with an outlook on further research.
1.1. Previous Work
In [10] the problem of spotting word images in historical
documents using a perfect transcript (obtained manually) is
addressed. An OCR is used to recognize the word images
and the recognized images are aligned with the transcript.
Good results were only obtained when the recognizer’s lex-
icon was restricted to the ASCII versions of the line to be
recognized (obtained from the perfect transcript). The word
alignment accuracy of just about 83% (on a single page)
shows how challenging the task of word spotting for histor-
ical documents is, even in the presence of a perfect (manu-
ally generated) transcript.
The word spotting idea was proposed by [5]. The authors
presented some preliminary work on matching techniques
and methods for discarding unlikely matches (“pruning”)
based on simple image features. In [3], the previously de-
scribed techniques were extended and refined. Partial re-
sults on three annotated data sets, each 10 pages, were re-
ported.
[4] examine the problem of spotting occurrences of a
known template word in each line of several pages. Their
approach is line based unlike the word based approach used
here. Thus, while our algorithm solves a sequence match-
ing problem, their algorithm solves a very expensive sub-
sequence matching problem. Since [4] do not perform seg-
mentation, the word templates are hand generated. In ad-
dition, the technique requires multiple (>10) handpicked
training samples for each word. We believe this makes their
technique not practical for automation. In contrast, the tem-
plates proposed here are automatically generated and mul-
tiple training samples are not needed. The matching algo-
rithm proposed in [4] is also problematic, since it aligns
each feature using a separate dynamic time warp and com-
bines the results heuristically. This means that for the same
word-line pair, each feature may produce a different align-
ment. In this paper on the other hand, we correctly align the
entire feature vector simultaneously so as to produce a com-
mon alignment over all feature vectors. [4] provide results
for 4 hand-picked individual words on the Archives of the
Indies - this data set seems to have been scanned from the
originals and is probably of good quality. It appears from
these results that the best result for any individual word tem-
plate has a precision of 0.4 or less. No statistical results for
a set of word templates are provided (presumably because
this line-based approach is too expensive to run).
The shape context approach [1] for shape matching is
currently the best classifier for handwritten digits. Two
shapes are matched by establishing correspondences be-
tween their outlines. The outlines are sampled and shape
context histograms are generated for each sample point:
each histogram describes the distribution of sample points
in the shape with respect to the sample point at which it
is generated. Points with similar histograms are deemed
correspondences and a warping transform between the two
shapes is calculated and performed. The matching cost is
determined from the cost associated with the chosen cor-
respondences. We compare the performance of the shape
context algorithm against our technique in section 3.
2. Matching
Previous research [3] indicates that good matching perfor-
mance can be achieved by a technique that skews, resizes
and aligns two candidate word images with respect to each
other and then compares them pixel-by-pixel. We use DTW
to match word images, because it offers additional flexibil-
ity to compensate for handwriting variations.
Running a matching algorithm is expensive with grow-
ing collection sizes, so pruning techniques which can
quickly discard unlikely matches are used. We briefly sum-
marize the applied pruning techniques in the next section.
Then, we shortly review the Dynamic Time Warping algo-
rithm before going on to explaining its application in our
matching technique.
2.1. Pruning
Pruning is a way to quickly determine whether a pair of
images is either dissimilar or likely to match each other.
In [5], pruning of word pairs based on the area and aspect
ratio of their bounding boxes was performed. The idea is to
require word images, which will later be compared, to have
similar pruning statistics (e.g. area of bounding box).
The authors of [3] extended the pruning based on area
and aspect ratio of word bounding boxes. Their technique
additionally requires two words to have the same number of
descenders (strokes below the baseline
1
, e.g. bottom part of
the letter ’q’).
2.2. DTW
Dynamic Time Warping [8] is used to compute a distance
2
between two time series. A time series is a list of samples
1
The baseline is the imaginary line people write on.
2
The terms distance and matching cost are used synonymously in this
work; we do not require the presented distances to obey all metric axioms.
2

Figure 1: Part of a scanned document from the George Washington collection.
taken from a signal, ordered by the time that the respective
samples were obtained.
A naive approach to calculating a matching distance be-
tween two time series could be to resample one of them and
then compare the series sample-by-sample. The drawback
of this method is that it does not produce intuitive results,
as it compares samples that might not correspond well (see
Figure 2(a)).
samples
(a) naive alignment after resampling,
samples
(b) alignment with DTW.
Figure 2: Different alignments of two similar time series.
Dynamic Time Warping solves this discrepancy between
intuition and calculated matching distance by recovering
optimal alignments between sample points in the two time
series. The alignment is optimal in the sense that it mini-
mizes a cumulative distance measure consisting of “local”
distances between aligned samples. Figure 2(b) shows such
an alignment. The procedure is called Time Warping be-
cause it warps the time axes of the two time series in such a
way that corresponding samples appear at the same location
on a common time axis.
The DTW-distance between two time series x
1
. . . x
M
and y
1
. . . y
N
is D(M, N), which we calculate in a dynamic
programming approach using
D(i, j) = min
D(i, j 1)
D(i 1, j)
D(i 1, j 1)
+ d(x
i
, y
i
). (1)
The particular choice of recurrence equation and “local”
distance function d(·, ·) varies with the application. Us-
ing the given three values D(i, j 1), D(i 1, j) and
D(i 1, j 1) in the calculation of D(i, j) realizes a lo-
cal continuity constraint (cf. Figure 3(a)), which ensures
smooth time warping (e.g. no samples left out in warping).
D(i-1, j)D(i-1, j-1)
D(i, j-1)
D(i, j)
(a) local continuity
constraint.


































(M, N)
(1, 1) r
r
r
r
(b) global path constraint (r = 15
in our implementation).
Figure 3: Constraints used in the current dynamic time
warping implementation.
Backtracking along the minimum cost index pairs (i, j)
k
starting from (M, N) yields the DTW warping path. We
use the Sakoe-Chiba band constraint [7] to ensure this path
stays close to the diagonal of the matrix which contains the
D(i, j) (see Figure 3(b)). This way, pathological warpings
that align a small portion in one sequence to a large por-
tion in the other are avoided. A more detailed discussion of
continuity constraints can be found in [8].
3

2.3. Matching Words with DTW
While the slant and skew angle at which a person writes
is usually constant for single words, the inter-character and
intra-character spacing is subject to larger variations. DTW
offers a more flexible way to compensate for these varia-
tions than linear scaling: in the matching algorithm that we
propose, image columns are aligned and compared using
DTW.
To do this, we first have to normalize the slant and
skew angle of candidate images to compensate for inter-
word variations. Then, from each word, four features per
image column are extracted and combined into a single
time series of multi-variate samples. That is, for each im-
age I with height h and width w, we extract a time series
X(I) = x
1
. . . x
w
, where each
x
i
= (f
1
(I, i), f
2
(I, i), f
3
(I, i), f
4
(I, i))
T
.
0 f
k
(·, ·) 1, k = 1, 2, 3, 4.
This makes X(I) a 4-variate vector of length w, where the
f
k
are the four extracted features per image column.
In order to run the DTW algorithm on two time series
X(I) and Y (J) extracted from images I and J, we have
to define a local distance function that compares the feature
sets at aligned columns. We have chosen to use the square
of the Euclidean distance:
d(x
i
, y
j
) =
4
X
k=1
(f
k
(I, i) f
k
(J, j))
2
. (2)
This penalizes large differences between the extracted fea-
tures more heavily than the Euclidean distance would.
Now the DTW algorithm can be run to determine a warp-
ing path between X and Y . The length K of the warping
path ((i
1
, j
1
), . . . , (i
K
, j
K
)) biases the determined distance
D(X, Y ) =
K
X
k=1
d(x
i
k
, y
j
k
). (3)
When comparing a template series X to others, shorter se-
ries would be favored (i.e. produce smaller costs). For this
reason, our final matching cost is normalized by the length
K of the warping path:
matching cost(X, Y ) = D(X, Y )/K. (4)
In the following section, the column features used for
matching will be described.
2.4. Features
The images we operate on are all grayscale with 256 lev-
els of intensity [0..255]. Before column features can be ex-
tracted from an image, inter-word variations, such as the
baseline offset and the skew/slant angles have to be detected
and normalized. All of the column features we describe in
the following are normalized to the range [0..1]. Specific
pixel intensity values in an image I (dimensions h × w) are
referred to as I(r, c), where r and c indicate the row and
column index of the pixel. Our goal was to choose a variety
of features presented in handwriting recognition literature
(e.g. [2]), such that an approximate reconstruction of a word
from its features would be possible.
2.5. Projection Profile
Projection profiles capture the distribution of ink along one
of the two dimensions in a word image. A vertical projec-
tion profile is computed by summing the intensity values
3
in each image column separately:
pp(I, c) =
h
X
r=1
(255 I(r, c)). (5)
Due to the variations in quality (e.g. contrast, faded ink)
(a) original image: slant/skew/baseline-normalized, cleaned.
(b) normalized projection profile.
Figure 4: Original image and projection profile feature.
of the scanned images, different projection profiles do not
generally vary in the same range. To make them compara-
ble, the range of the projection profiles is normalized to the
range [0..1] which yields f
1
(I, c). Figure 4 shows an exam-
ple projection profile and the original image it was extracted
from.
2.6. Word Profiles
Word profiles capture part of the outlining shape of a word.
The current word matching algorithm uses upper and lower
word profiles: these two features are calculated by going
along the upper (lower) boundary of a word’s bounding box
and recording for each image column the distance to the
nearest “ink” pixel in that column. The identification of ink
pixels is currently realized using a thresholding technique
which we have found to be sufficient for our purposes.
3
We invert the pixel intensities, because the result is visually more intu-
itive (peaks for pronounced vertical components in the input word image).
4

Due to a number of factors, such as pressure on the writ-
ing instrument and fading ink, some image columns may
not contain ink pixels. The occurrence of such gaps is not
consistent for multiple instances of the same word. There-
fore, we close these gaps by linearly interpolating between
the two closest points where the word profile feature values
could be reliably determined.
Figure 5: Normalized upper word profile (negative feature
value displayed).
The features f
2
and f
3
can be obtained from the up-
per and lower word profiles by normalizing their maximum
range to [0..1]. Figure 5 shows an upper word profile fea-
ture, generated from the original in Figure 4(a).
2.7. Background/Ink Transitions
So far, the above features represent the distribution of ink
in the columns of a word image and the outlining shape of
the word. To capture part of the “inner” structure of a word,
we chose to record the number of background to ink transi-
tions nbit(I, c) in an image column as the last feature. The
range of this feature is normalized with a (conservatively
estimated) constant that ensures a range of [0..1]:
f
4
(I, c) = nbit(I, c)/6. (6)
With this feature set at hand, we will now demonstrate its
effectiveness when used within the proposed DTW match-
ing algorithm (section 2.3).
We tried other features, including Gaussian derivatives,
but the above set seemed to work the best.
3. Experimental Results
3.1. Data Sets and Processing
Word matching experiments were conducted on two test
sets of different quality, both 10 pages in size. The first
set is of acceptable quality, see Figure 6(a)). The second
set is very degraded (see Figure 6(b)) - it is difficult even
for people to read these documents - and it was used to test
how badly the algorithms would perform. A number of al-
gorithms were tested and results are presented on four sets
which were constructed as follows:
A: 15 images in test set 1, analyzed in [3].
B: entire test set 1 (2381 images total, 9 do not contain
words
4
).
4
These images result from segmentation errors.
C: 32 images in test set 2, analyzed in [3].
D: entire test set 2 (3370 images total, 108 do not contain
words
4
).
The subsets A and C allow us to test algorithms which
would otherwise take too long to run on the entire dataset.
Each page in the two test sets was segmented into words
using the algorithm described in [6]. The algorithm uses
scale-space techniques to determine word boundaries which
are then used to extract single word images. For reasons of
comparability we used the exact same segmentation results
as in [3].
For the matching based on DTW and the shape context
run (see below), we normalized the slant and skew of the
word images and cleaned the images to remove noise in
the background and parts of other words that reach into the
bounding box.
Test set total #queries
#pruned pairs
#total pairs
Recall
A 15 12.71% 90.72%
B 2372 13.57% 71.11%
C 32 13.01% 56.49%
D 3262 14.26% 55.05%
Table 1: Effects of pruning for all analyzed data sets.
The total number of word pairs, which would otherwise
have to be processed by the matching algorithm, was re-
duced by applying the pruning techniques described in sec-
tion 2.1. Table 1 shows the effects of pruning on the 4
subsets A, B, C and D. Pruned pairs denotes the images
left for comparison after pruning, #total pairs is the num-
ber of query words in the (partial) test set multiplied by the
number of words in the enclosing collection (either 2381
or 3370). Recall is the proportion of valid matches that re-
mains in the pruned set (100%=no valid matches discarded).
3.2. Evaluation Method
Each word in the data sets was tagged with its ASCII equiv-
alent. In case of segmentation errors, a tag corresponding to
all visible characters in the segmented word image was as-
signed. Based on this annotation, relevance judgments were
produced for the data sets. Two word images were consid-
ered relevant, if they have the same tags.
To evaluate the word image matching algorithms, we
used an information retrieval approach: each image in a
data set is viewed as a query which is used to retrieve sim-
ilar images from the entire collection enclosing the data set
(e.g. data set A is enclosed in set 1). Matching the query
against other images produces a ranked list of retrieved im-
ages, sorted by the matching cost. Using the trec eval
program, we calculated average precision scores [11] for all
queries in the sets A through D.
5

Citations
More filters
Journal ArticleDOI

Exact indexing of dynamic time warping

TL;DR: This work introduces a novel technique for the exact indexing of Dynamic time warping and proves its vast superiority over all competing approaches in the largest and most comprehensive set of time series indexing experiments ever undertaken.
Journal ArticleDOI

The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances

TL;DR: This work implemented 18 recently proposed algorithms in a common Java framework and compared them against two standard benchmark classifiers (and each other) by performing 100 resampling experiments on each of the 85 datasets, indicating that only nine of these algorithms are significantly more accurate than both benchmarks.
Posted Content

Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques

TL;DR: This paper presents the viability of MFCC to extract features and DTW to compare the test patterns and explains why the alignment is important to produce the better performance.
Journal ArticleDOI

Computing and Visualizing Dynamic Time Warping Alignments in R: The dtw Package

TL;DR: The dtw package allows R users to compute time series alignments mixing freely a variety of continuity constraints, restriction windows, endpoints, local distance definitions, and so on.
Journal ArticleDOI

A global averaging method for dynamic time warping, with applications to clustering

TL;DR: A global technique for averaging a set of sequences is developed, which avoids using iterative pairwise averaging and is thus insensitive to ordering effects, and a new strategy to reduce the length of the resulting average sequence is described.
References
More filters
Journal ArticleDOI

Shape matching and object recognition using shape contexts

TL;DR: This paper presents work on computing shape models that are computationally fast and invariant basic transformations like translation, scaling and rotation, and proposes shape detection using a feature called shape context, which is descriptive of the shape of the object.
Journal ArticleDOI

Dynamic programming algorithm optimization for spoken word recognition

TL;DR: This paper reports on an optimum dynamic progxamming (DP) based time-normalization algorithm for spoken word recognition, in which the warping function slope is restricted so as to improve discrimination between words in different categories.
Book

Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison

TL;DR: In this paper, a mudflap assembly for use with a dump vehicle having dual tires at the rear end thereof and including a pair of flexible flap sections one of which is supported by a rigid member adjacent the dual tires and the other is located above and to the rear of the rigid member and is secured at its upper end to the dump body.
Journal ArticleDOI

An algorithm for associating the features of two images

TL;DR: An algorithm that operates on the distances between features in the two related images and delivers a set of correspondences between them and will recover the feature mappings that result from image translation, expansion or shear deformation even when the displacements of individual features depart slightly from the general trend.
Related Papers (5)
Frequently Asked Questions (13)
Q1. What contributions have the authors mentioned in the paper "Word image matching using dynamic time warping" ?

The authors present an algorithm for matching handwritten words in noisy historical documents. The authors present experimental results on two different data sets from the George Washington collection. 

Their future work will focus on improving the accuracy as well as the speed of the techniques used here. Accuracy can be improved by using better pruning techniques as well as using a larger feature set which discriminates words better from each other. Speed can be improved by optimizing their implementation of the dynamic time warping algorithm, as well as looking at related computational techniques to minimize the number of possible matches. 

DTW offers a more flexible way to compensate for these variations than linear scaling: in the matching algorithm that the authors propose, image columns are aligned and compared using DTW. 

The identification of ink pixels is currently realized using a thresholding technique which the authors have found to be sufficient for their purposes. 

For the matching based on DTW and the shape context run (see below), the authors normalized the slant and skew of the word images and cleaned the images to remove noise in the background and parts of other words that reach into the bounding box. 

Previous research [3] indicates that good matching performance can be achieved by a technique that skews, resizes and aligns two candidate word images with respect to each other and then compares them pixel-by-pixel. 

While the slant and skew angle at which a person writes is usually constant for single words, the inter-character and intra-character spacing is subject to larger variations. 

Due to a number of factors, such as pressure on the writing instrument and fading ink, some image columns may not contain ink pixels. 

The authors attribute this effect to the pruning method, which works much better on the smaller set A: while the pruning preserves about 91% of the relevant documents for data set A, it only produces 71% recall on data set B. 

The authors use the Sakoe-Chiba band constraint [7] to ensure this path stays close to the diagonal of the matrix which contains the D(i, j) (see Figure 3(b)). 

a human can tag the n most interesting clusters for indexing with the appropriate ASCIIequivalent, which could be used to build a partial index for the analyzed collection. 

The word alignment accuracy of just about 83% (on a single page) shows how challenging the task of word spotting for historical documents is, even in the presence of a perfect (manually generated) transcript. 

(5)Due to the variations in quality (e.g. contrast, faded ink)of the scanned images, different projection profiles do not generally vary in the same range.