What contributions have the authors mentioned in the paper "Word image matching using dynamic time warping" ?

The authors present an algorithm for matching handwritten words in noisy historical documents. The authors present experimental results on two different data sets from the George Washington collection.

What have the authors stated for future works in "Word image matching using dynamic time warping" ?

Their future work will focus on improving the accuracy as well as the speed of the techniques used here. Accuracy can be improved by using better pruning techniques as well as using a larger feature set which discriminates words better from each other. Speed can be improved by optimizing their implementation of the dynamic time warping algorithm, as well as looking at related computational techniques to minimize the number of possible matches.

How is the identification of ink pixels realized?

The identification of ink pixels is currently realized using a thresholding technique which the authors have found to be sufficient for their purposes.

What is the effect of the pruning method on the smaller set A?

The authors attribute this effect to the pruning method, which works much better on the smaller set A: while the pruning preserves about 91% of the relevant documents for data set A, it only produces 71% recall on data set B.

What is the difference between the two projection profiles?

(5)Due to the variations in quality (e.g. contrast, faded ink)of the scanned images, different projection profiles do not generally vary in the same range.

(Open Access) Word image matching using dynamic time warping (2003) | Toni M. Rath

Q: What is the way to compensate for the variations in the slant and skew?

DTW offers a more flexible way to compensate for these variations than linear scaling: in the matching algorithm that the authors propose, image columns are aligned and compared using DTW.

Q: What is the method for determining the likelihood of a pair of words matching?

Previous research [3] indicates that good matching performance can be achieved by a technique that skews, resizes and aligns two candidate word images with respect to each other and then compares them pixel-by-pixel.

Q: Why do some image columns not contain ink pixels?

Due to a number of factors, such as pressure on the writing instrument and fading ink, some image columns may not contain ink pixels.

Q: What constraint is used to ensure that the path stays close to the diagonal of the matrix?

The authors use the Sakoe-Chiba band constraint [7] to ensure this path stays close to the diagonal of the matrix which contains the D(i, j) (see Figure 3(b)).

Word Image Matching Using Dynamic Time Warping

Toni M. Rath and R. Manmatha

∗

Multi-Media Indexing and Retrieval Group

Center for Intelligent Information Retrieval

University of Massachusetts

Amherst, MA 01003

Abstract

Libraries and other institutions are interested in provid-

ing access to scanned versions of their large collections

of handwritten historical manuscripts on electronic media.

Convenient access to a collection requires an index, which

is manually created at great labour and expense. Since cur-

rent handwriting recognizers do not perform well on histor-

ical documents, a technique called word spotting has been

developed: clusters with occurrences of the same word in

a collection are established using image matching. By an-

notating “interesting” clusters, an index can be built auto-

matically.

We present an algorithm for matching handwritten words

in noisy historical documents. The segmented word images

are preprocessed to create sets of 1-dimensional features,

which are then compared using dynamic time warping. We

present experimental results on two different data sets from

the George Washington collection. Our experiments show

that this algorithm performs better and is faster than com-

peting matching techniques.

1. Introduction

Traditional libraries contain an enormous amount of hand-

written historical documents that they would like to make

available electronically on the Internet or on digital media.

However, such large collections can only be accessed efﬁ-

ciently if a searchable or browsable index exists, just like

in the back of a book. The current state-of-the-art approach

to this task is to manually create an index for the collection.

Since manual indexing is expensive, automation is desirable

in order to reduce costs.

Success in ofﬂine handwriting recognition, where only

an image of the produced writing is available, has been lim-

ited to domains with small vocabularies, such as automatic

∗

This work was supported in part by the Center for Intelligent Informa-

tion Retrieval and in part by the National Science Foundation under grant

number IIS-9909073. Any opinions, ﬁndings and conclusions or recom-

mendations expressed in this material are the author(s) and do not neces-

sarily reﬂect those of the sponsor.

mail sorting and check processing. In addition, these do-

mains usually provide good quality images, while the qual-

ity of historical documents is often signiﬁcantly degraded

due to faded ink, stained paper, and other adverse factors

(see Figure 1). Consequently, traditional Optical Charac-

ter Recognition (OCR) techniques that usually recognize

words character-by-character, fail when applied to histori-

cal manuscripts.

For collections of handwritten manuscripts written by a

single author (or a few authors) – for example the George

Washington collection used in this paper – the images of

multiple instances of the same word are likely to look simi-

lar. For such collections, the Word spotting idea [5] provides

an alternative approach to index generation: ﬁrst, each page

in the document collection is segmented into words, and the

different instances of a word are clustered together using

image matching. Then, a human can tag the n most in-

teresting clusters for indexing with the appropriate ASCII-

equivalent, which could be used to build a partial index for

the analyzed collection. Historical handwritten documents

are often of poor quality and unlike printed documents,

there is variation in the way the words are written. Thus,

both segmentation of a page into words and the matching of

word images are challenging problems for such documents.

Previous work by [

6] has dealt with the problem of seg-

menting such images of historical documents. In this work,

we present a word matching algorithm that compares word

images using Dynamic Time Warping (DTW). DTW has

been widely used in the speech processing, bio-informatics

and also the online handwriting communities to match 1-D

signals. Although the matching of word images is in general

a 2-dimensional problem, we recast it as a 1-dimensional

problem since there is a loose association of image columns

with the time that they were written over. By carefully pre-

processing the image we try to minimize the variations in

the other dimension. We then extract a number of features

from each image column and match the resulting feature se-

quences with the DTW algorithm. DTW can handle local

distortions in word images and is not restricted to a single

global transform. We compare this approach to a number of

other techniques, including afﬁne-corrected Euclidean Dis-

tance Mapping, the shape context algorithm, and correlation

using sum of squared differences. Our results show that the

algorithm proposed here outperforms the other techniques

both in terms of accuracy as well as speed.

In the following section, we put our work in context with

previous efforts in this direction. Section 2 reviews the dy-

namic time warping algorithm and introduces our matching

technique. After presenting our results and comparing them

to other word image matching methods in section 3, we con-

clude with an outlook on further research.

1.1. Previous Work

In [10] the problem of spotting word images in historical

documents using a perfect transcript (obtained manually) is

addressed. An OCR is used to recognize the word images

and the recognized images are aligned with the transcript.

Good results were only obtained when the recognizer’s lex-

icon was restricted to the ASCII versions of the line to be

recognized (obtained from the perfect transcript). The word

alignment accuracy of just about 83% (on a single page)

shows how challenging the task of word spotting for histor-

ical documents is, even in the presence of a perfect (manu-

ally generated) transcript.

The word spotting idea was proposed by [5]. The authors

presented some preliminary work on matching techniques

and methods for discarding unlikely matches (“pruning”)

based on simple image features. In [3], the previously de-

scribed techniques were extended and reﬁned. Partial re-

sults on three annotated data sets, each 10 pages, were re-

ported.

[4] examine the problem of spotting occurrences of a

known template word in each line of several pages. Their

approach is line based unlike the word based approach used

here. Thus, while our algorithm solves a sequence match-

ing problem, their algorithm solves a very expensive sub-

sequence matching problem. Since [4] do not perform seg-

mentation, the word templates are hand generated. In ad-

dition, the technique requires multiple (>10) handpicked

training samples for each word. We believe this makes their

technique not practical for automation. In contrast, the tem-

plates proposed here are automatically generated and mul-

tiple training samples are not needed. The matching algo-

rithm proposed in [4] is also problematic, since it aligns

each feature using a separate dynamic time warp and com-

bines the results heuristically. This means that for the same

word-line pair, each feature may produce a different align-

ment. In this paper on the other hand, we correctly align the

entire feature vector simultaneously so as to produce a com-

mon alignment over all feature vectors. [4] provide results

for 4 hand-picked individual words on the Archives of the

Indies - this data set seems to have been scanned from the

originals and is probably of good quality. It appears from

these results that the best result for any individual word tem-

plate has a precision of 0.4 or less. No statistical results for

a set of word templates are provided (presumably because

this line-based approach is too expensive to run).

The shape context approach [1] for shape matching is

currently the best classiﬁer for handwritten digits. Two

shapes are matched by establishing correspondences be-

tween their outlines. The outlines are sampled and shape

context histograms are generated for each sample point:

each histogram describes the distribution of sample points

in the shape with respect to the sample point at which it

is generated. Points with similar histograms are deemed

correspondences and a warping transform between the two

shapes is calculated and performed. The matching cost is

determined from the cost associated with the chosen cor-

respondences. We compare the performance of the shape

context algorithm against our technique in section 3.

2. Matching

Previous research [3] indicates that good matching perfor-

mance can be achieved by a technique that skews, resizes

and aligns two candidate word images with respect to each

other and then compares them pixel-by-pixel. We use DTW

to match word images, because it offers additional ﬂexibil-

ity to compensate for handwriting variations.

Running a matching algorithm is expensive with grow-

ing collection sizes, so pruning techniques which can

quickly discard unlikely matches are used. We brieﬂy sum-

marize the applied pruning techniques in the next section.

Then, we shortly review the Dynamic Time Warping algo-

rithm before going on to explaining its application in our

matching technique.

2.1. Pruning

Pruning is a way to quickly determine whether a pair of

images is either dissimilar or likely to match each other.

In [5], pruning of word pairs based on the area and aspect

ratio of their bounding boxes was performed. The idea is to

require word images, which will later be compared, to have

similar pruning statistics (e.g. area of bounding box).

The authors of [3] extended the pruning based on area

and aspect ratio of word bounding boxes. Their technique

additionally requires two words to have the same number of

descenders (strokes below the baseline

, e.g. bottom part of

the letter ’q’).

2.2. DTW

Dynamic Time Warping [8] is used to compute a distance

between two time series. A time series is a list of samples

The baseline is the imaginary line people write on.

The terms distance and matching cost are used synonymously in this

work; we do not require the presented distances to obey all metric axioms.

Figure 1: Part of a scanned document from the George Washington collection.

taken from a signal, ordered by the time that the respective

samples were obtained.

A naive approach to calculating a matching distance be-

tween two time series could be to resample one of them and

then compare the series sample-by-sample. The drawback

of this method is that it does not produce intuitive results,

as it compares samples that might not correspond well (see

Figure 2(a)).

samples

(a) naive alignment after resampling,

samples

(b) alignment with DTW.

Figure 2: Different alignments of two similar time series.

Dynamic Time Warping solves this discrepancy between

intuition and calculated matching distance by recovering

optimal alignments between sample points in the two time

series. The alignment is optimal in the sense that it mini-

mizes a cumulative distance measure consisting of “local”

distances between aligned samples. Figure 2(b) shows such

an alignment. The procedure is called Time Warping be-

cause it warps the time axes of the two time series in such a

way that corresponding samples appear at the same location

on a common time axis.

The DTW-distance between two time series x

. . . x

and y

. . . y

is D(M, N), which we calculate in a dynamic

programming approach using

D(i, j) = min







D(i, j − 1)

D(i − 1, j)

D(i − 1, j − 1)







+ d(x

, y

). (1)

The particular choice of recurrence equation and “local”

distance function d(·, ·) varies with the application. Us-

ing the given three values D(i, j − 1), D(i − 1, j) and

D(i − 1, j − 1) in the calculation of D(i, j) realizes a lo-

cal continuity constraint (cf. Figure 3(a)), which ensures

smooth time warping (e.g. no samples left out in warping).

D(i-1, j)D(i-1, j-1)

D(i, j-1)

D(i, j)

(a) local continuity

constraint.





(M, N)

(1, 1) r

(b) global path constraint (r = 15

in our implementation).

Figure 3: Constraints used in the current dynamic time

warping implementation.

Backtracking along the minimum cost index pairs (i, j)

starting from (M, N) yields the DTW warping path. We

use the Sakoe-Chiba band constraint [7] to ensure this path

stays close to the diagonal of the matrix which contains the

D(i, j) (see Figure 3(b)). This way, pathological warpings

that align a small portion in one sequence to a large por-

tion in the other are avoided. A more detailed discussion of

continuity constraints can be found in [8].

2.3. Matching Words with DTW

While the slant and skew angle at which a person writes

is usually constant for single words, the inter-character and

intra-character spacing is subject to larger variations. DTW

offers a more ﬂexible way to compensate for these varia-

tions than linear scaling: in the matching algorithm that we

propose, image columns are aligned and compared using

DTW.

To do this, we ﬁrst have to normalize the slant and

skew angle of candidate images to compensate for inter-

word variations. Then, from each word, four features per

image column are extracted and combined into a single

time series of multi-variate samples. That is, for each im-

age I with height h and width w, we extract a time series

X(I) = x

. . . x

, where each

= (f

(I, i), f

(I, i))

0 ≤ f

(·, ·) ≤ 1, k = 1, 2, 3, 4.

This makes X(I) a 4-variate vector of length w, where the

are the four extracted features per image column.

In order to run the DTW algorithm on two time series

X(I) and Y (J) extracted from images I and J, we have

to deﬁne a local distance function that compares the feature

sets at aligned columns. We have chosen to use the square

of the Euclidean distance:

d(x

, y

) =

k=1

(I, i) − f

(J, j))

. (2)

This penalizes large differences between the extracted fea-

tures more heavily than the Euclidean distance would.

Now the DTW algorithm can be run to determine a warp-

ing path between X and Y . The length K of the warping

path ((i

, j

), . . . , (i

, j

)) biases the determined distance

D(X, Y ) =

k=1

d(x

, y

). (3)

When comparing a template series X to others, shorter se-

ries would be favored (i.e. produce smaller costs). For this

reason, our ﬁnal matching cost is normalized by the length

K of the warping path:

matching cost(X, Y ) = D(X, Y )/K. (4)

In the following section, the column features used for

matching will be described.

2.4. Features

The images we operate on are all grayscale with 256 lev-

els of intensity [0..255]. Before column features can be ex-

tracted from an image, inter-word variations, such as the

baseline offset and the skew/slant angles have to be detected

and normalized. All of the column features we describe in

the following are normalized to the range [0..1]. Speciﬁc

pixel intensity values in an image I (dimensions h × w) are

referred to as I(r, c), where r and c indicate the row and

column index of the pixel. Our goal was to choose a variety

of features presented in handwriting recognition literature

(e.g. [2]), such that an approximate reconstruction of a word

from its features would be possible.

2.5. Projection Proﬁle

Projection proﬁles capture the distribution of ink along one

of the two dimensions in a word image. A vertical projec-

tion proﬁle is computed by summing the intensity values

in each image column separately:

pp(I, c) =

r=1

(255 − I(r, c)). (5)

Due to the variations in quality (e.g. contrast, faded ink)

(a) original image: slant/skew/baseline-normalized, cleaned.

(b) normalized projection proﬁle.

Figure 4: Original image and projection proﬁle feature.

of the scanned images, different projection proﬁles do not

generally vary in the same range. To make them compara-

ble, the range of the projection proﬁles is normalized to the

range [0..1] which yields f

(I, c). Figure 4 shows an exam-

ple projection proﬁle and the original image it was extracted

from.

2.6. Word Proﬁles

Word proﬁles capture part of the outlining shape of a word.

The current word matching algorithm uses upper and lower

word proﬁles: these two features are calculated by going

along the upper (lower) boundary of a word’s bounding box

and recording for each image column the distance to the

nearest “ink” pixel in that column. The identiﬁcation of ink

pixels is currently realized using a thresholding technique

which we have found to be sufﬁcient for our purposes.

We invert the pixel intensities, because the result is visually more intu-

itive (peaks for pronounced vertical components in the input word image).

Due to a number of factors, such as pressure on the writ-

ing instrument and fading ink, some image columns may

not contain ink pixels. The occurrence of such gaps is not

consistent for multiple instances of the same word. There-

fore, we close these gaps by linearly interpolating between

the two closest points where the word proﬁle feature values

could be reliably determined.

Figure 5: Normalized upper word proﬁle (negative feature

value displayed).

The features f

and f

can be obtained from the up-

per and lower word proﬁles by normalizing their maximum

range to [0..1]. Figure 5 shows an upper word proﬁle fea-

ture, generated from the original in Figure 4(a).

2.7. Background/Ink Transitions

So far, the above features represent the distribution of ink

in the columns of a word image and the outlining shape of

the word. To capture part of the “inner” structure of a word,

we chose to record the number of background to ink transi-

tions nbit(I, c) in an image column as the last feature. The

range of this feature is normalized with a (conservatively

estimated) constant that ensures a range of [0..1]:

(I, c) = nbit(I, c)/6. (6)

With this feature set at hand, we will now demonstrate its

effectiveness when used within the proposed DTW match-

ing algorithm (section 2.3).

We tried other features, including Gaussian derivatives,

but the above set seemed to work the best.

3. Experimental Results

3.1. Data Sets and Processing

Word matching experiments were conducted on two test

sets of different quality, both 10 pages in size. The ﬁrst

set is of acceptable quality, see Figure 6(a)). The second

set is very degraded (see Figure 6(b)) - it is difﬁcult even

for people to read these documents - and it was used to test

how badly the algorithms would perform. A number of al-

gorithms were tested and results are presented on four sets

which were constructed as follows:

A: 15 images in test set 1, analyzed in [3].

B: entire test set 1 (2381 images total, 9 do not contain

words

These images result from segmentation errors.

C: 32 images in test set 2, analyzed in [3].

D: entire test set 2 (3370 images total, 108 do not contain

words

The subsets A and C allow us to test algorithms which

would otherwise take too long to run on the entire dataset.

Each page in the two test sets was segmented into words

using the algorithm described in [6]. The algorithm uses

scale-space techniques to determine word boundaries which

are then used to extract single word images. For reasons of

comparability we used the exact same segmentation results

as in [3].

For the matching based on DTW and the shape context

run (see below), we normalized the slant and skew of the

word images and cleaned the images to remove noise in

the background and parts of other words that reach into the

bounding box.

Test set total #queries

#pruned pairs

#total pairs

Recall

A 15 12.71% 90.72%

B 2372 13.57% 71.11%

C 32 13.01% 56.49%

D 3262 14.26% 55.05%

Table 1: Effects of pruning for all analyzed data sets.

The total number of word pairs, which would otherwise

have to be processed by the matching algorithm, was re-

duced by applying the pruning techniques described in sec-

tion 2.1. Table 1 shows the effects of pruning on the 4

subsets A, B, C and D. Pruned pairs denotes the images

left for comparison after pruning, #total pairs is the num-

ber of query words in the (partial) test set multiplied by the

number of words in the enclosing collection (either 2381

or 3370). Recall is the proportion of valid matches that re-

mains in the pruned set (100%=no valid matches discarded).

3.2. Evaluation Method

Each word in the data sets was tagged with its ASCII equiv-

alent. In case of segmentation errors, a tag corresponding to

all visible characters in the segmented word image was as-

signed. Based on this annotation, relevance judgments were

produced for the data sets. Two word images were consid-

ered relevant, if they have the same tags.

To evaluate the word image matching algorithms, we

used an information retrieval approach: each image in a

data set is viewed as a query which is used to retrieve sim-

ilar images from the entire collection enclosing the data set

(e.g. data set A is enclosed in set 1). Matching the query

against other images produces a ranked list of retrieved im-

ages, sorted by the matching cost. Using the trec eval

program, we calculated average precision scores [11] for all

queries in the sets A through D.

Word image matching using dynamic time warping

Figures

Citations

Exact indexing of dynamic time warping

The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances

Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW) Techniques

Computing and Visualizing Dynamic Time Warping Alignments in R: The dtw Package

A global averaging method for dynamic time warping, with applications to clustering

References

Shape matching and object recognition using shape contexts

Dynamic programming algorithm optimization for spoken word recognition

Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison

Time Warps, String Edits, and Macromolecules – The Theory and Practice of Sequence Comparison . David Sankoff and Joseph Kruskal. ISBN 1-57586-217-4. Price £13.95 (US$22·95).

An algorithm for associating the features of two images

Related Papers (5)

Dynamic programming algorithm optimization for spoken word recognition

Word spotting for historical documents

Using a statistical language model to improve the performance of an HMM-based cursive handwriting recognition systems

Distinctive Image Features from Scale-Invariant Keypoints

A Novel Word Spotting Method Based on Recurrent Neural Networks

Frequently Asked Questions (13)

Q1. What contributions have the authors mentioned in the paper "Word image matching using dynamic time warping" ?

Q2. What have the authors stated for future works in "Word image matching using dynamic time warping" ?

Q3. What is the way to compensate for the variations in the slant and skew?

Q4. How is the identification of ink pixels realized?

Q5. What is the effect of pruning on the word image matching algorithm?

Q6. What is the method for determining the likelihood of a pair of words matching?

Q7. What is the slant and skew angle of a person's writing?

Q8. Why do some image columns not contain ink pixels?

Q9. What is the effect of the pruning method on the smaller set A?

Q10. What constraint is used to ensure that the path stays close to the diagonal of the matrix?

Q11. What is the way to index a collection of handwritten documents?

Q12. How accurate is the word alignment algorithm?

Q13. What is the difference between the two projection profiles?