Journal Article•DOI•

A Novel Connectionist System for Unconstrained Handwriting Recognition

Alex Graves¹, Marcus Liwicki, Santiago Fernández², R. Bertolami, Horst Bunke, Jürgen Schmidhuber¹ - Show less +2 more•Institutions (2)

Information Technology University¹, Dalle Molle Institute for Artificial Intelligence Research²

01 May 2009-IEEE Transactions on Pattern Analysis and Machine Intelligence (IEEE Computer Society)-Vol. 31, Iss: 5, pp 855-868

TL;DR: This paper proposes an alternative approach based on a novel type of recurrent neural network, specifically designed for sequence labeling tasks where the data is hard to segment and contains long-range bidirectional interdependencies, significantly outperforming a state-of-the-art HMM-based system.

read less

Abstract: Recognizing lines of unconstrained handwritten text is a challenging task. The difficulty of segmenting cursive or overlapping characters, combined with the need to exploit surrounding context, has led to low recognition rates for even the best current recognizers. Most recent progress in the field has been made either through improved preprocessing or through advances in language modeling. Relatively little work has been done on the basic recognition algorithms. Indeed, most systems rely on the same hidden Markov models that have been used for decades in speech and handwriting recognition, despite their well-known shortcomings. This paper proposes an alternative approach based on a novel type of recurrent neural network, specifically designed for sequence labeling tasks where the data is hard to segment and contains long-range bidirectional interdependencies. In experiments on two large unconstrained handwriting databases, our approach achieves word recognition accuracies of 79.7 percent on online data and 74.1 percent on offline data, significantly outperforming a state-of-the-art HMM-based system. In addition, we demonstrate the network's robustness to lexicon size, measure the individual influence of its hidden layers, and analyze its use of context. Last, we provide an in-depth discussion of the differences between the network and HMMs, suggesting reasons for the network's superior performance.

...read moreread less

Summary (5 min read)

Jump to: [Introduction] – [II. ONLINE DATA PREPARATION] – [B. Feature Extraction] – [III. OFFLINE DATA PREPARATION] – [A. Recurrent Neural Networks] – [B. Long Short-Term Memory (LSTM)] – [C. Bidirectional Recurrent Neural Networks] – [D. Connectionist Temporal Classification (CTC)] – [E. CTC Forward Backward Algorithm] – [F. CTC Objective Function] – [G. CTC Decoding] – [H. CTC Token Passing Algorithm] – [V. EXPERIMENTS AND RESULTS] – [A. Data Sets] – [B. Language Model and Dictionaries] – [C. HMM Parameters] – [D. RNN Parameters] – [E. Main Results] – [I. Use of Context by the RNN] – [VI. DISCUSSION] and [VII. CONCLUSIONS]

Introduction

However an obvious drawback of whole word classification is that it does not scale to large vocabularies.
Based on unsupervised learning and data-driven methods.
For this reason the authors chose the bidirectional Long Short-Term Memory (BLSTM; [33]) architecture, which provides access to long range context along both input directions.

II. ONLINE DATA PREPARATION

The online data used in their experiments were recorded from a whiteboard using the eBeam interface1 [34].
The acquisition interface outputs a sequence of (x, y)-coordinates representing the location of the tip of the pen together with a time stamp for each location.
The coordinates are only recorded during the periods when the pen-tip is in continuous contact with the whiteboard.
After some standard steps to correct for missing and noisy points [35], the data was stored in xml-format, along with the frame rate, which varied from 30 to 70 frames per second.

B. Feature Extraction

To extract the feature vectors from the normalised images, a sliding window approach is used.
The positions of the uppermost and lowermost black pixels .
The rate of change of these positions (with respect to the neighbouring windows) 2 5 The number of black-white transitions between the uppermost and lowermost pixels .
For a more detailed description of the offline features, see [17].

III. OFFLINE DATA PREPARATION

The offline data used in their experiments consists of greyscale images scanned from handwritten forms, with a scanning resolution of 300 dpi and a greyscale bit depth of 8.
Then a histogram of the horizontal black/white transitions was calculated, and the text was split at the local minima to give a series of horizontal lines.
Any stroke crossing the boundaries between two lines was assigned to the line containing their centre of gravity.
With this method almost all text lines were extracted correctly.
Once the line images were extracted, the next stage was to normalise the text with respect to writing skew and slant, and character size.

A. Recurrent Neural Networks

Recurrent neural networks (RNNs) are a connectionist model containing a self-connected hidden layer.
One benefit of the recurrent connection is that a ‘memory’ of previous inputs remains in the network’s internal state, allowing it to make use of past context.
Another important advantage of recurrency is that the rate of change of the internal state can be finely modulated by the recurrent weights, which builds in robustness to localised distortions of the input data.

B. Long Short-Term Memory (LSTM)

Unfortunately, the range of contextual information that standard RNNs can access is in practice quite limited.
The problem is that the influence of a given input on the hidden layer, and therefore on the network output, either decays or blows up exponentially as it cycles around the network’s recurrent connections.
As can be seen, the influence of the first input decays exponentially over time.
The cell has a recurrent connection with fixed weight 1.0.
The three gates collect input from the rest of the network, and control the cell via multiplicative units (small circles).

C. Bidirectional Recurrent Neural Networks

In handwriting recognition, for example, the identification of a given letter is helped by knowing the letters both to the right and left of it.
6 Bidirectional recurrent neural networks [41], [42] are able to access context in both directions along the input sequence.
Combining BRNNs and LSTM gives bidirectional LSTM .
BLSTM has previously outperformed other network architectures, including standard LSTM, BRNNs and HMM-RNN hybrids, on phoneme recognition [33], [45].

D. Connectionist Temporal Classification (CTC)

Traditional RNN objective functions require a presegmented input sequence with a separate target for every segment.
Moreover, because the outputs of such an RNN are a series of independent, local classifications, some form of post processing is required to transform them into the desired label sequence.
A CTC output layer contains as many units as there are labels in the task, plus an additional ‘blank’ or ‘no label’ unit.
Note that each output is conditioned on the entire input sequence.
One such architecture is bidirectional LSTM, as described in the previous section.

E. CTC Forward Backward Algorithm

To allow for blanks in the output paths, the authors consider modified label sequences l′, with blanks added to the beginning and the end of l, and inserted between every pair of consecutive labels.
In calculating the probabilities of prefixes of l′ the authors allow all transitions between blank and non-blank labels, and also those between any pair of distinct non-blank labels.
The backward variables βts are defined as the summed probability of all paths whose suffixes starting at t map onto the suffix of l starting at label s/2: βts = ∑ π: B(πt:T )=ls/2:|l|.
Finally, the label sequence probability is given by the sum of the products of the forward and backward variables at any time: p(l|x) = |l′|∑ s=1 αtsβ t s. (6).

F. CTC Objective Function

The CTC objective function is defined as the negative log probability of the network correctly labelling the entire training set.
Then the objective function O can be expressed as O = − ∑ (x,z)∈S ln p(z|x). (7) The network can be trained with gradient descent by first differentiating O with respect to the outputs, then using backpropagation through time [47] to find the derivatives with respect to the network weights.
Substituting this into (7) gives ∂O ∂ytk = − 1 p(z|x)ytk ∑ s∈lab(z,k) αtsβ t s. (9) To backpropagate the gradient through the output layer, the authors need the objective function derivatives with respect to the outputs atk before the activation function is applied.

G. CTC Decoding

Once the network is trained, the authors would ideally transcribe some unknown input sequence x by choosing the labelling l∗ with the highest conditional probability: l∗ = arg max l p(l|x).
For some tasks the authors want to constrain the output labellings according to a grammar.
Note that this assumption is in general false, since both the input sequences and the grammar depend on the underlying generator of the data, for example the language being spoken.
If the authors further assume that, prior to any knowledge about the input or the grammar, all label sequences are equally probable, (14) reduces to l∗ = arg max l p(l|x)p(l|G) (17).
Note that, since the number of possible label sequences is finite (because both L and |l| are finite), assigning equal prior probabilities does not lead to an improper prior.

H. CTC Token Passing Algorithm

The authors now describe an algorithm, based on the token passing algorithm for HMMs [48], that allows us to find an approximate solution to (17) for a simple grammar.
The transition probabilities are used when a token is passed from the last character in one word to the first character in another.
This is the highest scoring token reaching that segment at that time.
Because the output tokens tok(w,−1, T ) are sorted in order of score, the search can be terminated when a token is reached whose score is less than the current best score with the transition included.
If no bigrams are used, lines 15-17 can be replaced by a simple search for the highest scoring output token, and the complexity reduces to O(TW ).

V. EXPERIMENTS AND RESULTS

The aim of their experiments was to evaluate the complete RNN handwriting recognition system, illustrated in Figure 13, on both online and offline handwriting.
The online and offline databases used were the IAMOnDB and the IAM-DB respectively.
Note that these do not correspond to the same handwriting samples: the IAM-OnDB was acquired from a whiteboard (see Section II), while the IAM-DB consists of scanned images of handwritten forms (see Section III).
To make the comparisons fair, the same online and offline preprocessing was used for both the HMM and RNN systems.
As well as the main comparisons, extra experiments were carried out on the online database to determine the effect of varying the dictionary size, and of disabling the forward and backward hidden layers in the RNN.

A. Data Sets

For the online experiments the authors used the IAM-OnDB, a database acquired from a ‘smart’ whiteboard [49].
After preprocessing, the input data consisted of 9 inputs per time step.
Both the online and offline transcriptions contain 81 separate characters, including all lower case and capital letters as well as various other special characters, e.g., punctuation marks, digits, a character for garbage symbols, and a character for the space.
Note that for the RNN systems, only 80 of these were used, since the garbage symbol is not required for CTC.

B. Language Model and Dictionaries

Dictionaries were used for all the experiments where word accuracy was recorded.
All dictionaries were derived from three different text corpora, the LOB (excluding the data used as prompts), the Brown corpus [52], and the Wellington corpus [53].
The figure 20,000 was chosen because it had been previously shown to give best results for HMMs [51].
Note that this dictionary was ‘open’, in the sense that it did not contain all the words in either the online or offline test set.
To analyse the dependency of the recognition scores on the lexicon, the authors carried out extra experiments on the online data using open dictionaries with between 5,000 and 30,000 words (also chosen from the most common words in LOB).

C. HMM Parameters

The HMM system used for the online and offline experiments was similar to that described in [35].
For the online data every character model contained eight states, while for the offline 3Both databases are available for public download, along with the corresponding task definitions.
//www.iam.unibe.ch/fki/databases/iam-on-line-handwriting-database data, the number of states was chosen individually for each character [51], also known as http.
The observation probabilities were modelled by mixtures of diagonal Gaussians.
32 Gaussians were used for online data and 12 for the offline data.

D. RNN Parameters

The RNN had a bidirectional Long Short-Term Memory architecture with a connectionist temporal classification (CTC) output layer (see Section IV for details).
The forward and backward hidden layers each contained 100 LSTM memory blocks.
The size of the input layer was determined by the data: for the online data there were 25 inputs, for the offline data there were 9.
Otherwise the network was identical for the two tasks.
The error rate was recorded every 5 epochs on the validation set and training was stopped when performance had ceased to improve on the validation set for 50 epochs.

E. Main Results

As can be seen from Tables I and II, the RNN substantially outperformed the HMM on both databases.
To put these results in perspective, the Microsoft tablet PC handwriting recogniser [55] gave a word accuracy score of 71.32% on the online test set.
It suggests that their recogniser is competitive with the best commercial systems for unconstrained handwriting.
In the other, the authors took a closed dictionary containing the 5,597 words in the online test set, and measured the change in performance when this was padded to 20,000 words.
The results for the first set of experiments are shown in Table III and plotted in Figure 14.

I. Use of Context by the RNN

The authors have previously asserted that the BLSTM architecture is able to access long-range, bidirectional context.
The larger the values of the sequential Jacobian for some t, t′, the more sensitive the network output at time t is to the input at time t′.
Figure 16 plots the sequential Jacobian for a single output during the transcription of a line from the online database.
As can be seen, the network output is sensitive to information from about the first 120 time steps of the sequence, which corresponds roughly to the length of the first word.
Moreover, this area of sensitivity extends in both directions from the point where the prediction is made.

VI. DISCUSSION

The authors experiments reveal a substantial gap in performance between the HMM and RNN systems, with a relative error reduction of over 40% in some cases.
Firstly, standard HMMs are generative, while an RNN trained with a discriminative objective function (such as CTC) is discriminative.
A second difference is that RNNs provide more flexible models of the input features than the mixtures of diagonal Gaussians used in standard HMMs.
It should be noted that RNNs typically perform better using input features with simpler relationships.
This means that for an HMM with n states, only O(logn) bits of information about the past observation sequence are carried by the internal state.

VII. CONCLUSIONS

The authors have introduced a novel approach for recognising unconstrained handwritten text, using a recurrent neural network.
The key features of the network are the bidirectional Long Short-Term Memory architecture, which provides access to long range, bidirectional contextual information, and the connectionist temporal classification output layer, which allows the network to be trained on unsegmented sequence data.
In experiments on online and offline handwriting data, the new approach outperformed a state-of-the-art HMM-based system, and also proved more robust to changes in dictionary size.

Did you find this useful? Give us your feedback

Figures (16)

Fig. 12. Preservation of gradient information by LSTM. The diagram represents a network unrolled in time with a single hidden LSTM memory block. The input, forget, and output gate activations are respectively displayed below, to the left and above the memory block. As in Figure 10, the shading of the units corresponds to their sensitivity to the input at time 1. For simplicity, the gates are either entirely open (‘O’) or entirely closed (‘—’).

TABLE V ONLINE CHARACTER ACCURACY WITH DIFFERENT RNN HIDDEN LAYERS

Fig. 15. RNN character error rate during training. The performance ceased to improve on the validation set after 45 passes through the training set (marked with ‘best network‘)

TABLE IV ONLINE WORD ACCURACY WITH CLOSED DICTIONARIES

TABLE III ONLINE WORD ACCURACY WITH OPEN DICTIONARIES

Fig. 4. Slant correction; grey lines indicate the estimated slant angle

Fig. 5. Baseline and corpus line of an example part of a text line

Fig. 3. Processing the text line. Top: text line split into individual parts with estimated skew in the middle of the text line; bottom: text line after skew normalisation. Note that baseline and corpus line detection (described below) give an improved final estimate of the skew.

Fig. 8. Preprocessing of an image of handwritten text, showing the original image (top), and the normalised image (bottom).

Fig. 6. Vicinity features of the point (x(t), y(t)). The three previous and three next points are considered in the example shown in this figure.

Fig. 7. Offline matrix features. The large white dot marks the considered point. The other online points are marked with smaller dots. The strokes have been widened for ease of visualisation.

Fig. 16. Sequential Jacobian for a sequence from the IAM-OnDB. The Jacobian is evaluated for the output corresponding to the label ‘l’ at the time step when ‘l’ is emitted (indicated by the vertical dotted line).

Fig. 13. The complete RNN handwriting recognition system. First the online or offline handwriting data is preprocessed with the techniques described in Sections II and III. The resulting sequence of feature vectors is scanned in opposite directions by the forwards and backwards BLSTM hidden layers. The BLSTM layers feed forward to the CTC output layer, which produces a probability distribution over character transcriptions. This distribution is passed to the dictionary and language model, using the token passing algorithm, to obtain the final word sequence.

Fig. 9. Importance of context in handwriting recognition. The word ‘defence’ is clearly legible, but the letter ‘n’ in isolation is ambiguous.

Content maybe subject to copyright Report

A Novel Connectionist System for Unconstrained

Handwriting Recognition

Alex Graves, Marcus Liwicki, Santiago Fern

andez

Roman Bertolami, Horst Bunke, J

urgen Schmidhuber

Abstract—Recognising lines of unconstrained handwritten text

is a challenging task. The difﬁculty of segmenting cursive

or overlapping characters, combined with the need to exploit

surrounding context, has led to low recognition rates for even the

best current recognisers. Most recent progress in the ﬁeld has

been made either through improved preprocessing, or through

advances in language modelling. Relatively little work has been

done on the basic recognition algorithms. Indeed, most systems

rely on the same hidden Markov models that have been used

for decades in speech and handwriting recognition, despite their

well-known shortcomings. This paper proposes an alternative

approach based on a novel type of recurrent neural network,

speciﬁcally designed for sequence labelling tasks where the

data is hard to segment and contains long range, bidirectional

interdependencies. In experiments on two large unconstrained

handwriting databases, our approach achieves word recognition

accuracies of 79.7% on online data and 74.1% on ofﬂine

data, signiﬁcantly outperforming a state-of-the-art HMM-based

system. In addition, we demonstrate the network’s robustness to

lexicon size, measure the individual inﬂuence of its hidden layers,

and analyse its use of context. Lastly we provide an in depth

discussion of the differences between the network and HMMs,

suggesting reasons for the network’s superior performance.

Index Terms—Handwriting recognition, online handwriting,

ofﬂine handwriting, connectionist temporal classiﬁcation, bidi-

rectional long short-term memory, recurrent neural networks,

hidden Markov model

I. INTRODUCTION

ANDWRITING recognition is traditionally divided into

online and ofﬂine recognition. In online recognition a

time series of coordinates, representing the movement of the

pen-tip, is captured, while in the ofﬂine case only an image of

the text is available. Because of the greater ease of extracting

relevant features, online recognition generally yields better re-

sults [1]. Another important distinction is between recognising

isolated characters or words, and recognising whole lines of

text. Unsurprisingly, the latter is substantially harder, and the

excellent results that have been obtained for digit and character

recognition [2], [3] have never been matched for complete

lines. Lastly, handwriting recognition can be split into cases

where the writing style is constrained in some way—for

Manuscript received May 9, 2008;

Alex Graves and J

urgen Schmidhuber are with TU Munich, Boltzmannstr.

3, D-85748 Garching, Munich, Germany

Marcus Liwicki, Roman Bertolami and Horst Bunke are with the IAM,

Universit

at Bern, Neubr

uckstrasse 10, CH-3012 Bern, Switzerland.

Marcus Liwicki is with the German Research Center for AI (DFKI GmbH),

Knowledge Management Department, Kaiserslautern, Germany, Germany

Santiago Fern

andez and J

urgen Schmidhuber are with IDSIA, Galleria 2,

CH-6928 Manno-Lugano, Switzerland

example, only hand printed characters are allowed—and the

more challenging scenario where it is unconstrained. Despite

more than 30 years of handwriting recognition research [2],

[3], [4], [5], developing a reliable, general-purpose system for

unconstrained text line recognition remains an open problem.

A well known test bed for isolated handwritten character

recognition is the UNIPEN database [6]. Systems that have

been found to perform well on UNIPEN include: a writer-

independent approach based on hidden Markov models [7]; a

hybrid technique called cluster generative statistical dynamic

time warping [8], which combines dynamic time warping

with HMMs and embeds clustering and statistical sequence

modelling in a single feature space; and a support vector

machine with a novel Gaussian dynamic time warping ker-

nel [9]. Typical error rates on UNIPEN range from 3% for digit

recognition, to about 10% for lower case character recognition.

Similar techniques can be used to classify isolated words,

and this has given good results for small vocabularies (for

example a writer dependent word error rate of about 4.5% for

32 words [10]). However an obvious drawback of whole word

classiﬁcation is that it does not scale to large vocabularies.

For large vocabulary recognition tasks, such as those con-

sidered in this paper, the only feasible approach is to recognise

individual characters and map them onto complete words using

a dictionary. Naively, this could be done by presegmenting

words into characters and classifying each segment. However,

segmentation is difﬁcult for cursive or unconstrained text,

unless the words have already been recognised. This creates

a circular dependency between segmentation and recognition

that is sometimes referred to as Sayre’s paradox [11].

One solution to Sayre’s paradox is to simply ignore it, and

carry out segmentation before recognition. For example [3]

describes techniques for character segmentation, based on un-

supervised learning and data-driven methods. Other strategies

ﬁrst segment the text into basic strokes, rather than characters.

The stroke boundaries may be deﬁned in various ways, such

as the minima of the velocity, the minima of the y-coordinates,

or the points of maximum curvature. For example, one online

approach ﬁrst segments the data at the minima of the y-

coordinates, then applies self-organising maps [12]. Another,

ofﬂine, approach uses the minima of the vertical histogram

for an initial estimation of the character boundaries and then

applies various heuristics to improve the segmentation [13].

A more promising approach to Sayre’s paradox is to

segment and recognise at the same time. Hidden Markov

models (HMMs) are able to do this, which is one reason for

their popularity in unconstrained handwriting recognition [14],

[15], [16], [17], [18], [19]. The idea of applying HMMs

to handwriting recognition was originally motivated by their

success in speech recognition [20], where a similar conﬂict

exists between recognition and segmentation. Over the years,

numerous reﬁnements of the basic HMM approach have been

proposed, such as the writer independent system considered

in [7], which combines point oriented and stroke oriented input

features.

However, HMMs have several well-known drawbacks. One

of these is that they assume the probability of each observation

depends only on the current state, which makes contextual ef-

fects difﬁcult to model. Another is that HMMs are generative,

while discriminative models generally give better performance

in labelling and classiﬁcation tasks.

Recurrent neural networks (RNNs) do not suffer from these

limitations, and would therefore seem a promising alterna-

tive to HMMs. However the application of RNNs alone to

handwriting recognition have so far been limited to isolated

character recognition (e.g. [21]). The main reason for this is

that traditional neural network objective functions require a

separate training signal for every point in the input sequence,

which in turn requires presegmented data.

A more successful use of neural networks for handwriting

recognition has been to combine them with HMMs in the so-

called hybrid approach [22], [23]. A variety of network archi-

tectures have been tried for hybrid handwriting recognition,

including multilayer perceptrons [24], [25], time delay neural

networks [18], [26], [27], and RNNs [28], [29], [30]. However,

although hybrid models alleviate the difﬁculty of introducing

context to HMMs, they still suffer from many of the drawbacks

of HMMs, and they do not realise the full potential of RNNs

for sequence modelling.

This paper proposes an alternative approach, in which a

single RNN is trained directly for sequence labelling. The

network uses the connectionist temporal classiﬁcation (CTC)

output layer [31], [32], ﬁrst applied to speech recognition.

CTC uses the network to map directly from the complete input

sequence to the sequence of output labels, obviating the need

for presegmented data. We extend the original formulation of

CTC by combining it with a dictionary and language model to

obtain word recognition scores that can be compared directly

with other systems. Although CTC can be used with any type

of RNN, best results are given by networks able to incorpo-

rate as much context as possible. For this reason we chose

the bidirectional Long Short-Term Memory (BLSTM; [33])

architecture, which provides access to long range context along

both input directions.

In experiments on large online and ofﬂine handwriting

databases, our approach signiﬁcantly outperforms a state-of-

the-art HMM-based system on unconstrained text line recog-

nition. Furthermore, the network retains its advantage over a

wide range of dictionary sizes.

The paper is organised as follows. Sections II and III

describe the preprocessing and feature extraction methods for

the online and ofﬂine data respectively. Section IV introduces

the novel RNN-based recogniser. Section V describes the

databases and presents the experimental analysis and results.

Section VI discusses the differences between the new system

Fig. 1. Illustration of the recording

Fig. 2. Examples of handwritten text acquired from a whiteboard

and HMMs, and suggests reasons for the network’s superior

performance. Our conclusions are presented in Section VII.

II. ONLINE DATA PREPARATION

The online data used in our experiments were recorded from

a whiteboard using the eBeam interface

[34]. As illustrated

in Figure 1, the interface consists of a normal pen in a

special casing, which sends infrared signals to a triangular

receiver mounted in one of the corners of the whiteboard. The

acquisition interface outputs a sequence of (x, y)-coordinates

representing the location of the tip of the pen together with

a time stamp for each location. The coordinates are only

recorded during the periods when the pen-tip is in continuous

contact with the whiteboard. We refer to these periods as

strokes. After some standard steps to correct for missing and

noisy points [35], the data was stored in xml-format, along

with the frame rate, which varied from 30 to 70 frames per

second.

A. Normalisation

The preprocessing begins with data normalisation. This is

an important step in handwriting recognition because writing

styles differ greatly with respect to the skew, slant, height and

width of the characters.

Since the subjects stand rather than sit, and their arms do

not rest on a table, handwriting rendered on a whiteboard

eBeam System by Luidia, Inc. - www.e-Beam.com

Fig. 3. Processing the text line. Top: text line split into individual parts with

estimated skew in the middle of the text line; bottom: text line after skew

normalisation. Note that baseline and corpus line detection (described below)

give an improved ﬁnal estimate of the skew.

Fig. 4. Slant correction; grey lines indicate the estimated slant angle

is different from that produced with a pen on a writing

tablet. In particular, it has been observed that the baseline

on a whiteboard cannot usually be approximated by a simple

straight line. Furthermore, the size and width of the characters

become smaller the more the pen moves to the right. Examples

of both effects can be seen in Figure 2. Consequently, online

handwriting gathered from a whiteboard requires some special

preprocessing steps.

Since the text lines on a whiteboard usually have no uniform

skew, they are split into smaller parts and the rest of the

preprocessing is done for each part separately. To accomplish

the splitting, all gaps within a line are determined ﬁrst. The

text is split at a gap if it is wider than the median gap width,

and if the size of both parts resulting from the split is larger

than some predeﬁned threshold. An example of the splitting

process is shown in Figure 3 with the resulting parts indicated

by lines below the text.

Next the parts are corrected with respect to their skew and

slant. A linear regression is performed through all the points,

and the orientation of the text line is corrected according to the

regression parameters (see Figure 3). For slant normalisation,

we compute the histogram over all angles enclosed by the

lines connecting two successive points of the trajectory and

the horizontal line [26]. The histogram ranges from −90

◦

with a step size of 2

◦

. We weight the histogram values

with a Gaussian whose mean is at the vertical angle and whose

variance is chosen empirically. This is beneﬁcial because some

words are not properly corrected if a single long straight line

is drawn in the horizontal direction, which results in a large

histogram value. We also smooth each histogram entry γ

using its nearest neighbours, ¯γ

= (γ

i−1

+ 2γ

+ γ

i+1

) /4,

because in some cases the correct slant is at the border of

two angle intervals and a single peak at another interval may

be slightly higher. This single peak will become smaller after

smoothing. Figure 4 shows a text line before and after slant

correction.

Delayed strokes, such as the crossing of a ‘t’ or the dot

of an ‘i’, are a well known problem in online handwriting

recognition, because the order in which they are written varies

between different writers. For this reason, delayed strokes

(identiﬁed as strokes above already written parts, followed by

a pen-movement to the right [26]) are removed. Note that some

i-dots may be missed by this procedure. However, this usually

occurs only if there was no pen-movement to the left, meaning

that the writing order is not disrupted. A special hat-feature

is used to indicate to the recogniser that a delayed stroke was

removed.

To correct for variations in writing speed, the input se-

quences are transformed so that the points are equally spaced.

The optimal value for this distance is found empirically.

The next step is the computation of the baseline and the

corpus line, which are then used to normalise the size of the

text. The baseline corresponds to the original line on which the

text was written, i.e. it passes through the bottom of the char-

acters. The corpus line goes through the top of the lower case

letters. To obtain these lines two linear regressions through the

minima and maxima are computed. After removing outliers the

regression is repeated twice, resulting in the estimated baseline

(minima) and corpus line (maxima). Figure 5 illustrates the

estimated baseline and the corpus line of part of the example

shown in Figure 3. The baseline is subtracted from all y-

coordinates and the heights of the three resulting areas are

normalised.

Fig. 5. Baseline and corpus line of an example part of a text line

The ﬁnal preprocessing step is to normalise the width of the

characters. This is done by scaling the text horizontally with a

fraction of the number of strokes crossing the horizontal line

between the baseline and the corpus line. This preprocessing

step is needed because the x-coordinates of the points are

taken as a feature.

B. Feature Extraction

The input to the recogniser consists of 25 features for

each (x, y)-coordinate recorded by the acquisition system.

These features can be divided into two classes. The ﬁrst class

consists of features extracted for each point by considering its

neighbours in the time series. The second class is based on the

spatial information given by the ofﬂine matrix representation.

For point (x, y), the features in the ﬁrst class are as follows:

• A feature indicating whether the pen-tip is touching the

board or not

• The hat-feature indicating whether a delayed stroke was

removed at y

• The velocity computed before resampling

• The x-coordinate after high-pass ﬁltering, i.e. after sub-

tracting a moving average from the true horizontal posi-

tion

• The y-coordinate after normalisation.

• The cosine and sine of the angle between the line segment

starting at the point and the x-axis (writing direction)

• The cosine and sine of the angle between the lines to the

previous and the next point (curvature)

Fig. 6. Vicinity features of the point (x(t), y(t)). The three previous and

three next points are considered in the example shown in this ﬁgure.

• vicinity aspect: this feature is equal to the aspect of the

trajectory (see Figure 6):

∆y(t) − ∆x(t)

∆y(t) + ∆x(t)

• The cosine and sine of the angle α of the straight line

from the ﬁrst to the last vicinity point (see Figure 6)

• The length of the trajectory in the vicinity divided by

max(∆x(t), ∆y(t))

• The average squared distance d

of each point in the

vicinity to the straight line from the ﬁrst to the last

vicinity point

The features in the second class, illustrated in Figure 7,

are computed using a two-dimensional matrix B = b

i,j

representing the ofﬂine version of the data. For each position

i,j

the number of points on the trajectory of the strokes is

stored, providing a low-resolution image of the handwritten

data. The following features are used:

• The number of points above the corpus line (ascenders)

and below the baseline (descenders) in the vicinity of the

current point

• The number of black points in each region of the context

map (the two-dimensional vicinity of the current point is

transformed to a 3 × 3 map with width and height set to

the height of the corpus)

Fig. 7. Ofﬂine matrix features. The large white dot marks the considered

point. The other online points are marked with smaller dots. The strokes have

been widened for ease of visualisation.

Fig. 8. Preprocessing of an image of handwritten text, showing the original

image (top), and the normalised image (bottom).

III. OFFLINE DATA PREPARATION

The ofﬂine data used in our experiments consists of

greyscale images scanned from handwritten forms, with a

scanning resolution of 300 dpi and a greyscale bit depth of

8. The following procedure was carried out to extract the text

lines from the images. First, the image was rotated to account

for the overall skew of the document, and the handwritten

part was extracted from the form. Then a histogram of the

horizontal black/white transitions was calculated, and the text

was split at the local minima to give a series of horizontal

lines. Any stroke crossing the boundaries between two lines

was assigned to the line containing their centre of gravity. With

this method almost all text lines were extracted correctly.

Once the line images were extracted, the next stage was to

normalise the text with respect to writing skew and slant, and

character size.

A. Normalisation

Unlike the online data, the normalisations for the ofﬂine data

are applied to entire text lines at once. First of all the image is

rotated to account for the line skew. Then the mean slant of the

text is estimated, and a shearing transformation is applied to

the image to bring the handwriting to an upright position. Next

the baseline and the corpus line are normalised. Normalisation

of the baseline means that the body of the text line (the

part which is located between the upper and lower baselines),

the ascender part (located above the upper baseline), and the

descender part (below the lower baseline) are each scaled to

a predeﬁned height. Finally the image is scaled horizontally

so that the mean character width is approximately equal to a

predeﬁned size. Figure 8 illustrates the ofﬂine preprocessing.

B. Feature Extraction

To extract the feature vectors from the normalised images, a

sliding window approach is used. The width of the window is

one pixel, and nine geometrical features are computed at each

window position. Each text line image is therefore converted

to a sequence of 9-dimensional vectors. The nine features are

as follows:

• The mean grey value of the pixels

• The centre of gravity of the pixels

• The second order vertical moment of the centre of gravity

• The positions of the uppermost and lowermost black

pixels

• The rate of change of these positions (with respect to the

neighbouring windows)

Only about 1 % of the text lines contain errors. These have been corrected

manually in previous work [36].

• The number of black-white transitions between the up-

permost and lowermost pixels

• The proportion of black pixels between the uppermost

and lowermost pixels

For a more detailed description of the ofﬂine features,

see [17].

IV. NEURAL NETWORK RECOGNISER

A. Recurrent Neural Networks

Recurrent neural networks (RNNs) are a connectionist

model containing a self-connected hidden layer. One beneﬁt

of the recurrent connection is that a ‘memory’ of previous

inputs remains in the network’s internal state, allowing it to

make use of past context. Context plays an important role in

handwriting recognition, as illustrated in Figure 9. Another

important advantage of recurrency is that the rate of change

of the internal state can be ﬁnely modulated by the recurrent

weights, which builds in robustness to localised distortions of

the input data.

Fig. 9. Importance of context in handwriting recognition. The word ‘defence’

is clearly legible, but the letter ‘n’ in isolation is ambiguous.

B. Long Short-Term Memory (LSTM)

Unfortunately, the range of contextual information that stan-

dard RNNs can access is in practice quite limited. The problem

is that the inﬂuence of a given input on the hidden layer,

and therefore on the network output, either decays or blows

up exponentially as it cycles around the network’s recurrent

connections. This shortcoming (referred to in the literature as

the vanishing gradient problem [37], [38]) makes it hard for an

RNN to bridge gaps of more than about 10 time steps between

relevant input and target events [37]. The vanishing gradient

problem is illustrated schematically in Figure 10.

Long Short-Term Memory (LSTM) [39], [40] is an RNN

architecture speciﬁcally designed to address the vanishing gra-

dient problem. An LSTM hidden layer consists of recurrently

connected subnets, called memory blocks. Each block contains

a set of internal units, or cells, whose activation is controlled

by three multiplicative gates: the input gate, forget gate and

output gate. Figure 11 provides a detailed illustration of an

LSTM memory block with a single cell.

The effect of the gates is to allow the cells to store and

access information over long periods of time. For example, as

long as the input gate remains closed (i.e. has an activation

close to 0), the activation of the cell will not be overwritten

by the new inputs arriving in the network. Similarly, the cell

activation is only available to the rest of the network when

Fig. 10. Illustration of the vanishing gradient problem. The diagram

represents a recurrent network unrolled in time. The units are shaded according

to how sensitive they are to the input at time 1 (where black is high and white

is low). As can be seen, the inﬂuence of the ﬁrst input decays exponentially

over time.

NET OUTPUT

FORGET GATE

NET INPUT

INPUT GATE

OUTPUT GATE

CELL

1.0

Fig. 11. LSTM memory block with one cell. The cell has a recurrent

connection with ﬁxed weight 1.0. The three gates collect input from the rest

of the network, and control the cell via multiplicative units (small circles). The

input and output gates scale the input and output of the cell, while the forget

gate scales the recurrent connection of the cell. The cell squashing functions

(g and h) are applied at the indicated places. The internal connections from

the cell to the gates are known as peephole weights.

the output gate is open, and the cell’s recurrent connection is

switched on and off by the forget gate.

Figure 12 illustrates how an LSTM block maintains gradient

information over time. Note that the dependency is ‘carried’

by the memory cell as long as the forget gate is open and the

input gate is closed, and that the output dependency can be

switched on and off by the output gate, without affecting the

hidden cell.

C. Bidirectional Recurrent Neural Networks

For many tasks it is useful to have access to future as well

as past context. In handwriting recognition, for example, the

identiﬁcation of a given letter is helped by knowing the letters

both to the right and left of it.

HTML Viewer

Frequently Asked Questions (1)

Q1. What have the authors contributed in "A novel connectionist system for unconstrained handwriting recognition" ?

This paper proposes an alternative approach based on a novel type of recurrent neural network, specifically designed for sequence labelling tasks where the data is hard to segment and contains long range, bidirectional interdependencies. In experiments on two large unconstrained handwriting databases, their approach achieves word recognition accuracies of 79. In addition, the authors demonstrate the network ’ s robustness to lexicon size, measure the individual influence of its hidden layers, and analyse its use of context. Lastly the authors provide an in depth discussion of the differences between the network and HMMs, suggesting reasons for the network ’ s superior performance.

A Novel Connectionist System for Unconstrained Handwriting Recognition

Summary (5 min read)

Introduction

II. ONLINE DATA PREPARATION

B. Feature Extraction

III. OFFLINE DATA PREPARATION

A. Recurrent Neural Networks

B. Long Short-Term Memory (LSTM)

C. Bidirectional Recurrent Neural Networks

D. Connectionist Temporal Classification (CTC)

E. CTC Forward Backward Algorithm

F. CTC Objective Function

G. CTC Decoding

H. CTC Token Passing Algorithm

V. EXPERIMENTS AND RESULTS

A. Data Sets

B. Language Model and Dictionaries

C. HMM Parameters

D. RNN Parameters

E. Main Results

I. Use of Context by the RNN

VI. DISCUSSION

VII. CONCLUSIONS

Figures (16)

Citations

Cites background from "A Novel Connectionist System for Un..."

Cites background from "A Novel Connectionist System for Un..."

Cites background from "A Novel Connectionist System for Un..."

Cites methods from "A Novel Connectionist System for Un..."

References

"A Novel Connectionist System for Un..." refers background in this paper

"A Novel Connectionist System for Un..." refers background in this paper

"A Novel Connectionist System for Un..." refers background or methods in this paper

Additional excerpts

Related Papers (5)

Frequently Asked Questions (1)

Q1. What have the authors contributed in "A novel connectionist system for unconstrained handwriting recognition" ?