scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A Novel Connectionist System for Unconstrained Handwriting Recognition

TL;DR: This paper proposes an alternative approach based on a novel type of recurrent neural network, specifically designed for sequence labeling tasks where the data is hard to segment and contains long-range bidirectional interdependencies, significantly outperforming a state-of-the-art HMM-based system.
Abstract: Recognizing lines of unconstrained handwritten text is a challenging task. The difficulty of segmenting cursive or overlapping characters, combined with the need to exploit surrounding context, has led to low recognition rates for even the best current recognizers. Most recent progress in the field has been made either through improved preprocessing or through advances in language modeling. Relatively little work has been done on the basic recognition algorithms. Indeed, most systems rely on the same hidden Markov models that have been used for decades in speech and handwriting recognition, despite their well-known shortcomings. This paper proposes an alternative approach based on a novel type of recurrent neural network, specifically designed for sequence labeling tasks where the data is hard to segment and contains long-range bidirectional interdependencies. In experiments on two large unconstrained handwriting databases, our approach achieves word recognition accuracies of 79.7 percent on online data and 74.1 percent on offline data, significantly outperforming a state-of-the-art HMM-based system. In addition, we demonstrate the network's robustness to lexicon size, measure the individual influence of its hidden layers, and analyze its use of context. Last, we provide an in-depth discussion of the differences between the network and HMMs, suggesting reasons for the network's superior performance.

Summary (5 min read)

Introduction

  • However an obvious drawback of whole word classification is that it does not scale to large vocabularies.
  • Based on unsupervised learning and data-driven methods.
  • For this reason the authors chose the bidirectional Long Short-Term Memory (BLSTM; [33]) architecture, which provides access to long range context along both input directions.

II. ONLINE DATA PREPARATION

  • The online data used in their experiments were recorded from a whiteboard using the eBeam interface1 [34].
  • The acquisition interface outputs a sequence of (x, y)-coordinates representing the location of the tip of the pen together with a time stamp for each location.
  • The coordinates are only recorded during the periods when the pen-tip is in continuous contact with the whiteboard.
  • After some standard steps to correct for missing and noisy points [35], the data was stored in xml-format, along with the frame rate, which varied from 30 to 70 frames per second.

B. Feature Extraction

  • To extract the feature vectors from the normalised images, a sliding window approach is used.
  • The positions of the uppermost and lowermost black pixels .
  • The rate of change of these positions (with respect to the neighbouring windows) 2 5 The number of black-white transitions between the uppermost and lowermost pixels .
  • For a more detailed description of the offline features, see [17].

III. OFFLINE DATA PREPARATION

  • The offline data used in their experiments consists of greyscale images scanned from handwritten forms, with a scanning resolution of 300 dpi and a greyscale bit depth of 8.
  • Then a histogram of the horizontal black/white transitions was calculated, and the text was split at the local minima to give a series of horizontal lines.
  • Any stroke crossing the boundaries between two lines was assigned to the line containing their centre of gravity.
  • With this method almost all text lines were extracted correctly.
  • Once the line images were extracted, the next stage was to normalise the text with respect to writing skew and slant, and character size.

A. Recurrent Neural Networks

  • Recurrent neural networks (RNNs) are a connectionist model containing a self-connected hidden layer.
  • One benefit of the recurrent connection is that a ‘memory’ of previous inputs remains in the network’s internal state, allowing it to make use of past context.
  • Another important advantage of recurrency is that the rate of change of the internal state can be finely modulated by the recurrent weights, which builds in robustness to localised distortions of the input data.

B. Long Short-Term Memory (LSTM)

  • Unfortunately, the range of contextual information that standard RNNs can access is in practice quite limited.
  • The problem is that the influence of a given input on the hidden layer, and therefore on the network output, either decays or blows up exponentially as it cycles around the network’s recurrent connections.
  • As can be seen, the influence of the first input decays exponentially over time.
  • The cell has a recurrent connection with fixed weight 1.0.
  • The three gates collect input from the rest of the network, and control the cell via multiplicative units (small circles).

C. Bidirectional Recurrent Neural Networks

  • In handwriting recognition, for example, the identification of a given letter is helped by knowing the letters both to the right and left of it.
  • 6 Bidirectional recurrent neural networks [41], [42] are able to access context in both directions along the input sequence.
  • Combining BRNNs and LSTM gives bidirectional LSTM .
  • BLSTM has previously outperformed other network architectures, including standard LSTM, BRNNs and HMM-RNN hybrids, on phoneme recognition [33], [45].

D. Connectionist Temporal Classification (CTC)

  • Traditional RNN objective functions require a presegmented input sequence with a separate target for every segment.
  • Moreover, because the outputs of such an RNN are a series of independent, local classifications, some form of post processing is required to transform them into the desired label sequence.
  • A CTC output layer contains as many units as there are labels in the task, plus an additional ‘blank’ or ‘no label’ unit.
  • Note that each output is conditioned on the entire input sequence.
  • One such architecture is bidirectional LSTM, as described in the previous section.

E. CTC Forward Backward Algorithm

  • To allow for blanks in the output paths, the authors consider modified label sequences l′, with blanks added to the beginning and the end of l, and inserted between every pair of consecutive labels.
  • In calculating the probabilities of prefixes of l′ the authors allow all transitions between blank and non-blank labels, and also those between any pair of distinct non-blank labels.
  • The backward variables βts are defined as the summed probability of all paths whose suffixes starting at t map onto the suffix of l starting at label s/2: βts = ∑ π: B(πt:T )=ls/2:|l|.
  • Finally, the label sequence probability is given by the sum of the products of the forward and backward variables at any time: p(l|x) = |l′|∑ s=1 αtsβ t s. (6).

F. CTC Objective Function

  • The CTC objective function is defined as the negative log probability of the network correctly labelling the entire training set.
  • Then the objective function O can be expressed as O = − ∑ (x,z)∈S ln p(z|x). (7) The network can be trained with gradient descent by first differentiating O with respect to the outputs, then using backpropagation through time [47] to find the derivatives with respect to the network weights.
  • Substituting this into (7) gives ∂O ∂ytk = − 1 p(z|x)ytk ∑ s∈lab(z,k) αtsβ t s. (9) To backpropagate the gradient through the output layer, the authors need the objective function derivatives with respect to the outputs atk before the activation function is applied.

G. CTC Decoding

  • Once the network is trained, the authors would ideally transcribe some unknown input sequence x by choosing the labelling l∗ with the highest conditional probability: l∗ = arg max l p(l|x).
  • For some tasks the authors want to constrain the output labellings according to a grammar.
  • Note that this assumption is in general false, since both the input sequences and the grammar depend on the underlying generator of the data, for example the language being spoken.
  • If the authors further assume that, prior to any knowledge about the input or the grammar, all label sequences are equally probable, (14) reduces to l∗ = arg max l p(l|x)p(l|G) (17).
  • Note that, since the number of possible label sequences is finite (because both L and |l| are finite), assigning equal prior probabilities does not lead to an improper prior.

H. CTC Token Passing Algorithm

  • The authors now describe an algorithm, based on the token passing algorithm for HMMs [48], that allows us to find an approximate solution to (17) for a simple grammar.
  • The transition probabilities are used when a token is passed from the last character in one word to the first character in another.
  • This is the highest scoring token reaching that segment at that time.
  • Because the output tokens tok(w,−1, T ) are sorted in order of score, the search can be terminated when a token is reached whose score is less than the current best score with the transition included.
  • If no bigrams are used, lines 15-17 can be replaced by a simple search for the highest scoring output token, and the complexity reduces to O(TW ).

V. EXPERIMENTS AND RESULTS

  • The aim of their experiments was to evaluate the complete RNN handwriting recognition system, illustrated in Figure 13, on both online and offline handwriting.
  • The online and offline databases used were the IAMOnDB and the IAM-DB respectively.
  • Note that these do not correspond to the same handwriting samples: the IAM-OnDB was acquired from a whiteboard (see Section II), while the IAM-DB consists of scanned images of handwritten forms (see Section III).
  • To make the comparisons fair, the same online and offline preprocessing was used for both the HMM and RNN systems.
  • As well as the main comparisons, extra experiments were carried out on the online database to determine the effect of varying the dictionary size, and of disabling the forward and backward hidden layers in the RNN.

A. Data Sets

  • For the online experiments the authors used the IAM-OnDB, a database acquired from a ‘smart’ whiteboard [49].
  • After preprocessing, the input data consisted of 9 inputs per time step.
  • Both the online and offline transcriptions contain 81 separate characters, including all lower case and capital letters as well as various other special characters, e.g., punctuation marks, digits, a character for garbage symbols, and a character for the space.
  • Note that for the RNN systems, only 80 of these were used, since the garbage symbol is not required for CTC.

B. Language Model and Dictionaries

  • Dictionaries were used for all the experiments where word accuracy was recorded.
  • All dictionaries were derived from three different text corpora, the LOB (excluding the data used as prompts), the Brown corpus [52], and the Wellington corpus [53].
  • The figure 20,000 was chosen because it had been previously shown to give best results for HMMs [51].
  • Note that this dictionary was ‘open’, in the sense that it did not contain all the words in either the online or offline test set.
  • To analyse the dependency of the recognition scores on the lexicon, the authors carried out extra experiments on the online data using open dictionaries with between 5,000 and 30,000 words (also chosen from the most common words in LOB).

C. HMM Parameters

  • The HMM system used for the online and offline experiments was similar to that described in [35].
  • For the online data every character model contained eight states, while for the offline 3Both databases are available for public download, along with the corresponding task definitions.
  • //www.iam.unibe.ch/fki/databases/iam-on-line-handwriting-database data, the number of states was chosen individually for each character [51], also known as http.
  • The observation probabilities were modelled by mixtures of diagonal Gaussians.
  • 32 Gaussians were used for online data and 12 for the offline data.

D. RNN Parameters

  • The RNN had a bidirectional Long Short-Term Memory architecture with a connectionist temporal classification (CTC) output layer (see Section IV for details).
  • The forward and backward hidden layers each contained 100 LSTM memory blocks.
  • The size of the input layer was determined by the data: for the online data there were 25 inputs, for the offline data there were 9.
  • Otherwise the network was identical for the two tasks.
  • The error rate was recorded every 5 epochs on the validation set and training was stopped when performance had ceased to improve on the validation set for 50 epochs.

E. Main Results

  • As can be seen from Tables I and II, the RNN substantially outperformed the HMM on both databases.
  • To put these results in perspective, the Microsoft tablet PC handwriting recogniser [55] gave a word accuracy score of 71.32% on the online test set.
  • It suggests that their recogniser is competitive with the best commercial systems for unconstrained handwriting.
  • In the other, the authors took a closed dictionary containing the 5,597 words in the online test set, and measured the change in performance when this was padded to 20,000 words.
  • The results for the first set of experiments are shown in Table III and plotted in Figure 14.

I. Use of Context by the RNN

  • The authors have previously asserted that the BLSTM architecture is able to access long-range, bidirectional context.
  • The larger the values of the sequential Jacobian for some t, t′, the more sensitive the network output at time t is to the input at time t′.
  • Figure 16 plots the sequential Jacobian for a single output during the transcription of a line from the online database.
  • As can be seen, the network output is sensitive to information from about the first 120 time steps of the sequence, which corresponds roughly to the length of the first word.
  • Moreover, this area of sensitivity extends in both directions from the point where the prediction is made.

VI. DISCUSSION

  • The authors experiments reveal a substantial gap in performance between the HMM and RNN systems, with a relative error reduction of over 40% in some cases.
  • Firstly, standard HMMs are generative, while an RNN trained with a discriminative objective function (such as CTC) is discriminative.
  • A second difference is that RNNs provide more flexible models of the input features than the mixtures of diagonal Gaussians used in standard HMMs.
  • It should be noted that RNNs typically perform better using input features with simpler relationships.
  • This means that for an HMM with n states, only O(logn) bits of information about the past observation sequence are carried by the internal state.

VII. CONCLUSIONS

  • The authors have introduced a novel approach for recognising unconstrained handwritten text, using a recurrent neural network.
  • The key features of the network are the bidirectional Long Short-Term Memory architecture, which provides access to long range, bidirectional contextual information, and the connectionist temporal classification output layer, which allows the network to be trained on unsegmented sequence data.
  • In experiments on online and offline handwriting data, the new approach outperformed a state-of-the-art HMM-based system, and also proved more robust to changes in dictionary size.

Did you find this useful? Give us your feedback

Figures (16)

Content maybe subject to copyright    Report

1
A Novel Connectionist System for Unconstrained
Handwriting Recognition
Alex Graves, Marcus Liwicki, Santiago Fern
´
andez
Roman Bertolami, Horst Bunke, J
¨
urgen Schmidhuber
Abstract—Recognising lines of unconstrained handwritten text
is a challenging task. The difficulty of segmenting cursive
or overlapping characters, combined with the need to exploit
surrounding context, has led to low recognition rates for even the
best current recognisers. Most recent progress in the field has
been made either through improved preprocessing, or through
advances in language modelling. Relatively little work has been
done on the basic recognition algorithms. Indeed, most systems
rely on the same hidden Markov models that have been used
for decades in speech and handwriting recognition, despite their
well-known shortcomings. This paper proposes an alternative
approach based on a novel type of recurrent neural network,
specifically designed for sequence labelling tasks where the
data is hard to segment and contains long range, bidirectional
interdependencies. In experiments on two large unconstrained
handwriting databases, our approach achieves word recognition
accuracies of 79.7% on online data and 74.1% on offline
data, significantly outperforming a state-of-the-art HMM-based
system. In addition, we demonstrate the network’s robustness to
lexicon size, measure the individual influence of its hidden layers,
and analyse its use of context. Lastly we provide an in depth
discussion of the differences between the network and HMMs,
suggesting reasons for the network’s superior performance.
Index Terms—Handwriting recognition, online handwriting,
offline handwriting, connectionist temporal classification, bidi-
rectional long short-term memory, recurrent neural networks,
hidden Markov model
I. INTRODUCTION
H
ANDWRITING recognition is traditionally divided into
online and offline recognition. In online recognition a
time series of coordinates, representing the movement of the
pen-tip, is captured, while in the offline case only an image of
the text is available. Because of the greater ease of extracting
relevant features, online recognition generally yields better re-
sults [1]. Another important distinction is between recognising
isolated characters or words, and recognising whole lines of
text. Unsurprisingly, the latter is substantially harder, and the
excellent results that have been obtained for digit and character
recognition [2], [3] have never been matched for complete
lines. Lastly, handwriting recognition can be split into cases
where the writing style is constrained in some way—for
Manuscript received May 9, 2008;
Alex Graves and J
¨
urgen Schmidhuber are with TU Munich, Boltzmannstr.
3, D-85748 Garching, Munich, Germany
Marcus Liwicki, Roman Bertolami and Horst Bunke are with the IAM,
Universit
¨
at Bern, Neubr
¨
uckstrasse 10, CH-3012 Bern, Switzerland.
Marcus Liwicki is with the German Research Center for AI (DFKI GmbH),
Knowledge Management Department, Kaiserslautern, Germany, Germany
Santiago Fern
´
andez and J
¨
urgen Schmidhuber are with IDSIA, Galleria 2,
CH-6928 Manno-Lugano, Switzerland
example, only hand printed characters are allowed—and the
more challenging scenario where it is unconstrained. Despite
more than 30 years of handwriting recognition research [2],
[3], [4], [5], developing a reliable, general-purpose system for
unconstrained text line recognition remains an open problem.
A well known test bed for isolated handwritten character
recognition is the UNIPEN database [6]. Systems that have
been found to perform well on UNIPEN include: a writer-
independent approach based on hidden Markov models [7]; a
hybrid technique called cluster generative statistical dynamic
time warping [8], which combines dynamic time warping
with HMMs and embeds clustering and statistical sequence
modelling in a single feature space; and a support vector
machine with a novel Gaussian dynamic time warping ker-
nel [9]. Typical error rates on UNIPEN range from 3% for digit
recognition, to about 10% for lower case character recognition.
Similar techniques can be used to classify isolated words,
and this has given good results for small vocabularies (for
example a writer dependent word error rate of about 4.5% for
32 words [10]). However an obvious drawback of whole word
classification is that it does not scale to large vocabularies.
For large vocabulary recognition tasks, such as those con-
sidered in this paper, the only feasible approach is to recognise
individual characters and map them onto complete words using
a dictionary. Naively, this could be done by presegmenting
words into characters and classifying each segment. However,
segmentation is difficult for cursive or unconstrained text,
unless the words have already been recognised. This creates
a circular dependency between segmentation and recognition
that is sometimes referred to as Sayre’s paradox [11].
One solution to Sayre’s paradox is to simply ignore it, and
carry out segmentation before recognition. For example [3]
describes techniques for character segmentation, based on un-
supervised learning and data-driven methods. Other strategies
first segment the text into basic strokes, rather than characters.
The stroke boundaries may be defined in various ways, such
as the minima of the velocity, the minima of the y-coordinates,
or the points of maximum curvature. For example, one online
approach first segments the data at the minima of the y-
coordinates, then applies self-organising maps [12]. Another,
offline, approach uses the minima of the vertical histogram
for an initial estimation of the character boundaries and then
applies various heuristics to improve the segmentation [13].
A more promising approach to Sayre’s paradox is to
segment and recognise at the same time. Hidden Markov
models (HMMs) are able to do this, which is one reason for
their popularity in unconstrained handwriting recognition [14],

2
[15], [16], [17], [18], [19]. The idea of applying HMMs
to handwriting recognition was originally motivated by their
success in speech recognition [20], where a similar conflict
exists between recognition and segmentation. Over the years,
numerous refinements of the basic HMM approach have been
proposed, such as the writer independent system considered
in [7], which combines point oriented and stroke oriented input
features.
However, HMMs have several well-known drawbacks. One
of these is that they assume the probability of each observation
depends only on the current state, which makes contextual ef-
fects difficult to model. Another is that HMMs are generative,
while discriminative models generally give better performance
in labelling and classification tasks.
Recurrent neural networks (RNNs) do not suffer from these
limitations, and would therefore seem a promising alterna-
tive to HMMs. However the application of RNNs alone to
handwriting recognition have so far been limited to isolated
character recognition (e.g. [21]). The main reason for this is
that traditional neural network objective functions require a
separate training signal for every point in the input sequence,
which in turn requires presegmented data.
A more successful use of neural networks for handwriting
recognition has been to combine them with HMMs in the so-
called hybrid approach [22], [23]. A variety of network archi-
tectures have been tried for hybrid handwriting recognition,
including multilayer perceptrons [24], [25], time delay neural
networks [18], [26], [27], and RNNs [28], [29], [30]. However,
although hybrid models alleviate the difficulty of introducing
context to HMMs, they still suffer from many of the drawbacks
of HMMs, and they do not realise the full potential of RNNs
for sequence modelling.
This paper proposes an alternative approach, in which a
single RNN is trained directly for sequence labelling. The
network uses the connectionist temporal classification (CTC)
output layer [31], [32], first applied to speech recognition.
CTC uses the network to map directly from the complete input
sequence to the sequence of output labels, obviating the need
for presegmented data. We extend the original formulation of
CTC by combining it with a dictionary and language model to
obtain word recognition scores that can be compared directly
with other systems. Although CTC can be used with any type
of RNN, best results are given by networks able to incorpo-
rate as much context as possible. For this reason we chose
the bidirectional Long Short-Term Memory (BLSTM; [33])
architecture, which provides access to long range context along
both input directions.
In experiments on large online and offline handwriting
databases, our approach significantly outperforms a state-of-
the-art HMM-based system on unconstrained text line recog-
nition. Furthermore, the network retains its advantage over a
wide range of dictionary sizes.
The paper is organised as follows. Sections II and III
describe the preprocessing and feature extraction methods for
the online and offline data respectively. Section IV introduces
the novel RNN-based recogniser. Section V describes the
databases and presents the experimental analysis and results.
Section VI discusses the differences between the new system
Fig. 1. Illustration of the recording
Fig. 2. Examples of handwritten text acquired from a whiteboard
and HMMs, and suggests reasons for the network’s superior
performance. Our conclusions are presented in Section VII.
II. ONLINE DATA PREPARATION
The online data used in our experiments were recorded from
a whiteboard using the eBeam interface
1
[34]. As illustrated
in Figure 1, the interface consists of a normal pen in a
special casing, which sends infrared signals to a triangular
receiver mounted in one of the corners of the whiteboard. The
acquisition interface outputs a sequence of (x, y)-coordinates
representing the location of the tip of the pen together with
a time stamp for each location. The coordinates are only
recorded during the periods when the pen-tip is in continuous
contact with the whiteboard. We refer to these periods as
strokes. After some standard steps to correct for missing and
noisy points [35], the data was stored in xml-format, along
with the frame rate, which varied from 30 to 70 frames per
second.
A. Normalisation
The preprocessing begins with data normalisation. This is
an important step in handwriting recognition because writing
styles differ greatly with respect to the skew, slant, height and
width of the characters.
Since the subjects stand rather than sit, and their arms do
not rest on a table, handwriting rendered on a whiteboard
1
eBeam System by Luidia, Inc. - www.e-Beam.com

3
Fig. 3. Processing the text line. Top: text line split into individual parts with
estimated skew in the middle of the text line; bottom: text line after skew
normalisation. Note that baseline and corpus line detection (described below)
give an improved final estimate of the skew.
Fig. 4. Slant correction; grey lines indicate the estimated slant angle
is different from that produced with a pen on a writing
tablet. In particular, it has been observed that the baseline
on a whiteboard cannot usually be approximated by a simple
straight line. Furthermore, the size and width of the characters
become smaller the more the pen moves to the right. Examples
of both effects can be seen in Figure 2. Consequently, online
handwriting gathered from a whiteboard requires some special
preprocessing steps.
Since the text lines on a whiteboard usually have no uniform
skew, they are split into smaller parts and the rest of the
preprocessing is done for each part separately. To accomplish
the splitting, all gaps within a line are determined first. The
text is split at a gap if it is wider than the median gap width,
and if the size of both parts resulting from the split is larger
than some predefined threshold. An example of the splitting
process is shown in Figure 3 with the resulting parts indicated
by lines below the text.
Next the parts are corrected with respect to their skew and
slant. A linear regression is performed through all the points,
and the orientation of the text line is corrected according to the
regression parameters (see Figure 3). For slant normalisation,
we compute the histogram over all angles enclosed by the
lines connecting two successive points of the trajectory and
the horizontal line [26]. The histogram ranges from 90
to
90
with a step size of 2
. We weight the histogram values
with a Gaussian whose mean is at the vertical angle and whose
variance is chosen empirically. This is beneficial because some
words are not properly corrected if a single long straight line
is drawn in the horizontal direction, which results in a large
histogram value. We also smooth each histogram entry γ
i
using its nearest neighbours, ¯γ
i
= (γ
i1
+ 2γ
i
+ γ
i+1
) /4,
because in some cases the correct slant is at the border of
two angle intervals and a single peak at another interval may
be slightly higher. This single peak will become smaller after
smoothing. Figure 4 shows a text line before and after slant
correction.
Delayed strokes, such as the crossing of a ‘t’ or the dot
of an ‘i’, are a well known problem in online handwriting
recognition, because the order in which they are written varies
between different writers. For this reason, delayed strokes
(identified as strokes above already written parts, followed by
a pen-movement to the right [26]) are removed. Note that some
i-dots may be missed by this procedure. However, this usually
occurs only if there was no pen-movement to the left, meaning
that the writing order is not disrupted. A special hat-feature
is used to indicate to the recogniser that a delayed stroke was
removed.
To correct for variations in writing speed, the input se-
quences are transformed so that the points are equally spaced.
The optimal value for this distance is found empirically.
The next step is the computation of the baseline and the
corpus line, which are then used to normalise the size of the
text. The baseline corresponds to the original line on which the
text was written, i.e. it passes through the bottom of the char-
acters. The corpus line goes through the top of the lower case
letters. To obtain these lines two linear regressions through the
minima and maxima are computed. After removing outliers the
regression is repeated twice, resulting in the estimated baseline
(minima) and corpus line (maxima). Figure 5 illustrates the
estimated baseline and the corpus line of part of the example
shown in Figure 3. The baseline is subtracted from all y-
coordinates and the heights of the three resulting areas are
normalised.
Fig. 5. Baseline and corpus line of an example part of a text line
The final preprocessing step is to normalise the width of the
characters. This is done by scaling the text horizontally with a
fraction of the number of strokes crossing the horizontal line
between the baseline and the corpus line. This preprocessing
step is needed because the x-coordinates of the points are
taken as a feature.
B. Feature Extraction
The input to the recogniser consists of 25 features for
each (x, y)-coordinate recorded by the acquisition system.
These features can be divided into two classes. The first class
consists of features extracted for each point by considering its
neighbours in the time series. The second class is based on the
spatial information given by the offline matrix representation.
For point (x, y), the features in the first class are as follows:
A feature indicating whether the pen-tip is touching the
board or not
The hat-feature indicating whether a delayed stroke was
removed at y
The velocity computed before resampling
The x-coordinate after high-pass filtering, i.e. after sub-
tracting a moving average from the true horizontal posi-
tion

4
The y-coordinate after normalisation.
The cosine and sine of the angle between the line segment
starting at the point and the x-axis (writing direction)
The cosine and sine of the angle between the lines to the
previous and the next point (curvature)
Fig. 6. Vicinity features of the point (x(t), y(t)). The three previous and
three next points are considered in the example shown in this figure.
vicinity aspect: this feature is equal to the aspect of the
trajectory (see Figure 6):
y(t) x(t)
y(t) + x(t)
The cosine and sine of the angle α of the straight line
from the first to the last vicinity point (see Figure 6)
The length of the trajectory in the vicinity divided by
max(∆x(t), y(t))
The average squared distance d
2
of each point in the
vicinity to the straight line from the first to the last
vicinity point
The features in the second class, illustrated in Figure 7,
are computed using a two-dimensional matrix B = b
i,j
representing the offline version of the data. For each position
b
i,j
the number of points on the trajectory of the strokes is
stored, providing a low-resolution image of the handwritten
data. The following features are used:
The number of points above the corpus line (ascenders)
and below the baseline (descenders) in the vicinity of the
current point
The number of black points in each region of the context
map (the two-dimensional vicinity of the current point is
transformed to a 3 × 3 map with width and height set to
the height of the corpus)
Fig. 7. Offline matrix features. The large white dot marks the considered
point. The other online points are marked with smaller dots. The strokes have
been widened for ease of visualisation.
Fig. 8. Preprocessing of an image of handwritten text, showing the original
image (top), and the normalised image (bottom).
III. OFFLINE DATA PREPARATION
The offline data used in our experiments consists of
greyscale images scanned from handwritten forms, with a
scanning resolution of 300 dpi and a greyscale bit depth of
8. The following procedure was carried out to extract the text
lines from the images. First, the image was rotated to account
for the overall skew of the document, and the handwritten
part was extracted from the form. Then a histogram of the
horizontal black/white transitions was calculated, and the text
was split at the local minima to give a series of horizontal
lines. Any stroke crossing the boundaries between two lines
was assigned to the line containing their centre of gravity. With
this method almost all text lines were extracted correctly.
2
Once the line images were extracted, the next stage was to
normalise the text with respect to writing skew and slant, and
character size.
A. Normalisation
Unlike the online data, the normalisations for the offline data
are applied to entire text lines at once. First of all the image is
rotated to account for the line skew. Then the mean slant of the
text is estimated, and a shearing transformation is applied to
the image to bring the handwriting to an upright position. Next
the baseline and the corpus line are normalised. Normalisation
of the baseline means that the body of the text line (the
part which is located between the upper and lower baselines),
the ascender part (located above the upper baseline), and the
descender part (below the lower baseline) are each scaled to
a predefined height. Finally the image is scaled horizontally
so that the mean character width is approximately equal to a
predefined size. Figure 8 illustrates the offline preprocessing.
B. Feature Extraction
To extract the feature vectors from the normalised images, a
sliding window approach is used. The width of the window is
one pixel, and nine geometrical features are computed at each
window position. Each text line image is therefore converted
to a sequence of 9-dimensional vectors. The nine features are
as follows:
The mean grey value of the pixels
The centre of gravity of the pixels
The second order vertical moment of the centre of gravity
The positions of the uppermost and lowermost black
pixels
The rate of change of these positions (with respect to the
neighbouring windows)
2
Only about 1 % of the text lines contain errors. These have been corrected
manually in previous work [36].

5
The number of black-white transitions between the up-
permost and lowermost pixels
The proportion of black pixels between the uppermost
and lowermost pixels
For a more detailed description of the offline features,
see [17].
IV. NEURAL NETWORK RECOGNISER
A. Recurrent Neural Networks
Recurrent neural networks (RNNs) are a connectionist
model containing a self-connected hidden layer. One benefit
of the recurrent connection is that a ‘memory’ of previous
inputs remains in the network’s internal state, allowing it to
make use of past context. Context plays an important role in
handwriting recognition, as illustrated in Figure 9. Another
important advantage of recurrency is that the rate of change
of the internal state can be finely modulated by the recurrent
weights, which builds in robustness to localised distortions of
the input data.
Fig. 9. Importance of context in handwriting recognition. The word ‘defence’
is clearly legible, but the letter ‘n’ in isolation is ambiguous.
B. Long Short-Term Memory (LSTM)
Unfortunately, the range of contextual information that stan-
dard RNNs can access is in practice quite limited. The problem
is that the influence of a given input on the hidden layer,
and therefore on the network output, either decays or blows
up exponentially as it cycles around the network’s recurrent
connections. This shortcoming (referred to in the literature as
the vanishing gradient problem [37], [38]) makes it hard for an
RNN to bridge gaps of more than about 10 time steps between
relevant input and target events [37]. The vanishing gradient
problem is illustrated schematically in Figure 10.
Long Short-Term Memory (LSTM) [39], [40] is an RNN
architecture specifically designed to address the vanishing gra-
dient problem. An LSTM hidden layer consists of recurrently
connected subnets, called memory blocks. Each block contains
a set of internal units, or cells, whose activation is controlled
by three multiplicative gates: the input gate, forget gate and
output gate. Figure 11 provides a detailed illustration of an
LSTM memory block with a single cell.
The effect of the gates is to allow the cells to store and
access information over long periods of time. For example, as
long as the input gate remains closed (i.e. has an activation
close to 0), the activation of the cell will not be overwritten
by the new inputs arriving in the network. Similarly, the cell
activation is only available to the rest of the network when
Fig. 10. Illustration of the vanishing gradient problem. The diagram
represents a recurrent network unrolled in time. The units are shaded according
to how sensitive they are to the input at time 1 (where black is high and white
is low). As can be seen, the influence of the first input decays exponentially
over time.
NET OUTPUT
FORGET GATE
NET INPUT
INPUT GATE
OUTPUT GATE
CELL
1.0
g
h
Fig. 11. LSTM memory block with one cell. The cell has a recurrent
connection with fixed weight 1.0. The three gates collect input from the rest
of the network, and control the cell via multiplicative units (small circles). The
input and output gates scale the input and output of the cell, while the forget
gate scales the recurrent connection of the cell. The cell squashing functions
(g and h) are applied at the indicated places. The internal connections from
the cell to the gates are known as peephole weights.
the output gate is open, and the cell’s recurrent connection is
switched on and off by the forget gate.
Figure 12 illustrates how an LSTM block maintains gradient
information over time. Note that the dependency is ‘carried’
by the memory cell as long as the forget gate is open and the
input gate is closed, and that the output dependency can be
switched on and off by the output gate, without affecting the
hidden cell.
C. Bidirectional Recurrent Neural Networks
For many tasks it is useful to have access to future as well
as past context. In handwriting recognition, for example, the
identification of a given letter is helped by knowing the letters
both to the right and left of it.

Citations
More filters
Journal ArticleDOI
TL;DR: This historical survey compactly summarizes relevant work, much of it from the previous millennium, review deep supervised learning, unsupervised learning, reinforcement learning & evolutionary computation, and indirect search for short programs encoding deep and large networks.

14,635 citations


Cites background from "A Novel Connectionist System for Un..."

  • ...…analysis (Hochreiter & Obermayer, 2005), handwriting recognition (Bluche et al., 2014; Graves, Fernandez, Liwicki, Bunke, & Schmidhuber, 2008; Graves et al., 2009; Graves & Schmidhuber, 2009), voice activity detection (Eyben, Weninger, Squartini, & Schuller, 2013), optical character…...

    [...]

  • ...BRNNs and DAG-RNNs unfold their full potential when combined with the LSTM concept (Graves et al., 2009; Graves & Schmidhuber, 2005, 2009)....

    [...]

  • ...Compare Graves and Jaitly (2014), Graves and Schmidhuber (2005), Graves et al. (2009), Graves et al. (2013) and Schmidhuber, Ciresan, Meier, Masci, and Graves (2011) (Section 5.22)....

    [...]

Journal ArticleDOI
TL;DR: This paper presents the first large-scale analysis of eight LSTM variants on three representative tasks: speech recognition, handwriting recognition, and polyphonic music modeling, and observes that the studied hyperparameters are virtually independent and derive guidelines for their efficient adjustment.
Abstract: Several variants of the long short-term memory (LSTM) architecture for recurrent neural networks have been proposed since its inception in 1995. In recent years, these networks have become the state-of-the-art models for a variety of machine learning problems. This has led to a renewed interest in understanding the role and utility of various computational components of typical LSTM variants. In this paper, we present the first large-scale analysis of eight LSTM variants on three representative tasks: speech recognition, handwriting recognition, and polyphonic music modeling. The hyperparameters of all LSTM variants for each task were optimized separately using random search, and their importance was assessed using the powerful functional ANalysis Of VAriance framework. In total, we summarize the results of 5400 experimental runs ( $\approx 15$ years of CPU time), which makes our study the largest of its kind on LSTM networks. Our results show that none of the variants can improve upon the standard LSTM architecture significantly, and demonstrate the forget gate and the output activation function to be its most critical components. We further observe that the studied hyperparameters are virtually independent and derive guidelines for their efficient adjustment.

4,746 citations


Cites background from "A Novel Connectionist System for Un..."

  • ...includes handwriting recognition [3]–[5] and generation [6], language modeling [7] and translation [8], acoustic modeling of speech [9], speech synthesis [10], protein secondary...

    [...]

Posted Content
TL;DR: This paper proposes a gradient norm clipping strategy to deal with exploding gradients and a soft constraint for the vanishing gradients problem and validates empirically the hypothesis and proposed solutions.
Abstract: There are two widely known issues with properly training Recurrent Neural Networks, the vanishing and the exploding gradient problems detailed in Bengio et al. (1994). In this paper we attempt to improve the understanding of the underlying issues by exploring these problems from an analytical, a geometric and a dynamical systems perspective. Our analysis is used to justify a simple yet effective solution. We propose a gradient norm clipping strategy to deal with exploding gradients and a soft constraint for the vanishing gradients problem. We validate empirically our hypothesis and proposed solutions in the experimental section.

3,549 citations


Cites background from "A Novel Connectionist System for Un..."

  • ...In Hochreiter and Schmidhuber (1997); Graves et al. (2009) a solution is proposed for the vanishing gradients problem, where the structure of the model is changed....

    [...]

Proceedings Article
16 Jun 2013
TL;DR: In this article, a gradient norm clipping strategy is proposed to deal with the vanishing and exploding gradient problems in recurrent neural networks. But the proposed solution is limited to the case of RNNs.
Abstract: There are two widely known issues with properly training recurrent neural networks, the vanishing and the exploding gradient problems detailed in Bengio et al. (1994). In this paper we attempt to improve the understanding of the underlying issues by exploring these problems from an analytical, a geometric and a dynamical systems perspective. Our analysis is used to justify a simple yet effective solution. We propose a gradient norm clipping strategy to deal with exploding gradients and a soft constraint for the vanishing gradients problem. We validate empirically our hypothesis and proposed solutions in the experimental section.

2,586 citations

Proceedings ArticleDOI
01 Jan 2014
TL;DR: The first distributed training of LSTM RNNs using asynchronous stochastic gradient descent optimization on a large cluster of machines is introduced and it is shown that a two-layer deep LSTm RNN where each L STM layer has a linear recurrent projection layer can exceed state-of-the-art speech recognition performance.
Abstract: Long Short-Term Memory (LSTM) is a specific recurrent neural network (RNN) architecture that was designed to model temporal sequences and their long-range dependencies more accurately than conventional RNNs. In this paper, we explore LSTM RNN architectures for large scale acoustic modeling in speech recognition. We recently showed that LSTM RNNs are more effective than DNNs and conventional RNNs for acoustic modeling, considering moderately-sized models trained on a single machine. Here, we introduce the first distributed training of LSTM RNNs using asynchronous stochastic gradient descent optimization on a large cluster of machines. We show that a two-layer deep LSTM RNN where each LSTM layer has a linear recurrent projection layer can exceed state-of-the-art speech recognition performance. This architecture makes more effective use of model parameters than the others considered, converges quickly, and outperforms a deep feed forward neural network having an order of magnitude more parameters. Index Terms: Long Short-Term Memory, LSTM, recurrent neural network, RNN, speech recognition, acoustic modeling.

2,492 citations


Cites methods from "A Novel Connectionist System for Un..."

  • ...For online and offline handwriting recognition, BLSTM networks used together with a Connectionist Temporal Classification (CTC) layer and trained from unsegmented sequence data, have been shown to outperform a stateof-the-art Hidden-Markov-Model (HMM) based system [10]....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Abstract: Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O. 1. Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

72,897 citations


"A Novel Connectionist System for Un..." refers background in this paper

  • ...An LSTM hidden layer consists of recurrently connected subnets, called memory blocks....

    [...]

  • ...For this reason, we chose the bidirectional Long Short-Term Memory (BLSTM) [33] architecture, which provides access to long-range context along both input directions....

    [...]

  • ...The RNN had a BLSTM architecture with a CTC output layer (see Section 4 for details)....

    [...]

  • ...11 provides a detailed illustration of an LSTM memory block with a single cell....

    [...]

  • ...One such architecture is BLSTM, as described in the previous section....

    [...]

Journal ArticleDOI
Lawrence R. Rabiner1
01 Feb 1989
TL;DR: In this paper, the authors provide an overview of the basic theory of hidden Markov models (HMMs) as originated by L.E. Baum and T. Petrie (1966) and give practical details on methods of implementation of the theory along with a description of selected applications of HMMs to distinct problems in speech recognition.
Abstract: This tutorial provides an overview of the basic theory of hidden Markov models (HMMs) as originated by L.E. Baum and T. Petrie (1966) and gives practical details on methods of implementation of the theory along with a description of selected applications of the theory to distinct problems in speech recognition. Results from a number of original sources are combined to provide a single source of acquiring the background required to pursue further this area of research. The author first reviews the theory of discrete Markov chains and shows how the concept of hidden states, where the observation is a probabilistic function of the state, can be used effectively. The theory is illustrated with two simple examples, namely coin-tossing, and the classic balls-in-urns system. Three fundamental problems of HMMs are noted and several practical techniques for solving these problems are given. The various types of HMMs that have been studied, including ergodic as well as left-right models, are described. >

21,819 citations

Journal ArticleDOI
TL;DR: This work shows why gradient based learning algorithms face an increasingly difficult problem as the duration of the dependencies to be captured increases, and exposes a trade-off between efficient learning by gradient descent and latching on information for long periods.
Abstract: Recurrent neural networks can be used to map input sequences to output sequences, such as for recognition, production or prediction problems. However, practical difficulties have been reported in training recurrent neural networks to perform tasks in which the temporal contingencies present in the input/output sequences span long intervals. We show why gradient based learning algorithms face an increasingly difficult problem as the duration of the dependencies to be captured increases. These results expose a trade-off between efficient learning by gradient descent and latching on information for long periods. Based on an understanding of this problem, alternatives to standard gradient descent are considered. >

7,309 citations


"A Novel Connectionist System for Un..." refers background in this paper

  • ...literature as the vanishing gradient problem [37], [38]) makes it...

    [...]

Journal ArticleDOI
TL;DR: It is shown how the proposed bidirectional structure can be easily modified to allow efficient estimation of the conditional posterior probability of complete symbol sequences without making any explicit assumption about the shape of the distribution.
Abstract: In the first part of this paper, a regular recurrent neural network (RNN) is extended to a bidirectional recurrent neural network (BRNN). The BRNN can be trained without the limitation of using input information just up to a preset future frame. This is accomplished by training it simultaneously in positive and negative time direction. Structure and training procedure of the proposed network are explained. In regression and classification experiments on artificial data, the proposed structure gives better results than other approaches. For real data, classification experiments for phonemes from the TIMIT database show the same tendency. In the second part of this paper, it is shown how the proposed bidirectional structure can be easily modified to allow efficient estimation of the conditional posterior probability of complete symbol sequences without making any explicit assumption about the shape of the distribution. For this part, experiments on real data are reported.

7,290 citations


"A Novel Connectionist System for Un..." refers background or methods in this paper

  • ...BRNNs have outperformed standard RNNs in several sequence learning tasks, notably protein structure prediction [43] and speech processing [41], [44]....

    [...]

  • ...Bidirectional RNNs (BRNNs) [41], [42] are able to access context in both directions along the input sequence....

    [...]

Proceedings ArticleDOI
25 Jun 2006
TL;DR: This paper presents a novel method for training RNNs to label unsegmented sequences directly, thereby solving both problems of sequence learning and post-processing.
Abstract: Many real-world sequence learning tasks require the prediction of sequences of labels from noisy, unsegmented input data. In speech recognition, for example, an acoustic signal is transcribed into words or sub-word units. Recurrent neural networks (RNNs) are powerful sequence learners that would seem well suited to such tasks. However, because they require pre-segmented training data, and post-processing to transform their outputs into label sequences, their applicability has so far been limited. This paper presents a novel method for training RNNs to label unsegmented sequences directly, thereby solving both problems. An experiment on the TIMIT speech corpus demonstrates its advantages over both a baseline HMM and a hybrid HMM-RNN.

5,188 citations


Additional excerpts

  • ...Ç...

    [...]

Frequently Asked Questions (1)
Q1. What have the authors contributed in "A novel connectionist system for unconstrained handwriting recognition" ?

This paper proposes an alternative approach based on a novel type of recurrent neural network, specifically designed for sequence labelling tasks where the data is hard to segment and contains long range, bidirectional interdependencies. In experiments on two large unconstrained handwriting databases, their approach achieves word recognition accuracies of 79. In addition, the authors demonstrate the network ’ s robustness to lexicon size, measure the individual influence of its hidden layers, and analyse its use of context. Lastly the authors provide an in depth discussion of the differences between the network and HMMs, suggesting reasons for the network ’ s superior performance.