A Novel Connectionist System for Unconstrained Handwriting Recognition
Summary (5 min read)
Introduction
- However an obvious drawback of whole word classification is that it does not scale to large vocabularies.
- Based on unsupervised learning and data-driven methods.
- For this reason the authors chose the bidirectional Long Short-Term Memory (BLSTM; [33]) architecture, which provides access to long range context along both input directions.
II. ONLINE DATA PREPARATION
- The online data used in their experiments were recorded from a whiteboard using the eBeam interface1 [34].
- The acquisition interface outputs a sequence of (x, y)-coordinates representing the location of the tip of the pen together with a time stamp for each location.
- The coordinates are only recorded during the periods when the pen-tip is in continuous contact with the whiteboard.
- After some standard steps to correct for missing and noisy points [35], the data was stored in xml-format, along with the frame rate, which varied from 30 to 70 frames per second.
B. Feature Extraction
- To extract the feature vectors from the normalised images, a sliding window approach is used.
- The positions of the uppermost and lowermost black pixels .
- The rate of change of these positions (with respect to the neighbouring windows) 2 5 The number of black-white transitions between the uppermost and lowermost pixels .
- For a more detailed description of the offline features, see [17].
III. OFFLINE DATA PREPARATION
- The offline data used in their experiments consists of greyscale images scanned from handwritten forms, with a scanning resolution of 300 dpi and a greyscale bit depth of 8.
- Then a histogram of the horizontal black/white transitions was calculated, and the text was split at the local minima to give a series of horizontal lines.
- Any stroke crossing the boundaries between two lines was assigned to the line containing their centre of gravity.
- With this method almost all text lines were extracted correctly.
- Once the line images were extracted, the next stage was to normalise the text with respect to writing skew and slant, and character size.
A. Recurrent Neural Networks
- Recurrent neural networks (RNNs) are a connectionist model containing a self-connected hidden layer.
- One benefit of the recurrent connection is that a ‘memory’ of previous inputs remains in the network’s internal state, allowing it to make use of past context.
- Another important advantage of recurrency is that the rate of change of the internal state can be finely modulated by the recurrent weights, which builds in robustness to localised distortions of the input data.
B. Long Short-Term Memory (LSTM)
- Unfortunately, the range of contextual information that standard RNNs can access is in practice quite limited.
- The problem is that the influence of a given input on the hidden layer, and therefore on the network output, either decays or blows up exponentially as it cycles around the network’s recurrent connections.
- As can be seen, the influence of the first input decays exponentially over time.
- The cell has a recurrent connection with fixed weight 1.0.
- The three gates collect input from the rest of the network, and control the cell via multiplicative units (small circles).
C. Bidirectional Recurrent Neural Networks
- In handwriting recognition, for example, the identification of a given letter is helped by knowing the letters both to the right and left of it.
- 6 Bidirectional recurrent neural networks [41], [42] are able to access context in both directions along the input sequence.
- Combining BRNNs and LSTM gives bidirectional LSTM .
- BLSTM has previously outperformed other network architectures, including standard LSTM, BRNNs and HMM-RNN hybrids, on phoneme recognition [33], [45].
D. Connectionist Temporal Classification (CTC)
- Traditional RNN objective functions require a presegmented input sequence with a separate target for every segment.
- Moreover, because the outputs of such an RNN are a series of independent, local classifications, some form of post processing is required to transform them into the desired label sequence.
- A CTC output layer contains as many units as there are labels in the task, plus an additional ‘blank’ or ‘no label’ unit.
- Note that each output is conditioned on the entire input sequence.
- One such architecture is bidirectional LSTM, as described in the previous section.
E. CTC Forward Backward Algorithm
- To allow for blanks in the output paths, the authors consider modified label sequences l′, with blanks added to the beginning and the end of l, and inserted between every pair of consecutive labels.
- In calculating the probabilities of prefixes of l′ the authors allow all transitions between blank and non-blank labels, and also those between any pair of distinct non-blank labels.
- The backward variables βts are defined as the summed probability of all paths whose suffixes starting at t map onto the suffix of l starting at label s/2: βts = ∑ π: B(πt:T )=ls/2:|l|.
- Finally, the label sequence probability is given by the sum of the products of the forward and backward variables at any time: p(l|x) = |l′|∑ s=1 αtsβ t s. (6).
F. CTC Objective Function
- The CTC objective function is defined as the negative log probability of the network correctly labelling the entire training set.
- Then the objective function O can be expressed as O = − ∑ (x,z)∈S ln p(z|x). (7) The network can be trained with gradient descent by first differentiating O with respect to the outputs, then using backpropagation through time [47] to find the derivatives with respect to the network weights.
- Substituting this into (7) gives ∂O ∂ytk = − 1 p(z|x)ytk ∑ s∈lab(z,k) αtsβ t s. (9) To backpropagate the gradient through the output layer, the authors need the objective function derivatives with respect to the outputs atk before the activation function is applied.
G. CTC Decoding
- Once the network is trained, the authors would ideally transcribe some unknown input sequence x by choosing the labelling l∗ with the highest conditional probability: l∗ = arg max l p(l|x).
- For some tasks the authors want to constrain the output labellings according to a grammar.
- Note that this assumption is in general false, since both the input sequences and the grammar depend on the underlying generator of the data, for example the language being spoken.
- If the authors further assume that, prior to any knowledge about the input or the grammar, all label sequences are equally probable, (14) reduces to l∗ = arg max l p(l|x)p(l|G) (17).
- Note that, since the number of possible label sequences is finite (because both L and |l| are finite), assigning equal prior probabilities does not lead to an improper prior.
H. CTC Token Passing Algorithm
- The authors now describe an algorithm, based on the token passing algorithm for HMMs [48], that allows us to find an approximate solution to (17) for a simple grammar.
- The transition probabilities are used when a token is passed from the last character in one word to the first character in another.
- This is the highest scoring token reaching that segment at that time.
- Because the output tokens tok(w,−1, T ) are sorted in order of score, the search can be terminated when a token is reached whose score is less than the current best score with the transition included.
- If no bigrams are used, lines 15-17 can be replaced by a simple search for the highest scoring output token, and the complexity reduces to O(TW ).
V. EXPERIMENTS AND RESULTS
- The aim of their experiments was to evaluate the complete RNN handwriting recognition system, illustrated in Figure 13, on both online and offline handwriting.
- The online and offline databases used were the IAMOnDB and the IAM-DB respectively.
- Note that these do not correspond to the same handwriting samples: the IAM-OnDB was acquired from a whiteboard (see Section II), while the IAM-DB consists of scanned images of handwritten forms (see Section III).
- To make the comparisons fair, the same online and offline preprocessing was used for both the HMM and RNN systems.
- As well as the main comparisons, extra experiments were carried out on the online database to determine the effect of varying the dictionary size, and of disabling the forward and backward hidden layers in the RNN.
A. Data Sets
- For the online experiments the authors used the IAM-OnDB, a database acquired from a ‘smart’ whiteboard [49].
- After preprocessing, the input data consisted of 9 inputs per time step.
- Both the online and offline transcriptions contain 81 separate characters, including all lower case and capital letters as well as various other special characters, e.g., punctuation marks, digits, a character for garbage symbols, and a character for the space.
- Note that for the RNN systems, only 80 of these were used, since the garbage symbol is not required for CTC.
B. Language Model and Dictionaries
- Dictionaries were used for all the experiments where word accuracy was recorded.
- All dictionaries were derived from three different text corpora, the LOB (excluding the data used as prompts), the Brown corpus [52], and the Wellington corpus [53].
- The figure 20,000 was chosen because it had been previously shown to give best results for HMMs [51].
- Note that this dictionary was ‘open’, in the sense that it did not contain all the words in either the online or offline test set.
- To analyse the dependency of the recognition scores on the lexicon, the authors carried out extra experiments on the online data using open dictionaries with between 5,000 and 30,000 words (also chosen from the most common words in LOB).
C. HMM Parameters
- The HMM system used for the online and offline experiments was similar to that described in [35].
- For the online data every character model contained eight states, while for the offline 3Both databases are available for public download, along with the corresponding task definitions.
- //www.iam.unibe.ch/fki/databases/iam-on-line-handwriting-database data, the number of states was chosen individually for each character [51], also known as http.
- The observation probabilities were modelled by mixtures of diagonal Gaussians.
- 32 Gaussians were used for online data and 12 for the offline data.
D. RNN Parameters
- The RNN had a bidirectional Long Short-Term Memory architecture with a connectionist temporal classification (CTC) output layer (see Section IV for details).
- The forward and backward hidden layers each contained 100 LSTM memory blocks.
- The size of the input layer was determined by the data: for the online data there were 25 inputs, for the offline data there were 9.
- Otherwise the network was identical for the two tasks.
- The error rate was recorded every 5 epochs on the validation set and training was stopped when performance had ceased to improve on the validation set for 50 epochs.
E. Main Results
- As can be seen from Tables I and II, the RNN substantially outperformed the HMM on both databases.
- To put these results in perspective, the Microsoft tablet PC handwriting recogniser [55] gave a word accuracy score of 71.32% on the online test set.
- It suggests that their recogniser is competitive with the best commercial systems for unconstrained handwriting.
- In the other, the authors took a closed dictionary containing the 5,597 words in the online test set, and measured the change in performance when this was padded to 20,000 words.
- The results for the first set of experiments are shown in Table III and plotted in Figure 14.
I. Use of Context by the RNN
- The authors have previously asserted that the BLSTM architecture is able to access long-range, bidirectional context.
- The larger the values of the sequential Jacobian for some t, t′, the more sensitive the network output at time t is to the input at time t′.
- Figure 16 plots the sequential Jacobian for a single output during the transcription of a line from the online database.
- As can be seen, the network output is sensitive to information from about the first 120 time steps of the sequence, which corresponds roughly to the length of the first word.
- Moreover, this area of sensitivity extends in both directions from the point where the prediction is made.
VI. DISCUSSION
- The authors experiments reveal a substantial gap in performance between the HMM and RNN systems, with a relative error reduction of over 40% in some cases.
- Firstly, standard HMMs are generative, while an RNN trained with a discriminative objective function (such as CTC) is discriminative.
- A second difference is that RNNs provide more flexible models of the input features than the mixtures of diagonal Gaussians used in standard HMMs.
- It should be noted that RNNs typically perform better using input features with simpler relationships.
- This means that for an HMM with n states, only O(logn) bits of information about the past observation sequence are carried by the internal state.
VII. CONCLUSIONS
- The authors have introduced a novel approach for recognising unconstrained handwritten text, using a recurrent neural network.
- The key features of the network are the bidirectional Long Short-Term Memory architecture, which provides access to long range, bidirectional contextual information, and the connectionist temporal classification output layer, which allows the network to be trained on unsegmented sequence data.
- In experiments on online and offline handwriting data, the new approach outperformed a state-of-the-art HMM-based system, and also proved more robust to changes in dictionary size.
Did you find this useful? Give us your feedback
Citations
14,635 citations
Cites background from "A Novel Connectionist System for Un..."
...…analysis (Hochreiter & Obermayer, 2005), handwriting recognition (Bluche et al., 2014; Graves, Fernandez, Liwicki, Bunke, & Schmidhuber, 2008; Graves et al., 2009; Graves & Schmidhuber, 2009), voice activity detection (Eyben, Weninger, Squartini, & Schuller, 2013), optical character…...
[...]
...BRNNs and DAG-RNNs unfold their full potential when combined with the LSTM concept (Graves et al., 2009; Graves & Schmidhuber, 2005, 2009)....
[...]
...Compare Graves and Jaitly (2014), Graves and Schmidhuber (2005), Graves et al. (2009), Graves et al. (2013) and Schmidhuber, Ciresan, Meier, Masci, and Graves (2011) (Section 5.22)....
[...]
4,746 citations
Cites background from "A Novel Connectionist System for Un..."
...includes handwriting recognition [3]–[5] and generation [6], language modeling [7] and translation [8], acoustic modeling of speech [9], speech synthesis [10], protein secondary...
[...]
3,549 citations
Cites background from "A Novel Connectionist System for Un..."
...In Hochreiter and Schmidhuber (1997); Graves et al. (2009) a solution is proposed for the vanishing gradients problem, where the structure of the model is changed....
[...]
2,586 citations
2,492 citations
Cites methods from "A Novel Connectionist System for Un..."
...For online and offline handwriting recognition, BLSTM networks used together with a Connectionist Temporal Classification (CTC) layer and trained from unsegmented sequence data, have been shown to outperform a stateof-the-art Hidden-Markov-Model (HMM) based system [10]....
[...]
References
72,897 citations
"A Novel Connectionist System for Un..." refers background in this paper
...An LSTM hidden layer consists of recurrently connected subnets, called memory blocks....
[...]
...For this reason, we chose the bidirectional Long Short-Term Memory (BLSTM) [33] architecture, which provides access to long-range context along both input directions....
[...]
...The RNN had a BLSTM architecture with a CTC output layer (see Section 4 for details)....
[...]
...11 provides a detailed illustration of an LSTM memory block with a single cell....
[...]
...One such architecture is BLSTM, as described in the previous section....
[...]
21,819 citations
7,309 citations
"A Novel Connectionist System for Un..." refers background in this paper
...literature as the vanishing gradient problem [37], [38]) makes it...
[...]
7,290 citations
"A Novel Connectionist System for Un..." refers background or methods in this paper
...BRNNs have outperformed standard RNNs in several sequence learning tasks, notably protein structure prediction [43] and speech processing [41], [44]....
[...]
...Bidirectional RNNs (BRNNs) [41], [42] are able to access context in both directions along the input sequence....
[...]
5,188 citations
Additional excerpts
...Ç...
[...]