scispace - formally typeset
Open AccessJournal ArticleDOI

Learning Spatial-Semantic Context with Fully Convolutional Recurrent Network for Online Handwritten Chinese Text Recognition

TLDR
In this article, a multi-spatial context fully convolutional recurrent network (MC-FCRN) is proposed to exploit the multiple spatial contexts from the signature feature maps and generate a prediction sequence while completely avoiding the difficult segmentation problem.
Abstract
Online handwritten Chinese text recognition (OHCTR) is a challenging problem as it involves a large-scale character set, ambiguous segmentation, and variable-length input sequences. In this paper, we exploit the outstanding capability of path signature to translate online pen-tip trajectories into informative signature feature maps, successfully capturing the analytic and geometric properties of pen strokes with strong local invariance and robustness. A multi-spatial-context fully convolutional recurrent network (MC-FCRN) is proposed to exploit the multiple spatial contexts from the signature feature maps and generate a prediction sequence while completely avoiding the difficult segmentation problem. Furthermore, an implicit language model is developed to make predictions based on semantic context within a predicting feature sequence, providing a new perspective for incorporating lexicon constraints and prior knowledge about a certain language in the recognition procedure. Experiments on two standard benchmarks, Dataset-CASIA and Dataset-ICDAR, yielded outstanding results, with correct rates of 97.50 and 96.58 percent, respectively, which are significantly better than the best result reported thus far in the literature.

read more

Content maybe subject to copyright    Report

0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2732978, IEEE
Transactions on Pattern Analysis and Machine Intelligence
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1
Learning Spatial-Semantic Context with Fully
Convolutional Recurrent Network for Online
Handwritten Chinese Text Recognition
Zecheng Xie, Zenghui Sun, Lianwen Jin
, Hao Ni and Terry Lyons
Abstract—Online handwritten Chinese text recognition (OHCTR) is a challenging problem as it involves a large-scale character set,
ambiguous segmentation, and variable-length input sequences. In this paper, we exploit the outstanding capability of path signature to
translate online pen-tip trajectories into informative signature feature maps, successfully capturing the analytic and geometric
properties of pen strokes with strong local invariance and robustness. A multi-spatial-context fully convolutional recurrent network
(MC-FCRN) is proposed to exploit the multiple spatial contexts from the signature feature maps and generate a prediction sequence
while completely avoiding the difficult segmentation problem. Furthermore, an implicit language model is developed to make
predictions based on semantic context within a predicting feature sequence, providing a new perspective for incorporating lexicon
constraints and prior knowledge about a certain language in the recognition procedure. Experiments on two standard benchmarks,
Dataset-CASIA and Dataset-ICDAR, yielded outstanding results, with correct rates of 97.50% and 96.58%, respectively, which are
significantly better than the best result reported thus far in the literature.
Index Terms—Handwritten Chinese text recognition, path signature, residual recurrent network, multiple spatial contexts, implicit
language model
F
1 INTRODUCTION
I
N recent years, increasingly in-depth studies have led
to significant developments in the field of handwrit-
ten text recognition. Various methods have been pro-
posed by the research community, including integrated
segmentation-recognition methods [1], [2], [3], [4], [5], hid-
den Markov models (HMMs) and their hybrid variants [6],
[7], segmentation-free methods [8], [9], [10] with long short-
term memory (LSTM) and multi-dimensional long short-
term memory (MDLSTM), and integrated convolutional
neural network (CNN)-LSTM methods [11], [12], [13], [14].
In this paper, we investigate the most recently developed
methods for online handwritten Chinese text recognition
(OHCTR), which is an interesting research topic presenting
the following challenges: a large character set, ambiguous
segmentation, and variable-length input sequences.
Segmentation is the fundamental component of hand-
written text recognition, and it has attracted the attention of
numerous researchers [1], [2], [3], [4], [5], [15], [16]. Among
the above-mentioned methods, over-segmentation [1], [2],
[3], [4], [5], i.e., an integrated segmentation-recognition
method, is the most efficient method and still plays a
crucial role in OHCTR. The basic concept underlying over-
segmentation is to slice the input string into sequential
character segments whose candidate classes can be used to
construct the segmentation-recognition lattice [2]. Based on
the lattice, path evaluation, which integrates the recogni-
Z. Xie, Z. Sun, and L. Jin are with College of Electronic and Information
Engineering, South China University of Technology, Guangzhou, China.
E-mail: {zcheng.xie, sunfreding, lianwen.jin}@gmail.com
H. Ni is with Oxford-Man Institute for Quantitative Finance, University
of Oxford, Oxford, UK. E-mail: hao.ni@maths.ox.ac.uk
T. Lyons is with Mathematical Institute, University of Oxford, Oxford,
UK. E-mail: tlyons@maths.ox.ac.uk
tion scores, geometry information, and semantic context, is
conducted to search for the optimal path and generate the
recognition result. In practice, segmentation inevitably leads
to mis-segmentation, which is barely rectifiable through
post-processing and thus degrades the overall performance.
Segmentation-free methods are flexible alternative meth-
ods that completely avoid the segmentation procedure. H-
MMs and their hybrid variants [6], [7] have been widely
used in handwritten text recognition. In general, the input
string is converted into slices by sliding windows, followed
by feature extraction and frame-wise prediction using an
HMM. Finally, the Viterbi algorithm is applied to search
for the best character string with maximum a posteriori
probability. However, HMMs are limited not only by the
assumption that their observation depends only on the cur-
rent state but also by their generative nature that generally
leads to poor performance in labeling and classification
tasks, as compared to discriminative models. Even though
hybrid models that combine HMMs with other network
architectures, including recurrent neural networks [17] and
multilayer perceptrons [18], have been proposed to alleviate
the above-mentioned limitations by introducing context into
HMMs, they still suffer from the drawbacks of HMMs.
The recent development of recurrent neural networks,
especially LSTM [8], [9], [19] and MDLSTM [10], [19], has
provided a revolutionary segmentation-free perspective to
the problem of handwritten text recognition. In general,
LSTM is directly fed with a point-wise feature vector that
consists of the (x, y)-coordinate and relative features, while
it recurrently updates its hidden state and generates per-
frame predictions for each time step. Then, it applies con-
nectionist temporal classification (CTC) to perform tran-
scription. It is worth noting that LSTM and MDLSTM have

0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2732978, IEEE
Transactions on Pattern Analysis and Machine Intelligence
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2
ݏ
ݏ
-3600
-3400
-3200
= = =
1×1
6× 1
14× 4
30×10
62× 22
1×1
6× 1
14× 6
30× 14
62× 30
126× 62
6× 1
14× 8
30× 18
62× 38
126× 78
1×1
M(ݏ
) M(ݏ
) M(ݏ
)
ᝏӪⲴݯㄕ⭥ᖡ
˖__Ӫ__ⲴⲴ_ݯ_ㄕ⭥__
˖_ᝏӪ__ݯ__ㄕㄕ___
˖__Ӫ___ݯ_ㄕ⭥⭥⭥_
Path signature
Receptive fields
Signature
feature maps
FCN
Residual
recurrent network
Transcription
Learning
semantic context
with multilayered
BLSTM
Embedding layer
126× 46
Predicting
sequence
݌= ݂(ݖ
= ߨ
|)
௧ୀଵ:
ݏ
MC-FCRN Implicit language modeling
Fig. 1. Overview of the proposed method. Variable-length pen-tip trajectories are first translated into offline signature feature maps that preserve
the essential online information. Then, a multi-spatial-context fully convolutional recurrent network (MC-FCRN) take input of the signature feature
maps with receptive fields of different scales in a sliding window manner and generate a predicting sequence. Finally, an implicit LM is proposed to
derive the final label sequence by exploiting the semantic context of embedding vectors that are transformed from the predicting sequence.
been successfully applied to handwritten text recognition
in Western languages, where the character set is relatively
small (e.g., for English, there are only 52 classes; therefore
it is easy to train the network). However, to the best of our
knowledge, very few studies have attempted to address the
problem of large-scale (where, e.g., the text lines may be
represented by more than 7,000 basic classes of characters
and sum up to more than 1 million character samples)
handwritten text recognition problems such as OHCTR.
Architectures that integrate CNN and LSTM exhibit ex-
cellent performance in terms of visual recognition and de-
scription [20], [21], scene text recognition [12], [13], [14], and
handwritten text recognition [11]. In text recognition prob-
lems, deep CNNs generate highly abstract feature sequences
from input sequential data. LSTM is fed with such feature
sequences and generates corresponding character strings.
Jointly training LSTM with CNN is straightforward and
can improve the overall performance significantly. However,
in the above-mentioned methods, the CNNs, specifically
fully convolutional networks (FCNs), process the input
string with only a fixed-size receptive field in a sliding
window manner, which we claim is inflexible for uncon-
strained written characters in OHCTR. Moreover, a deep
integrated CNN-LSTM network is usually accompanied by
degradation problem [22] that slows down the convergence
procedure and affects the system optimization.
In this paper, we propose a novel solution (see Fig. 1)
that integrates path signature, a multi-spatial-context fully
convolutional recurrent network (MC-FCRN), and an im-
plicit language model (implicit LM) to address the problem
of unconstrained online handwritten text recognition. Path
signature, a recent development in the field of the rough
path theory [23], [24], [25], is a promising approach for
translating variable-length pen-tip trajectories into offline
signature feature maps in our system, because it effective-
ly preserves the online information that characterizes the
analytic and geometric properties of the path. Encouraged
by recent advances in deep CNNs and LSTMs, we propose
the MC-FCRN for robust recognition of signature feature
maps. MC-FCRN leverages the multiple spatial contexts that
correspond to multiple receptive fields in each time step to
achieve strong robustness and high accuracy. Furthermore,
we propose an implicit LM, which incorporates semantic
context within the entire predicting feature sequence from
both forward and reverse directions, to enhance the predic-
tion for each time step. The contributions of this paper can
be summarized as follows:
We develop a novel segmentation-free MC-FCRN to
effectively capture the variable spatial contextual dy-
namics as well as the character information for high-
performance recognition. With a series of receptive
fields of different scales, MC-FCRN is able to model the
complicate spatial context with strong robustness and
high accuracy.
The residual recurrent network, a basic component of
MC-FCRN, not only accelerates the convergence pro-
cess but also promotes the optimization result, while
adding neither extra parameter nor computational bur-
den to the system, as compared to ordinary stacked
recurrent network
We propose an implicit LM that learns to model the
output distribution given the entire predicting feature
sequence. Unlike the statistical language model that
predicts the next word given only a few previous
words, our implicit LM exploits the semantic context
not only from the forward and reverse directions of the
text but also with arbitrary text length.
Path signature, a novel mathematical feature set,
brought from the rough path theory [23], [24], [25]

0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2732978, IEEE
Transactions on Pattern Analysis and Machine Intelligence
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 3
as a non-linear generalization of classical theory of
controlled differential equations, is successfully applied
to capture essential online information for long pen-
tip trajectories. Moreover, we investigate path signature
for learning the variable online knowledge of the input
string with different iterated integrals.
The remainder of this paper is organized as follows.
Section 2 reviews the related studies. Section 3 formally
introduces path signature. Section 4 details the network
architecture of FCRN and its extended version, namely
MC-FCRN. Section 5 describes the proposed implicit LM
and discusses the corresponding training strategy. Section 6
presents the experimental results. Finally, Section 7 con-
cludes the paper.
2 RELATED WORK
Feature extraction [26], [27], [28], [29], [30], [31] plays a cru-
cial role in traditional online handwritten text recognition.
The 8-directional feature [26], [29] is widely used in OHCTR
owing to its excellent ability to express stroke directions.
The projection of each trajectory point in eight directions
is calculated in a 2-D manner and eight pattern images
are generated accordingly. For further sophistication, Grave
et al. [8] considered not only the (x, y)-coordinate and its
relationship with its neighbors in the time series but also
the spatial information from an offline perspective, thus
obtaining 25 features for each point. However, the above-
mentioned techniques have been developed empirically.
Inspired by the theoretical work of Lyons and his colleagues
[23], [24], [25], we applied path signature to translate the
online pen-tip trajectories into offline signature feature maps
that maintain the essential features for characterizing the
online information of the trajectories. Furthermore, we can
use truncated path signature in practical applications to
achieve a trade-off between complexity and precision.
Yang et al. [32], [33] showed that the domain-specific
information extracted by the aforementioned methods can
improve the recognition performance with deep CNN (DC-
NN). However, DCNN-based networks are unable to han-
dle input sequences of variable length in OHCTR. On the
contrary, LSTM- and MDLSTM-based networks have an
inherent advantage in dealing with such input sequences
and demonstrate excellent performance in unconstrained
handwritten text recognition [8], [9], [34], [35]. Recently,
deep learning methods that integrate LSTM and CNN have
demonstrated outstanding capability in the field of visual
captioning [21], [36] and scene text recognition [12], [13].
However, in this paper, we show that the simple combina-
tion of CNN and LSTM cannot utilize their full potential,
which is probably due to the degradation problem [22]. On
the other hand, highway network [37] [38] and residual con-
nection [22] [39] were advocated to solve the degradation
problem [22] in training very deep networks. Therefore, we
take inspiration from them and present the residual recur-
rent network to realize faster and better optimization of the
system. Furthermore, our MC-FCRN also differs from these
methods in that it uses multiple receptive fields of different
scales to capture highly informative contextual features in
each time step. Such a multi-scale strategy originates from
traditional methods. The pyramid match kernel [40] maps
features to multi-dimensional multi-resolution histograms
that help to capture co-occurring features. The SIFT vectors
[41] search for stable features across all possible scales and
construct a high-dimensional vector for the key points. Fur-
ther, spatial pyramid pooling [42] allows images of varying
size or scale to be fed during training and enhances the net-
work performance significantly. GoogLeNet [43] introduced
the concept of “inception” whereby multi-scale convolution
kernels are integrated to boost performance. We have drawn
inspiration from these multi-scale methods to design our
MC-FCRN.
In general, language modeling is applied after feature
extraction and recognition in order to improve the overall
performance of the system [1], [2], [3], [4], [31], [44]. The
concept of ‘embedding’ plays a critical role in computational
linguistics. Traditionally, one character is strictly represent-
ed with one ‘embedding’ [45]. However, as emphasized by
Vilnis et al [46], representing an object as a single point in
space carries limitations. Instead, a density-based distribut-
ed embeddings can provide much more information of each
word, e.g. capturing uncertainty about a representation and
its relationship. Recently, Mukherjee [47] further verified
that a visual-linguistic mapping where words and visual
categories are both represented by distribution can improve
result at the intersection of language and vision, due to
the better exploiting of intra-concept variability in each
modality. In this paper, we take inspiration from their works
and take the predicting feature sequence, instead of one-hot
vectors, as input of the implicit LM to maintain the intra-
concept variability, which reflects recognition confidence
information in our problem. The recent development of
neural networks, especially LSTM, in the field of language
translation [48] and visual captioning [20], [21] has provided
us with a new perspective of language models. To the best
of our knowledge, neural networks were first applied to lan-
guage modeling by Bengio et al. [45]. Subsequently, Mikolov
et al. [49] used recurrent neural network, and Sundermeyer
et al. [50] used LSTM for language modeling. For language
translation, Sutskever et al. [48] used multilayered LSTM to
encode the input text into a vector of fixed dimensionality
and then applied another deep LSTM to decode the text
in a different language. For visual captioning, Venugopalan
et al. [21] and Pan et al. [20] extracted deep visual CNN
representations from image or video data and then used
an LSTM as a sequence decoder to generate a description
for the representations. Partially inspired by these methods,
we developed our implicit LM to incorporate semantic con-
text for recognition. However, unlike the above-mentioned
methods, which only derive context information from the
past predicted text, our implicit LM learns to make predic-
tions given the entire predicting feature sequence in both
forward and reverse directions.
3 PATH SIGNATURE
Proper translation of online data into offline feature maps
while retaining most, or hopefully all, of the online knowl-
edge within the pen-tip trajectory plays an essential role in
online handwritten recognition. To this end, we investigate
path signature, which was pioneered by Chen [51] in the
form of iterated integrals and developed by Lyons and

0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2732978, IEEE
Transactions on Pattern Analysis and Machine Intelligence
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 4
()
,
()
,
()
,
=()
,
۪ࡼ()
,
Uniform Time Sampling
Equal-Distance Sampling
Window-based Signature
Chen’s Identity
(a)
,
= ,
,
=
૚۪
െ૚
=
െ૚
,
,
=
െ૚
۪
െ૚
=
Τ
െ૚
Τ
െ૚
Τ
Τ
.
()
,
= , , െ૚,
,
,
,
.
Eq. (5)
(,)
(1,-1)
Eq. (2)
()
,
(b)
-3600
-3400
=
=
=
8%
0 0.1
0.2 0.3 0.4
0.5 0.6 0.7 0.8 0.9
1
-1.0 0
1.0
(c)
-3600
-3400
-3200
=
=
=
(d)
Fig. 2. (a) Illustration of feature extraction of path signature. (b) A simple example of calculation of path signature features. (c) Path signature of one
typical online handwritten text example. (d) Left: path signature of the original pen-tip trajectories; Right: path signature of the pen-tip trajectories
with randomly added connections between adjacent strokes. It is notable that excepting for the additional connections, the original part of the
sequential data has the same path signature (same color).
his colleagues as a fundamental component of rough path
theory [23], [24], [25]. Path-signature was first introduced
into handwritten Chinese character recognition by Benjamin
Graham [52], and followed by Yang et al [32], [33], but
only at the character level. We go further by applying path
signature to extremely long sequential data that usually
consist of hundreds of thousands of points and prove its
effectiveness in OHCTR problem. In the following, we first
briefly introduce path signature theoretically, and then tech-
nically for sake of implementation and application.
Consider the pen strokes of the online handwritten text
collected from a writing plane H R
2
. Then, a pen stroke
can be expressed as a continuous mapping denoted by D :
[a, b] H with D = (D
1
t
, D
2
t
) and t [a, b]. For k
1 and a collection of indexes i
1
, · · · , i
k
{1, 2}, the k-th
fold iterated integral of D along the index i
1
, · · · , i
k
can be
defined by
P (D)
i
1
,··· ,i
k
a,b
=
Z
a<t
1
<···<t
k
<b
dD
i
1
t
1
, · · · , dD
i
k
t
k
. (1)
The signature of the path is a collection of all the iterated
integrals of D:
P (D)
a,b
=(1, P (D)
1
a,b
, P (D)
2
a,b
, P (D)
1,1
a,b
,
P (D)
1,2
a,b
, P (D)
2,1
a,b
, P (D)
2,2
a,b
, · · · ), (2)
where the superscripts of the terms P (X)
i
1
,··· ,i
k
a,b
run over
the set of all multi-indexes
G = {(i
1
, ..., i
k
)|i
1
, · · · , i
k
{1, 2}, k 1}. (3)
Then, the k-th iterated integral of the signature P (D)
(k)
a,b
is
the finite collection of terms P (D)
i
1
,··· ,i
k
a,b
with multi-indexes
of length k. More specifically, P (D)
(k)
a,b
is the 2
k
-dimensional
vector defined by
P (D)
(k)
a,b
= (P (X)
i
1
,··· ,i
k
a,b
|i
1
, · · · , i
k
{1, 2}). (4)
In [25], it is proved that the whole signature of a path de-
termines the path up to time re-parameterization; i.e., path
signature can not only characterize the path displacement
and its further derivative as the classical directional features
do, but also provide more detailed analytic and geometric
properties of the path. In practice, we have to use the
truncated signature feature, which can capture the global
information on the path. Increasing the degree of truncated
signature results in the exponential growth of dimension but
may not always lead to significant gain.
Next, we describe the practical calculation of path sig-
nature in OHCTR from the implementation and application
point of view. As illustrated in Fig. 2a, the pen-tip trajecto-
ries of the online handwritten text samples are represented
by a sequence of uniform-time sampling points. First, the
uniform-time sampling trajectory is translated into equal-
distance sampling style. Then, to calculate the signature fea-
ture for a specific point, e.g., the red point in Fig. 2a, we esti-
mate the window-based signature P (D)
t
1
,t
3
that takes this
point as the midpoint. In order to calculate P (D)
t
1
,t
3
, we
first compute point-wise signature P (D)
t
1
,t
2
and P (D)
t
2
,t
3
;
then combine them according to Chen’s identity [51].
As adjacent sampling points of text samples are connect-
ed by a straight line D = (D
1
t
, D
2
t
) with t [a, b], the
iterated integrals P (D)
(k)
a,b
can be calculated iteratively as
follows:
P (D)
(k)
a,b
=
(
1 , k = 0,
(P (D)
(k1)
a,b
4
a,b
)/k , k 1,
(5)
where 4
a,b
:= D
b
D
a
denotes the path displacement and
represents the tensor product. In Fig. 2b, we provide a
simple example to explain the calculation of path signature
according to Eq. (5) and Eq. (2). Suppose we have two
adjacent straight lines D = (D
1
t
, D
2
t
) with t [t
1
, t
2
] and
D = (D
1
t
, D
2
t
) with t [t
2
, t
3
], as shown in Fig. 2a. Then,
following Chen’s identity [51], we can calculate the path
signature for the concatenation of these two paths as
P (D)
(k)
t
1
,t
3
=
k
X
i=0
P (D)
i
t
1
,t
2
P (D)
ki
t
2
,t
3
. (6)
Given the pen-tip trajectories of online handwritten text,
for each sequential stroke point, we first compute the path
signature within a sliding window according to Eq. (2) and
Eq. (6). Then the path signature features of certain level
(k) along all the stroke points will form the corresponding
feature maps. Specifically, the path signature feature vector
of each stroke point spreads over 2
(k+1)
1 two-dimensional
matrices, according to the coordinates of the stroke point
of the handwritten text data. The 0, 1, 2-th iterated inte-
gral signature feature maps, i.e., the above-mentioned two-
dimensional matrices, of one typical online handwritten text
example are visualized in Fig. 2c.

0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2732978, IEEE
Transactions on Pattern Analysis and Machine Intelligence
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 5
In Fig. 2a, P (D)
t
1
,t
3
is computed with window size 3. In
practice, we set the window size as 9 to keep strong local
invariance and robustness. Fig. 2d shows that, although
connections are randomly added between adjacent strokes
within a character or between characters, their impact on the
path signature of the original input string is not significant,
which proves that path signature based on sliding window
has excellent local invariance and robustness.
4 MULTI-SPATIAL-CONTEXT FCRN
Unlike character recognition, where it is easy to normalize
characters to a fixed size, text recognition is complicated be-
cause it involves input sequences of variable length, such as
feature maps and online pen-tip trajectories. We propose a
new fully convolutional recurrent network (FCRN) for spa-
tial context learning to overcome this problem by leveraging
a fully convolutional network, a residual recurrent network,
and connectionist temporal classification, all of which nat-
urally take inputs of arbitrary size or length. Furthermore,
we extend our FCRN to multi-spatial-context FCRN (MC-
FCRN), as shown in Fig. 1, to learn multi-spatial-context
knowledge from complicated signature feature maps. In
the following subsections, we briefly introduce the basic
components of FCRN and explain their roles in the architec-
ture. Then, we demonstrate how MC-FCRN performs multi-
spatial-context learning for the OHCTR problem.
4.1 Fully Convolutional Recurrent Network
4.1.1 Fully Convolutional Network
DCNNs exhibit excellent performance in computer vision
applications such as image classification [39], [42], scene text
recognition [12], [13], and visual description [20], [21]. Fol-
lowing the approach of Long et al. [53], we remove the orig-
inal last fully connected classification layer from DCNNs to
construct a fully convolutional network. Fully convolutional
networks not only inherit the ability of DCNNs to learn
powerful and interpretable image features but also adapt to
variable input image size and generate corresponding-size
feature maps. It is worth noting that such CNN feature maps
contain strong spatial order information from the overlap
regions (known as receptive fields) of the original feature
maps. Such spatial order information is very important
and can be leveraged to learn spatial context to enhance
the overall performance of the system. Furthermore, unlike
image cropping or sliding window-based approaches, FCNs
eliminate redundant computations by sharing convolutional
response maps layer by layer to achieve efficient inference
and backpropagation.
4.1.2 The Residual Recurrent Network
Recurrent neural networks (RNNs), which are well known
for the self-connected hidden layer that recurrently transfers
information from output to input, have been widely adopt-
ed to learn continuous sequential features. Recently, long
short-term memory (LSTM) [54], a variant of RNN that over-
comes the gradient vanishing and exploding problem, has
demonstrated excellent performance in terms of learning
complex and long-term temporal dynamics in applications
such as language translation [55], visual description [20],
[21], and text recognition [12], [13]. Bidirectional LSTM
(BLSTM) facilitates the learning of complex context dynam-
ics in both forward and reverse directions, thereby out-
performing unidirectional networks significantly. Stacked
LSTM is also popular for sequence learning for accessing
higher-level abstract information in temporal dimensions.
Integrated CNN-LSTM systems demonstrate their out-
standing capability in visual recognition and description
[20], [21] and scene text recognition [12], [13], [14]. How-
ever, the degradation problem [39] usually accompanies
deep integrated CNN-LSTM networks and slows down the
convergence process. Driven by the significance of deep
residual learning [22], [39] for optimization of very deep
networks, we presented the residual recurrent network to
accelerate the convergence of our FCRN and obtain better
optimization result. Theoretically, we explicitly reformulate
the LSTM\BLSTM layer (denoted by h with parameter ω
h
)
as learning the spatial contextual information with reference
to the input. Denoting the l-th LSTM\BLSTM layer output
as q
l
(x), we have
q
l
(x) = h(q
l1
(x)) + q
l1
(x). (7)
Iteratively applying q
l
(x) = h(q
l1
(x)) + q
l1
(x) =
h(q
l1
(x)) + h(q
l2
(x)) + q
l2
(x) to q
L
(x), we get
q
L
(x) = q
0
(x) +
L1
X
l=0
h(q
l
(x)), (8)
where L is the total number of layers of the residual multi-
layered LSTM\BLSTM. Residual recurrent network has the
following advantages in jointly learning with deep CNN.
First, gradient information can easily pass through the
complex residual recurrent network through the identity
mapping q
L
(x) = q
0
(x) according to Eq. (8), as the term
P
L1
l=1
h(q
l
(x)) for the residual spatial learning is very small
and has not yet functioned in the early training stage. There-
fore, the system gains rapid growth in the nascent period (as
illustrated in Fig. 6 ). Furthermore, by gradually occupying a
greater proportion in Eq. (8), the term
P
L1
l=1
h(q
l
(x)) plays
an increasingly important role in spatial context learning.
As a result, our residual recurrent network captures the
contextual information from a sequence through the term
P
L1
l=1
h(q
l
(x)) in an elegant manner, making the text recog-
nition process more efficient and reliable than processing
each character independently. Finally, the residual recurrent
network significantly promotes system performance while
not adding extra parameter or computational burden to the
system.
4.1.3 Transcription
Connectionist temporal classification (CTC), which facili-
tates the use of FCN and LSTM for sequential training
without requiring any prior alignment between input im-
ages and their corresponding label sequences, is adopted as
the transcription layer in our framework. Let C represent
all the characters used in this problem and let “blank”
represent the null emission. Then, the character set can
be denoted as C
0
= C {blank}. Given input sequences
u = (u
1
, u
2
, · · · , u
T
) of length T , where u
t
R
|C
0
|
, we can
obtain an exponentially large number of label sequences of
length T , refered to as alignments π, by assigning a label to

Citations
More filters
Journal ArticleDOI

Building fast and compact convolutional neural networks for offline handwritten Chinese character recognition

TL;DR: This paper proposed a new method for building fast and compact CNN model for large scale handwritten Chinese character recognition (HCCR) using Adaptive Drop Weight (ADW) for effectively pruning CNN parameters.
Journal ArticleDOI

Accurate, data-efficient, unconstrained text recognition with convolutional neural networks

TL;DR: This paper proposed a fully convolutional network without any recurrent connections trained with the CTC loss function, which achieved state-of-the-art results on seven public benchmark datasets, covering a wide spectrum of text recognition tasks.
Proceedings ArticleDOI

Aggregation Cross-Entropy for Sequence Recognition

TL;DR: Song et al. as mentioned in this paper proposed aggregation cross entropy (ACE) for sequence recognition from a new perspective, which can be directly applied for 2D prediction by flattening the 2D predictions into 1D predictions as the input.
Posted Content

Text Recognition in the Wild: A Survey

TL;DR: This literature review attempts to present the entire picture of the field of scene text recognition, which provides a comprehensive reference for people entering this field, and could be helpful to inspire future research.
Journal ArticleDOI

A signature-based machine learning model for distinguishing bipolar disorder and borderline personality disorder.

TL;DR: The signature method provided an effective approach to the analysis of mood data both in terms of diagnostic classification and prediction of future mood, highlighting the differing predictability and the overlap inherent within disorders.
References
More filters
Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Journal ArticleDOI

Long short-term memory

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Journal ArticleDOI

Distinctive Image Features from Scale-Invariant Keypoints

TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Proceedings ArticleDOI

Going deeper with convolutions

TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
Proceedings Article

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

TL;DR: Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.
Related Papers (5)
Frequently Asked Questions (10)
Q1. What is the network architecture of the implicit LM?

The network architecture of their implicit LM consists of three components: the embedding layer, the language modeling layer, and the transcription layer. 

The residual recurrent network of MC-FCRN not only substantially accelerates the convergence procedure but also promotes the performance significantly, while adding neither extra parameter nor computational burden to the system. 

This is because the recurrent network had not yet functioned during that period; thus FCN network can be optimized directly with CTC loss function through the residual connection. 

Among the above-mentioned methods, over-segmentation [1], [2], [3], [4], [5], i.e., an integrated segmentation-recognition method, is the most efficient method and still plays a crucial role in OHCTR. 

On the other hand, highway network [37] [38] and residual connection [22] [39] were advocated to solve the degradation problem [22] in training very deep networks. 

Fig. 2d shows that, although connections are randomly added between adjacent strokes within a character or between characters, their impact on the path signature of the original input string is not significant, which proves that path signature based on sliding window has excellent local invariance and robustness. 

To evaluate the effectiveness of the proposed system, the authors conducted experiments on the standard benchmark dataset CASIA-OLHWDB [57] and the ICDAR2013 Chinese handwriting recognition competition dataset [58] for unconstrained online handwritten Chinese text recognition. 

As adjacent sampling points of text samples are connected by a straight line D = (D1t , D 2 t ) with t ∈ [a, b], the iterated integrals P (D)(k)a,b can be calculated iteratively as follows:P (D) (k) a,b ={ 1 , k = 0,(P (D) (k−1) a,b ⊗4a,b)/k , k ≥ 1,(5)where4a,b := Db −Da denotes the path displacement and ⊗ represents the tensor product. 

As a result, their residual recurrent network captures the contextual information from a sequence through the term∑L−1l=1 h(ql(x)) in an elegant manner, making the text recognition process more efficient and reliable than processing each character independently. 

The authors propose a new fully convolutional recurrent network (FCRN) for spatial context learning to overcome this problem by leveraging a fully convolutional network, a residual recurrent network, and connectionist temporal classification, all of which naturally take inputs of arbitrary size or length.