What is the network architecture of the implicit LM?

The network architecture of their implicit LM consists of three components: the embedding layer, the language modeling layer, and the transcription layer.

What is the main advantage of the residual recurrent network?

The residual recurrent network of MC-FCRN not only substantially accelerates the convergence procedure but also promotes the performance significantly, while adding neither extra parameter nor computational burden to the system.

What is the reason why the recurrent network is not optimized?

This is because the recurrent network had not yet functioned during that period; thus FCN network can be optimized directly with CTC loss function through the residual connection.

What is the way to evaluate the effectiveness of the proposed system?

To evaluate the effectiveness of the proposed system, the authors conducted experiments on the standard benchmark dataset CASIA-OLHWDB [57] and the ICDAR2013 Chinese handwriting recognition competition dataset [58] for unconstrained online handwritten Chinese text recognition.

What is the advantage of the residual recurrent network?

As a result, their residual recurrent network captures the contextual information from a sequence through the term∑L−1l=1 h(ql(x)) in an elegant manner, making the text recognition process more efficient and reliable than processing each character independently.

(Open Access) Learning Spatial-Semantic Context with Fully Convolutional Recurrent Network for Online Handwritten Chinese Text Recognition (2018) | Zecheng Xie

Q: What is the effect of the sliding window on the path signature?

Fig. 2d shows that, although connections are randomly added between adjacent strokes within a character or between characters, their impact on the path signature of the original input string is not significant, which proves that path signature based on sliding window has excellent local invariance and robustness.

Q: What is the k-th iterated integral of the signature?

As adjacent sampling points of text samples are connected by a straight line D = (D1t , D 2 t ) with t ∈ [a, b], the iterated integrals P (D)(k)a,b can be calculated iteratively as follows:P (D) (k) a,b ={ 1 , k = 0,(P (D) (k−1) a,b ⊗4a,b)/k , k ≥ 1,(5)where4a,b := Db −Da denotes the path displacement and ⊗ represents the tensor product.

Q: What is the proposed solution to the problem of a fully convolutional recurrent network?

The authors propose a new fully convolutional recurrent network (FCRN) for spatial context learning to overcome this problem by leveraging a fully convolutional network, a residual recurrent network, and connectionist temporal classification, all of which naturally take inputs of arbitrary size or length.

0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2017.2732978, IEEE

Transactions on Pattern Analysis and Machine Intelligence

A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1

Learning Spatial-Semantic Context with Fully

Convolutional Recurrent Network for Online

Handwritten Chinese Text Recognition

Zecheng Xie, Zenghui Sun, Lianwen Jin

∗

, Hao Ni and Terry Lyons

Abstract—Online handwritten Chinese text recognition (OHCTR) is a challenging problem as it involves a large-scale character set,

ambiguous segmentation, and variable-length input sequences. In this paper, we exploit the outstanding capability of path signature to

translate online pen-tip trajectories into informative signature feature maps, successfully capturing the analytic and geometric

properties of pen strokes with strong local invariance and robustness. A multi-spatial-context fully convolutional recurrent network

(MC-FCRN) is proposed to exploit the multiple spatial contexts from the signature feature maps and generate a prediction sequence

while completely avoiding the difﬁcult segmentation problem. Furthermore, an implicit language model is developed to make

predictions based on semantic context within a predicting feature sequence, providing a new perspective for incorporating lexicon

constraints and prior knowledge about a certain language in the recognition procedure. Experiments on two standard benchmarks,

Dataset-CASIA and Dataset-ICDAR, yielded outstanding results, with correct rates of 97.50% and 96.58%, respectively, which are

signiﬁcantly better than the best result reported thus far in the literature.

Index Terms—Handwritten Chinese text recognition, path signature, residual recurrent network, multiple spatial contexts, implicit

language model

1 INTRODUCTION

N recent years, increasingly in-depth studies have led

to signiﬁcant developments in the ﬁeld of handwrit-

ten text recognition. Various methods have been pro-

posed by the research community, including integrated

segmentation-recognition methods [1], [2], [3], [4], [5], hid-

den Markov models (HMMs) and their hybrid variants [6],

[7], segmentation-free methods [8], [9], [10] with long short-

term memory (LSTM) and multi-dimensional long short-

term memory (MDLSTM), and integrated convolutional

neural network (CNN)-LSTM methods [11], [12], [13], [14].

In this paper, we investigate the most recently developed

methods for online handwritten Chinese text recognition

(OHCTR), which is an interesting research topic presenting

the following challenges: a large character set, ambiguous

segmentation, and variable-length input sequences.

Segmentation is the fundamental component of hand-

written text recognition, and it has attracted the attention of

numerous researchers [1], [2], [3], [4], [5], [15], [16]. Among

the above-mentioned methods, over-segmentation [1], [2],

[3], [4], [5], i.e., an integrated segmentation-recognition

method, is the most efﬁcient method and still plays a

crucial role in OHCTR. The basic concept underlying over-

segmentation is to slice the input string into sequential

character segments whose candidate classes can be used to

construct the segmentation-recognition lattice [2]. Based on

the lattice, path evaluation, which integrates the recogni-

• Z. Xie, Z. Sun, and L. Jin are with College of Electronic and Information

Engineering, South China University of Technology, Guangzhou, China.

E-mail: {zcheng.xie, sunfreding, lianwen.jin}@gmail.com

• H. Ni is with Oxford-Man Institute for Quantitative Finance, University

of Oxford, Oxford, UK. E-mail: hao.ni@maths.ox.ac.uk

• T. Lyons is with Mathematical Institute, University of Oxford, Oxford,

UK. E-mail: tlyons@maths.ox.ac.uk

tion scores, geometry information, and semantic context, is

conducted to search for the optimal path and generate the

recognition result. In practice, segmentation inevitably leads

to mis-segmentation, which is barely rectiﬁable through

post-processing and thus degrades the overall performance.

Segmentation-free methods are ﬂexible alternative meth-

ods that completely avoid the segmentation procedure. H-

MMs and their hybrid variants [6], [7] have been widely

used in handwritten text recognition. In general, the input

string is converted into slices by sliding windows, followed

by feature extraction and frame-wise prediction using an

HMM. Finally, the Viterbi algorithm is applied to search

for the best character string with maximum a posteriori

probability. However, HMMs are limited not only by the

assumption that their observation depends only on the cur-

rent state but also by their generative nature that generally

leads to poor performance in labeling and classiﬁcation

tasks, as compared to discriminative models. Even though

hybrid models that combine HMMs with other network

architectures, including recurrent neural networks [17] and

multilayer perceptrons [18], have been proposed to alleviate

the above-mentioned limitations by introducing context into

HMMs, they still suffer from the drawbacks of HMMs.

The recent development of recurrent neural networks,

especially LSTM [8], [9], [19] and MDLSTM [10], [19], has

provided a revolutionary segmentation-free perspective to

the problem of handwritten text recognition. In general,

LSTM is directly fed with a point-wise feature vector that

consists of the (x, y)-coordinate and relative features, while

it recurrently updates its hidden state and generates per-

frame predictions for each time step. Then, it applies con-

nectionist temporal classiﬁcation (CTC) to perform tran-

scription. It is worth noting that LSTM and MDLSTM have

Transactions on Pattern Analysis and Machine Intelligence

A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2

ଵ

்

-3600

-3400

-3200

࢑= ૚࢑= ૛࢑= ૙

1×1

6× 1

14× 4

30×10

62× 22

1×1

6× 1

14× 6

30× 14

62× 30

126× 62

6× 1

14× 8

30× 18

62× 38

126× 78

1×1

M(ݏ

ଵ

) M(ݏ

௜

) M(ݏ

்

)

ᝏӪⲴݯㄕ⭥ᖡ

࣊

૚

˖_ᝏ_Ӫ__ⲴⲴ_ݯ_ㄕ⭥_ᖡ_

࣊

૛

˖_ᝏӪ_Ⲵ_ݯ__ㄕㄕ_⭥_ᖡ_

࣊

૜

˖_ᝏ_Ӫ_Ⲵ__ݯ_ㄕ⭥⭥⭥_ᖡ

… …

Path signature

Receptive fields

Signature

feature maps

FCN

Residual

recurrent network

Transcription

Learning

semantic context

with multilayered

BLSTM

Embedding layer

126× 46

Predicting

sequence

࢙

݌࢒࢙= ෍ෑ݂(ݖ

௧

= ߨ

௧

|࢙)

்

௧ୀଵ࣊: 

࣊ୀ࢒

௜

MC-FCRN Implicit language modeling

Fig. 1. Overview of the proposed method. Variable-length pen-tip trajectories are ﬁrst translated into ofﬂine signature feature maps that preserve

the essential online information. Then, a multi-spatial-context fully convolutional recurrent network (MC-FCRN) take input of the signature feature

maps with receptive ﬁelds of different scales in a sliding window manner and generate a predicting sequence. Finally, an implicit LM is proposed to

derive the ﬁnal label sequence by exploiting the semantic context of embedding vectors that are transformed from the predicting sequence.

been successfully applied to handwritten text recognition

in Western languages, where the character set is relatively

small (e.g., for English, there are only 52 classes; therefore

it is easy to train the network). However, to the best of our

knowledge, very few studies have attempted to address the

problem of large-scale (where, e.g., the text lines may be

represented by more than 7,000 basic classes of characters

and sum up to more than 1 million character samples)

handwritten text recognition problems such as OHCTR.

Architectures that integrate CNN and LSTM exhibit ex-

cellent performance in terms of visual recognition and de-

scription [20], [21], scene text recognition [12], [13], [14], and

handwritten text recognition [11]. In text recognition prob-

lems, deep CNNs generate highly abstract feature sequences

from input sequential data. LSTM is fed with such feature

sequences and generates corresponding character strings.

Jointly training LSTM with CNN is straightforward and

can improve the overall performance signiﬁcantly. However,

in the above-mentioned methods, the CNNs, speciﬁcally

fully convolutional networks (FCNs), process the input

string with only a ﬁxed-size receptive ﬁeld in a sliding

window manner, which we claim is inﬂexible for uncon-

strained written characters in OHCTR. Moreover, a deep

integrated CNN-LSTM network is usually accompanied by

degradation problem [22] that slows down the convergence

procedure and affects the system optimization.

In this paper, we propose a novel solution (see Fig. 1)

that integrates path signature, a multi-spatial-context fully

convolutional recurrent network (MC-FCRN), and an im-

plicit language model (implicit LM) to address the problem

of unconstrained online handwritten text recognition. Path

signature, a recent development in the ﬁeld of the rough

path theory [23], [24], [25], is a promising approach for

translating variable-length pen-tip trajectories into ofﬂine

signature feature maps in our system, because it effective-

ly preserves the online information that characterizes the

analytic and geometric properties of the path. Encouraged

by recent advances in deep CNNs and LSTMs, we propose

the MC-FCRN for robust recognition of signature feature

maps. MC-FCRN leverages the multiple spatial contexts that

correspond to multiple receptive ﬁelds in each time step to

achieve strong robustness and high accuracy. Furthermore,

we propose an implicit LM, which incorporates semantic

context within the entire predicting feature sequence from

both forward and reverse directions, to enhance the predic-

tion for each time step. The contributions of this paper can

be summarized as follows:

• We develop a novel segmentation-free MC-FCRN to

effectively capture the variable spatial contextual dy-

namics as well as the character information for high-

performance recognition. With a series of receptive

ﬁelds of different scales, MC-FCRN is able to model the

complicate spatial context with strong robustness and

high accuracy.

• The residual recurrent network, a basic component of

MC-FCRN, not only accelerates the convergence pro-

cess but also promotes the optimization result, while

adding neither extra parameter nor computational bur-

den to the system, as compared to ordinary stacked

recurrent network

• We propose an implicit LM that learns to model the

output distribution given the entire predicting feature

sequence. Unlike the statistical language model that

predicts the next word given only a few previous

words, our implicit LM exploits the semantic context

not only from the forward and reverse directions of the

text but also with arbitrary text length.

• Path signature, a novel mathematical feature set,

brought from the rough path theory [23], [24], [25]

Transactions on Pattern Analysis and Machine Intelligence

A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 3

as a non-linear generalization of classical theory of

controlled differential equations, is successfully applied

to capture essential online information for long pen-

tip trajectories. Moreover, we investigate path signature

for learning the variable online knowledge of the input

string with different iterated integrals.

The remainder of this paper is organized as follows.

Section 2 reviews the related studies. Section 3 formally

introduces path signature. Section 4 details the network

architecture of FCRN and its extended version, namely

MC-FCRN. Section 5 describes the proposed implicit LM

and discusses the corresponding training strategy. Section 6

presents the experimental results. Finally, Section 7 con-

cludes the paper.

2 RELATED WORK

Feature extraction [26], [27], [28], [29], [30], [31] plays a cru-

cial role in traditional online handwritten text recognition.

The 8-directional feature [26], [29] is widely used in OHCTR

owing to its excellent ability to express stroke directions.

The projection of each trajectory point in eight directions

is calculated in a 2-D manner and eight pattern images

are generated accordingly. For further sophistication, Grave

et al. [8] considered not only the (x, y)-coordinate and its

relationship with its neighbors in the time series but also

the spatial information from an ofﬂine perspective, thus

obtaining 25 features for each point. However, the above-

mentioned techniques have been developed empirically.

Inspired by the theoretical work of Lyons and his colleagues

[23], [24], [25], we applied path signature to translate the

online pen-tip trajectories into ofﬂine signature feature maps

that maintain the essential features for characterizing the

online information of the trajectories. Furthermore, we can

use truncated path signature in practical applications to

achieve a trade-off between complexity and precision.

Yang et al. [32], [33] showed that the domain-speciﬁc

information extracted by the aforementioned methods can

improve the recognition performance with deep CNN (DC-

NN). However, DCNN-based networks are unable to han-

dle input sequences of variable length in OHCTR. On the

contrary, LSTM- and MDLSTM-based networks have an

inherent advantage in dealing with such input sequences

and demonstrate excellent performance in unconstrained

handwritten text recognition [8], [9], [34], [35]. Recently,

deep learning methods that integrate LSTM and CNN have

demonstrated outstanding capability in the ﬁeld of visual

captioning [21], [36] and scene text recognition [12], [13].

However, in this paper, we show that the simple combina-

tion of CNN and LSTM cannot utilize their full potential,

which is probably due to the degradation problem [22]. On

the other hand, highway network [37] [38] and residual con-

nection [22] [39] were advocated to solve the degradation

problem [22] in training very deep networks. Therefore, we

take inspiration from them and present the residual recur-

rent network to realize faster and better optimization of the

system. Furthermore, our MC-FCRN also differs from these

methods in that it uses multiple receptive ﬁelds of different

scales to capture highly informative contextual features in

each time step. Such a multi-scale strategy originates from

traditional methods. The pyramid match kernel [40] maps

features to multi-dimensional multi-resolution histograms

that help to capture co-occurring features. The SIFT vectors

[41] search for stable features across all possible scales and

construct a high-dimensional vector for the key points. Fur-

ther, spatial pyramid pooling [42] allows images of varying

size or scale to be fed during training and enhances the net-

work performance signiﬁcantly. GoogLeNet [43] introduced

the concept of “inception” whereby multi-scale convolution

kernels are integrated to boost performance. We have drawn

inspiration from these multi-scale methods to design our

MC-FCRN.

In general, language modeling is applied after feature

extraction and recognition in order to improve the overall

performance of the system [1], [2], [3], [4], [31], [44]. The

concept of ‘embedding’ plays a critical role in computational

linguistics. Traditionally, one character is strictly represent-

ed with one ‘embedding’ [45]. However, as emphasized by

Vilnis et al [46], representing an object as a single point in

space carries limitations. Instead, a density-based distribut-

ed embeddings can provide much more information of each

word, e.g. capturing uncertainty about a representation and

its relationship. Recently, Mukherjee [47] further veriﬁed

that a visual-linguistic mapping where words and visual

categories are both represented by distribution can improve

result at the intersection of language and vision, due to

the better exploiting of intra-concept variability in each

modality. In this paper, we take inspiration from their works

and take the predicting feature sequence, instead of one-hot

vectors, as input of the implicit LM to maintain the intra-

concept variability, which reﬂects recognition conﬁdence

information in our problem. The recent development of

neural networks, especially LSTM, in the ﬁeld of language

translation [48] and visual captioning [20], [21] has provided

us with a new perspective of language models. To the best

of our knowledge, neural networks were ﬁrst applied to lan-

guage modeling by Bengio et al. [45]. Subsequently, Mikolov

et al. [49] used recurrent neural network, and Sundermeyer

et al. [50] used LSTM for language modeling. For language

translation, Sutskever et al. [48] used multilayered LSTM to

encode the input text into a vector of ﬁxed dimensionality

and then applied another deep LSTM to decode the text

in a different language. For visual captioning, Venugopalan

et al. [21] and Pan et al. [20] extracted deep visual CNN

representations from image or video data and then used

an LSTM as a sequence decoder to generate a description

for the representations. Partially inspired by these methods,

we developed our implicit LM to incorporate semantic con-

text for recognition. However, unlike the above-mentioned

methods, which only derive context information from the

past predicted text, our implicit LM learns to make predic-

tions given the entire predicting feature sequence in both

forward and reverse directions.

3 PATH SIGNATURE

Proper translation of online data into ofﬂine feature maps

while retaining most, or hopefully all, of the online knowl-

edge within the pen-tip trajectory plays an essential role in

online handwritten recognition. To this end, we investigate

path signature, which was pioneered by Chen [51] in the

form of iterated integrals and developed by Lyons and

Transactions on Pattern Analysis and Machine Intelligence

A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 4

࢚

૚

࢚

૛

࢚

૜

ࡼ(ࡰ)

࢚

૚

,࢚

૛

ࡼ(ࡰ)

࢚

૛

,࢚

૜

ࡼ(ࡰ)

࢚

૚

,࢚

૜

=ࡼ(ࡰ)

࢚

૚

,࢚

૛

۪ࡼ(ࡰ)

࢚

૛

,࢚

૜

Uniform Time Sampling

Equal-Distance Sampling

Window-based Signature

Chen’s Identity

(a)

ࡼࡰ

࢚

૚

,࢚

૛

૙

= ૚,

ࡼࡰ

࢚

૚

,࢚

૛

૚

૚۪

૚

െ૚

૚

െ૚

ࡼ

ࡰ

࢚

૚

,࢚

૛

૚

െ૚

૚

െ૚

૛

૚ ૛

െ૚ ૛

૚ ૛

ࡼ(ࡰ)

࢚

૚

,࢚

૛

= ૚, ૚, െ૚,

૚

૛

, െ

૚

૛

, െ

૚

૛

૚

૛

Eq. (5)

࢚

૚

(૙,૙)

࢚

૛

(1,-1)

Eq. (2)

ࡼ(ࡰ)

࢚

૚

,࢚

૛

(b)

-3600

-3400

࢑= ૚

࢑= ૛

࢑= ૙

0 0.1

0.2 0.3 0.4

0.5 0.6 0.7 0.8 0.9

-1.0 0

1.0

(c)

-3600

-3400

-3200

-3600

-3400

-3200

࢑= ૙

࢑= ૚

࢑= ૛

(d)

Fig. 2. (a) Illustration of feature extraction of path signature. (b) A simple example of calculation of path signature features. (c) Path signature of one

typical online handwritten text example. (d) Left: path signature of the original pen-tip trajectories; Right: path signature of the pen-tip trajectories

with randomly added connections between adjacent strokes. It is notable that excepting for the additional connections, the original part of the

sequential data has the same path signature (same color).

his colleagues as a fundamental component of rough path

theory [23], [24], [25]. Path-signature was ﬁrst introduced

into handwritten Chinese character recognition by Benjamin

Graham [52], and followed by Yang et al [32], [33], but

only at the character level. We go further by applying path

signature to extremely long sequential data that usually

consist of hundreds of thousands of points and prove its

effectiveness in OHCTR problem. In the following, we ﬁrst

brieﬂy introduce path signature theoretically, and then tech-

nically for sake of implementation and application.

Consider the pen strokes of the online handwritten text

collected from a writing plane H ⊂ R

. Then, a pen stroke

can be expressed as a continuous mapping denoted by D :

[a, b] → H with D = (D

, D

) and t ∈ [a, b]. For k ≥

1 and a collection of indexes i

, · · · , i

∈ {1, 2}, the k-th

fold iterated integral of D along the index i

, · · · , i

can be

deﬁned by

P (D)

,··· ,i

a,b

a<t

<···<t

, · · · , dD

. (1)

The signature of the path is a collection of all the iterated

integrals of D:

P (D)

a,b

=(1, P (D)

a,b

, P (D)

a,b

, P (D)

1,1

a,b

P (D)

1,2

a,b

, P (D)

2,1

a,b

, P (D)

2,2

a,b

, · · · ), (2)

where the superscripts of the terms P (X)

,··· ,i

a,b

run over

the set of all multi-indexes

G = {(i

, ..., i

)|i

, · · · , i

∈ {1, 2}, k ≥ 1}. (3)

Then, the k-th iterated integral of the signature P (D)

(k)

a,b

the ﬁnite collection of terms P (D)

,··· ,i

a,b

with multi-indexes

of length k. More speciﬁcally, P (D)

(k)

a,b

is the 2

-dimensional

vector deﬁned by

P (D)

(k)

a,b

= (P (X)

,··· ,i

a,b

, · · · , i

∈ {1, 2}). (4)

In [25], it is proved that the whole signature of a path de-

termines the path up to time re-parameterization; i.e., path

signature can not only characterize the path displacement

and its further derivative as the classical directional features

do, but also provide more detailed analytic and geometric

properties of the path. In practice, we have to use the

truncated signature feature, which can capture the global

information on the path. Increasing the degree of truncated

signature results in the exponential growth of dimension but

may not always lead to signiﬁcant gain.

Next, we describe the practical calculation of path sig-

nature in OHCTR from the implementation and application

point of view. As illustrated in Fig. 2a, the pen-tip trajecto-

ries of the online handwritten text samples are represented

by a sequence of uniform-time sampling points. First, the

uniform-time sampling trajectory is translated into equal-

distance sampling style. Then, to calculate the signature fea-

ture for a speciﬁc point, e.g., the red point in Fig. 2a, we esti-

mate the window-based signature P (D)

that takes this

point as the midpoint. In order to calculate P (D)

, we

ﬁrst compute point-wise signature P (D)

and P (D)

;

then combine them according to Chen’s identity [51].

As adjacent sampling points of text samples are connect-

ed by a straight line D = (D

, D

) with t ∈ [a, b], the

iterated integrals P (D)

(k)

a,b

can be calculated iteratively as

follows:

P (D)

(k)

a,b

(

1 , k = 0,

(P (D)

(k−1)

a,b

⊗ 4

a,b

)/k , k ≥ 1,

(5)

where 4

a,b

:= D

− D

denotes the path displacement and

⊗ represents the tensor product. In Fig. 2b, we provide a

simple example to explain the calculation of path signature

according to Eq. (5) and Eq. (2). Suppose we have two

adjacent straight lines D = (D

, D

) with t ∈ [t

, t

] and

D = (D

, D

) with t ∈ [t

, t

], as shown in Fig. 2a. Then,

following Chen’s identity [51], we can calculate the path

signature for the concatenation of these two paths as

P (D)

(k)

i=0

P (D)

⊗ P (D)

k−i

. (6)

Given the pen-tip trajectories of online handwritten text,

for each sequential stroke point, we ﬁrst compute the path

signature within a sliding window according to Eq. (2) and

Eq. (6). Then the path signature features of certain level

(k) along all the stroke points will form the corresponding

feature maps. Speciﬁcally, the path signature feature vector

of each stroke point spreads over 2

(k+1)

−1 two-dimensional

matrices, according to the coordinates of the stroke point

of the handwritten text data. The 0, 1, 2-th iterated inte-

gral signature feature maps, i.e., the above-mentioned two-

dimensional matrices, of one typical online handwritten text

example are visualized in Fig. 2c.

Transactions on Pattern Analysis and Machine Intelligence

A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 5

In Fig. 2a, P (D)

is computed with window size 3. In

practice, we set the window size as 9 to keep strong local

invariance and robustness. Fig. 2d shows that, although

connections are randomly added between adjacent strokes

within a character or between characters, their impact on the

path signature of the original input string is not signiﬁcant,

which proves that path signature based on sliding window

has excellent local invariance and robustness.

4 MULTI-SPATIAL-CONTEXT FCRN

Unlike character recognition, where it is easy to normalize

characters to a ﬁxed size, text recognition is complicated be-

cause it involves input sequences of variable length, such as

feature maps and online pen-tip trajectories. We propose a

new fully convolutional recurrent network (FCRN) for spa-

tial context learning to overcome this problem by leveraging

a fully convolutional network, a residual recurrent network,

and connectionist temporal classiﬁcation, all of which nat-

urally take inputs of arbitrary size or length. Furthermore,

we extend our FCRN to multi-spatial-context FCRN (MC-

FCRN), as shown in Fig. 1, to learn multi-spatial-context

knowledge from complicated signature feature maps. In

the following subsections, we brieﬂy introduce the basic

components of FCRN and explain their roles in the architec-

ture. Then, we demonstrate how MC-FCRN performs multi-

spatial-context learning for the OHCTR problem.

4.1 Fully Convolutional Recurrent Network

4.1.1 Fully Convolutional Network

DCNNs exhibit excellent performance in computer vision

applications such as image classiﬁcation [39], [42], scene text

recognition [12], [13], and visual description [20], [21]. Fol-

lowing the approach of Long et al. [53], we remove the orig-

inal last fully connected classiﬁcation layer from DCNNs to

construct a fully convolutional network. Fully convolutional

networks not only inherit the ability of DCNNs to learn

powerful and interpretable image features but also adapt to

variable input image size and generate corresponding-size

feature maps. It is worth noting that such CNN feature maps

contain strong spatial order information from the overlap

regions (known as receptive ﬁelds) of the original feature

maps. Such spatial order information is very important

and can be leveraged to learn spatial context to enhance

the overall performance of the system. Furthermore, unlike

image cropping or sliding window-based approaches, FCNs

eliminate redundant computations by sharing convolutional

response maps layer by layer to achieve efﬁcient inference

and backpropagation.

4.1.2 The Residual Recurrent Network

Recurrent neural networks (RNNs), which are well known

for the self-connected hidden layer that recurrently transfers

information from output to input, have been widely adopt-

ed to learn continuous sequential features. Recently, long

short-term memory (LSTM) [54], a variant of RNN that over-

comes the gradient vanishing and exploding problem, has

demonstrated excellent performance in terms of learning

complex and long-term temporal dynamics in applications

such as language translation [55], visual description [20],

[21], and text recognition [12], [13]. Bidirectional LSTM

(BLSTM) facilitates the learning of complex context dynam-

ics in both forward and reverse directions, thereby out-

performing unidirectional networks signiﬁcantly. Stacked

LSTM is also popular for sequence learning for accessing

higher-level abstract information in temporal dimensions.

Integrated CNN-LSTM systems demonstrate their out-

standing capability in visual recognition and description

[20], [21] and scene text recognition [12], [13], [14]. How-

ever, the degradation problem [39] usually accompanies

deep integrated CNN-LSTM networks and slows down the

convergence process. Driven by the signiﬁcance of deep

residual learning [22], [39] for optimization of very deep

networks, we presented the residual recurrent network to

accelerate the convergence of our FCRN and obtain better

optimization result. Theoretically, we explicitly reformulate

the LSTM\BLSTM layer (denoted by h with parameter ω

)

as learning the spatial contextual information with reference

to the input. Denoting the l-th LSTM\BLSTM layer output

as q

(x), we have

(x) = h(q

l−1

(x)) + q

l−1

(x). (7)

Iteratively applying q

(x) = h(q

l−1

(x)) + q

l−1

(x) =

h(q

l−1

(x)) + h(q

l−2

(x)) + q

l−2

(x) to q

(x), we get

(x) = q

(x) +

L−1

l=0

h(q

(x)), (8)

where L is the total number of layers of the residual multi-

layered LSTM\BLSTM. Residual recurrent network has the

following advantages in jointly learning with deep CNN.

First, gradient information can easily pass through the

complex residual recurrent network through the identity

mapping q

(x) = q

(x) according to Eq. (8), as the term

L−1

l=1

h(q

(x)) for the residual spatial learning is very small

and has not yet functioned in the early training stage. There-

fore, the system gains rapid growth in the nascent period (as

illustrated in Fig. 6 ). Furthermore, by gradually occupying a

greater proportion in Eq. (8), the term

L−1

l=1

h(q

(x)) plays

an increasingly important role in spatial context learning.

As a result, our residual recurrent network captures the

contextual information from a sequence through the term

L−1

l=1

h(q

(x)) in an elegant manner, making the text recog-

nition process more efﬁcient and reliable than processing

each character independently. Finally, the residual recurrent

network signiﬁcantly promotes system performance while

not adding extra parameter or computational burden to the

system.

4.1.3 Transcription

Connectionist temporal classiﬁcation (CTC), which facili-

tates the use of FCN and LSTM for sequential training

without requiring any prior alignment between input im-

ages and their corresponding label sequences, is adopted as

the transcription layer in our framework. Let C represent

all the characters used in this problem and let “blank”

represent the null emission. Then, the character set can

be denoted as C

= C ∪ {blank}. Given input sequences

u = (u

, u

, · · · , u

) of length T , where u

∈ R

, we can

obtain an exponentially large number of label sequences of

length T , refered to as alignments π, by assigning a label to

Learning Spatial-Semantic Context with Fully Convolutional Recurrent Network for Online Handwritten Chinese Text Recognition

Figures

Citations

Building fast and compact convolutional neural networks for offline handwritten Chinese character recognition

Accurate, data-efficient, unconstrained text recognition with convolutional neural networks

Aggregation Cross-Entropy for Sequence Recognition

Text Recognition in the Wild: A Survey

A signature-based machine learning model for distinguishing bipolar disorder and borderline personality disorder.

References

Deep Residual Learning for Image Recognition

Long short-term memory

Distinctive Image Features from Scale-Invariant Keypoints

Going deeper with convolutions

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Related Papers (5)

An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition

Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

Deep Residual Learning for Image Recognition

ICDAR 2013 Chinese Handwriting Recognition Competition

A Novel Connectionist System for Unconstrained Handwriting Recognition

Frequently Asked Questions (10)

Q1. What is the network architecture of the implicit LM?

Q2. What is the main advantage of the residual recurrent network?

Q3. What is the reason why the recurrent network is not optimized?

Q4. What is the efficient method for OHCTR?

Q5. What are the main arguments for the degradation problem in training very deep networks?

Q6. What is the effect of the sliding window on the path signature?

Q7. What is the way to evaluate the effectiveness of the proposed system?

Q8. What is the k-th iterated integral of the signature?

Q9. What is the advantage of the residual recurrent network?

Q10. What is the proposed solution to the problem of a fully convolutional recurrent network?