scispace - formally typeset
Open AccessJournal ArticleDOI

A survey of methods and strategies in character segmentation

TLDR
H holistic approaches that avoid segmentation by recognizing entire character strings as units are described, including methods that partition the input image into subimages, which are then classified.
Abstract
Character segmentation has long been a critical area of the OCR process. The higher recognition rates for isolated characters vs. those obtained for words and connected character strings well illustrate this fact. A good part of recent progress in reading unconstrained printed and written text may be ascribed to more insightful handling of segmentation. This paper provides a review of these advances. The aim is to provide an appreciation for the range of techniques that have been developed, rather than to simply list sources. Segmentation methods are listed under four main headings. What may be termed the "classical" approach consists of methods that partition the input image into subimages, which are then classified. The operation of attempting to decompose the image into classifiable units is called "dissection." The second class of methods avoids dissection, and segments the image either explicitly, by classification of prespecified windows, or implicitly by classification of subsets of spatial features collected from the image as a whole. The third strategy is a hybrid of the first two, employing dissection together with recombination rules to define potential segments, but using classification to select from the range of admissible segmentation possibilities offered by these subimages. Finally, holistic approaches that avoid segmentation by recognizing entire character strings as units are described.

read more

Content maybe subject to copyright    Report

A SURVEY OF METHODS AND STRATEGIES IN
CHARACTER SEGMENTATION
Richard G. Casey and Eric Lecolinet
ENST Paris and IBM Almaden Research Center
ENST Paris
ABSTRACT
Character segmentation has long been a critical area of the OCR process. The higher
recognition rates for isolated characters vs. those obtained for words and connected
character strings well illustrate this fact. A good part of recent progress in reading
unconstrained printed and written text may be ascribed to more insightful handling of
segmentation.
This paper provides a review of these advances. The aim is to provide an appreci-
ation for the range of techniques that have been developed, rather than to simply list
sources. Segmentation methods are listed under four main headings. What may be
termed the "classical" approach consists of methods that partition the input image into
subimages, which are then classified. The operation of attempting to decompose the
image into classifiable units is called "dissection". The second class of methods avoids
dissection, and segments the image either explicitly, by classification of prespecified
windows, or implicitly by classification of subsets of spatial features collected from the
image as a whole. The third strategy is a hybrid of the first two, employing dissection
together with recombination rules to define potential segments, but using classification
to select from the range of admissible segmentation possibilities offered by these
subimages. Finally, holistic approaches that avoid segmentation by recognizing entire
character strings as units are described.

- 2 -
KEYWORDS
Optical character recognition, character segmentation, survey, holistic recognition, Hid-
den Markov Models, graphemes, contextual methods, recognition-based segmentation

- 3 -
1. Introduction
1.1. The role of segmentation in recognition processing
Character segmentation is an operation that seeks to decompose an image of a sequence of charac-
ters into subimages of individual symbols. It is one of the decision processes in a system for optical
character recognition (OCR). Its decision, that a pattern isolated from the image is that of a character (or
some other identifiable unit), can be right or wrong. It is wrong sufficiently often to make a major contri-
bution to the error rate of the system.
In what may be called the "classical" approach to OCR, Fig. 1, segmentation is the initial step in a
three-step procedure:
Given a starting point in a document image:
1. Find the next character image.
2. Extract distinguishing attributes of the character image.
3. Find the member of a given symbol set whose attributes best match those of the input, and output
its identity.
This sequence is repeated until no additional character images are found.
An implementation of step 1, the segmentation step, requires answering a simply-posed question:
"What constitutes a character?" The many researchers and developers who have tried to provide an algo-
rithmic answer to this question find themselves in a Catch-22 situation. A character is a pattern that
resembles one of the symbols the system is designed to recognize. But to determine such a resemblance
the pattern must be segmented from the document image. Each stage depends on the other, and in com-
plex cases it is paradoxical to seek a pattern that will match a member of the system’s recognition alpha-
bet of symbols without incorporating detailed knowledge of the structure of those symbols into the pro-
cess.
Furthermore, the segmentation decision is not a local decision, independent of previous and subse-
quent decisions. Producing a good match to a library symbol is necessary, but not sufficient, for reliable
recognition. That is, a poor match on a later pattern can cast doubt on the correctness of the current
segmentation/recognition result. Even a series of satisfactory pattern matches can be judged incorrect if
contextual requirements on the system output are not satisfied. For example, the letter sequence "cl" can
often closely resemble a "d", but usually such a choice will not constitute a contextually valid result.
Thus it is seen that the segmentation decision is interdependent with local decisions regarding shape
similarity, and with global decisions regarding contextual acceptability. This sentence summarizes the
refinement of character segmentation processes in the past 40 years or so. Initially, designers sought to
perform segmentation as per the "classical" sequence listed above. As faster, more powerful electronic

- 4 -
circuitry has encouraged the application of OCR to more complex documents, designers have realized
that step 1 can not be divorced from the other facets of the recognition process.
In fact, researchers have been aware of the limitations of the classical approach for many years.
Researchers in the 1960s and 1970s observed that segmentation caused more errors than shape distortions
in reading unconstrained characters, whether hand- or machine-printed. The problem was often masked
in experimental work by the use of databases of well-segmented patterns, or by scanning character strings
printed with extra spacing. In commercial applications stringent requirements for document preparation
were imposed. By the beginning of the 1980’s workers were beginning to encourage renewed research
interest [73] to permit extension of OCR to less constrained documents.
The problems of segmentation persist today. The well-known tests of commercial printed text OCR
systems by University of Nevada, Las Vegas [64][65] consistently ascribe a high proportion of errors to
segmentation. Even when perfect patterns, the bitmapped characters that are input to digital printers,
were recognized, commercial systems averaged 0.5% spacing errors. This is essentially a segmentation
error by a process that attempts to isolate a word subimage. The article [6] emphatically illustrates the
woes of current machine-print recognition systems as segmentation difficulties increase (see Fig. 2). The
degradation in performance of NIST tests of handwriting recognition on segmented [86] and unseg-
mented [88] images underscore the continuing need for refinement and fresh approaches in this area. On
the positive side of the ledger, the study [29] shows the dramatic improvements that can be obtained
when a thoroughgoing segmentation scheme replaces one of prosaic design
Some authors previously have surveyed segmentation, often as part of a more comprehensive work,
e.g., cursive recognition [36] [19] [20] [55] [58] [81], or document analysis [23] [29]. In the present
paper we present a survey whose focus is character segmentation, and which attempts to provide broad
coverage of the topic.
1.2 Organization of methods
A major problem in discussing segmentation is how to classify methods. Tappert et al [81], for
example, speaks of "external" vs. "internal" segmentation, depending on whether recognition is required
in the process. Dunn and Wang [20] use "straight segmentation" and "segmentation-recognition" for a
similar dichotomization.
A somewhat different point of view is proposed in this paper. The division according to use or
non-use of recognition in the process fails to make clear the fundamental distinctions among present-day
approaches. For example, it is not uncommon in text recognition to use a spelling corrector as a post-
processor. This stage may propose the substitution of two letters for a single letter output by the
classifier. This is in effect a use of recognition to resegment the subimage involved. However, the process
represents only a trivial advance on traditional methods that segment independent of recognition.

- 5 -
In this paper the distinction between methods is based on how segmentation and classification
interact in the overall process. In the example just cited, segmentation is done in two stages, one before
and one after image classification. Basically an unacceptable recognition result is re-examined and
modified by a (implied) resegmentation. This is a rather "loose" coupling of segmentation and
classification.
A more profound interaction between the two aspects of recognition occurs when a classifier is
invoked to select the segments from a set of possibilities. In this family of approaches segmentation and
classification are integrated. To some observers it even appears that the classifier performs segmentation
since, conceptually at least, it could select the desired segments by exhaustive evaluation of all possible
sets of subimages of the input image.
After reviewing available literature, we have concluded that there are three "pure" strategies for
segmentation, plus numerous hybrid approaches that are weighted combinations of these three. The ele-
mental strategies are:
1. the classical approach, in which segments are identified based on "character-like" properties. This
process of cutting up the image into meaningful components is given a special name, "dissection",
in discussions below.
2. recognition-based segmentation, in which the system searches the image for components that match
classes in its alphabet.
3. holistic methods, in which the system seeks to recognize words as a whole, thus avoiding the need
to segment into characters.
In strategy (1) the criterion for good segmentation is the agreement of general properties of the segments
obtained with those expected for valid characters. Examples of such properties are height, width, separa-
tion from neighboring components, disposition along a baseline, etc. In method (2) the criterion is recog-
nition confidence, perhaps including syntactic or semantic correctness of the overall result. Holistic
methods (3) in essence revert to the classical approach with words as the alphabet to be read. The reader
interested to obtain an early illustration of these basic techniques may glance ahead to Fig. 6 for exam-
ples of dissection processes, Fig. 13 for a recognition-based strategy, and Fig. 16 for a holistic approach.
Although examples of these basic strategies are offered below, much of the literature reviewed for
this survey reports a blend of methods, using combinations of dissection, recognition searching, and word
characteristics. Thus, although the paper necessarily has a discrete organization, the situation is perhaps
better conceived as in Fig. 3. Here the three fundamental strategies occupy orthogonal axes: hybrid
methods can be represented as weighted combinations of these lying at points in the intervening space.
There is a continuous space of segmentation strategies rather than a discrete set of classes with well-
defined boundaries. Of course, such a space exists only conceptually; it is not meaningful to assign pre-
cise weights to the elements of a particular combination.

Citations
More filters
Journal ArticleDOI

Online and off-line handwriting recognition: a comprehensive survey

TL;DR: The nature of handwritten language, how it is transduced into electronic data, and the basic concepts behind written language recognition algorithms are described.
Journal ArticleDOI

ASTER: An Attentional Scene Text Recognizer with Flexible Rectification

TL;DR: This work introduces ASTER, an end-to-end neural network model that comprises a rectification network and a recognition network that predicts a character sequence directly from the rectified image.
Journal ArticleDOI

Twenty years of document image analysis in PAMI

TL;DR: The contributions to document image analysis of 99 papers published in the IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) are clustered, summarized, interpolated, interpreted, and evaluated.
Journal ArticleDOI

An overview of character recognition focused on off-line handwriting

TL;DR: The historical evolution of CR systems is presented, the available CR techniques, with their superiorities and weaknesses, are reviewed and directions for future research are suggested.
Book ChapterDOI

Word spotting in the wild

TL;DR: It is argued that the appearance of words in the wild spans this range of difficulties and a new word recognition approach based on state-of-the-art methods from generic object recognition is proposed, in which object categories are considered to be the words themselves.
References
More filters
Journal ArticleDOI

The state of the art in online handwriting recognition

TL;DR: The state of the art of online handwriting recognition during a period of renewed activity in the field is described, based on an extensive review of the literature, including journal articles, conference proceedings, and patents.
Journal ArticleDOI

Off-line cursive script word recognition

TL;DR: In this paper, a word image is transformed through a hierarchy of representation levels: points, contours, features, letters, and words, and a unique feature representation is generated bottom-up from the image using statistical dependences between letters and features.
Journal ArticleDOI

An off-line cursive handwriting recognition system

TL;DR: Describes a complete system for the recognition of off-line handwriting, including segmentation and normalization of word images to give invariance to scale, slant, slope and stroke thickness.
Journal ArticleDOI

Segmentation methods for character recognition: from segmentation to document structure analysis

TL;DR: A pattern- oriented segmentation method for optical character recognition that leads to document structure analysis is presented, and an extended form of pattern-oriented segmentation, tabular form recognition, is considered.
Journal ArticleDOI

Machine recognition of handwritten words: A project report

TL;DR: A cursive script recognition program which has correctly identified 79 per cent of a test sample of 84 words, which compares favorably in performance level with previously reported programs appropriately “normalized”, while not requiring input pertaining to stroke sequence and stroke segmentation that is essential to these other programs.
Related Papers (5)
Frequently Asked Questions (10)
Q1. What are the contributions mentioned in the paper "A survey of methods and strategies in character segmentation" ?

This paper provides a review of these advances. The aim is to provide an appreciation for the range of techniques that have been developed, rather than to simply list sources. The third strategy is a hybrid of the first two, employing dissection together with recombination rules to define potential segments, but using classification to select from the range of admissible segmentation possibilities offered by these subimages. 

The authors apologize to researchers whose important contributions may have been overlooked. 

Upper contour analysis was also used in [47] for a pre-segmentation algorithm that served as part of the second stage of a hybrid recognition system. 

By testing their adjacency relationships to perform merging, or their size and aspect ratios to trigger splitting mechanisms, much of the segmentation task can be accurately performed at a low cost in computation. 

In [67], words and letters were represented by means of tree dictionaries: possible words were described by a letter tree (also called a "trie") and letters were described by a feature tree. 

Splitting of an image classified as connected is then accomplished by finding characteristic landmarks of the image that are likely to be segmentation points, rejecting those that appear to be situated within a character, and implementing a suitable cutting path. 

in many current studies, as the authors shall see, segmentation is a complex process, and there is a need for a term such as "dissection" to distinguish the image-cutting subprocess from the overall segmentation, which may use contextual knowledge and/or character shape description. 

The authors noted that the technique was heavily dependent on the quality of the input images, and tended to fail on both very heavy or very light printing. 

The twin facts that early OCR development dealt with constrained inputs, while research wasmainly concerned with representation and classification of individual symbols, explains why segmentation is so rarely mentioned in pre-70s literature. 

As the system knows in advance what it is searching for, it can make use of high-level contextual knowledge to improve recognition, even at low-level stages.