scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Automatic localization of page segmentation errors

TL;DR: This work focuses on localizing line level segmentation errors without directly using the ground truth and performs experiments on more than 18000 scanned pages of 109 books belonging to four prominent south Indian languages.
Abstract: Page segmentation is a basic step in any character recognition system. Its failure is one of the major causes for deteriorating overall accuracy of the current Indian language OCR engines. Many segmentation algorithms are proposed in literature. Often these algorithms fail to adapt dynamically to a given page and thus tend to yield poor segmentation for some specific regions or some specific pages. Given the ground truth, locating page segmentation errors is a straight foreword problem and merely useful for comparing segmentation algorithms. In this work, we locate segmentation errors without directly using the ground truth. Such automatic localization of page segmentation errors can be considered a major step towards improving page segmentation errors. In this work, we focus on localizing line level segmentation errors. We perform experiments on more than 18000 scanned pages of 109 books belonging to four prominent south Indian languages.

Summary (2 min read)

1. INTRODUCTION

  • The success of page segmentation algorithm critically affects the performance of OCR.
  • Most of these segmentation algorithms perform satisfactorily well but tend to fail in some specific region or for some specific pages.
  • The primary objective of this work is to automatically locate segmentation errors with very high accuracy.
  • The objective of this work is to locate these errors without the help of ground truth.

2. PAGE SEGMENTATION ERRORS

  • There are large number of document segmentation algorithms available in literature.
  • Most of these segmentation algorithms suffer from some or other page segmentation errors.
  • Let S and G be the set of lines denoting segmentation output and ground truth respectively.
  • The authors then locate the errors by classifying each line as either correct, over-segmented, under-segmented, false alarm or missing component.

3. THE PROBLEM OF LOCATING PAGE SEGMENTATION ERRORS

  • More often the existing page segmentation algorithms tend to fail for some specific pages or some specific regions of the page.
  • Once segmentation errors are localized, one can use human intervention or alternate algorithm with tuned parameters for error correction.
  • Compute line level features for each line of training document image, also known as Learning phase.
  • The authors achieve this by using a set of simple features in stage-1 where they classify correct and incorrect pages, and in the stage-2 they compute computationally more expensive line level features only for the pages which are classified as incorrect by stage1.
  • To evaluate the performance of their system the authors first locate all the errors using ground truth as in [14].

3.1 Features

  • The authors observe that (1) most of the characters in a page are of same size, font and style, (2) line spacing within the documents are mostly same, (3) page is formatted uniformly within a book, (4) two nearby lines in a document is mostly of same height.
  • The features the authors use for classifying segmented page as correct or incorrect i.e., stage1 classification are as follows: f1: Maximum line height.
  • Maximum of difference in line heights and line gap, also known as f5.
  • To identify such case the authors compute maximum word gap in line and use this as feature F4 F5: Maximum area of connected component in a line.

4.1 About Dataset

  • The authors use a dataset [7] of 109 books in four prominent south Indian languages for all their experiments.
  • Table 4 gives the details of the dataset.
  • This dataset contains pages scanned in 600 dpi.
  • Segmentation of Indian language document pages is a challenging task, mainly due to (1) Presence of dangling modifiers (2) The relative position of the neighbouring characters are not fixed etc.
  • In phase-2, the authors learn the ground truth based error localization for the training images.

4.2 Error localization using ground truth

  • The authors first run the segmentation algorithm on all the pages.
  • Further, if all the lines in a page are correctly segmented the authors tag that page as correct.
  • Table 2 summarizes segmentation errors at line level caused by segmentation algorithm.
  • Which is actually a very large quantity considering the huge size of the dataset.
  • The authors aim is to localize these errors automatically i.e. without using ground truth.

4.3 Automatic error localization

  • For a given segmented page, the authors say whether the page is correctly segmented or not.
  • Thus the authors define a set of performance measures using confusion matrix.
  • The authors see that they are able to classify correct page as correct and incorrect page as incorrect with more than 89% accuracy.
  • At line level error localization, the authors classify each segmented page as correct or one of the segmentation error.
  • To measure the performance of line level error localization the authors define a performance metric using confusion matrix.

5. CONCLUSIONS

  • The authors address the problem of localizing page segmentation errors.
  • The proposed scheme is able to locate segmentation errors without ground truth with high accuracy.
  • Such error localization is very important for segmentation error correction which can be done either by manual intervention or running alternate segmentation algorithms in the error localized part.
  • Further, the proposed error localization scheme is independent of segmentation algorithms.
  • Future direction of this work is to design segmentation postprocessor to automatically correct page segmentation errors.

Did you find this useful? Give us your feedback

Figures (9)

Content maybe subject to copyright    Report

Automatic Localization of Page Segmentation E rrors
Dheeraj Mundhra
IIT Kharagpur
Kharagpur, India
09MA2009@iitkgp.ac.in
Anand Mishra
IIIT Hyderabad
Hyderabad, India
anand.mishra@research.iiit.ac.in
C. V. Jawahar
IIIT Hyderabad
Hyderabad, India
jawahar@iiit.ac.in
ABSTRACT
Page segmentation is a basic step in any character recogni-
tion system. Its failure is one of the major causes for deteri-
orating overall accuracy of the current Indian language OCR
engines. Many segmentation algorithms are proposed in lit-
erature. Often these algorithms fail to adapt dynamically
to a given page and thus tend to yield poor segmentation
for some specific regions or some specific pages. Given the
ground truth, locating page segmentation errors is a straight
foreword problem and merely useful for comparing segmen-
tation algorithms. In this work, we locate segmentation er-
rors without directly using the ground truth. Such auto-
matic localization of page segmentation errors can be con-
sidered a major step towards improving page segmentation
errors. In this work, we focus on localizing line level seg-
mentation errors. We perform experiments on more than
18000 scanned pages of 109 books belonging to four promi-
nent south Indian languages.
General Terms
Experimentation.
Keywords
OCR, document image segmentation, under-segmentation,
over-segmentation, false alarm.
1. INTRODUCTION
The success of page segmentation algorithm critically af-
fects the performance of OCR. Page segmentation algorithms
are one of the widely studied topics in document image anal-
ysis literature (see [6], [14]). Most of these segmentation al-
gorithms perform satisfactorily well but tend to fail in some
specific region or for some specific pages. The main reason
This work was carried out when Dheeraj Mundhra was vis-
iting IIIT Hyderabad.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for prot or commercial advantage and that copies
bear this notice and the full citation on the rst page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specic
permission and/or a fee.
J-MOCR-AND ’11 Beijing, China
Copyright 2011 ACM 978-1-4503-0685-0/11/09 ...$10.00.
for such failures is that these algorithms are heavily depen-
dent on parameters and thus fail to adapt to a given page
dynamically.
Document image segmentation is a widely studied topic
in literature. Consistent appearance of page segmentation
work (see [2], [3], [4]) in competitions at ICDAR shows the
interest of the community in this area. Kise et al. [8] had
proposed a powerful Voronoi diagram based segmentation
algorithm a decade ago. Nevertheless, the complexity and
variations in the document images make the task of seg-
menting the given document page into lines still challenging.
There is also interest in designing a hybrid segmentation al-
gorithm [1] or designing segmentation algorithm by learning
the page features [10].
Set theoretic approach of analysing segmentation algo-
rithms was presented in [11]. In [14] performance of six
most popular page segmentation algorithm were analysed.
Sesh Kumar et al. [9] had done similar analysis on segmen-
tation algorithm for Indian languages. All these works com-
pared document image segmentation algorithms assuming
the availability of the ground truth. Although such works do
fair comparison of segmentation algorithms, but have sev-
eral disadvantages: (1) They do not directly lead towards
improvement in the segmentation algorithm (2) They do not
specify why the errors arise? (3) They are not applicable for
large scale evaluation of OCRs [15] due to unavailability of
the ground truth.
Recently, in computer vision community researchers have
shown interest in unsupervised evaluation of image segmen-
tation algorithms [17]. Here for a given image, availability
of the ground truth is not assumed. Rather a set of fea-
tures are computed from the segmented image and based
on these features performance of image segmentation algo-
rithms are measured. A survey of unsupervised methods
for evaluating segmentation algorithms is given in [17]. We
are highly inspired by such methods. However, we do not do
any evaluation of page segmentation algorithms in this work.
Rather we go one step further and try to find out where and
what type of page segmentation errors are present at line
level for a given page. Such segmentation error localization
can be considered a major step towards improvement in seg-
mentation output. Once segmentation errors are localized
automatically, one can either use human intervention or al-
ternate segmentation algorithms for error correction. How-
ever, improving the segmentation accuracy is beyond the
scope of this work. The primary objective of this work is
to automatically locate segmentation errors with very high
accuracy.

Similar to [14], we focus on line level document image seg-
mentation, where a segmentation algorithm partitions the
text block into lines. Often these segmentation algorithms
fail to segment lines correctly. In other words, each seg-
mented line is either correctly segmented, over-segmented,
under-segmented, false alarm or missing dangling modifier.
The objective of this work is to locate these errors without
the help of ground truth. (We use ground truth only for
evaluation of proposed error localization scheme). We for-
mulate the problem of locating page segmentation errors as a
multi-class classification problem where each segmented line
is classified into one of the five classes i.e., correct, under-
segmented, over-segmented, false alarm or missing dangling
modifier. For this we compute a set of features for few seg-
mented lines assuming the availability of the ground truth
for them. We then, compute the same set of features for the
rest of the segmented lines to locate the segmentation errors
in a classification framework.
We have shown segmentation error localization perfor-
mance on a specific segmentation algorithm. However, pro-
posed method is independent of segmentation algorithm.
We used 109 books in four prominent south Indian lan-
guages for our experiments, where randomly selected half
of the pages were used for training whereas rest half was
used for testing. To evaluate the performance of proposed
method, we used ground truth based comparison. The pro-
posed scheme localizes the segmentation errors with more
than 78% accuracy, which means that we are able to lo-
calize more than seventy eight percentage of the total page
segmentation errors automatically.
The reminder of the paper is organised as follows. In
Section 2 various types of page segmentation errors are de-
scribed. In Section 3, we formulate the problem of locating
page segmentation errors as a classification problem. Here
we describe features used to learn various types of page seg-
mentation errors. Section 4 describes experiments and re-
sults. We finally conclude our work in Section 5.
2. PAGE SEGMENTATION ERRORS
There are large number of document segmentation algo-
rithms available in literature. The popular ones are Recur-
sive XY cut [12], White-space analysis [5], Docstrum [13],
Voronoi diagram based [8], and RLSA [16]. Description of
these algorithms is not in the scope of this paper. However,
readers are encouraged to see [14] for the description of these
segmentation algorithms. Most of these segmentation algo-
rithms suffer from some or other page segmentation errors.
These errors can be defined in a set theocratic notion as
in [14]. We summarize these definitions here.
Let S and G be the set of lines denoting segmentation
output and ground truth respectively. Then we can define
segmentation errors as follows:
Correct
.
ALineB S is said to be correct if there exists a unique
line A G such that A B is significant.
Over-segmented
.
A line B S is said to be over-segmented if there exist
at-least one more line B
S and A G such that both
A B and A B
are significant.
Figure 1: Ground truth and segmented page.
Under-segmented
.
A line B S is said be under-segmented if there exist
multiple lines A’s in G such that A B is significant.
Missing component
.
A line B S is said to be missing component if there
exists a unique line A G such that A B is not significant.
In other words, by calling line A as missing component we
mean that line A has missed some dangling modifier either
above or below the line. This error is very common in Indian
language document image segmentation. (Note that this
error is not defined in [14]).
False alarm
.
A line B S is said to be a false alarm if there does not
exist any line A G such that A B = φ.
Missed line
.
A line A G is said to be a missed line if there does not
exist any line B S such that A B = φ.
We demonstrate typical examples of page segmentation
errors in Figure 1. Here lines B
1
and B
2
are correctly
segmented. B
3
and B
4
are over-segmented. Line B
5
is
under-segmented. Line B
6
is a false alarm whereas A
6
is
a missed line. Line B
7
is missing component as it is missing
a small dangling modifier which is just below the line. In
this work, for a given segmented page we compute certain
features (which we describe in next section) and locate first
five segmentation errors in a given page in multi-class classi-
fication framework. For learning the features we assume the
availability of the ground truth for training images, whereas
for a test image we compute same set of features for every
line. We then locate the errors by classifying each line as ei-
ther correct, over-segmented, under-segmented, false alarm
or missing component.
3. THE PROBLEM OF LOCATING PAGE
SEGMENTATION ERRORS
More often the existing page segmentation algorithms tend
to fail for some specific pages or some specific regions of the
page. Figure 2 shows some typical examples of failure at line
level segmentation. Given a segmented page, our goal is to
locate page segmentation errors. Once segmentation errors
are localized, one can use human intervention or alternate

(a)
(b)
(c)
Figure 2: Typical segmentation errors: left and right columns show part of a sample page and corresponding
segmented output respectively. (a) Two lines are merged into one line (under-segmentation) (b) One line is
spilt into two lines (over segmentation) (c) A dangling modifier shown in a small red circle is missed (missing
component).
Figure 3: Process of locating segmentation errors.
algorithm with tuned parameters for error correction. How-
ever, improving the segmentation accuracy is beyond the
scope of this paper. We rather work on segmentation error
localization. We do it in two stages. Figure 3 demonstrates
the process of error localization. In stage-1 we compute
some page level features and classify each page as correct
or erroneous (over, under, false alarm, or missing compo-
nent). In erroneous pages we compute line level features and
classify each line either as correct, over-segmented, under-
segmented, false alarm or missing component. Note that we
learn both line level and page level features in supervised
learning framework i.e. we assume availability of ground
truth for the training images so that we can learn line/page
level features and corresponding error. However, for the test
images we locate the errors using nearest neighbour based
classification.
The training and testing phase of locating page segmen-
tation errors can be summarized as follows:
Page level
.
Learning phase:
1. Compute page level features for each page of training
document image.
2. Assign correct or incorrect label to each page using ground
truth. (Note that by correct label of a page we mean all the
lines in that page are correctly segmented)
Testing phase:
1. Compute page level features for each page of test docu-
ment image.
2. Classify each page into correct or incorrect using k-nearest
neighbour based classification.
Line level
.
Learning phase:
1. Compute line level features for each line of training doc-
ument image.
2. Assign correct or error labels to each line using ground
truth.

Testing phase:
1. Compute line level features for each line of a test docu-
ment image.
2. Classify each line into correct, over-segmented, under-
segmented, false alarm or missing component using k-nearest
neighbour based classification.
The motivation behind two stages classification is that
we want to avoid extra computation for correct pages. We
achieve this by using a set of simple features in stage-1 where
we classify correct and incorrect pages, and in the stage-2 we
compute computationally more expensive line level features
only for the pages which are classified as incorrect by stage-
1.
To evaluate the performance of our system we first lo-
cate all the errors using ground truth as in [14]. We then,
compare our results of error localization with it and report
percentage of errors which we are able to localize automati-
cally.
3.1 Features
We observe that (1) most of the characters in a page are
of same size, font and style, (2) line spacing within the doc-
uments are mostly same, (3) page is formatted uniformly
within a book, (4) two nearby lines in a document is mostly
of same height. We use these concepts to design a set of
features at page level. These features are powerful enough
to decide whether a page is correctly segmented or not by
looking into the segmented page. The features we use for
classifying segmented page as correct or incorrect i.e., stage-
1 classification are as follows:
f1: Maximum line height.
Given a segmented page with set of lines = {L
1
,L
2
, ..., L
n
}
we define feature f1 as:
f1=max{LH
1
,LH
2
, ..., LH
n
},
where LH
i
is the height of i
th
line L
i
f2: Minimum line height.
Given a segmented page with set of lines = {L
1
,L
2
, ..., L
n
}
we define feature f2 as:
f2=min{LH
1
,LH
2
, ..., LH
n
},
where LH
i
is the height of i
th
line L
i
f3: Difference of maximum and average line height.
Feature f3 is defined as:
f3=f1
1
n
n
i=1
LH
i
.
f4: Difference in average and minimum line gap.
Let LG
i1,i
be the line gap between lines L
i
and L
i1
,
then we define f4 as:
f4=
1
n 1
i=n
i=2
LG
i1,i
min{LG
1,2
,LG
2,3
, ..., LG
n1,n
}.
f5: Maximum of difference in line heights and line gap.
Let LH
i1
, LH
i
and LH
i+1
be the line heights of line
i-1, i and i+1 respectively. Further, suppose LG
i1,i
and
LG
i,i+1
are the gaps between lines i-1 and i, and i and i+1
respectively. Then we define feature f5 as:
f5=max{|LH
i1
LH
i
|−LG
i1,i
}∀i ∈{2, 3, ..., n}.
f6: Maximum area of connected component between lines.
We find out all connected components (CCs) between lines
which are not part of any line. We compute the area of all
CCs and take the maximum area as a feature.
f7: Minimum area of connected component between lines.
Similarly we take the minimum area of CCs as a feature.
f8: Maximum white width in vertical profile of the page.
We compute the vertical profile for the page and in that
vertical profile find out longest run of zero. We use this as
a feature f8. This helps us to identify if two multi-column
lines are merged.
In stage-2, we wish to locate page segmentation errors, i.e.
for each segmented line in a segmented page we have to clas-
sify it as correct or erroneous. This is a tougher task than
stage-1, thus we need powerful and error specific features for
these stage. However, this will not affect the overall compu-
tation of the error localization, as we locate error only for
those pages which are classified as incorrect by stage-1. The
set of features which we compute for each line L
i
in order
to classify it as correct, over-segmented, under-segmented,
false alarm or missing component are as follows:
F1: Difference in line heights and line gap.
We define this feature as follows:
F 1=max{|LH
i1
LH
i
|−LG
i1,i
, |LH
i
LH
i+1
|−
LG
i,i+1
},
where LH
i
is the height of line i and LG
i1,i
is the gap
between lines i 1andi. The intuition behind this feature
is two closest line should be of similar height.
F2: Difference in line height and maximum height of con-
nected component
.
We use difference of line height and maximum height of
connected component in a line as a feature. This helps us
to locate under segmentation where line height is far greater
than size of maximum connected component.
F3: Maximum area of CCs closest to line.
To define this feature for a line L
i
we find out the CCs
which are not part of any line and is the closest to line L
i
compared to its above or below lines i.e., lines L
i1
and L
i+1
.
We compute the area of all such CCs and take maximum
area as a feature F3 for line L
i
F4: Maximum word gap in line.
In case of multicolumn some segmentation algorithms merge
two horizontal lines. To identify such case we compute max-
imum word gap in line and use this as feature F4

Language tot.lines overseg underseg m.c. f.a.
Telugu 123493 6.47 0.78 4.8 0.89
Tamil 144215 2.74 3.3 1.1 1.43
Malayalam 181951 0.41 0.43 1.21 0.72
Kannada 114468 4.42 11.49 3.6 4.95
Total 564127 3.14 3.48 2.45 1.8
Table 2: Percentage of lines having error estimated
using ground truth. This shows that around 11% of
the total segmented lines have some segmentation
error.
F5: Maximum area of connected component in a line.
In every line we compute the maximum area of connected
component and use it as a feature. Very high and low value
of this feature correspond to false alarm.
F6: Minimum of upper and lower line gaps.
We define feature F6 for the line L
i
as:
F 6=min{LG
i1,i
,LG
i,i+1
}.
This feature helps us to locate over-segmentation where one
of the gap LG
i1,i
or LG
i1,i
is very low.
4. EXPERIMENTS AND RESULTS
4.1 About Dataset
We use a dataset [7] of 109 books in four prominent south
Indian languages for all our experiments. Table 4 gives the
details of the dataset. This dataset contains pages scanned
in 600 dpi. Some sample pages of this dataset can be seen
in Figure 4. We also have line level annotation in form of
XML for this dataset produced using a semi-automatic tool.
Segmentation of Indian language document pages is a
challenging task, mainly due to (1) Presence of dangling
modifiers (2) The relative position of the neighbouring char-
acters are not fixed etc. In [9], authors experimentally show
that many well-known segmentation algorithm perform poorer
in case of Indian languages compared to English, which also
makes automatic error localization an important task.
For performance evaluation of our system, we need to
compare our error localization with one obtained using ground
truth. Moreover, we also use ground truth based evaluation
in learning phase of our system. Thus we do our experi-
ment in two phase. In phase-1, we localize the segmentation
errors with the help of ground truth. In phase-2, we learn
the ground truth based error localization for the training
images. However, for test images we automatically localize
errors using nearest neighbour based classification.
4.2 Error localization using ground truth
We first run the segmentation algorithm on all the pages.
With the help of ground truth we locate all the segmentation
errors and store line co-ordinates along with corresponding
error type in the database. Further, if all the lines in a page
are correctly segmented we tag that page as correct. Simi-
larly, a page having line level segmentation error, is tagged
as incorrect page. Table 2 summarizes segmentation errors
at line level caused by segmentation algorithm. We see that
on average around 11% of the segmented lines are not cor-
rect. Which is actually a very large quantity considering the
Language ρ
cc
ρ
ii
ρ
p
Telugu 76.30 94.52 91.92
Tamil 87.48 91.12 89.90
Malayalam 85.66 84.55 85.06
Kannada 89.49 93.84 92.76
Average 84.42 90.73 89.66
Table 3: Accuracy of classifying pages as cor-
rect/incorrect. ρ
cc
, ρ
ii
and ρ
p
denotes how accurately
we classify correct as correct, incorrect as incorrect
and overall accuracy at page level.
Language overseg underseg m.c. f.a. ρ
l
Telugu 89.67 64.48 80.64 52.82 82.68
Tamil 84.65 91.65 44.44 79.83 83.84
Malayalam 79.30 99.83 52.24 76.62 71.68
Kannada 59.11 88.07 62.46 77.50 78.55
Average 79.63 88.09 62.46 70.62 78.51
Table 4: Percentage of segmentation errors we au-
tomatically detect (Proposed Scheme). Here ρ
l
de-
notes the overall error localization performance.
huge size of the dataset. Our aim is to localize these errors
automatically i.e. without using ground truth.
4.3 Automatic error localization
4.3.1 Page level
In this stage, for a given segmented page, we say whether
the page is correctly segmented or not. For this we use half
of the pages as training and rest half as testing. For train-
ing images we compute page level features as described in
Section 3 and learn the correctness of the page from ground
truth, and for testing images we use k-NN based classifier
to decide whether the page is correct or not. (Note that
by correct page we mean those pages which are perfectly
segmented at line level).
This step is important to avoid unnecessary computation
for the correctly segmented pages, as we can work on only
incorrect pages for line level error localization.
The task of tagging pages as correct/incorrect is performed
in classification framework. Thus we define a set of perfor-
mance measures using confusion matrix. Let C be a confu-
sion matrix where row 0 and 1 corresponds to correct and
incorrect classification respectively. (Recall that the entry
C
ij
corresponds to total number of lines classified as class j
but are actually belonging to class i). We define correct as
correct(ρ
cc
), incorrect as incorrect (ρ
ii
) and overall classifi-
cation accuracy (ρ
p
)asfollows:
ρ
cc
=
C
00
× 100
C
00
+ C
01
ρ
ii
=
C
11
× 100
C
10
+ C
11
ρ
p
=
(C
00
+ C
11
) × 100
C
00
+ C
01
+ C
10
+ C
11
Table 3 describes how accurately we say correct pages
as correct and incorrect pages as incorrect based on above
measures. We see that we are able to classify correct page
as correct and incorrect page as incorrect with more than
89% accuracy.

Citations
More filters
Proceedings ArticleDOI
16 Dec 2012
TL;DR: The proposed segmentation post processor, which works in a "learning by examples" framework, is not only independent to segmentation algorithms but also robust to the diversity of scanned pages.
Abstract: Text line segmentation is a basic step in any OCR system. Its failure deteriorates the performance of OCR engines. This is especially true for the Indian languages due to the nature of scripts. Many segmentation algorithms are proposed in literature. Often these algorithms fail to adapt dynamically to a given page and thus tend to yield poor segmentation for some specific regions or some specific pages. In this work we design a text line segmentation post processor which automatically localizes and corrects the segmentation errors. The proposed segmentation post processor, which works in a "learning by examples" framework, is not only independent to segmentation algorithms but also robust to the diversity of scanned pages.We show over 5% improvement in text line segmentation on a large dataset of scanned pages for multiple Indian languages.

1 citations


Cites background or methods from "Automatic localization of page segm..."

  • ...These books were identified based on experiments in [12]...

    [...]

  • ...Figure 1: Typical segmentation errors in Indian scripts as discussed in [12] (a) Two lines are merged into one line (under-segmentation) (b) One line is spilt into two lines (over segmentation) (c) A dangling modifier shown in a small red circle is missed (missing component)....

    [...]

  • ...Extending the work of [12] on localizing segmentation errors, we design a post-processor which automatically localizes and corrects the errors....

    [...]

  • ...The exhaustive experiments on scanned document of a large collection of Indian language dataset are conducted in [10, 12]....

    [...]

  • ...Note that identical to [12] when we measure overall error localization accuracyρl we also consider percentage of correct lines classified as correct....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: An extensive evaluation of the unsupervised objective evaluation methods that have been proposed in the literature are presented and the advantages and shortcomings of the underlying design mechanisms in these methods are discussed and analyzed.

996 citations


"Automatic localization of page segm..." refers background or methods in this paper

  • ...Recently, in computer vision community researchers have shown interest in unsupervised evaluation of image segmentation algorithms [17]....

    [...]

  • ...A survey of unsupervised methods for evaluating segmentation algorithms is given in [17]....

    [...]

Journal ArticleDOI
TL;DR: The requirements and components for a proposed Document Analysis System, which assists a user in encoding printed documents for computer processing, are outlined and several critical functions have been investigated and the technical approaches are discussed.
Abstract: This paper outlines the requirements and components for a proposed Document Analysis System, which assists a user in encoding printed documents for computer processing. Several critical functions have been investigated and the technical approaches are discussed. The first is the segmentation and classification of digitized printed documents into regions of text and images. A nonlinear, run-length smoothing algorithm has been used for this purpose. By using the regular features of text lines, a linear adaptive classification scheme discriminates text regions from others. The second technique studied is an adaptive approach to the recognition of the hundreds of font styles and sizes that can occur on printed documents. A preclassifier is constructed during the input process and used to speed up a well-known pattern-matching method for clustering characters from an arbitrary print source into a small sample of prototypes. Experimental results are included.

718 citations


Additional excerpts

  • ...The popular ones are Recursive XY cut [12], White-space analysis [5], Docstrum [13], Voronoi diagram based [8], and RLSA [16]....

    [...]

  • ...The popular ones are Recur­sive XY cut [12], White-space analysis [5], Docstrum [13], Voronoi diagram based [8], and RLSA [16]....

    [...]

Journal ArticleDOI
Lawrence O'Gorman1
TL;DR: The document spectrum (or docstrum) as discussed by the authors is a method for structural page layout analysis based on bottom-up, nearest-neighbor clustering of page components, which yields an accurate measure of skew, within-line, and between-line spacings and locates text lines and text blocks.
Abstract: Page layout analysis is a document processing technique used to determine the format of a page. This paper describes the document spectrum (or docstrum), which is a method for structural page layout analysis based on bottom-up, nearest-neighbor clustering of page components. The method yields an accurate measure of skew, within-line, and between-line spacings and locates text lines and text blocks. It is advantageous over many other methods in three main ways: independence from skew angle, independence from different text spacings, and the ability to process local regions of different text orientations within the same image. Results of the method shown for several different page formats and for randomly oriented subpages on the same image illustrate the versatility of the method. We also discuss the differences, advantages, and disadvantages of the docstrum with respect to other lay-out methods. >

654 citations

Book
01 Jan 1995
TL;DR: The document spectrum (or docstrum), which is a method for structural page layout analysis based on bottom-up, nearest-neighbor clustering of page components, yields an accurate measure of skew, within-line, and between-line spacings and locates text lines and text blocks.
Abstract: Page layout analysis is a document processing technique used to determine the format of a page. This paper describes the document spectrum (or docstrum), which is a method for structural page layout analysis based on bottom-up, nearest-neighbor clustering of page components. The method yields an accurate measure of skew, within-line, and between-line spacings and locates text lines and text blocks. It is advantageous over many other methods in three main ways: independence from skew angle, independence from different text spacings, and the ability to process local regions of different text orientations within the same image. Results of the method shown for several different page formats and for randomly oriented subpages on the same image illustrate the versatility of the method. We also discuss the differences, advantages, and disadvantages of the docstrum with respect to other lay-out methods. >

628 citations


Additional excerpts

  • ...The popular ones are Recursive XY cut [12], White-space analysis [5], Docstrum [13], Voronoi diagram based [8], and RLSA [16]....

    [...]

  • ...The popular ones are Recur­sive XY cut [12], White-space analysis [5], Docstrum [13], Voronoi diagram based [8], and RLSA [16]....

    [...]

Journal ArticleDOI
TL;DR: The document image acquisition process and the knowledge base that must be entered into the system to process a family of page images are described, and the process by which the X-Y tree data structure converts a 2-D page-segmentation problem into a series of 1-D string-parsing problems that can be tackled using conventional compiler tools.
Abstract: Gobbledoc, a system providing remote access to stored documents, which is based on syntactic document analysis and optical character recognition (OCR), is discussed. In Gobbledoc, image processing, document analysis, and OCR operations take place in batch mode when the documents are acquired. The document image acquisition process and the knowledge base that must be entered into the system to process a family of page images are described. The process by which the X-Y tree data structure converts a 2-D page-segmentation problem into a series of 1-D string-parsing problems that can be tackled using conventional compiler tools is also described. Syntactic analysis is used in Gobbledoc to divide each page into labeled rectangular blocks. Blocks labeled text are converted by OCR to obtain a secondary (ASCII) document representation. Since such symbolic files are better suited for computerized search than for human access to the document content and because too many visual layout clues are lost in the OCR process (including some special characters), Gobbledoc preserves the original block images for human browsing. Storage, networking, and display issues specific to document images are also discussed. >

466 citations


Additional excerpts

  • ...The popular ones are Recursive XY cut [12], White-space analysis [5], Docstrum [13], Voronoi diagram based [8], and RLSA [16]....

    [...]

Frequently Asked Questions (8)
Q1. How many errors are found in the proposed scheme?

Their experimental results show that on average only 0.19%, 0.30%, 0.10% and 0.07% of correctly segmented lines are located as under-segmented, over-segmented, missed component and false alarm respectively. 

With the help of ground truth the authors locate all the segmentation errors and store line co-ordinates along with corresponding error type in the database. 

The authors used 109 books in four prominent south Indian languages for their experiments, where randomly selected half of the pages were used for training whereas rest half was used for testing. 

Segmentation of Indian language document pages is a challenging task, mainly due to (1) Presence of dangling modifiers (2) The relative position of the neighbouring characters are not fixed etc. 

for such failures is that these algorithms are heavily dependent on parameters and thus fail to adapt to a given page dynamically. 

For training images the authors compute page level features as described in Section 3 and learn the correctness of the page from ground truth, and for testing images the authors use k-NN based classifier to decide whether the page is correct or not. 

In [9], authors experimentally show that many well-known segmentation algorithm perform poorer in case of Indian languages compared to English, which also makes automatic error localization an important task. 

Then the authors define overall error localization accuracy at line level as follows:ρl =4∑i=1Cii × 1004∑i=14∑j=0CijThis measure gives percentage of segmentation errors which the authors are able to detect automatically.