Proceedings Article•DOI•

Automatic localization of page segmentation errors

Dheeraj Mundhra¹, Anand Mishra², C. V. Jawahar²•Institutions (2)

Indian Institute of Technology Kharagpur¹, International Institute of Information Technology, Hyderabad²

17 Sep 2011-pp 17

TL;DR: This work focuses on localizing line level segmentation errors without directly using the ground truth and performs experiments on more than 18000 scanned pages of 109 books belonging to four prominent south Indian languages.

read less

Abstract: Page segmentation is a basic step in any character recognition system. Its failure is one of the major causes for deteriorating overall accuracy of the current Indian language OCR engines. Many segmentation algorithms are proposed in literature. Often these algorithms fail to adapt dynamically to a given page and thus tend to yield poor segmentation for some specific regions or some specific pages. Given the ground truth, locating page segmentation errors is a straight foreword problem and merely useful for comparing segmentation algorithms. In this work, we locate segmentation errors without directly using the ground truth. Such automatic localization of page segmentation errors can be considered a major step towards improving page segmentation errors. In this work, we focus on localizing line level segmentation errors. We perform experiments on more than 18000 scanned pages of 109 books belonging to four prominent south Indian languages.

...read moreread less

Summary (2 min read)

Jump to: [1. INTRODUCTION] – [2. PAGE SEGMENTATION ERRORS] – [3. THE PROBLEM OF LOCATING PAGE SEGMENTATION ERRORS] – [3.1 Features] – [4.1 About Dataset] – [4.2 Error localization using ground truth] – [4.3 Automatic error localization] and [5. CONCLUSIONS]

1. INTRODUCTION

The success of page segmentation algorithm critically affects the performance of OCR.
Most of these segmentation algorithms perform satisfactorily well but tend to fail in some specific region or for some specific pages.
The primary objective of this work is to automatically locate segmentation errors with very high accuracy.
The objective of this work is to locate these errors without the help of ground truth.

2. PAGE SEGMENTATION ERRORS

There are large number of document segmentation algorithms available in literature.
Most of these segmentation algorithms suffer from some or other page segmentation errors.
Let S and G be the set of lines denoting segmentation output and ground truth respectively.
The authors then locate the errors by classifying each line as either correct, over-segmented, under-segmented, false alarm or missing component.

3. THE PROBLEM OF LOCATING PAGE SEGMENTATION ERRORS

More often the existing page segmentation algorithms tend to fail for some specific pages or some specific regions of the page.
Once segmentation errors are localized, one can use human intervention or alternate algorithm with tuned parameters for error correction.
Compute line level features for each line of training document image, also known as Learning phase.
The authors achieve this by using a set of simple features in stage-1 where they classify correct and incorrect pages, and in the stage-2 they compute computationally more expensive line level features only for the pages which are classified as incorrect by stage1.
To evaluate the performance of their system the authors first locate all the errors using ground truth as in [14].

3.1 Features

The authors observe that (1) most of the characters in a page are of same size, font and style, (2) line spacing within the documents are mostly same, (3) page is formatted uniformly within a book, (4) two nearby lines in a document is mostly of same height.
The features the authors use for classifying segmented page as correct or incorrect i.e., stage1 classification are as follows: f1: Maximum line height.
Maximum of difference in line heights and line gap, also known as f5.
To identify such case the authors compute maximum word gap in line and use this as feature F4 F5: Maximum area of connected component in a line.

4.1 About Dataset

The authors use a dataset [7] of 109 books in four prominent south Indian languages for all their experiments.
Table 4 gives the details of the dataset.
This dataset contains pages scanned in 600 dpi.
Segmentation of Indian language document pages is a challenging task, mainly due to (1) Presence of dangling modifiers (2) The relative position of the neighbouring characters are not fixed etc.
In phase-2, the authors learn the ground truth based error localization for the training images.

4.2 Error localization using ground truth

The authors first run the segmentation algorithm on all the pages.
Further, if all the lines in a page are correctly segmented the authors tag that page as correct.
Table 2 summarizes segmentation errors at line level caused by segmentation algorithm.
Which is actually a very large quantity considering the huge size of the dataset.
The authors aim is to localize these errors automatically i.e. without using ground truth.

4.3 Automatic error localization

For a given segmented page, the authors say whether the page is correctly segmented or not.
Thus the authors define a set of performance measures using confusion matrix.
The authors see that they are able to classify correct page as correct and incorrect page as incorrect with more than 89% accuracy.
At line level error localization, the authors classify each segmented page as correct or one of the segmentation error.
To measure the performance of line level error localization the authors define a performance metric using confusion matrix.

5. CONCLUSIONS

The authors address the problem of localizing page segmentation errors.
The proposed scheme is able to locate segmentation errors without ground truth with high accuracy.
Such error localization is very important for segmentation error correction which can be done either by manual intervention or running alternate segmentation algorithms in the error localized part.
Further, the proposed error localization scheme is independent of segmentation algorithms.
Future direction of this work is to design segmentation postprocessor to automatically correct page segmentation errors.

Did you find this useful? Give us your feedback

Figures (9)

Figure 4: Sample images from the dataset in three different languages: (a) Kannada (b) Malayalam (c) Telugu.

Figure 1: Ground truth and segmented page.

Table 1: Details of dataset used for the experiment.

Figure 5: Error localization based on proposed scheme. Left and right columns represent segmentation output and our error localization results respectively. Green, blue, red, pink and black colours represent automatic localization as correct, under-segmented, over-segmented, missing component and false alarm (best viewed in colour). (Note - In Figure (b) the first line is incorrectly located as correct. Although it is actually missing a small dangling modifier shown in a red circle just below the line).

Figure 3: Process of locating segmentation errors.

Figure 2: Typical segmentation errors: left and right columns show part of a sample page and corresponding segmented output respectively. (a) Two lines are merged into one line (under-segmentation) (b) One line is spilt into two lines (over segmentation) (c) A dangling modifier shown in a small red circle is missed (missing component).

Table 4: Percentage of segmentation errors we automatically detect (Proposed Scheme). Here ρl denotes the overall error localization performance.

Table 2: Percentage of lines having error estimated using ground truth. This shows that around 11% of the total segmented lines have some segmentation error.

Table 3: Accuracy of classifying pages as correct/incorrect. ρcc, ρii and ρp denotes how accurately we classify correct as correct, incorrect as incorrect and overall accuracy at page level.

Content maybe subject to copyright Report

Automatic Localization of Page Segmentation E rrors

Dheeraj Mundhra

∗

IIT Kharagpur

Kharagpur, India

09MA2009@iitkgp.ac.in

Anand Mishra

IIIT Hyderabad

Hyderabad, India

anand.mishra@research.iiit.ac.in

C. V. Jawahar

IIIT Hyderabad

Hyderabad, India

jawahar@iiit.ac.in

ABSTRACT

Page segmentation is a basic step in any character recogni-

tion system. Its failure is one of the major causes for deteri-

orating overall accuracy of the current Indian language OCR

engines. Many segmentation algorithms are proposed in lit-

erature. Often these algorithms fail to adapt dynamically

to a given page and thus tend to yield poor segmentation

for some speciﬁc regions or some speciﬁc pages. Given the

ground truth, locating page segmentation errors is a straight

foreword problem and merely useful for comparing segmen-

tation algorithms. In this work, we locate segmentation er-

rors without directly using the ground truth. Such auto-

matic localization of page segmentation errors can be con-

sidered a major step towards improving page segmentation

errors. In this work, we focus on localizing line level seg-

mentation errors. We perform experiments on more than

18000 scanned pages of 109 books belonging to four promi-

nent south Indian languages.

General Terms

Experimentation.

Keywords

OCR, document image segmentation, under-segmentation,

over-segmentation, false alarm.

1. INTRODUCTION

The success of page segmentation algorithm critically af-

fects the performance of OCR. Page segmentation algorithms

are one of the widely studied topics in document image anal-

ysis literature (see [6], [14]). Most of these segmentation al-

gorithms perform satisfactorily well but tend to fail in some

speciﬁc region or for some speciﬁc pages. The main reason

∗

This work was carried out when Dheeraj Mundhra was vis-

iting IIIT Hyderabad.

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

J-MOCR-AND ’11 Beijing, China

for such failures is that these algorithms are heavily depen-

dent on parameters and thus fail to adapt to a given page

dynamically.

Document image segmentation is a widely studied topic

in literature. Consistent appearance of page segmentation

work (see [2], [3], [4]) in competitions at ICDAR shows the

interest of the community in this area. Kise et al. [8] had

proposed a powerful Voronoi diagram based segmentation

algorithm a decade ago. Nevertheless, the complexity and

variations in the document images make the task of seg-

menting the given document page into lines still challenging.

There is also interest in designing a hybrid segmentation al-

gorithm [1] or designing segmentation algorithm by learning

the page features [10].

Set theoretic approach of analysing segmentation algo-

rithms was presented in [11]. In [14] performance of six

most popular page segmentation algorithm were analysed.

Sesh Kumar et al. [9] had done similar analysis on segmen-

tation algorithm for Indian languages. All these works com-

pared document image segmentation algorithms assuming

the availability of the ground truth. Although such works do

fair comparison of segmentation algorithms, but have sev-

eral disadvantages: (1) They do not directly lead towards

improvement in the segmentation algorithm (2) They do not

specify why the errors arise? (3) They are not applicable for

large scale evaluation of OCRs [15] due to unavailability of

the ground truth.

Recently, in computer vision community researchers have

shown interest in unsupervised evaluation of image segmen-

tation algorithms [17]. Here for a given image, availability

of the ground truth is not assumed. Rather a set of fea-

tures are computed from the segmented image and based

on these features performance of image segmentation algo-

rithms are measured. A survey of unsupervised methods

for evaluating segmentation algorithms is given in [17]. We

are highly inspired by such methods. However, we do not do

any evaluation of page segmentation algorithms in this work.

Rather we go one step further and try to ﬁnd out where and

what type of page segmentation errors are present at line

level for a given page. Such segmentation error localization

can be considered a major step towards improvement in seg-

mentation output. Once segmentation errors are localized

automatically, one can either use human intervention or al-

ternate segmentation algorithms for error correction. How-

ever, improving the segmentation accuracy is beyond the

scope of this work. The primary objective of this work is

to automatically locate segmentation errors with very high

accuracy.

Similar to [14], we focus on line level document image seg-

mentation, where a segmentation algorithm partitions the

text block into lines. Often these segmentation algorithms

fail to segment lines correctly. In other words, each seg-

mented line is either correctly segmented, over-segmented,

under-segmented, false alarm or missing dangling modiﬁer.

The objective of this work is to locate these errors without

the help of ground truth. (We use ground truth only for

evaluation of proposed error localization scheme). We for-

mulate the problem of locating page segmentation errors as a

multi-class classiﬁcation problem where each segmented line

is classiﬁed into one of the ﬁve classes i.e., correct, under-

segmented, over-segmented, false alarm or missing dangling

modiﬁer. For this we compute a set of features for few seg-

mented lines assuming the availability of the ground truth

for them. We then, compute the same set of features for the

rest of the segmented lines to locate the segmentation errors

in a classiﬁcation framework.

We have shown segmentation error localization perfor-

mance on a speciﬁc segmentation algorithm. However, pro-

posed method is independent of segmentation algorithm.

We used 109 books in four prominent south Indian lan-

guages for our experiments, where randomly selected half

of the pages were used for training whereas rest half was

used for testing. To evaluate the performance of proposed

method, we used ground truth based comparison. The pro-

posed scheme localizes the segmentation errors with more

than 78% accuracy, which means that we are able to lo-

calize more than seventy eight percentage of the total page

segmentation errors automatically.

The reminder of the paper is organised as follows. In

Section 2 various types of page segmentation errors are de-

scribed. In Section 3, we formulate the problem of locating

page segmentation errors as a classiﬁcation problem. Here

we describe features used to learn various types of page seg-

mentation errors. Section 4 describes experiments and re-

sults. We ﬁnally conclude our work in Section 5.

2. PAGE SEGMENTATION ERRORS

There are large number of document segmentation algo-

rithms available in literature. The popular ones are Recur-

sive XY cut [12], White-space analysis [5], Docstrum [13],

Voronoi diagram based [8], and RLSA [16]. Description of

these algorithms is not in the scope of this paper. However,

readers are encouraged to see [14] for the description of these

segmentation algorithms. Most of these segmentation algo-

rithms suﬀer from some or other page segmentation errors.

These errors can be deﬁned in a set theocratic notion as

in [14]. We summarize these deﬁnitions here.

Let S and G be the set of lines denoting segmentation

output and ground truth respectively. Then we can deﬁne

segmentation errors as follows:

Correct

ALineB ∈ S is said to be correct if there exists a unique

line A ∈ G such that A ∩ B is signiﬁcant.

Over-segmented

A line B ∈ S is said to be over-segmented if there exist

at-least one more line B



∈ S and A ∈ G such that both

A ∩ B and A ∩ B



are signiﬁcant.

Figure 1: Ground truth and segmented page.

Under-segmented

A line B ∈ S is said be under-segmented if there exist

multiple lines A’s in G such that A ∩ B is signiﬁcant.

Missing component

A line B ∈ S is said to be missing component if there

exists a unique line A ∈ G such that A ∩B is not signiﬁcant.

In other words, by calling line A as missing component we

mean that line A has missed some dangling modiﬁer either

above or below the line. This error is very common in Indian

language document image segmentation. (Note that this

error is not deﬁned in [14]).

False alarm

A line B ∈ S is said to be a false alarm if there does not

exist any line A ∈ G such that A ∩ B = φ.

Missed line

A line A ∈ G is said to be a missed line if there does not

exist any line B ∈ S such that A ∩ B = φ.

We demonstrate typical examples of page segmentation

errors in Figure 1. Here lines B

and B

are correctly

segmented. B

and B

are over-segmented. Line B

under-segmented. Line B

is a false alarm whereas A

a missed line. Line B

is missing component as it is missing

a small dangling modiﬁer which is just below the line. In

this work, for a given segmented page we compute certain

features (which we describe in next section) and locate ﬁrst

ﬁve segmentation errors in a given page in multi-class classi-

ﬁcation framework. For learning the features we assume the

availability of the ground truth for training images, whereas

for a test image we compute same set of features for every

line. We then locate the errors by classifying each line as ei-

ther correct, over-segmented, under-segmented, false alarm

or missing component.

3. THE PROBLEM OF LOCATING PAGE

SEGMENTATION ERRORS

More often the existing page segmentation algorithms tend

to fail for some speciﬁc pages or some speciﬁc regions of the

page. Figure 2 shows some typical examples of failure at line

level segmentation. Given a segmented page, our goal is to

locate page segmentation errors. Once segmentation errors

are localized, one can use human intervention or alternate

(a)

(b)

(c)

Figure 2: Typical segmentation errors: left and right columns show part of a sample page and corresponding

segmented output respectively. (a) Two lines are merged into one line (under-segmentation) (b) One line is

spilt into two lines (over segmentation) (c) A dangling modiﬁer shown in a small red circle is missed (missing

component).

Figure 3: Process of locating segmentation errors.

algorithm with tuned parameters for error correction. How-

ever, improving the segmentation accuracy is beyond the

scope of this paper. We rather work on segmentation error

localization. We do it in two stages. Figure 3 demonstrates

the process of error localization. In stage-1 we compute

some page level features and classify each page as correct

or erroneous (over, under, false alarm, or missing compo-

nent). In erroneous pages we compute line level features and

classify each line either as correct, over-segmented, under-

segmented, false alarm or missing component. Note that we

learn both line level and page level features in supervised

learning framework i.e. we assume availability of ground

truth for the training images so that we can learn line/page

level features and corresponding error. However, for the test

images we locate the errors using nearest neighbour based

classiﬁcation.

The training and testing phase of locating page segmen-

tation errors can be summarized as follows:

Page level

Learning phase:

1. Compute page level features for each page of training

document image.

2. Assign correct or incorrect label to each page using ground

truth. (Note that by correct label of a page we mean all the

lines in that page are correctly segmented)

Testing phase:

1. Compute page level features for each page of test docu-

ment image.

2. Classify each page into correct or incorrect using k-nearest

neighbour based classiﬁcation.

Line level

Learning phase:

1. Compute line level features for each line of training doc-

ument image.

2. Assign correct or error labels to each line using ground

truth.

Testing phase:

1. Compute line level features for each line of a test docu-

ment image.

2. Classify each line into correct, over-segmented, under-

segmented, false alarm or missing component using k-nearest

neighbour based classiﬁcation.

The motivation behind two stages classiﬁcation is that

we want to avoid extra computation for correct pages. We

achieve this by using a set of simple features in stage-1 where

we classify correct and incorrect pages, and in the stage-2 we

compute computationally more expensive line level features

only for the pages which are classiﬁed as incorrect by stage-

To evaluate the performance of our system we ﬁrst lo-

cate all the errors using ground truth as in [14]. We then,

compare our results of error localization with it and report

percentage of errors which we are able to localize automati-

cally.

3.1 Features

We observe that (1) most of the characters in a page are

of same size, font and style, (2) line spacing within the doc-

uments are mostly same, (3) page is formatted uniformly

within a book, (4) two nearby lines in a document is mostly

of same height. We use these concepts to design a set of

features at page level. These features are powerful enough

to decide whether a page is correctly segmented or not by

looking into the segmented page. The features we use for

classifying segmented page as correct or incorrect i.e., stage-

1 classiﬁcation are as follows:

f1: Maximum line height.

Given a segmented page with set of lines = {L

, ..., L

}

we deﬁne feature f1 as:

f1=max{LH

,LH

, ..., LH

where LH

is the height of i

line L

f2: Minimum line height.

Given a segmented page with set of lines = {L

, ..., L

}

we deﬁne feature f2 as:

f2=min{LH

,LH

, ..., LH

where LH

is the height of i

line L

f3: Diﬀerence of maximum and average line height.

Feature f3 is deﬁned as:

f3=f1 −



i=1

f4: Diﬀerence in average and minimum line gap.

Let LG

i−1,i

be the line gap between lines L

and L

i−1

then we deﬁne f4 as:

f4=

n − 1

i=n



i=2

i−1,i

− min{LG

1,2

,LG

2,3

, ..., LG

n−1,n

f5: Maximum of diﬀerence in line heights and line gap.

Let LH

i−1

, LH

and LH

i+1

be the line heights of line

i-1, i and i+1 respectively. Further, suppose LG

i−1,i

and

i,i+1

are the gaps between lines i-1 and i, and i and i+1

respectively. Then we deﬁne feature f5 as:

f5=max{|LH

i−1

− LH

|−LG

i−1,i

}∀i ∈{2, 3, ..., n}.

f6: Maximum area of connected component between lines.

We ﬁnd out all connected components (CCs) between lines

which are not part of any line. We compute the area of all

CCs and take the maximum area as a feature.

f7: Minimum area of connected component between lines.

Similarly we take the minimum area of CCs as a feature.

f8: Maximum white width in vertical proﬁle of the page.

We compute the vertical proﬁle for the page and in that

vertical proﬁle ﬁnd out longest run of zero. We use this as

a feature f8. This helps us to identify if two multi-column

lines are merged.

In stage-2, we wish to locate page segmentation errors, i.e.

for each segmented line in a segmented page we have to clas-

sify it as correct or erroneous. This is a tougher task than

stage-1, thus we need powerful and error speciﬁc features for

these stage. However, this will not aﬀect the overall compu-

tation of the error localization, as we locate error only for

those pages which are classiﬁed as incorrect by stage-1. The

set of features which we compute for each line L

in order

to classify it as correct, over-segmented, under-segmented,

false alarm or missing component are as follows:

F1: Diﬀerence in line heights and line gap.

We deﬁne this feature as follows:

F 1=max{|LH

i−1

− LH

|−LG

i−1,i

, |LH

− LH

i+1

|−

i,i+1

where LH

is the height of line i and LG

i−1,i

is the gap

between lines i − 1andi. The intuition behind this feature

is two closest line should be of similar height.

F2: Diﬀerence in line height and maximum height of con-

nected component

We use diﬀerence of line height and maximum height of

connected component in a line as a feature. This helps us

to locate under segmentation where line height is far greater

than size of maximum connected component.

F3: Maximum area of CCs closest to line.

To deﬁne this feature for a line L

we ﬁnd out the CCs

which are not part of any line and is the closest to line L

compared to its above or below lines i.e., lines L

i−1

and L

i+1

We compute the area of all such CCs and take maximum

area as a feature F3 for line L

F4: Maximum word gap in line.

In case of multicolumn some segmentation algorithms merge

two horizontal lines. To identify such case we compute max-

imum word gap in line and use this as feature F4

Language tot.lines overseg underseg m.c. f.a.

Telugu 123493 6.47 0.78 4.8 0.89

Tamil 144215 2.74 3.3 1.1 1.43

Malayalam 181951 0.41 0.43 1.21 0.72

Kannada 114468 4.42 11.49 3.6 4.95

Total 564127 3.14 3.48 2.45 1.8

Table 2: Percentage of lines having error estimated

using ground truth. This shows that around 11% of

the total segmented lines have some segmentation

error.

F5: Maximum area of connected component in a line.

In every line we compute the maximum area of connected

component and use it as a feature. Very high and low value

of this feature correspond to false alarm.

F6: Minimum of upper and lower line gaps.

We deﬁne feature F6 for the line L

as:

F 6=min{LG

i−1,i

,LG

i,i+1

This feature helps us to locate over-segmentation where one

of the gap LG

i−1,i

or LG

i−1,i

is very low.

4. EXPERIMENTS AND RESULTS

4.1 About Dataset

We use a dataset [7] of 109 books in four prominent south

Indian languages for all our experiments. Table 4 gives the

details of the dataset. This dataset contains pages scanned

in 600 dpi. Some sample pages of this dataset can be seen

in Figure 4. We also have line level annotation in form of

XML for this dataset produced using a semi-automatic tool.

Segmentation of Indian language document pages is a

challenging task, mainly due to (1) Presence of dangling

modiﬁers (2) The relative position of the neighbouring char-

acters are not ﬁxed etc. In [9], authors experimentally show

that many well-known segmentation algorithm perform poorer

in case of Indian languages compared to English, which also

makes automatic error localization an important task.

For performance evaluation of our system, we need to

compare our error localization with one obtained using ground

truth. Moreover, we also use ground truth based evaluation

in learning phase of our system. Thus we do our experi-

ment in two phase. In phase-1, we localize the segmentation

errors with the help of ground truth. In phase-2, we learn

the ground truth based error localization for the training

images. However, for test images we automatically localize

errors using nearest neighbour based classiﬁcation.

4.2 Error localization using ground truth

We ﬁrst run the segmentation algorithm on all the pages.

With the help of ground truth we locate all the segmentation

errors and store line co-ordinates along with corresponding

error type in the database. Further, if all the lines in a page

are correctly segmented we tag that page as correct. Simi-

larly, a page having line level segmentation error, is tagged

as incorrect page. Table 2 summarizes segmentation errors

at line level caused by segmentation algorithm. We see that

on average around 11% of the segmented lines are not cor-

rect. Which is actually a very large quantity considering the

Language ρ

Telugu 76.30 94.52 91.92

Tamil 87.48 91.12 89.90

Malayalam 85.66 84.55 85.06

Kannada 89.49 93.84 92.76

Average 84.42 90.73 89.66

Table 3: Accuracy of classifying pages as cor-

rect/incorrect. ρ

, ρ

and ρ

denotes how accurately

we classify correct as correct, incorrect as incorrect

and overall accuracy at page level.

Language overseg underseg m.c. f.a. ρ

Telugu 89.67 64.48 80.64 52.82 82.68

Tamil 84.65 91.65 44.44 79.83 83.84

Malayalam 79.30 99.83 52.24 76.62 71.68

Kannada 59.11 88.07 62.46 77.50 78.55

Average 79.63 88.09 62.46 70.62 78.51

Table 4: Percentage of segmentation errors we au-

tomatically detect (Proposed Scheme). Here ρ

de-

notes the overall error localization performance.

huge size of the dataset. Our aim is to localize these errors

automatically i.e. without using ground truth.

4.3 Automatic error localization

4.3.1 Page level

In this stage, for a given segmented page, we say whether

the page is correctly segmented or not. For this we use half

of the pages as training and rest half as testing. For train-

ing images we compute page level features as described in

Section 3 and learn the correctness of the page from ground

truth, and for testing images we use k-NN based classiﬁer

to decide whether the page is correct or not. (Note that

by correct page we mean those pages which are perfectly

segmented at line level).

This step is important to avoid unnecessary computation

for the correctly segmented pages, as we can work on only

incorrect pages for line level error localization.

The task of tagging pages as correct/incorrect is performed

in classiﬁcation framework. Thus we deﬁne a set of perfor-

mance measures using confusion matrix. Let C be a confu-

sion matrix where row 0 and 1 corresponds to correct and

incorrect classiﬁcation respectively. (Recall that the entry

corresponds to total number of lines classiﬁed as class j

but are actually belonging to class i). We deﬁne correct as

correct(ρ

), incorrect as incorrect (ρ

) and overall classiﬁ-

cation accuracy (ρ

)asfollows:

× 100

+ C

× 100

+ C

) × 100

+ C

Table 3 describes how accurately we say correct pages

as correct and incorrect pages as incorrect based on above

measures. We see that we are able to classify correct page

as correct and incorrect page as incorrect with more than

89% accuracy.

HTML Viewer

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Automatic localization and correction of line segmentation errors

[...]

Anand Mishra¹, Naveen Sankaran¹, Viresh Ranjan¹, C. V. Jawahar¹•Institutions (1)

International Institute of Information Technology, Hyderabad¹

16 Dec 2012

TL;DR: The proposed segmentation post processor, which works in a "learning by examples" framework, is not only independent to segmentation algorithms but also robust to the diversity of scanned pages.

...read moreread less

Abstract: Text line segmentation is a basic step in any OCR system. Its failure deteriorates the performance of OCR engines. This is especially true for the Indian languages due to the nature of scripts. Many segmentation algorithms are proposed in literature. Often these algorithms fail to adapt dynamically to a given page and thus tend to yield poor segmentation for some specific regions or some specific pages. In this work we design a text line segmentation post processor which automatically localizes and corrects the segmentation errors. The proposed segmentation post processor, which works in a "learning by examples" framework, is not only independent to segmentation algorithms but also robust to the diversity of scanned pages.We show over 5% improvement in text line segmentation on a large dataset of scanned pages for multiple Indian languages.

...read moreread less

1 citations

Cites background or methods from "Automatic localization of page segm..."

...These books were identified based on experiments in [12]...
[...]
...Figure 1: Typical segmentation errors in Indian scripts as discussed in [12] (a) Two lines are merged into one line (under-segmentation) (b) One line is spilt into two lines (over segmentation) (c) A dangling modifier shown in a small red circle is missed (missing component)....
[...]
...Extending the work of [12] on localizing segmentation errors, we design a post-processor which automatically localizes and corrects the errors....
[...]
...The exhaustive experiments on scanned document of a large collection of Indian language dataset are conducted in [10, 12]....
[...]
...Note that identical to [12] when we measure overall error localization accuracyρl we also consider percentage of correct lines classified as correct....
[...]

Frequently Asked Questions (8)

Q1. How many errors are found in the proposed scheme?

Their experimental results show that on average only 0.19%, 0.30%, 0.10% and 0.07% of correctly segmented lines are located as under-segmented, over-segmented, missed component and false alarm respectively.

Q2. How do the authors find errors in the segmentation algorithm?

With the help of ground truth the authors locate all the segmentation errors and store line co-ordinates along with corresponding error type in the database.

Q3. How many books were used for training?

The authors used 109 books in four prominent south Indian languages for their experiments, where randomly selected half of the pages were used for training whereas rest half was used for testing.

Q4. What is the problem with segmentation of Indian language document pages?

Segmentation of Indian language document pages is a challenging task, mainly due to (1) Presence of dangling modifiers (2) The relative position of the neighbouring characters are not fixed etc.

Q5. What is the main reason for page segmentation algorithms to fail?

for such failures is that these algorithms are heavily dependent on parameters and thus fail to adapt to a given page dynamically.

Q6. How do the authors determine if a page is correct?

For training images the authors compute page level features as described in Section 3 and learn the correctness of the page from ground truth, and for testing images the authors use k-NN based classifier to decide whether the page is correct or not.

Q7. How does the algorithm perform in Indian languages?

In [9], authors experimentally show that many well-known segmentation algorithm perform poorer in case of Indian languages compared to English, which also makes automatic error localization an important task.

Q8. How do the authors determine the error localization accuracy?

Then the authors define overall error localization accuracy at line level as follows:ρl =4∑i=1Cii × 1004∑i=14∑j=0CijThis measure gives percentage of segmentation errors which the authors are able to detect automatically.

Automatic localization of page segmentation errors

Summary (2 min read)

1. INTRODUCTION

2. PAGE SEGMENTATION ERRORS

3. THE PROBLEM OF LOCATING PAGE SEGMENTATION ERRORS

3.1 Features

4.1 About Dataset

4.2 Error localization using ground truth

4.3 Automatic error localization

5. CONCLUSIONS

Figures (9)

Citations

Cites background or methods from "Automatic localization of page segm..."

References

"Automatic localization of page segm..." refers background or methods in this paper

Additional excerpts

Additional excerpts

Additional excerpts

Related Papers (5)

Frequently Asked Questions (8)

Q1. How many errors are found in the proposed scheme?

Q2. How do the authors find errors in the segmentation algorithm?

Q3. How many books were used for training?

Q4. What is the problem with segmentation of Indian language document pages?

Q5. What is the main reason for page segmentation algorithms to fail?

Q6. How do the authors determine if a page is correct?

Q7. How does the algorithm perform in Indian languages?

Q8. How do the authors determine the error localization accuracy?