What have the authors stated for future works in "Layout analysis for arabic historical document images using machine learning" ?

Their future work will focus on improving some aspects of the algorithm.

How do the authors measure the segmentation accuracy of the document images?

The authors evaluate the segmentation accuracy by adopting the F-measure met-ric which combines precision and recall values into a single scalar representative.

What is the neighborhood of the document?

The considered neighborhood is calculated adaptively as a function of component’s width and height (denoted by w and h respectively), and is wfactort × w by hfactor × h, where wfactor is always greater than hfactor because of the horizontal nature of Arabic script.

How do the authors improve the segmentation results?

In order to improve the segmentation results the authors use a post-processing step based on relaxation labeling approach which is described below.

(Open Access) Layout Analysis for Arabic Historical Document Images Using Machine Learning (2012) | Syed Saqib Bukhari

Q: What are the contributions in "Layout analysis for arabic historical document images using machine learning" ?

The authors introduce an approach that segments text appearing in page margins ( a. k. a side-notes text ) from manuscripts with complex layout format.

Q: Why did layout analysis of ancient manuscripts become a challenging problem?

Due to looser formatting rules, non-rectangular layout and irregularities in location of layout entities [2, 11], layout analysis of handwritten ancient documents became a challenging research problem.

Q: How did the authors improve the segmentation results?

The authors improve the segmentation results applying nearest neighbor analysis and using class probabilities for refining the class label of each connected component.

Q: What is the size of the context-based feature vector?

To generate context-basedfeature vector, each connected component with its surrounding context area is rescaled to a 64 × 64 window size, while the connected component is kept at the center of the window.

Q: What is the method for analyzing side-notes?

Conventional methods for geometric layout analysis could be an adequate choice to tackle the side-notes segmentation problem when main-body and side-note text have salient and differentiable geometric properties, such as: text orientation, text size, white space locations, etc.

Q: Why is the proposed method not applicable?

Due to the challenges in handwritten historical documents [2], applying traditional page layout analysis methods, which usually address machine-printed documents, is not applicable.

Layout Analysis for Arabic Historical Document Images Using Machine

Learning

Syed Saqib Bukhari

∗

, Thomas M. Breuel

Technical University of Kaiserslautern, Germany

bukhari@informatik.uni-kl.de, tmb@informatik.uni-kl.de

Abedelkadir Asi

∗

, Jihad El-Sana

Ben-Gurion University of the Negev, Israel

abedas@cs.bgu.ac.il, el-sana@cs.bgu.ac.il

Abstract

Page layout analysis is a fundamental step of any

document image understanding system. We introduce

an approach that segments text appearing in page mar-

gins (a.k.a side-notes text) from manuscripts with com-

plex layout format. Simple and discriminative features

are extracted in a connected-component level and sub-

sequently robust feature vectors are generated. Multi-

layer perception classiﬁer is exploited to classify con-

nected components to the relevant class of text. A voting

scheme is then applied to reﬁne the resulting segmenta-

tion and produce the ﬁnal classiﬁcation. In contrast to

state-of-the-art segmentation approaches, this method

is independent of block segmentation, as well as pixel

level analysis. The proposed method has been trained

and tested on a dataset that contains a variety of com-

plex side-notes layout formats, achieving a segmenta-

tion accuracy of about 95%.

1 Introduction

Manually copying a manuscript was the ultimate

way to spread knowledge before printing houses were

established. Scholars added their own notes on page

margins mainly because paper was an expensive ma-

terial. Historians regard the importance of the notes’

content and the role of their layout; these notes became

an important reference by themselves. Hence, analyz-

ing this content became an inevitable step toward a re-

liable manuscript authentication [11] which would sub-

sequently shed light on the manuscript temporal and ge-

ographical origin.

∗

these authors contributed equally.

Figure 1. Arabic historical document im-

age with complex layout formatting due to

side-notes text.

Physical structure of handwritten historical

manuscripts imposes a variety of challenges for

any page layout analysis system. Due to looser format-

ting rules, non-rectangular layout and irregularities in

location of layout entities [2, 11], layout analysis of

handwritten ancient documents became a challenging

research problem. In contrast to algorithms which cope

with modern machine-printed documents or historical

documents from the hand-press period, algorithms for

handwritten ancient documents are required to cope

with the above challenges.

Page layout analysis is a fundamental step of any

document image understanding system. The analysis

process consists of two main steps, page decomposi-

tion and block classiﬁcation. Page decomposition seg-

ments a document image into homogeneous regions,

2012 International Conference on Frontiers in Handwriting Recognition

DOI 10.1109/ICFHR.2012.227

635

and the classiﬁcation step attempts to distinguish among

the segmented regions whether they are text, picture or

drawing. Later on, the text regions are fed into a recog-

nition system such as, Optical Character Recognition

(OCR), to retrieve the actual letters and words which

correlate to the characters in the manuscript.

In this paper, we introduce an approach that seg-

ments side-notes text from manuscripts with com-

plex layout formatting (see Figure 1). It extracts and

generates feature vectors in a connected-component

level. Multi-layer perception classiﬁer, which has

been already used for page-layout analysis by Jain and

Zhong [9], was exploited to classify connected compo-

nents to the relevant classes of text. A voting step is then

applied to reﬁne the resulting segmentation and produce

the ﬁnal classiﬁcation. The suggested approach is inde-

pendent of block segmentation, as well as pixel level

analysis.

In the rest of the paper, we overview previous work,

present our approach in detail, report experimental re-

sults, and ﬁnally we conclude and suggest directions for

future work.

2 Related Work

Due to the challenges in handwritten historical doc-

uments [2], applying traditional page layout analysis

methods, which usually address machine-printed docu-

ments, is not applicable. Methods for page layout analy-

sis can be roughly categorized into three major classes:

bottom-up, top-down and hybrid methods [12, 15, 7].

In top-down methods, the document image is divided

into regions which are classiﬁed and reﬁned according

to pre-deﬁned criteria. Bottom-up approaches group ba-

sic image elements, such as pixels and connected com-

ponents, to create larger homogeneous regions. Hy-

brid schemes exploit the advantages of top-down and

bottom-up approaches to yield better results.

Recently, Graz et al. [8] introduced a binarization-

free approach which employs the Scale Invariant Fea-

ture Transform (SIFT) to analyze the layout of hand-

written ancient documents. The proposed method sug-

gests a part-based detection of layout entities locally,

using a multi-stage algorithm for the localization of the

entities based on interest points. Support Vector Ma-

chine (SVM) was used to discriminate the considered

classes. Kise et al. [10] introduced a page segmentation

method for non-Manhattan layout documents. Their

method is based on connected components analysis and

exploits the Area Voronoi Digarams to segment the

page. Bukhari et al. [5] presented a segmentation al-

gorithm for printed document images into text and not-

text regions. They examined the document in the level

of connected components and introduced a self-tunable

training model (AutoMLP) for distinguishing between

text and non-text components. Connected components

shape and context were utilized to generate feature vec-

tors. Moll et al. [14] suggested an algorithm that clas-

siﬁes individual pixels. The approach is applied on

handwritten, machine-printed and photographed doc-

ument images. Pixel-based classiﬁcation approaches

are time-consuming in comparison to block-based and

component-based approaches.

Page layout analysis was also posed as a texture seg-

mentation problem in literature. For texture-based ap-

proaches see reviews in [13, 16]. Jain and Zhong [9]

suggested a texture-based language-free algorithm for

machine-printed document images. A neural network

was employed to train a set of masks which were des-

ignated to be robust and distinctive. Texture features

were obtained by convolving the trained masks with the

input image. Shape and textural image properties moti-

vated the work introduced by Bloomberg in [3]. In this

work, standard and generalized (multi-resolution) mor-

phological operations were used. Later on, Bukhari et

al. [6] generalized Bloomberg’s text/image segmenta-

tion algorithm for separating text and non-text compo-

nents including halftones, drawings, graphs, maps, etc.

The approach by Won [19] focuses on the combination

of a block based algorithm and a pixel based algorithm

to segment a document image into text and image re-

gions.

Ouwayed et al. [17] suggested an approach to seg-

ment multi-oriented handwritten documents into text

lines. Their method addressed documents with com-

plex layout strucutre. They subdivided the image into

rectangular cells, and estimated text orientation in each

cell using projection proﬁle. Then, cells are merged to-

gether into larger zones with respect to their orienta-

tion. Wigner-Ville Distribution was exploited to esti-

mate the orientation within large zones. This method

could not yield accurate segmentation results due to

some assumptions that were adopted by the authors.

When a window contains several writings in different

orientations, the authors assumed that the border be-

tween the two types of writing could be detected by

ﬁnding the minimum index in the projection proﬁle to

reﬁne the cells subdivision. However, this border is not

always obvious and detecting the minimum index from

the projection proﬁle becomes a real challenge when

side-notes are written in a ﬂexible writing style (see Fig-

ure 1). One can also notice that the robustness of this

approach could be negatively affected once side-notes

text have the same orientation as main-body text and the

two types of text have no salient space between them. In

this case the method would not distinguish between the

636

two coinciding regions and erroneous text-lines would

be extracted.

3 Method

Conventional methods for geometric layout analy-

sis could be an adequate choice to tackle the side-notes

segmentation problem when main-body and side-note

text have salient and differentiable geometric proper-

ties, such as: text orientation, text size, white space lo-

cations, etc. However, layout rules have not necessarily

guided the scribes of ancient manuscripts, as a result,

complex document images became common. These

documents contain non-uniform and/or similar geomet-

ric properties for both main-body and side-notes text; a

fact that makes the developing of a method which could

gracefully cope with this type of documents a challeng-

ing task.

Our approach utilizes machine learning technique to

meet the challenges of this problem. In general, clas-

siﬁer tuning is a hard problem with respect to the opti-

mization of their sensitive parameters, e.g., learning ‘C’

and gamma of SVM classiﬁer.

Here, we are using MLP classiﬁer for segmenting

side-notes from main-body text in complex Arabic doc-

uments. This approach is based on a previous work of

Bukhari et al. [5]. The main reason of using MLP clas-

siﬁer over others is that it achieves good classiﬁcation

once it is adequately trained as well as being scalable.

However, a major difﬁculty of its use has been the re-

quirement for manual inspection in the training process.

They are hard to train because their performance is sen-

sitive to chosen parameter values, and optimal param-

eter values depends heavily on the considered dataset.

The parameters optimization problem of MLPs could

be solved by using grid search for classiﬁer training.

But grid search is a slow process. Therefore in order

to overcome this problem we use AutoMLP [4], a self-

tuning classiﬁer that can automatically adjust learning

parameters.

3.1 AutoMLP Calssifer

AutoMLP combines ideas from genetic algorithms

and stochastic optimization. It trains a small number

of networks in parallel with different learning rates and

different numbers of hidden layers. After a small num-

ber of training cycles the error rate of each network is

determined with respect to a validation dataset accord-

ing to an internal validation process. Based on valida-

tion errors, the networks with bad performance are re-

placed by the modiﬁed copies of networks with good

performance. The modiﬁed copies are generated with

different learning rates and different numbers of hidden

layers using probability distributions derived from suc-

cessful rates and sizes. The whole process is repeated

a few number of times, and ﬁnally the best network is

selected as an optimally trained MLP classiﬁer.

3.2 Feature Extraction

As it widely known, once reliable features are ex-

tracted adequately, they could leverage the accuracy of

the classiﬁcation step. Representative feature vectors

could be of high dimensions, however, in this work we

extract simple feature vectors, yet distinguishable and

representative ones. One can notice that the raw shape

of a connected component itself incorporates important

discriminative data - such as density - for classifying

main-body and side-notes text, as shown in Figure 2.

The neighborhood of a connected component plays also

a salient role towards a perfect classiﬁcation. Figure 2

shows surrounding regions of main-body and side-notes

components. We refer to a connected component with

its predeﬁned neighborhood as context.

We used the following features to generate discrimi-

native feature vectors:

• Component Shape: For shape feature genera-

tion, each connected component is downscaled to a

64 × 64 pixel window size if either width or height

of the component is greater than 64 pixels, other-

wise it is ﬁt into the center of a 64 × 64 window.

This type of rescaling is used in order to exploit the

incorporated information in a components shape

with respect to its size.

We utilize additional four important characteristics

of connected components:

1. Normalized height: the height of a compo-

nent divided by the height of an input docu-

ment image.

2. Forground area: number of foreground pix-

els in the rescaled area of a component di-

vided by the total number of pixels in the

rescaled area.

3. Relative distance: the relative distance of a

connected component from the center of the

document.

4. Orientation: the orientation of a connected

component is estimated with respect to its

neighborhood. The considered neighborhood

is calculated as a function of the width and

height of the considered component, as we

will elaborate later (component context). The

637

regions’ orientation is estimated based on di-

rectional projection proﬁle for 12 angles with

a step of 15 i.e. from −75

◦

to 90

◦

. The pro-

ﬁle with robust alternations between peaks

and valleys has been chosen. We compute

a score s for each rotation angle [18], then,

the angle that corresponds to the proﬁle with

the highest score is chosen as the ﬁnal orien-

tation. The score is calculated according to

Eq. 1.

s =

i=0

(

(n)

− y

(n)

) (1)

where N is the number of peaks found in the

proﬁle, y

(n)

is the value of the nth peak, and

(n)

is the value of the highest valley around

the nth peak. In our case h

(n)

= 1 because

our dataset does not contain non-rectangular

document images; which was possible in

[18].

Together with these four discrete values, the gen-

erated shape-based feature vector is of size 64 ×

64 + 4 = 4100.

• Component Context: To generate context-based

feature vector, each connected component with its

surrounding context area is rescaled to a 64 × 64

window size, while the connected component is

kept at the center of the window. The considered

neighborhood is calculated adaptively as a func-

tion of component’s width and height (denoted by

w and h respectively), and is w

factort

× w by

factor

× h, where w

factor

is always greater than

factor

because of the horizontal nature of Ara-

bic script. w

factor

and h

factor

were obtained ex-

perimentally and they equal 5 and 2, respectively.

The rescaled main-body and side-notes compo-

nents context are shown in Figure 2. The size of

context-based feature vector is 64 × 64 = 4096.

In this way, the size of a complete shape-based

and context-based feature vector is 4100 +4096 =

8196.

3.3 Training dataset

Our dataset consists of 38 document images which

were scanned at a private library located at the old

city of Jerusalem and other samples which were col-

lected from the Islamic manuscripts digitization project

at Leipzig university library [1]. The dataset contains

samples from 7 different books. From the 38 document

Figure 2. Main-body and Side-notes con-

nected components with their corre-

sponding shape and context features.

images, 28 samples were selected as training set and the

remaining 10 were used as testing set.

Main-body text and side-notes text are separated and

extracted from the original document images to gener-

ate the ground truth for the training phase. The same

process is applied on the testing set for evaluation pur-

poses. Around 13 thousand main-body text components

and 12 thousand side-notes components are used for

training AutoMLP classiﬁer. A segmented image gen-

erated by applying the trained MLP classiﬁer is shown

in Figure 3(a) and Figure 3(b). It is widely known that

generalization is a critical issue when training a model,

namely, generating a model that has the ability to pre-

dict reliably the suitable class of a given sample that

does not appear in the training set. In our case, we are

using a relatively small amount of document images for

training which is still able to show the effectiveness of

our approach.

In order to improve the segmentation results we use

a post-processing step based on relaxation labeling ap-

proach which is described below.

3.4 Relaxation Labeling

We improve the segmentation results applying near-

est neighbor analysis and using class probabilities for

reﬁning the class label of each connected component.

For this purpose, a region of 150 × 150 is selected from

the document by keeping the target connected compo-

nent at the center. Several region sizes were tested and

the one that yielded the highest segmentation accuracy

(F-measure; discussed in next section) was chosen (as

appears in Figure 4). The probabilities of connected

638

(a) (b)

(e) (f)

Figure 3. (a) and (b) depict the seg-

mentation of two samples before post-

processing. (c) and (d) represent the ﬁnal

segmentation, respectively.

components within the selected regions were already

computed during the classiﬁcation phase. The labels of

connected components were updated using the average

of main-body and side-notes component probabilities

within a selected region. To illustrate the effectiveness

of the relaxation labeling step, some segmented images

are shown in Figure 3(c) and Figure 3(d).

4 Experimental Results

As stated above, our dataset contains 38 document

images from which 10 images were chosen to build the

testing set and it contains different images from differ-

ent books. We test the performance of our approach

using images with various writing styles and different

layout structures which were not used for training.

Pixel-level ground truth has been generated by man-

ually assigning text in the documents of the testing set

with one of the two classes, main-body or side-notes

text. Several methods to measure the segmentation ac-

curacy have been reported in literature. We evaluate the

segmentation accuracy by adopting the F-measure met-

Figure 4. Different window sizes and the

corresponding side-notes segmentation

accuracy estimated by F-measure.

ric which combines precision and recall values into a

single scalar representative. It guarantees that both val-

ues are high (conservative), in contrary to the average

(tolerant) which does not hold this property. For ex-

ample, when precision and recall both equals one, the

average and F-measure will both be one, but, if the pre-

cision is one and the recall is zero, the average would

be 0.5 and the F-measure would be zero. Therefore,

this measure has been adopted as it reliably measures

the segmentation accuracy. Precision and recall are es-

timated according to Eq. 2 and Eq. 3, resp.

P recis i on =

T P

T P + F P

(2)

Recall =

T P

T P + F N

(3)

where True-Positive(TP), False-Positive(FP) and

False-Negative(FN) with respect to side-notes, are de-

ﬁned as following:

• TP: side-notes text classiﬁed as side-notes text.

• FP: side-notes text classiﬁed as main-body text.

• FN: main-body text classiﬁed as side-notes text.

Likewise, these metrics can also be deﬁned with re-

spect to main-body text. Once we have the precision

and recall counts, F-measure is calculated according to

Eq. 4.

F-Measure =

(1 + β

) · P rec is i on · Recall

(β

· Recall) + P recision

(4)

Assigning β = 1 induces equal emphasis of preci-

sion and recall on F-measure estimation. F-measure for

both main-body and side-notes text with different post-

processing window sizes is shown in Table 1. Note that

the optimal window size is 150.

639

Layout Analysis for Arabic Historical Document Images Using Machine Learning

Figures

Citations

Page segmentation of historical document images with convolutional autoencoders

DocBank: A Benchmark Dataset for Document Layout Analysis

Document Layout Analysis: A Comprehensive Survey

Convolutional Neural Networks for Page Segmentation of Historical Document Images

DocBank: A Benchmark Dataset for Document Layout Analysis

References

Texture Analysis Methods - A Review

Text line segmentation of historical documents: a survey

Segmentation of Page Images Using the Area Voronoi Diagram

Document structure analysis algorithms: a literature survey

Page segmentation using texture analysis

Related Papers (5)

Page segmentation of historical document images with convolutional autoencoders

Text line segmentation of historical documents: a survey

Document image segmentation using discriminative learning over connected components

A threshold selection method from gray level histograms

Document analysis system

Frequently Asked Questions (17)

Q1. What are the contributions in "Layout analysis for arabic historical document images using machine learning" ?

Q2. What have the authors stated for future works in "Layout analysis for arabic historical document images using machine learning" ?

Q3. Why did layout analysis of ancient manuscripts become a challenging problem?

Q4. How did the authors improve the segmentation results?

Q5. How do the authors measure the segmentation accuracy of the document images?

Q6. What is the neighborhood of the document?

Q7. What is the size of the context-based feature vector?

Q8. How do the authors improve the segmentation results?

Q9. What is the method for analyzing side-notes?

Q10. Why is the proposed method not applicable?

Q11. What is the importance of generalization when training a model?

Q12. How did the authors generate the ground truth?

Q13. What is the effect of the projection profile on the robustness of the approach?

Q14. What is the main problem with classifier tuning?

Q15. What is the size of the dataset?

Q16. What is the size of the rescaled main-body and side-notes components?

Q17. What is the proposed method for detecting layout entities locally?