scispace - formally typeset
Open AccessProceedings ArticleDOI

Layout Analysis for Arabic Historical Document Images Using Machine Learning

TLDR
This work introduces an approach that segments text appearing in page margins from manuscripts with complex layout format, independent of block segmentation, as well as pixel level analysis.
Abstract
Page layout analysis is a fundamental step of any document image understanding system. We introduce an approach that segments text appearing in page margins (a.k.a side-notes text) from manuscripts with complex layout format. Simple and discriminative features are extracted in a connected-component level and subsequently robust feature vectors are generated. Multilayer perception classifier is exploited to classify connected components to the relevant class of text. A voting scheme is then applied to refine the resulting segmentation and produce the final classification. In contrast to state-of-the-art segmentation approaches, this method is independent of block segmentation, as well as pixel level analysis. The proposed method has been trained and tested on a dataset that contains a variety of complex side-notes layout formats, achieving a segmentation accuracy of about 95%.

read more

Content maybe subject to copyright    Report

Layout Analysis for Arabic Historical Document Images Using Machine
Learning
Syed Saqib Bukhari
, Thomas M. Breuel
Technical University of Kaiserslautern, Germany
bukhari@informatik.uni-kl.de, tmb@informatik.uni-kl.de
Abedelkadir Asi
, Jihad El-Sana
Ben-Gurion University of the Negev, Israel
abedas@cs.bgu.ac.il, el-sana@cs.bgu.ac.il
Abstract
Page layout analysis is a fundamental step of any
document image understanding system. We introduce
an approach that segments text appearing in page mar-
gins (a.k.a side-notes text) from manuscripts with com-
plex layout format. Simple and discriminative features
are extracted in a connected-component level and sub-
sequently robust feature vectors are generated. Multi-
layer perception classifier is exploited to classify con-
nected components to the relevant class of text. A voting
scheme is then applied to refine the resulting segmenta-
tion and produce the final classification. In contrast to
state-of-the-art segmentation approaches, this method
is independent of block segmentation, as well as pixel
level analysis. The proposed method has been trained
and tested on a dataset that contains a variety of com-
plex side-notes layout formats, achieving a segmenta-
tion accuracy of about 95%.
1 Introduction
Manually copying a manuscript was the ultimate
way to spread knowledge before printing houses were
established. Scholars added their own notes on page
margins mainly because paper was an expensive ma-
terial. Historians regard the importance of the notes’
content and the role of their layout; these notes became
an important reference by themselves. Hence, analyz-
ing this content became an inevitable step toward a re-
liable manuscript authentication [11] which would sub-
sequently shed light on the manuscript temporal and ge-
ographical origin.
these authors contributed equally.
Figure 1. Arabic historical document im-
age with complex layout formatting due to
side-notes text.
Physical structure of handwritten historical
manuscripts imposes a variety of challenges for
any page layout analysis system. Due to looser format-
ting rules, non-rectangular layout and irregularities in
location of layout entities [2, 11], layout analysis of
handwritten ancient documents became a challenging
research problem. In contrast to algorithms which cope
with modern machine-printed documents or historical
documents from the hand-press period, algorithms for
handwritten ancient documents are required to cope
with the above challenges.
Page layout analysis is a fundamental step of any
document image understanding system. The analysis
process consists of two main steps, page decomposi-
tion and block classification. Page decomposition seg-
ments a document image into homogeneous regions,
2012 International Conference on Frontiers in Handwriting Recognition
978-0-7695-4774-9/12 $26.00 © 2012 IEEE
DOI 10.1109/ICFHR.2012.227
635

and the classification step attempts to distinguish among
the segmented regions whether they are text, picture or
drawing. Later on, the text regions are fed into a recog-
nition system such as, Optical Character Recognition
(OCR), to retrieve the actual letters and words which
correlate to the characters in the manuscript.
In this paper, we introduce an approach that seg-
ments side-notes text from manuscripts with com-
plex layout formatting (see Figure 1). It extracts and
generates feature vectors in a connected-component
level. Multi-layer perception classifier, which has
been already used for page-layout analysis by Jain and
Zhong [9], was exploited to classify connected compo-
nents to the relevant classes of text. A voting step is then
applied to refine the resulting segmentation and produce
the final classification. The suggested approach is inde-
pendent of block segmentation, as well as pixel level
analysis.
In the rest of the paper, we overview previous work,
present our approach in detail, report experimental re-
sults, and finally we conclude and suggest directions for
future work.
2 Related Work
Due to the challenges in handwritten historical doc-
uments [2], applying traditional page layout analysis
methods, which usually address machine-printed docu-
ments, is not applicable. Methods for page layout analy-
sis can be roughly categorized into three major classes:
bottom-up, top-down and hybrid methods [12, 15, 7].
In top-down methods, the document image is divided
into regions which are classified and refined according
to pre-defined criteria. Bottom-up approaches group ba-
sic image elements, such as pixels and connected com-
ponents, to create larger homogeneous regions. Hy-
brid schemes exploit the advantages of top-down and
bottom-up approaches to yield better results.
Recently, Graz et al. [8] introduced a binarization-
free approach which employs the Scale Invariant Fea-
ture Transform (SIFT) to analyze the layout of hand-
written ancient documents. The proposed method sug-
gests a part-based detection of layout entities locally,
using a multi-stage algorithm for the localization of the
entities based on interest points. Support Vector Ma-
chine (SVM) was used to discriminate the considered
classes. Kise et al. [10] introduced a page segmentation
method for non-Manhattan layout documents. Their
method is based on connected components analysis and
exploits the Area Voronoi Digarams to segment the
page. Bukhari et al. [5] presented a segmentation al-
gorithm for printed document images into text and not-
text regions. They examined the document in the level
of connected components and introduced a self-tunable
training model (AutoMLP) for distinguishing between
text and non-text components. Connected components
shape and context were utilized to generate feature vec-
tors. Moll et al. [14] suggested an algorithm that clas-
sifies individual pixels. The approach is applied on
handwritten, machine-printed and photographed doc-
ument images. Pixel-based classification approaches
are time-consuming in comparison to block-based and
component-based approaches.
Page layout analysis was also posed as a texture seg-
mentation problem in literature. For texture-based ap-
proaches see reviews in [13, 16]. Jain and Zhong [9]
suggested a texture-based language-free algorithm for
machine-printed document images. A neural network
was employed to train a set of masks which were des-
ignated to be robust and distinctive. Texture features
were obtained by convolving the trained masks with the
input image. Shape and textural image properties moti-
vated the work introduced by Bloomberg in [3]. In this
work, standard and generalized (multi-resolution) mor-
phological operations were used. Later on, Bukhari et
al. [6] generalized Bloomberg’s text/image segmenta-
tion algorithm for separating text and non-text compo-
nents including halftones, drawings, graphs, maps, etc.
The approach by Won [19] focuses on the combination
of a block based algorithm and a pixel based algorithm
to segment a document image into text and image re-
gions.
Ouwayed et al. [17] suggested an approach to seg-
ment multi-oriented handwritten documents into text
lines. Their method addressed documents with com-
plex layout strucutre. They subdivided the image into
rectangular cells, and estimated text orientation in each
cell using projection profile. Then, cells are merged to-
gether into larger zones with respect to their orienta-
tion. Wigner-Ville Distribution was exploited to esti-
mate the orientation within large zones. This method
could not yield accurate segmentation results due to
some assumptions that were adopted by the authors.
When a window contains several writings in different
orientations, the authors assumed that the border be-
tween the two types of writing could be detected by
finding the minimum index in the projection profile to
refine the cells subdivision. However, this border is not
always obvious and detecting the minimum index from
the projection profile becomes a real challenge when
side-notes are written in a flexible writing style (see Fig-
ure 1). One can also notice that the robustness of this
approach could be negatively affected once side-notes
text have the same orientation as main-body text and the
two types of text have no salient space between them. In
this case the method would not distinguish between the
636

two coinciding regions and erroneous text-lines would
be extracted.
3 Method
Conventional methods for geometric layout analy-
sis could be an adequate choice to tackle the side-notes
segmentation problem when main-body and side-note
text have salient and differentiable geometric proper-
ties, such as: text orientation, text size, white space lo-
cations, etc. However, layout rules have not necessarily
guided the scribes of ancient manuscripts, as a result,
complex document images became common. These
documents contain non-uniform and/or similar geomet-
ric properties for both main-body and side-notes text; a
fact that makes the developing of a method which could
gracefully cope with this type of documents a challeng-
ing task.
Our approach utilizes machine learning technique to
meet the challenges of this problem. In general, clas-
sifier tuning is a hard problem with respect to the opti-
mization of their sensitive parameters, e.g., learning ‘C’
and gamma of SVM classifier.
Here, we are using MLP classifier for segmenting
side-notes from main-body text in complex Arabic doc-
uments. This approach is based on a previous work of
Bukhari et al. [5]. The main reason of using MLP clas-
sifier over others is that it achieves good classification
once it is adequately trained as well as being scalable.
However, a major difficulty of its use has been the re-
quirement for manual inspection in the training process.
They are hard to train because their performance is sen-
sitive to chosen parameter values, and optimal param-
eter values depends heavily on the considered dataset.
The parameters optimization problem of MLPs could
be solved by using grid search for classifier training.
But grid search is a slow process. Therefore in order
to overcome this problem we use AutoMLP [4], a self-
tuning classifier that can automatically adjust learning
parameters.
3.1 AutoMLP Calssifer
AutoMLP combines ideas from genetic algorithms
and stochastic optimization. It trains a small number
of networks in parallel with different learning rates and
different numbers of hidden layers. After a small num-
ber of training cycles the error rate of each network is
determined with respect to a validation dataset accord-
ing to an internal validation process. Based on valida-
tion errors, the networks with bad performance are re-
placed by the modified copies of networks with good
performance. The modified copies are generated with
different learning rates and different numbers of hidden
layers using probability distributions derived from suc-
cessful rates and sizes. The whole process is repeated
a few number of times, and finally the best network is
selected as an optimally trained MLP classifier.
3.2 Feature Extraction
As it widely known, once reliable features are ex-
tracted adequately, they could leverage the accuracy of
the classification step. Representative feature vectors
could be of high dimensions, however, in this work we
extract simple feature vectors, yet distinguishable and
representative ones. One can notice that the raw shape
of a connected component itself incorporates important
discriminative data - such as density - for classifying
main-body and side-notes text, as shown in Figure 2.
The neighborhood of a connected component plays also
a salient role towards a perfect classification. Figure 2
shows surrounding regions of main-body and side-notes
components. We refer to a connected component with
its predefined neighborhood as context.
We used the following features to generate discrimi-
native feature vectors:
Component Shape: For shape feature genera-
tion, each connected component is downscaled to a
64 × 64 pixel window size if either width or height
of the component is greater than 64 pixels, other-
wise it is fit into the center of a 64 × 64 window.
This type of rescaling is used in order to exploit the
incorporated information in a components shape
with respect to its size.
We utilize additional four important characteristics
of connected components:
1. Normalized height: the height of a compo-
nent divided by the height of an input docu-
ment image.
2. Forground area: number of foreground pix-
els in the rescaled area of a component di-
vided by the total number of pixels in the
rescaled area.
3. Relative distance: the relative distance of a
connected component from the center of the
document.
4. Orientation: the orientation of a connected
component is estimated with respect to its
neighborhood. The considered neighborhood
is calculated as a function of the width and
height of the considered component, as we
will elaborate later (component context). The
637

regions’ orientation is estimated based on di-
rectional projection profile for 12 angles with
a step of 15 i.e. from 75
to 90
. The pro-
file with robust alternations between peaks
and valleys has been chosen. We compute
a score s for each rotation angle [18], then,
the angle that corresponds to the profile with
the highest score is chosen as the final orien-
tation. The score is calculated according to
Eq. 1.
s =
1
N
N
X
i=0
(
y
(n)
h
y
(n)
l
h
(n)
) (1)
where N is the number of peaks found in the
profile, y
(n)
h
is the value of the nth peak, and
y
(n)
l
is the value of the highest valley around
the nth peak. In our case h
(n)
= 1 because
our dataset does not contain non-rectangular
document images; which was possible in
[18].
Together with these four discrete values, the gen-
erated shape-based feature vector is of size 64 ×
64 + 4 = 4100.
Component Context: To generate context-based
feature vector, each connected component with its
surrounding context area is rescaled to a 64 × 64
window size, while the connected component is
kept at the center of the window. The considered
neighborhood is calculated adaptively as a func-
tion of component’s width and height (denoted by
w and h respectively), and is w
factort
× w by
h
factor
× h, where w
factor
is always greater than
h
factor
because of the horizontal nature of Ara-
bic script. w
factor
and h
factor
were obtained ex-
perimentally and they equal 5 and 2, respectively.
The rescaled main-body and side-notes compo-
nents context are shown in Figure 2. The size of
context-based feature vector is 64 × 64 = 4096.
In this way, the size of a complete shape-based
and context-based feature vector is 4100 +4096 =
8196.
3.3 Training dataset
Our dataset consists of 38 document images which
were scanned at a private library located at the old
city of Jerusalem and other samples which were col-
lected from the Islamic manuscripts digitization project
at Leipzig university library [1]. The dataset contains
samples from 7 different books. From the 38 document
Figure 2. Main-body and Side-notes con-
nected components with their corre-
sponding shape and context features.
images, 28 samples were selected as training set and the
remaining 10 were used as testing set.
Main-body text and side-notes text are separated and
extracted from the original document images to gener-
ate the ground truth for the training phase. The same
process is applied on the testing set for evaluation pur-
poses. Around 13 thousand main-body text components
and 12 thousand side-notes components are used for
training AutoMLP classifier. A segmented image gen-
erated by applying the trained MLP classifier is shown
in Figure 3(a) and Figure 3(b). It is widely known that
generalization is a critical issue when training a model,
namely, generating a model that has the ability to pre-
dict reliably the suitable class of a given sample that
does not appear in the training set. In our case, we are
using a relatively small amount of document images for
training which is still able to show the effectiveness of
our approach.
In order to improve the segmentation results we use
a post-processing step based on relaxation labeling ap-
proach which is described below.
3.4 Relaxation Labeling
We improve the segmentation results applying near-
est neighbor analysis and using class probabilities for
refining the class label of each connected component.
For this purpose, a region of 150 × 150 is selected from
the document by keeping the target connected compo-
nent at the center. Several region sizes were tested and
the one that yielded the highest segmentation accuracy
(F-measure; discussed in next section) was chosen (as
appears in Figure 4). The probabilities of connected
638

(a) (b)
(e) (f)
Figure 3. (a) and (b) depict the seg-
mentation of two samples before post-
processing. (c) and (d) represent the final
segmentation, respectively.
components within the selected regions were already
computed during the classification phase. The labels of
connected components were updated using the average
of main-body and side-notes component probabilities
within a selected region. To illustrate the effectiveness
of the relaxation labeling step, some segmented images
are shown in Figure 3(c) and Figure 3(d).
4 Experimental Results
As stated above, our dataset contains 38 document
images from which 10 images were chosen to build the
testing set and it contains different images from differ-
ent books. We test the performance of our approach
using images with various writing styles and different
layout structures which were not used for training.
Pixel-level ground truth has been generated by man-
ually assigning text in the documents of the testing set
with one of the two classes, main-body or side-notes
text. Several methods to measure the segmentation ac-
curacy have been reported in literature. We evaluate the
segmentation accuracy by adopting the F-measure met-
Figure 4. Different window sizes and the
corresponding side-notes segmentation
accuracy estimated by F-measure.
ric which combines precision and recall values into a
single scalar representative. It guarantees that both val-
ues are high (conservative), in contrary to the average
(tolerant) which does not hold this property. For ex-
ample, when precision and recall both equals one, the
average and F-measure will both be one, but, if the pre-
cision is one and the recall is zero, the average would
be 0.5 and the F-measure would be zero. Therefore,
this measure has been adopted as it reliably measures
the segmentation accuracy. Precision and recall are es-
timated according to Eq. 2 and Eq. 3, resp.
P recis i on =
T P
T P + F P
(2)
Recall =
T P
T P + F N
(3)
where True-Positive(TP), False-Positive(FP) and
False-Negative(FN) with respect to side-notes, are de-
fined as following:
TP: side-notes text classified as side-notes text.
FP: side-notes text classified as main-body text.
FN: main-body text classified as side-notes text.
Likewise, these metrics can also be defined with re-
spect to main-body text. Once we have the precision
and recall counts, F-measure is calculated according to
Eq. 4.
F-Measure =
(1 + β
2
) · P rec is i on · Recall
(β
2
· Recall) + P recision
(4)
Assigning β = 1 induces equal emphasis of preci-
sion and recall on F-measure estimation. F-measure for
both main-body and side-notes text with different post-
processing window sizes is shown in Table 1. Note that
the optimal window size is 150.
639

Citations
More filters
Proceedings ArticleDOI

Page segmentation of historical document images with convolutional autoencoders

TL;DR: This paper considers page segmentation as a pixel labeling problem, i.e., each pixel is classified as either periphery, background, text block, or decoration, and applies convolutional autoencoders to learn features directly from pixel intensity values.
Posted Content

DocBank: A Benchmark Dataset for Document Layout Analysis

TL;DR: DocBank as discussed by the authors is a large-scale dataset with fine-grained token-level annotations for document layout analysis, which contains 500k document pages with fine grained token level annotations.
Journal ArticleDOI

Document Layout Analysis: A Comprehensive Survey

TL;DR: This survey paper presents a critical study of different document layout analysis techniques and discusses comprehensively the different phases of the DLA algorithms based on a general framework that is formed as an outcome of reviewing the research in the field.
Proceedings ArticleDOI

Convolutional Neural Networks for Page Segmentation of Historical Document Images

TL;DR: In this article, a simple CNN with only one convolutional layer was proposed to learn features from raw image pixels using a CNN, which achieved competitive results against other deep architectures on different public datasets.
Proceedings ArticleDOI

DocBank: A Benchmark Dataset for Document Layout Analysis

TL;DR: This paper presents DocBank, a benchmark dataset that contains 500K document pages with fine-grained token-level annotations for document layout analysis that shows that models trained on DocBank accurately recognize the layout information for a variety of documents.
References
More filters

Texture Analysis Methods - A Review

TL;DR: Methods for digital-image texture analysis are reviewed based on available literature and research work either carried out or supervised by the authors.
Journal ArticleDOI

Text line segmentation of historical documents: a survey

TL;DR: The objective of this paper is to present a survey of existing methods, developed during the last decade and dedicated to documents of historical interest.
Journal ArticleDOI

Segmentation of Page Images Using the Area Voronoi Diagram

TL;DR: It is confirmed that the proposed method of page segmentation based on the approximated area Voronoi diagram is effective for extraction of body text regions, and it is as efficient as other methods based on connected component analysis.
Proceedings ArticleDOI

Document structure analysis algorithms: a literature survey

TL;DR: This paper provides a detailed survey of past work on document structure analysis algorithms and summarize the limitations of past approaches.
Journal ArticleDOI

Page segmentation using texture analysis

TL;DR: A new texture-based language-free page segmentation algorithm which automatically extracts the text, halftone, and line-drawing regions from input greyscale document images and demonstrates that the masks can perform language separation (English/Chinese) when appropriately trained.
Related Papers (5)
Frequently Asked Questions (17)
Q1. What are the contributions in "Layout analysis for arabic historical document images using machine learning" ?

The authors introduce an approach that segments text appearing in page margins ( a. k. a side-notes text ) from manuscripts with complex layout format. 

Their future work will focus on improving some aspects of the algorithm. 

Due to looser formatting rules, non-rectangular layout and irregularities in location of layout entities [2, 11], layout analysis of handwritten ancient documents became a challenging research problem. 

The authors improve the segmentation results applying nearest neighbor analysis and using class probabilities for refining the class label of each connected component. 

The authors evaluate the segmentation accuracy by adopting the F-measure met-ric which combines precision and recall values into a single scalar representative. 

The considered neighborhood is calculated adaptively as a function of component’s width and height (denoted by w and h respectively), and is wfactort × w by hfactor × h, where wfactor is always greater than hfactor because of the horizontal nature of Arabic script. 

To generate context-basedfeature vector, each connected component with its surrounding context area is rescaled to a 64 × 64 window size, while the connected component is kept at the center of the window. 

In order to improve the segmentation results the authors use a post-processing step based on relaxation labeling approach which is described below. 

Conventional methods for geometric layout analysis could be an adequate choice to tackle the side-notes segmentation problem when main-body and side-note text have salient and differentiable geometric properties, such as: text orientation, text size, white space locations, etc. 

Due to the challenges in handwritten historical documents [2], applying traditional page layout analysis methods, which usually address machine-printed documents, is not applicable. 

It is widely known that generalization is a critical issue when training a model, namely, generating a model that has the ability to predict reliably the suitable class of a given sample that does not appear in the training set. 

Pixel-level ground truth has been generated by manually assigning text in the documents of the testing set with one of the two classes, main-body or side-notes text. 

One can also notice that the robustness of this approach could be negatively affected once side-notes text have the same orientation as main-body text and the two types of text have no salient space between them. 

In general, classifier tuning is a hard problem with respect to the optimization of their sensitive parameters, e.g., learning ‘C’ and gamma of SVM classifier. 

Their dataset consists of 38 document images which were scanned at a private library located at the old city of Jerusalem and other samples which were collected from the Islamic manuscripts digitization project at Leipzig university library [1]. 

From the 38 documentMain-body text and side-notes text are separated and extracted from the original document images to generate the ground truth for the training phase. 

The proposed method suggests a part-based detection of layout entities locally, using a multi-stage algorithm for the localization of the entities based on interest points.