Automatic localization of page segmentation errors
Summary (2 min read)
1. INTRODUCTION
- The success of page segmentation algorithm critically affects the performance of OCR.
- Most of these segmentation algorithms perform satisfactorily well but tend to fail in some specific region or for some specific pages.
- The primary objective of this work is to automatically locate segmentation errors with very high accuracy.
- The objective of this work is to locate these errors without the help of ground truth.
2. PAGE SEGMENTATION ERRORS
- There are large number of document segmentation algorithms available in literature.
- Most of these segmentation algorithms suffer from some or other page segmentation errors.
- Let S and G be the set of lines denoting segmentation output and ground truth respectively.
- The authors then locate the errors by classifying each line as either correct, over-segmented, under-segmented, false alarm or missing component.
3. THE PROBLEM OF LOCATING PAGE SEGMENTATION ERRORS
- More often the existing page segmentation algorithms tend to fail for some specific pages or some specific regions of the page.
- Once segmentation errors are localized, one can use human intervention or alternate algorithm with tuned parameters for error correction.
- Compute line level features for each line of training document image, also known as Learning phase.
- The authors achieve this by using a set of simple features in stage-1 where they classify correct and incorrect pages, and in the stage-2 they compute computationally more expensive line level features only for the pages which are classified as incorrect by stage1.
- To evaluate the performance of their system the authors first locate all the errors using ground truth as in [14].
3.1 Features
- The authors observe that (1) most of the characters in a page are of same size, font and style, (2) line spacing within the documents are mostly same, (3) page is formatted uniformly within a book, (4) two nearby lines in a document is mostly of same height.
- The features the authors use for classifying segmented page as correct or incorrect i.e., stage1 classification are as follows: f1: Maximum line height.
- Maximum of difference in line heights and line gap, also known as f5.
- To identify such case the authors compute maximum word gap in line and use this as feature F4 F5: Maximum area of connected component in a line.
4.1 About Dataset
- The authors use a dataset [7] of 109 books in four prominent south Indian languages for all their experiments.
- Table 4 gives the details of the dataset.
- This dataset contains pages scanned in 600 dpi.
- Segmentation of Indian language document pages is a challenging task, mainly due to (1) Presence of dangling modifiers (2) The relative position of the neighbouring characters are not fixed etc.
- In phase-2, the authors learn the ground truth based error localization for the training images.
4.2 Error localization using ground truth
- The authors first run the segmentation algorithm on all the pages.
- Further, if all the lines in a page are correctly segmented the authors tag that page as correct.
- Table 2 summarizes segmentation errors at line level caused by segmentation algorithm.
- Which is actually a very large quantity considering the huge size of the dataset.
- The authors aim is to localize these errors automatically i.e. without using ground truth.
4.3 Automatic error localization
- For a given segmented page, the authors say whether the page is correctly segmented or not.
- Thus the authors define a set of performance measures using confusion matrix.
- The authors see that they are able to classify correct page as correct and incorrect page as incorrect with more than 89% accuracy.
- At line level error localization, the authors classify each segmented page as correct or one of the segmentation error.
- To measure the performance of line level error localization the authors define a performance metric using confusion matrix.
5. CONCLUSIONS
- The authors address the problem of localizing page segmentation errors.
- The proposed scheme is able to locate segmentation errors without ground truth with high accuracy.
- Such error localization is very important for segmentation error correction which can be done either by manual intervention or running alternate segmentation algorithms in the error localized part.
- Further, the proposed error localization scheme is independent of segmentation algorithms.
- Future direction of this work is to design segmentation postprocessor to automatically correct page segmentation errors.
Did you find this useful? Give us your feedback
Citations
1 citations
Cites background or methods from "Automatic localization of page segm..."
...These books were identified based on experiments in [12]...
[...]
...Figure 1: Typical segmentation errors in Indian scripts as discussed in [12] (a) Two lines are merged into one line (under-segmentation) (b) One line is spilt into two lines (over segmentation) (c) A dangling modifier shown in a small red circle is missed (missing component)....
[...]
...Extending the work of [12] on localizing segmentation errors, we design a post-processor which automatically localizes and corrects the errors....
[...]
...The exhaustive experiments on scanned document of a large collection of Indian language dataset are conducted in [10, 12]....
[...]
...Note that identical to [12] when we measure overall error localization accuracyρl we also consider percentage of correct lines classified as correct....
[...]
References
996 citations
"Automatic localization of page segm..." refers background or methods in this paper
...Recently, in computer vision community researchers have shown interest in unsupervised evaluation of image segmentation algorithms [17]....
[...]
...A survey of unsupervised methods for evaluating segmentation algorithms is given in [17]....
[...]
718 citations
Additional excerpts
...The popular ones are Recursive XY cut [12], White-space analysis [5], Docstrum [13], Voronoi diagram based [8], and RLSA [16]....
[...]
...The popular ones are Recursive XY cut [12], White-space analysis [5], Docstrum [13], Voronoi diagram based [8], and RLSA [16]....
[...]
654 citations
628 citations
Additional excerpts
...The popular ones are Recursive XY cut [12], White-space analysis [5], Docstrum [13], Voronoi diagram based [8], and RLSA [16]....
[...]
...The popular ones are Recursive XY cut [12], White-space analysis [5], Docstrum [13], Voronoi diagram based [8], and RLSA [16]....
[...]
466 citations
Additional excerpts
...The popular ones are Recursive XY cut [12], White-space analysis [5], Docstrum [13], Voronoi diagram based [8], and RLSA [16]....
[...]
Related Papers (5)
Frequently Asked Questions (8)
Q2. How do the authors find errors in the segmentation algorithm?
With the help of ground truth the authors locate all the segmentation errors and store line co-ordinates along with corresponding error type in the database.
Q3. How many books were used for training?
The authors used 109 books in four prominent south Indian languages for their experiments, where randomly selected half of the pages were used for training whereas rest half was used for testing.
Q4. What is the problem with segmentation of Indian language document pages?
Segmentation of Indian language document pages is a challenging task, mainly due to (1) Presence of dangling modifiers (2) The relative position of the neighbouring characters are not fixed etc.
Q5. What is the main reason for page segmentation algorithms to fail?
for such failures is that these algorithms are heavily dependent on parameters and thus fail to adapt to a given page dynamically.
Q6. How do the authors determine if a page is correct?
For training images the authors compute page level features as described in Section 3 and learn the correctness of the page from ground truth, and for testing images the authors use k-NN based classifier to decide whether the page is correct or not.
Q7. How does the algorithm perform in Indian languages?
In [9], authors experimentally show that many well-known segmentation algorithm perform poorer in case of Indian languages compared to English, which also makes automatic error localization an important task.
Q8. How do the authors determine the error localization accuracy?
Then the authors define overall error localization accuracy at line level as follows:ρl =4∑i=1Cii × 1004∑i=14∑j=0CijThis measure gives percentage of segmentation errors which the authors are able to detect automatically.