scispace - formally typeset
Search or ask a question

Showing papers by "Faisal Shafait published in 2008"


Proceedings ArticleDOI
27 Jan 2008
TL;DR: A fast adaptive binarization algorithm that yields the same quality of Binarization as the Sauvola method but runs in time close to that of global thresholding methods (like Otsu's method), independent of the window size.
Abstract: Adaptive binarization is an important first step in many document analysis and OCR processes. This paper describes a fast adaptive binarization algorithm that yields the same quality of binarization as the Sauvola method,1 but runs in time close to that of global thresholding methods (like Otsu's method2), independent of the window size. The algorithm combines the statistical constraints of Sauvola's method with integral images.3 Testing on the UW-1 dataset demonstrates a 20-fold speedup compared to the original Sauvola algorithm.

317 citations


Journal ArticleDOI
TL;DR: A vectorial score that is sensitive to, and identifies, the most important classes of segmentation errors (over, under, and mis-segmentation) and what page components (lines, blocks, etc.) are affected.
Abstract: Informative benchmarks are crucial for optimizing the page segmentation step of an OCR system, frequently the performance limiting step for overall OCR system performance. We show that current evaluation scores are insufficient for diagnosing specific errors in page segmentation and fail to identify some classes of serious segmentation errors altogether. This paper introduces a vectorial score that is sensitive to, and identifies, the most important classes of segmentation errors (over, under, and mis-segmentation) and what page components (lines, blocks, etc.) are affected. Unlike previous schemes, our evaluation method has a canonical representation of ground-truth data and guarantees pixel-accurate evaluation results for arbitrary region shapes. We present the results of evaluating widely used segmentation algorithms (x-y cut, smearing, whitespace analysis, constrained text-line finding, docstrum, and Voronoi) on the UW-III database and demonstrate that the new evaluation scheme permits the identification of several specific flaws in individual segmentation methods.

204 citations


Journal ArticleDOI
TL;DR: A geometric matching algorithm is used to find the optimal page frame of structured documents (journal articles, books, magazines) by exploiting their text alignment property and shows that by removing characters outside the computed page frame, the OCR error rate is reduced.
Abstract: When a page of a book is scanned or photocopied, textual noise (extraneous symbols from the neighboring page) and/or non-textual noise (black borders, speckles, ...) appear along the border of the document. Existing document analysis methods can handle non-textual noise reasonably well, whereas textual noise still presents a major issue for document analysis systems. Textual noise may result in undesired text in optical character recognition (OCR) output that needs to be removed afterwards. Existing document cleanup methods try to explicitly detect and remove marginal noise. This paper presents a new perspective for document image cleanup by detecting the page frame of the document. The goal of page frame detection is to find the actual page contents area, ignoring marginal noise along the page border. We use a geometric matching algorithm to find the optimal page frame of structured documents (journal articles, books, magazines) by exploiting their text alignment property. We evaluate the algorithm on the UW-III database. The results show that the error rates are below 4% each of the performance measures used. Further tests were run on a dataset of magazine pages and on a set of camera captured document images. To demonstrate the benefits of using page frame detection in practical applications, we choose OCR and layout-based document image retrieval as sample applications. Experiments using a commercial OCR system show that by removing characters outside the computed page frame, the OCR error rate is reduced from 4.3 to 1.7% on the UW-III dataset. The use of page frame detection in layout-based document image retrieval application decreases the retrieval error rates by 30%.

66 citations


Proceedings ArticleDOI
16 Sep 2008
TL;DR: This work presents a new algorithm for curled textline segmentation which is robust to above mentioned problems at the expense of high execution time and insensitivity, and will demonstrate this insensitivity in a performance evaluation section.
Abstract: Segmentation of curled textlines from warped document images is one of the major issues in document image dewarping. Most of the curled textlines segmentation algorithms present in the literature today are sensitive to the degree of curl, direction of curl, and spacing between adjacent lines. We present a new algorithm for curled textline segmentation which is robust to above mentioned problems at the expense of high execution time. We will demonstrate this insensitivity in a performance evaluation section. Our approach is based on the state-of-the-art image segmentation technique: Active Contour Model (Snake) with the novel idea of several baby snakes and their convergence in a vertical direction only. Experiment on publically available CBDAR 2007 document image dewarping contest dataset shows our text line segmentation algorithm accuracy of 97.96%.

42 citations


Proceedings ArticleDOI
16 Sep 2008
TL;DR: This paper finds a robust and pixel accurate scanner independent alignment of the scanned image with the electronic document, allowing the extraction of accurate ground truth character information.
Abstract: Most optical character recognition (OCR) systems need to be trained and tested on the symbols that are to be recognized. Therefore, ground truth data is needed. This data consists of character images together with their ASCII code. Among the approaches for generating ground truth of real world data, one promising technique is to use electronic version of the scanned documents. Using an alignment method, the character bounding boxes extracted from the electronic document are matched to the scanned image. Current alignment methods are not robust to different similarity transforms. They also need calibration to deal with non-linear local distortions introduced by the printing/scanning process. In this paper we present a significant improvement over existing methods, allowing to skip the calibration step and having a more accurate alignment, under all similarity transforms. Our method finds a robust and pixel accurate scanner independent alignment of the scanned image with the electronic document, allowing the extraction of accurate ground truth character information. The accuracy of the alignment is demonstrated using documents from the UW3 dataset. The results show that the mean distance between the estimated and the ground truth character bounding box position is less than one pixel.

31 citations


Book ChapterDOI
07 Aug 2008
TL;DR: This work presents an approach for detecting falsified documents using a document signature obtained from its intrinsic features: bounding boxes of connected components are used as a signature and is able to reliably detect faked documents.
Abstract: Document security does not only play an important role in specific domains e.g. passports, checks and degrees but also in every day documents e.g. bills and vouchers. Using special high-security features for this class of documents is not feasible due to the cost and the complexity of these methods. We present an approach for detecting falsified documents using a document signature obtained from its intrinsic features: bounding boxes of connected components are used as a signature. Using the model signature learned from a set of original bills, our approach can identify documents whose signature significantly differs from the model signature. Our approach uses globally optimal document alignment to build a model signature that can be used to compute the probability of a new document being an original one. Preliminary evaluation shows that the method is able to reliably detect faked documents.

20 citations


Proceedings ArticleDOI
01 Dec 2008
TL;DR: This work proposes a statistically motivated model-based trainable layout analysis system that allows assumption-free adaptation to different layout types and produces likelihood estimates of the correctness of the computed page segmentation.
Abstract: Geometric layout analysis plays an important role in document image understanding. Many algorithms known in literature work well on standard document images, achieving high text line segmentation accuracy on the UW-III dataset. These algorithms rely on certain assumptions about document layouts, and fail when their underlying assumptions are not met. Also, they do not provide confidence scores for their output. These two problems limit the usefulness of general purpose layout analysis methods in large scale applications. In this contribution, we propose a statistically motivated model-based trainable layout analysis system that allows assumption-free adaptation to different layout types and produces likelihood estimates of the correctness of the computed page segmentation. The performance of our approach is tested on a subset of the Google 1000 books dataset where it achieved a text line segmentation accuracy of 98.4% on layouts where other general-purpose algorithms failed to do a correct segmentation.

16 citations


Proceedings ArticleDOI
16 Sep 2008
TL;DR: A probabilistic matching algorithm is presented that gives multiple interpretations of input layout with associated probabilities that aims at solving the above mentioned problems for Manhattan layouts.
Abstract: A key limitation of current layout analysis methods is that they rely on many hard-coded assumptions about document layouts and can not adapt to new layouts for which the underlying assumptions are not satisfied. Another major drawback of these approaches is that they do not return confidence scores for their outputs. These problems pose major challenges in large scale digitization efforts where a large number of different layouts need to be handled and manual inspection of the results on each individual page is not feasible. This paper presents a novel statistical approach to layout analysis that aims at solving the above mentioned problems for Manhattan layouts. The presented approach models known page layouts as a structural mixture model. A probabilistic matching algorithm is presented that gives multiple interpretations of input layout with associated probabilities. First experiments on documents from the publicly available MARG dataset achieved below 5%error rate for geometric layout analysis.

15 citations


Dissertation
01 Jan 2008
TL;DR: This thesis presents an efficient local adaptive thresholding algorithm that yields the same quality of binarization as that of state-of-the-art local binarized methods, but runs in time close to that of global thresholding methods, independent of the local window size.
Abstract: Layout analysis--the division of page images into text blocks, lines, and determination of their reading order--is a major performance limiting step in large scale document digitization projects. This thesis addresses this problem in several ways: it presents new performance measures to identify important classes of layout errors, evaluates the performance of state-of-the-art layout analysis algorithms, presents a number of methods to reduce the error rate and catastrophic failures occurring during layout analysis, and develops a statistically motivated, trainable layout analysis system that addresses the needs of large-scale document analysis applications. An overview of the key contributions of this thesis is as follows. First, this thesis presents an efficient local adaptive thresholding algorithm that yields the same quality of binarization as that of state-of-the-art local binarization methods, but runs in time close to that of global thresholding methods, independent of the local window size. Tests on the UW-1 dataset demonstrate a 20-fold speedup compared to traditional local thresholding techniques. Then, this thesis presents a new perspective for document image cleanup. Instead of trying to explicitly detect and remove marginal noise, the approach focuses on locating the page frame, i.e. the actual page contents area. A geometric matching algorithm is presented to extract the page frame of a structured document. It is demonstrated that incorporating page frame detection step into document processing chain results in a reduction in OCR error rates from 4.3% to 1.7% (n=4,831,618 characters) on the UW-III dataset and layout-based retrieval error rates from 7.5% to 5.3% (n=815 documents) on the MARG dataset. The performance of six widely used page segmentation algorithms (x-y cut, smearing, whitespace analysis, constrained text-line finding, docstrum, and Voronoi) on the UW-III database is evaluated in this work using a state-of-the-art evaluation methodology. It is shown that current evaluation scores are insufficient for diagnosing specific errors in page segmentation and fail to identify some classes of serious segmentation errors altogether. Thus, a vectorial score is introduced that is sensitive to, and identifies, the most important classes of segmentation errors (over-, under-, and mis-segmentation) and what page components (lines, blocks, etc.) are affected. Unlike previous schemes, this evaluation method has a canonical representation of ground truth data and guarantees pixel-accurate evaluation results for arbitrary region shapes. Based on a detailed analysis of the errors made by different page segmentation algorithms, this thesis presents a novel combination of the line-based approach by Breuel with the area-based approach of Baird which solves the over-segmentation problem in area-based approaches. This new approach achieves a mean text-line extraction error rate of 4.4% (n=878 documents) on the UW-III dataset, which is the lowest among the analyzed algorithms. This thesis also describes a simple, fast, and accurate system for document image zone classification that results from a detailed comparative analysis of performance of widely used features in document analysis and content-based image retrieval. Using a novel combination of known algorithms, an error rate of 1.46% (n=13,811 zones) is achieved on the UW-III dataset in comparison to a state-of-the-art system that reports an error rate of 1.55% (n=24,177 zones) using more complicated techniques. In addition to layout analysis of Roman script documents, this work also presents the first high-performance layout analysis method for Urdu script. For that purpose a geometric text-line model for Urdu script is presented. It is shown that the method can accurately extract Urdu text-lines from documents of different layouts like prose books, poetry books, magazines, and newspapers. Finally, this thesis presents a novel algorithm for probabilistic layout analysis that specifically addresses the needs of large-scale digitization projects. The presented approach models known page layouts as a structural mixture model. A probabilistic matching algorithm is presented that gives multiple interpretations of input layout with associated probabilities. An algorithm based on A* search is presented for finding the most likely layout of a page, given its structural layout model. For training layout models, an EM-like algorithm is presented that is capable of learning the geometric variability of layout structures from data, without the need for a page segmentation ground-truth. Evaluation of the algorithm on documents from the MARG dataset shows an accuracy of above 95% for geometric layout analysis.

14 citations


Book ChapterDOI
01 Apr 2008
TL;DR: The results of the contest are presented by giving an overview of the dataset used in the contest, evaluation methodology, participating methods and the segmentation accuracy achieved by the participating methods.
Abstract: Automatic conversion of line drawings from paper to electronic form requires the recognition of geometric primitives like lines, arcs, circles etc. in scanned documents. Many algorithms have been proposed over the years to extract lines and arcs from document images. To compare different state-of-the-art systems, an arc segmentation contest was held in the seventh IAPR International Workshop on Graphics Recognition - GREC 2007. Four methods participated in the contest, three of which were commercial systems and one was a research algorithm. This paper presents the results of the contest by giving an overview of the dataset used in the contest, evaluation methodology, participating methods and the segmentation accuracy achieved by the participating methods.

10 citations