scispace - formally typeset
Search or ask a question

Showing papers on "Document layout analysis published in 2011"


Proceedings ArticleDOI
18 Sep 2011
TL;DR: Aletheia is described, an advanced system for accurate and yet cost-effective ground truthing of large amounts of documents which aids the user with a number of automated and semi-automated tools which were partly developed and improved based on feedback from major libraries across Europe and from their digitisation service providers which are using the tool in a production environment.
Abstract: Large-scale digitisation has led to a number of new possibilities with regard to adaptive and learning based methods in the field of Document Image Analysis and OCR. For ground truth production of large corpora, however, there is still a gap in terms of productivity. Ground truth is not only crucial for training and evaluation at the development stage of tools but also for quality assurance in the scope of production workflows for digital libraries. This paper describes Aletheia, an advanced system for accurate and yet cost-effective ground truthing of large amounts of documents. It aids the user with a number of automated and semi-automated tools which were partly developed and improved based on feedback from major libraries across Europe and from their digitisation service providers which are using the tool in a production environment. Novel features are, among others, the support of top-down ground truthing with sophisticated split and shrink tools as well as bottom-up ground truthing supporting the aggregation of lower-level elements to more complex structures. Special features have been developed to support working with the complexities of historical documents. The integrated rules and guidelines validator, in combination with powerful correction tools, enable efficient production of highly accurate ground truth.

124 citations


Book
18 May 2011
TL;DR: This book presents a different approach to pattern recognition (PR) systems, in which users of a system are involved during the recognition process, to help to avoid later errors and reduce the costs associated with post-processing.
Abstract: This book presents a different approach to pattern recognition (PR) systems, in which users of a system are involved during the recognition process. This can help to avoid later errors and reduce the costs associated with post-processing. The book also examines a range of advanced multimodal interactions between the machine and the users, including handwriting, speech and gestures. Features: presents an introduction to the fundamental concepts and general PR approaches for multimodal interaction modeling and search (or inference); provides numerous examples and a helpful Glossary; discusses approaches for computer-assisted transcription of handwritten and spoken documents; examines systems for computer-assisted language translation, interactive text generation and parsing, relevance-based image retrieval, and interactive document layout analysis; reviews several full working prototypes of multimodal interactive PR applications, including live demonstrations that can be publicly accessed on the Internet.

68 citations


Patent
14 Jan 2011
TL;DR: In this article, a method of automatically narrowing data search space and improving accuracy of data extraction using known constraints in a layout of extracted data elements for classified documented is provided, which includes: analyzing each document to classify it within a document category, each category having a corresponding set of expected layouts.
Abstract: A method of automatically narrowing data search space and improving accuracy of data extraction using known constraints in a layout of extracted data elements for classified documented is provided. The method includes: analyzing each document to classify it within a document category, each category having a corresponding set of expected layouts; analyzing each electronic document to automatically extract images and text features; automatically constructing a data structure including a layout of the extracted features and layout relationships amongst the extracted features, wherein each of the extracted features in the layout maintains a reference to neighboring features and wherein closely related features are merged to form a combined feature; automatically narrowing data search space by detecting and removing parts of the layout that are not associated with any data elements using the data structure; and automatically detecting data using the extracted feature layout and the layout relationships amongst the extracted features.

58 citations


Proceedings ArticleDOI
18 Sep 2011
TL;DR: This paper presents an advanced framework for evaluating the performance of layout analysis methods that combines efficiency and accuracy by using a special interval based geometric representation of regions.
Abstract: This paper presents an advanced framework for evaluating the performance of layout analysis methods. It combines efficiency and accuracy by using a special interval based geometric representation of regions. A wide range of sophisticated evaluation measures provides the means for a deep insight into the analysed systems, which goes far beyond simple benchmarking. The support of user-defined profiles allows the tuning for practically any kind of evaluation scenario related to real world applications. The framework has been successfully delivered as part of a major EU-funded project (IMPACT) to evaluate large-scale digitisation projects and has been validated using the dataset from the ICDAR2009 Page Segmentation Competition.

54 citations


Journal ArticleDOI
TL;DR: The focus of this article is to present, compare, and analyze techniques for two subareas of document image analysis: skew angle estimation and correction.
Abstract: Document skew estimation and correction is a regular issue in scanned document images. It is an active research area in the domain of document analysis and recognition. Literature is replete with several document skew detection and correction techniques. The focus of this article is to present, compare, and analyze techniques for two subareas of document image analysis: skew angle estimation and correction. Several algorithms proposed in the literature are synthetically described. Accordingly, document skew estimation and correction techniques are broadly divided into several categories. Critical discussions are observed about the current status of the field and persistent problems of each category are highlighted. Finally, possible solutions are recommended.

52 citations


Proceedings ArticleDOI
18 Sep 2011
TL;DR: Evaluation of the presented system on Arabic and Urdu document image datasets consisting of a variety of complex single- and multi-column layouts achieves high accuracies for text and non-text segmentation, text-line extraction, and reading order determination.
Abstract: Text-lines extraction and their reading order determination is an important step in optical character recognition (OCR) systems. Research in OCR of Arabic script documents has primarily focused on character recognition and therefore most of researchers use primitive methods like projection profile analysis for text-line extraction. Although projection methods achieve good accuracy on clean, skew corrected documents, their performance drops under challenging situations (border noise, skew, complex layouts). This paper presents a robust layout analysis system for extracting text-lines in reading order from scanned Arabic script document images written in different languages (Arabic, Urdu, Persian) and styles (Naskh, Nastaliq). The presented system is based on a suitable combination of different well established techniques for analyzing Latin script documents that have proven to be robust against different types of document image degradations. Evaluation of the presented system on Arabic and Urdu document image datasets consisting of a variety of complex single- and multi-column layouts achieves high accuracies for text and non-text segmentation, text-line extraction, and reading order determination.

26 citations


Patent
09 Mar 2011
TL;DR: In this article, a method and system for identifying a page layout of an image that includes textual regions is provided for optical character recognition (OCR), which includes an input component that receives an input image that including words around which bounding boxes have been formed and a text identifying component that groups the words into a plurality of text regions.
Abstract: A method and system is provided for identifying a page layout of an image that includes textual regions. The textual regions are to undergo optical character recognition (OCR). The system includes an input component that receives an input image that includes words around which bounding boxes have been formed and a text identifying component that groups the words into a plurality of text regions. A reading line component groups words within each of the text regions into reading lines. A text region sorting component that sorts the text regions in accordance with their reading order.

20 citations


Journal ArticleDOI
TL;DR: This paper proposes an efficient approach to space-filling tree representations that uses mechanisms from the point-based rendering paradigm and presents helpful interaction techniques and visual cues that tie in with this new layout.
Abstract: Space-filling layout techniques for tree representations are frequently used when the available screen space is small or the data set is large. In this paper, we propose an efficient approach to space-filling tree representations that uses mechanisms from the point-based rendering paradigm. We present helpful interaction techniques and visual cues that tie in with our layout. Additionally, we relate this new layout approach to common layout mechanisms and evaluate the new layout along the lines of a numerical evaluation using the measures of the Ink-Paper Ratio and overplotted%, and in a preliminary user study. The flexibility of the general approach is illustrated by several enhancements of the basic layout, as well as its usage within the context of two software frameworks from different application fields.

20 citations


Patent
21 Oct 2011
TL;DR: This article used document level features to detect whether a document pair is generated through machine translation based at least in part upon the document-level features of the documents in the document pair and showed that these features correlate with translation quality between the documents.
Abstract: Various technologies described herein pertain to detecting machine translated content. Documents in a document pair are mutual lingual translations of each other. Further, document level features of the documents in the document pair can be identified. The document level features can correlate with translation quality between the documents in the document pair. Moreover, statistical classification can be used to detect whether the document pair is generated through machine translation based at least in part upon the document level features. Further, a first document can be a machine translation of a second document in the document pair or a disparate document when generated through machine translation.

19 citations


Journal ArticleDOI
TL;DR: A new approach for document clustering based on the Topic Map representation of the documents is introduced and a similarity measure is proposed based upon the inferred information through topic maps data and structures.
Abstract: Importance of document clustering is now widely acknowledged by researchers for better management, smart navigation, efficient filtering, and concise summarization of large collection of documents like World Wide Web (WWW). The next challenge lies in semantically performing clustering based on the semantic contents of the document. The problem of document clustering has two main components: (1) to represent the document in such a form that inherently captures semantics of the text. This may also help to reduce dimensionality of the document, and (2) to define a similarity measure based on the semantic representation such that it assigns higher numerical values to document pairs which have higher semantic relationship. Feature space of the documents can be very challenging for document clustering. A document may contain multiple topics, it may contain a large set of class-independent general-words, and a handful class-specific core-words. With these features in mind, traditional agglomerative clustering algorithms, which are based on either Document Vector model (DVM) or Suffix Tree model (STC), are less efficient in producing results with high cluster quality. This paper introduces a new approach for document clustering based on the Topic Map representation of the documents. The document is being transformed into a compact form. A similarity measure is proposed based upon the inferred information through topic maps data and structures. The suggested method is implemented using agglomerative hierarchal clustering and tested on standard Information retrieval (IR) datasets. The comparative experiment reveals that the proposed approach is effective in improving the cluster quality.

19 citations


Proceedings ArticleDOI
26 Sep 2011
TL;DR: Title Vector is proposed to address the issue of document title, which is not taken into special consideration, while it obviously contains much semantic information.
Abstract: Text Classification is a daunting task because it is difficult to extract the semantics of natural language texts. Many problems must be resolved before natural-language processing techniques can be effectively applied to a large collection of texts. A significant one is to extract semantic information from corpus in plan text. In Vector Space Model, a document is conceptually represented by a vector of terms extracted from each document, with associated weights representing the importance of each term in the document and within the whole document collection. Likewise, an unclassified document is also modeled as a list of terms with associated weights representing the importance of the terms in it. Many techniques introduces much statistical information of terms to represent their semantic information. However, as always, document title is not taken into special consideration, while it obviously contains much semantic information. This paper proposes Title Vector to address this issue.

Proceedings ArticleDOI
18 Sep 2011
TL;DR: The proposed method aims at clustering document snippets, so that an automated clustering of documents can be performed, and shows promising results on a dataset consisting of document snippets with varying shapes, content writing and layout.
Abstract: In general document image analysis methods are pre-processing steps for Optical Character Recognition (OCR) systems. In contrast, the proposed method aims at clustering document snippets, so that an automated clustering of documents can be performed. Therefore, words are classified according to printed text, manuscripts, and noise. Where, the third class corrects falsely segmented background elements. Having classified text elements, a layout analysis is carried out which groups words into text lines and paragraphs. A back propagation of the class weights - assigned to each word in the first step - enables correcting wrong class labels. The proposed method shows promising results on a dataset consisting of document snippets with varying shapes, content writing and layout. In addition, the system is compared to page segmentation methods of the ICDAR 2009 Page Segmentation Competition.

Patent
Eric Saund1, Prateek Sarkar1, Alejandro E. Brito1, Marshall Bern1, Francois Ragnet1 
15 Aug 2011
TL;DR: In this article, a method for generating image anchor templates from low variance regions of document images of a first class is presented. But the method is limited to the first class and it is not suitable for the second class.
Abstract: Methods of generating image anchor templates from low variance regions of document images of a first class are provided. The methods select (102) a document image from the document images of the first class and align (104) the other document images of the first class to the selected document image. Low variance regions are then determined (106) by comparing the aligned document images and the selected document image and used to generate (108) image anchor templates.

Proceedings ArticleDOI
19 Sep 2011
TL;DR: An approach called bag- of-related-words is proposed to generate features compounded by a set of related words with a dimensionality smaller than the bag-of-words for the topic hierarchy building.
Abstract: A simple and intuitive way to organize a huge document collection is by a topic hierarchy. Generally two steps are carried out to build a topic hierarchy automatically: 1) hierarchical document clustering and 2) cluster labeling. For both steps, a good textual document representation is essential. The bag-of-words is the common way to represent text collections. In this representation, each document is represented by a vector where each word in the document collection represents a dimension (feature). This approach has well known problems as the high dimensionality and sparsity of data. Besides, most of the concepts are composed by more than one word, as "document engineering" or "text mining". In this paper an approach called bag-of-related-words is proposed to generate features compounded by a set of related words with a dimensionality smaller than the bag-of-words. The features are extracted from each textual document of a collection using association rules. Different ways to map the document into transactions in order to allow the extraction of association rules and interest measures to prune the number of features are analyzed. To evaluate how much the proposed approach can aid the topic hierarchy building, we carried out an objective evaluation for the clustering structure, and a subjective evaluation for topic hierarchies. All the results were compared with the bag-of-words. The obtained results demonstrated that the proposed representation is better than the bag-of-words for the topic hierarchy building.

Proceedings ArticleDOI
22 Dec 2011
TL;DR: A modified RLSA, called Spiral Run Length Smearing Algorithm (SRLSA), is applied to suppress the non-text components from text ones in handwritten document images using a Support Vector Machine (SVM) classifier.
Abstract: Document layout analysis is a pre-processing step to convert handwritten/printed documents into electronic form through Optical Character Recognition (OCR) system. Handwritten documents are usually unstructured i.e. they do not have a specific layout and most documents may contain some non-text regions e.g. graphs, tables, diagrams etc. Therefore, such documents cannot be directly given as input to the OCR system without suppressing the non-text regions in the documents. The traditional Run Length Smoothing Algorithm (RLSA) does not produce good results for handwritten document pages, since the text components in it have lesser pixel density than those in printed text. In present work, a modified RLSA, called Spiral Run Length Smearing Algorithm (SRLSA), is applied to suppress the non-text components from text ones in handwritten document images. The components in the document pages are then classified into text/non-text groups using a Support Vector Machine (SVM) classifier. The method shows a success rate of 83.3% on a dataset of 3000 components.

Proceedings ArticleDOI
18 Sep 2011
TL;DR: This work was especially interested in developing a 'template-free' form recognition technique that extracts and recognizes target characters without pre-defined layout knowledge (form-template) and was able to use a hypothesis testing approach to successfully extract items from noisy form images and ambiguous alignment layout forms.
Abstract: We present a new form recognition technique. In our work, we were especially interested in developing a 'template-free' form recognition technique that extracts and recognizes target characters without pre-defined layout knowledge (form-template). We also attempted to overcome well known difficulties in developing template-free form recognition techniques, i.e., extracting items from noisy form images and ambiguous alignment layout forms. We were able to use a hypothesis testing approach to successfully extract such items from such form images.

Patent
16 Nov 2011
TL;DR: In this paper, a difference component operative to determine a set of differences between the old layout and the new layout, and an animation layer generation component operating on the set of difference.
Abstract: Techniques for the automatic animation of document content are described. An apparatus may comprise a difference component operative to receive an old layout of a document and a new layout of the document, the new layout corresponding to an application of one or more changes to the old layout of the document, the difference component operative to determine a set of differences between the old layout and the new layout, and an animation layer generation component operative to generate a set of animation layers from the set of differences. Other embodiments are described and claimed.

Proceedings ArticleDOI
18 Sep 2011
TL;DR: A novel framework for segmentation of documents with complex layouts performed by combination of clustering and conditional random fields (CRF) based modeling and has been extensively tested on multi-colored document images with text overlapping graphics/image.
Abstract: In this paper, we propose a novel framework for segmentation of documents with complex layouts. The document segmentation is performed by combination of clustering and conditional random fields (CRF) based modeling. The bottom-up approach for segmentation assigns each pixel to a cluster plane based on color intensity. A CRF based discriminative model is learned to extract the local neighborhood information in different cluster/color planes. The final category assignment is done by a top-level CRF based on the semantic correlation learned across clusters. The proposed framework has been extensively tested on multi-colored document images with text overlapping graphics/image.

Patent
24 Feb 2011
TL;DR: In this paper, the authors present a document image generating apparatus that can keep a layout of original text present in the original text image and then can improve the readability of original texts.
Abstract: It is expected to provide a document image generating apparatus, a document image generating method and a computer program that can keep a layout of original text present in the original text image and then can improve the readability of original text and the readability annotation corresponding to the original text (e.g., translation). A translation 421 of original text 411 is aligned at the interline space between the original text 411 at the first line and the original text 412 at the second line. When the interline space is narrow as shown in FIG. 4B, the original text 411 overlays the translation 421. At that time, the color regarding the original text 411 is changed to be a low visibility color, and the color regarding the translation 421 is changed to be a high visibility color.

Patent
20 Oct 2011
TL;DR: In this paper, a Ruby character is displayed in an appropriate display form corresponding to characteristics of a layout of a document intended for giving of Ruby characters, where a storage portion stores information on a layout and a unification judging portion reads the document layout data stored in the storage portion.
Abstract: A ruby character is displayed in an appropriate display form corresponding to characteristics of a layout of a document intended for giving of ruby characters. In a document generating apparatus, a storage portion stores information on a layout of a document as document layout data, a unification judging portion reads the document layout data stored in the storage portion, and a ruby character setting portion judges unification of a layout of the whole document based on the read document layout data to set a display form of a ruby character based on a judgment result by the unification judging portion.

Patent
20 Dec 2011
TL;DR: In this article, an underlying grid structure that facilitates layout of East Asian text is determined, which includes both a size of character frames and the size of a text block frame, in order to fit a greater or lesser number of characters on a line.
Abstract: Determination of an underlying grid structure that facilitates layout of East Asian text is disclosed. The underlying grid structure includes both a size of character frames and a size of a text block frame. The East Asian text may be obtained from a scan of printed material that has the text formatted according to layout conventions established by the publisher. The text may be reformatted to appear on a display of an electronic device in a manner similar to the formatting in the original scanned document. Reformatting may include reflowing the text in order to fit a greater or lesser number of characters on a line. The reflowing may maintain character spacing from the original document and follow formatting rules against locating certain characters at the start or end of a line.

Patent
14 Jan 2011
TL;DR: In this article, a method of training a document analysis system to extract data from documents is provided, which includes automatically analyzing images and text features extracted from a document to associate the document with a corresponding document category.
Abstract: A method of training a document analysis system to extract data from documents is provided. The method includes: automatically analyzing images and text features extracted from a document to associate the document with a corresponding document category; comparing the extracted text features with a set of text features associated with corresponding category of the document, in which the set of text features includes a set of characters, words, and phrases; if the extracted features are found to consist of the characters, words, and phrases belonging to the set of text features associated with the corresponding document category, storing the extracted text features as the data contained in the corresponding document; and, if the extracted text features are found to include at least one text feature that does not belong to the set of text features associated with the corresponding document category, submitting the unrecognized text features to a training phase.

Book ChapterDOI
05 Oct 2011
TL;DR: The authors proposed to use condensed representations of text documents instead of the full-text document to reduce the labeling time for single documents and evaluated whether document labeling with these condensed representations can be done faster and equally accurate by human labelers.
Abstract: In text classification the amount and quality of training data is crucial for the performance of the classifier. The generation of training data is done by human labelers - a tedious and time-consuming work. We propose to use condensed representations of text documents instead of the full-text document to reduce the labeling time for single documents. These condensed representations are key sentences and key phrases and can be generated in a fully unsupervised way. The key phrases are presented in a layout similar to a tag cloud. In a user study with 37 participants we evaluated whether document labeling with these condensed representations can be done faster and equally accurate by the human labelers. Our evaluation shows that the users labeled word clouds twice as fast but as accurately as full-text documents. While further investigations for different classification tasks are necessary, this insight could potentially reduce costs for the labeling process of text documents.

Proceedings Article
01 Aug 2011
TL;DR: A binarization-free layout analysis method for ancient manuscripts is proposed, which identifies and localizes layout entities exploiting their structural similarities on the local level.
Abstract: A binarization-free layout analysis method for ancient manuscripts is proposed, which identifies and localizes layout entities exploiting their structural similarities on the local level. Hence, the textual entities are disassembled into segments, and a part-based detection is done which employs local gradient features known from the field of object recognition, the Scale Invariant Feature Transform (SIFT), to describe these structures. Layout analysis is the first step in the process of document understanding; it identifies regions of interest and, hence, serves as input for other algorithms such as Optical Character Recognition (OCR). Moreover, the document layout allows scholars to establish the spatio-temporal origin, authenticate, or index a document. The layout entities considered in this approach include the body text, embellished initials, plain initials and headings.

Proceedings ArticleDOI
Jian Fan1
18 Sep 2011
TL;DR: This work proposes a new local homogeneity measure based on line space, and incorporates this new feature into a region growing algorithm that achieved robust performance on PDF magazines with wide-ranging layouts and styles.
Abstract: Text segmentation is usually the first step taken towards the reuse and repurposing of PDF documents. Through experimental evaluation, we found that the leading text segmentation algorithms have limitations for contemporary consumer magazines. We propose a new local homogeneity measure based on line space, and incorporate this new feature into a region growing algorithm. Using a fixed set of parameters, our algorithm achieved robust performance on PDF magazines with wide-ranging layouts and styles.

Proceedings ArticleDOI
23 Jan 2011
TL;DR: An operator is devised that allows optional activation of the stochastic parsing mechanism, and it is shown how this fusion of numeric and symbolic information in a feedback loop can be applied to syntactic methods to improve document description expressiveness.
Abstract: This paper presents an improvement to a document layout analysis system, offering a possible solution to Sayre's paradox ("a letter must be recognized before it can be segmented; and it must be segmented before it can be recognized"). This improvement, based on stochastic parsing, allows integration of statistical information, obtained from recognizers, during syntactic layout analysis. We present how this fusion of numeric and symbolic information in a feedback loop can be applied to syntactic methods to simplify document description. To limit combinatorial explosion during exploration of solutions, we devised an operator that allows optional activation of the stochastic parsing mechanism. Our evaluation on 1250 handwritten business letters shows this method allows the improvement of global recognition scores.


Patent
29 Dec 2011
TL;DR: In this paper, an image based document index and retrieval method is described, where each source document is analyzed to generate index information at document, page, region and unit levels, and the index information and the source document images are stored in a database.
Abstract: An image based document index and retrieval method is described. During document indexing, each source document is analyzed to generate index information at document, page, region and unit levels. Region and unit level index information is generated by segmenting each text region into units, constructing unit length or unit density histograms, and analyzing the units in a few most frequent bins of the histogram. The index information and the source document images are stored in a database. During document retrieval, a target document is analyzed to generate target index information in the same way as during document indexing. The target index information is compared to stored index information in a progressive manner (from higher to lower levels) to identify source documents with index information that matches the target index information. Fuzzy logic is used in the comparison steps to increase the robustness of the document retrieval.

Proceedings ArticleDOI
19 Sep 2011
TL;DR: A new algorithm is proposed that approximates a metric function between documents based on their visual similarity so that the computation required to find similar documents in a document database can be significantly reduced.
Abstract: Managing large document databases has become an important task. Being able to automatically compare document layouts and classify and search documents with respect to their visual appearance proves to be desirable in many applications. We propose a new algorithm that approximates a metric function between documents based on their visual similarity. The comparison is based only on the visual appearance of the document without taking into consideration its text content. We measure the similarity of single page documents with respect to distance functions between three document components: background, text, and saliency. Each document component is represented as a Gaussian mixture distribution; and distances between the components of different documents are calculated as an approximation of the Hellinger distance between corresponding distributions. Since the Hellinger distance obeys the triangle inequality, it proves to be favorable in the task of nearest neighbor search in a document database. Thus, the computation required to find similar documents in a document database can be significantly reduced.

Patent
31 Aug 2011
TL;DR: In this article, a scanned grayscale image of the target document is binarized by separating halftone and non-halftone text areas and binarizing them separately.
Abstract: A document authentication method determines the authenticity of a target hardcopy document, which purports to be a true copy of an original hardcopy document. The method compares a binarized image of the target document with a binarized image of the original document which has been stored in a storage device. The image of the original document is generated by binarizing a scanned grayscale image of the original document. Halftone and non-halftone text areas in the grayscale image area separated, and the two types of text are separately binarized. The non-halftone text areas are then down-sampled. During authenticating, a scanned grayscale image of the target document is binarized by separating halftone and non-halftone text areas and binarizing them separately, and then down-sampling the non-halftone text areas. The binarized images of the target document and the original document are compared to determine the authenticity of the target document.