scispace - formally typeset
Search or ask a question

Showing papers on "Document layout analysis published in 2007"


Patent
16 Aug 2007
TL;DR: In this paper, a method and apparatus for reformatting electronic documents is disclosed, which consists of performing layout analysis on an electronic version of a document to locate text zones, assigning attributes for scale and importance to text zones in the electronic version, and reformating text based on the attributes to create an image.
Abstract: A method and apparatus for reformatting electronic documents is disclosed. In one embodiment, the method comprises performing layout analysis on an electronic version of a document to locate text zones, assigning attributes for scale and importance to text zones in the electronic version of the document, and reformatting text in the electronic version of the document based on the attributes to create an image.

216 citations


Journal ArticleDOI
TL;DR: This work focuses on techniques that classify single-page typeset document images without using OCR results, and brings to light important issues in designing a document classifier, including the definition of document classes, the choices of document features and feature representation, and the choice of classification algorithm and learning mechanism.
Abstract: Document image classification is an important step in Office Automation, Digital Libraries, and other document image analysis applications. There is great diversity in document image classifiers: they differ in the problems they solve, in the use of training data to construct class models, and in the choice of document features and classification algorithms. We survey this diverse literature using three components: the problem statement, the classifier architecture, and performance evaluation. This brings to light important issues in designing a document classifier, including the definition of document classes, the choice of document features and feature representation, and the choice of classification algorithm and learning mechanism. We emphasize techniques that classify single-page typeset document images without using OCR results. Developing a general, adaptable, high-performance classifier is challenging due to the great variety of documents, the diverse criteria used to define document classes, and the ambiguity that arises due to ill-defined or fuzzy document classes.

181 citations


Book ChapterDOI
01 Jan 2007
TL;DR: Automatic analysis of an arbitrary document with complex layout is an extremely difficult task and is beyond the capabilities of the state-of-the-art document structure and layout analysis systems.
Abstract: A document image is composed of a variety of physical entities or regions such as text blocks, lines, words, figures, tables, and background. We could also assign functional or logical labels such as sentences, titles, captions, author names, and addresses to some of these regions. The process of document structure and layout analysis tries to decompose a given document image into its component regions and understand their functional roles and relationships. The processing is carried out in multiple steps, such as preprocessing, page decomposition, structure understanding, etc. We will look into each of these steps in detail in the following sections. Document images are often generated from physical documents by digitization using scanners or digital cameras. Many documents, such as newspapers, magazines and brochures, contain very complex layout due to the placement of figures, titles, and captions, complex backgrounds, artistic text formatting, etc. (see Figure 1). A human reader uses a variety of additional cues such as context, conventions and information about language/script, along with a complex reasoning process to decipher the contents of a document. Automatic analysis of an arbitrary document with complex layout is an extremely difficult task and is beyond the capabilities of the state-of-the-art document structure and layout analysis systems. This is interesting since documents are designed to be effective and clear to human interpretation unlike natural images.

101 citations


Proceedings Article
06 Jan 2007
TL;DR: A method for detecting the scene frame division in comic images using the density gradient after filling the quadrangle regions in each image with black is proposed and results show that 80 percent of 672 pages in four print comic booklets are successfully divided into scene frames by the proposed method.
Abstract: Today, the demand of services for comic contents increases because paper magazines and books are bulky while digital contents can be read anytime and anywhere with cellular phones and PDAs. To convert existing print comic materials into digital format such that they can be read using the cellular phones and the PDAs with small screens, it is necessary to divide each page into scene frames and to determine reading order of the scene frames. The division of comic images into the scene frames can be considered as a type of document layout analysis. We analyzed layout of comic images using density gradient. The method can be applied to comics in which comic balloons or pictures are drawn over scene frames. In this research, a method for detecting the scene frame division in comic images using the density gradient after filling the quadrangle regions in each image with black is proposed. Experimental results show that 80 percent of 672 pages in four print comic booklets are successfully divided into scene frames by the proposed method.

83 citations


Patent
15 Nov 2007
TL;DR: In this article, the authors present a system for storing, organizing, and accessing image-based documents, which includes OCR conversion process to produce an equivalent document in text format, identifying the keywords of the equivalent document, linking the keywords with the image based document and storing the imagebased document, the corresponding equivalent document and the keywords in a relational database.
Abstract: Methods and systems are provided for storing, organizing, and accessing image-based documents The method includes receiving an image-based document, conducting an OCR conversion process to produce an equivalent document in text format, identifying keywords of the equivalent document in text format, linking the keywords with the image-based document and the corresponding equivalent document in text format, and storing the image-based document, the corresponding equivalent document in text format, and the keywords in a relational database

61 citations


Patent
David M. Bargeron1, Charles E. Jacobs1, Wilmot Li1, David Salesin1, Evan J. Schrier1 
16 Jul 2007
TL;DR: The adaptive grid-based document layout system and methods as discussed by the authors provides a template authoring tool and user interface for interactively drawing and arranging layout elements within an adaptive template, including various element types and constraint-based relationships that define the layout of elements.
Abstract: A system and methods for facilitating adaptive grid-based document layout. More particularly, the adaptive grid-based document layout system and methods feature a new approach to adaptive grid-based document layout that utilizes a set of adaptive templates that are configurable in a range of different pages sizes and viewing conditions. The templates include various element types and constraint-based relationships that define the layout of elements with reference to viewing conditions under which the document content will be displayed and that define other content properties. Through a layout engine and paginator, the adaptive grid-based document layout system and methods determines a desirable sequence of templates to use for adapting document content. Additionally, the adaptive grid-based document layout system and methods provides a template authoring tool and user interface for interactively drawing and arranging layout elements within an adaptive template.

59 citations


Patent
13 Jun 2007
TL;DR: In this paper, a document annotator is configured to convert a source document with a layout to a deterministic format including content and layout metadata, which is then associated with positional tags to locate the document annotations in the layout.
Abstract: In a document annotator (8), a document converter (12) is configured to convert a source document (10) with a layout to a deterministic format (14, 64) including content and layout metadata. At least one annotation pipeline (20, 22) is configured to generate document annotations respective to received content. A merger (36, 46) is configured to associate the generated document annotations with positional tags based on the layout metadata, which locate the document annotations in the layout. A document visualizer (58) is configured to render at least some content of the deterministic format and one or more selected annotations (60) in substantial conformance with the layout based on the layout metadata and the positional tags associated with the selected one or more annotations (60).

57 citations


Proceedings ArticleDOI
23 Sep 2007
TL;DR: This paper presents a text line segmentation method for printed or handwritten historical Arabic documents which has a 96% accuracy on a collection of 100 historical documents.
Abstract: This paper presents a text line segmentation method for printed or handwritten historical Arabic documents. Documents are first classified into 2 classes using a K-means scheme. These classes correspond to document complexity (easy or not easy to segment). Then, a document which includes overlapping and touching characters, is divided into vertical strips. The extracted text blocks obtained by horizontal projection are classified into three categories: small, average and large text blocks. After segmenting the large text blocks, the lines are obtained by matching adjacent blocks within two successive strips using spatial relationship. The document without overlapping or touching characters is segmented by making abstraction on the segmentation module of the large text blocks. The text line segmentation method has a 96% accuracy on a collection of 100 historical documents

48 citations


Patent
15 Feb 2007
TL;DR: In this paper, a system and method for producing semantically-rich representations of texts to amplify and sharpen the interpretations of texts is proposed, which relies on the fact that there is a substantial amount of semantic content associated with most text strings that is not explicit in those strings, or in the mere statistical co-occurrence of the strings with other strings, but which is nevertheless extremely relevant to the text.
Abstract: A system and method for producing semantically-rich representations of texts to amplify and sharpen the interpretations of texts. The method relies on the fact that there is a substantial amount of semantic content associated with most text strings that is not explicit in those strings, or in the mere statistical co-occurrence of the strings with other strings, but which is nevertheless extremely relevant to the text. This additional information is used to both sharpen the representations derived directly from the text string, and also to augment the representation with content that, while not explicitly mentioned in the string, is implicit in the text and, if made explicit, can be used to support the performance of text processing applications including document indexing and retrieval, document classification, document routing, document summarization, and document tagging. These enhancements may be used to support down-stream processing, such as automated document reading and understanding, online advertising placement, electronic commerce, corporate knowledge management, and business and government intelligence applications.

48 citations


Patent
23 Oct 2007
TL;DR: In this article, the authors proposed a tool for electronic document classification based on specific properties such as security classification, information type, document type, and document retention, and the like associated.
Abstract: Electronic document classification is disclosed. A toolbar adds the ability to classify documents based on specific properties such as security classification, information type, document type, document retention, document caveats, and the like associated. The toolbar through dropdown selections allows users to select the appropriate classification and properties based upon the content of the document and have appropriate classifiers added to the document. Document classification properties are generated that are associated with the document in the document properties and by inserting visual markings that allow users to quickly identify the security, sensitivity, intended distribution or retention. By utilizing the classification toolbar a user can classify an document by one or more classification levels and be ensured that the classification will be visible to any person viewing the document.

46 citations


Proceedings ArticleDOI
23 Sep 2007
TL;DR: This work proposes a robust technique for segmenting all sorts of graphics and texts in any orientation from document pages, essential for better OCR performance and vectorization in computer vision applications.
Abstract: Text, graphics and half-tones are the major constituents of any document page. While half-tone can be characterised by its inherent intensity variation, text and graphics share common characteristics except difference in spatial distribution. The success of document image analysis systems depends on the proper segmentation. The success of document image analysis systems depends on the proper segmentation of text and graphics as text is further subdivided into other classes such as heading, table and math-zones. Segmentation of graphics is essential for better OCR performance and vectorization in computer vision applications. Graphics segmentation from text is particularly difficult in the context of graphics made of small components (dashed or dotted lines etc.) which have many features similar to texts. Here we propose a robust technique for segmenting all sorts of graphics and texts in any orientation from document pages.

Journal ArticleDOI
15 Oct 2007
TL;DR: This paper presents a hybrid approach to segment and classify contents of document images, segmented into three types of regions: Graphics, Text and Space.
Abstract: In this paper we present a hybrid approach to segment and classify contents of document images. A Document Image is segmented into three types of regions: Graphics, Text and Space. The image of a document is subdivided into blocks and for each block five GLCM (Grey Level Co-occurrence Matrix) features are extracted. Based on these features, blocks are then clustered into three groups using K-Means algorithm; connected blocks that belong to the same group are merged. The classification of groups is done using pre-learned heuristic rules. Experiments were conducted on scanned newspapers and images from MediaTeam Document Database

Patent
Qing Yu1, Shuming Shi1, Zhiwei Li1, Ji-Rong Wen1, Wei-Ying Ma1 
01 Mar 2007
TL;DR: In this paper, a method and system for determining relevance of a document having text and images to a text string is provided. But the method is not suitable for text-only documents.
Abstract: A method and system for determining relevance of a document having text and images to a text string is provided. A scoring system identifies image text associated with an image of the document. The scoring system calculates an image score indicating relevance of the image text to the text string. The image score may be used in many applications, such as searching, summary generation, and document classification, image search, and image classification.

Patent
23 Jul 2007
TL;DR: In this paper, a user-selected image, if any, and text elements having user-supplied text content are disregarded, and positioning of user text entries is determined based on the size of the text entries, defined text element spacing distances, and defined positioning rules.
Abstract: Methods and computer programs for automatically creating a text layout in an electronic design for a product to be printed. A number of defined text elements are available for user text entries. The product layout is based a user-selected image, if any, and on the text elements having user-supplied text content. Text elements without text content are disregarded. Positioning of user text entries is determined based on the size of the text entries, defined text element spacing distances, and defined positioning rules. Creating a layout incorporating user-supplied text entries and/or image may include cropping or resizing of other design elements in the product design and wrapping of relatively long text entries onto multiple lines.

Proceedings ArticleDOI
12 Aug 2007
TL;DR: This paper presents an approach for extracting relevant named entities from document images by combining rich page layout features in the image space with language content in the OCR text using a discriminative conditional random field (CRF) framework and integrates it into the expense reimbursement solution.
Abstract: Expense reimbursement is a time-consuming and labor-intensive process across organizations. In this paper, we present a prototype expense reimbursement system that dramatically reduces the elapsed time and costs involved, by eliminating paper from the process life cycle. Our complete solution involves (1) an electronic submission infrastructure that provides multi- channel image capture, secure transport and centralized storage of paper documents; (2) an unconstrained data mining approach to extracting relevant named entities from un-structured document images; (3) automation of auditing procedures that enables automatic expense validation with minimum human interaction.Extracting relevant named entities robustly from document images with unconstrained layouts and diverse formatting is a fundamental technical challenge to image-based data mining, question answering, and other information retrieval tasks. In many applications that require such capability, applying traditional language modeling techniques to the stream of OCR text does not give satisfactory result due to the absence of linguistic context. We present an approach for extracting relevant named entities from document images by combining rich page layout features in the image space with language content in the OCR text using a discriminative conditional random field (CRF) framework. We integrate this named entity extraction engine into our expense reimbursement solution and evaluate the system performance on large collections of real-world receipt images provided by IBM World Wide Reimbursement Center.

Proceedings ArticleDOI
23 Sep 2007
TL;DR: This paper presents a new framework for in-depth analysis of the performance of layout analysis methods that provides detailed information at various levels that can be used by method developers to identify specific problems and improve their work.
Abstract: This paper presents a new framework for in-depth analysis of the performance of layout analysis methods. Contrary to existing approaches aimed at evaluation or benchmarking, the proposed framework provides detailed information at various levels that can be used by method developers to identify specific problems and improve their work. Complex layouts are supported as well as the flexible configuration of goal-oriented performance analysis scenarios. The comparison of segmentation results against the ground truth is performed in a very efficient way based on a decomposition of any region shape into an interval-based description. The framework has been validated using the dataset and method results of the ICDAR2005 Page Segmentation Competition.

Proceedings ArticleDOI
23 Sep 2007
TL;DR: An identification technique that automatically detects the underlying script and orientation of scanned document images by using the stroke density and distribution and is tolerant to the document skew and able to detect orientations of documents of different scripts.
Abstract: This paper presents an identification technique that automatically detects the underlying script and orientation of scanned document images. In the proposed technique, document script and orientation are identified by using the stroke density and distribution, which convert each document image into a document vector. For each script at each orientation, a number of reference document vectors are first constructed. Script and orientation of the query document are then determined according to the similarity between the query document vector and multiple pre- constructed reference document vectors by using the K-nearest neighbor algorithm. Experiments show that the proposed technique is tolerant to the document skew and able to detect orientations of documents of different scripts.

Proceedings Article
22 Jul 2007
TL;DR: The experimental results on the DUC2002 dataset demonstrate the effectiveness of the proposed approach based on document expansion, and the cross-document relationships between sentences in the expanded document set are validated to be very important for single document summarization.
Abstract: Existing methods for single document summarization usually make use of only the information contained in the specified document This paper proposes the technique of document expansion to provide more knowledge to help single document summarization A specified document is expanded to a small document set by adding a few neighbor documents close to the document, and then the graph-ranking based algorithm is applied on the expanded document set for extracting sentences from the single document, by making use of both the within-document relationships between sentences of the specified document and the cross-document relationships between sentences of all documents in the document set The experimental results on the DUC2002 dataset demonstrate the effectiveness of the proposed approach based on document expansion The cross-document relationships between sentences in the expanded document set are validated to be very important for single document summarization

Patent
02 Apr 2007
TL;DR: In this article, a method for automated processing of hard copy text documents includes scanning the hard copy document, subjecting the scanned document to an OCR process, so as to obtain a text file of the text of the document and subjecting text file to a Named Entities (NE) recognition process.
Abstract: A method for automated processing of hard copy text documents includes scanning the hard copy document, subjecting the scanned document to an OCR process, so as to obtain a text file of the text of the document and subjecting the text file to a Named Entities (NE) recognition process. The NE recognition process includes detecting OCR recognition errors in the text file.

Patent
15 May 2007
TL;DR: In this paper, a document resource including pre-built textual components and document settings and properties is first passed through a translation process for translating any prebuilt textual content to one or more target languages.
Abstract: Automated localization (translation) and internationalization of document resources may be provided for use by various target user groups requiring different text languages and/or document settings. A document resource including pre-built textual components and document settings and properties is first passed through a translation process for translating any pre-built textual content to one or more target languages. Text strings in the document resource may be extracted, translated and replaced to the document resource. Internationalization processing may then be accomplished wherein default page sizes, margin settings, language reading direction, and other document settings and properties are modified according to each target user group for the document resource. For initial document resource assembly, source files are identified for each component of a given document resource. The source files may be localized and internationalized and then may be used to compile a document resource for each of one or more target user groups.

Proceedings ArticleDOI
13 Dec 2007
TL;DR: Tamil Document Summarization using sub graph presents a method for extracting sentences from an individual document to serve as a document summary or a pre-cursor to creating a generic document abstract.
Abstract: Document summarization refers to the task of producing shorter version of the original document by selecting important sentences from the text. Tamil Document Summarization using sub graph presents a method for extracting sentences from an individual document to serve as a document summary or a pre-cursor to creating a generic document abstract. Language-Neutral Syntax (LNS), a system of representation for natural language sentences has been used for considering the semantics of the documents. Syntactic analysis of the text that produces a logical form analysis has been applied for each sentence. Subject-Object-Predicate (SOP) triples are extracted from individual sentences to create a semantic graph [2] of the original document and the corresponding human extracted summary. Semantic Normalization is applied to SOP triples to reduce the number of nodes in the semantic graph of the original document. Using the Support Vector Machine (SVM) learning algorithm, a classifier has been trained to identify SOP triples from the document semantic graph that belong to the summary. The classifier is then used for automatic extraction of summaries from the test documents.

Proceedings ArticleDOI
23 Sep 2007
TL;DR: This paper proposes an approach for the automatic generation of synthesised document images and associated ground-truth information based on a derivation of publishing tools that illustrates the richness of the produced information.
Abstract: Performance evaluation for document image analysis and understanding is a recurring problem. Many ground- truthed document image databases are now used to evaluate algorithms, but these databases are less useful for the design of a complete system in a precise context. This paper proposes an approach for the automatic generation of synthesised document images and associated ground-truth information based on a derivation of publishing tools. An implementation of this approach illustrates the richness of the produced information.

Proceedings ArticleDOI
28 Aug 2007
TL;DR: It is shown that the HTML document, modeled with a Hidden Markov Model, can be accurately segmented into logical zones in the narrow domain of online journal articles.
Abstract: We describe ongoing research on segmenting and labeling HTML medical journal articles. In contrast to existing approaches in which HTML tags usually serve as strong indicators, we seek to minimize dependence on HTML tags. Designing logical component models for general Web pages is a challenging task. However, in the narrow domain of online journal articles, we show that the HTML document, modeled with a Hidden Markov Model, can be accurately segmented into logical zones.

Book ChapterDOI
25 Apr 2007
TL;DR: This work proposes a technique for the discovery of the logical document structure based on the analysis of various visual properties of the document such as the page layout or text properties, currently being tested and some promising preliminary results are available.
Abstract: A great amount of information is still being stored in loosely structured documents in several widely used formats. Due to the lack of data description in these documents, their integration to the existing information systems requires sophisticated pre-processing techniques to be developed. To the document reader, the content structure is mostly presented by visual means. Therefore, we propose a technique for the discovery of the logical document structure based on the analysis of various visual properties of the document such as the page layout or text properties. This technique is currently being tested and some promising preliminary results are available.

Proceedings ArticleDOI
23 Sep 2007
TL;DR: The experimental results show that document filtering based on the proposed method is more than 20 times faster than the one based on OCR, and has comparable filtering accuracy.
Abstract: In order to capture the content of an imaged document but avoid the time-consuming full-scale OCR which is fragile to handle touching characters, a fast and segmentation- free keyword spotting method is proposed in this paper. The keyword spotting method is based on word shape coding technique. The proposed coding scheme has little ambiguity, and can be swiftly executed. It is a promising technique to boost better document image retrieval. The strength of the proposed method is demonstrated in a document filtering experiment. The experimental results show that document filtering based on the proposed method is more than 20 times faster than the one based on OCR, and has comparable filtering accuracy.

Proceedings ArticleDOI
23 Sep 2007
TL;DR: A flexible and effective example- based approach for labeling title pages which can be used for automated extraction of bibliographic data and has equivalent and partially better performance when compared to other more complex labeling methods known from the literature.
Abstract: This paper presents a flexible and effective example- based approach for labeling title pages which can be used for automated extraction of bibliographic data. The labels of interest are "title", "author", "abstract" and "affiliation". The method takes a set of labeled document layouts and a single unlabeled document layout as input and finds the best matching layout in the set. The labels of this layout are used to label the new layout. The similarity measure for layouts combines structural layout similarity and textural similarity on the block-level. Experimental results yield accuracy rates from 94.8% to 99.6% obtained on the publicly available MARG dataset. This shows that our lightweight method has equivalent and partially better performance when compared to other more complex labeling methods known from the literature.

Book ChapterDOI
16 Sep 2007
TL;DR: A system to perform Document Image Retrieval in Digital Libraries that allows users to retrieve digitized pages on the basis of layout similarities and to make textual searches on the documents without relying on OCR.
Abstract: In this paper, we describe a system to perform Document Image Retrieval in Digital Libraries. The system allows users to retrieve digitized pages on the basis of layout similarities and to make textual searches on the documents without relying on OCR. The system is discussed in the context of recent applications of document image retrieval in the field of Digital Libraries. We present the different techniques in a single framework in which the emphasis is put on the representation level at which the similarity between the query and the indexed documents is computed. We also report the results of some recent experiments on the use of layout-based document image retrieval.

Proceedings ArticleDOI
Hervé Déjean1, Jean-Luc Meunier1
28 Aug 2007
TL;DR: A method for document layout analysis based on identifying the function of document elements (what they do) that is not impacted by layout variability, a key issue in logical document analysis and is thus very robust and versatile.
Abstract: We present in this paper a method for document layout analysis based on identifying the function of document elements (what they do). This approach is orthogonal and complementary to the traditional view based on the form of document elements (how they are constructed). One key advantage of such functional knowledge is that the functions of some document elements are very stable from document to document and over time. Relying on the stability of such functions, the method is not impacted by layout variability, a key issue in logical document analysis and is thus very robust and versatile. The method starts the recognition process by using functional knowledge and uses in a second step formal knowledge as a source of feedback in order to correct some errors. This allows the method to adapt to specific documents by using formal specificities.

Patent
27 Nov 2007
TL;DR: In this article, a method, computer system and computer program product for identifying a writing system associated with a document image containing one or more words written in the writing system is presented.
Abstract: Disclosed herein is a method, computer system and computer program product for identifying a writing system associated with a document image containing one or more words written in the writing system. Initially, a document image fragment is identified based on the document image, wherein the document image fragment contains one or more pixels from one or more of the words in the document image. A set of sequential features associated with the document image fragment is generated, wherein each sequential feature describes one dimensional graphic information derived from the one or more pixels in the document image fragment. A classification score for the document image fragment is generated responsive at least in part to the set of sequential features, the classification score indicating a likelihood that the document image fragment is written in the writing system. The writing system associated with the document image is identified based at least in part on the classification score for the document image fragment.

Proceedings ArticleDOI
T. Hirano1, Y. Okano1, Y. Okada1, F. Yoda1
23 Sep 2007
TL;DR: This method analyzes the page description language (PDL) data generated from a printed document to extract text and layout information from document files of various formats.
Abstract: We propose a document analysis method, which extracts text and layout information from document files of various formats. This method analyzes the page description language (PDL) data generated from a printed document. By converting the document to PDL data, this method can handle various document formats. Graphic elements such as text objects, image objects, and path objects in the PDL data are analyzed to extract text and layout information (character size, character position, and table position). By applying OCR to the image objects and the path objects, text images in source documents and vectorized font characters in engineering drawings are converted to text. Moreover, tables in various documents are detected by analyzing path objects. Therefore, it is possible to extract the full content information from document files of various formats as long as the document is printable.