Showing papers on "Document layout analysis published in 2007"

PDF

Open Access

Patent•

Reformatting documents using document analysis information

[...]

Kathrin Berkner¹, Christophe Marle¹, Edward L. Schwartz¹, Michael J. Gormish¹•Institutions (1)

16 Aug 2007

TL;DR: In this paper, a method and apparatus for reformatting electronic documents is disclosed, which consists of performing layout analysis on an electronic version of a document to locate text zones, assigning attributes for scale and importance to text zones in the electronic version, and reformating text based on the attributes to create an image.

...read moreread less

Abstract: A method and apparatus for reformatting electronic documents is disclosed. In one embodiment, the method comprises performing layout analysis on an electronic version of a document to locate text zones, assigning attributes for scale and importance to text zones in the electronic version of the document, and reformatting text in the electronic version of the document based on the attributes to create an image.

...read moreread less

216 citations

Journal Article•DOI•

A survey of document image classification: problem statement, classifier architecture and performance evaluation

[...]

Nawei Chen¹, Dorothea Blostein¹•Institutions (1)

Queen's University¹

22 May 2007-International Journal on Document Analysis and Recognition

TL;DR: This work focuses on techniques that classify single-page typeset document images without using OCR results, and brings to light important issues in designing a document classifier, including the definition of document classes, the choices of document features and feature representation, and the choice of classification algorithm and learning mechanism.

...read moreread less

Abstract: Document image classification is an important step in Office Automation, Digital Libraries, and other document image analysis applications. There is great diversity in document image classifiers: they differ in the problems they solve, in the use of training data to construct class models, and in the choice of document features and classification algorithms. We survey this diverse literature using three components: the problem statement, the classifier architecture, and performance evaluation. This brings to light important issues in designing a document classifier, including the definition of document classes, the choice of document features and feature representation, and the choice of classification algorithm and learning mechanism. We emphasize techniques that classify single-page typeset document images without using OCR results. Developing a general, adaptable, high-performance classifier is challenging due to the great variety of documents, the diverse criteria used to define document classes, and the ambiguity that arises due to ill-defined or fuzzy document classes.

...read moreread less

181 citations

Book Chapter•DOI•

Document Structure and Layout Analysis

[...]

Anoop M. Namboodiri¹, Anil K. Jain²•Institutions (2)

International Institute of Information Technology¹, Michigan State University²

01 Jan 2007

TL;DR: Automatic analysis of an arbitrary document with complex layout is an extremely difficult task and is beyond the capabilities of the state-of-the-art document structure and layout analysis systems.

...read moreread less

Abstract: A document image is composed of a variety of physical entities or regions such as text blocks, lines, words, figures, tables, and background. We could also assign functional or logical labels such as sentences, titles, captions, author names, and addresses to some of these regions. The process of document structure and layout analysis tries to decompose a given document image into its component regions and understand their functional roles and relationships. The processing is carried out in multiple steps, such as preprocessing, page decomposition, structure understanding, etc. We will look into each of these steps in detail in the following sections. Document images are often generated from physical documents by digitization using scanners or digital cameras. Many documents, such as newspapers, magazines and brochures, contain very complex layout due to the placement of figures, titles, and captions, complex backgrounds, artistic text formatting, etc. (see Figure 1). A human reader uses a variety of additional cues such as context, conventions and information about language/script, along with a complex reasoning process to decipher the contents of a document. Automatic analysis of an arbitrary document with complex layout is an extremely difficult task and is beyond the capabilities of the state-of-the-art document structure and layout analysis systems. This is interesting since documents are designed to be effective and clear to human interpretation unlike natural images.

...read moreread less

101 citations

Proceedings Article•

Layout analysis of tree-structured scene frames in comic images

[...]

Takamasa Tanaka¹, Kenji Shoji¹, Fubito Toyama¹, Juichi Miyamichi¹•Institutions (1)

Utsunomiya University¹

06 Jan 2007

TL;DR: A method for detecting the scene frame division in comic images using the density gradient after filling the quadrangle regions in each image with black is proposed and results show that 80 percent of 672 pages in four print comic booklets are successfully divided into scene frames by the proposed method.

...read moreread less

Abstract: Today, the demand of services for comic contents increases because paper magazines and books are bulky while digital contents can be read anytime and anywhere with cellular phones and PDAs. To convert existing print comic materials into digital format such that they can be read using the cellular phones and the PDAs with small screens, it is necessary to divide each page into scene frames and to determine reading order of the scene frames. The division of comic images into the scene frames can be considered as a type of document layout analysis. We analyzed layout of comic images using density gradient. The method can be applied to comics in which comic balloons or pictures are drawn over scene frames. In this research, a method for detecting the scene frame division in comic images using the density gradient after filling the quadrangle regions in each image with black is proposed. Experimental results show that 80 percent of 672 pages in four print comic booklets are successfully divided into scene frames by the proposed method.

...read moreread less

83 citations

Patent•

Image-based data management method and system

[...]

Yung-Chun Huang

15 Nov 2007

TL;DR: In this article, the authors present a system for storing, organizing, and accessing image-based documents, which includes OCR conversion process to produce an equivalent document in text format, identifying the keywords of the equivalent document, linking the keywords with the image based document and storing the imagebased document, the corresponding equivalent document and the keywords in a relational database.

...read moreread less

Abstract: Methods and systems are provided for storing, organizing, and accessing image-based documents The method includes receiving an image-based document, conducting an OCR conversion process to produce an equivalent document in text format, identifying keywords of the equivalent document in text format, linking the keywords with the image-based document and the corresponding equivalent document in text format, and storing the image-based document, the corresponding equivalent document in text format, and the keywords in a relational database

...read moreread less

61 citations

Patent•

Facilitating adaptive grid-based document layout

[...]

David M. Bargeron¹, Charles E. Jacobs¹, Wilmot Li¹, David Salesin¹, Evan J. Schrier¹ - Show less +1 more•Institutions (1)

Microsoft¹

16 Jul 2007

TL;DR: The adaptive grid-based document layout system and methods as discussed by the authors provides a template authoring tool and user interface for interactively drawing and arranging layout elements within an adaptive template, including various element types and constraint-based relationships that define the layout of elements.

...read moreread less

Abstract: A system and methods for facilitating adaptive grid-based document layout. More particularly, the adaptive grid-based document layout system and methods feature a new approach to adaptive grid-based document layout that utilizes a set of adaptive templates that are configurable in a range of different pages sizes and viewing conditions. The templates include various element types and constraint-based relationships that define the layout of elements with reference to viewing conditions under which the document content will be displayed and that define other content properties. Through a layout engine and paginator, the adaptive grid-based document layout system and methods determines a desirable sequence of templates to use for adapting document content. Additionally, the adaptive grid-based document layout system and methods provides a template authoring tool and user interface for interactively drawing and arranging layout elements within an adaptive template.

...read moreread less

59 citations

Patent•

Visualizing document annotations in the context of the source document

[...]

Jacquin Thierry¹, Jean-Pierre Chanod¹•Institutions (1)

Xerox¹

13 Jun 2007

TL;DR: In this paper, a document annotator is configured to convert a source document with a layout to a deterministic format including content and layout metadata, which is then associated with positional tags to locate the document annotations in the layout.

...read moreread less

Abstract: In a document annotator (8), a document converter (12) is configured to convert a source document (10) with a layout to a deterministic format (14, 64) including content and layout metadata. At least one annotation pipeline (20, 22) is configured to generate document annotations respective to received content. A merger (36, 46) is configured to associate the generated document annotations with positional tags based on the layout metadata, which locate the document annotations in the layout. A document visualizer (58) is configured to render at least some content of the deterministic format and one or more selected annotations (60) in substantial conformance with the layout based on the layout metadata and the positional tags associated with the selected one or more annotations (60).

...read moreread less

57 citations

Proceedings Article•DOI•

Text Line Segmentation of Historical Arabic Documents

[...]

Abderrazak Zahour, Laurence Likforman-Sulem¹, W. Boussalaa², Bruno Taconet³•Institutions (3)

École Normale Supérieure¹, University of Sfax², University of Le Havre³

23 Sep 2007

TL;DR: This paper presents a text line segmentation method for printed or handwritten historical Arabic documents which has a 96% accuracy on a collection of 100 historical documents.

...read moreread less

Abstract: This paper presents a text line segmentation method for printed or handwritten historical Arabic documents. Documents are first classified into 2 classes using a K-means scheme. These classes correspond to document complexity (easy or not easy to segment). Then, a document which includes overlapping and touching characters, is divided into vertical strips. The extracted text blocks obtained by horizontal projection are classified into three categories: small, average and large text blocks. After segmenting the large text blocks, the lines are obtained by matching adjacent blocks within two successive strips using spatial relationship. The document without overlapping or touching characters is segmented by making abstraction on the segmentation module of the large text blocks. The text line segmentation method has a 96% accuracy on a collection of 100 historical documents

...read moreread less

48 citations

Patent•

Semantics-based method and apparatus for document analysis

[...]

Michael Witbrock, David Schneider, Benjamin Rode, Bjoern Aldag

15 Feb 2007

TL;DR: In this paper, a system and method for producing semantically-rich representations of texts to amplify and sharpen the interpretations of texts is proposed, which relies on the fact that there is a substantial amount of semantic content associated with most text strings that is not explicit in those strings, or in the mere statistical co-occurrence of the strings with other strings, but which is nevertheless extremely relevant to the text.

...read moreread less

Abstract: A system and method for producing semantically-rich representations of texts to amplify and sharpen the interpretations of texts. The method relies on the fact that there is a substantial amount of semantic content associated with most text strings that is not explicit in those strings, or in the mere statistical co-occurrence of the strings with other strings, but which is nevertheless extremely relevant to the text. This additional information is used to both sharpen the representations derived directly from the text string, and also to augment the representation with content that, while not explicitly mentioned in the string, is implicit in the text and, if made explicit, can be used to support the performance of text processing applications including document indexing and retrieval, document classification, document routing, document summarization, and document tagging. These enhancements may be used to support down-stream processing, such as automated document reading and understanding, online advertising placement, electronic commerce, corporate knowledge management, and business and government intelligence applications.

...read moreread less

48 citations

Patent•

Document classification toolbar

[...]

Charles E. Pulfer, Brad P. Smith, Tim J. Upton

23 Oct 2007

TL;DR: In this article, the authors proposed a tool for electronic document classification based on specific properties such as security classification, information type, document type, and document retention, and the like associated.

...read moreread less

Abstract: Electronic document classification is disclosed. A toolbar adds the ability to classify documents based on specific properties such as security classification, information type, document type, document retention, document caveats, and the like associated. The toolbar through dropdown selections allows users to select the appropriate classification and properties based upon the content of the document and have appropriate classifiers added to the document. Document classification properties are generated that are associated with the document in the document properties and by inserting visual markings that allow users to quickly identify the security, sensitivity, intended distribution or retention. By utilizing the classification toolbar a user can classify an document by one or more classification levels and be ensured that the classification will be visible to any person viewing the document.

...read moreread less

46 citations

Proceedings Article•DOI•

Segmentation of Text and Graphics from Document Images

[...]

S.P. Chowdhury, Sekhar Mandal, Amit Das, Bhabatosh Chanda¹•Institutions (1)

Indian Statistical Institute¹

23 Sep 2007

TL;DR: This work proposes a robust technique for segmenting all sorts of graphics and texts in any orientation from document pages, essential for better OCR performance and vectorization in computer vision applications.

...read moreread less

Abstract: Text, graphics and half-tones are the major constituents of any document page. While half-tone can be characterised by its inherent intensity variation, text and graphics share common characteristics except difference in spatial distribution. The success of document image analysis systems depends on the proper segmentation. The success of document image analysis systems depends on the proper segmentation of text and graphics as text is further subdivided into other classes such as heading, table and math-zones. Segmentation of graphics is essential for better OCR performance and vectorization in computer vision applications. Graphics segmentation from text is particularly difficult in the context of graphics made of small components (dashed or dotted lines etc.) which have many features similar to texts. Here we propose a robust technique for segmenting all sorts of graphics and texts in any orientation from document pages.

...read moreread less

Journal Article•DOI•

A Texture-based Method for Document Segmentation and Classification

[...]

Ming-Wei Lin¹, Jules-Raymond Tapamo¹, Baird Ndovie¹•Institutions (1)

University of KwaZulu-Natal¹

15 Oct 2007

TL;DR: This paper presents a hybrid approach to segment and classify contents of document images, segmented into three types of regions: Graphics, Text and Space.

...read moreread less

Abstract: In this paper we present a hybrid approach to segment and classify contents of document images. A Document Image is segmented into three types of regions: Graphics, Text and Space. The image of a document is subdivided into blocks and for each block five GLCM (Grey Level Co-occurrence Matrix) features are extracted. Based on these features, blocks are then clustered into three groups using K-Means algorithm; connected blocks that belong to the same group are merged. The classification of groups is done using pre-learned heuristic rules. Experiments were conducted on scanned newspapers and images from MediaTeam Document Database

...read moreread less

Patent•

Scoring relevance of a document based on image text

[...]

Qing Yu¹, Shuming Shi¹, Zhiwei Li¹, Ji-Rong Wen¹, Wei-Ying Ma¹ - Show less +1 more•Institutions (1)

Microsoft¹

01 Mar 2007

TL;DR: In this paper, a method and system for determining relevance of a document having text and images to a text string is provided. But the method is not suitable for text-only documents.

...read moreread less

Abstract: A method and system for determining relevance of a document having text and images to a text string is provided. A scoring system identifies image text associated with an image of the document. The scoring system calculates an image score indicating relevance of the image text to the text string. The image score may be used in many applications, such as searching, summary generation, and document classification, image search, and image classification.

...read moreread less

Patent•

Automated product layout

[...]

Brian D. Hanechak

23 Jul 2007

TL;DR: In this paper, a user-selected image, if any, and text elements having user-supplied text content are disregarded, and positioning of user text entries is determined based on the size of the text entries, defined text element spacing distances, and defined positioning rules.

...read moreread less

Abstract: Methods and computer programs for automatically creating a text layout in an electronic design for a product to be printed. A number of defined text elements are available for user text entries. The product layout is based a user-selected image, if any, and on the text elements having user-supplied text content. Text elements without text content are disregarded. Positioning of user text entries is determined based on the size of the text entries, defined text element spacing distances, and defined positioning rules. Creating a layout incorporating user-supplied text entries and/or image may include cropping or resizing of other design elements in the product design and wrapping of relatively long text entries onto multiple lines.

...read moreread less

Proceedings Article•DOI•

Extracting relevant named entities for automated expense reimbursement

[...]

Guangyu Zhu¹, Timothy J. Bethea², Vikas Krishna²•Institutions (2)

University of Maryland, College Park¹, IBM²

12 Aug 2007

TL;DR: This paper presents an approach for extracting relevant named entities from document images by combining rich page layout features in the image space with language content in the OCR text using a discriminative conditional random field (CRF) framework and integrates it into the expense reimbursement solution.

...read moreread less

Abstract: Expense reimbursement is a time-consuming and labor-intensive process across organizations. In this paper, we present a prototype expense reimbursement system that dramatically reduces the elapsed time and costs involved, by eliminating paper from the process life cycle. Our complete solution involves (1) an electronic submission infrastructure that provides multi- channel image capture, secure transport and centralized storage of paper documents; (2) an unconstrained data mining approach to extracting relevant named entities from un-structured document images; (3) automation of auditing procedures that enables automatic expense validation with minimum human interaction.Extracting relevant named entities robustly from document images with unconstrained layouts and diverse formatting is a fundamental technical challenge to image-based data mining, question answering, and other information retrieval tasks. In many applications that require such capability, applying traditional language modeling techniques to the stream of OCR text does not give satisfactory result due to the absence of linguistic context. We present an approach for extracting relevant named entities from document images by combining rich page layout features in the image space with language content in the OCR text using a discriminative conditional random field (CRF) framework. We integrate this named entity extraction engine into our expense reimbursement solution and evaluate the system performance on large collections of real-world receipt images provided by IBM World Wide Reimbursement Center.

...read moreread less

Proceedings Article•DOI•

Performance Analysis Framework for Layout Analysis Methods

[...]

Apostolos Antonacopoulos¹, D. Bridson¹•Institutions (1)

University of Salford¹

23 Sep 2007

TL;DR: This paper presents a new framework for in-depth analysis of the performance of layout analysis methods that provides detailed information at various levels that can be used by method developers to identify specific problems and improve their work.

...read moreread less

Abstract: This paper presents a new framework for in-depth analysis of the performance of layout analysis methods. Contrary to existing approaches aimed at evaluation or benchmarking, the proposed framework provides detailed information at various levels that can be used by method developers to identify specific problems and improve their work. Complex layouts are supported as well as the flexible configuration of goal-oriented performance analysis scenarios. The comparison of segmentation results against the ground truth is performed in a very efficient way based on a decomposition of any region shape into an interval-based description. The framework has been validated using the dataset and method results of the ICDAR2005 Page Segmentation Competition.

...read moreread less

Proceedings Article•DOI•

Automatic Detection of Document Script and Orientation

[...]

Shijian Lu¹, Chew Lim Tan¹•Institutions (1)

National University of Singapore¹

23 Sep 2007

TL;DR: An identification technique that automatically detects the underlying script and orientation of scanned document images by using the stroke density and distribution and is tolerant to the document skew and able to detect orientations of documents of different scripts.

...read moreread less

Abstract: This paper presents an identification technique that automatically detects the underlying script and orientation of scanned document images. In the proposed technique, document script and orientation are identified by using the stroke density and distribution, which convert each document image into a document vector. For each script at each orientation, a number of reference document vectors are first constructed. Script and orientation of the query document are then determined according to the similarity between the query document vector and multiple pre- constructed reference document vectors by using the K-nearest neighbor algorithm. Experiments show that the proposed technique is tolerant to the document skew and able to detect orientations of documents of different scripts.

...read moreread less

Proceedings Article•

Single document summarization with document expansion

[...]

Xiaojun Wan¹, Jianwu Yang¹•Institutions (1)

Peking University¹

22 Jul 2007

TL;DR: The experimental results on the DUC2002 dataset demonstrate the effectiveness of the proposed approach based on document expansion, and the cross-document relationships between sentences in the expanded document set are validated to be very important for single document summarization.

...read moreread less

Abstract: Existing methods for single document summarization usually make use of only the information contained in the specified document This paper proposes the technique of document expansion to provide more knowledge to help single document summarization A specified document is expanded to a small document set by adding a few neighbor documents close to the document, and then the graph-ranking based algorithm is applied on the expanded document set for extracting sentences from the single document, by making use of both the within-document relationships between sentences of the specified document and the cross-document relationships between sentences of all documents in the document set The experimental results on the DUC2002 dataset demonstrate the effectiveness of the proposed approach based on document expansion The cross-document relationships between sentences in the expanded document set are validated to be very important for single document summarization

...read moreread less

Patent•

Method for automated processing of hard copy text documents

[...]

Alena V. Belitskaya, Bernardus J. Faust, Henricus M.J.M. Dortmans

02 Apr 2007

TL;DR: In this article, a method for automated processing of hard copy text documents includes scanning the hard copy document, subjecting the scanned document to an OCR process, so as to obtain a text file of the text of the document and subjecting text file to a Named Entities (NE) recognition process.

...read moreread less

Abstract: A method for automated processing of hard copy text documents includes scanning the hard copy document, subjecting the scanned document to an OCR process, so as to obtain a text file of the text of the document and subjecting the text file to a Named Entities (NE) recognition process. The NE recognition process includes detecting OCR recognition errors in the text file.

...read moreread less

Patent•

Localization and internationalization of document resources

[...]

Eric D. Bailey¹, Roberto C. Taboada¹, Alfred Hellstern¹, William C. Henry¹•Institutions (1)

Microsoft¹

15 May 2007

TL;DR: In this paper, a document resource including pre-built textual components and document settings and properties is first passed through a translation process for translating any prebuilt textual content to one or more target languages.

...read moreread less

Abstract: Automated localization (translation) and internationalization of document resources may be provided for use by various target user groups requiring different text languages and/or document settings. A document resource including pre-built textual components and document settings and properties is first passed through a translation process for translating any pre-built textual content to one or more target languages. Text strings in the document resource may be extracted, translated and replaced to the document resource. Internationalization processing may then be accomplished wherein default page sizes, margin settings, language reading direction, and other document settings and properties are modified according to each target user group for the document resource. For initial document resource assembly, source files are identified for each component of a given document resource. The source files may be localized and internationalized and then may be used to compile a document resource for each of one or more target user groups.

...read moreread less

Proceedings Article•DOI•

Tamil Document Summarization Using Semantic Graph Method

[...]

M. Banu¹, C. Karthika¹, P. Sudarmani¹, T. V. Geetha¹•Institutions (1)

Anna University¹

13 Dec 2007

TL;DR: Tamil Document Summarization using sub graph presents a method for extracting sentences from an individual document to serve as a document summary or a pre-cursor to creating a generic document abstract.

...read moreread less

Abstract: Document summarization refers to the task of producing shorter version of the original document by selecting important sentences from the text. Tamil Document Summarization using sub graph presents a method for extracting sentences from an individual document to serve as a document summary or a pre-cursor to creating a generic document abstract. Language-Neutral Syntax (LNS), a system of representation for natural language sentences has been used for considering the semantics of the documents. Syntactic analysis of the text that produces a logical form analysis has been applied for each sentence. Subject-Object-Predicate (SOP) triples are extracted from individual sentences to create a semantic graph [2] of the original document and the corresponding human extracted summary. Semantic Normalization is applied to SOP triples to reduce the number of nodes in the semantic graph of the original document. Using the Support Vector Machine (SVM) learning algorithm, a classifier has been trained to identify SOP triples from the document semantic graph that belong to the summary. The classifier is then used for automatic extraction of summaries from the test documents.

...read moreread less

Proceedings Article•DOI•

Automatic Ground-truth Generation for Document Image Analysis and Understanding

[...]

Pierre Héroux, Eugen Barbu, Sébastien Adam, Eric Trupin

23 Sep 2007

TL;DR: This paper proposes an approach for the automatic generation of synthesised document images and associated ground-truth information based on a derivation of publishing tools that illustrates the richness of the produced information.

...read moreread less

Abstract: Performance evaluation for document image analysis and understanding is a recurring problem. Many ground- truthed document image databases are now used to evaluate algorithms, but these databases are less useful for the design of a complete system in a precise context. This paper proposes an approach for the automatic generation of synthesised document images and associated ground-truth information based on a derivation of publishing tools. An implementation of this approach illustrates the richness of the produced information.

...read moreread less

Proceedings Article•DOI•

Structure and content analysis for html medical articles: a hidden markov model approach

[...]

Jie Zou¹, Daniel Le¹, George R. Thoma¹•Institutions (1)

National Institutes of Health¹

28 Aug 2007

TL;DR: It is shown that the HTML document, modeled with a Hidden Markov Model, can be accurately segmented into logical zones in the narrow domain of online journal articles.

...read moreread less

Abstract: We describe ongoing research on segmenting and labeling HTML medical journal articles. In contrast to existing approaches in which HTML tags usually serve as strong indicators, we seek to minimize dependence on HTML tags. Designing logical component models for general Web pages is a challenging task. However, in the narrow domain of online journal articles, we show that the HTML document, modeled with a Hidden Markov Model, can be accurately segmented into logical zones.

...read moreread less

Book Chapter•DOI•

Automatic document structure detection for data integration

[...]

Radek Burget¹•Institutions (1)

Brno University of Technology¹

25 Apr 2007

TL;DR: This work proposes a technique for the discovery of the logical document structure based on the analysis of various visual properties of the document such as the page layout or text properties, currently being tested and some promising preliminary results are available.

...read moreread less

Abstract: A great amount of information is still being stored in loosely structured documents in several widely used formats. Due to the lack of data description in these documents, their integration to the existing information systems requires sophisticated pre-processing techniques to be developed. To the document reader, the content structure is mostly presented by visual means. Therefore, we propose a technique for the discovery of the logical document structure based on the analysis of various visual properties of the document such as the page layout or text properties. This technique is currently being tested and some promising preliminary results are available.

...read moreread less

Proceedings Article•DOI•

A Fast Keyword-Spotting Technique

[...]

Linlin Li¹, Shijian Lu¹, Chew Lim Tan¹•Institutions (1)

National University of Singapore¹

23 Sep 2007

TL;DR: The experimental results show that document filtering based on the proposed method is more than 20 times faster than the one based on OCR, and has comparable filtering accuracy.

...read moreread less

Abstract: In order to capture the content of an imaged document but avoid the time-consuming full-scale OCR which is fragile to handle touching characters, a fast and segmentation- free keyword spotting method is proposed in this paper. The keyword spotting method is based on word shape coding technique. The proposed coding scheme has little ambiguity, and can be swiftly executed. It is a promising technique to boost better document image retrieval. The strength of the proposed method is demonstrated in a document filtering experiment. The experimental results show that document filtering based on the proposed method is more than 20 times faster than the one based on OCR, and has comparable filtering accuracy.

...read moreread less

Proceedings Article•DOI•

Example-Based Logical Labeling of Document Title Page Images

[...]

J. van Beusekom¹, Daniel Keysers², Faisal Shafait², Thomas M. Breuel•Institutions (2)

Kaiserslautern University of Technology¹, German Research Centre for Artificial Intelligence²

23 Sep 2007

TL;DR: A flexible and effective example- based approach for labeling title pages which can be used for automated extraction of bibliographic data and has equivalent and partially better performance when compared to other more complex labeling methods known from the literature.

...read moreread less

Abstract: This paper presents a flexible and effective example- based approach for labeling title pages which can be used for automated extraction of bibliographic data. The labels of interest are "title", "author", "abstract" and "affiliation". The method takes a set of labeled document layouts and a single unlabeled document layout as input and finds the best matching layout in the set. The labels of this layout are used to label the new layout. The similarity measure for layouts combines structural layout similarity and textural similarity on the block-level. Experimental results yield accuracy rates from 94.8% to 99.6% obtained on the publicly available MARG dataset. This shows that our lightweight method has equivalent and partially better performance when compared to other more complex labeling methods known from the literature.

...read moreread less

Book Chapter•DOI•

Exploring digital libraries with document image retrieval

[...]

Simone Marinai¹, Emanuele Marino¹, Giovanni Soda¹•Institutions (1)

University of Florence¹

16 Sep 2007

TL;DR: A system to perform Document Image Retrieval in Digital Libraries that allows users to retrieve digitized pages on the basis of layout similarities and to make textual searches on the documents without relying on OCR.

...read moreread less

Abstract: In this paper, we describe a system to perform Document Image Retrieval in Digital Libraries. The system allows users to retrieve digitized pages on the basis of layout similarities and to make textual searches on the documents without relying on OCR. The system is discussed in the context of recent applications of document image retrieval in the field of Digital Libraries. We present the different techniques in a single framework in which the emphasis is put on the representation level at which the similarity between the query and the indexed documents is computed. We also report the results of some recent experiments on the use of layout-based document image retrieval.

...read moreread less

Proceedings Article•DOI•

Logical document conversion: combining functional and formal knowledge

[...]

Hervé Déjean¹, Jean-Luc Meunier¹•Institutions (1)

Xerox¹

28 Aug 2007

TL;DR: A method for document layout analysis based on identifying the function of document elements (what they do) that is not impacted by layout variability, a key issue in logical document analysis and is thus very robust and versatile.

...read moreread less

Abstract: We present in this paper a method for document layout analysis based on identifying the function of document elements (what they do). This approach is orthogonal and complementary to the traditional view based on the form of document elements (how they are constructed). One key advantage of such functional knowledge is that the functions of some document elements are very stable from document to document and over time. Relying on the stability of such functions, the method is not impacted by layout variability, a key issue in logical document analysis and is thus very robust and versatile. The method starts the recognition process by using functional knowledge and uses in a second step formal knowledge as a source of feedback in order to correct some errors. This allows the method to adapt to specific documents by using formal specificities.

...read moreread less

Patent•

Image-domain script and language identification

[...]

Ashok Popat, Eugene Brevdo

27 Nov 2007

TL;DR: In this article, a method, computer system and computer program product for identifying a writing system associated with a document image containing one or more words written in the writing system is presented.

...read moreread less

Abstract: Disclosed herein is a method, computer system and computer program product for identifying a writing system associated with a document image containing one or more words written in the writing system. Initially, a document image fragment is identified based on the document image, wherein the document image fragment contains one or more pixels from one or more of the words in the document image. A set of sequential features associated with the document image fragment is generated, wherein each sequential feature describes one dimensional graphic information derived from the one or more pixels in the document image fragment. A classification score for the document image fragment is generated responsive at least in part to the set of sequential features, the classification score indicating a likelihood that the document image fragment is written in the writing system. The writing system associated with the document image is identified based at least in part on the classification score for the document image fragment.

...read moreread less

Proceedings Article•DOI•

Text and Layout Information Extraction from Document Files of Various Formats Based on the Analysis of Page Description Language

[...]

T. Hirano¹, Y. Okano¹, Y. Okada¹, F. Yoda¹•Institutions (1)

Mitsubishi¹

23 Sep 2007

TL;DR: This method analyzes the page description language (PDL) data generated from a printed document to extract text and layout information from document files of various formats.

...read moreread less

Abstract: We propose a document analysis method, which extracts text and layout information from document files of various formats. This method analyzes the page description language (PDL) data generated from a printed document. By converting the document to PDL data, this method can handle various document formats. Graphic elements such as text objects, image objects, and path objects in the PDL data are analyzed to extract text and layout information (character size, character position, and table position). By applying OCR to the image objects and the path objects, text images in source documents and vectorized font characters in engineering drawings are converted to text. Moreover, tables in various documents are detected by analyzing path objects. Therefore, it is possible to extract the full content information from document files of various formats as long as the document is printable.

...read moreread less