scispace - formally typeset
Search or ask a question
Topic

Document layout analysis

About: Document layout analysis is a research topic. Over the lifetime, 1462 publications have been published within this topic receiving 34021 citations.


Papers
More filters
Patent
22 May 1996
TL;DR: In this paper, a knowledge-based document analysis system and method for identifying and decomposing constrained and unconstrained images of documents is disclosed, where low level features are extracted within bitonal and grayscale images.
Abstract: A knowledge-based document analysis system and method for identifying and decomposing constrained and unconstrained images of documents is disclosed. Low level features are extracted within bitonal and grayscale images. Low level features are passed to a document classification means which forms initial hypotheses about the document class. For constrained documents, the document analysis system sorts through various models to determine the exact type of document and then extracts the relevant fields for character recognition. For unconstrained documents, through the use of a blackboard architecture which includes a knowledge database and knowledge sources, the document analysis means creates information and hypotheses to identify and locate relevant fields within the document. These fields are then sent for optical character recognition.

108 citations

Journal ArticleDOI
TL;DR: A survey on the techniques and problems involved in automatic knowledge acquisition through document processing is presented, and the basic concept of document structure and its measurement based on entropy analysis is introduced.
Abstract: The knowledge acquisition bottleneck has become the major impediment to the development and application of effective information systems. To remove this bottleneck, new document processing techniques must be introduced to automatically acquire knowledge from various types of documents. By presenting a survey on the techniques and problems involved, this paper aims at serving as a catalyst to stimulate research in automatic knowledge acquisition through document processing. In this study, a document is considered to have two structures: geometric structure and logical structure. These play a key role in the process of the knowledge acquisition, which can be viewed as a process of acquiring the above structures. Extracting the geometric structure from a document refers to document analysis; mapping the geometric structure into logical structure is regarded as document understanding. Both areas are described in this paper, and the basic concept of document structure and its measurement based on entropy analysis is introduced. Logical structure and geometric models are proposed. Both top-down and bottom-up approaches and their entropy analyses are presented. Different techniques are discussed with practical examples. Mapping methods, such as tree transformation, document formatting knowledge and document format description language, are described. >

106 citations

Journal ArticleDOI
TL;DR: SectLabel is described, a module that further develops existing software to detect the logical structure of a document from existing PDF files, using the formalism of conditional random fields.
Abstract: Scholarly digital libraries increasingly provide analytics to information within documents themselves. This includes information about the logical document structure of use to downstream components, such as search, navigation, and summarization. In this paper, the authors describe SectLabel, a module that further develops existing software to detect the logical structure of a document from existing PDF files, using the formalism of conditional random fields. While previous work has assumed access only to the raw text representation of the document, a key aspect of this work is to integrate the use of a richer representation of the document that includes features from optical character recognition OCR, such as font size and text position. Experiments reveal that using such rich features improves logical structure detection by a significant 9 F1 points, over a suitable baseline, motivating the use of richer document representations in other digital library applications.

104 citations

Patent
Yoshifumi Sato1, Masatoshi Hino1
29 Aug 1996
TL;DR: In this paper, a structured document generating method and apparatus capable of easily generating structured documents matching the document structure of each non-structured document, by using a rule directly generated from a preset document structure definition for the conversion of the nonstructured documents into the structured documents.
Abstract: A structured document generating method and apparatus capable of easily generating a structured document matching the document structure of each non-structured document, by using a rule directly generated from a preset document structure definition for the conversion of the non-structured document into the structured document. A keyword extracting module extracts a keyword representative of the document structure from a non-structured document by using a keyword extracting rule, and a keyword/text model is generated which is described by two elements including keywords and other strings. A parsing module generated by a process of automatically parsing the document structure by referring to a parsing rule generated by modifying and converting DTD, performs a parsing process relative to the keyword/text model to generate an interim SGML document. An SGML document correcting module modifies the interim SGML document and generates a final output of an SGML document by referring to DTD different information generated when the parsing rule was generated.

103 citations

Patent
27 Aug 2012
TL;DR: In this article, a document processing system for accurately and efficiently analyzing documents and methods for making and using the same is presented, where each incoming document includes at least one section of textual content and is provided as a paper-based document that is converted into an electronic form.
Abstract: A document processing system for accurately and efficiently analyzing documents and methods for making and using same. Each incoming document includes at least one section of textual content and is provided in an electronic form or as a paper-based document that is converted into an electronic form. Since many categories of documents, such as legal and accounting documents, often include one or more common text sections with similar textual content, the document processing system compares the documents to identify and classify the common text sections. The document comparison can be further enhanced by dividing the document into document segments and comparing the document segments; whereas, the conversion of paper-based documents likewise can be improved by comparing the resultant electronic document with a library of standard phrases, sentences, and paragraphs. The document processing system thereby enables an image of the document to be manipulated, as desired, to facilitate its review.

103 citations


Network Information
Related Topics (5)
Feature extraction
111.8K papers, 2.1M citations
82% related
Feature (computer vision)
128.2K papers, 1.7M citations
82% related
Object detection
46.1K papers, 1.3M citations
81% related
Image segmentation
79.6K papers, 1.8M citations
80% related
Convolutional neural network
74.7K papers, 2M citations
79% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20235
202219
202134
202019
201914
20189