Topic
Document layout analysis
About: Document layout analysis is a research topic. Over the lifetime, 1462 publications have been published within this topic receiving 34021 citations.
Papers published on a yearly basis
Papers
More filters
•
20 May 2003
TL;DR: In this article, a method of automated document structure identification based on visual cues is proposed, which can be applied in the generation of extensible mark-up language files, natural language parsing and search engine ranking mechanisms.
Abstract: A method of automated document structure identification based on visual cues is disclosed herein The two dimensional layout of the document is analyzed to discern visual cues related to the structure of the document, and the text of the document is tokenized so that similarly structured elements are treated similarly The method can be applied in the generation of extensible mark-up language files, natural language parsing and search engine ranking mechanisms
91 citations
•
23 May 1996TL;DR: In this paper, a system for providing information on the structure of a document page so as to complement the textual information provided in an optical character recognition system is presented, which can be used to produce a file editable in a native word-processing environment from input data including the content and characteristics of regions of at least one page forming the document.
Abstract: The present invention is a system for providing information on the structure of a document page so as to complement the textual information provided in an optical character recognition system. The system employs a method that can be used to produce a file editable in a native word-processing environment from input data including the content and characteristics of regions of at least one page forming the document. The method includes the steps of: (a) identifying sections within the page; (b) identifying captions; (c) determining boundaries of at least one column on the page, and optionally (d) resizing at least one element of the page of the document so that all pages of the document are of a common size.
91 citations
•
06 Mar 2006TL;DR: In this article, a method and apparatus for automated document layout creation is described, which comprises receiving a first layout of document image objects and creating a second layout of image objects subject to placement constraints corresponding to the placement of the image objects.
Abstract: A method and apparatus for automated document layout creation is disclosed. In one embodiment, the method comprises receiving a first layout of document image objects and creating a second layout of document image objects subject to placement constraints corresponding to placement of document image objects, at least one of the placement constraints being based on object content in one or more of the document image objects.
90 citations
•
14 Jan 2004
TL;DR: In this paper, a method is proposed to generate a minimum set of simplified and navigable web contents from a single web document that is oversized for targeted smaller devices, while preserving text, image, transactional and embedded presentation constraint information.
Abstract: A method is disclosed to generate, while preserving text, image, transactional and embedded presentation constraint information, a minimum set of simplified and navigable web contents from a single web document that is oversized for targeted smaller devices. The method includes a parser, a content tree builder, a document tree builder, a document simplifier, a virtual layout engine, a document partitioner, a content scalar and a markup generator. The parser generates markup and data tags from an HTML source document. The builder constructs a content tree. The simplifier transforms the document tree into an intermediate one defined by a subset of XHTML tags and attributes. Layout constraints, including size, area, placement order, and column/row relationships, are calculated for partitioning and scaling the document tree into sub document trees with assigned navigation order and hierarchical hyperlinks. A simplified HTML document is then generated with the markup generator.
90 citations
•
01 Jan 1995
TL;DR: In this article, a conceptual framework for solving the task of document analysis, which consists in the conversion of the document's pixel representation into an equivalent knowledge network representation holding the document content and layout, is presented.
Abstract: The authors present a conceptual framework for solving the task of document analysis, which, in essence, consists in the conversion of the document's pixel representation into an equivalent knowledge network representation holding the document's content and layout. Starting on the pixel level, the formation of elementary geometric objects on which layout analysis as well as the definition of character objects is based is described. Character recognition accomplishes the mapping from geometric object to character meaning in ASCII representation. On the next level of abstraction words are formed and verified by contextual processing. Modeled knowledge about complete documents and about how their constituents are related to the application forms the highest level of abstraction. The various problems arising at each stage are discussed. The dependencies between the different levels are exemplified and technical solutions put forward. >
89 citations