scispace - formally typeset
Search or ask a question
Topic

Document layout analysis

About: Document layout analysis is a research topic. Over the lifetime, 1462 publications have been published within this topic receiving 34021 citations.


Papers
More filters
Patent
Akio Yamashita1, Kazuharu Toyokawa1
26 Jul 1995
TL;DR: In this article, a tree structure and layout model are automatically generated by automatically extracting the tree structure in accordance with document image analysis before a user executes graphical correction, and then the area segmentation is displayed on a display unit together with a document image and interactively corrected by the user to define a desired tree structure.
Abstract: The present invention provide a method for extracting a tree structure by using image analysis results of an actual document and generating a flexible layout model. A tree structure and layout model are newly generated by automatically extracting the tree structure in accordance with document image analysis before a user executes graphical correction. That is, an inputted document image is physically analyzed to extract a separator with a high possibility to separate the objects of the document and segment the above document image into a plurality of areas in accordance with the information for the separator. Then, the area segmentation is displayed on a display unit together with a document image and interactively corrected by the user to define a desired tree structure and complete a flexible layout model by setting a parameter to each node of the tree structure.

121 citations

Patent
31 Jul 2002
TL;DR: In this article, a system and method for summarizing the contents of a natural language document provided in electronic or digital form includes preformatting the document, performing linguistic analysis, weighting each sentence in the document as a function of quantitative importance, and generating one or more document summaries, from a plurality of selectable document summary types, as a result of the sentence weights.
Abstract: A system and method for summarizing the contents of a natural language document provided in electronic or digital form includes preformatting the document, performing linguistic analysis, weighting each sentence in the document as a function of quantitative importance, and generating one or more document summaries, from a plurality of selectable document summary types, as a function of the sentence weights.

120 citations

Journal ArticleDOI
TL;DR: It is demonstrated that layout offers a rich resource for achieving presentational coherence, alongside more traditional resources such as text-formatting and the text-internal marking of discourse connections, and an integrated approach to layout, text, and diagram generation is introduced.
Abstract: Combining elements appropriately within a coherent page layout is a well-recognized and crucial aspect of sophisticated information presentation. The precise function and nature of layout has not, however, been sufficiently addressed within computational approaches; attention is often restricted to relatively local issues of typography and text-formatting, leaving broader issues of layout unaddressed. In this paper we focus on the selection and function of layout in pages that appropriately combine textual and graphical representation styles to yield coherent presentation designs. We demonstrate that layout offers a rich resource for achieving presentational coherence, alongside more traditional resources such as text-formatting and the text-internal marking of discourse connections. We also introduce an integrated approach to layout, text, and diagram generation. Our approach is developed on the basis of a preliminary empirical investigation of professionally produced layouts, followed by implementation within a prototype information system in the area of art history.

119 citations

Proceedings ArticleDOI
26 Jul 2009
TL;DR: This paper presents a new dataset (and the methodology used to create it) based on a wide range of contemporary documents, with strong emphasis on comprehensive and detailed representation of both complex and simple layouts, and on colour originals.
Abstract: There is a significant need for a realistic dataset on which to evaluate layout analysis methods and examine their performance in detail. This paper presents a new dataset (and the methodology used to create it) based on a wide range of contemporary documents. Strong emphasis is placed on comprehensive and detailed representation of both complex and simple layouts, and on colour originals. In-depth information is recorded both at the page and region level. Ground truth is efficiently created using a new semi-automated tool and stored in a new comprehensive XML representation, the PAGE format. The dataset can be browsed and searched via a web-based front end to the underlying database and suitable subsets (relevant to specific evaluation goals) can be selected and downloaded.

117 citations

01 Jan 2003
TL;DR: This paper summarize research in document layout analysis carried out over the last few years in the laboratory, which has developed a number of novel geometric algorithms and statistical methods that are applicable to a wide variety of languages and layouts.
Abstract: In this paper, I summarize research in document layout analysis carried out over the last few years in our laboratory. Correct document layout analysis is a key step in document capture conversions into electronic formats, optical character recognition (OCR), information retrieval from scanned documents, appearance-based document retrieval, and reformatting of documents for on-screen display. We have developed a number of novel geometric algorithms and statistical methods. Layout analysis systems built from these algorithms are applicable to a wide variety of languages and layouts, and have proven to be robust to the presence of noise and spurious features in a page image. The system itself consists of reusable and independent software modules that can be reconfigured to be adapted to different languages and applications. Currently, we are using them for electronic book and document capture applications. If there is commercial or government demand, we are interested in adapting these tools to information retrieval and intelligence applications.

114 citations


Network Information
Related Topics (5)
Feature extraction
111.8K papers, 2.1M citations
82% related
Feature (computer vision)
128.2K papers, 1.7M citations
82% related
Object detection
46.1K papers, 1.3M citations
81% related
Image segmentation
79.6K papers, 1.8M citations
80% related
Convolutional neural network
74.7K papers, 2M citations
79% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20235
202219
202134
202019
201914
20189