scispace - formally typeset
Search or ask a question
Topic

Document layout analysis

About: Document layout analysis is a research topic. Over the lifetime, 1462 publications have been published within this topic receiving 34021 citations.


Papers
More filters
Patent
31 Mar 1992
TL;DR: An apparatus and method for editing a document to automatically produce a satisfactory, well ordered layout which includes the steps of extracting characteristic quantities which characterize different elements of the document, deriving relationships among the different elements in accordance with the characteristic quantities, determining a layout of the different parts of the documents, and processing the documents in accordance to the layout is described in this paper.
Abstract: An apparatus and method for editing a document to automatically produce a satisfactory, well ordered layout which includes the steps of (a) extracting characteristic quantities which characterize different elements of the document; (b) deriving relationships among the different elements of the document in accordance with the characteristic quantities; (c) determining a layout of the different elements of the document in accordance with the relationships; and (d) processing the document in accordance with the layout.

40 citations

Patent
Chinmoy Panda1
23 Mar 2000
TL;DR: In this article, a computer-implemented method and system for identifying key images in a document is presented, which includes extracting one or more document keywords from the document considered important in describing the document, collecting one or several images associated with the document including information describing each image, generating a proximity factor for each image collected from the documents and each document keyword that reflects the degree of correlation between the image and the document keyword, and determining the importance of each image according to an image metric that combines the proximity factors for each document keywords and image pair.
Abstract: A computer-implemented method and system for identifying key images in a document is provided. The operations used include extracting one or more document keywords from the document considered important in describing the document, collecting one or more images associated with the document including information describing each image, generating a proximity factor for each image collected from the document and each document keyword that reflects the degree of correlation between the image and the document keyword, and determining the importance of each image according to an image metric that combines the proximity factors for each document keyword and image pair. In addition, the operations may also include ordering the document keywords according to an ordering criterion and weighting the proximity factor associated with each document keyword and image pair based on the order of the document keyword.

40 citations

Patent
05 Jun 2003
TL;DR: In this article, a distributed data processing system, including providing document structure templates comprising model document structures and semantics for the model document structure, identifying the structure of a document, selecting a document structure template in dependence upon the document and the model documents in the template, and storing search keywords from the document in records in a semantics-based search index according to the semantics from the selected template.
Abstract: Indexing information in a distributed data processing system, including providing document structure templates comprising model document structures and semantics for the model document structures; identifying the structure of a document; selecting a document structure template in dependence upon the structure of the document and the model document structures in the document structure templates; and storing search keywords from the document in records in a semantics-based search index according to the semantics from the selected document structure template. Selecting a document structure template in dependence upon the structure of the document and the model document structures in the document structure templates typically further comprises comparing the structure of the document and the model document structures in the templates; and selecting a template whose model document structure matches the structure of the document.

40 citations

Patent
16 Aug 2006
TL;DR: In this article, the layout is based on the text elements having user text content, while text elements without text content are disregarded, and position of text elements is determined based on height of the text element, defined text element spacing distances, and a defined positioning order.
Abstract: Methods and computer programs for automatically creating a text layout in a markup language design for a product to be printed. A number of defined text elements are available for user text. The layout is based on the text elements having user text content. Text elements without text content are disregarded. Positioning of the text elements is determined based on the height of the text elements, defined text element spacing distances, and a defined positioning order. Creating a layout may include positioning design elements relative to the text elements. Font sizes and spacing distances are automatically reduced if necessary to create a suitable layout.

39 citations

01 Jan 2013
TL;DR: Noise in scanned document images is reviewed, which reduces the accuracy of subsequent tasks of OCR (Optical character Recognition) systems and some noise removal methods are discussed.
Abstract:  Abstract- document images may be contaminated with noise during transmission, scanning or conversion to digital form. We can categorize noises by identifying their features and can search for similar patterns in a document image to choose appropriate methods for their removal. After a brief introduction, this paper reviews noises that might appear in scanned document images and discusses some noise removal methods. owadays, with the increase in computer use in everybody's lives, the ability for people to convert documents to digital and readable formats has become a necessity. Scanning documents is a way of changing printed documents into digital format. A common problem encountered when scanning documents is 'noise' which can occur in an image because of paper quality, the typing machine used, or it can be created by scanners during the scanning process. Noise removal is one of the steps in pre- processing. Among other things, noise reduces the accuracy of subsequent tasks of OCR (Optical character Recognition) systems. It can appear in the foreground or background of an image and can be generated before or after scanning. Examples of noise in scanned document images are as follows. The page rule line is a source of noise which interferes with text objects. The marginal noise usually appears in a large dark region around the document image and can be textual or non-textual. Some forms of clutter noise appear in an image because of document skew while scanning or are from holes punched in the document, or background noise, such as uneven contrast, show through effects, interfering strokes, and background spots, etc. Next, we will discuss each type in detail.

39 citations


Network Information
Related Topics (5)
Feature extraction
111.8K papers, 2.1M citations
82% related
Feature (computer vision)
128.2K papers, 1.7M citations
82% related
Object detection
46.1K papers, 1.3M citations
81% related
Image segmentation
79.6K papers, 1.8M citations
80% related
Convolutional neural network
74.7K papers, 2M citations
79% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20235
202219
202134
202019
201914
20189