scispace - formally typeset
Search or ask a question
Topic

Document layout analysis

About: Document layout analysis is a research topic. Over the lifetime, 1462 publications have been published within this topic receiving 34021 citations.


Papers
More filters
Proceedings ArticleDOI
19 Sep 2011
TL;DR: An approach called bag- of-related-words is proposed to generate features compounded by a set of related words with a dimensionality smaller than the bag-of-words for the topic hierarchy building.
Abstract: A simple and intuitive way to organize a huge document collection is by a topic hierarchy. Generally two steps are carried out to build a topic hierarchy automatically: 1) hierarchical document clustering and 2) cluster labeling. For both steps, a good textual document representation is essential. The bag-of-words is the common way to represent text collections. In this representation, each document is represented by a vector where each word in the document collection represents a dimension (feature). This approach has well known problems as the high dimensionality and sparsity of data. Besides, most of the concepts are composed by more than one word, as "document engineering" or "text mining". In this paper an approach called bag-of-related-words is proposed to generate features compounded by a set of related words with a dimensionality smaller than the bag-of-words. The features are extracted from each textual document of a collection using association rules. Different ways to map the document into transactions in order to allow the extraction of association rules and interest measures to prune the number of features are analyzed. To evaluate how much the proposed approach can aid the topic hierarchy building, we carried out an objective evaluation for the clustering structure, and a subjective evaluation for topic hierarchies. All the results were compared with the bag-of-words. The obtained results demonstrated that the proposed representation is better than the bag-of-words for the topic hierarchy building.

17 citations

Patent
25 Apr 2000
TL;DR: In this paper, a document image recognizing method was proposed to identify the areas of color document images and black-and-white/gray images and accurate OCR is enabled to a color document having a problem peculiar for the color document as well.
Abstract: PROBLEM TO BE SOLVED: To provide a document image recognizing method, with which the areas of color document images and black-and-white/gray images are accurately and efficiently identified and accurate OCR is enabled to a color document having a problem peculiar for the color document as well. SOLUTION: Concerning the document image recognizing method for recognizing a document image, the document image is inputted as a digital image, the background color of this document image is specified, the image is reduced as needed, pixels except for a background area are extracted from the document image while using this background color, a link component is generated by merging these pixels, the link component is classified into prescribed areas while using form features at least, and the area identified result of the document image is provided. Besides, the area identification of a binary image is performed, the result is collated with the color area identified result, feedback processing is performed as needed and the binary image and the area identified result suitable for OCR can be provided. COPYRIGHT: (C)2001,JPO

17 citations

Book ChapterDOI
15 Dec 2009
TL;DR: A model-guided segmentation and document layout extraction scheme based on hierarchical Conditional Random Fields, motivated for an automated layout analyser and machine translator for technical papers, and can also be used for other applications such as search, indexing and information retrieval.
Abstract: We present a model-guided segmentation and document layout extraction scheme based on hierarchical Conditional Random Fields (CRFs, hereafter). Common methods to classify a pixel of a document image into classes - text, background and image - are often noisy, and error-prone, often requiring post-processing through heuristic methods. The input to the system is a pixel-wise classification based on the output of a Fisher classifier based on the output of a set of Globally Matched Wavelet (GMW) Filters. The system extracts features which encode contextual information and spatial configurations of a given document image, and learns relations between these layout entities using hierarchical CRFs. The hierarchical CRF enables learning at various levels - 1. local features for text, background and image areas; 2. contextual features for further classifying region blocks - title, author block, heading, paragraph, etc.; and 3. probabilistic layout model for encoding global relations between the above blocks for a particular class of documents. Although the work has been motivated for an automated layout analyser and machine translator for technical papers, it can also be used for other applications such as search, indexing and information retrieval.

17 citations

Proceedings ArticleDOI
18 Aug 1997
TL;DR: A new document model which preserves top-down generation information is proposed based on which a document is logically represented for interactive editing, storage, retrieval, transfer and logical analysis.
Abstract: Transforming a paper document to its electronic version in a form suitable for efficient storage, retrieval and interpretation continues to be a challenging problem. An efficient document model is necessary to solve this problem. Document modeling involves techniques of thresholding, skew detection, geometric layout analysis and logical layout analysis. The derived model can then be used in document storage and retrieval. We use the traditional bottom-up approach based on the connected component extraction to efficiently implement page segmentation and region identification. A new document model which preserves top-down generation information is proposed based on which a document is logically represented for interactive editing, storage, retrieval, transfer and logical analysis.

17 citations

Patent
26 Mar 2012
TL;DR: In this paper, an energy model of the layout of the user-content components in the user document is generated based on original positions and sizes of the users in the template document.
Abstract: Methods and systems for optimizing a layout of a document constructed based on a template document, where the template document comprises a plurality of individually-specified components including one or more individually specified user-content components configured to receive user content from a user of the template document. An energy model of the layout of the user-content components in the user document is generated based on original positions and sizes of the user-content components in the template document. Positions of corresponding components in the user document are automatically adjusted to minimize the energy of the user-content component layout in to the user document.

17 citations


Network Information
Related Topics (5)
Feature extraction
111.8K papers, 2.1M citations
82% related
Feature (computer vision)
128.2K papers, 1.7M citations
82% related
Object detection
46.1K papers, 1.3M citations
81% related
Image segmentation
79.6K papers, 1.8M citations
80% related
Convolutional neural network
74.7K papers, 2M citations
79% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20235
202219
202134
202019
201914
20189