Topic

Document layout analysis

About: Document layout analysis is a research topic. Over the lifetime, 1462 publications have been published within this topic receiving 34021 citations.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Building a topic hierarchy using the bag-of-related-words representation

[...]

Rafael Geraldeli Rossi¹, Solange Oliveira Rezende¹•Institutions (1)

University of São Paulo¹

19 Sep 2011

TL;DR: An approach called bag- of-related-words is proposed to generate features compounded by a set of related words with a dimensionality smaller than the bag-of-words for the topic hierarchy building.

...read moreread less

Abstract: A simple and intuitive way to organize a huge document collection is by a topic hierarchy. Generally two steps are carried out to build a topic hierarchy automatically: 1) hierarchical document clustering and 2) cluster labeling. For both steps, a good textual document representation is essential. The bag-of-words is the common way to represent text collections. In this representation, each document is represented by a vector where each word in the document collection represents a dimension (feature). This approach has well known problems as the high dimensionality and sparsity of data. Besides, most of the concepts are composed by more than one word, as "document engineering" or "text mining". In this paper an approach called bag-of-related-words is proposed to generate features compounded by a set of related words with a dimensionality smaller than the bag-of-words. The features are extracted from each textual document of a collection using association rules. Different ways to map the document into transactions in order to allow the extraction of association rules and interest measures to prune the number of features are analyzed. To evaluate how much the proposed approach can aid the topic hierarchy building, we carried out an objective evaluation for the clustering structure, and a subjective evaluation for topic hierarchies. All the results were compared with the bag-of-words. The obtained results demonstrated that the proposed representation is better than the bag-of-words for the topic hierarchy building.

...read moreread less

17 citations

Patent•

Method and device for recognizing document image and computer readable recording medium

[...]

Tsukasa Kouchi, 司幸地

25 Apr 2000

TL;DR: In this paper, a document image recognizing method was proposed to identify the areas of color document images and black-and-white/gray images and accurate OCR is enabled to a color document having a problem peculiar for the color document as well.

...read moreread less

Abstract: PROBLEM TO BE SOLVED: To provide a document image recognizing method, with which the areas of color document images and black-and-white/gray images are accurately and efficiently identified and accurate OCR is enabled to a color document having a problem peculiar for the color document as well. SOLUTION: Concerning the document image recognizing method for recognizing a document image, the document image is inputted as a digital image, the background color of this document image is specified, the image is reduced as needed, pixels except for a background area are extracted from the document image while using this background color, a link component is generated by merging these pixels, the link component is classified into prescribed areas while using form features at least, and the area identified result of the document image is provided. Besides, the area identification of a binary image is performed, the result is collated with the color area identified result, feedback processing is performed as needed and the binary image and the area identified result suitable for OCR can be provided. COPYRIGHT: (C)2001,JPO

...read moreread less

17 citations

Book Chapter•DOI•

Model-Guided Segmentation and Layout Labelling of Document Images Using a Hierarchical Conditional Random Field

[...]

Santanu Chaudhury¹, Megha Jindal¹, Sumantra Dutta Roy¹•Institutions (1)

Indian Institute of Technology Delhi¹

15 Dec 2009

TL;DR: A model-guided segmentation and document layout extraction scheme based on hierarchical Conditional Random Fields, motivated for an automated layout analyser and machine translator for technical papers, and can also be used for other applications such as search, indexing and information retrieval.

...read moreread less

Abstract: We present a model-guided segmentation and document layout extraction scheme based on hierarchical Conditional Random Fields (CRFs, hereafter). Common methods to classify a pixel of a document image into classes - text, background and image - are often noisy, and error-prone, often requiring post-processing through heuristic methods. The input to the system is a pixel-wise classification based on the output of a Fisher classifier based on the output of a set of Globally Matched Wavelet (GMW) Filters. The system extracts features which encode contextual information and spatial configurations of a given document image, and learns relations between these layout entities using hierarchical CRFs. The hierarchical CRF enables learning at various levels - 1. local features for text, background and image areas; 2. contextual features for further classifying region blocks - title, author block, heading, paragraph, etc.; and 3. probabilistic layout model for encoding global relations between the above blocks for a particular class of documents. Although the work has been motivated for an automated layout analyser and machine translator for technical papers, it can also be used for other applications such as search, indexing and information retrieval.

...read moreread less

17 citations

Proceedings Article•DOI•

Page segmentation using document model

[...]

Anil K. Jain¹, Bin Yu•Institutions (1)

Michigan State University¹

18 Aug 1997

TL;DR: A new document model which preserves top-down generation information is proposed based on which a document is logically represented for interactive editing, storage, retrieval, transfer and logical analysis.

...read moreread less

Abstract: Transforming a paper document to its electronic version in a form suitable for efficient storage, retrieval and interpretation continues to be a challenging problem. An efficient document model is necessary to solve this problem. Document modeling involves techniques of thresholding, skew detection, geometric layout analysis and logical layout analysis. The derived model can then be used in document storage and retrieval. We use the traditional bottom-up approach based on the connected component extraction to efficiently implement page segmentation and region identification. A new document model which preserves top-down generation information is proposed based on which a document is logically represented for interactive editing, storage, retrieval, transfer and logical analysis.

...read moreread less

17 citations

Patent•

Self-adjusting document layouts using system optimization modeling

[...]

Vyacheslav Nykyforov¹•Institutions (1)

Winterthur Museum, Garden and Library¹

26 Mar 2012

TL;DR: In this paper, an energy model of the layout of the user-content components in the user document is generated based on original positions and sizes of the users in the template document.

...read moreread less

Abstract: Methods and systems for optimizing a layout of a document constructed based on a template document, where the template document comprises a plurality of individually-specified components including one or more individually specified user-content components configured to receive user content from a user of the template document. An energy model of the layout of the user-content components in the user document is generated based on original positions and sizes of the user-content components in the template document. Positions of corresponding components in the user document are automatically adjusted to minimize the energy of the user-content component layout in to the user document.

...read moreread less

17 citations

Collapse

Network Information

Performance

Metrics

1,488

Papers

35,779

Citations

No. of papers in the topic in previous years
Year	Papers
2023	5
2022	19
2021	34
2020	19
2019	14
2018	9

Document layout analysis

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics