Topic

Document layout analysis

About: Document layout analysis is a research topic. Over the lifetime, 1462 publications have been published within this topic receiving 34021 citations.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

A document classification and extraction system with learning ability

[...]

Xuhong Li¹, P.A. Ng•Institutions (1)

New Jersey Institute of Technology¹

20 Sep 1999

TL;DR: Two learning methodologies are applied-learning from experience and an enhanced perceptron learning algorithm for domain-independent automatic document image understanding system with learning ability.

...read moreread less

Abstract: Document image processing begins at the OCR phase with the difficulty of automatic document analysis and understanding. Most existing systems only do well in their specific application domains. In this paper, we describe a domain-independent automatic document image understanding system with learning ability. A segmentation method based on "logical closeness" is proposed. A novel and natural representation of document layout structure-a directed weight graph (DWG)-is described. To classify a given document, a string representation matching algorithm is applied first, instead of comparing all the sample graphs. A frame template and a document type hierarchy (DTH) are used to represent the document's logical structure and the hierarchical relationships among these frame templates, respectively. In this paper, two learning methodologies are applied-learning from experience and an enhanced perceptron learning algorithm.

...read moreread less

24 citations

Proceedings Article•DOI•

Tree clustering for layout-based document image retrieval

[...]

Simone Marinai, Emanuele Marino, Giovanni Soda

27 Apr 2006

TL;DR: The proposed indexing method combines a new tree clustering algorithm (based on self organizing maps) with principal component analysis that allows us to retrieve the most similar pages from large collections without the need for a direct comparison of the query page with each indexed document.

...read moreread less

Abstract: We describe a system for the retrieval on the basis of layout similarity of document images belonging to collections stored in digital libraries. Layout regions are extracted and represented with the XY tree. The proposed indexing method combines a new tree clustering algorithm (based on self organizing maps) with principal component analysis. The combination of these techniques allows us to retrieve the most similar pages from large collections without the need for a direct comparison of the query page with each indexed document.

...read moreread less

24 citations

Proceedings Article•DOI•

Automatic article extraction in old newspapers digitized collections

[...]

David Hebert¹, Thomas Palfray¹, Stéphane Nicolas¹, Pierrick Tranouez¹, Thierry Paquet¹ - Show less +1 more•Institutions (1)

University of Rouen¹

19 May 2014

TL;DR: This top-level structural analysis relies on the generation of an article separation grid applied recursively on the document image, allowing analyzing any type of Manhattan page layout, even for complex structures with multiple columns and overlapping entities.

...read moreread less

Abstract: We present a complete method for article segmentation in old newspapers, which deals with complex layouts analysis of degraded documents. The designed workflow can process large amounts of documents and generates digital objects in METS/ALTO format in order to facilitate the indexing and the browsing of information in digital libraries. The analysis of the document image is performed by a two stages scheme. Pixels are labeled in a first stage with a Conditional Random Field model in order to intent to label the areas of interest with a low logical level. Then this first logical representation of the document content is analyzed in a second stage to get a higher logical representation including article segmentation and reading order. This top-level structural analysis relies on the generation of an article separation grid applied recursively on the document image, allowing analyzing any type of Manhattan page layout, even for complex structures with multiple columns and overlapping entities. This method which benefits from both a local analysis using a probabilistic model trained using machine learning procedures, and a more global structural analysis using recursive rules, is evaluated on a dataset of daily local press document images covering several time periods and different page layouts, to prove its effectiveness.

...read moreread less

24 citations

Patent•

Method and apparatus for logically tagging of document elements in the column by major white region pattern matching

[...]

Masaharu Ozaki¹•Institutions (1)

Xerox¹

07 Jun 1995

TL;DR: In this article, a system for logically identifying document elements from a document includes an input port for inputting a signal representing the document image, a computer having a document structural model, a document white region extraction system, a major white region selecting device and a column string selection device that generate matching column string of document elements that match the extracted major white regions in a column.

...read moreread less

Abstract: A system for logically identifying document elements from a document includes an input port for inputting a signal representing the document image, a computer having a document structural model, a document white region extraction system that extracts major white regions separating and within document elements in the input document image, a major white region selecting device and a column string selection device that generate matching column string of document elements that match the extracted major white regions in a column, a column expression comparison device that selects the best matching column string and a logical tagging device that logically tags and then extracts the document elements in the document image using the best matching column string. The method for logically identifying document elements includes providing at least one structural model of a corresponding source document, each structural model including at least one column expression defining relationships between document elements of the source document. Identifying major white regions in the input document image segmenting and defining the document elements of the document image, and assembling a major white region pattern and generating at least one column string that matches the major white region pattern for each column of the input document. Then, determining the column string that most closely matches the column expression, and logically identifying each document element of the document image based on the closest matching column string.

...read moreread less

24 citations

Proceedings Article•DOI•

Text line script identification for a tri-lingual document

[...]

Prakash K. Aithal¹, G Rajesh¹, Dinesh U Acharya¹, N. V. Krishnamoorthi M. Subbareddy¹•Institutions (1)

Manipal Institute of Technology¹

29 Jul 2010

TL;DR: A simple and efficient technique of script identification for Kannada, Hindi and English text lines from a printed document is presented and an overall classification rate of 99.83% is achieved.

...read moreread less

Abstract: India is a multilingual multi-script country. States of India follow a three language formula. The document may be printed in English, Hindi and other state official language. For example in Karnataka, a state in India, the document may contain text lines in English, Hindi script. For Optical Character Recognition (OCR) of such a multilingual document, it is necessary to identify the script before feeding the text lines to the OCRs of individual scripts. In this paper, a simple and efficient technique of script identification for Kannada, Hindi and English text lines from a printed document is presented. The proposed system uses horizontal projection profile to distinguish the three scripts. The feature extraction is done based on the horizontal projection profile of each text line. The knowledge base of the system is developed based on 15 different document images containing about 450 text lines. For a new text line, necessary features are extracted from the horizontal projection profile and compared with the stored knowledge base to classify the script. The proposed system is tested on 20 different document images containing about 200 text lines of each script and an overall classification rate of 99.83% is achieved.

...read moreread less

24 citations

Collapse

Network Information

Performance

Metrics

1,488

Papers

35,779

Citations

No. of papers in the topic in previous years
Year	Papers
2023	5
2022	19
2021	34
2020	19
2019	14
2018	9

Document layout analysis

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics