scispace - formally typeset
Search or ask a question
Topic

Document layout analysis

About: Document layout analysis is a research topic. Over the lifetime, 1462 publications have been published within this topic receiving 34021 citations.


Papers
More filters
Proceedings ArticleDOI
20 Sep 1999
TL;DR: Two learning methodologies are applied-learning from experience and an enhanced perceptron learning algorithm for domain-independent automatic document image understanding system with learning ability.
Abstract: Document image processing begins at the OCR phase with the difficulty of automatic document analysis and understanding. Most existing systems only do well in their specific application domains. In this paper, we describe a domain-independent automatic document image understanding system with learning ability. A segmentation method based on "logical closeness" is proposed. A novel and natural representation of document layout structure-a directed weight graph (DWG)-is described. To classify a given document, a string representation matching algorithm is applied first, instead of comparing all the sample graphs. A frame template and a document type hierarchy (DTH) are used to represent the document's logical structure and the hierarchical relationships among these frame templates, respectively. In this paper, two learning methodologies are applied-learning from experience and an enhanced perceptron learning algorithm.

24 citations

Proceedings ArticleDOI
27 Apr 2006
TL;DR: The proposed indexing method combines a new tree clustering algorithm (based on self organizing maps) with principal component analysis that allows us to retrieve the most similar pages from large collections without the need for a direct comparison of the query page with each indexed document.
Abstract: We describe a system for the retrieval on the basis of layout similarity of document images belonging to collections stored in digital libraries. Layout regions are extracted and represented with the XY tree. The proposed indexing method combines a new tree clustering algorithm (based on self organizing maps) with principal component analysis. The combination of these techniques allows us to retrieve the most similar pages from large collections without the need for a direct comparison of the query page with each indexed document.

24 citations

Proceedings ArticleDOI
19 May 2014
TL;DR: This top-level structural analysis relies on the generation of an article separation grid applied recursively on the document image, allowing analyzing any type of Manhattan page layout, even for complex structures with multiple columns and overlapping entities.
Abstract: We present a complete method for article segmentation in old newspapers, which deals with complex layouts analysis of degraded documents. The designed workflow can process large amounts of documents and generates digital objects in METS/ALTO format in order to facilitate the indexing and the browsing of information in digital libraries. The analysis of the document image is performed by a two stages scheme. Pixels are labeled in a first stage with a Conditional Random Field model in order to intent to label the areas of interest with a low logical level. Then this first logical representation of the document content is analyzed in a second stage to get a higher logical representation including article segmentation and reading order. This top-level structural analysis relies on the generation of an article separation grid applied recursively on the document image, allowing analyzing any type of Manhattan page layout, even for complex structures with multiple columns and overlapping entities. This method which benefits from both a local analysis using a probabilistic model trained using machine learning procedures, and a more global structural analysis using recursive rules, is evaluated on a dataset of daily local press document images covering several time periods and different page layouts, to prove its effectiveness.

24 citations

Patent
Masaharu Ozaki1
07 Jun 1995
TL;DR: In this article, a system for logically identifying document elements from a document includes an input port for inputting a signal representing the document image, a computer having a document structural model, a document white region extraction system, a major white region selecting device and a column string selection device that generate matching column string of document elements that match the extracted major white regions in a column.
Abstract: A system for logically identifying document elements from a document includes an input port for inputting a signal representing the document image, a computer having a document structural model, a document white region extraction system that extracts major white regions separating and within document elements in the input document image, a major white region selecting device and a column string selection device that generate matching column string of document elements that match the extracted major white regions in a column, a column expression comparison device that selects the best matching column string and a logical tagging device that logically tags and then extracts the document elements in the document image using the best matching column string. The method for logically identifying document elements includes providing at least one structural model of a corresponding source document, each structural model including at least one column expression defining relationships between document elements of the source document. Identifying major white regions in the input document image segmenting and defining the document elements of the document image, and assembling a major white region pattern and generating at least one column string that matches the major white region pattern for each column of the input document. Then, determining the column string that most closely matches the column expression, and logically identifying each document element of the document image based on the closest matching column string.

24 citations

Proceedings ArticleDOI
29 Jul 2010
TL;DR: A simple and efficient technique of script identification for Kannada, Hindi and English text lines from a printed document is presented and an overall classification rate of 99.83% is achieved.
Abstract: India is a multilingual multi-script country. States of India follow a three language formula. The document may be printed in English, Hindi and other state official language. For example in Karnataka, a state in India, the document may contain text lines in English, Hindi script. For Optical Character Recognition (OCR) of such a multilingual document, it is necessary to identify the script before feeding the text lines to the OCRs of individual scripts. In this paper, a simple and efficient technique of script identification for Kannada, Hindi and English text lines from a printed document is presented. The proposed system uses horizontal projection profile to distinguish the three scripts. The feature extraction is done based on the horizontal projection profile of each text line. The knowledge base of the system is developed based on 15 different document images containing about 450 text lines. For a new text line, necessary features are extracted from the horizontal projection profile and compared with the stored knowledge base to classify the script. The proposed system is tested on 20 different document images containing about 200 text lines of each script and an overall classification rate of 99.83% is achieved.

24 citations


Network Information
Related Topics (5)
Feature extraction
111.8K papers, 2.1M citations
82% related
Feature (computer vision)
128.2K papers, 1.7M citations
82% related
Object detection
46.1K papers, 1.3M citations
81% related
Image segmentation
79.6K papers, 1.8M citations
80% related
Convolutional neural network
74.7K papers, 2M citations
79% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20235
202219
202134
202019
201914
20189