scispace - formally typeset
Search or ask a question
Topic

Document layout analysis

About: Document layout analysis is a research topic. Over the lifetime, 1462 publications have been published within this topic receiving 34021 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: A method for determining an algorithm's optimal tuning parameters and the correspondences between detected entities and ground truth is described, and a group of document layout analysis algorithms are evaluated.

26 citations

Proceedings ArticleDOI
18 Sep 2011
TL;DR: Evaluation of the presented system on Arabic and Urdu document image datasets consisting of a variety of complex single- and multi-column layouts achieves high accuracies for text and non-text segmentation, text-line extraction, and reading order determination.
Abstract: Text-lines extraction and their reading order determination is an important step in optical character recognition (OCR) systems. Research in OCR of Arabic script documents has primarily focused on character recognition and therefore most of researchers use primitive methods like projection profile analysis for text-line extraction. Although projection methods achieve good accuracy on clean, skew corrected documents, their performance drops under challenging situations (border noise, skew, complex layouts). This paper presents a robust layout analysis system for extracting text-lines in reading order from scanned Arabic script document images written in different languages (Arabic, Urdu, Persian) and styles (Naskh, Nastaliq). The presented system is based on a suitable combination of different well established techniques for analyzing Latin script documents that have proven to be robust against different types of document image degradations. Evaluation of the presented system on Arabic and Urdu document image datasets consisting of a variety of complex single- and multi-column layouts achieves high accuracies for text and non-text segmentation, text-line extraction, and reading order determination.

26 citations

Patent
Bryan Keith Schacht1
23 Sep 2004
TL;DR: In this paper, a system and method for processing a document image using color highlighting is described, which consists of scanning a document, creating a document and searching the document image for a color-highlighted area.
Abstract: A system and method are provided for processing a document image using color highlighting. The method comprises: scanning a document, creating a document image; searching the document image for a color-highlighted area; processing the document image with optical character recognition (OCR), creating a text document; identifying a text phrase associated with the color-highlighted area; searching the text document for the identified text phrase; and, tracking each area in the document image associated with the identified text phrase. Searching the document image for a color-highlighted area includes supplying a coordinate associated with the color-highlighted area. A text phrase in the text document is identified in response to locating the text phrase at the color-highlighted area coordinates. Tracking each area in the document image associated with the identified text phrase includes: tracking the coordinates of each identified text phrase in the text document; and, transposing the coordinates to the document image.

26 citations

Proceedings ArticleDOI
25 Aug 2013
TL;DR: This paper proposes a novel text line extraction method for historical documents that takes the layout recognition results as an input, extracts the text lines, and groups them into blocks using the connected components approach.
Abstract: This paper proposes a novel text line extraction method for historical documents. The method works in two steps. In the first step, layout analysis is performed to recognize the physical structure of a given document using a classification technique, more precisely the pixels of a coloured document image are classified into five classes: text-block, core-text-line, decoration, background, and periphery. This layout recognition is achieved by a cascade of two Dynamic Multilayer Perceptron (DMLP) classifiers and works without binarisation. In the second step, an algorithm takes the layout recognition results as an input, extracts the text lines, and groups them into blocks using the connected components approach. Finally, the algorithm refines the boundaries of the text lines using the binary image and the layout recognition results. Our system is evaluated on three historical manuscripts with a test set of 49 pages. The best obtained hit rate for text lines is 96.3%.

26 citations

Patent
15 Sep 1999
TL;DR: In this article, a method and a system for embedding information in document data that include text written in a page description language is presented. Butler et al. present a method based on an analysis of the layout of the document data in which information is to be embedded.
Abstract: A method and a system for embedding information in document data that include text written in a page description language. First, an analysis is made of the layout of the document data in which information is to be embedded. Then, based on the analysis of the layout, a sequence of locations is generated whereat the information is to be embedded. A page description of the text at a determined location is changed in accordance with the embedded information. As a result, the information is embedded in document data that include text written in a page description language. The sequence of locations is generated by producing a string of sequential pseudo-random numbers.

26 citations


Network Information
Related Topics (5)
Feature extraction
111.8K papers, 2.1M citations
82% related
Feature (computer vision)
128.2K papers, 1.7M citations
82% related
Object detection
46.1K papers, 1.3M citations
81% related
Image segmentation
79.6K papers, 1.8M citations
80% related
Convolutional neural network
74.7K papers, 2M citations
79% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20235
202219
202134
202019
201914
20189