scispace - formally typeset
Search or ask a question
Topic

Document layout analysis

About: Document layout analysis is a research topic. Over the lifetime, 1462 publications have been published within this topic receiving 34021 citations.


Papers
More filters
Patent
10 Aug 2011
TL;DR: In this paper, a plurality of elements within a document is identified from an online document processing service, and an object corresponding to an element is invoked to generate layout data associated with the element and the element is rendered based on the layout data.
Abstract: Data defining a document is received from an online document processing service, and a plurality of elements within the document is identified. The plurality of elements may comprise paragraphs, lines of text, images, tables, headers, footers, footnotes, footnote reference information, etc. For each of the plurality of elements, a respective object comprising a layout function and a render function is generated. An object corresponding to an element is invoked to generate layout data associated with the element, and the element is rendered based on the layout data.

5 citations

Proceedings ArticleDOI
01 Aug 2018
TL;DR: A binarization free dynamic programming approach that generates an equidistant text line extraction polygon is presented and compared with other solutions ranging from the actual human reviewed ground-truth polygons to simpler automatic generated rectangle areas.
Abstract: Text Line Segmentation is a basic document layout task that consists in detecting and extracting the text lines present in a document page image. Although considered a basic task, generally, it is a necessary step for Handwritten Text Recognition (HTR) higher level tasks. Most state of the art automatic text recognition, text-to-line image alignment and key word spotting systems require it due to their need for isolated text line images as input. Traditionally most Text Line Segmentation approaches cover both detection and extraction sub steps. However, the community has recently shifted its focus to tackle independently the baseline detection in document images. This shift generates the need for extraction methods that use these detected baselines as input. In this paper, a binarization free dynamic programming approach that generates an equidistant text line extraction polygon is presented. The approach performs this calculation, based on the information provided by priorly detected text baselines and automatically generated foreground pixels distance maps. We evaluate our approach both in a synthetic competition corpus and in a challenging real handwritten text recognition task corpus. We evaluate it not only at the graphical error level but also the impact it produces on an HTR task trained with the line images it yields. We compare our solution with other solutions ranging from the actual human reviewed ground-truth polygons to simpler automatic generated rectangle areas.

5 citations

Journal ArticleDOI
TL;DR: This paper presents a technology that addresses this issue by closely preserving the original textual layout of the scanned PDF using the open source document analysis and OCR system (OCRopus) based on geometric layout and positioning information.
Abstract: Information can include text, pictures and signatures that can be scanned into a document format, such as the Portable Document Format (PDF), and easily emailed to recipients around the world. Upon the document’s arrival, the receiver can open and view it using a vast array of different PDF viewing applications such as Adobe Reader and Apple Preview. Hence, today the use of the PDF has become pervasive. Since the scanned PDF is an image format, it is inaccessible to assistive technologies such as a screen reader. Therefore, the retrieval of the information needs Optical Character Recognition (OCR). The OCR software scans the scanned PDF file and through text extraction generates an editable text formatted document. This text document can then be edited, formatted, searched and indexed as well as translated or converted to speech. A problem that the OCR software does not solve is the accurate regeneration of the full text layout. This paper presents a technology that addresses this issue by closely preserving the original textual layout of the scanned PDF using the open source document analysis and OCR system (OCRopus) based on geometric layout and positioning information. The main issues considered in this research are the preservation of the correct reading order, and the representation of common logical structured elements such as section headings, line breaks, paragraphs, captions, and sidebars, foot-bars, running headers, embedded images, graphics, tables and mathematical expressions.

5 citations

Proceedings ArticleDOI
18 Jun 1996
TL;DR: The algorithm is very fast, is able to work on low-resolution document pages and is robust against skew, no assumptions are made on the layout of the document, the shape of the text regions, and the font size and style.
Abstract: This paper describes a fast and flexible method for extracting text regions from a document page containing text, graphics, and pictures. Such regions can be given as an input to an OCR system. The user fixes two parameters, the minimum width w of the text to be detected, and the precision /spl epsiv/ needed (both expressed as a percentage of the image width), according to the implementation needs. The method works by subdividing the page into overlapping columns whose width and inter-shift depend on w and /spl epsiv/, and by performing text lines extraction on each column separately. Successively, a statistical analysis of the text line elements found in each column is performed, and they are connected to form complete text lines. Finally, related pieces of text are merged into blocks so that a sensible reading order is provided for the OCR system. The algorithm is very fast, is able to work on low-resolution document pages and is robust against skew. The algorithm as also very flexible: no assumptions are made on the layout of the document, the shape of the text regions, and the font size and style; the main assumption is that the background is uniform and the text approximately horizontal. Despite the statistical nature of the method, a single line of text of a certain font size is generally sufficient to warrant detection. Experimental results are shown which demonstrate the effectiveness of the method on several different kinds of documents.

5 citations

Proceedings ArticleDOI
04 Jul 2005
TL;DR: A method for skew detection of document images is presented, based on the information of the directions of text lines in the images, which reflects the strength and direction of a pixel being part of a text line.
Abstract: A method for skew detection of document images is presented. The method is based on the information of the directions of text lines in the images. To extract the directions of text lines in the images, a measure of line-likeness of texts is adopted, which reflects the strength and direction of a pixel being part of a text line. This measure is applicable to texts of various fonts, sizes, and even languages. The skew angle of the whole image is the consensus among all pixels in the image. Several methods for the consensus are proposed. Experimental results of the method on document images are presented.

5 citations


Network Information
Related Topics (5)
Feature extraction
111.8K papers, 2.1M citations
82% related
Feature (computer vision)
128.2K papers, 1.7M citations
82% related
Object detection
46.1K papers, 1.3M citations
81% related
Image segmentation
79.6K papers, 1.8M citations
80% related
Convolutional neural network
74.7K papers, 2M citations
79% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20235
202219
202134
202019
201914
20189