scispace - formally typeset
Search or ask a question
Topic

Document layout analysis

About: Document layout analysis is a research topic. Over the lifetime, 1462 publications have been published within this topic receiving 34021 citations.


Papers
More filters
Patent
28 Aug 2012
TL;DR: In this paper, a capture device for capturing a document feature of a document, a processor that is designed to perform document identification locally using the document feature if a processing criterion for the local performance of document identification by means of the apparatus for document identification is satisfied.
Abstract: An apparatus for document identification, having a capture device for capturing a document feature of a document, a processor that is designed to perform document identification locally using the document feature if a processing criterion for the local performance of document identification by means of the apparatus for document identification is satisfied, and a transmitter that is designed to send a data record that is dependent on the document feature via a communication network to a communication network address if the processing criterion for the local performance of document identification by means of the apparatus for document identification is not satisfied.

5 citations

Proceedings ArticleDOI
06 Jun 2018
TL;DR: A language-independent document segmentation system that segments a heterogeneous printed document into homogeneous components like halftones and graphics, texts and tables including its individual cells and Text Extraction from Table Cells is presented.
Abstract: The research on document layout analysis has been widespread over a large arena recently and is craving for more efficiency day by day. Document segmentation is an important preprocessing step before analyzing the layouts. This paper presents a language-independent document segmentation system that segments a heterogeneous printed document into homogeneous components like halftones and graphics, texts and tables including its individual cells. From an input document page homogeneous components are segmented in three steps with three separate modules, which are- extraction of halftone images, extraction of tables and segmentation of text blocks. These modules altogether build the whole page segmentation system which takes an input image of heterogeneous document page and produces an output with explicitly indicated homogeneous segments with colored bounding boxes. The modules use morphological operations to detect the components. To improve the performance of image segmentation Residual Image Fragments Retrieval (RIFR) is proposed. The paper also proposes Text Extraction from Table Cells (TETC). Combining RIFR and TETC together we get an overall accuracy of 93%. Table and cell detection have a higher accuracy of 96% whereas image and texts have around 90% accuracy.

5 citations

Proceedings ArticleDOI
23 Mar 2012
TL;DR: An algorithm for the scanned document image segmentation based on Voronoi diagram is proposed that can fully identify each area of the document image, and the segmentation is accurate with very little information lost.
Abstract: With the emergence of complex layout, the layout isno longer confined to rectangular. This makes the traditional layout segmentation algorithm no longer applies, and new method dealing with complex layout emerge. This article proposed an algorithm for the scanned document image segmentation based on Voronoi diagram. The algorithm does not need additional tilt detection to oblique layout and pretreatment process for tilt correction, Voronoi diagram generated directly on the outer edge of the connected element, not related to the pixel processing, The algorithm does not need the processing to delete redundancy Voronoi edges and merger Voronoi edges, greatly reducing excessive segmentation of the other algorithm. The algorithm can fully identify each area of the document image, and the segmentation is accurate with very little information lost. In addition, because the structural elements were performed using statistical features, the adaptability of the algorithm is better.

5 citations

Patent
19 Dec 2013
TL;DR: In this paper, a computer manages methods for determining accurate document transformation by rendering the source document into a non-rasterized format, where the non-standardized format is a rendered source document.
Abstract: A computer manages methods for determining accurate document transformation by rendering the source document into a non-rasterized format, where the non-rasterized format is a rendered source document. The computer rendering the target document into a non-rasterized format, where the non-rasterized format is a rendered target document. The computer comparing one or more aspects of the rendered source document to corresponding one or more aspects of the rendered target document. The computer determining, based, at least in part, on the compared one or more aspects, whether or not the source document was accurately transformed to the target document.

4 citations

Journal Article
TL;DR: A new approach to segment documents with complex layout and degraded image quality is described which uses a local-to-global strategy which can be adapted to a variety of documents.
Abstract: Document layout analysis is concerned about the decomposition of raster representation of a document into several regions which contain homogeneous entities. This paper describes a new approach to segment documents with complex layout and degraded image quality. The approach uses a local-to-global strategy which can be adapted to a variety of documents. The system was tested on different English and Japanese documents and the experiments had shown promising results.

4 citations


Network Information
Related Topics (5)
Feature extraction
111.8K papers, 2.1M citations
82% related
Feature (computer vision)
128.2K papers, 1.7M citations
82% related
Object detection
46.1K papers, 1.3M citations
81% related
Image segmentation
79.6K papers, 1.8M citations
80% related
Convolutional neural network
74.7K papers, 2M citations
79% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20235
202219
202134
202019
201914
20189