Topic

Document layout analysis

About: Document layout analysis is a research topic. Over the lifetime, 1462 publications have been published within this topic receiving 34021 citations.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Journal Article•DOI•

A Zone Classification Approach for Arabic Documents using Hybrid Features

[...]

Amany M. Hesham, Sherif M. Abdou, Amr Badr, Mohsen A. Rashwan, Hassanin M. Al-Barhamtoshy - Show less +1 more

01 Jan 2016-International Journal of Advanced Computer Science and Applications

TL;DR: System evaluation shows that the proposed zone classification works well on multi-font and multi-size documents with a variety of layouts even on historical documents.

...read moreread less

Abstract: Zone segmentation and classification is an important step in document layout analysis. It decomposes a given scanned document into zones. Zones need to be classified into text and non-text, so that only text zones are provided to a recognition engine. This eliminates garbage output resulting from sending non-text zones to the engine. This paper proposes a framework for zone segmentation and classification. Zones are segmented using morphological operation and connected component analysis. Features are then extracted from each zone for the purpose of classification into text and non-text. Features are hybrid between texture-based and connected component based features. Effective features are selected using genetic algorithm. Selected features are fed into a linear SVM classifier for zone classification. System evaluation shows that the proposed zone classification works well on multi-font and multi-size documents with a variety of layouts even on historical documents.

...read moreread less

6 citations

DOI•

Document analysis at DFKI. - Part 1: Image analysis and text recognition

[...]

Majdi Ben Hadj Ali¹, Frank Fein, Frank Hönes, Thorsten Jäger, Achim Weigel - Show less +1 more•Institutions (1)

Kaiserslautern University of Technology¹

01 Jan 1995

TL;DR: In a series of three research reports the work of the document analysis and office automation department at DFKI is presented, the concept for a specialized document analysis knowledge representation language is described.

...read moreread less

Abstract: Document analysis is responsible for an essential progress in office automation. This paper is part of an overview about the combined research efforts in document analysis at the DFKI. Common to all document analysis projects is the global goal of providing a high level electronic representation of documents in terms of iconic, structural, textual, and semantic information. These symbolic document descriptions enable an "intelligent'; access to a document database. Currently there are three ongoing document analysis projects at DFKI: INCA, OMEGA, and PASCAL2000/PASCAL+. Though the projects pursue different goals in different application domains, they all share the same problems which have to be resolved with similar techniques. For that reason the activities in these projects are bundled to avoid redundant work. At DFKI we have divided the problem of document analysis into two main tasks, text recognition and text analysis, which themselves are divided into a set of subtasks. In a series of three research reports the work of the document analysis and office automation department at DFKI is presented. The first report discusses the problem of text recognition, the second that of text analysis. In a third report we describe our concept for a specialized document analysis knowledge representation language. The report in hand describes the activities dealing with the text recognition task. Text recognition covers the phase starting with capturing a document image up to identifying the written words. This comprises the following subtasks: preprocessing the pictorial information, segmenting into blocks, lines, words, and characters, classifying characters, and identifying the input words. For each subtask several competing solution algorithms, called specialists or knowledge sources, may exist. To efficiently control and organize these specialists an intelligent situation-based planning component is necessary, which is also described in this report. It should be mentioned that the planning component is also responsible to control the overall document analysis system instead of the text recognition phase only

...read moreread less

6 citations

Patent•

Optical character recognition of text in an image according to a prioritized processing sequence

[...]

Pierre Hamel, Alain Belanger, Eric Beauchamp

17 Apr 2014

TL;DR: In this article, a computer-implemented method for providing a text-based representation of a region of interest of an image to first is provided that includes a step of identifying text zones within the image, each text zone including textual content and having a respective rank assigned to the text zones.

...read moreread less

Abstract: A computer-implemented method for providing a text-based representation of a region of interest of an image to first is provided that includes a step of identifying text zones within the image, each text zone including textual content and having a respective rank assigned thereto based on an arrangement of the text zones within the image. The method also includes determining a processing sequence for performing optical character recognition (OCR) on the text zones. The processing sequence is based, firstly, on an arrangement of the text zones with respect to the region of interest and, secondly, on the ranks assigned to the text zones. The method further includes performing an OCR process on the text zones according to the processing sequence to progressively obtain a machine-encoded representation of the region of interest, and concurrently present the machine-encoded representation to the user, via an output device, as the text-based representation.

...read moreread less

6 citations

Patent•

Document-processing device, document-processing method, and document-processing program

[...]

Koji Fujiwara, 浩次藤原

20 Jul 2007

TL;DR: In this paper, the authors proposed a document processing device which adequately generates annex information for identifying the position of a document element in a document image when an electronic document including the document image is generated.

...read moreread less

Abstract: PROBLEM TO BE SOLVED: To provide: a document-processing device which adequately generates annex information for identifying the position of a document element in a document image when an electronic document including the document image is generated; a document-processing method; and a document-processing program SOLUTION: A displaying part 24 displays the number of document elements included in a document image for each kind of element after an element kind determination part 20 analyzes the kinds of extracted document elements Selection conditions relating to the kinds are set to the display of the analysis result by the displaying part 24 The element kind determination part 20 selects the document elements, which satisfy the selection conditions, from among the extracted document elements in response to the selection conditions concerned, and outputs the kinds of the selected document elements and position information for identifying their positions in the document image to a bookmark data-generating part 22 The bookmark data-generating part 22 generates bookmark data, and an electronic document-generating part 16 generates an electronic document by adding the bookmark data to the (compressed) document image from a compression processing part 14 COPYRIGHT: (C)2009,JPO&INPIT

...read moreread less

6 citations

Proceedings Article•DOI•

Quality enhancement in information extraction from scanned documents

[...]

Atsuhiro Takasu¹, Kenro Aihara¹•Institutions (1)

National Institute of Informatics¹

10 Oct 2006

TL;DR: A robust reference extraction method for academic articles scanned with OCR mark-up is proposed and applied to articles appearing in various journals, and experiments showed that the proposed method achieved a recognition accuracy of more than 94%.

...read moreread less

Abstract: When constructing a large document archive, an important element is the digitizing of printed documents. Although various techniques for document image analysis such as Optical Character Recognition (OCR) have been developed, error handling is required in constructing real document archive systems. This paper discusses the problem from the quality enhancement perspective and proposes a robust reference extraction method for academic articles scanned with OCR mark-up. We applied the proposed method to articles appearing in various journals, and these experiments showed that the proposed method achieved a recognition accuracy of more than 94%. This paper also discusses manual correction and investigates experimentally the relationship between extraction accuracy and cost reduction.

...read moreread less

6 citations

Collapse

Network Information

Performance

Metrics

1,488

Papers

35,779

Citations

No. of papers in the topic in previous years
Year	Papers
2023	5
2022	19
2021	34
2020	19
2019	14
2018	9

Document layout analysis

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics