scispace - formally typeset
Open AccessProceedings ArticleDOI

High Performance Layout Analysis of Medieval European Document Images.

TLDR
High performance page segmentation techniques for medieval European document images which include a novel main-body and side-notes segregation and an improved version of OCRopus (OCRopus, ) based text line extraction are presented.
Abstract
Layout analysis, mainly including binarization and page segmentation, is one of the most important performance determining steps of an OCR system for complex medieval document images, which contain noise, distortions and irregular layouts. In this paper, we present high performance page segmentation techniques for medieval European document images which include a novel main-body and side-notes segregation and an improved version of OCRopus (OCRopus, ) based text line extraction. In order to complete the high performance layout analysis pipeline, we have also presented the application of the percentile based binarization (Afzal et al., 2014) and the multiresolution morphology based text and non-text segmentation (Bukhari et al., 2011) methods over historical document images. presented layout analysis techniques are applied to a collection of the 15th century Latin document images, which achieved more than 90% accuracy for each of the segmentation techniques.

read more

Citations
More filters
Proceedings ArticleDOI

anyOCR: An Open-Source OCR System for Historical Archives

TL;DR: The current state of the anyOCR system, its architecture, as well as its major features are described, which mainly emphasize the techniques requires for digitizing a historical archive with high accuracy.
Journal ArticleDOI

Multi-scale Gated Fully Convolutional DenseNets for semantic labeling of historical newspaper images

TL;DR: This work proposes a fully convolutional neural network architecture (FCN) that outputs a pixel-labeling of the various semantic entities that occur in historical newspaper images and demonstrates that this proposition outperforms standard FCN architectures.
Journal ArticleDOI

Segmentation-Less Extraction of Text and Non-Text Regions From JPEG 2000 Compressed Document Images Through Partial and Intelligent Decompression

- 01 Jan 2023 - 
TL;DR: In this paper , the authors proposed a direct operation over the JPEG 2000 compressed documents for extracting text and non-text regions without using any segmentation algorithm, which avoids full decompression of the compressed document in contrast to the conventional methods, where they fully decompress and then process.
Journal ArticleDOI

Segmentation-Less Extraction of Text and Non-Text Regions From JPEG 2000 Compressed Document Images Through Partial and Intelligent Decompression

TL;DR: In this article , the authors proposed a direct operation over the JPEG 2000 compressed documents for extracting text and non-text regions without using any segmentation algorithm, which avoids full decompression of the compressed document in contrast to the conventional methods, where they fully decompress and then process.
References
More filters
Journal ArticleDOI

Document analysis system

TL;DR: The requirements and components for a proposed Document Analysis System, which assists a user in encoding printed documents for computer processing, are outlined and several critical functions have been investigated and the technical approaches are discussed.
Journal ArticleDOI

The document spectrum for page layout analysis

TL;DR: The document spectrum (or docstrum) as discussed by the authors is a method for structural page layout analysis based on bottom-up, nearest-neighbor clustering of page components, which yields an accurate measure of skew, within-line, and between-line spacings and locates text lines and text blocks.
Book

The document spectrum for page layout analysis

TL;DR: The document spectrum (or docstrum), which is a method for structural page layout analysis based on bottom-up, nearest-neighbor clustering of page components, yields an accurate measure of skew, within-line, and between-line spacings and locates text lines and text blocks.
Journal ArticleDOI

Twenty years of document image analysis in PAMI

TL;DR: The contributions to document image analysis of 99 papers published in the IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) are clustered, summarized, interpolated, interpreted, and evaluated.
Journal ArticleDOI

A prototype document image analysis system for technical journals

TL;DR: The document image acquisition process and the knowledge base that must be entered into the system to process a family of page images are described, and the process by which the X-Y tree data structure converts a 2-D page-segmentation problem into a series of 1-D string-parsing problems that can be tackled using conventional compiler tools.