scispace - formally typeset
Proceedings ArticleDOI

Syntactic and Semantic Labeling of Hierarchically Organized Document Image Components of Indian Scripts

Reads0
Chats0
TLDR
A document image analysis system which performs segmentation, content characterization as well as semantic labeling of components, and has obtained promising results for semantic segmentation of over 30 categories of documents in Indian scripts.
Abstract
In this paper we describe our document image analysis system which performs segmentation, content characterization as well as semantic labeling of components.  Segmentation is done using white spaces and gives the segmented components arranged in a hierarchy.  Semantic labeling is done using domain knowledge which is specified where possible in the form of a document model applicable to a class of documents.  The novelty of the system lies in the suite of methods it employs which are capable of handling documents in Indian scripts.  We have obtained promising results for semantic segmentation of over 30 categories of documents in Indian scripts.

read more

Citations
More filters
Proceedings ArticleDOI

Text graphic separation in Indian newspapers

TL;DR: A novel framework for learning optimal parameters for text graphic separation in the presence of complex layouts of Indian newspaper is proposed.
Proceedings ArticleDOI

An approach for printed document labeling

TL;DR: A model which performs labeling of different components of a printed document image, i.e. identification of heading, subheading, caption, article and photo is proposed, which gives promising results on printed document of different scripts.
References
More filters
Journal ArticleDOI

Document analysis system

TL;DR: The requirements and components for a proposed Document Analysis System, which assists a user in encoding printed documents for computer processing, are outlined and several critical functions have been investigated and the technical approaches are discussed.
Journal ArticleDOI

The document spectrum for page layout analysis

TL;DR: The document spectrum (or docstrum) as discussed by the authors is a method for structural page layout analysis based on bottom-up, nearest-neighbor clustering of page components, which yields an accurate measure of skew, within-line, and between-line spacings and locates text lines and text blocks.
Book

The document spectrum for page layout analysis

TL;DR: The document spectrum (or docstrum), which is a method for structural page layout analysis based on bottom-up, nearest-neighbor clustering of page components, yields an accurate measure of skew, within-line, and between-line spacings and locates text lines and text blocks.
Journal ArticleDOI

A prototype document image analysis system for technical journals

TL;DR: The document image acquisition process and the knowledge base that must be entered into the system to process a family of page images are described, and the process by which the X-Y tree data structure converts a 2-D page-segmentation problem into a series of 1-D string-parsing problems that can be tackled using conventional compiler tools.
Journal ArticleDOI

Segmentation of Page Images Using the Area Voronoi Diagram

TL;DR: It is confirmed that the proposed method of page segmentation based on the approximated area Voronoi diagram is effective for extraction of body text regions, and it is as efficient as other methods based on connected component analysis.
Related Papers (5)