scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Syntactic and Semantic Labeling of Hierarchically Organized Document Image Components of Indian Scripts

TL;DR: A document image analysis system which performs segmentation, content characterization as well as semantic labeling of components, and has obtained promising results for semantic segmentation of over 30 categories of documents in Indian scripts.
Abstract: In this paper we describe our document image analysis system which performs segmentation, content characterization as well as semantic labeling of components. Segmentation is done using white spaces and gives the segmented components arranged in a hierarchy. Semantic labeling is done using domain knowledge which is specified where possible in the form of a document model applicable to a class of documents. The novelty of the system lies in the suite of methods it employs which are capable of handling documents in Indian scripts. We have obtained promising results for semantic segmentation of over 30 categories of documents in Indian scripts.
Citations
More filters
Proceedings ArticleDOI
24 Aug 2013
TL;DR: A novel framework for learning optimal parameters for text graphic separation in the presence of complex layouts of Indian newspaper is proposed.
Abstract: Digitization of newspaper article is important for registering historical events. Layout analysis of Indian newspaper is a challenging task due to the presence of different font size, font styles and random placement of text and non-text regions. In this paper we propose a novel framework for learning optimal parameters for text graphic separation in the presence of complex layouts. The learning problem has been formulated as an optimization problem using EM algorithm to learn optimal parameters depending on the nature of the document content.

3 citations


Cites methods from "Syntactic and Semantic Labeling of ..."

  • ...The method used for evaluating the performance of our algorithm is based on counting the number of matches between the pixels segmented by the algorithm and the pixels in the ground truth [11]....

    [...]

Proceedings ArticleDOI
01 May 2014
TL;DR: A model which performs labeling of different components of a printed document image, i.e. identification of heading, subheading, caption, article and photo is proposed, which gives promising results on printed document of different scripts.
Abstract: A document image contains texts and non-texts, it may be printed, handwritten, or hybrid of both. In this paper we deal with printed document where textual region is of printed characters, and non-texts are mainly photo images. Here we propose a model which performs labeling of different components of a printed document image, i.e. identification of heading, subheading, caption, article and photo. Our method consists of a preprocessing stage where fuzzy c-means clustering is used to segment the document image into printed (object) region and background. Then Hough transformation is used to find white-line dividers of object region and grid structure examination is used to extract the non-text portion. After that, we use horizontal histogram to find text lines and then we label different components. Our method gives promising results on printed document of different scripts.
References
More filters
Journal ArticleDOI
TL;DR: An integrated approach is presented that provides solutions to problems related to newspaper page image enhancement, segmentation of pages into various items, article identification and reconstruction, and, finally, recognition of the textual components.
Abstract: Digital preservation of newspaper archives aims both at the salvation of endangered material (paper) and at the creation of digital library services that will allow full utilization of the archives by all interested parties. In this paper, we address a series of issues pertaining to the retro-conversion of newspapers, i.e., the conversion of newspaper pages into digital resources. An integrated approach is presented that provides solutions to problems related to newspaper page image enhancement, segmentation of pages into various items (titles, text, images etc), article identification and reconstruction, and, finally, recognition of the textual components. Emphasis is placed on the most difficult intermediate stages of page segmentation and article identification and reconstruction. Detailed experimental results, obtained from a large testbed of old newspaper issues, are presented which clearly demonstrate the applicability of our methodology to the successful retro-conversion of newspaper material.

30 citations


Additional excerpts

  • ...Notable work includes [1], [3], [4] , [12], [14]....

    [...]

Proceedings ArticleDOI
25 Aug 1996
TL;DR: The present method differs from the traditional split-and-merge segmentation method in that it orthogonally splits regions using thresholds adaptively computed from projection profiles.
Abstract: This paper describes a generic document segmentation and geometric relation labeling method with applications to document analysis. Unlike the previous document segmentation methods where text spacing, border lines, and/or a priori layout models based template processing are performed, the present method begins with a hierarchy of partitioned image layers where inhomogeneous higher-level regions are recursively positioned into lower-level rectangular subregions and at the same time lower-level smaller homogeneous regions are merged into larger homogeneous regions. The present method differs from the traditional split-and-merge segmentation method in that it orthogonally splits regions using thresholds adaptively computed from projection profiles.

20 citations


"Syntactic and Semantic Labeling of ..." refers methods in this paper

  • ...However polygons with h/v sides have been obtained in the work of Liu et al [8] by the application of adaptive split and merge for segmentation....

    [...]

  • ...Well known methods are XY-cut [10], the smearing algorithm [15], white space analysis [2], Docstrum [11], the Voronoi-diagram based approach [5], and other variants like [6], [8], [13]....

    [...]

Journal ArticleDOI
TL;DR: This work proposes a comprehensive knowledge-centered approach that model not only comparatively static knowledge concerning document properties and analysis results within the same declarative formalism, but also includes the analysis task and the current context of the system environment within thesame formalism.
Abstract: Knowledge-based systems for document analysis and understanding (DAU) are quite useful whenever analysis has to deal with the changing of free-form document types which require different analysis components. In this case, declarative modeling is a good way to achieve flexibility. An important application domain for such systems is the business letter domain. Here, high accuracy and the correct assignment to the right people and the right processes is a crucial success factor. Our solution to this proposes a comprehensive knowledge-centered approach: we model not only comparatively static knowledge concerning document properties and analysis results within the same declarative formalism, but we also include the analysis task and the current context of the system environment within the same formalism. This allows an easy definition of new analysis tasks and also an efficient and accurate analysis by using expectations about incoming documents as context information. The approach described has been implemented within the VOPR (VOPR is an acronym for the Virtual Office PRototype.) system. This DAU system gains the required context information from a commercial workflow management system (WfMS) by constant exchanges of expectations and analysis tasks. Further interaction between these two systems covers the delivery of results from DAU to the WfMS and the delivery of corrected results vice versa.

14 citations


Additional excerpts

  • ...Notable work includes [1], [3], [4] , [12], [14]....

    [...]

Proceedings ArticleDOI
19 Aug 2001
TL;DR: Since the block extraction only scans through border pixels of paragraphs and characters need to be extracted in the OCR process, this algorithm is faster with fewer overheads than algorithms that need to access all pixels of a document.
Abstract: This paper describes fast and efficient method for page segmentation of a document containing a nonrectangular block. The presented method is based on a mixed top-down and bottom-up approach to document analysis. The segmentation is based on a column block (paragraph) extracted by a modified edge following algorithm. Instead of a pixel, a window of 32 by 32 pixel is used in the algorithm so that a paragraph can be extracted instead of a character. The document is scanned at 300 dpi and it is possible to extract more than one column into a block. Then, characters in the block are extracted using the edge following algorithm and their boundaries are used to detect multicolumn cases (bottom-up). Since the block extraction only scans through border pixels of paragraphs and characters need to be extracted in the OCR process, this algorithm is faster with fewer overheads than algorithms that need to access all pixels of a document.

13 citations


"Syntactic and Semantic Labeling of ..." refers methods in this paper

  • ...Well known methods are XY-cut [10], the smearing algorithm [15], white space analysis [2], Docstrum [11], the Voronoi-diagram based approach [5], and other variants like [6], [8], [13]....

    [...]

Proceedings ArticleDOI
10 Sep 2001
TL;DR: A single-parameter text-line extraction algorithm is described along with an efficient technique for estimating the optimal value for the parameter for individual images without need for ground truth.
Abstract: A single-parameter text-line extraction algorithm is described along with an efficient technique for estimating the optimal value for the parameter for individual images without need for ground truth. The algorithm is based on three simple tree operations, cut, glue and flip. An XY-tree representing the segmentation is incrementally transformed to reflect a change in the parameter while intrinsic measures of the cost of the transformation are used to detect when specific tree operations would cause an error if they were performed, allowing these errors to be avoided. The algorithm correctly identified 98.8% of the area of the ground truth bounding boxes and committed no column bridging errors on a set of 97 test images selected from a variety of technical journals.

8 citations


"Syntactic and Semantic Labeling of ..." refers methods in this paper

  • ...Well known methods are XY-cut [10], the smearing algorithm [15], white space analysis [2], Docstrum [11], the Voronoi-diagram based approach [5], and other variants like [6], [8], [13]....

    [...]