scispace - formally typeset
Proceedings ArticleDOI

Syntactic and Semantic Labeling of Hierarchically Organized Document Image Components of Indian Scripts

04 Feb 2009-pp 314-317

TL;DR: A document image analysis system which performs segmentation, content characterization as well as semantic labeling of components, and has obtained promising results for semantic segmentation of over 30 categories of documents in Indian scripts.

AbstractIn this paper we describe our document image analysis system which performs segmentation, content characterization as well as semantic labeling of components.  Segmentation is done using white spaces and gives the segmented components arranged in a hierarchy.  Semantic labeling is done using domain knowledge which is specified where possible in the form of a document model applicable to a class of documents.  The novelty of the system lies in the suite of methods it employs which are capable of handling documents in Indian scripts.  We have obtained promising results for semantic segmentation of over 30 categories of documents in Indian scripts.

...read more


Citations
More filters
Proceedings ArticleDOI
24 Aug 2013
TL;DR: A novel framework for learning optimal parameters for text graphic separation in the presence of complex layouts of Indian newspaper is proposed.
Abstract: Digitization of newspaper article is important for registering historical events. Layout analysis of Indian newspaper is a challenging task due to the presence of different font size, font styles and random placement of text and non-text regions. In this paper we propose a novel framework for learning optimal parameters for text graphic separation in the presence of complex layouts. The learning problem has been formulated as an optimization problem using EM algorithm to learn optimal parameters depending on the nature of the document content.

3 citations


Cites methods from "Syntactic and Semantic Labeling of ..."

  • ...The method used for evaluating the performance of our algorithm is based on counting the number of matches between the pixels segmented by the algorithm and the pixels in the ground truth [11]....

    [...]

Proceedings ArticleDOI
01 May 2014
TL;DR: A model which performs labeling of different components of a printed document image, i.e. identification of heading, subheading, caption, article and photo is proposed, which gives promising results on printed document of different scripts.
Abstract: A document image contains texts and non-texts, it may be printed, handwritten, or hybrid of both. In this paper we deal with printed document where textual region is of printed characters, and non-texts are mainly photo images. Here we propose a model which performs labeling of different components of a printed document image, i.e. identification of heading, subheading, caption, article and photo. Our method consists of a preprocessing stage where fuzzy c-means clustering is used to segment the document image into printed (object) region and background. Then Hough transformation is used to find white-line dividers of object region and grid structure examination is used to extract the non-text portion. After that, we use horizontal histogram to find text lines and then we label different components. Our method gives promising results on printed document of different scripts.

References
More filters
Journal ArticleDOI
TL;DR: The requirements and components for a proposed Document Analysis System, which assists a user in encoding printed documents for computer processing, are outlined and several critical functions have been investigated and the technical approaches are discussed.
Abstract: This paper outlines the requirements and components for a proposed Document Analysis System, which assists a user in encoding printed documents for computer processing. Several critical functions have been investigated and the technical approaches are discussed. The first is the segmentation and classification of digitized printed documents into regions of text and images. A nonlinear, run-length smoothing algorithm has been used for this purpose. By using the regular features of text lines, a linear adaptive classification scheme discriminates text regions from others. The second technique studied is an adaptive approach to the recognition of the hundreds of font styles and sizes that can occur on printed documents. A preclassifier is constructed during the input process and used to speed up a well-known pattern-matching method for clustering characters from an arbitrary print source into a small sample of prototypes. Experimental results are included.

701 citations


"Syntactic and Semantic Labeling of ..." refers methods in this paper

  • ...Well known methods are XY-cut [10], the smearing algorithm [15], white space analysis [2], Docstrum [11], the Voronoi-diagram based approach [5], and other variants like [6], [8], [13]....

    [...]

Book
01 Jan 1995
TL;DR: The document spectrum (or docstrum), which is a method for structural page layout analysis based on bottom-up, nearest-neighbor clustering of page components, yields an accurate measure of skew, within-line, and between-line spacings and locates text lines and text blocks.
Abstract: Page layout analysis is a document processing technique used to determine the format of a page. This paper describes the document spectrum (or docstrum), which is a method for structural page layout analysis based on bottom-up, nearest-neighbor clustering of page components. The method yields an accurate measure of skew, within-line, and between-line spacings and locates text lines and text blocks. It is advantageous over many other methods in three main ways: independence from skew angle, independence from different text spacings, and the ability to process local regions of different text orientations within the same image. Results of the method shown for several different page formats and for randomly oriented subpages on the same image illustrate the versatility of the method. We also discuss the differences, advantages, and disadvantages of the docstrum with respect to other lay-out methods. >

628 citations


"Syntactic and Semantic Labeling of ..." refers methods in this paper

  • ...Well known methods are XY-cut [10], the smearing algorithm [15], white space analysis [2], Docstrum [11], the Voronoi-diagram based approach [5], and other variants like [6], [8], [13]....

    [...]

Journal ArticleDOI
Lawrence O'Gorman1
TL;DR: The document spectrum (or docstrum) as discussed by the authors is a method for structural page layout analysis based on bottom-up, nearest-neighbor clustering of page components, which yields an accurate measure of skew, within-line, and between-line spacings and locates text lines and text blocks.
Abstract: Page layout analysis is a document processing technique used to determine the format of a page. This paper describes the document spectrum (or docstrum), which is a method for structural page layout analysis based on bottom-up, nearest-neighbor clustering of page components. The method yields an accurate measure of skew, within-line, and between-line spacings and locates text lines and text blocks. It is advantageous over many other methods in three main ways: independence from skew angle, independence from different text spacings, and the ability to process local regions of different text orientations within the same image. Results of the method shown for several different page formats and for randomly oriented subpages on the same image illustrate the versatility of the method. We also discuss the differences, advantages, and disadvantages of the docstrum with respect to other lay-out methods. >

624 citations

Journal ArticleDOI
TL;DR: The document image acquisition process and the knowledge base that must be entered into the system to process a family of page images are described, and the process by which the X-Y tree data structure converts a 2-D page-segmentation problem into a series of 1-D string-parsing problems that can be tackled using conventional compiler tools.
Abstract: Gobbledoc, a system providing remote access to stored documents, which is based on syntactic document analysis and optical character recognition (OCR), is discussed. In Gobbledoc, image processing, document analysis, and OCR operations take place in batch mode when the documents are acquired. The document image acquisition process and the knowledge base that must be entered into the system to process a family of page images are described. The process by which the X-Y tree data structure converts a 2-D page-segmentation problem into a series of 1-D string-parsing problems that can be tackled using conventional compiler tools is also described. Syntactic analysis is used in Gobbledoc to divide each page into labeled rectangular blocks. Blocks labeled text are converted by OCR to obtain a secondary (ASCII) document representation. Since such symbolic files are better suited for computerized search than for human access to the document content and because too many visual layout clues are lost in the OCR process (including some special characters), Gobbledoc preserves the original block images for human browsing. Storage, networking, and display issues specific to document images are also discussed. >

456 citations


"Syntactic and Semantic Labeling of ..." refers methods in this paper

  • ...Our method differs from XY-cut since the decomposed blocks need not be rectangle but can be general polygons....

    [...]

  • ...Recursive XY-cut produces a hierarchical organization of document components....

    [...]

  • ...Well known methods are XY-cut [10], the smearing algorithm [15], white space analysis [2], Docstrum [11], the Voronoi-diagram based approach [5], and other variants like [6], [8], [13]....

    [...]

  • ...The output of the recursive procedure is a hierarchical arrangement of the segmented blocks can be seen in Fig 2 It is worthwhile to compare our approach with the XY-cut algorithm which also makes use of white spaces....

    [...]

Journal ArticleDOI
TL;DR: It is confirmed that the proposed method of page segmentation based on the approximated area Voronoi diagram is effective for extraction of body text regions, and it is as efficient as other methods based on connected component analysis.
Abstract: This paper presents a method of page segmentation based on the approximated area Voronoi diagram. The characteristics of the proposed method are as follows: (1) The Voronoi diagram enables us to obtain the candidates of boundaries of document components from page images with non-Manhattan layout and a skew. (2) The candidates are utilized to estimate the intercharacter and interline gaps without the use of domain-specific parameters to select the boundaries. From the experimental results for 128 images with non-Manhattan layout and the skew of 0°~45° as well as 98 images with Manhattan layout, we have confirmed that the method is effective for extraction of body text regions, and it is as efficient as other methods based on connected component analysis.

275 citations


"Syntactic and Semantic Labeling of ..." refers methods in this paper

  • ...Well known methods are XY-cut [10], the smearing algorithm [15], white space analysis [2], Docstrum [11], the Voronoi-diagram based approach [5], and other variants like [6], [8], [13]....

    [...]