scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Syntactic and Semantic Labeling of Hierarchically Organized Document Image Components of Indian Scripts

TL;DR: A document image analysis system which performs segmentation, content characterization as well as semantic labeling of components, and has obtained promising results for semantic segmentation of over 30 categories of documents in Indian scripts.
Abstract: In this paper we describe our document image analysis system which performs segmentation, content characterization as well as semantic labeling of components. Segmentation is done using white spaces and gives the segmented components arranged in a hierarchy. Semantic labeling is done using domain knowledge which is specified where possible in the form of a document model applicable to a class of documents. The novelty of the system lies in the suite of methods it employs which are capable of handling documents in Indian scripts. We have obtained promising results for semantic segmentation of over 30 categories of documents in Indian scripts.
Citations
More filters
Proceedings ArticleDOI
24 Aug 2013
TL;DR: A novel framework for learning optimal parameters for text graphic separation in the presence of complex layouts of Indian newspaper is proposed.
Abstract: Digitization of newspaper article is important for registering historical events. Layout analysis of Indian newspaper is a challenging task due to the presence of different font size, font styles and random placement of text and non-text regions. In this paper we propose a novel framework for learning optimal parameters for text graphic separation in the presence of complex layouts. The learning problem has been formulated as an optimization problem using EM algorithm to learn optimal parameters depending on the nature of the document content.

3 citations


Cites methods from "Syntactic and Semantic Labeling of ..."

  • ...The method used for evaluating the performance of our algorithm is based on counting the number of matches between the pixels segmented by the algorithm and the pixels in the ground truth [11]....

    [...]

Proceedings ArticleDOI
01 May 2014
TL;DR: A model which performs labeling of different components of a printed document image, i.e. identification of heading, subheading, caption, article and photo is proposed, which gives promising results on printed document of different scripts.
Abstract: A document image contains texts and non-texts, it may be printed, handwritten, or hybrid of both. In this paper we deal with printed document where textual region is of printed characters, and non-texts are mainly photo images. Here we propose a model which performs labeling of different components of a printed document image, i.e. identification of heading, subheading, caption, article and photo. Our method consists of a preprocessing stage where fuzzy c-means clustering is used to segment the document image into printed (object) region and background. Then Hough transformation is used to find white-line dividers of object region and grid structure examination is used to extract the non-text portion. After that, we use horizontal histogram to find text lines and then we label different components. Our method gives promising results on printed document of different scripts.
References
More filters
Journal ArticleDOI
TL;DR: A new document model which preserves top-down generation information is proposed based on which a document is logically represented for interactive editing, storage, retrieval, transfer, and logical analysis.
Abstract: Transforming a paper document to its electronic version in a form suitable for efficient storage, retrieval, and interpretation continues to be a challenging problem. An efficient representation scheme for document images is necessary to solve this problem. Document representation involves techniques of thresholding, skew detection, geometric layout analysis, and logical layout analysis. The derived representation can then be used in document storage and retrieval. Page segmentation is an important stage in representing document images obtained by scanning journal pages. The performance of a document understanding system greatly depends on the correctness of page segmentation and labeling of different regions such as text, tables, images, drawings, and rulers. We use the traditional bottom-up approach based on the connected component extraction to efficiently implement page segmentation and region identification. A new document model which preserves top-down generation information is proposed based on which a document is logically represented for interactive editing, storage, retrieval, transfer, and logical analysis. Our algorithm has a high accuracy and takes approximately 1.4 seconds on a SGI Indy workstation for model creation, including orientation estimation, segmentation, and labeling (text, table, image, drawing, and ruler) for a 2550/spl times/3300 image of a typical journal page scanned at 300 dpi. This method is applicable to documents from various technical journals and can accommodate moderate amounts of skew and noise.

239 citations


Additional excerpts

  • ...Notable work includes [1], [3], [4] , [12], [14]....

    [...]

Journal ArticleDOI
TL;DR: A clustering-based technique has been devised for estimating globally matched wavelet filters using a collection of groundtruth images and a text extraction scheme for the segmentation of document images into text, background, and picture components is extended.
Abstract: In this paper, we have proposed a novel scheme for the extraction of textual areas of an image using globally matched wavelet filters. A clustering-based technique has been devised for estimating globally matched wavelet filters using a collection of groundtruth images. We have extended our text extraction scheme for the segmentation of document images into text, background, and picture components (which include graphics and continuous tone images). Multiple, two-class Fisher classifiers have been used for this purpose. We also exploit contextual information by using a Markov random field formulation-based pixel labeling scheme for refinement of the segmentation results. Experimental results have established effectiveness of our approach.

159 citations


"Syntactic and Semantic Labeling of ..." refers methods in this paper

  • ...Segmentation (explained in section 2) yields component blocks which are then characterized for syntactic content − picture, text/graphics, using a globally matched wavelet based feature extraction and a Fisher classifier adopted from [7]....

    [...]

  • ...) on the computation of an appropriate threshold for binarization, we have first separated the text portions from the rest of the document page using a wavelet-based text, non-text separation tool [7]....

    [...]

Journal ArticleDOI
TL;DR: The innovative aspects described in the paper are: the preprocessing algorithm, the adaptive page segmentation, the acquisition of block classification rules using techniques from machine learning, the layout analysis based on general layout principles, and a method that uses document layout information for conversion to HTML/XML formats.
Abstract: The transformation of scanned paper documents to a form suitable for an Internet browser is a complex process that requires solutions to several problems. The application of an OCR to some parts of the document image is only one of the problems. In fact, the generation of documents in HTML format is easier when the layout structure of a page has been extracted by means of a document analysis process. The adoption of an XML format is even better, since it can facilitate the retrieval of documents in the Web. Nevertheless, an effective transformation of paper documents into this format requires further processing steps, namely document image classification and understanding. WISDOM++ is a document processing system that operates in five steps: document analysis, document classification, document understanding, text recognition with an OCR, and transformation into HTML/XML format. The innovative aspects described in the paper are: the preprocessing algorithm, the adaptive page segmentation, the acquisition of block classification rules using techniques from machine learning, the layout analysis based on general layout principles, and a method that uses document layout information for conversion to HTML/XML formats. A benchmarking of the system components implementing these innovative aspects is reported.

129 citations


Additional excerpts

  • ...Notable work includes [1], [3], [4] , [12], [14]....

    [...]

Book
01 Jan 1998
TL;DR: The key idea is that analyses of the text contours at appropriate levels of granularity offer a rich source of information about document structure, which can provide the basis for flexible document manipulation tools in heterogeneous collections.
Abstract: The availability of large, heterogeneous repositories of electronic documents is increasing rapidly, and the need for flexible, sophisticated document manipulation tools is growing correspondingly. These tools can benefit greatly by exploiting logical structure, a hierarchy of visually observable organizational components of a document, such as paragraphs, lists, sections, etc. Knowledge of this structure can enable a multiplicity of applications, including hierarchical browsing, structural hyperlinking, logical component-based retrieval, and style translation. Most work on the problem of deriving logical structure from document layout either relies on knowledge of the particular document style or finds a single flat set of text blocks. This thesis describes an implemented approach to discovering a full logical hierarchy in generic text documents, based primarily on layout information. Since the styles of the documents are not known a priori, the precise layout effects of the logical structure are unknown. Nonetheless, typographical capabilities and conventions provide cues that can be used to deduce a logical structure for a generic document. In particular, the key idea is that analyses of the text contours at appropriate levels of granularity offer a rich source of information about document structure. The problem of logical structure discovery is divided into problems of segmentation, which separates the text into logical pieces, and classification, which labels the pieces with structure types. The segmentation algorithm relies entirely on layout-based cues, and the classification algorithm uses word-based information only when this is demonstrably unavoidable. Thus, this approach is particularly appropriate for scanned-in documents, since it is more robust with respect to OCR errors than a content-oriented approach would be. It is applicable, however, to the problem of analyzing any electronic document whose original formatting style rules remain unknown; thus, it can provide the basis for flexible document manipulation tools in heterogeneous collections.

49 citations


Additional excerpts

  • ...Notable work includes [1], [3], [4] , [12], [14]....

    [...]

Journal ArticleDOI
TL;DR: An adaptive algorithm for preprocessing document images prior to binarization in character recognition problems using a quadratic system model to provide edge enhancement for input images that have been corrupted by noise and other types of distortions during the scanning process.
Abstract: This paper presents an adaptive algorithm for preprocessing document images prior to binarization in character recognition problems. Our method is similar in its approach to the blind adaptive equalization of binary communication channels. The adaptive filter utilizes a quadratic system model to provide edge enhancement for input images that have been corrupted by noise and other types of distortions during the scanning process. Experimental results demonstrating significant improvement in the quality of the binarized images over both direct binarization and a previously available preprocessing technique are also included.

37 citations


"Syntactic and Semantic Labeling of ..." refers methods in this paper

  • ...We have used the approach suggested by [9] for pre-processing and binarization which makes use of non-linear adaptive filters....

    [...]