Syntactic and Semantic Labeling of Hierarchically Organized Document Image Components of Indian Scripts

doi:10.1109/ICAPR.2009.88

Home
/
Papers
/
Syntactic and Semantic Labeling of Hierarchically Organized Document Image Components of Indian Scripts

Proceedings Article•DOI•

Syntactic and Semantic Labeling of Hierarchically Organized Document Image Components of Indian Scripts

Gaurav Harit¹, Ritu Garg², Santanu Chaudhury²•Institutions (2)

Indian Institute of Technology Kharagpur¹, Indian Institute of Technology Delhi²

04 Feb 2009-pp 314-317

TL;DR: A document image analysis system which performs segmentation, content characterization as well as semantic labeling of components, and has obtained promising results for semantic segmentation of over 30 categories of documents in Indian scripts.

read less

Abstract: In this paper we describe our document image analysis system which performs segmentation, content characterization as well as semantic labeling of components. Segmentation is done using white spaces and gives the segmented components arranged in a hierarchy. Semantic labeling is done using domain knowledge which is specified where possible in the form of a document model applicable to a class of documents. The novelty of the system lies in the suite of methods it employs which are capable of handling documents in Indian scripts. We have obtained promising results for semantic segmentation of over 30 categories of documents in Indian scripts.

...read moreread less

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Text graphic separation in Indian newspapers

[...]

Ritu Garg¹, Anukriti Bansal¹, Santanu Chaudhury¹, Sumantra Dutta Roy¹•Institutions (1)

Indian Institute of Technology Delhi¹

24 Aug 2013

TL;DR: A novel framework for learning optimal parameters for text graphic separation in the presence of complex layouts of Indian newspaper is proposed.

...read moreread less

Abstract: Digitization of newspaper article is important for registering historical events. Layout analysis of Indian newspaper is a challenging task due to the presence of different font size, font styles and random placement of text and non-text regions. In this paper we propose a novel framework for learning optimal parameters for text graphic separation in the presence of complex layouts. The learning problem has been formulated as an optimization problem using EM algorithm to learn optimal parameters depending on the nature of the document content.

...read moreread less

3 citations

Cites methods from "Syntactic and Semantic Labeling of ..."

...The method used for evaluating the performance of our algorithm is based on counting the number of matches between the pixels segmented by the algorithm and the pixels in the ground truth [11]....
[...]

Proceedings Article•DOI•

An approach for printed document labeling

[...]

Chandranath Adak¹•Institutions (1)

Kalyani Government Engineering College¹

01 May 2014

TL;DR: A model which performs labeling of different components of a printed document image, i.e. identification of heading, subheading, caption, article and photo is proposed, which gives promising results on printed document of different scripts.

...read moreread less

Abstract: A document image contains texts and non-texts, it may be printed, handwritten, or hybrid of both. In this paper we deal with printed document where textual region is of printed characters, and non-texts are mainly photo images. Here we propose a model which performs labeling of different components of a printed document image, i.e. identification of heading, subheading, caption, article and photo. Our method consists of a preprocessing stage where fuzzy c-means clustering is used to segment the document image into printed (object) region and background. Then Hough transformation is used to find white-line dividers of object region and grid structure examination is used to extract the non-text portion. After that, we use horizontal histogram to find text lines and then we label different components. Our method gives promising results on printed document of different scripts.

...read moreread less

References

PDF

Open Access

More filters

Journal Article•DOI•

Document representation and its application to page decomposition

[...]

Anil K. Jain¹, Bin Yu•Institutions (1)

Michigan State University¹

01 Mar 1998-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: A new document model which preserves top-down generation information is proposed based on which a document is logically represented for interactive editing, storage, retrieval, transfer, and logical analysis.

...read moreread less

Abstract: Transforming a paper document to its electronic version in a form suitable for efficient storage, retrieval, and interpretation continues to be a challenging problem. An efficient representation scheme for document images is necessary to solve this problem. Document representation involves techniques of thresholding, skew detection, geometric layout analysis, and logical layout analysis. The derived representation can then be used in document storage and retrieval. Page segmentation is an important stage in representing document images obtained by scanning journal pages. The performance of a document understanding system greatly depends on the correctness of page segmentation and labeling of different regions such as text, tables, images, drawings, and rulers. We use the traditional bottom-up approach based on the connected component extraction to efficiently implement page segmentation and region identification. A new document model which preserves top-down generation information is proposed based on which a document is logically represented for interactive editing, storage, retrieval, transfer, and logical analysis. Our algorithm has a high accuracy and takes approximately 1.4 seconds on a SGI Indy workstation for model creation, including orientation estimation, segmentation, and labeling (text, table, image, drawing, and ruler) for a 2550/spl times/3300 image of a typical journal page scanned at 300 dpi. This method is applicable to documents from various technical journals and can accommodate moderate amounts of skew and noise.

...read moreread less

239 citations

Additional excerpts

...Notable work includes [1], [3], [4] , [12], [14]....
[...]

Journal Article•DOI•

Text Extraction and Document Image Segmentation Using Matched Wavelets and MRF Model

[...]

S.. Kumar¹, Rajat Gupta, Nitin Khanna², Santanu Chaudhury³, Shashank Joshi³ - Show less +1 more•Institutions (3)

IBM¹, Purdue University², Indian Institute of Technology Delhi³

01 Aug 2007-IEEE Transactions on Image Processing

TL;DR: A clustering-based technique has been devised for estimating globally matched wavelet filters using a collection of groundtruth images and a text extraction scheme for the segmentation of document images into text, background, and picture components is extended.

...read moreread less

Abstract: In this paper, we have proposed a novel scheme for the extraction of textual areas of an image using globally matched wavelet filters. A clustering-based technique has been devised for estimating globally matched wavelet filters using a collection of groundtruth images. We have extended our text extraction scheme for the segmentation of document images into text, background, and picture components (which include graphics and continuous tone images). Multiple, two-class Fisher classifiers have been used for this purpose. We also exploit contextual information by using a Markov random field formulation-based pixel labeling scheme for refinement of the segmentation results. Experimental results have established effectiveness of our approach.

...read moreread less

159 citations

"Syntactic and Semantic Labeling of ..." refers methods in this paper

...Segmentation (explained in section 2) yields component blocks which are then characterized for syntactic content − picture, text/graphics, using a globally matched wavelet based feature extraction and a Fisher classifier adopted from [7]....
[...]
...) on the computation of an appropriate threshold for binarization, we have first separated the text portions from the rest of the document page using a wavelet-based text, non-text separation tool [7]....
[...]

Journal Article•DOI•

Transforming paper documents into XML format with WISDOM

[...]

O. Altamura, Floriana Esposito, Donato Malerba

01 Aug 2001-International Journal on Document Analysis and Recognition

TL;DR: The innovative aspects described in the paper are: the preprocessing algorithm, the adaptive page segmentation, the acquisition of block classification rules using techniques from machine learning, the layout analysis based on general layout principles, and a method that uses document layout information for conversion to HTML/XML formats.

...read moreread less

Abstract: The transformation of scanned paper documents to a form suitable for an Internet browser is a complex process that requires solutions to several problems. The application of an OCR to some parts of the document image is only one of the problems. In fact, the generation of documents in HTML format is easier when the layout structure of a page has been extracted by means of a document analysis process. The adoption of an XML format is even better, since it can facilitate the retrieval of documents in the Web. Nevertheless, an effective transformation of paper documents into this format requires further processing steps, namely document image classification and understanding. WISDOM++ is a document processing system that operates in five steps: document analysis, document classification, document understanding, text recognition with an OCR, and transformation into HTML/XML format. The innovative aspects described in the paper are: the preprocessing algorithm, the adaptive page segmentation, the acquisition of block classification rules using techniques from machine learning, the layout analysis based on general layout principles, and a method that uses document layout information for conversion to HTML/XML formats. A benchmarking of the system components implementing these innovative aspects is reported.

...read moreread less

129 citations

Additional excerpts

...Notable work includes [1], [3], [4] , [12], [14]....
[...]

Book•

Automatic Discovery of Logical Document Structure

[...]

Kristen Maria Summers, John E. Hopcroft

01 Jan 1998

TL;DR: The key idea is that analyses of the text contours at appropriate levels of granularity offer a rich source of information about document structure, which can provide the basis for flexible document manipulation tools in heterogeneous collections.

...read moreread less

Abstract: The availability of large, heterogeneous repositories of electronic documents is increasing rapidly, and the need for flexible, sophisticated document manipulation tools is growing correspondingly. These tools can benefit greatly by exploiting logical structure, a hierarchy of visually observable organizational components of a document, such as paragraphs, lists, sections, etc. Knowledge of this structure can enable a multiplicity of applications, including hierarchical browsing, structural hyperlinking, logical component-based retrieval, and style translation. Most work on the problem of deriving logical structure from document layout either relies on knowledge of the particular document style or finds a single flat set of text blocks. This thesis describes an implemented approach to discovering a full logical hierarchy in generic text documents, based primarily on layout information. Since the styles of the documents are not known a priori, the precise layout effects of the logical structure are unknown. Nonetheless, typographical capabilities and conventions provide cues that can be used to deduce a logical structure for a generic document. In particular, the key idea is that analyses of the text contours at appropriate levels of granularity offer a rich source of information about document structure. The problem of logical structure discovery is divided into problems of segmentation, which separates the text into logical pieces, and classification, which labels the pieces with structure types. The segmentation algorithm relies entirely on layout-based cues, and the classification algorithm uses word-based information only when this is demonstrably unavoidable. Thus, this approach is particularly appropriate for scanned-in documents, since it is more robust with respect to OCR errors than a content-oriented approach would be. It is applicable, however, to the problem of analyzing any electronic document whose original formatting style rules remain unknown; thus, it can provide the basis for flexible document manipulation tools in heterogeneous collections.

...read moreread less

49 citations

Additional excerpts

...Notable work includes [1], [3], [4] , [12], [14]....
[...]

Journal Article•DOI•

Adaptive, quadratic preprocessing of document images for binarization

[...]

Shan Mo¹, J. Mathews•Institutions (1)

University of Utah¹

01 Jul 1998-IEEE Transactions on Image Processing

TL;DR: An adaptive algorithm for preprocessing document images prior to binarization in character recognition problems using a quadratic system model to provide edge enhancement for input images that have been corrupted by noise and other types of distortions during the scanning process.

...read moreread less

Abstract: This paper presents an adaptive algorithm for preprocessing document images prior to binarization in character recognition problems. Our method is similar in its approach to the blind adaptive equalization of binary communication channels. The adaptive filter utilizes a quadratic system model to provide edge enhancement for input images that have been corrupted by noise and other types of distortions during the scanning process. Experimental results demonstrating significant improvement in the quality of the binarized images over both direct binarization and a previously available preprocessing technique are also included.

...read moreread less

37 citations