scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

An approach for printed document labeling

01 May 2014-pp 1-4
TL;DR: A model which performs labeling of different components of a printed document image, i.e. identification of heading, subheading, caption, article and photo is proposed, which gives promising results on printed document of different scripts.
Abstract: A document image contains texts and non-texts, it may be printed, handwritten, or hybrid of both. In this paper we deal with printed document where textual region is of printed characters, and non-texts are mainly photo images. Here we propose a model which performs labeling of different components of a printed document image, i.e. identification of heading, subheading, caption, article and photo. Our method consists of a preprocessing stage where fuzzy c-means clustering is used to segment the document image into printed (object) region and background. Then Hough transformation is used to find white-line dividers of object region and grid structure examination is used to extract the non-text portion. After that, we use horizontal histogram to find text lines and then we label different components. Our method gives promising results on printed document of different scripts.
References
More filters
Journal ArticleDOI
01 Jul 1992
TL;DR: Both template matching and structure analysis approaches to R&D are considered and it is noted that the two approaches are coming closer and tending to merge.
Abstract: Research and development of OCR systems are considered from a historical point of view. The historical development of commercial systems is included. Both template matching and structure analysis approaches to R&D are considered. It is noted that the two approaches are coming closer and tending to merge. Commercial products are divided into three generations, for each of which some representative OCR systems are chosen and described in some detail. Some comments are made on recent techniques applied to OCR, such as expert systems and neural networks, and some open problems are indicated. The authors' views and hopes regarding future trends are presented. >

892 citations


"An approach for printed document la..." refers background in this paper

  • ...The presented scheme works reasonably well as a precursor to different online OCRs....

    [...]

  • ...In the field of document image analysis [1-3], the Optical Character Recognition (OCR) [4] has gained interest due to its utility in advancement of different applications (e....

    [...]

  • ...Keywords—Document Image Analysis; Document Labeling; Fuzzy C-Means Clustering; Hough Transform; Optical Character Recognition I. INTRODUCTION In the field of document image analysis [1-3], the Optical Character Recognition (OCR) [4] has gained interest due to its utility in advancement of different applications (e.g. newspaper/magazine data automation, commercial/educational form processing, card/ banner/number plate analyzer, support system for blind readers etc.)....

    [...]

  • ...Document labeling is a precursor stage of an OCR system....

    [...]

  • ...After proper labeling of object region of a document image, an OCR can be fed well (the texts are fed to OCR and non-texts are sent to graphics processing system)....

    [...]

Journal ArticleDOI
TL;DR: The requirements and components for a proposed Document Analysis System, which assists a user in encoding printed documents for computer processing, are outlined and several critical functions have been investigated and the technical approaches are discussed.
Abstract: This paper outlines the requirements and components for a proposed Document Analysis System, which assists a user in encoding printed documents for computer processing. Several critical functions have been investigated and the technical approaches are discussed. The first is the segmentation and classification of digitized printed documents into regions of text and images. A nonlinear, run-length smoothing algorithm has been used for this purpose. By using the regular features of text lines, a linear adaptive classification scheme discriminates text regions from others. The second technique studied is an adaptive approach to the recognition of the hundreds of font styles and sizes that can occur on printed documents. A preclassifier is constructed during the input process and used to speed up a well-known pattern-matching method for clustering characters from an arbitrary print source into a small sample of prototypes. Experimental results are included.

718 citations


"An approach for printed document la..." refers background in this paper

  • ...In the field of document image analysis [1-3], the Optical Character Recognition (OCR) [4] has gained interest due to its utility in advancement of different applications (e....

    [...]

Journal ArticleDOI
TL;DR: The contributions to document image analysis of 99 papers published in the IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) are clustered, summarized, interpolated, interpreted, and evaluated.
Abstract: The contributions to document image analysis of 99 papers published in the IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) are clustered, summarized, interpolated, interpreted, and evaluated.

544 citations


"An approach for printed document la..." refers background in this paper

  • ...In the field of document image analysis [1-3], the Optical Character Recognition (OCR) [4] has gained interest due to its utility in advancement of different applications (e....

    [...]

BookDOI
01 Jan 1994
TL;DR: A perfectly parallel thinning algorithm for character string recognition and model-based analysis and understanding of check forms, and an adaptive modular neural network with application to unconstrained character recognition.
Abstract: A perfectly parallel thinning algorithm, Y.Y. Zhang and P.S.P. Wang background structure in document images, H.S. Baird analysis of form images, D.-C Wang and S.N. Srihari model-based analysis and understanding of check forms, T.H. Minh and H. Bunke document structures - a survey, Y.Y. Tang and C.Y. Suen automatic input of logic diagrams by recognizing loop-symbols and rectilinear connections, S.H. Kim and J.H. Kim syntactic analysis of technical drawing dimensions, S. Collin and D. Colnet recognition of elevation value in topographic maps by multi-angled parallelism, H. Yamada et al character recognition by signature approximation, N. Papamarkos et al an adaptive modular neural network with application to unconstrained character recognition, L. Mui et al a model-based split-and-merge method for character string recognition, H. Nishida and S. Mori handprinted Chinese character recognition using probability distribution feature, T.F. Li and S.S. Yu an algorithm for matching OCR-generated text strings, S.V. Rice et al.

13 citations


"An approach for printed document la..." refers background in this paper

  • ...In the field of document image analysis [1-3], the Optical Character Recognition (OCR) [4] has gained interest due to its utility in advancement of different applications (e....

    [...]

Proceedings ArticleDOI
04 Feb 2009
TL;DR: A document image analysis system which performs segmentation, content characterization as well as semantic labeling of components, and has obtained promising results for semantic segmentation of over 30 categories of documents in Indian scripts.
Abstract: In this paper we describe our document image analysis system which performs segmentation, content characterization as well as semantic labeling of components. Segmentation is done using white spaces and gives the segmented components arranged in a hierarchy. Semantic labeling is done using domain knowledge which is specified where possible in the form of a document model applicable to a class of documents. The novelty of the system lies in the suite of methods it employs which are capable of handling documents in Indian scripts. We have obtained promising results for semantic segmentation of over 30 categories of documents in Indian scripts.

2 citations