An approach for printed document labeling

doi:10.1109/ACES.2014.6808032

Proceedings Article•DOI•

An approach for printed document labeling

01 May 2014-pp 1-4

TL;DR: A model which performs labeling of different components of a printed document image, i.e. identification of heading, subheading, caption, article and photo is proposed, which gives promising results on printed document of different scripts.

read less

Abstract: A document image contains texts and non-texts, it may be printed, handwritten, or hybrid of both. In this paper we deal with printed document where textual region is of printed characters, and non-texts are mainly photo images. Here we propose a model which performs labeling of different components of a printed document image, i.e. identification of heading, subheading, caption, article and photo. Our method consists of a preprocessing stage where fuzzy c-means clustering is used to segment the document image into printed (object) region and background. Then Hough transformation is used to find white-line dividers of object region and grid structure examination is used to extract the non-text portion. After that, we use horizontal histogram to find text lines and then we label different components. Our method gives promising results on printed document of different scripts.

...read moreread less

References

PDF

Open Access

More filters

Journal Article•DOI•

Historical review of OCR research and development

[...]

Shunji Mori¹, Ching Y. Suen², Kazuhiko Yamamoto•Institutions (2)

Ricoh¹, Concordia University²

01 Jul 1992

TL;DR: Both template matching and structure analysis approaches to R&D are considered and it is noted that the two approaches are coming closer and tending to merge.

...read moreread less

Abstract: Research and development of OCR systems are considered from a historical point of view. The historical development of commercial systems is included. Both template matching and structure analysis approaches to R&D are considered. It is noted that the two approaches are coming closer and tending to merge. Commercial products are divided into three generations, for each of which some representative OCR systems are chosen and described in some detail. Some comments are made on recent techniques applied to OCR, such as expert systems and neural networks, and some open problems are indicated. The authors' views and hopes regarding future trends are presented. >

...read moreread less

892 citations

"An approach for printed document la..." refers background in this paper

...The presented scheme works reasonably well as a precursor to different online OCRs....
[...]
...In the field of document image analysis [1-3], the Optical Character Recognition (OCR) [4] has gained interest due to its utility in advancement of different applications (e....
[...]
...Keywords—Document Image Analysis; Document Labeling; Fuzzy C-Means Clustering; Hough Transform; Optical Character Recognition I. INTRODUCTION In the field of document image analysis [1-3], the Optical Character Recognition (OCR) [4] has gained interest due to its utility in advancement of different applications (e.g. newspaper/magazine data automation, commercial/educational form processing, card/ banner/number plate analyzer, support system for blind readers etc.)....
[...]
...Document labeling is a precursor stage of an OCR system....
[...]
...After proper labeling of object region of a document image, an OCR can be fed well (the texts are fed to OCR and non-texts are sent to graphics processing system)....
[...]

Journal Article•DOI•

Document analysis system

[...]

Kwan Y. Wong¹, Richard G. Casey¹, Friedrich M. Wahl•Institutions (1)

IBM¹

01 Nov 1982-Ibm Journal of Research and Development

TL;DR: The requirements and components for a proposed Document Analysis System, which assists a user in encoding printed documents for computer processing, are outlined and several critical functions have been investigated and the technical approaches are discussed.

...read moreread less

Abstract: This paper outlines the requirements and components for a proposed Document Analysis System, which assists a user in encoding printed documents for computer processing. Several critical functions have been investigated and the technical approaches are discussed. The first is the segmentation and classification of digitized printed documents into regions of text and images. A nonlinear, run-length smoothing algorithm has been used for this purpose. By using the regular features of text lines, a linear adaptive classification scheme discriminates text regions from others. The second technique studied is an adaptive approach to the recognition of the hundreds of font styles and sizes that can occur on printed documents. A preclassifier is constructed during the input process and used to speed up a well-known pattern-matching method for clustering characters from an arbitrary print source into a small sample of prototypes. Experimental results are included.

...read moreread less

718 citations

"An approach for printed document la..." refers background in this paper

...In the field of document image analysis [1-3], the Optical Character Recognition (OCR) [4] has gained interest due to its utility in advancement of different applications (e....
[...]

Journal Article•DOI•

Twenty years of document image analysis in PAMI

[...]

George Nagy¹•Institutions (1)

Rensselaer Polytechnic Institute¹

01 Jan 2000-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: The contributions to document image analysis of 99 papers published in the IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) are clustered, summarized, interpolated, interpreted, and evaluated.

...read moreread less

Abstract: The contributions to document image analysis of 99 papers published in the IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) are clustered, summarized, interpolated, interpreted, and evaluated.

...read moreread less

544 citations

"An approach for printed document la..." refers background in this paper

...In the field of document image analysis [1-3], the Optical Character Recognition (OCR) [4] has gained interest due to its utility in advancement of different applications (e....
[...]

Book•DOI•

Document Image Analysis

[...]

Patrick S. P. Wang, Henry S. Baird, Horst Bunke

01 Jan 1994

TL;DR: A perfectly parallel thinning algorithm for character string recognition and model-based analysis and understanding of check forms, and an adaptive modular neural network with application to unconstrained character recognition.

...read moreread less

Abstract: A perfectly parallel thinning algorithm, Y.Y. Zhang and P.S.P. Wang background structure in document images, H.S. Baird analysis of form images, D.-C Wang and S.N. Srihari model-based analysis and understanding of check forms, T.H. Minh and H. Bunke document structures - a survey, Y.Y. Tang and C.Y. Suen automatic input of logic diagrams by recognizing loop-symbols and rectilinear connections, S.H. Kim and J.H. Kim syntactic analysis of technical drawing dimensions, S. Collin and D. Colnet recognition of elevation value in topographic maps by multi-angled parallelism, H. Yamada et al character recognition by signature approximation, N. Papamarkos et al an adaptive modular neural network with application to unconstrained character recognition, L. Mui et al a model-based split-and-merge method for character string recognition, H. Nishida and S. Mori handprinted Chinese character recognition using probability distribution feature, T.F. Li and S.S. Yu an algorithm for matching OCR-generated text strings, S.V. Rice et al.

...read moreread less

13 citations

"An approach for printed document la..." refers background in this paper

...In the field of document image analysis [1-3], the Optical Character Recognition (OCR) [4] has gained interest due to its utility in advancement of different applications (e....
[...]

Proceedings Article•DOI•

Syntactic and Semantic Labeling of Hierarchically Organized Document Image Components of Indian Scripts

[...]

Gaurav Harit¹, Ritu Garg², Santanu Chaudhury²•Institutions (2)

Indian Institute of Technology Kharagpur¹, Indian Institute of Technology Delhi²

04 Feb 2009

TL;DR: A document image analysis system which performs segmentation, content characterization as well as semantic labeling of components, and has obtained promising results for semantic segmentation of over 30 categories of documents in Indian scripts.

...read moreread less

Abstract: In this paper we describe our document image analysis system which performs segmentation, content characterization as well as semantic labeling of components. Segmentation is done using white spaces and gives the segmented components arranged in a hierarchy. Semantic labeling is done using domain knowledge which is specified where possible in the form of a document model applicable to a class of documents. The novelty of the system lies in the suite of methods it employs which are capable of handling documents in Indian scripts. We have obtained promising results for semantic segmentation of over 30 categories of documents in Indian scripts.

...read moreread less

2 citations

An approach for printed document labeling

References

"An approach for printed document la..." refers background in this paper

"An approach for printed document la..." refers background in this paper

"An approach for printed document la..." refers background in this paper

"An approach for printed document la..." refers background in this paper

Related Papers (5)