scispace - formally typeset
Search or ask a question

Showing papers by "Umapada Pal published in 2004"


Journal ArticleDOI
TL;DR: A review of the OCR work done on Indian language scripts and the scope of future work and further steps needed for Indian script OCR development is presented.

592 citations


Journal ArticleDOI
01 Aug 2004
TL;DR: A novel scheme, mainly based on the concept of water reservoir analogy, to extract individual text lines from printed Indian documents containing multioriented and/or curve text lines is proposed.
Abstract: There are printed artistic documents where text lines of a single page may not be parallel to each other. These text lines may have different orientations or the text lines may be curved shapes. For the optical character recognition (OCR) of these documents, we need to extract such lines properly. In this paper, we propose a novel scheme, mainly based on the concept of water reservoir analogy, to extract individual text lines from printed Indian documents containing multioriented and/or curve text lines. A reservoir is a metaphor to illustrate the cavity region of a character where water can be stored. In the proposed scheme, at first, connected components are labeled and identified either as isolated or touching. Next, each touching component is classified either straight type (S-type) or curve type (C-type), depending on the reservoir base-area and envelope points of the component. Based on the type (S-type or C-type) of a component two candidate points are computed from each touching component. Finally, candidate regions (neighborhoods of the candidate points) of the candidate points of each component are detected and after analyzing these candidate regions, components are grouped to get individual text lines.

83 citations


Proceedings ArticleDOI
26 Oct 2004
TL;DR: A water reservoir- concept based scheme is proposed for the segmentation of unconstrained Oriya handwritten text into individual characters, which combines structural, topological and water-reservoir-concept based features touching characters of the word.
Abstract: Segmentation of handwritten text into lines, words and characters is one of the important steps in the handwritten recognition system. For the segmentation of unconstrained Oriya handwritten text into individual characters, a water reservoir-concept based scheme is proposed in this paper. Here, at first, the text image is segmented into lines, and then lines are segmented into individual words, and words are segmented into individual characters. For line segmentation the document is divided into vertical stripes. Analyzing the heights of the water reservoirs obtained from different components of the document, the width of a stripe is calculated. Stripe-wise horizontal histograms are then computed and the relationship of the peak-valley points of the histograms is used for line segment. Based on vertical projection profile and structural features of Oriya characters, text lines are segmented into words. For character segmentation, at first, isolated and connected (touching) characters in a word are detected. Using structural, topological and water-reservoir-concept based features touching characters of the word are then segmented.

74 citations


Proceedings ArticleDOI
26 Oct 2004
TL;DR: A two-stage MLP based classifier is employed to recognise Bangla and Arabic numerals for the sorting of postal documents written in Arabic and a local language Bangla for postal automation in India.
Abstract: In this paper, we present a system towards Indian postal automation. In the proposed system, at first, using run length smoothing algorithm (RLSA), we decompose the image into blocks. Based on the black pixel density and number of components inside a block, non-text block (postal stamp, postal seal etc.) are detected. Using positional information, the destination address block (DAB) is identified from text block. Next, pin-code box from the DAB is detected and numerals from the pin-code box are extracted. Since India is a multi-lingual and multi-script country, the address part may be written by combination of two languages: Arabic and a local language. For the sorting of postal documents written in Arabic and a local language Bangla, a two-stage MLP based classifier is employed to recognise Bangla and Arabic numerals. At present, the accuracy of the handwritten numeral recognition module is 92.10%.

59 citations


Proceedings ArticleDOI
20 Dec 2004
TL;DR: In the proposed scheme at first document skew is detected and corrected, non-text parts are then segmented from the document using run length smoothing algorithm (RLSA), and a tree classifier is generated for word-wise Bangla/Devnagari and English scripts identification.
Abstract: Postal automation is a topic of research over the last few years. There are many works towards the postal automation in USA, UK, Japan and Australia, but for Indian postal automation there is no significant work. This paper deals with word-wise handwritten script identification for Indian postal automation. In the proposed scheme at first document skew is detected and corrected. Non-text parts are then segmented from the document using run length smoothing algorithm (RLSA). Next, using a piece-wise projection method the destination address block (DAB) is at first segmented into lines and then links into words. Using water reservoir concept we compute the busy-zone of the word. Finally, using matra/Shirorekha, water reservoir concept based feature, etc. a tree classifier is generated for word-wise Bangla/Devnagari and English scripts identification.

41 citations


Book ChapterDOI
08 Sep 2004
TL;DR: A robust technique is proposed to extract word-wise script identification from Indian doublet form documents using different topological and structural features to separate different script words from such documents.
Abstract: In a country like India, a single text line of most of the official documents contains two different script words. Under two-language formula, the Indian documents are written in English and the state official language. For Optical Character Recognition (OCR) of such a document page, it is necessary to separate different script words before feeding them to the OCRs of individual scripts. In this paper a robust technique is proposed to extract word-wise script identification from Indian doublet form documents. Here, at first, the document is segmented into lines and then the lines are segmented into words. Using different topological and structural features (like number of loops, headline feature, water reservoir concept based features, profile features, etc.) individual script words are identified from the documents. The proposed scheme is tested on 24210 words of different doublets and we received more than 97% accuracy, on average.

24 citations


Proceedings Article
01 Jan 2004
TL;DR: A system towards recognition of Bangla pincode numerals for Indian postal automation by combining Neural Network and tree classifier based approach, with overall accuracy at present 94.21%.
Abstract: In this paper, we present a system towards recognition of Bangla pincode numerals for Indian postal automation In the proposed system, at first, using structural features the broken numerals are joined Next combining Neural Network (NN) and tree classifier based approach the numerals are recognized Considering similar shaped numerals at first, NN classifies the 10 numerals into six groups Next tree classifier is used for final recognition The features used for the NN based recognition are the number and position of end points, junction points, position of the centre of gravity, and distance between the centre of the bounding box and the centre of gravity etc of a numeral Different features used for tree classifier are based on water reservoir concept, structural features, and topological features Overall accuracy of the proposed system is at present 9421%

14 citations


Proceedings Article
01 Jan 2004
TL;DR: A recognition scheme for isolated off-line unconstrained Malayalam handwritten numeral is proposed here, based on water-reservoir concept, which considers the morphological pattern of the numeral.
Abstract: Main problem in handwritten recognition is the huge variability and distortion of patterns. To take care of writing variability of different individuals, a recognition scheme for isolated off-line unconstrained Malayalam handwritten numeral is proposed here. Main features used in the scheme are based on water-reservoir concept. A reservoir is a metaphor to illustrate the cavity region of the numeral where water can store if water is poured from a side of the numeral. The important reservoir based features used in the scheme are: (i) number of reservoirs (ii) positions of reservoirs with respect to bounding box of the touching pattern (iii) height and width of the reservoirs (iv) water flow direction, etc. Topological and structural features are also used for the recognition along with water reservoir concept based features. Close loop features (number of close loop, position of loops with respect to the bounding box of the component) are the main topological features used here. In the structural feature we consider the morphological pattern of the numeral. At present we obtained 96.34% overall recognition accuracy.

3 citations