scispace - formally typeset
Search or ask a question

Showing papers on "Optical character recognition published in 2008"


Journal ArticleDOI
12 Sep 2008-Science
TL;DR: This research explored whether human effort can be channeled into a useful purpose: helping to digitize old printed material by asking users to decipher scanned words from books that computerized optical character recognition failed to recognize.
Abstract: CAPTCHAs (Completely Automated Public Turing test to tell Computers and Humans Apart) are widespread security measures on the World Wide Web that prevent automated programs from abusing online services. They do so by asking humans to perform a task that computers cannot yet perform, such as deciphering distorted characters. Our research explored whether such human effort can be channeled into a useful purpose: helping to digitize old printed material by asking users to decipher scanned words from books that computerized optical character recognition failed to recognize. We showed that this method can transcribe text with a word accuracy exceeding 99%, matching the guarantee of professional human transcribers. Our apparatus is deployed in more than 40,000 Web sites and has transcribed over 440 million words.

1,155 citations


Journal ArticleDOI
TL;DR: This paper offers to researchers a link to a public image database to define a common reference point for LPR algorithmic assessment and issues such as processing time, computational power, and recognition rate are addressed.
Abstract: License plate recognition (LPR) algorithms in images or videos are generally composed of the following three processing steps: 1) extraction of a license plate region; 2) segmentation of the plate characters; and 3) recognition of each character This task is quite challenging due to the diversity of plate formats and the nonuniform outdoor illumination conditions during image acquisition Therefore, most approaches work only under restricted conditions such as fixed illumination, limited vehicle speed, designated routes, and stationary backgrounds Numerous techniques have been developed for LPR in still images or video sequences, and the purpose of this paper is to categorize and assess them Issues such as processing time, computational power, and recognition rate are also addressed, when available Finally, this paper offers to researchers a link to a public image database to define a common reference point for LPR algorithmic assessment

575 citations


Proceedings ArticleDOI
27 Jan 2008
TL;DR: The current status of the OCR system, its general architecture, as well as the major algorithms currently being used for layout analysis and text line recognition are described.
Abstract: OCRopus is a new, open source OCR system emphasizing modularity, easy extensibility, and reuse, aimed at both the research community and large scale commercial document conversions. This paper describes the current status of the system, its general architecture, as well as the major algorithms currently being used for layout analysis and text line recognition.

239 citations


Journal ArticleDOI
TL;DR: This work presents a geometric rectification framework for restoring the frontal-flat view of a document from a single camera-captured image and estimates the 3D document shape from texture flow information obtained directly from the image without requiring additional 3D/metric data or prior camera calibration.
Abstract: Compared to typical scanners, handheld cameras offer convenient, flexible, portable, and noncontact image capture, which enables many new applications and breathes new life into existing ones. However, camera-captured documents may suffer from distortions caused by a nonplanar document shape and perspective projection, which lead to the failure of current optical character recognition (OCR) technologies. We present a geometric rectification framework for restoring the frontal-flat view of a document from a single camera-captured image. Our approach estimates the 3D document shape from texture flow information obtained directly from the image without requiring additional 3D/metric data or prior camera calibration. Our framework provides a unified solution for both planar and curved documents and can be applied in many, especially mobile, camera-based document analysis applications. Experiments show that our method produces results that are significantly more OCR compatible than the original images.

184 citations


Journal ArticleDOI
Hiromichi Fujisawa1
TL;DR: An overview on the last 40-years of technical advances in the field of character and document recognition in Japan is presented, and robustness design principles, which have proven to be effective to solve complex problems in postal address recognition are discussed.

158 citations


Patent
C. Philipp Schloter1, Jiang Gao1
10 Mar 2008
TL;DR: In this paper, a device for switching between code-based searching, optical character recognition (OCR) searching and visual searching is presented, which includes a media content input for receiving media content from a camera or other element of the device and transferring this media content to a switch.
Abstract: A device for switching between code-based searching, optical character recognition (OCR) searching and visual searching is provided. The device includes a media content input for receiving media content from a camera or other element of the device and transferring this media content to a switch. Additionally, the device includes a meta-information input capable of receiving meta- information from an element of the device and transferring the meta-information to the switch. The switch is able to utilize the received media content and the meta- information to select and/or switch between a visual search algorithm, an OCR algorithm and a code-based algorithm.

143 citations


Journal ArticleDOI
TL;DR: The memory requirements are uniquely designed to be extremely low, which enables usage of smaller FPGAs, and the resulting hardware is suitable for applications where cost, compactness, and efficiency are system design constraints.
Abstract: In this paper, a video processing methodology for a field-programmable gate array (FPGA)-based license plate recognition (LPR) system is researched. The raster scan video is used as an input with low memory utilization. During the design, Gabor filter, threshold, and connected component labeling (CCL) algorithms are used to obtain license plate region. This region is segmented into disjoint characters for the character recognition phase, where the self-organizing map (SOM) neural network is used to identify the characters. The system is portable and relatively faster than computer-based recognition systems. The robustness of the system has been tested with a large database acquired from parking lots and a highway. The memory requirements are uniquely designed to be extremely low, which enables usage of smaller FPGAs. The resulting hardware is suitable for applications where cost, compactness, and efficiency are system design constraints.

122 citations


Journal ArticleDOI
TL;DR: The proposed technique retrieves document images by a new word shape coding scheme, which captures the document content through annotating each word image by a word shape code.
Abstract: This paper presents a document retrieval technique that is capable of searching document images without optical character recognition (OCR). The proposed technique retrieves document images by a new word shape coding scheme, which captures the document content through annotating each word image by a word shape code. In particular, we annotate word images by using a set of topological shape features including character ascenders/descenders, character holes, and character water reservoirs. With the annotated word shape codes, document images can be retrieved by either query keywords or a query document image. Experimental results show that the proposed document image retrieval technique is fast, efficient, and tolerant to various types of document degradation.

111 citations


Proceedings ArticleDOI
01 Dec 2008
TL;DR: In this paper, the authors used four feature extraction techniques namely, intersection, shadow feature, chain code histogram, and straight line fitting features for handwritten Devnagari characters recognition using weighted majority voting technique.
Abstract: In this paper, we present an OCR for handwritten Devnagari characters. Basic symbols are recognized by neural classifier. We have used four feature extraction techniques namely, intersection, shadow feature, chain code histogram and straight line fitting features. Shadow features are computed globally for character image while intersection features, chain code histogram features and line fitting features are computed by dividing the character image into different segments. Weighted majority voting technique is used for combining the classification decision obtained from four multi layer perceptron(MLP) based classifier. On experimentation with a dataset of 4900 samples the overall recognition rate observed is 92.80% as we considered top five choices results. This method is compared with other recent methods for handwritten Devnagari character recognition and it has been observed that this approach has better success rate than other methods.

110 citations


01 Dec 2008
TL;DR: This paper seeks to provide a comprehensive review of the methods of off-line handwriting text line segmentation proposed by researchers to develop a reliable OCR system for handwriting recognition.
Abstract: Summary Text line segmentation is an essential pre-processing stage for off-line handwriting recognition in many Optical Character Recognition (OCR) systems. It is an important step because inaccurately segmented text lines will cause errors in the recognition stage. Text line segmentation of the handwritten documents is still one of the most complicated problems in developing a reliable OCR. The nature of handwriting makes the process of text line segmentation very challenging. Several techniques to segment handwriting text line have been proposed in the past. This paper seeks to provide a comprehensive review of the methods of off-line handwriting text line segmentation proposed by researchers.

91 citations


Proceedings ArticleDOI
16 Sep 2008
TL;DR: In this paper a complete OCR methodology for recognizing historical documents, either printed or handwritten without any knowledge of the font, is presented.
Abstract: In this paper a complete OCR methodology for recognizing historical documents, either printed or handwritten without any knowledge of the font, is presented. This methodology consists of three steps: The first two steps refer to creating a database for training using a set of documents, while the third one refers to recognition of new document images. First, a pre-processing step that includes image binarization and enhancement takes place. At a second step a top-down segmentation approach is used in order to detect text lines, words and characters. A clustering scheme is then adopted in order to group characters of similar shape. This is a semi-automatic procedure since the user is able to interact at any time in order to correct possible errors of clustering and assign an ASCII label. After this step, a database is created in order to be used for recognition. Finally, in the third step, for every new document image the above segmentation approach takes place while the recognition is based on the character database that has been produced at the previous step.

P. Vanaja Ranjan1
01 Jan 2008
TL;DR: Zone centroid and Image centroid based Distance metric feature extraction system is proposed and 99 %, 99%, 96% and 95 % recognition rate for Kannada, Telugu, Tamil and Malayalam numerals respectively are obtained.
Abstract: Character recognition is the important area in image processing and pattern recognition fields. Handwritten character recognition has received extensive attention in academic and production fields. The recognition system can be either on-line or off-line. Off-line handwriting recognition is the subfield of optical character recognition. India is a multi-lingual and multi-script country, where eighteen official scripts are accepted and have over hundred regional languages. In this paper we propose Zone centroid and Image centroid based Distance metric feature extraction system. The character centroid is computed and the image (character/numeral) is further divided in to n equal zones. Average distance from the character centroid to the each pixel present in the zone is computed. Similarly zone centroid is computed and average distance from the zone centroid to each pixel present in the zone is computed. We repeated this procedure for all the zones/grids/boxes present in the numeral image. There could be some zones that are empty, and then the value of that particular zone image value in the feature vector is zero. Finally 2*n such features are extracted. Nearest neighbor and Feed forward back propagation neural network classifiers are used for subsequent classification and recognition purpose. We obtained 99 %, 99%, 96% and 95 % recognition rate for Kannada, Telugu, Tamil and Malayalam numerals respectively.

Journal ArticleDOI
TL;DR: The presented technique for the recognition of optical off-line handwritten Arabic (Indian) numerals using hidden Markov models (HMM) is writer independent as separate writers' data were used in training of the classifiers and other writers' information was used in the testing phase.

Journal ArticleDOI
TL;DR: A geometric matching algorithm is used to find the optimal page frame of structured documents (journal articles, books, magazines) by exploiting their text alignment property and shows that by removing characters outside the computed page frame, the OCR error rate is reduced.
Abstract: When a page of a book is scanned or photocopied, textual noise (extraneous symbols from the neighboring page) and/or non-textual noise (black borders, speckles, ...) appear along the border of the document. Existing document analysis methods can handle non-textual noise reasonably well, whereas textual noise still presents a major issue for document analysis systems. Textual noise may result in undesired text in optical character recognition (OCR) output that needs to be removed afterwards. Existing document cleanup methods try to explicitly detect and remove marginal noise. This paper presents a new perspective for document image cleanup by detecting the page frame of the document. The goal of page frame detection is to find the actual page contents area, ignoring marginal noise along the page border. We use a geometric matching algorithm to find the optimal page frame of structured documents (journal articles, books, magazines) by exploiting their text alignment property. We evaluate the algorithm on the UW-III database. The results show that the error rates are below 4% each of the performance measures used. Further tests were run on a dataset of magazine pages and on a set of camera captured document images. To demonstrate the benefits of using page frame detection in practical applications, we choose OCR and layout-based document image retrieval as sample applications. Experiments using a commercial OCR system show that by removing characters outside the computed page frame, the OCR error rate is reduced from 4.3 to 1.7% on the UW-III dataset. The use of page frame detection in layout-based document image retrieval application decreases the retrieval error rates by 30%.

Book ChapterDOI
01 Jan 2008
TL;DR: The most widely known applications of DAR are related to the processing of office documents and to the automatic mail sorting and the use of inexpensive high-resolution scanning devices combined with powerful computers, state-of-the-art OCR packages can solve simple recognition tasks for most users.
Abstract: Document Analysis and Recognition (DAR) aims at the automatic extraction of information presented on paper and initially addressed to human comprehension The desired output of DAR systems is usually in a suitable symbolic representation that can subsequently be processed by computers Over the centuries, paper documents have been the principal instrument to make permanent the progress of the humankind Nowadays, most information is still recorded, stored, and distributed in paper format The widespread use of computers for document editing, with the introduction of PCs and wordprocessors in the late 1980’s, had the effect of increasing, instead of reducing, the amount of information held on paper Even if current technological trends seem to move towards a paperless world, some studies demonstrated that the use of paper as a media for information exchange is still increasing [1] Moreover, there are still application domains where the paper persists to be the preferred media [2] The most widely known applications of DAR are related to the processing of office documents (such as invoices, bank documents, business letters, and checks) and to the automatic mail sorting With the current availability of inexpensive high-resolution scanning devices, combined with powerful computers, state-of-the-art OCR packages can solve simple recognition tasks for most users Recent research directions are widening the use of the DAR techniques, significant examples are the processing of ancient/historical documents in digital libraries, the information extraction from “digital born” documents, such as PDF and HTML, and the analysis of natural images (acquired with mobile phones and digital cameras) containing textual information The development of a DAR system requires the integration of several competences in computer science, among the others: image processing, pattern recognition, natural language processing, artificial intelligence, and database systems DAR applications are particularly suitable for the incorporation of

Journal ArticleDOI
TL;DR: This work investigates the use of support vector machines to improve the classification of InftyReader, a free system for the OCR of mathematical documents, and describes a successful approach to multi-class classification with SVM, utilizing the ranking of alternatives within Infty Reader's confusion clusters.

Journal ArticleDOI
TL;DR: Empirical study shows that the proposed artificial immune system (AIS)-based pattern classification approach exhibits very good generalization ability in generating a smaller prototype library from a larger one and at the same time giving a substantial improvement in the classification accuracy of the underlying NN classifier.
Abstract: Artificial immune system (AIS)-based pattern classification approach is relatively new in the field of pattern recognition. The study explores the potentiality of this paradigm in the context of prototype selection task that is primarily effective in improving the classification performance of nearest-neighbor (NN) classifier and also partially in reducing its storage and computing time requirement. The clonal selection model of immunology has been incorporated to condense the original prototype set, and performance is verified by employing the proposed technique in a practical optical character recognition (OCR) system as well as for training and testing of a set of benchmark databases available in the public domain. The effect of control parameters is analyzed and the efficiency of the method is compared with another existing techniques often used for prototype selection. In the case of the OCR system, empirical study shows that the proposed approach exhibits very good generalization ability in generating a smaller prototype library from a larger one and at the same time giving a substantial improvement in the classification accuracy of the underlying NN classifier. The improvement in performance has been statistically verified. Consideration of both OCR data and public domain datasets demonstrate that the proposed method gives results better than or at least comparable to that of some existing techniques.

Proceedings Article
13 Feb 2008
TL;DR: This research explores best sets of feature extraction techniques and studies the accuracy of well-known classifiers for Arabic letters and found out that a subset of 25 features is needed to get 84% recognition accuracy using a linear discriminant classifier, and using more features does not substantially improve this accuracy.
Abstract: Users are still waiting for accurate optical character recognition solutions for Arabic handwritten scripts. This research explores best sets of feature extraction techniques and studies the accuracy of well-known classifiers for Arabic letters. Depending on their position in the word, Arabic letters are drawn in four forms: Isolated, Initial, Medial, and Final. The principal component analysis technique is used to select best subset of features out of a large number of extracted features. We used parametric and non-parametric classifiers and found out that a subset of 25 features is needed to get 84% recognition accuracy using a linear discriminant classifier, and using more features does not substantially improve this accuracy. However, for features fewer than 25 features, a quadratic discriminant classifier is more accurate than the linear classifier. Classifiers that are parameterized for the individual four forms score better accuracy than classifiers that do not make use of this input information.

Proceedings ArticleDOI
13 Dec 2008
TL;DR: In this paper, a system for offline recognition of handwritten handwritten Tamil characters using Hidden Markov Models (HMM) has been presented, which uses a combination of Time domain and frequency domain feature.
Abstract: Concerning to optical character recognition, handwriting has sustained to persist as a means of communication and recording information in day to day life even with the introduction of new technologies. Hidden Markov Models (HMM) have long been a popular choice for Western cursive handwriting recognition following their success in speech recognition. However, when it comes to Indic script recognition, the published work employing HMMs is limited, and generally focused on isolated character recognition. A system for offline recognition of cursive handwritten Tamil characters is presented. In this effort, offline cursive handwritten recognition system for Tamil based on HMM and uses a combination of Time domain and frequency domain feature is proposed. The tolerance of the system is evident as it can overwhelm the complexities arise out of font variations and proves to be flexible and robust. Higher degree of accuracy in results has been obtained with the implementation of this approach on a comprehensive database. These initial results are promising and warrant further research in this direction. The results are also encouraging to explore possibilities for adopting the approach to other Indic scripts as well.

Journal ArticleDOI
TL;DR: This paper focuses on the applicability of the features inspired by the visual ventral stream for handwritten character recognition, and an analysis is conducted to evaluate the robustness of this approach to orientation, scale and translation distortions.
Abstract: This paper focuses on the applicability of the features inspired by the visual ventral stream for handwritten character recognition. A set of scale and translation invariant C2 features are first extracted from all images in the dataset. Three standard classifiers kNN, ANN and SVM are then trained over a training set and then compared over a separate test set. In order to achieve higher recognition rate, a two stage classifier was designed with different preprocessing in the second stage. Experiments performed to validate the method on the well-known MNIST database, standard Farsi digits and characters, exhibit high recognition rates and compete with some of the best existing approaches. Moreover an analysis is conducted to evaluate the robustness of this approach to orientation, scale and translation distortions.

Proceedings ArticleDOI
24 Jul 2008
TL;DR: A new paradigm is proposed for measuring the impact of recognition errors on the stages of a standard text analysis pipeline: sentence boundary detection, tokenization, and part-of-speech tagging, which formulates error classification as an optimization problem solvable using a hierarchical dynamic programming approach.
Abstract: Errors are unavoidable in advanced computer vision applications such as optical character recognition, and the noise induced by these errors presents a serious challenge to down-stream processes that attempt to make use of such data. In this paper, we apply a new paradigm we have proposed for measuring the impact of recognition errors on the stages of a standard text analysis pipeline: sentence boundary detection, tokenization, and part-of-speech tagging. Our methodology formulates error classification as an optimization problem solvable using a hierarchical dynamic programming approach. Errors and their cascading effects are isolated and analyzed as they travel through the pipeline. We present experimental results based on a large collection of scanned pages to study the varying impact depending on the nature of the error and the character(s) involved. The problem of identifying tabular structures that should not be parsed as sentential text is also discussed.

Journal ArticleDOI
TL;DR: This paper attempts to consider the intrinsic characteristics of the text by using the stroke filter and design a new and robust algorithm for text segmentation using a stroke filter based on local region analysis.

Proceedings ArticleDOI
01 Nov 2008
TL;DR: Zone and Distance metric based feature extraction system is presented and 98 % and 96 % recognition rate for Kannada and Telugu numerals respectively are obtained.
Abstract: Character recognition is the important area in image processing and pattern recognition fields. Handwritten character recognition has received extensive attention in academic and production fields. The recognition system can be either on-line or off-line. Off-line handwriting recognition is the subfield of optical character recognition. India is a multi-lingual and multi-script country, where eighteen official scripts are accepted and have over hundred regional languages. In this paper we present Zone and Distance metric based feature extraction system. The character centroid is computed and the image is further divided in to n equal zones. Average distance from the character centroid to the each pixel present in the zone is computed. This procedure is repeated for all the zones present in the numeral image. Finally n such features are extracted for classification and recognition. Feed forward back propagation neural network is designed for subsequent classification and recognition purpose. We obtained 98 % and 96 % recognition rate for Kannada and Telugu numerals respectively.

Proceedings ArticleDOI
16 Jul 2008
TL;DR: The projection distance metric and zoning based scheme for numeral recognition and a nearest neighbor classifier is used for subsequent purpose and gives around 93% and 90% of recognition accuracy for Kannada and Tamil numerals respectively.
Abstract: Handwritten character recognition has received extensive attention in academic and production fields. The recognition system can be either online or off-line. There is a large demand for Optical character recognition on hand written documents. India is a multi-lingual country and multi script country, where eighteen official scripts are accepted and have over hundred regional languages. In this paper we have proposed the projection distance metric and zoning based scheme for numeral recognition. We tested our proposed method for Kannada and Tamil numerals. A nearest neighbor classifier is used for subsequent purpose. The proposed method gives around 93% and 90% of recognition accuracy for Kannada and Tamil numerals respectively.

Patent
25 Nov 2008
TL;DR: In this article, a revenue sharing and data security system is disclosed for encouraging competitors to make their data available to the system in a way that lexical data providers, the OS provider, the LSC provider, and the user may all mutually benefit.
Abstract: Embodiments can include means for categorizing lexical data, means for accurately describing the structure hierarchical data, means for accommodating lexicons having disparate data structures, means for pooling data from separate lexicons into aggregate lists, means for gathering data from participating users, and specified interfaces for handwriting recognition, optical character recognition, and text-to-speech and speech-to-text conversion. Embodiments can provide significant enhancements in data description, data connectivity and access, data presentation, data enhancement, and input functionality. The input means may be coupled with an electronic implementation of the character lookup invention by the same inventor to facilitate the lookup of individual characters. An exemplary embodiment can comprise a linguistic services center that interfaces with various natural language processing modules such that users of one module can take advantage of the wealth of linguistic information provided in the system. The resulting system may greatly minimize the frustration and inconvenience users typically experience when using Japanese, Chinese, or Korean in electronic contexts. A revenue sharing and data security system is disclosed for encouraging competitors to make their data available to the system in a way that lexical data providers, the OS provider, the LSC provider, and the user may all mutually benefit.

Proceedings ArticleDOI
16 Sep 2008
TL;DR: This paper finds a robust and pixel accurate scanner independent alignment of the scanned image with the electronic document, allowing the extraction of accurate ground truth character information.
Abstract: Most optical character recognition (OCR) systems need to be trained and tested on the symbols that are to be recognized. Therefore, ground truth data is needed. This data consists of character images together with their ASCII code. Among the approaches for generating ground truth of real world data, one promising technique is to use electronic version of the scanned documents. Using an alignment method, the character bounding boxes extracted from the electronic document are matched to the scanned image. Current alignment methods are not robust to different similarity transforms. They also need calibration to deal with non-linear local distortions introduced by the printing/scanning process. In this paper we present a significant improvement over existing methods, allowing to skip the calibration step and having a more accurate alignment, under all similarity transforms. Our method finds a robust and pixel accurate scanner independent alignment of the scanned image with the electronic document, allowing the extraction of accurate ground truth character information. The accuracy of the alignment is demonstrated using documents from the UW3 dataset. The results show that the mean distance between the estimated and the ground truth character bounding box position is less than one pixel.

Journal Article
TL;DR: Experimental results show that application of genetic algorithms (GA) to feature subset selection in a Farsi OCR results in lower computational complexity and enhanced recognition rate.
Abstract: Dealing with hundreds of features in character recognition systems is not unusual. This large number of features leads to the increase of computational workload of recognition process. There have been many methods which try to remove unnecessary or redundant features and reduce feature dimensionality. Besides because of the characteristics of Farsi scripts, it’s not possible to apply other languages algorithms to Farsi directly. In this paper some methods for feature subset selection using genetic algorithms are applied on a Farsi optical character recognition (OCR) system. Experimental results show that application of genetic algorithms (GA) to feature subset selection in a Farsi OCR results in lower computational complexity and enhanced recognition rate. Keywords—Feature Subset Selection, Genetic Algorithms, Optical Character Recognition.

Proceedings ArticleDOI
16 Jul 2008
TL;DR: A text line extraction method for multi-skewed handwritten document images that assumes that hypothetical water flows, from both left and right sides of the image frame, face obstruction from characters of text lines.
Abstract: A text line extraction method is presented for multi-skewed handwritten document images. First the whole document image is divided into several sub-images with overlapped area, then for every sub-image, it assumes that hypothetical water flows, from both left and right sides of the image frame, face obstruction from characters of text lines. The stripes of areas left unwetted on the sub-images are labeled for extraction of text lines. Finally, touching lines are segmented by detecting the bottom edge points. The experimental results show the approach is effective and available to extract curved text lines.

Proceedings ArticleDOI
12 Dec 2008
TL;DR: A new method for the automatic detection of music staff lines based on a connected path approach is presented and results show that the proposed technique consistently outperforms well-established algorithms.
Abstract: The preservation of many music works produced in the past entails their digitalization and consequent accessibility in an easy-to-manage digital format. Carrying this task manually is very time consuming and error prone. While optical music recognition systems usually perform well on printed scores, the processing of handwritten musical scores by computers remain far from ideal. One of the fundamental stages to carry out this task is the staff line detection. In this paper a new method for the automatic detection of music staff lines based on a connected path approach is presented. Lines affected by curvature, discontinuities, and inclination are robustly detected. Experimental results show that the proposed technique consistently outperforms well-established algorithms.

Journal ArticleDOI
TL;DR: The technique has been extensively tested on a variety of document images and its accuracy and robustness is compared with other existing techniques.
Abstract: In this paper we propose a technique for detecting and correcting the skew of text areas in a document. The documents we work with may contain several areas of text with different skew angles. First, a text localization procedure is applied based on connected components analysis. Specifically, the connected components of the document are extracted and filtered according to their size and geometric characteristics. Next, the candidate characters are grouped using a nearest neighbor approach to form words and then based on these words text lines of any skew are constructed. Then, the top-line and baseline for each text line are estimated using linear regression. Text lines in near locations, having similar skew angles, are grown to form text areas. For each text area a local skew angle is estimated and then these text areas are skew corrected independently to horizontal or vertical orientation. The technique has been extensively tested on a variety of document images and its accuracy and robustness is compared with other existing techniques.