Script and Language Identification in Noisy and Degraded Document Images

doi:10.1109/TPAMI.2007.1158

Journal ArticleDOI

Script and Language Identification in Noisy and Degraded Document Images

Lu Shijian, +1 more

- 01 Jan 2008 -

IEEE Transactions on Pattern Analysis an...

- Vol. 30, Iss: 1, pp 14-24

Chats0

TLDR

Experimental results show that the proposed identification technique is accurate, easy for extension, and tolerant to noise and various types of document degradation.

Abstract:

This paper reports an identification technique that detects scripts and languages of noisy and degraded document images. In the proposed technique, scripts and languages are identified through the document vectorization, which converts each document image into a document vector that characterizes the shape and frequency of the contained character or word images. Document images are vectorized by using vertical component cuts and character extremum points, which are both tolerant to the variation in text fonts and styles, noise, and various types of document degradation. For each script or language under study, a script or language template is first constructed through a training process. Scripts and languages of document images are then determined according to the distances between converted document vectors and the preconstructed script and language templates. Experimental results show that the proposed technique is accurate, easy for extension, and tolerant to noise and various types of document degradation.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Document Image Retrieval through Word Shape Coding

Shijian Lu, +2 more

- 01 Nov 2008 -

IEEE Transactions on Pattern Analysis an...

TL;DR: The proposed technique retrieves document images by a new word shape coding scheme, which captures the document content through annotating each word image by a word shape code.

...read moreread less

Proceedings ArticleDOI

An Efficient Edge Based Technique for Text Detection in Video Frames

Palaiahnakote Shivakumara, +2 more

TL;DR: A novel technique for detecting both graphic text and scene text in video images by finding segments containing text in an input image and then using statistical features such as vertical and horizontal bars for edges in the segments for detecting true text blocks efficiently is presented.

...read moreread less

Journal ArticleDOI

Script Identification of Multi-Script Documents: A Survey

Kurban Ubul, +5 more

- 30 Mar 2017 -

IEEE Access

TL;DR: The most vital processes in script identification are addressed in detail: identification and discriminating methods, features extraction (local and global, and classification), and classification.

...read moreread less

Proceedings ArticleDOI

Video Script Identification Based on Text Lines

Trung Quy Phan, +4 more

TL;DR: A new method for video script identification which is essential before choosing an appropriate OCR engine for identifying text lines when a video frame contains more than one language is presented.

...read moreread less

Journal ArticleDOI

New Gradient-Spatial-Structural Features for video script identification

Palaiahnakote Shivakumara, +4 more

- 01 Jan 2015 -

Computer Vision and Image Understanding

TL;DR: This paper proposes to integrate the spatial and the structural features based on end points, intersection points, junction points and straightness of the skeleton of text components in a novel way to identify the scripts.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

A threshold selection method from gray level histograms

Nobuyuki Otsu

- 01 Jan 1979 -

IEEE Transactions on Systems, Man, and C...

N-gram-based text categorization

W.B. Cavnar, +1 more

TL;DR: An N-gram-based approach to text categorization that is tolerant of textual errors is described, which worked very well for language classification and worked reasonably well for classifying articles from a number of different computer-oriented newsgroups according to subject.

...read moreread less

Journal ArticleDOI

Center weighted median filters and their applications to image enhancement

Sung-Jea Ko, +1 more

- 01 Sep 1991 -

IEEE Transactions on Circuits and System...

TL;DR: The center weighted median (CWM) filter as discussed by the authors is a weighted median filter that gives more weight only to the central value of each window, which can preserve image details while suppressing additive white and/or impulsive-type noise.

...read moreread less

Journal ArticleDOI

Evaluation of binarization methods for document images

O.D. Trier, +1 more

- 01 Mar 1995 -

IEEE Transactions on Pattern Analysis an...

TL;DR: This paper presents an evaluation of eleven locally adaptive binarization methods for gray scale images with low contrast, variable background intensity and noise and Niblack's method with the addition of the postprocessing step of Yanowitz and Bruckstein's method (1989) performed the best and was also one of the fastest binarized methods.

...read moreread less

Journal ArticleDOI

Rotation invariant texture features and their use in automatic script identification

Tieniu Tan

- 01 Jul 1998 -

IEEE Transactions on Pattern Analysis an...

TL;DR: Rotation invariant texture features are computed based on an extension of the popular multi-channel Gabor filtering technique, and their effectiveness is tested with 300 randomly rotated samples of 15 Brodatz textures to solve a practical but hitherto mostly overlooked problem in document image processing.

...read moreread less