scispace - formally typeset
Proceedings ArticleDOI

Information extraction and text mining of Ancient Vattezhuthu characters in historical documents using image zoning

Reads0
Chats0
TLDR
A system that involves character recognition of Brahmi, Grantha and Vattezuthu characters from palm manuscripts of historical Tamil ancient documents is developed, analyzed the text and machine translated the present Tamil digital text format.
Abstract
The aim of this paper is to develop a system that involves character recognition of Brahmi, Grantha and Vattezuthu characters from palm manuscripts of historical Tamil ancient documents, analyzed the text and machine translated the present Tamil digital text format. Though many researchers have implemented various algorithms and techniques for character recognition in different languages, ancient characters conversion still poses a big challenge. Because image recognition technology has reached near-perfection when it comes to scanning English and other language text. But optical character recognition (OCR) software capable of digitizing printed Tamil text with high levels of accuracy is still elusive. Only a few people are familiar with the ancient characters and make attempts to convert them into written documents manually. The proposed system overcomes such a situation by converting all the ancient historical documents from inscriptions and palm manuscripts into Tamil digital text format. It converts the digital text format using Tamil unicode. Our algorithm comprises different stages: i) image preprocessing, ii) feature extraction, iii) character recognition and iv) digital text conversion. The first phase conversion accuracy of the Brahmi script rate of our algorithm is 91.57% using the neural network and image zoning method. The second phase of the Vattezhuthu character set is to be implemented. Conversion accuracy of Vattezhuthu is 89.75%.

read more

Citations
More filters
Journal ArticleDOI

An analytical study of information extraction from unstructured and multidimensional big data

TL;DR: This research work addresses the competency and limitations of the existing IE techniques related to data pre-processing, data extraction and transformation, and representations for huge volumes of multidimensional unstructured data and presents a systematic literature review of state-of-the-art techniques for a variety of big data.
Journal ArticleDOI

Efficient English text classification using selected Machine Learning Techniques

TL;DR: The Support Vector Machines (SVM) model in classifying English text and documents is implemented and it is observed that the classification rate exceeds 90% when using more than 4000 features.
Journal ArticleDOI

Limitations of information extraction methods and techniques for heterogeneous unstructured big data

TL;DR: The review finds that advanced techniques for IE, particularly for multifaceted unstructured big data sets, are the utmost requirement of the organizations to manage big data and derive strategic information.
Journal ArticleDOI

Brahmi character recognition based on SVM (support vector machine) classifier using image gradient features

TL;DR: A recognition system for Brahmi characters using linear Support Vector machine classifier, trained on the feature set of 24 images of each character, with an accuracy of 91.6% is presented.
Proceedings ArticleDOI

Character Recognition in Historical Handwritten Documents – A Survey

TL;DR: The paper reviews some of the major works carried out in HCR for Ancient handwritten documents and states that promising results have not been achieved.
References
More filters
Journal ArticleDOI

A prototype document image analysis system for technical journals

TL;DR: The document image acquisition process and the knowledge base that must be entered into the system to process a family of page images are described, and the process by which the X-Y tree data structure converts a 2-D page-segmentation problem into a series of 1-D string-parsing problems that can be tackled using conventional compiler tools.
Journal ArticleDOI

Performance Evaluation and Benchmarking of Six-Page Segmentation Algorithms

TL;DR: A vectorial score that is sensitive to, and identifies, the most important classes of segmentation errors (over, under, and mis-segmentation) and what page components (lines, blocks, etc.) are affected.
Journal ArticleDOI

Handwritten Chinese Text Recognition by Integrating Multiple Contexts

TL;DR: The experimental results show that confidence transformation and combining multiple contexts improve the text line recognition performance significantly, and are superior by far to the best results reported in the literature.
Journal ArticleDOI

A synthesised word approach to word retrieval in handwritten documents

TL;DR: A novel method is described to overcome the training data problem using a character-based modelling approach and a word modelling technique enabling the retrieval of keywords that have not explicitly been seen in the training set.
Journal ArticleDOI

Adaptive Membership Functions for Handwritten Character Recognition by Voronoi-Based Image Zoning

TL;DR: A new class of zone-based membership functions with adaptive capabilities is introduced and its effectiveness is shown and a genetic algorithm is proposed to determine—in a unique process—the most favorable membership functions along with the optimal zoning topology, described by Voronoi tessellation.
Related Papers (5)