A Fuzzy Matching based Image Classification System for Printed and Handwritten Text Documents

doi:10.4018/JITR.2020040110

Home
/
Papers
/
A Fuzzy Matching based Image Classification System for Printed and Handwritten Text Documents

Journal Article•DOI•

A Fuzzy Matching based Image Classification System for Printed and Handwritten Text Documents

Shalini Puri¹, Satya Prakash Singh¹•Institutions (1)

Birla Institute of Technology and Science¹

01 Apr 2020-Journal of Information Technology Research (IGI Global)-Vol. 13, Iss: 2, pp 155-194

TL;DR: This article proposes a system that performs better than existing systems, and shows the results of experiments on this and other proposed systems.

read less

Abstract: This article proposes a bi-leveled image classification system to classify printed and handwritten English documents into mutually exclusive predefined categories. The proposed system follows the steps of preprocessing, segmentation, feature extraction, and SVM based character classification at level 1, and word association and fuzzy matching based document classification at level 2. The system architecture and its modular structure discuss various task stages and their functionalities. Further, a case study on document classification is discussed to show the internal score computations of words and keywords with fuzzy matching. The experiments on proposed system illustrate that the system achieves promising results in the time-efficient manner and achieves better accuracy with less computation time for printed documents than handwritten ones. Finally, the performance of the proposed system is compared with the existing systems and it is observed that proposed system performs better than many other systems. KeywoRDS Confidence Computation, Document Image, Fuzzy Matching, Handwritten Documents, Performance Analysis, Printed Documents, SVM, Text Image Classification, Word Association

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Intelligent Recognition and Teaching of English Fuzzy Texts Based on Fuzzy Computing and Big Data

[...]

Ling Liu, Sang-Bing Tsai¹•Institutions (1)

Wuyi University¹

10 Jul 2021-Wireless Communications and Mobile Computing

TL;DR: In this paper, in-depth research and analysis is conducted on the intelligent recognition and teaching of English fuzzy text through parallel projection and region expansion and the substring representation in the model ensures the generation of unregistered word vectors.

...read moreread less

Abstract: In this paper, we conduct in-depth research and analysis on the intelligent recognition and teaching of English fuzzy text through parallel projection and region expansion. Multisense Soft Cluster Vector (MSCVec), a multisense word vector model based on nonnegative matrix decomposition and sparse soft clustering, is constructed. The MSCVec model is a monolingual word vector model, which uses nonnegative matrix decomposition of positive point mutual information between words and contexts to extract low-rank expressions of mixed semantics of multisense words and then uses sparse. It uses the nonnegative matrix decomposition of the positive pointwise mutual information between words and contexts to extract the low-rank expressions of the mixed semantics of the polysemous words and then uses the sparse soft clustering algorithm to partition the multiple word senses of the polysemous words and also obtains the global sense of the polysemous word affiliation distribution; the specific polysemous word cluster classes are determined based on the negative mean log-likelihood of the global affiliation between the contextual semantics and the polysemous words, and finally, the polysemous word vectors are learned using the Fast text model under the extended dictionary word set. The advantage of the MSCVec model is that it is an unsupervised learning process without any knowledge base, and the substring representation in the model ensures the generation of unregistered word vectors; in addition, the global affiliation of the MSCVec model can also expect polysemantic word vectors to single word vectors. Compared with the traditional static word vectors, MSCVec shows excellent results in both word similarity and downstream text classification task experiments. The two sets of features are then fused and extended into new semantic features, and similarity classification experiments and stack generalization experiments are designed for comparison. In the cross-lingual sentence-level similarity detection task, SCLVec cross-lingual word vector lexical-level features outperform MSCVec multisense word vector features as the input embedding layer; deep semantic sentence-level features trained by twin recurrent neural networks outperform the semantic features of twin convolutional neural networks; extensions of traditional statistical features can effectively improve cross-lingual similarity detection performance, especially cross-lingual topic model (BL-LDA); the stack generalization integration approach maximizes the error rate of the underlying classifier and improves the detection accuracy.

...read moreread less

10 citations

Cites background from "A Fuzzy Matching based Image Classi..."

...The important means of education informatization is to apply information technology and network technology to education to realize the mode of “Internet + education” [1]....
[...]

Applications of Text Detection and its Challenges: A Review

[...]

M. P. Nevetha¹, A. Baskar¹•Institutions (1)

Amrita Vishwa Vidyapeetham¹

01 Jan 2015

TL;DR: In this article, the authors present a survey of text detection and recognition from images to a large extent, including general documents such as newspapers, books and magazines, forms, scientific documents, maps, architectural and engineering drawings, and scene images with textual information.

...read moreread less

Abstract: The rising need for automation of systems has effected the development of text detection and recognition from images to a large extent. Text recognition has a wide range of applications, each with scenario dependent challenges and complications. How can these challenges be mitigated? What image processing techniques can be applied to make the text in the image machine readable? How can text be localized and separated from non textual information? How can the text image be converted to digital text format? This paper attempts to answer these questions in chosen scenarios. The types of document images that we have surveyed include general documents such as newspapers, books and magazines, forms, scientific documents, unconstrained documents such as maps, architectural and engineering drawings, and scene images with textual information.

...read moreread less

9 citations

Journal Article•DOI•

Advanced Applications on Bilingual Document Analysis and Processing Systems

[...]

Shalini Puri¹, Satya Prakash Singh¹•Institutions (1)

Birla Institute of Technology, Mesra¹

01 Oct 2020-International Journal of Applied Metaheuristic Computing

TL;DR: A journey of bilingual NLP and image-based document classification systems is discussed and an overview of their methods, feature extraction techniques, document sets, classifiers, and accuracy for English-Hindi and other language pairs is provided.

...read moreread less

Abstract: Today, rapid digitization requires efficient bilingual non-image and image document classification systems. Although many bilingual NLP and image-based systems provide solutions for real-world problems, they primarily focus on text extraction, identification, and recognition tasks with limited document types. This article discusses a journey of these systems and provides an overview of their methods, feature extraction techniques, document sets, classifiers, and accuracy for English-Hindi and other language pairs. The gaps found lead toward the idea of a generic and integrated bilingual English-Hindi document classification system, which classifies heterogeneous documents using a dual class feeder and two character corpora. Its non-image and image modules include pre- and post-processing stages and pre-and post-segmentation stages to classify documents into predefined classes. This article discusses many real-life applications on societal and commercial issues. The analytical results show important findings of existing and proposed systems.

...read moreread less

8 citations

Journal Article•DOI•

Opinion Classification of Product Reviews Using Naïve Bayes, Logistic Regression and Sentiwordnet: Challenges and Survey

[...]

A Dadhich, B Thankachan

01 Mar 2021

TL;DR: A detailed review and comparative analysis of various existing sentiment analysis algorithms especially for the Amazon products, which have worked upon the supervised learning techniques called Naïve Bayes, logistic regression and SentiWordNet are presented.

...read moreread less

Abstract: In recent years, automated opinion classification has evolved as one of the most demanding area in natural language processing. Many such systems have been implemented and developed for the summarization and classification of text and reviews of online products. There are many data sources and domains which sells the online products, such as Amazon, Flipkart, Snapdeal etc. In the same direction, this paper is intended to present a detailed review and comparative analysis of various existing sentiment analysis algorithms especially for the Amazon products, which have worked upon the supervised learning techniques called Naïve Bayes, logistic regression and SentiWordNet. Various key parameters and aspects of such a comparative tour are the use of feature reduction method, sentiment polarity, dataset domain and sources, product name, data set size and classifier. Further this paper includes the discussion on their accuracy results; additional results including important findings; and needs, challenges and limitations. Lastly, the performance of these algorithms is evaluated by comparing the % usage of the key parameters.

...read moreread less

2 citations

Book Chapter•DOI•

Standards for Web-Based Integration Adapters

[...]

Bill Karakostas¹•Institutions (1)

City University London¹

01 Jan 2005

1 citations

References

PDF

Open Access

More filters

Journal Article•DOI•

Indian script character recognition: a survey

[...]

Umapada Pal¹, Bidyut B. Chaudhuri¹•Institutions (1)

Indian Statistical Institute¹

01 Sep 2004-Pattern Recognition

TL;DR: A review of the OCR work done on Indian language scripts and the scope of future work and further steps needed for Indian script OCR development is presented.

...read moreread less

592 citations

Journal Article•DOI•

Script Recognition—A Review

[...]

Debashis Ghosh¹, Tulika Dube, A P Shivaprasad•Institutions (1)

Indian Institute of Technology Roorkee¹

01 Dec 2010-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: An overview of the different script identification methodologies under each of the two broad categories-structure-based and visual-appearance-based techniques is given.

...read moreread less

Abstract: A variety of different scripts are used in writing languages throughout the world. In a multiscript, multilingual environment, it is essential to know the script used in writing a document before an appropriate character recognition and document analysis algorithm can be chosen. In view of this, several methods for automatic script identification have been developed so far. They mainly belong to two broad categories-structure-based and visual-appearance-based techniques. This survey report gives an overview of the different script identification methodologies under each of these categories. Methods for script identification in online data and video-texts are also presented. It is noted that the research in this field is relatively thin and still more research is to be done, particularly in the case of handwritten documents.

...read moreread less

234 citations

Journal Article•DOI•

A survey of document image classification: problem statement, classifier architecture and performance evaluation

[...]

Nawei Chen¹, Dorothea Blostein¹•Institutions (1)

Queen's University¹

22 May 2007-International Journal on Document Analysis and Recognition

TL;DR: This work focuses on techniques that classify single-page typeset document images without using OCR results, and brings to light important issues in designing a document classifier, including the definition of document classes, the choices of document features and feature representation, and the choice of classification algorithm and learning mechanism.

...read moreread less

Abstract: Document image classification is an important step in Office Automation, Digital Libraries, and other document image analysis applications. There is great diversity in document image classifiers: they differ in the problems they solve, in the use of training data to construct class models, and in the choice of document features and classification algorithms. We survey this diverse literature using three components: the problem statement, the classifier architecture, and performance evaluation. This brings to light important issues in designing a document classifier, including the definition of document classes, the choice of document features and feature representation, and the choice of classification algorithm and learning mechanism. We emphasize techniques that classify single-page typeset document images without using OCR results. Developing a general, adaptable, high-performance classifier is challenging due to the great variety of documents, the diverse criteria used to define document classes, and the ambiguity that arises due to ill-defined or fuzzy document classes.

...read moreread less

181 citations

Proceedings Article•DOI•

CNN based common approach to handwritten character recognition of multiple scripts

[...]

Durjoy Sen Maitra¹, Ujjwal Bhattacharya¹, Swapan K. Parui¹•Institutions (1)

Indian Statistical Institute¹

23 Aug 2015

TL;DR: A convolutional neural network trained for a larger class recognition problem towards feature extraction of samples of several smaller class recognition problems of English, Devanagari, Bangla, Telugu and Oriya each of which is an official Indian script.

...read moreread less

Abstract: There are many scripts in the world, several of which are used by hundreds of millions of people. Handwritten character recognition studies of several of these scripts are found in the literature. Different hand-crafted feature sets have been used in these recognition studies. However, convolutional neural network (CNN) has recently been used as an efficient unsupervised feature vector extractor. Although such a network can be used as a unified framework for both feature extraction and classification, it is more efficient as a feature extractor than as a classifier. In the present study, we performed certain amount of training of a 5-layer CNN for a moderately large class character recognition problem. We used this CNN trained for a larger class recognition problem towards feature extraction of samples of several smaller class recognition problems. In each case, a distinct Support Vector Machine (SVM) was used as the corresponding classifier. In particular, the CNN of the present study is trained using samples of a standard 50-class Bangla basic character database and features have been extracted for 5 different 10-class numeral recognition problems of English, Devanagari, Bangla, Telugu and Oriya each of which is an official Indian script. Recognition accuracies are comparable with the state-of-the-art.

...read moreread less

117 citations

Journal Article•DOI•

A new scheme for unconstrained handwritten text-line segmentation

[...]

Alireza Alaei¹, Umapada Pal², P. Nagabhushan¹•Institutions (2)

University of Mysore¹, Indian Statistical Institute²

01 Apr 2011-Pattern Recognition

TL;DR: A novel approach for unconstrained handwritten text-line segmentation is proposed using a new painting technique that enhances the separability between the foreground and background portions enabling easy detection of text-lines.

...read moreread less

105 citations