scispace - formally typeset
Journal ArticleDOI

Semi-supervised cluster-and-label with feature based re-clustering to reduce noise in Thai document images

TLDR
A novel noise reduction method by applying a machine learning technique to classify and reduce noise in document images and an enhance labeling method of semi-supervised cluster-and-label approach that can significantly improve the accuracy of labeling examples and the performance of classification.
Abstract
We proposed a novel noise reduction method for document images.Semi-supervised learning is applied to classify noise from character components.The proposed method is suitable for Non-Latin based scripts i.e. Thai document image.We proposed an enhance labeling method of semi-supervised cluster-and-label approach.The performance of proposed methods are significantly better than comparison methods. Noise components are a major cause of poor performance in document analysis. To reduce undesired components, most recent research works have applied an image processing technique. However, the effectiveness of these techniques is suitable only for a Latin script document but not a non-Latin script document. The characteristics of the non-Latin script document, such as Thai, are considerably more complicated than the Latin script document and include many levels of character alignment, no word or sentence separator, and variability in a character's size. When applying an image processing technique to a Thai document, we usually remove the characters that are relatively close to noise. Hence, in this paper, we propose a novel noise reduction method by applying a machine learning technique to classify and reduce noise in document images. The proposed method uses a semi-supervised cluster-and-label approach with an improved labeling method, namely, feature selected sub-cluster labeling. Feature selected sub-cluster labeling focuses on the clusters that are incorrectly labeled by conventional labeling methods. These clusters are re-clustered into small groups with a new feature set that is selected according to class labels. The experimental results show that this method can significantly improve the accuracy of labeling examples and the performance of classification. We compared the performance of noise reduction and character preservation between the proposed method and two related noise reduction approaches, i.e., a two-phased stroke-like pattern noise (SPN) removal and a commercial noise reduction software called ScanFix Xpress 6.0. The results show that semi-supervised noise reduction is significantly better than the compared methods of which an F-measure of character and noise is 86.01 and 97.82, respectively.

read more

Citations
More filters

The Self-Organizing Map

TL;DR: An overview of the self-organizing map algorithm, on which the papers in this issue are based, is presented in this article, where the authors present an overview of their work.
Journal ArticleDOI

A novel regularized concept factorization for document clustering

TL;DR: A novel regularized concept factorization (RCF) algorithm with dual connected constraints, which focuses on whether two documents belong to the same class (must- connected constraint) or different classes (cannot-connected constraint), which will improve the clustering performance significantly.
Journal ArticleDOI

Analysis of training data using clustering to improve semi-supervised self-training

TL;DR: Comparison experiments on UCI and real-world datasets show that the proposed methods are an effective preprocessing step for determining and obtaining a sufficient quantity of labeled data, which is essential for attaining accuracy in a semi-supervised self-training classifier.
Journal ArticleDOI

Local gap density for clustering high-dimensional data with varying densities

TL;DR: A new type of density, local gap density, is defined in the k -NN graph which works well for high-dimensional data and can be easily detected by the core points in sparse regions.
Journal ArticleDOI

Text recognition for Vietnamese identity card based on deep features network

TL;DR: This paper investigates to develop a method for Vietnamese identity card recognition based on deep features network that achieves an accuracy of more than 96.7% and 89.8% on character level and word level, respectively.
References
More filters
Journal ArticleDOI

The WEKA data mining software: an update

TL;DR: This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.
Proceedings Article

A Comparative Study on Feature Selection in Text Categorization

TL;DR: This paper finds strong correlations between the DF IG and CHI values of a term and suggests that DF thresholding the simplest method with the lowest cost in computation can be reliably used instead of IG or CHI when the computation of these measures are too expensive.
BookDOI

Semi-Supervised Learning

TL;DR: Semi-supervised learning (SSL) as discussed by the authors is the middle ground between supervised learning (in which all training examples are labeled) and unsupervised training (where no label data are given).

The OpenCV library

Gary Bradski

The Self-Organizing Map

TL;DR: An overview of the self-organizing map algorithm, on which the papers in this issue are based, is presented in this article, where the authors present an overview of their work.
Related Papers (5)