Semi-supervised cluster-and-label with feature based re-clustering to reduce noise in Thai document images

doi:10.1016/J.KNOSYS.2015.09.033

Journal ArticleDOI

Semi-supervised cluster-and-label with feature based re-clustering to reduce noise in Thai document images

N. Piroonsup, +1 more

- 01 Dec 2015 -

Knowledge Based Systems

- Vol. 90, pp 58-69

TLDR

A novel noise reduction method by applying a machine learning technique to classify and reduce noise in document images and an enhance labeling method of semi-supervised cluster-and-label approach that can significantly improve the accuracy of labeling examples and the performance of classification.

Abstract:

We proposed a novel noise reduction method for document images.Semi-supervised learning is applied to classify noise from character components.The proposed method is suitable for Non-Latin based scripts i.e. Thai document image.We proposed an enhance labeling method of semi-supervised cluster-and-label approach.The performance of proposed methods are significantly better than comparison methods. Noise components are a major cause of poor performance in document analysis. To reduce undesired components, most recent research works have applied an image processing technique. However, the effectiveness of these techniques is suitable only for a Latin script document but not a non-Latin script document. The characteristics of the non-Latin script document, such as Thai, are considerably more complicated than the Latin script document and include many levels of character alignment, no word or sentence separator, and variability in a character's size. When applying an image processing technique to a Thai document, we usually remove the characters that are relatively close to noise. Hence, in this paper, we propose a novel noise reduction method by applying a machine learning technique to classify and reduce noise in document images. The proposed method uses a semi-supervised cluster-and-label approach with an improved labeling method, namely, feature selected sub-cluster labeling. Feature selected sub-cluster labeling focuses on the clusters that are incorrectly labeled by conventional labeling methods. These clusters are re-clustered into small groups with a new feature set that is selected according to class labels. The experimental results show that this method can significantly improve the accuracy of labeling examples and the performance of classification. We compared the performance of noise reduction and character preservation between the proposed method and two related noise reduction approaches, i.e., a two-phased stroke-like pattern noise (SPN) removal and a commercial noise reduction software called ScanFix Xpress 6.0. The results show that semi-supervised noise reduction is significantly better than the compared methods of which an F-measure of character and noise is 86.01 and 97.82, respectively.

Semi-supervised cluster-and-label with feature based re-clustering to reduce noise in Thai document images

Citations

The Self-Organizing Map

A novel regularized concept factorization for document clustering

Analysis of training data using clustering to improve semi-supervised self-training

Local gap density for clustering high-dimensional data with varying densities

Text recognition for Vietnamese identity card based on deep features network

References

The WEKA data mining software: an update

A Comparative Study on Feature Selection in Text Categorization

Semi-Supervised Learning

The OpenCV library

The Self-Organizing Map

Related Papers (5)

Weighted Document Frequency for feature selection in text classification

A Feature Selection Method for Improved Document Classification

Research on the feature selection techniques used in text classification

A Novel Weighting Scheme Applied to Improve the Text Document Clustering Techniques

Improving binary classification on text problems using differential word features