scispace - formally typeset
Journal ArticleDOI

Text classification method based on self-training and LDA topic models

Reads0
Chats0
TLDR
The proposed ST LDA method for text classification in a semi-supervised manner with representations based on topic models may well help to improve text classification tasks, which are essential in many advanced expert and intelligent systems, especially in the case of a scarcity of labeled texts.
Abstract
A novel text classification method for learning from very small labeled set.The method uses a text representation based on the LDA topic model.Self-training is used to enlarge labeled set from unlabeled instances.A model for setting methods parameters for any document collection is proposed. Supervised text classification methods are efficient when they can learn with reasonably sized labeled sets. On the other hand, when only a small set of labeled documents is available, semi-supervised methods become more appropriate. These methods are based on comparing distributions between labeled and unlabeled instances, therefore it is important to focus on the representation and its discrimination abilities. In this paper we present the ST LDA method for text classification in a semi-supervised manner with representations based on topic models. The proposed method comprises a semi-supervised text classification algorithm based on self-training and a model, which determines parameter settings for any new document collection. Self-training is used to enlarge the small initial labeled set with the help of information from unlabeled data. We investigate how topic-based representation affects prediction accuracy by performing NBMN and SVM classification algorithms on an enlarged labeled set and then compare the results with the same method on a typical TF-IDF representation. We also compare ST LDA with supervised classification methods and other well-known semi-supervised methods. Experiments were conducted on 11 very small initial labeled sets sampled from six publicly available document collections. The results show that our ST LDA method, when used in combination with NBMN, performed significantly better in terms of classification accuracy than other comparable methods and variations. In this manner, the ST LDA method proved to be a competitive classification method for different text collections when only a small set of labeled instances is available. As such, the proposed ST LDA method may well help to improve text classification tasks, which are essential in many advanced expert and intelligent systems, especially in the case of a scarcity of labeled texts.

read more

Citations
More filters
Journal Article

When is nearest neighbor meaningful

TL;DR: In this article, the authors explore the effect of dimensionality on the nearest neighbor problem and show that under a broad set of conditions (much broader than independent and identically distributed dimensions), as dimensionality increases, the distance to the nearest data point approaches the distance of the farthest data point.
Journal ArticleDOI

A recent overview of the state-of-the-art elements of text classification

TL;DR: Six baseline elements of text classification including data collection, data analysis for labelling, feature construction and weighing, feature selection and projection, training of a classification model, and solution evaluation are described.
Journal ArticleDOI

Understanding the use of Virtual Reality in Marketing: a text mining-based review

TL;DR: A text-mining approach using a Bayesian statistical topic model called latent Dirichlet allocation is employed to conduct a comprehensive analysis of 150 articles from 115 journals, revealing seven relevant topics.
Journal ArticleDOI

Application of optimized machine learning techniques for prediction of occupational accidents

TL;DR: Optimize machine learning algorithms have been applied to predict the accident outcomes such as injury, near miss, and property damage using occupational accident data and PSO-based SVM outperforms the other algorithms with the highest level of accuracy and robustness.
Journal ArticleDOI

Systematic reviews in sentiment analysis: a tertiary study

TL;DR: According to this analysis, LSTM and CNN algorithms are the most used deep learning algorithms for sentiment analysis.
References
More filters
Journal ArticleDOI

LIBSVM: A library for support vector machines

TL;DR: Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Journal ArticleDOI

Latent dirichlet allocation

TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.
Proceedings Article

Latent Dirichlet Allocation

TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).
Journal ArticleDOI

The WEKA data mining software: an update

TL;DR: This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.
Journal ArticleDOI

Indexing by Latent Semantic Analysis

TL;DR: A new method for automatic indexing and retrieval to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries.
Related Papers (5)