Least squares twin support vector machines for text categorization

doi:10.1109/NATSYS.2015.7489094

Proceedings ArticleDOI

Least squares twin support vector machines for text categorization

- pp 1-5

TLDR

A new text categorization system combining distribution clustering of words for document representation and linear LSTSVM for document classification is proposed, verified by conducting experiments on two benchmark text corpuses and comparing its results with SVMlight based classification in a similar setting.

Abstract:

Least squares twin support vector machines (LSTSVM) [1] is a popular kernel-based SVM formulation for binary classification tasks. LSTSVM is an efficient algorithm to learn linear/nonlinear classification boundaries as it just requires solution of two linear systems of equations. LSTSVM has been applied to text categorization with simple bag-of-words representation and conventional feature selection. The disadvantage of this approach is that it is likely to hurt the classification performance due to loss of information as the features that are ranked lowest are discarded. However, as LSTSVM training involves solving for linear systems of equations of the size of input space dimension it is extremely important to keep the input dimension small and hence all features cannot be considered for training. Thus there is a need to learn a "dense" concept that combines many features without throwing them and produces a compact representation. Distributional clustering of words is an efficient alternative to traditional feature selection measures. Unlike feature selection measures which discard low ranked features, it generates extremely compact representation for text documents in word cluster space. It has been shown that SVM's classification performance is better/on par when using this new representation compared to traditional bag-of-words despite the advantages of reduced dimensionality of text documents. In this paper, we propose a new text categorization system combining distribution clustering of words for document representation and linear LSTSVM for document classification. We verified its effectiveness by conducting experiments on two benchmark text corpuses: WebKB, SRAA and comparing its results with SVMlight based classification in a similar setting.

Least squares twin support vector machines for text categorization

Citations

A Novel Twin Support Vector Machine with Generalized Pinball Loss Function for Pattern Classification

References

Advances in kernel methods: support vector learning

Making large scale SVM learning practical

The information bottleneck method

Twin Support Vector Machines for Pattern Classification

Distributional clustering of english words

Related Papers (5)

Support vector machines for text categorization

SVOIS: Support Vector Oriented Instance Selection for text classification

LDA boost classification: boosting by topics

A Class-Incremental Learning Method for Multi-Class Support Vector Machines in Text Classification

Feature/Model Selection by the Linear Programming SVM Combined with State-of-Art Classifiers: What Can We Learn About the Data