scispace - formally typeset
Proceedings ArticleDOI

Least squares twin support vector machines for text categorization

TLDR
A new text categorization system combining distribution clustering of words for document representation and linear LSTSVM for document classification is proposed, verified by conducting experiments on two benchmark text corpuses and comparing its results with SVMlight based classification in a similar setting.
Abstract
Least squares twin support vector machines (LSTSVM) [1] is a popular kernel-based SVM formulation for binary classification tasks. LSTSVM is an efficient algorithm to learn linear/nonlinear classification boundaries as it just requires solution of two linear systems of equations. LSTSVM has been applied to text categorization with simple bag-of-words representation and conventional feature selection. The disadvantage of this approach is that it is likely to hurt the classification performance due to loss of information as the features that are ranked lowest are discarded. However, as LSTSVM training involves solving for linear systems of equations of the size of input space dimension it is extremely important to keep the input dimension small and hence all features cannot be considered for training. Thus there is a need to learn a "dense" concept that combines many features without throwing them and produces a compact representation. Distributional clustering of words is an efficient alternative to traditional feature selection measures. Unlike feature selection measures which discard low ranked features, it generates extremely compact representation for text documents in word cluster space. It has been shown that SVM's classification performance is better/on par when using this new representation compared to traditional bag-of-words despite the advantages of reduced dimensionality of text documents. In this paper, we propose a new text categorization system combining distribution clustering of words for document representation and linear LSTSVM for document classification. We verified its effectiveness by conducting experiments on two benchmark text corpuses: WebKB, SRAA and comparing its results with SVMlight based classification in a similar setting.

read more

Citations
More filters
Journal ArticleDOI

A Novel Twin Support Vector Machine with Generalized Pinball Loss Function for Pattern Classification

TL;DR: A novel twin support vector machine with the generalized pinball loss function (GPin-TSVM) for solving data classification problems that are less sensitive to noise and preserve the sparsity of the solution.
References
More filters
Proceedings ArticleDOI

Advances in kernel methods: support vector learning

TL;DR: Support vector machines for dynamic reconstruction of a chaotic system, Klaus-Robert Muller et al pairwise classification and support vector machines, Ulrich Kressel.
Posted ContentDOI

Making large scale SVM learning practical

TL;DR: SVM light as discussed by the authors is an implementation of an SVM learner which addresses the problem of large-scale SVM training with many training examples on the shelf, which makes large scale SVM learning more practical.

The information bottleneck method

TL;DR: The variational principle provides a surprisingly rich framework for discussing a variety of problems in signal processing and learning, as will be described in detail elsewhere.
Journal ArticleDOI

Twin Support Vector Machines for Pattern Classification

TL;DR: A binary SVM classifier that determines two nonparallel planes by solving two related SVM-type problems, each of which is smaller than in a conventional SVM, which shows good generalization on several benchmark data sets.
Proceedings ArticleDOI

Distributional clustering of english words

TL;DR: In this article, a method for clustering words according to their distribution in particular syntactic contexts is described and evaluated experimentally, where words are represented by the relative frequency distributions of contexts in which they appear, and relative entropy between those distributions is used as the similarity measure for word clustering.
Related Papers (5)