scispace - formally typeset
Search or ask a question
Book ChapterDOI

Text Categorization Using Fuzzy Proximal SVM and Distributional Clustering of Words

19 Apr 2009-pp 52-61
TL;DR: This paper uses linear PSVM and its extension Fuzzy PSVM (FPSVM) together with recently proposed distributional clustering (DC) of words to realize its potential in TC realm, and reveals the merits ofPSVM and FPSVM over other linear SVMs.
Abstract: Text Categorization (TC) remains as a potential application area for linear support vector machines (SVMs). Among the numerous linear SVM formulations, we bring forward linear PSVM together with recently proposed distributional clustering (DC) of words to realize its potential in TC realm. DC has been presented as an efficient alternative to conventionally used feature selection in TC. It has been shown that, DC together with linear SVM drastically brings down the dimensionality of text documents without any compromise in classification performance. In this paper we use linear PSVM and its extension Fuzzy PSVM (FPSVM) together with DC for TC. We present experimental results comparing PSVM/FPSVM with linear SVM light and SVMlin on popular WebKB text corpus. Through numerous experiments on subsets of WebKB, we reveal the merits of PSVM and FPSVM over other linear SVMs.
Citations
More filters
Journal Article
TL;DR: This research work was funded by the Spanish Government through the research program FPI associated to the project “TEXT-MESS” and was also partially funded by projects grants TIN2009-133991-C04 and PROMETEO/2009/199 from the Spanish and the Valencian Government, respectively.
Abstract: La informacion juega un papel muy importante en la sociedad actual, puesto que si se procesa y maneja correctamente, proporciona grandes ventajas a los usuarios. Sin embargo, debido al crecimiento exponencial de la misma, los usuarios son incapaces de procesar toda esta informacion, y por tanto, las Tecnologias del Lenguaje Humano (TLH) son fundamentales para manejar dicha informacion de manera eficiente y efectiva, siendo de gran ayuda para los usuarios. La generacion automatica de resumenes es un area de las TLH, cuyo objetivo es procesar, sintetizar y presentar al usuario la informacion de manera condensada, de tal manera que evita a los usuarios tener que leer multitud de documentos y extraer lo mas importante de cada uno. El trabajo de investigacion que se ha desarrollado en esta tesis doctoral se centra en este area; en concreto, en la generacion automatica de resumenes, demostrando que los resumenes automaticos son beneficiosos tanto para los usuarios, como para otras aplicaciones de TLH. Despues de realizar un analisis exhaustivo del estado de la cuestion tanto en enfoques para la generacion de resumenes como para su evaluacion, se propone la herramienta de resumenes COMPENDIUM. Esta herramienta sigue un enfoque cognitivo, que se basa en las teorias de (Van Dijk, 1980), (Van Dijk & Kintsch, 1983), que explican como generan resumenes los humanos, pero tambien aporta una componente computacional (Hovy, 2005) que permite su automatizacion. COMPENDIUM es capaz de generar distintos tipos de resumenes de texto en ingles. La longitud de dichos resumenes se determina en funcion de un numero fijo de palabras o una tasa de compresion. Ademas, en lo que respecta a la entrada de la herramienta, se pueden generar resumenes a partir de uno o de varios documentos (mono- o multi-documento, respectivamente). Como salida, los resumenes siguen un paradigma extractivo (extractos) u orientado a abstractos. Finalmente, en cuanto a su finalidad, estos pueden ser resumenes genericos, orientados a un topico, o resumenes subjetivos, y en todos los casos, se pretende que puedan servir como sustituto del documento original, siendo informativos. La arquitectura propuesta para COMPENDIUM se divide en dos tipos de etapas: las que forman el nucleo central de la herramienta, cuyo resultado son extractos genericos y una serie de etapas adicionales, que sirven para generar tipos de resumenes especificos: resumenes orientados a un topico, resumenes subjetivos y resumenes orientados a abstractos. Por un lado, las etapas que forman el nucleo de COMPENDIUM son: i) analisis linguistico; ii) deteccion de redundancia; iii) identificacion del topico; iv) deteccion de relevancia; y v) generacion del resumen. Por otro lado, las que etapas adicionales son: i) similitud con la pregunta; ii) deteccion de informacion subjetiva; y iii) compresion y fusion de informacion. Ademas, algunas de las etapas anteriormente citadas se basan en metodos y enfoques novedosos. En concreto, el uso del reconocimiento de la implicacion textual como metodo para detectar y eliminar la redundancia de un documento, mientras que el principio de la cantidad de codificacion se propone, junto con la frecuencia de las palabras, para identificar que frases contienen la informacion mas relevante. Tambien se propone un metodo basado en grafos de palabras que permite combinar informaci\'on extractiva y abstractiva, y que produce como resultado, resumenes orientados a abstractos. COMPENDIUM se ha evaluado de manera intrinseca y extrinseca. En lo que respecta a la evaluacion intrinseca, se han usado distintos tipos de textos pertenecientes a diversos dominios: noticias periodisticas, descripciones de imagenes, blogs y articulos cientificos del dominio medico. Para su evaluacion extrinseca, COMPENDIUM se ha integrado en: mineria de opiniones, busqueda de respuestas y clasificacion de textos. El objetivo de integrar COMPENDIUM en la primera de estas aplicaciones es mejorar la generacion de resumenes subjetivos con respecto a los enfoques que no tienen en cuenta tecnicas de generacion de resumenes. Para la segunda aplicacion, se han utilizado resumenes orientados a un topico, en vez de los snippets que devuelven los motores de busqueda, para que un sistema de busqueda de respuestas encuente de manera mas eficaz las respuestas a preguntas factuales. Finalmente, en en la tercera, COMPENDIUM se ha usado para generar resumenes que ayuden a predecir la puntuacion asociada a un resena, en lugar de procesar la resena completa. Por lo tanto, de todo ello se demuestra que los resumenes automaticos generados con COMPENDIUM son adecuados para que se usen de manera individual o para que se integren en otra aplicaciones de TLH, con la finalidad de mejorar su rendimiento.

22 citations

Proceedings Article
05 Jun 2010
TL;DR: Preliminary results show that some types of summaries could be as effective or better as full documents in this problem of rating-inference and the effect of text summarisation in this task is investigated.
Abstract: We investigate the effect of text summarisation in the problem of rating-inference -- the task of associating a fine-grained numerical rating to an opinionated document. We set-up a comparison framework to study the effect of different summarisation algorithms of various compression rates in this task and compare the classification accuracy of summaries and documents for associating documents to classes. We make use of SVM algorithms to associate numerical ratings to opinionated documents. The algorithms are informed by linguistic and sentiment-based features computed from full documents and summaries. Preliminary results show that some types of summaries could be as effective or better as full documents in this problem.

22 citations


Cites methods from "Text Categorization Using Fuzzy Pro..."

  • ...For the experiments reported here, we adopt a Support Vector Machine (SVM) learning paradigm not only because it has recently been used with success in different tasks in natural language processing (Isozaki and Kazawa, 2002), but it has been shown particularly suitable for text categorization (Kumar and Gopal, 2009) where the feature space is huge, as it is in our case....

    [...]

Proceedings ArticleDOI
01 Dec 2015
TL;DR: A new text categorization system combining distribution clustering of words for document representation and linear LSTSVM for document classification is proposed, verified by conducting experiments on two benchmark text corpuses and comparing its results with SVMlight based classification in a similar setting.
Abstract: Least squares twin support vector machines (LSTSVM) [1] is a popular kernel-based SVM formulation for binary classification tasks. LSTSVM is an efficient algorithm to learn linear/nonlinear classification boundaries as it just requires solution of two linear systems of equations. LSTSVM has been applied to text categorization with simple bag-of-words representation and conventional feature selection. The disadvantage of this approach is that it is likely to hurt the classification performance due to loss of information as the features that are ranked lowest are discarded. However, as LSTSVM training involves solving for linear systems of equations of the size of input space dimension it is extremely important to keep the input dimension small and hence all features cannot be considered for training. Thus there is a need to learn a "dense" concept that combines many features without throwing them and produces a compact representation. Distributional clustering of words is an efficient alternative to traditional feature selection measures. Unlike feature selection measures which discard low ranked features, it generates extremely compact representation for text documents in word cluster space. It has been shown that SVM's classification performance is better/on par when using this new representation compared to traditional bag-of-words despite the advantages of reduced dimensionality of text documents. In this paper, we propose a new text categorization system combining distribution clustering of words for document representation and linear LSTSVM for document classification. We verified its effectiveness by conducting experiments on two benchmark text corpuses: WebKB, SRAA and comparing its results with SVMlight based classification in a similar setting.

1 citations


Cites result from "Text Categorization Using Fuzzy Pro..."

  • ...We have investigated word clusters along with PSVM and observed improved results over conventional SVM [9]....

    [...]

References
More filters
01 Jan 1998
TL;DR: Presenting a method for determining the necessary and sufficient conditions for consistency of learning process, the author covers function estimates from small data pools, applying these estimations to real-life problems, and much more.
Abstract: A comprehensive look at learning and generalization theory. The statistical theory of learning and generalization concerns the problem of choosing desired functions on the basis of empirical data. Highly applicable to a variety of computer science and robotics fields, this book offers lucid coverage of the theory as a whole. Presenting a method for determining the necessary and sufficient conditions for consistency of learning process, the author covers function estimates from small data pools, applying these estimations to real-life problems, and much more.

26,531 citations


"Text Categorization Using Fuzzy Pro..." refers background in this paper

  • ...An important characteristic of SVM is that it can be extended in a relatively straightforward manner to create nonlinear decision boundaries [ 16 ]....

    [...]

Book ChapterDOI
21 Apr 1998
TL;DR: This paper explores the use of Support Vector Machines for learning text classifiers from examples and analyzes the particular properties of learning with text data and identifies why SVMs are appropriate for this task.
Abstract: This paper explores the use of Support Vector Machines (SVMs) for learning text classifiers from examples. It analyzes the particular properties of learning with text data and identifies why SVMs are appropriate for this task. Empirical results support the theoretical findings. SVMs achieve substantial improvements over the currently best performing methods and behave robustly over a variety of different learning tasks. Furthermore they are fully automatic, eliminating the need for manual parameter tuning.

8,658 citations


"Text Categorization Using Fuzzy Pro..." refers methods in this paper

  • ...Among the entire machine learning methods, linear SVM achieved impressive performance for TC tasks [ 2-3 ]....

    [...]

Journal ArticleDOI
TL;DR: An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL, and performs slightly better than a much more elaborate system with which it has been compared.
Abstract: The automatic removal of suffixes from words in English is of particular interest in the field of information retrieval. An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL. Although simple, it performs slightly better than a much more elaborate system with which it has been compared. It effectively works by treating complex suffixes as compounds made up of simple suffixes, and removing the simple suffixes in a number of steps. In each step the removal of the suffix is made to depend upon the form of the remaining stem, which usually involves a measure of its syllable length.

7,572 citations


"Text Categorization Using Fuzzy Pro..." refers methods in this paper

  • ...We also performed word stemming using Porter stemmer algorithm [ 13 ]....

    [...]

Proceedings ArticleDOI
08 Feb 1999
TL;DR: Support vector machines for dynamic reconstruction of a chaotic system, Klaus-Robert Muller et al pairwise classification and support vector machines, Ulrich Kressel.
Abstract: Introduction to support vector learning roadmap. Part 1 Theory: three remarks on the support vector method of function estimation, Vladimir Vapnik generalization performance of support vector machines and other pattern classifiers, Peter Bartlett and John Shawe-Taylor Bayesian voting schemes and large margin classifiers, Nello Cristianini and John Shawe-Taylor support vector machines, reproducing kernel Hilbert spaces, and randomized GACV, Grace Wahba geometry and invariance in kernel based methods, Christopher J.C. Burges on the annealed VC entropy for margin classifiers - a statistical mechanics study, Manfred Opper entropy numbers, operators and support vector kernels, Robert C. Williamson et al. Part 2 Implementations: solving the quadratic programming problem arising in support vector classification, Linda Kaufman making large-scale support vector machine learning practical, Thorsten Joachims fast training of support vector machines using sequential minimal optimization, John C. Platt. Part 3 Applications: support vector machines for dynamic reconstruction of a chaotic system, Davide Mattera and Simon Haykin using support vector machines for time series prediction, Klaus-Robert Muller et al pairwise classification and support vector machines, Ulrich Kressel. Part 4 Extensions of the algorithm: reducing the run-time complexity in support vector machines, Edgar E. Osuna and Federico Girosi support vector regression with ANOVA decomposition kernels, Mark O. Stitson et al support vector density estimation, Jason Weston et al combining support vector and mathematical programming methods for classification, Bernhard Scholkopf et al.

5,506 citations

01 Jan 1999

4,584 citations


"Text Categorization Using Fuzzy Pro..." refers methods in this paper

  • ...For all implementations, after training SVM for a category, we fit a two parameter sigmoid trained with regularized binomial maximum likelihood [ 19 ], so that SVM outputs posterior probabilities....

    [...]