scispace - formally typeset
Open AccessProceedings ArticleDOI

Sparse Latent Semantic Analysis.

TLDR
A new model called Sparse LSA is proposed, which produces a sparse projection matrix via the `1 regularization and achieves similar performance gains to LSA, but is more efficient in projection computation, storage, and also well explain the topic-word relationships.
Abstract
Latent semantic analysis (LSA), as one of the most popular unsupervised dimension reduction tools, has a wide range of applications in text mining and information retrieval. The key idea of LSA is to learn a projection matrix that maps the high dimensional vector space representations of documents to a lower dimensional latent space, i.e. so called latent topic space. In this paper, we propose a new model called Sparse LSA, which produces a sparse projection matrix via the `1 regularization. Compared to the traditional LSA, Sparse LSA selects only a small number of relevant words for each topic and hence provides a compact representation of topic-word relationships. Moreover, Sparse LSA is computationally very efficient with much less memory usage for storing the projection matrix. Furthermore, we propose two important extensions of Sparse LSA: group structured Sparse LSA and non-negative Sparse LSA. We conduct experiments on several benchmark datasets and compare Sparse LSA and its extensions with several widely used methods, e.g. LSA, Sparse Coding and LDA. Empirical results suggest that Sparse LSA achieves similar performance gains to LSA, but is more efficient in projection computation, storage, and also well explain the topic-word relationships.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings Article

Angular Quantization-based Binary Codes for Fast Similarity Search

TL;DR: This work introduces a novel angular quantization-based binary coding (AQBC) technique for high-dimensional non-negative data that arises in vision and text applications where counts or frequencies are used as features and proposes a method for mapping feature vectors to their smallest-angle binary vertices that scales as O(d log d).
Proceedings Article

Learning Topics in Short Texts by Non-negative Matrix Factorization on Term Correlation Matrix

TL;DR: This paper introduces a novel way to compute term correlation in short texts by representing each term with its co-occurred terms and formulated the topic learning problem as symmetric non-negative matrix factorization on the term correlation matrix.
Journal ArticleDOI

Automated risk identification using NLP in cloud based development environments

TL;DR: The need for automated risk assessments with the help of NLP to auto identify the risks on the analysis of weakness and vulnerabilities is addressed.
Proceedings ArticleDOI

Regularized latent semantic indexing

TL;DR: Regularized Latent Semantic Indexing (RLSI), a new method which is designed for parallelization, is introduced, which is as effective as existing topic models, and scales to larger datasets without reducing input vocabulary.
Proceedings Article

Harmonious hashing

TL;DR: A novel hashing algorithm called Harmonious Hashing is introduced which aims at learning hash functions with low information loss and learns a set of optimized projections to preserve the maximum cumulative energy and meet the constraint of equivalent variance on each dimension as much as possible.
References
More filters
Journal ArticleDOI

LIBSVM: A library for support vector machines

TL;DR: Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Journal ArticleDOI

Regression Shrinkage and Selection via the Lasso

TL;DR: A new method for estimation in linear models called the lasso, which minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant, is proposed.
Journal ArticleDOI

Gene Ontology: tool for the unification of biology

TL;DR: The goal of the Gene Ontology Consortium is to produce a dynamic, controlled vocabulary that can be applied to all eukaryotes even as knowledge of gene and protein roles in cells is accumulating and changing.
Journal ArticleDOI

Latent dirichlet allocation

TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.
Proceedings Article

Latent Dirichlet Allocation

TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).
Related Papers (5)