scispace - formally typeset
Search or ask a question

Showing papers on "Probabilistic latent semantic analysis published in 1999"


Journal ArticleDOI
01 Aug 1999
TL;DR: Probabilistic Latent Semantic Indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data.
Abstract: Probabilistic Latent Semantic Indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. Fitted from a training corpus of text documents by a generalization of the Expectation Maximization algorithm, the utilized model is able to deal with domain{specific synonymy as well as with polysemous words. In contrast to standard Latent Semantic Indexing (LSI) by Singular Value Decomposition, the probabilistic variant has a solid statistical foundation and defines a proper generative data model. Retrieval experiments on a number of test collections indicate substantial performance gains over direct term matching methods as well as over LSI. In particular, the combination of models with different dimensionalities has proven to be advantageous.

4,577 citations


Proceedings Article
30 Jul 1999
TL;DR: This work proposes a widely applicable generalization of maximum likelihood model fitting by tempered EM, based on a mixture decomposition derived from a latent class model which results in a more principled approach which has a solid foundation in statistics.
Abstract: Probabilistic Latent Semantic Analysis is a novel statistical technique for the analysis of two-mode and co-occurrence data, which has applications in information retrieval and filtering, natural language processing, machine learning from text, and in related areas. Compared to standard Latent Semantic Analysis which stems from linear algebra and performs a Singular Value Decomposition of co-occurrence tables, the proposed method is based on a mixture decomposition derived from a latent class model. This results in a more principled approach which has a solid foundation in statistics. In order to avoid overfitting, we propose a widely applicable generalization of maximum likelihood model fitting by tempered EM. Our approach yields substantial and consistent improvements over Latent Semantic Analysis in a number of experiments.

2,306 citations


Proceedings Article
30 Jul 1999
TL;DR: The Variational Bayes framework as discussed by the authors approximates full posterior distributions over model parameters and structures, as well as latent variables, in an analytical manner without resorting to sampling methods.
Abstract: Current methods for learning graphical models with latent variables and a fixed structure estimate optimal values for the model parameters. Whereas this approach usually produces overfitting and suboptimal generalization performance, carrying out the Bayesian program of computing the full posterior distributions over the parameters remains a difficult problem. Moreover, learning the structure of models with latent variables, for which the Bayesian approach is crucial, is yet a harder problem. In this paper I present the Variational Bayes framework, which provides a solution to these problems. This approach approximates full posterior distributions over model parameters and structures, as well as latent variables, in an analytical manner without resorting to sampling methods. Unlike in the Laplace approximation, these posteriors are generally non-Gaussian and no Hessian needs to be computed. The resulting algorithm generalizes the standard Expectation Maximization algorithm, and its convergence is guaranteed. I demonstrate that this algorithm can be applied to a large class of models in several domains, including unsupervised clustering and blind source separation.

615 citations


Proceedings Article
31 Jul 1999
TL;DR: This paper presents a statistical approach to collaborative filtering and investigates the use of latent class models for predicting individual choices and preferences based on observed preference behavior and presents EM algorithms for different variants of the aspect model.
Abstract: This paper presents a statistical approach to collaborative filtering and investigates the use of latent class models for predicting individual choices and preferences based on observed preference behavior. Two models are discussed and compared: the aspect model, a probabilistic latent space model which models individual preferences as a convex combination of preference factors, and the two-sided clustering model, which simultaneously partitions persons and objects into clusters. We present EM algorithms for different variants of the aspect model and derive an approximate EM algorithm based on a variational principle for the two-sided clustering model. The benefits of the different models are experimentally investigated on a large movie data set.

510 citations


Dissertation
01 Jan 1999
TL;DR: This dissertation attempts to meet this need, extending and applying the modern tools of latent variable modeling to problems in neural data analysis, by proposing a new, extremely general, optimization algorithm that may be used to learn the optimal parameter values of arbitrary latent variable models.
Abstract: The brain is perhaps the most complex system to have ever been subjected to rigorous scientific investigation The scale is staggering: over 10^11 neurons, each making an average of 10^3 synapses, with computation occurring on scales ranging from a single dendritic spine, to an entire cortical area Slowly, we are beginning to acquire experimental tools that can gather the massive amounts of data needed to characterize this system However, to understand and interpret these data will also require substantial strides in inferential and statistical techniques This dissertation attempts to meet this need, extending and applying the modern tools of latent variable modeling to problems in neural data analysis It is divided into two parts The first begins with an exposition of the general techniques of latent variable modeling A new, extremely general, optimization algorithm is proposed - called Relaxation Expectation Maximization (REM) - that may be used to learn the optimal parameter values of arbitrary latent variable models This algorithm appears to alleviate the common problem of convergence to local, sub-optimal, likelihood maxima REM leads to a natural framework for model size selection; in combination with standard model selection techniques the quality of fits may be further improved, while the appropriate model size is automatically and efficiently determined Next, a new latent variable model, the mixture of sparse hidden Markov models, is introduced, and approximate inference and learning algorithms are derived for it This model is applied in the second part of the thesis The second part brings the technology of part I to bear on two important problems in experimental neuroscience The first is known as spike sorting; this is the problem of separating the spikes from different neurons embedded within an extracellular recording The dissertation offers the first thorough statistical analysis of this problem, which then yields the first powerful probabilistic solution The second problem addressed is that of characterizing the distribution of spike trains recorded from the same neuron under identical experimental conditions A latent variable model is proposed Inference and learning in this model leads to new principled algorithms for smoothing and clustering of spike data

158 citations


Proceedings ArticleDOI
01 Aug 1999
TL;DR: A dual probability model is constructed for the Latent Semantic Indexing using the cosine similarity measure, establishing a statistical framework for LSI and leading to a statistical criterion for the optimal semantic dimensions.
Abstract: A dual probability model is constructed for the Latent Semantic Indexing (LSI) using the cosine similarity measure. Both the document-document similarity matrix and the term-term similarity matrix naturally arise from the maximum likelihood estimation of the model parameters, and the optimal solutions are the latent semantic vectors of of LSI. Dimensionality reduction is justi ed by the statistical signi cance of latent semantic vectors as measured by the likelihood of the model. This leads to a statistical criterion for the optimal semantic dimensions, answering a critical open question in LSI with practical importance. Thus the model establishes a statistical framework for LSI. Ambiguities related to statistical modeling of LSI are clari ed.

152 citations


Patent
Jerome R. Bellegarda1
12 Mar 1999
TL;DR: In this paper, a hybrid stochastic language model is proposed to combine a latent semantic analysis language model with an n-gram probability language model. But the model is not suitable for the generation of speech signals.
Abstract: Speech or acoustic signals are processed directly using a hybrid stochastic language model produced by integrating a latent semantic analysis language model into an n-gram probability language model. The latent semantic analysis language model probability is computed using a first pseudo-document vector that is derived from a second pseudo-document vector with the pseudo-document vectors representing pseudo-documents created from the signals received at different times. The first pseudo-document vector is derived from the second pseudo-document vector by updating the second pseudo-document vector directly in latent semantic analysis space in response to at least one addition of a candidate word of the received speech signals to the pseudo-document represented by the second pseudo-document vector. Updating precludes mapping a sparse representation for a pseudo-document into the latent semantic space to produce the first pseudo-document vector. A linguistic message representative of the received speech signals is generated.

150 citations


Proceedings ArticleDOI
12 Oct 1999
TL;DR: Applying Latent Semantic Analysis to the domain of source code and internal documentation for the support of software reuse is a new application of this method and a departure from the normal application domain of natural language.
Abstract: The paper describes the initial results of applying Latent Semantic Analysis (LSA) to program source code and associated documentation. Latent Semantic Analysis is a corpus based statistical method for inducing and representing aspects of the meanings of words and passages (of natural language) reflective in their usage. This methodology is assessed for application to the domain of software components (i.e., source code and its accompanying documentation). The intent of applying Latent Semantic Analysis to software components is to automatically induce a specific semantic meaning of a given component. Here LSA is used as the basis to cluster software components. Results of applying this method to the LEDA library and MINIX operating system are given. Applying Latent Semantic Analysis to the domain of source code and internal documentation for the support of software reuse is a new application of this method and a departure from the normal application domain of natural language.

83 citations


Proceedings Article
31 Jul 1999
TL;DR: This work has used LSA as a mechanism for evaluating the quality of student responses in an intelligent tutoring system, and its performance equals that of human raters with intermediate domain knowledge.
Abstract: Latent Semantic Analysis (LSA) is a statistical, corpus-based text comparison mechanism that was originally developed for the task of information retrieval, but in recent years has produced remarkably human-like abilities in a variety of language tasks. LSA has taken the Test of English as a Foreign Language and performed as well as non-native English speakers who were successful college applicants. It has shown an ability to learn words at a rate similar to humans. It has even graded papers as reliably as human graders. We have used LSA as a mechanism for evaluating the quality of student responses in an intelligent tutoring system, and its performance equals that of human raters with intermediate domain knowledge. It has been claimed that LSA's text-comparison abilities stem primarily from its use of a statistical technique called singular value decomposition (SVD) which compresses a large amount of term and document co-occurrence information into a smaller space. This compression is said to capture the semantic information that is latent in the corpus itself. We test this claim by comparing LSA to a version of LSA without SVD, as well as a simple keyword matching model.

75 citations



Book ChapterDOI
01 Jan 1999
TL;DR: Methods for enhancing the indexing of spoken documents by using latent semantic analysis and self-organizing maps are presented and tested to extract extra information from the structure of the document collection and use it for more accurate indexing by generating new index terms and stochastic index weights.
Abstract: This paper describes an important application for state-of-art automatic speech recognition, natural language processing and information retrieval systems. Methods for enhancing the indexing of spoken documents by using latent semantic analysis and self-organizing maps are presented, motivated and tested. The idea is to extract extra information from the structure of the document collection and use it for more accurate indexing by generating new index terms and stochastic index weights. Indexing methods are evaluated for two broadcast news databases (one French and one English) using the average document perplexity defined in this paper and test queries analyzed by human experts

Proceedings Article
01 Jan 1999
TL;DR: A new method for multivariate discretization based on the use of a latent variable model is described, proposed as a tool to extend the scope of applicability of machine learning algorithms that handle discrete variables only.
Abstract: We describe a new method for multivariate discretization based on the use of a latent variable model. The method is proposed as a tool to extend the scope of applicability of machine learning algorithms that handle discrete variables only.

Journal ArticleDOI
01 Sep 1999
TL;DR: This study explores a new technique for data mining --latent semantic indexing LSI, an efficient information retrieval method for textual documents that introduces a promising way to discover knowledge in important commercial application areas such as retail and consumer banking.
Abstract: One important focus of data mining research is in the development of algorithms for extracting valuable information from large databases in order to facilitate business decisions. This study explores a new technique for data mining --latent semantic indexing LSI. LSI is an efficient information retrieval method for textual documents. By determining the singular value decomposition SVD of a large sparse term-by-document matrix, LSI constructs an approximate vector space model which represents important associative relationships between terms and documents that are not evident in individual documents. This paper explores the applicability of the LSI model to numerical databases, namely consumer product data. By properly choosing attributes of data records as terms or documents, a term-by-document frequency matrix is built from which a distribution-based indexing scheme is employed to construct a correlated distribution matrix CDM. An LSI-like vector space model is then used to detect useful or hidden patterns in the numerical data. The extracted information can then be validated using statistical hypotheses testing or resampling. LSI is an automatic yet intelligent indexing method. Its application to numerical data introduces a promising way to discover knowledge in important commercial application areas such as retail and consumer banking.

Journal ArticleDOI
TL;DR: A detailed analysis of matrices satisfying the so-called low-rank-plus-shift property in connection with the computation of their partial singular value decomposition (SVD) is presented, where the term-document matrices generated from a text corpus approximately satisfy this property.
Abstract: We present a detailed analysis of matrices satisfying the so-called low-rank-plus-shift property in connection with the computation of their partial singular value decomposition (SVD) The application we have in mind is latent semantic indexing for information retrieval, where the term-document matrices generated from a text corpus approximately satisfy this property The analysis is motivated by developing more efficient methods for computing and updating partial SVD of large term-document matrices and gaining deeper understanding of the behavior of the methods in the presence of noise

Proceedings ArticleDOI
01 May 1999
TL;DR: Latent SemanticAnalysis (LSA) was used to compute semantic similarity between task descriptions and labels in an applications menu system and when the labels in the menus system were semantically similar to the task descriptions, subjects performed the tasks faster.
Abstract: Models of learning and performing by exploration assume that the semantic similarity between task descriptions and labels on display objects (e.g., menus, tool bars) controls in part the users search strategies. Nevertheless, none of the models has an objective way to compute semantic similarity. In this study, Latent Semantic Analysis (LSA) was used to compute semantic similarity between task descriptions and labels in an applications menu system. Participants performed twelve tasks by exploration and they were tested for recall after a l-week delay. When the labels in the menu system were semantically similar to the task descriptions, subjects performed the tasks faster. LSA could be incorporated into any of the current models, and it could be used to automate the evaluation of computer applications for ease of learning and performing by exploration.

Book ChapterDOI
01 Jan 1999
TL;DR: This paper uses an alternate decomposition, the semi-discrete decomposition (SDD), and shows that for equal query times, the SDD does as well as the SVD and uses less than one-tenth the storage for the MEDLINE test set.
Abstract: With the electronic storage of documents comes the possibility of building search engines that can automatically choose documents relevant to a given set of topics. In information retrieval, we wish to match queries with relevant documents. Documents can be represented by the terms that appear within them, but literal matching of terms does not necessarily retrieve all relevant documents. There are a number of information retrieval systems based on inexact matches. Latent Semantic Indexing represents documents by approximations and tends to cluster documents on similar topics even if their term profiles are somewhat different. This approximate representation is usually accomplished using a low-rank singular value decomposition (SVD) approximation. In this paper, we use an alternate decomposition, the semi-discrete decomposition (SDD). For equal query times, the SDD does as well as the SVD and uses less than one-tenth the storage for the MEDLINE test set.

Book ChapterDOI
01 Aug 1999
TL;DR: A probabilistic approach which combines a statistical, model-based analysis with a topological visualization principle is presented which can be utilized to derive topic maps which represent topical information by characteristic keyword distributions arranged in a two-dimensional spatial layout.
Abstract: The visualization of large text databases and document collections is an important step towards more flexible and interactive types of information retrieval. This paper presents a probabilistic approach which combines a statistical, model-based analysis with a topological visualization principle. Our method can be utilized to derive topic maps which represent topical information by characteristic keyword distributions arranged in a two-dimensional spatial layout. Combined with multi-resolution techniques this provides a three-dimensional space for interactive information navigation in large text collections.

Patent
Stuermer Thomas1
30 Jun 1999
TL;DR: In this paper, a procedure for the automatic generation of a textual expression from a semantic representation by a computer-system is described, in which a statistical model is determined by the computer system on a plurality of pre-determined pairs of semantic representations and associated expressions and stored.
Abstract: A procedure for the automatic generation of a textual expression from a semantic representation by a computer-system is described. With the procedure, a statistical model is determined by the computer-system on a plurality of pre-determined pairs of semantic representations and associated expressions and stored. A semantic representation, from which an associated expression is determined by the computer system by means of the statistical model, is presented to the computer system. These steps are repeated by the computer system for further semantic representations if necessary.


01 Jan 1999
TL;DR: This paper proposes a stochastic index for the documents based on the Latent Semantic Analysis (LSA) of the decoded document contents using Random Mapping and Self-Organizing Maps (SOM).
Abstract: An important problem for the information retrieval from spoken documents is how to extract those relevant documents which are poorly decoded by the speech recognizer. In this paper we propose a stochastic index for the documents based on the Latent Semantic Analysis (LSA) of the decoded document contents. The original LSA approach uses Singular Value Decomposition to reduce the dimensionality of the documents. As an alternative, we propose a computationally more feasible solution using Random Mapping (RM) and Self-Organizing Maps (SOM). The motivation for clustering the documents by SOM is to reduce the effect of recognition errors and to extract new characteristic index terms. Experimental indexing results are presented using relevance judgments for the retrieval results of test queries and using a document perplexity defined in this paper to measure the power of the index models.

Journal ArticleDOI
TL;DR: It is shown that deerwester(1990)のLatent Semantic AnalysisをおこなうDeerwester (1990)に現われる単語に着目することにより,潜在的な意味的検索に対して適用した,
Abstract: 異なった文書に同時に現われる単語に着目することにより,潜在的な意味的検索をおこなうDeerwester(1990)のLatent Semantic Analysisを日本語の比較的大規模な文書集合に対して適用した.その中で,大型疎行列における特異値分解アルゴリズムの比較検討を行ない,日本語文書検索に適した方法を見つけた.これを実際の新聞記事で試し,文書検索,および関連語表示において有効であることの見通しを得た.また実装する上での工夫として,関連文書検索においては,文書の大きさによる基準化が必要なことがわかった.さらに,重複を許す単語のクラスタリングを試みた.


Proceedings Article
01 Jan 1999
TL;DR: This work uses a frequency based statistical method combined with general hidden markov models in order to learn domain speci c knowledge within a semantic network formalism and uses a dialogue system for German train timetable information.
Abstract: For an e cient linguistic analysis of spoken queries a lot of domain speci c knowledge is needed and usually has to be entered manually into the knowledge base of each domain. This makes the adaption of dialogue systems which base on explicit knowledge representation to new domains a very costly procedure. We use a frequency based statistical method combined with general hidden markov models in order to learn domain speci c knowledge within a semantic network formalism. As a framework we use a dialogue system for German train timetable information. By means of experiments we show that our statistical approach is not only able to reach, but even outperforms previous results with manually entered restrictions.