scispace - formally typeset
Search or ask a question

Showing papers on "Probabilistic latent semantic analysis published in 1998"


Journal ArticleDOI
TL;DR: The adequacy of LSA's reflection of human knowledge has been established in a variety of ways, for example, its scores overlap those of humans on standard vocabulary and subject matter tests; it mimics human word sorting and category judgments; it simulates word‐word and passage‐word lexical priming data.
Abstract: Latent Semantic Analysis (LSA) is a theory and method for extracting and representing the contextual‐usage meaning of words by statistical computations applied to a large corpus of text (Landauer & Dumais, 1997). The underlying idea is that the aggregate of all the word contexts in which a given word does and does not appear provides a set of mutual constraints that largely determines the similarity of meaning of words and sets of words to each other. The adequacy of LSA's reflection of human knowledge has been established in a variety of ways. For example, its scores overlap those of humans on standard vocabulary and subject matter tests; it mimics human word sorting and category judgments; it simulates word‐word and passage‐word lexical priming data; and, as reported in 3 following articles in this issue, it accurately estimates passage coherence, learnability of passages by individual students, and the quality and quantity of knowledge contained in an essay.

4,391 citations


Journal ArticleDOI
TL;DR: A form of nonlinear latent variable model called the generative topographic mapping, for which the parameters of the model can be determined using the expectation-maximization algorithm, is introduced.
Abstract: Latent variable models represent the probability density of data in a space of several dimensions in terms of a smaller number of latent, or hidden, variables. A familiar example is factor analysis which is based on a linear transformations between the latent space and the data space. In this paper we introduce a form of non-linear latent variable model called the Generative Topographic Mapping, for which the parameters of the model can be determined using the EM algorithm. GTM provides a principled alternative to the widely used Self-Organizing Map (SOM) of Kohonen (1982), and overcomes most of the significant limitations of the SOM. We demonstrate the performance of the GTM algorithm on a toy problem and on simulated data from flow diagnostics for a multi-phase oil pipeline.

1,469 citations


Proceedings ArticleDOI
01 May 1998
TL;DR: It is proved that under certain conditions LSI does succeed in capturing the underlying semantics of the corpus and achieves improved retrieval performance.
Abstract: Latent semantic indexing LSI is an information retrieval technique based on the spectral analysis of the term document matrix whose empirical success had heretofore been without rigorous prediction and explanation We prove that under certain conditions LSI does succeed in capturing the underlying semantics of the corpus and achieves improved retrieval performance We also propose the technique of random projection as a way of speeding up LSI We complement our theorems with encouraging experimental results We also argue that our results may be viewed in a more general framework as a theoretical basis for the use of spectral methods in a wider class of applications such as collaborative ltering

1,235 citations


Journal ArticleDOI
TL;DR: The approach for predicting coherence through reanalyzing sets of texts from 2 studies that manipulated the coherence of texts and assessed readers’ comprehension indicates that the method is able to predict the effect of text coherence on comprehension and is more effective than simple term‐term overlap measures.
Abstract: Latent Semantic Analysis (LSA) is used as a technique for measuring the coherence of texts. By comparing the vectors for 2 adjoining segments of text in a high‐dimensional semantic space, the method provides a characterization of the degree of semantic relatedness between the segments. We illustrate the approach for predicting coherence through reanalyzing sets of texts from 2 studies that manipulated the coherence of texts and assessed readers’ comprehension. The results indicate that the method is able to predict the effect of text coherence on comprehension and is more effective than simple term‐term overlap measures. In this manner, LSA can be applied as an automated method that produces coherence predictions similar to propositional modeling. We describe additional studies investigating the application of LSA to analyzing discourse structure and examine the potential of LSA as a psychological model of coherence effects in text comprehension.

776 citations


Book ChapterDOI
01 Jan 1998
TL;DR: This work describes a method for fully automated cross-language document retrieval in which no query translation is required and provides some evidence that this automatic method performs comparably to a retrieval method based on machine translation (MT-LSI).
Abstract: We describe a method for fully automated cross-language document retrieval in which no query translation is required. Queries in one language can retrieve documents in other languages (as well as the original language). This is accomplished by a method that automatically constructs a multi-lingual semantic space using Latent Semantic Indexing (LSI). We present strong preliminary test results for our cross-language LSI (CL-LSI) method for a French-English collection. We also provide some evidence that this automatic method performs comparably to a retrieval method based on machine translation (MT-LSI).

225 citations


Book ChapterDOI
26 Mar 1998
TL;DR: In this article, the authors provide an overview of latent variable models for representing continuous variables and show how a particular form of linear latent variable model can be used to provide a probabilistic formulation of the well-known technique of principal components analysis (PCA).
Abstract: A powerful approach to probabilistic modelling involves supplementing a set of observed variables with additional latent, or hidden, variables. By defining a joint distribution over visible and latent variables, the corresponding distribution of the observed variables is then obtained by marginalization. This allows relatively complex distributions to be expressed in terms of more tractable joint distributions over the expanded variable space. One well-known example of a hidden variable model is the mixture distribution in which the hidden variable is the discrete component label. In the case of continuous latent variables we obtain models such as factor analysis. The structure of such probabilistic models can be made particularly transparent by giving them a graphical representation, usually in terms of a directed acyclic graph, or Bayesian network. In this chapter we provide an overview of latent variable models for representing continuous variables. We show how a particular form of linear latent variable model can be used to provide a probabilistic formulation of the well-known technique of principal components analysis (PCA). By extending this technique to mixtures, and hierarchical mixtures, of probabilistic PCA models we are led to a powerful interactive algorithm for data visualization. We also show how the probabilistic PCA approach can be generalized to non-linear latent variable models leading to the Generative Topographic Mapping algorithm (GTM). Finally, we show how GTM can itself be extended to model temporal data.

132 citations


Patent
30 Sep 1998
TL;DR: In this paper, a natural language understanding system is described which provides for the generation of concept codes from free-text medical data using a probabilistic model of lexical semantics, in the preferred embodiment of the invention implemented by means of a Bayesian network.
Abstract: A natural language understanding system is described which provides for the generation of concept codes from free-text medical data. A probabilistic model of lexical semantics, in the preferred embodiment of the invention implemented by means of a Bayesian network, is used to determine the most probable concept or meaning associated with a sentence or phrase. The inventive method and system includes the steps of checking for synonyms, checking spelling, performing syntactic parsing, transforming text to its “deep” or semantic form, and performing a semantic analysis based on a probabilistic model of lexical semantics. In the preferred embodiment of the invention, spell checking and transformational processing as well as semantic analysis make use of semantic probabilistic determinations.

101 citations


Proceedings Article
01 Jan 1998
TL;DR: It is shown that modifying the dynamic range, applying a per-word con fi dence metric, and using geometric rather than linear combinations with N-grams produces a more robust language model which has a lower perplexity on a Wall Street Journal test-set than a baseline N- gram model.
Abstract: We introduce a number of techniques designed to help integrate semantic knowledge with N-gram language models for automatic speech recognition. Our techniques allow us to integrate Latent Semantic Analysis (LSA), a word-similarity algorithm based on word co-occurrence information, with N-gram models. While LSA is good at predicting content words which are coherent with the rest of a text, it is a bad predictor of frequent words, has a low dynamic range, and is inaccurate when combined linearly with N-grams. We show that modifying the dynamic range, applying a per-word confidence metric, and using geometric rather than linear combinations with N-grams produces a more robust language model which has a lower perplexity on a Wall Street Journal testset than a baseline N-gram model.

99 citations


Journal ArticleDOI
TL;DR: Latent Semantic Analysis (LSA) as mentioned in this paper is a theory of how word meaning is derived from statistics of experience, and of how passage meaning is represented by combinations of words.
Abstract: Latent semantic analysis (LSA) is a theory of how word meaning—and possibly other knowledge—is derived from statistics of experience, and of how passage meaning is represented by combinations of words. Given a large and representative sample of text, LSA combines the way thousands of words are used in thousands of contexts to map a point for each into a common semantic space. LSA goes beyond pair-wise co-occurrence or correlation to find latent dimensions of meaning that best relate every word and passage to every other. After learning from comparable bodies of text, LSA has scored almost as well as humans on vocabulary and subject-matter tests, accurately simulated many aspects of human judgment and behavior based on verbal meaning, and been successfully applied to measure the coherence and conceptual content of text. The surprising success of LSA has implications for the nature of generalization and language.

98 citations


Journal ArticleDOI
TL;DR: Latent class analysis (LCA) is an extremely useful and flexible technique for the analysis of categorical data, measured at the nominal, ordinal, or interval level as discussed by the authors.
Abstract: Latent class analysis (LCA) is an extremely useful and flexible technique for the analysis of categorical data, measured at the nominal, ordinal, or interval level (the latter with fixed or estimat...

47 citations



Journal ArticleDOI
TL;DR: In this paper Bayesian networks modelling is applied to a multidimensional model of depression and the characterization of the probabilistic model exploits expert knowledge to associate latent concentrations of neurotransmitters and symptoms.


Proceedings Article
24 Jul 1998
TL;DR: This work extends cross-language latent semantic indexing to the case in which English-Spanish translations are not available, but instead, translations for documents in both languages are available in a third \bridge" language, say, French.
Abstract: Cross-language latent semantic indexing is a method that learns useful language-independent vector representations of terms through a statistical analysis of a document-aligned text. This is accomplished by taking a collection of, say, English paragraphs and their translations in Spanish and processing them by singular value decomposition to yield a high-dimensional vector representation for each term in the collection. These term vectors have the property that semantically similar terms have vectors with high cosine measure, regardless of their source language. In the present work, we extend this approach to the case in which English-Spanish translations are not available, but instead, translations for documents in both languages are available in a third \bridge" language, say, French. Thus, although no aligned English-Spanish documents are used, our method creates a representation in which English and Spanish terms can be compared. The resulting vector representation of terms can be useful in natural language applications such as cross-language information retrieval and machine translation.

Book ChapterDOI
01 Jan 1998
TL;DR: In this paper, the authors present a graphic display of the latent budget analysis (LBA) and latent class analysis (LCA), with special reference to the correspondence analysis (CA).
Abstract: Publisher Summary This chapter presents a graphic display of the latent budget analysis (LBA) and latent class analysis (LCA), with special reference to the correspondence analysis (CA). The LBA and LCA are methods used for the analysis of contingency tables. The LBA is a technique that can be used best when one explanatory and one response variable is present, and the question of interest is how the expected budgets can be composed of a smaller amount of typical or latent budgets. The LCA can be used best when the relation between two or more discrete response variables is studied. The question of interest is whether the sample can be split up into K latent classes such that the relation among the variables is satisfactorily explained by the classes. On the other hand, the CA visualizes how row profiles can be explained by continuous axes, which can be interpreted as latent traits. The chapter shows the visualization of results for the LBA and LCA and how these visualizations are related to the visualizations of the correspondence analysis (CA). The chapter concludes with a discussion on the latent budget model and latent class model.

ReportDOI
01 Aug 1998
TL;DR: A detailed analysis of matrices satisfying the so-called low-rank-plus-shift property in connection with the computation of their partial singular value decomposition for Latent Semantic Indexing for information retrieval where the term-document matrices generated from a text corpus approximately satisfy this property.
Abstract: The authors present a detailed analysis of matrices satisfying the so-called low-rank-plus-shift property in connection with the computation of their partial singular value decomposition. The application they have in mind is Latent Semantic Indexing for information retrieval where the term-document matrices generated from a text corpus approximately satisfy this property. The analysis is motivated by developing more efficient methods for computing and updating partial SVD of large term-document matrices and gaining deeper understanding of the behavior of the methods in the presence of noise.

Journal ArticleDOI
TL;DR: Simple examples show the effectiveness of latent profile analysis and mean-field approximations in the E-step of the EM algorithm, which were first proposed in the engineering and neural-computing literatures.

DOI
01 Aug 1998
TL;DR: In this paper, a Latent Semantic Indexing (LSI) approach is proposed for Chinese news filtering agents that use a character-based and hierarchical filtering scheme, where the traditional vector space model is employed as an information filtering model and each document is converted into a vector of weights of terms.
Abstract: We assess the Latent Semantic Indexing (LSI) approach to Chinese information filtering. In particular, the approach is for Chinese news filtering agents that use a character-based and hierarchical filtering scheme. The traditional vector space model is employed as an information filtering model, and each document is converted into a vector of weights of terms. Instead of using words as terms in the JR nominating tradition, terms refer to Chinese characters. LSI captures the semantic relationship between documents and Chinese characters. We use the Sin-gular-value Decomposition (SVD) technique to compress the term space into a lower dimension which achieves latent association between documents and terms. The results of experiments show that the recall and precision rates of Chinese news filtering using the character-based ap-proach incorporating the LSI technique are satisfactory.

Book ChapterDOI
01 Jan 1998
TL;DR: An overview of latent variable models is provided, and it is shown how a particular form of linear latent variable model can be used to provide a probabilistic formulation of the well-known technique of principal components analysis (PCA).
Abstract: Most pattern recognition tasks, such as regression, classification and novelty detection, can be viewed in terms of probability density estimation. A powerful approach to probabilistic modelling is to represent the observed variables in terms of a number of hidden, or latent, variables. One well-known example of a hidden variable model is the mixture distribution in which the hidden variable is the discrete component label. In the case of continuous latent variables we obtain models such as factor analysis. In this paper we provide an overview of latent variable models, and we show how a particular form of linear latent variable model can be used to provide a probabilistic formulation of the well-known technique of principal components analysis (PCA). By extending this technique to mixtures, and hierarchical mixtures, of probabilistic PCA models we are led to a powerful interactive algorithm for data visualization. We also show how the probabilistic PCA approach can be generalized to non-linear latent variable models leading to the Generative Topographic Mapping algorithm (GTM). Finally, we show how GTM can itself be extended to model temporal data.

Proceedings Article
01 Dec 1998
TL;DR: Two developments of nonlinear latent variable models based on radial basis functions are discussed: in the first, the use of priors or constraints on allowable models is considered as a means of preserving data structure in low-dimensional representations for visualisation purposes.
Abstract: Two developments of nonlinear latent variable models based on radial basis functions are discussed: in the first, the use of priors or constraints on allowable models is considered as a means of preserving data structure in low-dimensional representations for visualisation purposes. Also, a resampling approach is introduced which makes more effective use of the latent samples in evaluating the likelihood.