Showing papers on "Probabilistic latent semantic analysis published in 1999"

PDF

Open Access

Journal Article•DOI•

[...]

Thomas Hofmann¹•Institutions (1)

International Computer Science Institute¹

01 Aug 1999

TL;DR: Probabilistic Latent Semantic Indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data.

...read moreread less

Abstract: Probabilistic Latent Semantic Indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. Fitted from a training corpus of text documents by a generalization of the Expectation Maximization algorithm, the utilized model is able to deal with domain{specific synonymy as well as with polysemous words. In contrast to standard Latent Semantic Indexing (LSI) by Singular Value Decomposition, the probabilistic variant has a solid statistical foundation and defines a proper generative data model. Retrieval experiments on a number of test collections indicate substantial performance gains over direct term matching methods as well as over LSI. In particular, the combination of models with different dimensionalities has proven to be advantageous.

...read moreread less

4,577 citations

Proceedings Article•

Probabilistic latent semantic analysis

[...]

Thomas Hofmann¹•Institutions (1)

International Computer Science Institute¹

30 Jul 1999

TL;DR: This work proposes a widely applicable generalization of maximum likelihood model fitting by tempered EM, based on a mixture decomposition derived from a latent class model which results in a more principled approach which has a solid foundation in statistics.

...read moreread less

Abstract: Probabilistic Latent Semantic Analysis is a novel statistical technique for the analysis of two-mode and co-occurrence data, which has applications in information retrieval and filtering, natural language processing, machine learning from text, and in related areas. Compared to standard Latent Semantic Analysis which stems from linear algebra and performs a Singular Value Decomposition of co-occurrence tables, the proposed method is based on a mixture decomposition derived from a latent class model. This results in a more principled approach which has a solid foundation in statistics. In order to avoid overfitting, we propose a widely applicable generalization of maximum likelihood model fitting by tempered EM. Our approach yields substantial and consistent improvements over Latent Semantic Analysis in a number of experiments.

...read moreread less

2,306 citations

Proceedings Article•

Inferring parameters and structure of latent variable models by variational bayes

[...]

Hagai Attias¹•Institutions (1)

University College London¹

30 Jul 1999

TL;DR: The Variational Bayes framework as discussed by the authors approximates full posterior distributions over model parameters and structures, as well as latent variables, in an analytical manner without resorting to sampling methods.

...read moreread less

Abstract: Current methods for learning graphical models with latent variables and a fixed structure estimate optimal values for the model parameters. Whereas this approach usually produces overfitting and suboptimal generalization performance, carrying out the Bayesian program of computing the full posterior distributions over the parameters remains a difficult problem. Moreover, learning the structure of models with latent variables, for which the Bayesian approach is crucial, is yet a harder problem. In this paper I present the Variational Bayes framework, which provides a solution to these problems. This approach approximates full posterior distributions over model parameters and structures, as well as latent variables, in an analytical manner without resorting to sampling methods. Unlike in the Laplace approximation, these posteriors are generally non-Gaussian and no Hessian needs to be computed. The resulting algorithm generalizes the standard Expectation Maximization algorithm, and its convergence is guaranteed. I demonstrate that this algorithm can be applied to a large class of models in several domains, including unsupervised clustering and blind source separation.

...read moreread less

615 citations

Proceedings Article•

Latent class models for collaborative filtering

[...]

Thomas Hofmann¹, Jan Puzicha²•Institutions (2)

University of California, Berkeley¹, University of Bonn²

31 Jul 1999

TL;DR: This paper presents a statistical approach to collaborative filtering and investigates the use of latent class models for predicting individual choices and preferences based on observed preference behavior and presents EM algorithms for different variants of the aspect model.

...read moreread less

Abstract: This paper presents a statistical approach to collaborative filtering and investigates the use of latent class models for predicting individual choices and preferences based on observed preference behavior. Two models are discussed and compared: the aspect model, a probabilistic latent space model which models individual preferences as a convex combination of preference factors, and the two-sided clustering model, which simultaneously partitions persons and objects into clusters. We present EM algorithms for different variants of the aspect model and derive an approximate EM algorithm based on a variational principle for the two-sided clustering model. The benefits of the different models are experimentally investigated on a large movie data set.

...read moreread less

510 citations

Dissertation•

Latent variable models for neural data analysis

[...]

Richard A. Andersen, Maneesh Sahani

01 Jan 1999

TL;DR: This dissertation attempts to meet this need, extending and applying the modern tools of latent variable modeling to problems in neural data analysis, by proposing a new, extremely general, optimization algorithm that may be used to learn the optimal parameter values of arbitrary latent variable models.

...read moreread less

Abstract: The brain is perhaps the most complex system to have ever been subjected to rigorous scientific investigation The scale is staggering: over 10^11 neurons, each making an average of 10^3 synapses, with computation occurring on scales ranging from a single dendritic spine, to an entire cortical area Slowly, we are beginning to acquire experimental tools that can gather the massive amounts of data needed to characterize this system However, to understand and interpret these data will also require substantial strides in inferential and statistical techniques This dissertation attempts to meet this need, extending and applying the modern tools of latent variable modeling to problems in neural data analysis It is divided into two parts The first begins with an exposition of the general techniques of latent variable modeling A new, extremely general, optimization algorithm is proposed - called Relaxation Expectation Maximization (REM) - that may be used to learn the optimal parameter values of arbitrary latent variable models This algorithm appears to alleviate the common problem of convergence to local, sub-optimal, likelihood maxima REM leads to a natural framework for model size selection; in combination with standard model selection techniques the quality of fits may be further improved, while the appropriate model size is automatically and efficiently determined Next, a new latent variable model, the mixture of sparse hidden Markov models, is introduced, and approximate inference and learning algorithms are derived for it This model is applied in the second part of the thesis The second part brings the technology of part I to bear on two important problems in experimental neuroscience The first is known as spike sorting; this is the problem of separating the spikes from different neurons embedded within an extracellular recording The dissertation offers the first thorough statistical analysis of this problem, which then yields the first powerful probabilistic solution The second problem addressed is that of characterizing the distribution of spike trains recorded from the same neuron under identical experimental conditions A latent variable model is proposed Inference and learning in this model leads to new principled algorithms for smoothing and clustering of spike data

...read moreread less

158 citations

Proceedings Article•DOI•

[...]

Chris Ding¹•Institutions (1)

Lawrence Berkeley National Laboratory¹

01 Aug 1999

TL;DR: A dual probability model is constructed for the Latent Semantic Indexing using the cosine similarity measure, establishing a statistical framework for LSI and leading to a statistical criterion for the optimal semantic dimensions.

...read moreread less

Abstract: A dual probability model is constructed for the Latent Semantic Indexing (LSI) using the cosine similarity measure. Both the document-document similarity matrix and the term-term similarity matrix naturally arise from the maximum likelihood estimation of the model parameters, and the optimal solutions are the latent semantic vectors of of LSI. Dimensionality reduction is justi ed by the statistical signi cance of latent semantic vectors as measured by the likelihood of the model. This leads to a statistical criterion for the optimal semantic dimensions, answering a critical open question in LSI with practical importance. Thus the model establishes a statistical framework for LSI. Ambiguities related to statistical modeling of LSI are clari ed.

...read moreread less

152 citations

Patent•

Fast update implementation for efficient latent semantic language modeling

[...]

Jerome R. Bellegarda¹•Institutions (1)

Apple Inc.¹

12 Mar 1999

TL;DR: In this paper, a hybrid stochastic language model is proposed to combine a latent semantic analysis language model with an n-gram probability language model. But the model is not suitable for the generation of speech signals.

...read moreread less

Abstract: Speech or acoustic signals are processed directly using a hybrid stochastic language model produced by integrating a latent semantic analysis language model into an n-gram probability language model. The latent semantic analysis language model probability is computed using a first pseudo-document vector that is derived from a second pseudo-document vector with the pseudo-document vectors representing pseudo-documents created from the signals received at different times. The first pseudo-document vector is derived from the second pseudo-document vector by updating the second pseudo-document vector directly in latent semantic analysis space in response to at least one addition of a candidate word of the received speech signals to the pseudo-document represented by the second pseudo-document vector. Updating precludes mapping a sparse representation for a pseudo-document into the latent semantic space to produce the first pseudo-document vector. A linguistic message representative of the received speech signals is generated.

...read moreread less

150 citations

Proceedings Article•DOI•

Automatic software clustering via Latent Semantic Analysis

[...]

Jonathan I. Maletic¹, Naveen Valluri•Institutions (1)

University of Memphis¹

12 Oct 1999

TL;DR: Applying Latent Semantic Analysis to the domain of source code and internal documentation for the support of software reuse is a new application of this method and a departure from the normal application domain of natural language.

...read moreread less

Abstract: The paper describes the initial results of applying Latent Semantic Analysis (LSA) to program source code and associated documentation. Latent Semantic Analysis is a corpus based statistical method for inducing and representing aspects of the meanings of words and passages (of natural language) reflective in their usage. This methodology is assessed for application to the domain of software components (i.e., source code and its accompanying documentation). The intent of applying Latent Semantic Analysis to software components is to automatically induce a specific semantic meaning of a given component. Here LSA is used as the basis to cluster software components. Results of applying this method to the LEDA library and MINIX operating system are given. Applying Latent Semantic Analysis to the domain of source code and internal documentation for the support of software reuse is a new application of this method and a departure from the normal application domain of natural language.

...read moreread less

83 citations

Proceedings Article•

How latent is latent semantic analysis

[...]

Peter Wiemer-Hastings¹•Institutions (1)

University of Memphis¹

31 Jul 1999

TL;DR: This work has used LSA as a mechanism for evaluating the quality of student responses in an intelligent tutoring system, and its performance equals that of human raters with intermediate domain knowledge.

...read moreread less

Abstract: Latent Semantic Analysis (LSA) is a statistical, corpus-based text comparison mechanism that was originally developed for the task of information retrieval, but in recent years has produced remarkably human-like abilities in a variety of language tasks. LSA has taken the Test of English as a Foreign Language and performed as well as non-native English speakers who were successful college applicants. It has shown an ability to learn words at a rate similar to humans. It has even graded papers as reliably as human graders. We have used LSA as a mechanism for evaluating the quality of student responses in an intelligent tutoring system, and its performance equals that of human raters with intermediate domain knowledge. It has been claimed that LSA's text-comparison abilities stem primarily from its use of a statistical technique called singular value decomposition (SVD) which compresses a large amount of term and document co-occurrence information into a smaller space. This compression is said to capture the semantic information that is latent in the corpus itself. We test this claim by comparing LSA to a version of LSA without SVD, as well as a simple keyword matching model.

...read moreread less

75 citations

Posted Content•DOI•

Understanding Heterogeneous Preferences in Random Utility Models: The Use of Latent Class Analysis

[...]

Peter C. Boxall, Wiktor L. Adamowicz

01 Jan 1999-Environmental and Resource Economics

74 citations

Book Chapter•DOI•

Indexing Audio Documents by using Latent Semantic Analysis and SOM

[...]

Mikko Kurimo

01 Jan 1999

TL;DR: Methods for enhancing the indexing of spoken documents by using latent semantic analysis and self-organizing maps are presented and tested to extract extra information from the structure of the document collection and use it for more accurate indexing by generating new index terms and stochastic index weights.

...read moreread less

Abstract: This paper describes an important application for state-of-art automatic speech recognition, natural language processing and information retrieval systems. Methods for enhancing the indexing of spoken documents by using latent semantic analysis and self-organizing maps are presented, motivated and tested. The idea is to extract extra information from the structure of the document collection and use it for more accurate indexing by generating new index terms and stochastic index weights. Indexing methods are evaluated for two broadcast news databases (one French and one English) using the average document perplexity defined in this paper and test queries analyzed by human experts

...read moreread less

Proceedings Article•

A latent variable model for multivariate discretization.

[...]

Stefano Monti, Gregory F. Cooper

01 Jan 1999

TL;DR: A new method for multivariate discretization based on the use of a latent variable model is described, proposed as a tool to extend the scope of applicability of machine learning algorithms that handle discrete variables only.

...read moreread less

Abstract: We describe a new method for multivariate discretization based on the use of a latent variable model. The method is proposed as a tool to extend the scope of applicability of machine learning algorithms that handle discrete variables only.

...read moreread less

Journal Article•DOI•

Mining consumer product data via latent semantic indexing

[...]

Jingqian Jiang¹, Michael W. Berry¹, June Donato², George Ostrouchov², Nancy W. Grady² - Show less +1 more•Institutions (2)

University of Tennessee¹, Oak Ridge National Laboratory²

01 Sep 1999

TL;DR: This study explores a new technique for data mining --latent semantic indexing LSI, an efficient information retrieval method for textual documents that introduces a promising way to discover knowledge in important commercial application areas such as retail and consumer banking.

...read moreread less

Abstract: One important focus of data mining research is in the development of algorithms for extracting valuable information from large databases in order to facilitate business decisions. This study explores a new technique for data mining --latent semantic indexing LSI. LSI is an efficient information retrieval method for textual documents. By determining the singular value decomposition SVD of a large sparse term-by-document matrix, LSI constructs an approximate vector space model which represents important associative relationships between terms and documents that are not evident in individual documents. This paper explores the applicability of the LSI model to numerical databases, namely consumer product data. By properly choosing attributes of data records as terms or documents, a term-by-document frequency matrix is built from which a distribution-based indexing scheme is employed to construct a correlated distribution matrix CDM. An LSI-like vector space model is then used to detect useful or hidden patterns in the numerical data. The extracted information can then be validated using statistical hypotheses testing or resampling. LSI is an automatic yet intelligent indexing method. Its application to numerical data introduces a promising way to discover knowledge in important commercial application areas such as retail and consumer banking.

...read moreread less

Journal Article•DOI•

Matrices with Low-Rank-Plus-Shift Structure: Partial SVD and Latent Semantic Indexing

[...]

Hongyuan Zha, Zhenyue Zhang

21 Oct 1999-SIAM Journal on Matrix Analysis and Applications

TL;DR: A detailed analysis of matrices satisfying the so-called low-rank-plus-shift property in connection with the computation of their partial singular value decomposition (SVD) is presented, where the term-document matrices generated from a text corpus approximately satisfy this property.

...read moreread less

Abstract: We present a detailed analysis of matrices satisfying the so-called low-rank-plus-shift property in connection with the computation of their partial singular value decomposition (SVD) The application we have in mind is latent semantic indexing for information retrieval, where the term-document matrices generated from a text corpus approximately satisfy this property The analysis is motivated by developing more efficient methods for computing and updating partial SVD of large term-document matrices and gaining deeper understanding of the behavior of the methods in the presence of noise

...read moreread less

Proceedings Article•DOI•

Learning and performing by exploration: label quality measured by latent semantic analysis

[...]

Rodolfo Soto¹•Institutions (1)

University of Colorado Boulder¹

01 May 1999

TL;DR: Latent SemanticAnalysis (LSA) was used to compute semantic similarity between task descriptions and labels in an applications menu system and when the labels in the menus system were semantically similar to the task descriptions, subjects performed the tasks faster.

...read moreread less

Abstract: Models of learning and performing by exploration assume that the semantic similarity between task descriptions and labels on display objects (e.g., menus, tool bars) controls in part the users search strategies. Nevertheless, none of the models has an objective way to compute semantic similarity. In this study, Latent Semantic Analysis (LSA) was used to compute semantic similarity between task descriptions and labels in an applications menu system. Participants performed twelve tasks by exploration and they were tested for recall after a l-week delay. When the labels in the menu system were semantically similar to the task descriptions, subjects performed the tasks faster. LSA could be incorporated into any of the current models, and it could be used to automate the evaluation of computer applications for ease of learning and performing by exploration.

...read moreread less

Book Chapter•DOI•

Latent semantic indexing via a semi-discrete matrix decomposition

[...]

Tamara G. Kolda¹, Dianne P. O'Leary¹•Institutions (1)

University of Maryland, College Park¹

01 Jan 1999

TL;DR: This paper uses an alternate decomposition, the semi-discrete decomposition (SDD), and shows that for equal query times, the SDD does as well as the SVD and uses less than one-tenth the storage for the MEDLINE test set.

...read moreread less

Abstract: With the electronic storage of documents comes the possibility of building search engines that can automatically choose documents relevant to a given set of topics. In information retrieval, we wish to match queries with relevant documents. Documents can be represented by the terms that appear within them, but literal matching of terms does not necessarily retrieve all relevant documents. There are a number of information retrieval systems based on inexact matches. Latent Semantic Indexing represents documents by approximations and tends to cluster documents on similar topics even if their term profiles are somewhat different. This approximate representation is usually accomplished using a low-rank singular value decomposition (SVD) approximation. In this paper, we use an alternate decomposition, the semi-discrete decomposition (SDD). For equal query times, the SDD does as well as the SVD and uses less than one-tenth the storage for the MEDLINE test set.

...read moreread less

Book Chapter•DOI•

Probabilistic Topic Maps: Navigating through Large Text Collections

[...]

Thomas Hofmann¹•Institutions (1)

University of California, Berkeley¹

01 Aug 1999

TL;DR: A probabilistic approach which combines a statistical, model-based analysis with a topological visualization principle is presented which can be utilized to derive topic maps which represent topical information by characteristic keyword distributions arranged in a two-dimensional spatial layout.

...read moreread less

Abstract: The visualization of large text databases and document collections is an important step towards more flexible and interactive types of information retrieval. This paper presents a probabilistic approach which combines a statistical, model-based analysis with a topological visualization principle. Our method can be utilized to derive topic maps which represent topical information by characteristic keyword distributions arranged in a two-dimensional spatial layout. Combined with multi-resolution techniques this provides a three-dimensional space for interactive information navigation in large text collections.

...read moreread less

Patent•

Process for the automatic generation of a textual expression from a semantic representation using a computer system

[...]

Stuermer Thomas¹•Institutions (1)

IBM¹

30 Jun 1999

TL;DR: In this paper, a procedure for the automatic generation of a textual expression from a semantic representation by a computer-system is described, in which a statistical model is determined by the computer system on a plurality of pre-determined pairs of semantic representations and associated expressions and stored.

...read moreread less

Abstract: A procedure for the automatic generation of a textual expression from a semantic representation by a computer-system is described. With the procedure, a statistical model is determined by the computer-system on a plurality of pre-determined pairs of semantic representations and associated expressions and stored. A semantic representation, from which an associated expression is determined by the computer system by means of the statistical model, is presented to the computer system. These steps are repeated by the computer system for further semantic representations if necessary.

...read moreread less

Latent semantic indexing by

[...]

Mikko Kurimo, Chafic Mokbel

01 Jan 1999

Latent Semantic Indexing by Self-Organizing Map

[...]

Mikko Kurimo¹, Chafic Mokbel•Institutions (1)

Idiap Research Institute¹

01 Jan 1999

TL;DR: This paper proposes a stochastic index for the documents based on the Latent Semantic Analysis (LSA) of the decoded document contents using Random Mapping and Self-Organizing Maps (SOM).

...read moreread less

Abstract: An important problem for the information retrieval from spoken documents is how to extract those relevant documents which are poorly decoded by the speech recognizer. In this paper we propose a stochastic index for the documents based on the Latent Semantic Analysis (LSA) of the decoded document contents. The original LSA approach uses Singular Value Decomposition to reduce the dimensionality of the documents. As an alternative, we propose a computationally more feasible solution using Random Mapping (RM) and Self-Organizing Maps (SOM). The motivation for clustering the documents by SOM is to reduce the effect of recognition errors and to extract new characteristic index terms. Experimental indexing results are presented using relevance judgments for the retrieval results of test queries and using a document perplexity defined in this paper to measure the power of the index models.

...read moreread less

Journal Article•DOI•

Document Retrieval Based on Words' Cooccurrences, the Algorithm and Its Application

[...]

Tsunenori Ishioka, Masayuki Kameda

11 Nov 1999-Japanese Journal of Applied Statistics

TL;DR: It is shown that deerwester(1990)のLatent Semantic AnalysisをおこなうDeerwester (1990)に現われる単語に着目することにより,潜在的な意味的検索に対して適用した,

...read moreread less

Abstract: 異なった文書に同時に現われる単語に着目することにより,潜在的な意味的検索をおこなうDeerwester(1990)のLatent Semantic Analysisを日本語の比較的大規模な文書集合に対して適用した.その中で,大型疎行列における特異値分解アルゴリズムの比較検討を行ない,日本語文書検索に適した方法を見つけた.これを実際の新聞記事で試し,文書検索,および関連語表示において有効であることの見通しを得た.また実装する上での工夫として,関連文書検索においては,文書の大きさによる基準化が必要なことがわかった.さらに,重複を許す単語のクラスタリングを試みた.

...read moreread less

Bayesian selection and testing of latent class models

[...]

Johannes Berkhof, I. Van Mechelen, Andrew Gelman

01 Jan 1999

Proceedings Article•

Learning of domain dependent knowledge in semantic networks.

[...]

Frank Deinzer, Julia Fischer, Ulrike Ahlrichs, Elmar Nöth

01 Jan 1999

TL;DR: This work uses a frequency based statistical method combined with general hidden markov models in order to learn domain speci c knowledge within a semantic network formalism and uses a dialogue system for German train timetable information.

...read moreread less

Abstract: For an e cient linguistic analysis of spoken queries a lot of domain speci c knowledge is needed and usually has to be entered manually into the knowledge base of each domain. This makes the adaption of dialogue systems which base on explicit knowledge representation to new domains a very costly procedure. We use a frequency based statistical method combined with general hidden markov models in order to learn domain speci c knowledge within a semantic network formalism. As a framework we use a dialogue system for German train timetable information. By means of experiments we show that our statistical approach is not only able to reach, but even outperforms previous results with manually entered restrictions.

...read moreread less

Dissertation•

Parametric classification with non-normal data

[...]

Alan Ray Willse

01 Jan 1999