scispace - formally typeset
Search or ask a question

Showing papers on "Probabilistic latent semantic analysis published in 2001"


Proceedings Article
03 Jan 2001
TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).
Abstract: We propose a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams [6], and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI) [3]. In the context of text modeling, our model posits that each document is generated as a mixture of topics, where the continuous-valued mixture proportions are distributed as a latent Dirichlet random variable. Inference and learning are carried out efficiently via variational algorithms. We present empirical results on applications of this model to problems in text modeling, collaborative filtering, and text classification.

25,546 citations


Journal ArticleDOI
Thomas Hofmann1
TL;DR: This paper proposes to make use of a temperature controlled version of the Expectation Maximization algorithm for model fitting, which has shown excellent performance in practice, and results in a more principled approach with a solid foundation in statistical inference.
Abstract: This paper presents a novel statistical method for factor analysis of binary and count data which is closely related to a technique known as Latent Semantic Analysis. In contrast to the latter method which stems from linear algebra and performs a Singular Value Decomposition of co-occurrence tables, the proposed technique uses a generative latent class model to perform a probabilistic mixture decomposition. This results in a more principled approach with a solid foundation in statistical inference. More precisely, we propose to make use of a temperature controlled version of the Expectation Maximization algorithm for model fitting, which has shown excellent performance in practice. Probabilistic Latent Semantic Analysis has many applications, most prominently in information retrieval, natural language processing, machine learning from text, and in related areas. The paper presents perplexity results for different types of text and linguistic data collections and discusses an application in automated document indexing. The experiments indicate substantial and consistent improvements of the probabilistic method over standard Latent Semantic Analysis.

2,574 citations


Patent
26 Jul 2001
TL;DR: In this paper, a probabilistic Latent Semantic Analysis (PLSA) model is used to integrate textual and other content descriptions of items to be searched, user profiles, demographic information, query logs of previous searches, and explicit user ratings of items.
Abstract: The disclosed system implements a novel method for personalized filtering of information and automated generation of user-specific recommendations. The system uses a statistical latent class model, also known as Probabilistic Latent Semantic Analysis, to integrate data including textual and other content descriptions of items to be searched, user profiles, demographic information, query logs of previous searches, and explicit user ratings of items. The disclosed system learns one or more statistical models based on available data. The learning may be reiterated once additional data is available. The statistical model, once learned, is utilized in various ways: to make predictions about item relevance and user preferences on un-rated items, to generate recommendation lists of items, to generate personalized search result lists, to disambiguate a users query, to refine a search, to compute similarities between items or users, and for data mining purposes such as identifying user communities.

645 citations


Proceedings ArticleDOI
06 Jul 2001
TL;DR: A model for framing data mining tasks and a unified approach to solving the resulting data mining problems using spectral analysis are presented, which give strong justification to the use of spectral techniques for latent semantic indexing, collaborative filtering, and web site ranking.
Abstract: Experimental evidence suggests that spectral techniques are valuable for a wide range of applications. A partial list of such applications include (i) semantic analysis of documents used to cluster documents into areas of interest, (ii) collaborative filtering --- the reconstruction of missing data items, and (iii) determining the relative importance of documents based on citation/link structure. Intuitive arguments can explain some of the phenomena that has been observed but little theoretical study has been done. In this paper we present a model for framing data mining tasks and a unified approach to solving the resulting data mining problems using spectral analysis. These results give strong justification to the use of spectral techniques for latent semantic indexing, collaborative filtering, and web site ranking.

322 citations


Journal ArticleDOI
28 Jun 2001
TL;DR: In this paper, the LSI approach can be implemented in a kernel-defined feature space, and experimental results demonstrate that the approach can significantly improve performance, and that it does not impair it.
Abstract: Kernel methods like support vector machines have successfully been used for text categorization. A standard choice of kernel function has been the inner product between the vector-space representation of two documents, in analogy with classical information retrieval (IR) approaches. Latent semantic indexing (LSI) has been successfully used for IR purposes as a technique for capturing semantic relations between terms and inserting them into the similarity measure between two documents. One of its main drawbacks, in IR, is its computational cost. In this paper we describe how the LSI approach can be implemented in a kernel-defined feature space. We provide experimental results demonstrating that the approach can significantly improve performance, and that it does not impair it.

303 citations


Journal ArticleDOI
TL;DR: Analyses over several data sets suggest that LC factor models typically fit data better and provide results that are easier to interpret than the corresponding LC cluster models.
Abstract: We propose an alternative method of conducting exploratory latent class analysis that utilizes latent class factor models, and compare it to the more traditional approach based on latent class cluster models. We show that when formulated in terms of R mutually independent, dichotomous latent factors, the LC factor model has the same number of distinct parameters as an LC cluster model with R+1 clusters. Analyses over several data sets suggest that LC factor models typically fit data better and provide results that are easier to interpret than the corresponding LC cluster models. We also introduce a new graphical bi-plot display for LC factor models and compare it to similar plots used in correspondence analysis and to a barycentric coordinate display for LC cluster models. New results on identification of LC models are also presented. We conclude by describing various model extensions and an approach for eliminating boundary solutions in identified and unidentified LC models, which we have implemented in a new computer program.

290 citations



Patent
08 May 2001
TL;DR: A computer-based information search and retrieval system and method for retrieving textual digital objects that makes full use of the projections of the documents onto both the reduced document space characterized by the singular value decomposition-based latent semantic structure and its orthogonal space is presented in this paper.
Abstract: A computer-based information search and retrieval system and method for retrieving textual digital objects that makes full use of the projections of the documents onto both the reduced document space characterized by the singular value decomposition-based latent semantic structure and its orthogonal space. The resulting system and method has increased robustness, improving the instability of the traditional keyword search engine due to synonymy and/or polysemy of a natural language, and therefore is particularly suitable for web document searching over a distributed computer network such as the Internet.

218 citations


01 Jan 2001
TL;DR: Experimental results of usage of LSA for analysis of English literature texts and preliminary transformations of the frequency text-document matrix with different weight functions are tested on the basis of control subsets.
Abstract: This paper presents experimental results of usage of LSA for analysis of English literature texts. Several preliminary transformations of the frequency text-document matrix with different weight functions are tested on the basis of control subsets. Additional clustering based on correlation matrix is applied in order to reveal the latent structure. The algorithm creates a shaded form matrix via singular values and vectors. The results are interpreted as a quality of the transformations and compared to the control set tests.

129 citations


Journal ArticleDOI
TL;DR: A novel statistical method for factor analysis of binary and count data which is closely related to a technique known as Latent Semantic Analysis is presented.
Abstract: This paper presents a novel statistical method for factor analysis of binary and count data which is closely related to a technique known as Latent Semantic Analysis. In contrast to the latter meth...

103 citations


11 Oct 2001
TL;DR: A Bayesian mixture model for probabilistic latent semantic analysis of documents with images and text and enables a priori knowledge, such as word and image preferences, to be encoded.
Abstract: We present a Bayesian mixture model for probabilistic latent semantic analysis of documents with images and text. The Bayesian perspective allows us to perform automatic regularisation to obtain sparser and more coherent clustering models. It also enables us to encode a priori knowledge, such as word and image preferences. The learnt model can be used for browsing digital databases, information retrieval with image and/or text queries, image annotation (adding words to an image) and text illustration (adding images to a text).

01 Jan 2001
TL;DR: Latent semantic indexing in conjunction with two different ordination techniques is employed to construct a semantic Reuters news wire space and topological information helps to identify the appropriate levels of granularity at which the information space can be visually explored.
Abstract: The geographic concepts of region and scale can be preserved in semantic information spaces and depicted cartographically. Region and scale are fundamental to geographical analysis, and are also associated with cognitive and experiential properties of the real world. Scale is important when graphically representing a spatialization, as it affects the amount of detail that can be shown. Latent semantic indexing in conjunction with two different ordination techniques is employed to construct a semantic Reuters news wire space. Intramax, a hierarchical clustering algorithm, is applied to delineate semantic regions in the Reuters database based on a functional distance measure. This topological information helps to identify the appropriate levels of granularity at which the information space can be visually explored. Amplification of ordination techniques with the Intramax procedure is a useful strategy for creating scale-dependent information spaces that facilitate the exploration of abstract, complex data archives.

04 Aug 2001
TL;DR: Experimental results show that accounting for semantic information in fact decreases the performances compared to LSI standalone, and the main weakenesses of the current hybrid scheme are discussed and several tracks for improvement are sketched.
Abstract: A new approach for constructing pseudo-keywords, referred to as Sense Units, is proposed. Sense Units are obtained by a word clustering process, where the underlying similarity reflects both statistical and semantic properties, respectively detected through Latent Semantic Analysis and WordNet. Sense Units are used to recode documents and are evaluated from the performance increase they permit in classification tasks. Experimental results show that accounting for semantic information in fact decreases the performances compared to LSI standalone. The main weakenesses of the current hybrid scheme are discussed and several tracks for improvement are sketched.

Journal Article
TL;DR: An object-oriented model in which the semantic features of the UMLS are made available through four major classes for representing Metathesaurus concepts, semantic types, inter-concept relationships and Semantic Network relationships is proposed.
Abstract: Several information models have been developed for the Unified Medical Language System (UMLS). While some models are term-oriented, a knowledge-oriented model is needed for representing semantic locality, i.e. the various semantic links among concepts. We propose an objectoriented model in which the semantic features of the UMLS are made available through four major classes for representing Metathesaurus concepts, semantic types, interconcept relationships and Semantic Network relationships. Additional semantic methods for reducing the complexity of the hierarchical relationships represented in the UMLS are proposed. Implementation details are presented, as well as examples of use. The interest of this approach is discussed.


01 Jan 2001
TL;DR: A signal processing algorithm which discovers the hierarchical organization of a document or media presentation is described, using latent semantic indexing to describe the semantic content of the signal, and scalespace segmentation to describe its features at many different scales.
Abstract: This paper describes a signal processing algorithm which discovers the hierarchical organization of a document or media presentation. We use latent semantic indexing to describe the semantic content of the signal, and scalespace segmentation to describe its features at many different scales. We represent the semantic content of the document as a signal that varies through the document. We lowpass filter this signal to compute the document’s semantic path at many different time scales and then look for changes. The changes are sorted by their strength to form a hierarchical segmentation. We present results from a text document and a video transcript. 1. THE PROBLEM As prices decline and storage and computational horsepower increase, we will soon be swamped in multimedia data. Unfortunately, given an audio or a video signal there is little information readily available that can help us find our way around such a time-based signal. Technical papers are structured into major and minor headings, imposing a hierarchical structure. Often professional or high-quality audio–visual (AV) presentations are also structured. However, this information is hidden in the signal. Our goal is to use the semantic information in the AV signal to create a hierarchical table of contents that describes the associated signal. Towards this end we combine two powerful concepts: scale space (SS) filtering and Latent Semantic Indexing (LSI). We use LSI to provide a continuously valued feature that describes the semantic content of an AV signal. By doing this we reduce the dimensionality of the problem and, more importantly, we address synonymy and polysemy as LSI does. The combined approach remains language independent. We use scale-space techniques to represent the semantic signal over many different time scales. We are looking for changes in the signal and scale space allows us to talk about features of the document that span from a single sentence to entire chapters. The scale parameter specifies the level of detail for our analysis. Intuitively, at small scales we are looking at the individual trees, and at large scales we are seeing the entire forest. We look at a wide range of scales to determine when the content of the signal has changed. In Section 5 we use Latent Semantic Indexing (LSI) as a means to describe the semantic content of a signal.





Proceedings ArticleDOI
01 Sep 2001
TL;DR: This paper shows how LSI is based on a unitary transformation, for which there are computationally more attractive alternatives, exemplified by the Haar transform, which is memory efficient, and can be computed in linear to sublinear time.
Abstract: Latent Semantic Indexing (LSI) dramatically reduces the dimension of the document space by mapping it into a space spanned by conceptual indices. Empirically, the number of concepts that can represent the documents are far fewer than the great variety of words in the textual representation. Although this almost obviates the problem of lexical matching, the mapping incurs a high computational cost compared to document parsing, indexing, query matching, and updating. This paper shows how LSI is based on a unitary transformation, for which there are computationally more attractive alternatives. This is exemplified by the Haar transform, which is memory efficient, and can be computed in linear to sublinear time. The principle advantages of LSI are thus preserved while the computational costs are drastically reduced.

Book ChapterDOI
01 Jan 2001
TL;DR: All variables, latent and observable, are treated as random variables whose joint distribution constitutes the model and can readily be extended to the study of relationships between latent variables as exemplified by Joreskog's LISREL and similar models.
Abstract: Many of the quantities which appear in social science are latent, meaning that they are not directly observable. General intelligence, or ‘g,’ was an early example but latent variables (or factors) are now used to represent many social and psychological characteristics including attitudes and abilities. During the twentieth century a large body of methods was introduced to identify latent variables and to provide measurement scales for them. Factor analysis, introduced by Spearman in 1904, is used where the observable and latent variables are continuous. Latent structure analysis was introduced by Lazarsfeld in 1950 and covers the cases where either or both of the observable and latent variables are categorical. Latent trait models, used in educational testing, and latent class models have been particularly prominent in the literature. The application of all such models has been extended greatly by the wide availability of powerful desktop computers. Until recently these methods have been developed separately, yet all have a common basis and purpose. This article sets all of the models within a common conceptual framework from which they emerge as special cases. The key idea is that all variables, latent and observable, are treated as random variables whose joint distribution constitutes the model. This approach can readily be extended to the study of relationships between latent variables as exemplified by Joreskog's LISREL and similar models. It provides a basis for their further generalization and for their critical evaluation.


01 Jan 2001
TL;DR: The problem has been solved by the use of the latent semantic analysis for comparison and assessment of scientific texts and knowledge, expressed by the students in the form of free verbal statements.
Abstract: The research on the effects of study is hindered by the possibilities of the techniques and methods of registering, measuring and assessing the actually formed knowledge as information represented in the memory with the appropriate correlation among its units. The problem has been solved by the use of the latent semantic analysis for comparison and assessment of scientific texts and knowledge, expressed by the students in the form of free verbal statements.

Dissertation
01 Jan 2001
TL;DR: The main emphasis of the work is on the derivation and construction of computationally efficient algorithms that perform well on both synthetic tasks and real-life problems, and that can be used as alternatives to other existing methods wherever appropriate.
Abstract: What is a latent variable? Simply defined, a latent variable is a variable that cannot be directly measured or observed. A latent variable model or latent structure model is a model whose structure contains one or many latent variables. The subject of this thesis is the study of various topics that arise during the analysis and/or use of latent structure models. Two classical models, namely the factor analysis (FA) model and the finite mixture (FM) model, are first considered and examined extensively, after which the mixture of factor analysers (MFA) model, constructed using ingredients from both FA and FM is introduced and studied at length. Several extensions of the MFA model are also presented, one of which consists of the incorporation of fixed observed covariates into the model. Common to all the models considered are such topics as: (a) model selection which consists of the determination or estimation of the dimensionality of the latent space; (b) parameter estimation which consists of estimating the parameters of the postulated model in order to interpret and characterise the mechanism that produced the observed data; (c) prediction which consists of estimating responses for future unseen observations. Other important topics such as identifiability (for unique solution, interpretability and parameter meaningfulness), density estimation, and to a certain extent aspects of unsupervised learning and exploration of group structure (through clustering, data visualisation in 2D) are also covered. We approach such topics as parameter estimation and model selection from both the likelihood-based and Bayesian perspectives, with a concentration on Maximum Likelihood Estimation via the EM algorithm, and Bayesian Analysis via Stochastic Simulation (derivation of efficient Markov Chain Monte Carlo algorithms). The main emphasis of our work is on the derivation and construction of computationally efficient algorithms that perform well on both synthetic tasks and real-life problems, and that can be used as alternatives to other existing methods wherever appropriate.

Proceedings ArticleDOI
09 Dec 2001
TL;DR: A unified probabilistic framework for statistical language modeling, the latent maximum entropy principle, which shows that the hidden causal hierarchical dependency structure can be encoded into the statistical model in a principled way by mixtures of exponential families with a rich expressive power.
Abstract: We describe a unified probabilistic framework for statistical language modeling, the latent maximum entropy principle. The salient feature of this approach is that the hidden causal hierarchical dependency structure can be encoded into the statistical model in a principled way by mixtures of exponential families with a rich expressive power. We first show the problem formulation, solution, and certain convergence properties. We then describe how to use this machine learning technique to model various aspects of natural language, such as syntactic structure of sentences, semantic information in a document. Finally, we draw a conclusion and point out future research directions.

Proceedings ArticleDOI
25 Jul 2001
TL;DR: A probabilistic DEDICOM model is proposed for mobility tables that captures asymmetry in observed mobility tables by asymmetric latent mobility tables and a maximum penalized likelihood (MPL) method is developed for parameter estimation.
Abstract: A probabilistic DEDICOM model is proposed for mobility tables. The model attempts to explain observed transition probabilities by a latent mobility table and a set of transition probabilities from latent classes to observed classes. The model captures asymmetry in observed mobility tables by asymmetric latent mobility tables. It may be viewed as a special case of both the latent class model and DEDICOM with special constraints. A maximum penalized likelihood (MPL) method is developed for parameter estimation. The EM algorithm is adapted for the MPL estimation. An example is given to illustrate the proposed method.

Book ChapterDOI
01 Jan 2001
TL;DR: The range of different approaches to capturing semantic similarity are discussed, including multidimensional scaling models and featural models (e.g., Tversky's contrast model) are introduced alongside newer approaches such as structural alignment accounts, context vectors, connectionist models, and generative probabilistic models.
Abstract: This article discusses the range of different approaches to capturing semantic similarity. Specifically, multidimensional scaling models and featural models (e.g., Tversky's contrast model) are introduced alongside newer approaches such as structural alignment accounts, context vectors, connectionist models, and generative probabilistic models. In addition, references are made to several cognitive abilities in which semantic similarity plays a role, including categorization, inductive reasoning, and memory.