scispace - formally typeset
Search or ask a question

Showing papers on "Probabilistic latent semantic analysis published in 2002"


Journal ArticleDOI
TL;DR: Different applications of latent variable applications are discussed in a unifying framework that brings together in one general model such different analysis types as factor models, growth curve models, multilevel models, latent class models and discrete-time survival models.
Abstract: This article gives an overview of statistical analysis with latent variables. Using traditional structural equation modeling as a starting point, it shows how the idea of latent variables captures a wide variety of statistical concepts, including random effects, missing data, sources of variation in hierarchical data, finite mixtures. latent classes, and clusters. These latent variable applications go beyond the traditional latent variable useage in psychometrics with its focus on measurement error and hypothetical constructs measured by multiple indicators. The article argues for the value of integrating statistical and psychometric modeling ideas. Different applications are discussed in a unifying framework that brings together in one general model such different analysis types as factor models, growth curve models, multilevel models, latent class models and discrete-time survival models. Several possible combinations and extensions of these models are made clear due to the unifying framework.

978 citations


01 Jan 2002
TL;DR: The authors compare these two approaches using data simulated from a setting where true group membership is known to indicate that LC substantially outperforms the K-means technique.
Abstract: Recent developments in latent class (LC) analysis and associated software to include continuous variables offer a model-based alternative to more traditional clustering approaches such as K-means. In this paper, the authors compare these two approaches using data simulated from a setting where true group membership is known. The authors choose a setting favourable to K-means by simulating data according to the assumptions made in both discriminant analysis (DISC) and K-means clustering. Since the information on true group membership is used in DISC but not in clustering approaches in general, the authors use the results obtained from DISC as a gold standard in determining an upper bound on the best possible outcome that might be expected from a clustering technique. The results indicate that LC substantially outperforms the K-means technique. A truly surprising result is that the LC performance is so good that it is virtually indistinguishable from the performance of DISC.

700 citations


Book ChapterDOI
15 Jul 2002
TL;DR: This paper proposes an approach for semantic search by matching conceptual graphs by calculating semantic similarities between concepts, relations and conceptual graphs using the detailed definitions of semantic similarity.
Abstract: Semantic search becomes a research hotspot. The combined use of linguistic ontologies and structured semantic matching is one of the promising ways to improve both recall and precision. In this paper, we propose an approach for semantic search by matching conceptual graphs. The detailed definitions of semantic similarities between concepts, relations and conceptual graphs are given. According to these definitions of semantic similarity, we propose our conceptual graph matching algorithm that calculates the semantic similarity. The computation complexity of this algorithm is constrained to be polynomial. A prototype of our approach is currently under development with IBM China Research Lab.

256 citations


Proceedings ArticleDOI
28 Jul 2002
TL;DR: In this article, a search-based algorithm for learning hierarchical latent class models from data is proposed, which is evaluated using both synthetic and real-world data and evaluated on both real and synthetic data sets.
Abstract: Latent class models are used for cluster analysis of categorical data. Underlying such a model is the assumption that the observed variables are mutually independent given the class variable. A serious problem with the use of latent class models, known as local dependence, is that this assumption is often untrue. In this paper we propose hierarchical latent class models as a framework where the local dependence problem can be addressed in a principled manner. We develop a search-based algorithm for learning hierarchical latent class models from data. The algorithm is evaluated using both synthetic and real-world data.

227 citations


Book ChapterDOI
01 Jan 2002
TL;DR: This paper illustrates that the large-scale structure of this representation has statistical properties that corre- spond well with those of semantic networks produced by humans, and trace this to the fidelity with which it reproduces the natural statistics of language.
Abstract: A probabilistic approach to semantic representation Thomas L. Griffiths & Mark Steyvers {gruffydd,msteyver}@psych.stanford.edu Department of Psychology Stanford University Stanford, CA 94305-2130 USA Abstract Semantic networks produced from human data have statistical properties that cannot be easily captured by spatial representations. We explore a probabilis- tic approach to semantic representation that explic- itly models the probability with which words occur in different contexts, and hence captures the proba- bilistic relationships between words. We show that this representation has statistical properties consis- tent with the large-scale structure of semantic net- works constructed by humans, and trace the origins of these properties. Contemporary accounts of semantic representa- tion suggest that we should consider words to be either points in a high-dimensional space (eg. Lan- dauer & Dumais, 1997), or interconnected nodes in a semantic network (eg. Collins & Loftus, 1975). Both of these ways of representing semantic information provide important insights, but also have shortcom- ings. Spatial approaches illustrate the importance of dimensionality reduction and employ simple al- gorithms, but are limited by Euclidean geometry. Semantic networks are less constrained, but their graphical structure lacks a clear interpretation. In this paper, we view the function of associa- tive semantic memory to be efficient prediction of the concepts likely to occur in a given context. We take a probabilistic approach to this problem, mod- eling documents as expressing information related to a small number of topics (cf. Blei, Ng, & Jordan, 2002). The topics of a language can then be learned from the words that occur in different documents. We illustrate that the large-scale structure of this representation has statistical properties that corre- spond well with those of semantic networks produced by humans, and trace this to the fidelity with which it reproduces the natural statistics of language. Approaches to semantic representation Spatial approaches Latent Semantic Analysis (LSA; Landauer & Dumais, 1997) is a procedure for finding a high-dimensional spatial representation for words. LSA uses singular value decomposition to factorize a word-document co-occurrence matrix. An approximation to the original matrix can be ob- tained by choosing to use less singular values than its rank. One component of this approximation is a matrix that gives each word a location in a high di- mensional space. Distances in this space are predic- tive in many tasks that require the use of semantic information. Performance is best for approximations that used less singular values than the rank of the matrix, illustrating that reducing the dimensional- ity of the representation can reduce the effects of statistical noise and increase efficiency. While the methods behind LSA were novel in scale and subject, the suggestion that similarity relates to distance in psychological space has a long history (Shepard, 1957). Critics have argued that human similarity judgments do not satisfy the properties of Euclidean distances, such as symmetry or the tri- angle inequality. Tversky and Hutchinson (1986) pointed out that Euclidean geometry places strong constraints on the number of points to which a par- ticular point can be the nearest neighbor, and that many sets of stimuli violate these constraints. The number of nearest neighbors in similarity judgments has an analogue in semantic representation. Nelson, McEvoy and Schreiber (1999) had people perform a word association task in which they named an as- sociated word in response to a set of target words. Steyvers and Tenenbaum (submitted) noted that the number of unique words produced for each target fol- lows a power law distribution: if k is the number of words, P (k) ∝ k γ . For reasons similar to those of Tversky and Hutchinson, it is difficult to produce a power law distribution by thresholding cosine or dis- tance in Euclidean space. This is shown in Figure 1. Power law distributions appear linear in log-log co- ordinates. LSA produces curved log-log plots, more consistent with an exponential distribution. Semantic networks Semantic networks were pro- posed by Collins and Quillian (1969) as a means of storing semantic knowledge. The original net- works were inheritance hierarchies, but Collins and Loftus (1975) generalized the notion to cover arbi- trary graphical structures. The interpretation of this graphical structure is vague, being based on connect- ing nodes that “activate” one another. Steyvers and Tenenbaum (submitted) constructed a semantic net- work from the word association norms of Nelson et

222 citations



Journal ArticleDOI
TL;DR: A general framework is presented for data analysis of latent finite partially ordered classification models and it is demonstrated that sequential analytic methods can dramatically reduce the amount of testing that is needed to make accurate classifications.
Abstract: Summary. A general framework is presented for data analysis of latent finite partially ordered classification models. When the latent models are complex, data analytic validation of model fits and of the analysis of the statistical properties of the experiments is essential for obtaining reliable and accurate results. Empirical results are analysed from an application to cognitive modelling in educational testing. It is demonstrated that sequential analytic methods can dramatically reduce the amount of testing that is needed to make accurate classifications.

174 citations


Proceedings ArticleDOI
04 Nov 2002
TL;DR: This paper presents a new method for topic-based document segmentation, i.e., the identification of boundaries between parts of a document that bear on different topics through the use of the Probabilistic Latent Semantic Analysis model and the method of selecting segmentation points based on the similarity values between pairs of adjacent blocks.
Abstract: This paper presents a new method for topic-based document segmentation, i.e., the identification of boundaries between parts of a document that bear on different topics. The method combines the use of the Probabilistic Latent Semantic Analysis (PLSA) model with the method of selecting segmentation points based on the similarity values between pairs of adjacent blocks. The use of PLSA allows for a better representation of sparse information in a text block, such as a sentence or a sequence of sentences. Furthermore, segmentation performance is improved by combining different instantiations of the same model, either using different random initializations or different numbers of latent classes. Results on commonly available data sets are significantly better than those of other state-of-the-art systems.

174 citations


Book ChapterDOI
19 Aug 2002
TL;DR: In this paper, a review of the basic theory of the variational extension to the expectation-maximization algorithm is presented, and then discrete component finding algorithms are presented in that light.
Abstract: Several authors in recent years have proposed discrete analogues to principle component analysis intended to handle discrete or positive only data, for instance suited to analyzing sets of documents. Methods include non-negative matrix factorization, probabilistic latent semantic analysis, and latent Dirichlet allocation. This paper begins with a review of the basic theory of the variational extension to the expectation-maximization algorithm, and then presents discrete component finding algorithms in that light. Experiments are conducted on both bigram word data and document bag-of-word to expose some of the subtleties of this new class of algorithms.

166 citations


Proceedings Article
01 Jan 2002
TL;DR: This paper proposes two methods for inferring semantic similarity from a corpus by means of diffusion process on a graph defined by lexicon and co-occurrence information and shows how the alignment measure can be used to successfully perform model selection over this parameter.
Abstract: The standard representation of text documents as bags of words suffers from well known limitations, mostly due to its inability to exploit semantic similarity between terms. Attempts to incorporate some notion of term similarity include latent semantic indexing [8], the use of semantic networks [9], and probabilistic methods [5]. In this paper we propose two methods for inferring such similarity from a corpus. The first one defines word-similarity based on document-similarity and viceversa, giving rise to a system of equations whose equilibrium point we use to obtain a semantic similarity measure. The second method models semantic relations by means of a diffusion process on a graph defined by lexicon and co-occurrence information. Both approaches produce valid kernel functions parametrised by a real number. The paper shows how the alignment measure can be used to successfully perform model selection over this parameter. Combined with the use of support vector machines we obtain positive results.

151 citations


Proceedings Article
01 Jan 2002
TL;DR: It is argued that the success of existing accounts of semantic representation comes as a result of indirectly addressing this problem, and that a closer correspondence to human data can be obtained by taking a probabilistic approach that explicitly models the generative structure of language.
Abstract: We explore the consequences of viewing semantic association as the result of attempting to predict the concepts likely to arise in a particular context. We argue that the success of existing accounts of semantic representation comes as a result of indirectly addressing this problem, and show that a closer correspondence to human data can be obtained by taking a probabilistic approach that explicitly models the generative structure of language.

Journal ArticleDOI
TL;DR: The use of latent semantic indexing (LSI) in conjunction with normalization and term weighting for content-based image retrieval is examined, using two different approaches to image feature representation.

MonographDOI
01 Jan 2002
TL;DR: This paper presents a meta-modelling framework for estimating confidence levels in the modeled environments of Hierarchically Related Nonparametric IRT Models and Practical Data Analysis Methods.
Abstract: Contents: Preface. D.J. Bartholomew, Old and New Approaches to Latent Variable Modelling. I. Moustaki, C. O'Muircheartaigh, Locating "Don't Know," "No Answer" and Middle Alternatives on an Attitude Scale: A Latent Variable Approach. L.A. van der Ark, B.T. Hemker, K. Sijtsma, Hierarchically Related Nonparametric IRT Models, and Practical Data Analysis Methods. P. Tzamourani, M. Knott, Fully Semiparametric Estimation of the Two-Parameter Latent Trait Model for Binary Data. P. Rivera, A. Satorra, Analyzing Group Differences: A Comparison of SEM Approaches. R.D. Wiggins, A. Sacker, Strategies for Handling Missing Data in SEM: A User's Perspective. T. Raykov, S. Penev, Exploring Structural Equation Model Misspecifications Via Latent Individual Residuals. J-Q. Shi, S-Y. Lee, B-C. Wei, On Confidence Regions of SEM Models. P. Filzmoser, Robust Factor Analysis: Methods and Applications. M. Croon, Using Predicted Latent Scores in General Latent Structure Models. H. Goldstein, W. Browne, Multilevel Factor Analysis Modelling Using Markov Chain Monte Carlo Estimation. J-P. Fox, C.A.W. Glas, Modelling Measurement Error in Structural Multilevel Models.

Proceedings ArticleDOI
11 Aug 2002
TL;DR: The view presented in this paper is that the fundamental vocabulary of the system is the images in the database and that relevance feedback is a document whose words are the images which expresses the semantic intent of the user over that query.
Abstract: This paper proposes a novel view of the information generated by relevance feedback. The latent semantic analysis is adapted to this view to extract useful inter-query information. The view presented in this paper is that the fundamental vocabulary of the system is the images in the database and that relevance feedback is a document whose words are the images. A relevance feedback document contains the intra-query information which expresses the semantic intent of the user over that query. The inter-query information then takes the form of a collection of documents which can be subjected to latent semantic analysis. An algorithm to query the latent semantic index is presented and evaluated against real data sets.

Journal ArticleDOI
01 Mar 2002
TL;DR: It is demonstrated that the probabilistic corpus model which emerges from the automatic or unsupervised hierarchical organisation of a document collection can be further exploited to create a kernel which boosts the performance of state-of-the-art support vector machine document classifiers.
Abstract: This paper presents a probabilistic mixture modeling framework for the hierarchic organisation of document collections. It is demonstrated that the probabilistic corpus model which emerges from the automatic or unsupervised hierarchical organisation of a document collection can be further exploited to create a kernel which boosts the performance of state-of-the-art support vector machine document classifiers. It is shown that the performance of such a classifier is further enhanced when employing the kernel derived from an appropriate hierarchic mixture model used for partitioning a document corpus rather than the kernel associated with a flat non-hierarchic mixture model. This has important implications for document classification when a hierarchic ordering of topics exists. This can be considered as the effective combination of documents with no topic or class labels (unlabeled data), labeled documents, and prior domain knowledge (in the form of the known hierarchic structure), in providing enhanced document classification performance.

01 Jan 2002
TL;DR: According to Kaufman and Rousseeuw (1990), cluster analysis is "the classification of similar objects into groups, where the number of groups, as well as their forms are unknown".
Abstract: According to Kaufman and Rousseeuw (1990), cluster analysis is "the classification of similar objects into groups, where the number of groups, as well as their forms are unknown". This same definition could be used for exploratory Latent Class (LC) analysis where a K-class latent variable is used to explain the associations among a set of observed variables. Each latent class, like each cluster, groups together similar cases.

Journal ArticleDOI
01 Mar 2002
TL;DR: A novel probabilistic method is proposed, based on latent variable models, for unsupervised topographic visualisation of dynamically evolving, coherent textual information, which can be seen as a complementary tool for topic detection and tracking applications.
Abstract: We propose a novel probabilistic method, based on latent variable models, for unsupervised topographic visualisation of dynamically evolving, coherent textual information. This can be seen as a complementary tool for topic detection and tracking applications. This is achieved by the exploitation of the a priori domain knowledge available, that there are relatively homogeneous temporal segments in the data stream. In a different manner from topographical techniques previously utilized for static text collections, the topography is an outcome of the coherence in time of the data stream in the proposed model. Simulation results on both toy-data settings and an actual application on Internet chat line discussion analysis is presented by way of demonstration.

Journal ArticleDOI
TL;DR: A latent class extension of signal detection theory is presented and applications are illustrated, useful for situations where observers attempt to detect latent categorical events or where the goal of the analysis is to select or classify cases.
Abstract: A latent class extension of signal detection theory is presented and applications are illustrated. The approach is useful for situations where observers attempt to detect latent categorical events or where the goal of the analysis is to select or classify cases. Signal detection theory is shown to offer a simple summary of the observers' performance in terms of detection and response criteria. Implications of the view via signal detection for the training of raters are noted, as are approaches to validating the parameters and classifications. An extension of the signal detection model to more than two latent classes, with a simple restriction on the detection parameters, is introduced. Sample programs to fit the models using software for latent class analysis or software for second generation structural equation modeling are provided.

Journal ArticleDOI
01 Jan 2002
TL;DR: This work relies on LSA to represent the student model in a tutoring system, and designed tutoring strategies to automatically detect lexeme misunderstandings and to select among the various examples of a domain the one which is best to expose the student to.
Abstract: Latent semantic analysis (LSA) is a tool for extracting semantic information from texts as well as a model of language learning based on the exposure to texts. We rely on LSA to represent the student model in a tutoring system. Domain examples and student productions are represented in a high-dimensional semantic space, automatically built from a statistical analysis of the co-occurrences of their lexemes. We also designed tutoring strategies to automatically detect lexeme misunderstandings and to select among the various examples of a domain the one which is best to expose the student to. Two systems are presented: the first one successively presents texts to be read by the student, selecting the next one according to the comprehension of the prior ones by the student. The second plays a board game (kalah) with the student in such a way that the next configuration of the board is supposed to be the most appropriate with respect to the semantic structure of the domain and the previous student's moves.

01 Jan 2002
TL;DR: This work shows the use of Probabilistic Latent Semantic Analysis for finding similar documents in natural language document collections, and evaluates the system on a collection of photocopier repair tips.
Abstract: Finding similar documents in natural language document collections is a difficult task that requires general and domain-specific world knowledge, deep analysis of the documents, and inference. However, a large portion of the pairs of similar documents can be identified by simpler, purely word-based methods. We show the use of Probabilistic Latent Semantic Analysis for finding similar documents. We evaluate our system on a collection of photocopier repair tips. Among the 100 top-ranked pairs, 88 are true positives. A manual analysis of the 12 false positives suggests the use of more semantic information in the retrieval model.

Proceedings Article
07 Aug 2002
TL;DR: This work introduces an algorithm for discovering partitions of observed variables such that members of a class share only a single latent common cause and requires no prior knowledge of the number of latent variables, and does not depend on the mathematical form of the relationships among the latent variables.
Abstract: Observed associations in a database may be due in whole or part to variations in unrecorded ("latent") variables. Identifying such variables and their causal relationships with one another is a principal goal in many scientific and practical domains. Previous work shows that, given a partition of observed variables such that members of a class share only a single latent common cause, standard search algorithms for causal Bayes nets can infer structural relations between latent variables. We introduce an algorithm for discovering such partitions when they exist. Uniquely among available procedures, the algorithm is (asymptotically) correct under standard assumptions in causal Bayes net search algorithms, requires no prior knowledge of the number of latent variables, and does not depend on the mathematical form of the relationships among the latent variables. We evaluate the algorithm on a variety of simulated data sets.

01 Jan 2002
TL;DR: This work empirically demonstrate that LSI uses up to fifth order term co-occurrence, and proves mathematically that a connectivity path exists for every nonzero element in the truncated term-term matrix computed by LSI.
Abstract: Current research in Latent Semantic Indexing (LSI) shows improvements in performance for a wide variety of information retrieval systems. We propose the development of a theoretical foundation for understanding the values produced in the reduced form of the term-term matrix. We assert that LSI’s use of higher orders of co -occurrence is a critical component of this study. In this work we present experiments that precisely determine the degree of co -occurrence used in LSI. We empirically demonstrate that LSI uses up to fifth order term co-occurrence. We also prove mathematically that a connectivity path exists for every nonzero element in the truncated term-term matrix computed by LSI. A complete understanding of this term transitivity is key to understanding LSI. 1. INTRODUCTION The use of co-occurrence information in textual data has led to improvements in performance when applied to a variety of applications in information retrieval, computational linguistics and textual data mining. Furthermore, many researchers in these fields have developed techniques that explicitly employ second and third order term co-occurrence. Examples include applications such as literature search [14], word sense disambiguation [12], ranking of relevant documents [15], and word selection [8]. Other authors have developed algorithms that implicitly rely on the use of term co-occurrence for applications such as search and retrieval [5], trend detection [14], and stemming [17]. In what follows we refer to various degrees of term transitivity as orders of co-occurrence – first order if two terms co-occur, second order if two terms are linked only by a third, etc. An example of second order co -occurrence follows . Assume that a collection has one document that contains the terms

Journal ArticleDOI
TL;DR: A class of constrained latent class models in which the conditional probabilities are a nonlinear function of basic parameters that pertain to each of the two modes are introduced.
Abstract: The latent class model for two-way two-mode data can easily be extended to the case of three-way three-mode data as a tool to cluster the elements of one mode on the basis of two other modes simultaneously. However, as the number of manifest variables is typically very large in this type of analysis, the number of conditional probabilities rapidly increases with the number of latent classes, which may lead to an overparameter-ized model. To solve this problem, we introduce a class of constrained latent class models in which the conditional probabilities are a nonlinear function of basic parameters that pertain to each of the two modes. The models can be regarded as a probabilistic extension of related deterministic models or as a generalization of related probabilistic models. For parameter estimation, an EM algorithm can be used to locate the posterior mode, and a Gibbs sampling algorithm can be used to compute a sample of the posterior distribution. Furthermore, model selection criteria and measures to check the fit of the model are discussed. Finally the models are applied to study the types of reactions that occur when one is angry at a person in a certain situation.

Proceedings Article
01 Jan 2002
TL;DR: The proposed approach of using a discriminative term selection process based on information grain to improve the performance of the latent semantic indexing (LSI) is described, making it highly portable to various information retrieval and natural language understanding tasks.
Abstract: In this paper, we describe an approach of using a discriminative term selection process based on information grain (IG) to improve the performance of the latent semantic indexing (LSI). The discriminative power of the term is measured by entropy variations averaged over all categories conditioned upon whether the term is present or absent. The proposed approach is applied to the task of natural language call routing (NLCR), where natural language based classifiers are used to route calls to desired destinations. Various experimental studies are performed. Significant performance gains of 27% on precision and 26.5% on recall are observed. Most importantly, the proposed approach is almost independent of task dependent language resources and robust to term variations, making it highly portable to various information retrieval and natural language understanding tasks.

Proceedings Article
01 Jan 2002
TL;DR: A statistical scheme that dynamically determines the unit scope in the generalization stage and the combination of all the techniques leads to a 14% perplexity reduction on a subset of Wall Street Journal, compared with the trigram model.
Abstract: We describe an extension to the use of Latent Semantic Analysis (LSA) for language modeling. This technique makes it easier to exploit long distance relationships in natural language for which the traditional n-gram is unsuited. However, with the growth of length, the semantic representation of the history may be contaminated by irrelevant information, increasing the uncertainty in predicting the next word. To address this problem, we propose a multilevel framework dividing the history into three levels corresponding to document, paragraph and sentence. To combine the three levels of information with the n-gram, a Softmax network is used. We further present a statistical scheme that dynamically determines the unit scope in the generalization stage. The combination of all the techniques leads to a 14% perplexity reduction on a subset of Wall Street Journal, compared with the trigram model.

Journal ArticleDOI
TL;DR: Experimental results demonstrate that a projection-based symmetrical factorisation method for extracting semantic features from collections of text documents stored in a Latent Semantic space yields a comparable representation to that provided by a novel probabilistic approach.
Abstract: This paper proposes a projection-based symmetrical factorisation method for extracting semantic features from collections of text documents stored in a Latent Semantic space. Preliminary experimental results demonstrate this yields a comparable representation to that provided by a novel probabilistic approach which reconsiders the entire indexing problem of text documents and works directly in the original high dimensional vector-space representation of text. The employed projection index is derived here from the a priori constraints on the problem. The principal advantage of this approach is computational efficiency and is obtained by the exploitation of the Latent Semantic Indexing as a preprocessing stage. Simulation results on subsets of the 20-Newsgroups text corpus in various settings are provided.

Journal ArticleDOI
TL;DR: A new sampling-based Bayesian technique, called the DA-T-Gibbs sampler, which relies on the particular latent data structure of latent response models to simplify the computations involved in parameter estimation.
Abstract: This paper introduces a new technique for estimating the parameters of models with continuous latent data. Using the Rasch model as an example, it is shown that existing Bayesian techniques for parameter estimation, such as the Gibbs sampler, are not always easy to implement. Then, a new sampling-based Bayesian technique, called the DA-T-Gibbs sampler, is introduced. The DA-T-Gibbs sampler relies on the particular latent data structure of latent response models to simplify the computations involved in parameter estimation.

01 Jan 2002
TL;DR: The paper presents a method of combining corpus information on word collocations with the probabilistic model of information retrieval based on the term independence assumption and the results of the lexico-semantic analysis of significant collocates.
Abstract: The paper presents a method of combining corpus information on word collocations with the probabilistic model of information retrieval. Corpus term dependencies are used to modify the probabilistic retrieval based on the term independence assumption. Collocates are derived from windows around term occurrences in the corpus. Statistical measures of mutual information and Z score are applied to select significantly associated collocates which are later used in query expansion. The results of the lexico-semantic analysis of significant collocates and their comparison with engineered term networks and thesauri are also discussed.

Book ChapterDOI
01 Jan 2002
TL;DR: This work casts the passages of a large and representative text corpus as a system of simultaneous linear equations in which passage meaning equals the sum of word meanings, and produces a high-dimensional vector representing the average contribution to passage meanings of every word.
Abstract: Latent Semantic Analysis (LSA) treats language learning and representation as a problem in mathematical induction. It casts the passages of a large and representative text corpus as a system of simultaneous linear equations in which passage meaning equals the sum of word meanings. LSA simulates human language understanding with surprising fidelity. Successes to date disprove the poverty of the stimulus argument for lexical meaning and recast the problem of syntax learning, but leave much room for improvement. Semantic atoms are not only single words; idioms need lexicalization. Syntax surely matters; LSA ignores word order. LSA’s knowledge resembles intuition; people also use language for logic. Relations to other input matter. LSA represents perceptual phenomena vicariously, e.g. color relations. Demonstrations that people think in other modes, or that LSA does not exhaust linguistic meaning do not question LS A’s validity, but call for more modeling, testing, and integration.

Proceedings Article
01 Jan 2002
TL;DR: This work suggests an alternative approach in which the latent variables are modelled using deterministic conditional probability tables, which has the advantage of tractable inference even for highly complex non-linear/non-Gaussian visible conditional probability Tables.
Abstract: The application of latent/hidden variable Dynamic Bayesian Networks is constrained by the complexity of marginalising over latent variables. For this reason either small latent dimensions or Gaussian latent conditional tables linearly dependent on past states are typically considered in order that inference is tractable. We suggest an alternative approach in which the latent variables are modelled using deterministic conditional probability tables. This specialisation has the advantage of tractable inference even for highly complex non-linear/non-Gaussian visible conditional probability tables. This approach enables the consideration of highly complex latent dynamics whilst retaining the benefits of a tractable probabilistic model.