scispace - formally typeset
Search or ask a question

Showing papers on "Probabilistic latent semantic analysis published in 2000"


Proceedings ArticleDOI
03 Oct 2000
TL;DR: This work presents a system for identifying the semantic relationships, or semantic roles, filled by constituents of a sentence within a semantic frame, derived from parse trees and hand-annotated training data.
Abstract: We present a system for identifying the semantic relationships, or semantic roles, filled by constituents of a sentence within a semantic frame. Various lexical and syntactic features are derived from parse trees and used to derive statistical classifiers from hand-annotated training data.

944 citations


Journal ArticleDOI
J.R. Bellegarda1
01 Aug 2000
TL;DR: This paper focuses on the use of latent semantic analysis, a paradigm that automatically uncovers the salient semantic relationships between words and documents in a given corpus, and proposes an integrative formulation for harnessing this synergy.
Abstract: Statistical language models used in large-vocabulary speech recognition must properly encapsulate the various constraints, both local and global, present in the language. While local constraints are readily captured through n-gram modeling, global constraints, such as long-term semantic dependencies, have been more difficult to handle within a data-driven formalism. This paper focuses on the use of latent semantic analysis, a paradigm that automatically uncovers the salient semantic relationships between words and documents in a given corpus. In this approach, (discrete) words and documents are mapped onto a (continuous) semantic vector space, in which familiar clustering techniques can be applied. This leads to the specification of a powerful framework for automatic semantic classification, as well as the derivation of several language model families with various smoothing properties. Because of their large-span nature, these language models are well suited to complement conventional n-grams. An integrative formulation is proposed for harnessing this synergy, in which the latent semantic information is used to adjust the standard n-gram probability. Such hybrid language modeling compares favorably with the corresponding n-gram baseline: experiments conducted on the Wall Street Journal domain show a reduction in average word error rate of over 20%. This paper concludes with a discussion of intrinsic tradeoffs, such as the influence of training data selection on the resulting performance.

565 citations


Journal ArticleDOI
01 Oct 2000
TL;DR: It is proved that, under certain conditions, LSI does succeed in capturing the underlying semantics of the corpus and achieves improved retrieval performance, and the technique of random projection is proposed as a way of speeding up LSI.
Abstract: Latent semantic indexing (LSI) is an information retrieval technique based on the spectral analysis of the term-document matrix, whose empirical success had heretofore been without rigorous prediction and explanation. We prove that, under certain conditions, LSI does succeed in capturing the underlying semantics of the corpus and achieves improved retrieval performance. We propose the technique of random projection as a way of speeding up LSI. We complement our theorems with encouraging experimental results. We also argue that our results may be viewed in a more general framework, as a theoretical basis for the use of spectral methods in a wider class of applications such as collaborative filtering.

399 citations


Journal ArticleDOI
TL;DR: Simulations and a psychiatric example are presented to demonstrate the effective use of procedures for assessing Markov chain Monte Carlo convergence and model diagnosis and for selecting the number of categories for the latent variable based on evidence in the data using MarkovChain Monte Carlo techniques.
Abstract: In many areas of medical research, such as psychiatry and gerontology, latent class variables are used to classify individuals into disease categories, often with the intention of hierarchical modeling. Problems arise when it is not clear how many disease classes are appropriate, creating a need for model selection and diagnostic techniques. Previous work has shown that the Pearson chi 2 statistic and the log-likelihood ratio G2 statistic are not valid test statistics for evaluating latent class models. Other methods, such as information criteria, provide decision rules without providing explicit information about where discrepancies occur between a model and the data. Identifiability issues further complicate these problems. This paper develops procedures for assessing Markov chain Monte Carlo convergence and model diagnosis and for selecting the number of categories for the latent variable based on evidence in the data using Markov chain Monte Carlo techniques. Simulations and a psychiatric example are presented to demonstrate the effective use of these methods.

254 citations


Journal ArticleDOI
TL;DR: A unified maximum likelihood method for estimating the parameters of the generalized latent trait model will be presented and in addition the scoring of individuals on the latent dimensions is discussed.
Abstract: In this paper we discuss a general model framework within which manifest variables with different distributions in the exponential family can be analyzed with a latent trait model. A unified maximum likelihood method for estimating the parameters of the generalized latent trait model will be presented. We discuss in addition the scoring of individuals on the latent dimensions. The general framework presented allows, not only the analysis of manifest variables all of one type but also the simultaneous analysis of a collection of variables with different distributions. The approach used analyzes the data as they are by making assumptions about the distribution of the manifest variables directly.

246 citations


Proceedings ArticleDOI
13 Sep 2000
TL;DR: A semantics-only algorithm for learning morphology which only proposes affixes when the stem and stem-plus-affix are sufficiently similar semantically and it is shown that this approach provides morphology induction results that rival a current state-of-the-art system.
Abstract: Morphology induction is a subproblem of important tasks like automatic learning of machine-readable dictionaries and grammar induction. Previous morphology induction approaches have relied solely on statistics of hypothesized stems and affixes to choose which affixes to consider legitimate. Relying on stem-and-affix statistics rather than semantic knowledge leads to a number of problems, such as the inappropriate use of valid affixes ("ally" stemming to "all"). We introduce a semantic-based algorithm for learning morphology which only proposes affixes when the stem and stem-plus-affix are sufficiently similar semantically. We implement our approach using Latent Semantic Analysis and show that our semantics-only approach provides morphology induction results that rival a current state-of-the-art system.

233 citations


Patent
18 Oct 2000
TL;DR: In this paper, state vectors representing the semantic content of a document are superpositioned to construct a single vector representing a semantic abstract for the document, which can be used to locate documents with similar semantic content.
Abstract: State vectors representing the semantic content of a document are created. The state vectors are superpositioned to construct a single vector representing a semantic abstract for the document. The single vector can be normalized. Once constructed, the single vector semantic abstract can be compared with semantic abstracts for other documents to measure a semantic distance between the documents, and can be used to locate documents with similar semantic content.

180 citations


Proceedings ArticleDOI
13 Nov 2000
TL;DR: The paper describes the results of applying Latent Semantic Analysis (LSA), an advanced information retrieval method, to program source code and associated documentation to assist in the understanding of a nontrivial software system, namely a version of Mosaic.
Abstract: The paper describes the results of applying Latent Semantic Analysis (LSA), an advanced information retrieval method, to program source code and associated documentation. Latent semantic analysis is a corpus based statistical method for inducing and representing aspects of the meanings of words and passages (of natural language) reflective in their usage. This methodology is assessed for application to the domain of software components (i.e., source code and its accompanying documentation). Here LSA is used as the basis to cluster software components. This clustering is used to assist in the understanding of a nontrivial software system, namely a version of Mosaic. Applying latent semantic analysis to the domain of source code and internal documentation for the support of program understanding is a new application of this method and a departure from the normal application domain of natural language.

110 citations


Proceedings Article
12 Apr 2000
TL;DR: Experiments with an on-line newspaper archive show that Latent Semantic Indexing can outperform both content based and context based approaches and that it is a promising approach for indexing visual and multi-modal data.
Abstract: In this paper, we introduce a new approach to image retrieval. This new approach takes the best from two worlds, combines image features (content) and words from collateral text (context) into one semantic space. Our approach uses Latent Semantic Indexing, a method that uses co-occurrence statistics to uncover hidden semantics. This paper shows how this method, that has proven successful in both monolingual and cross lingual text retrieval, can be used for multi-modal and cross-modal information retrieval. Experiments with an on-line newspaper archive show that Latent Semantic Indexing can outperform both content based and context based approaches and that it is a promising approach for indexing visual and multi-modal data.

105 citations


Proceedings ArticleDOI
01 Jul 2000
TL;DR: A novel algorithm that creates document vectors with reduced dimensionality by iteratively "scaling" vectors and computing eigenvectors is presented, which breaks the symmetry of documents and terms to capture information more evenly across documents.
Abstract: We present a novel algorithm that creates document vectors with reduced dimensionality. This work was motivated by an application characterizing relationships among documents in a collection. Our algorithm yielded inter-document similarities with an average precision up to 17.8% higher than that of singular value decomposition (SVD) used for Latent Semantic Indexing. The best performance was achieved with dimensional reduction rates that were 43% higher than SVD on average. Our algorithm creates basis vectors for a reduced space by iteratively “scaling” vectors and computing eigenvectors. Unlike SVD, it breaks the symmetry of documents and terms to capture information more evenly across documents. We also discuss correlation with a probabilistic model and evaluate a method for selecting the dimensionality using log-likelihood estimation.

85 citations


Proceedings ArticleDOI
05 Jun 2000
TL;DR: A new latent semantic indexing (LSI) method for spoken audio documents that smoothing by the closest document clusters is important here, because the documents are often short and have a high word error rate (WER).
Abstract: This paper describes a new latent semantic indexing (LSI) method for spoken audio documents. The framework is indexing broadcast news from radio and TV as a combination of large vocabulary continuous speech recognition (LVCSR), natural language processing (NLP) and information retrieval (IR). For indexing, the documents are presented as vectors of word counts, whose dimensionality is rapidly reduced by random mapping (RM). The obtained vectors are projected into the latent semantic subspace determined by SVD, where the vectors are then smoothed by a self-organizing map (SOM). The smoothing by the closest document clusters is important here, because the documents are often short and have a high word error rate (WER). As the clusters in the semantic subspace reflect the news topics, the SOMs provide an easy way to visualize the index and query results and to explore the database. Test results are reported for TREC's spoken document retrieval databases (www.idiap.ch/kurimo/thisl.html).

Book
06 Apr 2000
TL;DR: Flexible discriminant and mixture models Neural networks for unsupervised learning based on information theory Radial basis function networks and statistics Robust prediction in many-parameter models and data visualisation.
Abstract: Flexible discriminant and mixture models Neural networks for unsupervised learning based on information theory Radial basis function networks and statistics Robust prediction in many-parameter models Density networks Latent variable models and data visualisation Analysis of latent structure models with multidimensional latent variables Artificial neural networks and multivariate statistics

Journal ArticleDOI
TL;DR: In this article, a probabilistic clustering model for mixed data is proposed, which allows analysis of variables of mixed type: the variables may be nominal, ordinal and/or quantitative.
Abstract: This paper develops a probabilistic clustering model for mixeddata. The model allows analysis of variables of mixed type: thevariables may be nominal, ordinal and/or quantitative. The modelcontains the well-known models of latent class analysis as submodels.As in latent class analysis, local independence of the variables isassumed. The parameters of the model are estimated by the EMalgorithm. Test statistics and goodness-of-fit measures are proposedfor model selection. Two artificial data sets show the usefulness ofthese tests. An empirical example completes the presentation.

Proceedings ArticleDOI
01 Jun 2000
TL;DR: An overview of the usage of the LSA for analysis of textual data and the potential of LSA on selected corpus of religious and sacred texts is demonstrated.
Abstract: Latent Semantic Analysis of Text Information The paper presents an overview of the usage of LSA for analysis of textual data. The mathematical apparatus is explained in brief and special attention if pointed on the key parameters that influence the quality of the results obtained. The potential of LSA is demonstrated on selected corpus of religious and sacred texts. The results of an experimental application of LSA for educational purposes are also present.

Proceedings Article
30 Jun 2000
TL;DR: This paper introduces a methodology for criticizing models both globally (a BN in its entirety) and locally (observable nodes), and explores its value in identifying several kinds of misfit: node errors, edge errors, state errors, and prior probability errors in the latent structure.
Abstract: The application of Bayesian networks (BNs) to cognitive assessment and intelligent tutoring systems poses new challenges for model construction. When cognitive task analyses suggest constructing a BN with several latent variables, empirical model criticism of the latent structure becomes both critical and complex. This paper introduces a methodology for criticizing models both globally (a BN in its entirety) and locally (observable nodes), and explores its value in identifying several kinds of misfit: node errors, edge errors, state errors, and prior probability errors in the latent structure. The results suggest the indices have potential for detecting model misfit and assisting in locating problematic components of the model.

Book ChapterDOI
Eirik Hektoen1
01 Jan 2000
TL;DR: This paper presents a new technique for selecting the correct parse of ambiguous sentences based on a probabilistic analysis, of lexical cooccurrences in semantic forms called Semco, which uses Bayesian Estimation for the cooccurrence probabilities to achieve higher accuracy for sparse data than the more common Maximum Likelihood Estimation would.
Abstract: This chapter presents a new technique for selecting the correct parse of ambiguous sentences based on a probabilistic analysis of lexical co-occurrences in semantic forms. The method is called ‘Semco’ (for semantic co-occurrence analysis) and is specifically targeted at the differential distribution of such co-occurrences in correct and incorrect parses. It uses Bayesian Estimation for the co-occurrence probabilities to achieve higher accuracy for sparse data than the more common Maximum Likelihood Estimation would. It has been tested on the Wall Street Journal corpus (in the Perm Treebank) and shown to find the correct parse of 60.9% of parseable sentences of 6–20 words.

Journal ArticleDOI
TL;DR: A Latent Semantic Index was constructed from arguments made by navy officers concerning events in an anti-air warfare scenario and a model based on LSI factor values predicted level of domain expertise with 89% accuracy.
Abstract: A Latent Semantic Index (LSI) was constructed from arguments made by navy officers concerning events in an anti-air warfare scenario. A model based on LSI factor values predicted level of domain expertise with 89% accuracy. The LSI factor space was reduced using MDS to five dimensions: aircraft route, aircraft response, kinematics, localization, and an unclassifiable element. Arguments in the localization category were reliably more common among officers with the greatest expertise. Automated classification of arguments into these elements achieved 84% accuracy. LSI may be a useful tool for automating aspects of modeling expertise and diagnosing knowledge deficiencies.

01 Jan 2000
TL;DR: A Bayesian treatment of mixtures of latent variable models is proposed to avoid having to choose a value for the dimension of the latent subspace by a computationally expensive search technique such as cross-validation.
Abstract: This paper deals with the problem of probability density estimation with the goal of finding a good probabilistic representation of the data. One of the most popular density estimation methods is the Gaussian mixture model (GMM). A promising alternative to GMMs are the recently proposed mixtures of latent variable models. Examples of the latter are principal component analysis and factor analysis. The advantage of these models is that they are capable of representing the covariance structure with less parameters by choosing the dimension of a subspace in a suitable way. An empirical evaluation on a large number of data sets shows that mixtures of latent variable models almost always outperform various GMMs both in density estimation and Bayes classifiers. To avoid having to choose a value for the dimension of the latent subspace by a computationally expensive search technique such as cross-validation, a Bayesian treatment of mixtures of latent variable models is proposed. This framework makes it possible to determine the appropriate dimension during training and experiments illustrate its viability.


Proceedings Article
01 May 2000
TL;DR: The utilization of semantic knowledge acquired from an MRD for language modelling tasks in relation to speech recognition applications is described, providing evidence that limited or incomplete knowledge from lexical resources such as MRDs can be useful for domain independent language modelling.
Abstract: Machine Readable Dictionaries (MRDs) have been used in a variety of language processing tasks including word sense disambiguation, text segmentation, information retrieval and information extraction. In this paper we describe the utilization of semantic knowledge acquired from an MRD for language modelling tasks in relation to speech recognition applications. A semantic model of language has been derived using the dictionary definitions in order to compute the semantic association between the words. The model is capable of capturing phenomena of latent semantic dependencies between the words in texts and reducing the language ambiguity by a considerable factor. The results of experiments suggest that the semantic model can improve the word recognition rates in “noisy-channel” applications. This research provides evidence that limited or incomplete knowledge from lexical resources such as MRDs can be useful for domain independent language modelling.

Proceedings ArticleDOI
01 Jul 2000
TL;DR: A new model named Boolean Latent Semantic Indexing model based on the Singular Value Decomposition and Boolean query formulation is introduced, which can help users to make precise representation of their information search needs.
Abstract: A new model named Boolean Latent Semantic Indexing model based on the Singular Value Decomposition and Boolean query formulation is introduced. While the Singular Value Decomposition alleviates the problems of lexical matching in the traditional information retrieval model, Boolean query formulation can help users to make precise representation of their information search needs. Retrieval experiments on a number of test collections seem to show that the proposed model achieves substantial performance gains over the Latent Semantic Indexing model.

Proceedings Article
01 Jan 2000
TL;DR: In this paper, a new model named Boolean Latent Semantic Indexing model based on the Singular Value Decomposition and Boolean query formulation is introduced, which can help users to make precise representation of their information search needs.
Abstract: A new model named Boolean Latent Semantic Indexing model based on the Singular Value Decomposition and Boolean query formulation is introduced. While the Singular Value Decomposition alleviates the problems of lexical matching in the traditional information retrieval model, Boolean query formulation can help users to make precise representation of their information search needs. Retrieval experiments on a number of test collections seem to show that the proposed model achieves substantial performance gains over the Latent Semantic Indexing model.

01 Jan 2000
TL;DR: This paper discusses two different forms of (exploratory) LC analysis which are implemented in a new computer program called Latent GOLD.
Abstract: Statistical Innovations Inc., P.O. Box 1, Belmont, MA 02478, USAKeywords. latent class analysis, factor analysis, cluster analysis, mixture models,categorical data, graphical displays, bi-plot, tri-plot, statistical softwareLatent class (LC) analysis is becoming one of the standard data analysis toolsin social, biomedical, and marketing research. This paper discusses two differentforms of (exploratory) LC analysis which are implemented in a new computerprogram called Latent GOLD

Proceedings Article
20 Aug 2000
TL;DR: This paper presents an approach that builds on user feedback across multiple queries in order to improve the retrieval quality of novel queries and demonstrates that REGRESSOR automatically improves on the performance of Latent Semantic Indexing by utilizing the feedback information from past queries.
Abstract: In several information retrieval (IR) systems there is a possibility for user feedback. Many machine learning methods have been proposed that learn from the feedback information in a long-term fashion. In this paper, we present an approach that builds on user feedback across multiple queries in order to improve the retrieval quality of novel queries. This allows users of an IR system to retrieve relevant documents at a reduced effort. Two algorithms for long-term learning across multiple queries in the scope of the retrieval system Latent Semantic Indexing have been implemented in a system, REGRESSOR, in order to test these ideas. The algorithms are based on k-nearest-neighbor searching and back propagation neural networks. Training examples are query vectors, and by using Latent Semantic Indexing, the examples are reduced to a fixed and manageable size. In order to evaluate the methods, we performed a set of experiments where we compared the performance of Latent Semantic Indexing and REGRESSOR. The results demonstrate that REGRESSOR automatically improves on the performance of Latent Semantic Indexing by utilizing the feedback information from past queries.

Journal ArticleDOI
TL;DR: The method proposed is based on the use of the latent semantic analysis to retrieve semantic dependencies between words to classify document based on these dependencies.
Abstract: In this paper, the problem of automatic document classification by a set of given topics is considered. The method proposed is based on the use of the latent semantic analysis to retrieve semantic dependencies between words. The classification of document is based on these dependencies. The results of experiments performed on the basis of the standard test data set TREC (Text REtrieval Conference) confirm the attractiveness of this approach. The relatively low computational complexity of this method at the classification stage makes it possible to be applied to the classification of document streams.



Journal Article
TL;DR: Text browsing based on Latent Semantic Indexing (LSI) is presented in this paper, and it combines LSI with concept tagging to improve the efficiency of users reading.
Abstract: Text browsing is the assistant reading mechanism to help users browse the online texts.Text browsing based on Latent Semantic Indexing(LSI)is presented in this paper,and it combines LSI with concept tagging to improve the efficiency of users reading.It applies LSI to reduce the skew intersections and calculates the similarity between terms and texts based on the semantic space,it also divides the terms into several semantic classes and determines the meanings of classes.In additional,it implements the information navigation based on conceptual tree.

01 Jan 2000
TL;DR: The meta-analysis carried out in this research is an attempt to alleviate inconsistent results in previous studies.
Abstract: Semantic data modeling, such as entity-relationship (ER) modeling and extended/enhanced entity-relationship (EER) modeling, has emerged as an alternative to relational data modeling. The majority of research in data modeling suggests that the use of semantic data models leads to better performance. However the findings are not conclusive and sometimes inconsistent. In this research, we investigate modeling relationship correctness in relational and semantic models. The meta-analysis carried out in this research is an attempt to alleviate inconsistent results in previous studies.

Posted Content
TL;DR: This paper obtains a case-role analysis, in which the semantic roles of the verb are identified, and presents a semantic parsing approach for unrestricted texts that identifies correctly more than 73% of possible semantic case-roles.
Abstract: This paper presents a semantic parsing approach for unrestricted texts. Semantic parsing is one of the major bottlenecks of Natural Language Understanding (NLU) systems and usually requires building expensive resources not easily portable to other domains. Our approach obtains a case-role analysis, in which the semantic roles of the verb are identified. In order to cover all the possible syntactic realisations of a verb, our system combines their argument structure with a set of general semantic labelled diatheses models. Combining them, the system builds a set of syntactic-semantic patterns with their own role-case representation. Once the patterns are build, we use an approximate tree pattern-matching algorithm to identify the most reliable pattern for a sentence. The pattern matching is performed between the syntactic-semantic patterns and the feature-structure tree representing the morphological, syntactical and semantic information of the analysed sentence. For sentences assigned to the correct model, the semantic parsing system we are presenting identifies correctly more than 73% of possible semantic case-roles. Keys: