scispace - formally typeset
Search or ask a question

Showing papers on "Probabilistic latent semantic analysis published in 2005"


Proceedings ArticleDOI
17 Oct 2005
TL;DR: This work treats object categories as topics, so that an image containing instances of several categories is modeled as a mixture of topics, and develops a model developed in the statistical text literature: probabilistic latent semantic analysis (pLSA).
Abstract: We seek to discover the object categories depicted in a set of unlabelled images. We achieve this using a model developed in the statistical text literature: probabilistic latent semantic analysis (pLSA). In text analysis, this is used to discover topics in a corpus using the bag-of-words document representation. Here we treat object categories as topics, so that an image containing instances of several categories is modeled as a mixture of topics. The model is applied to images by using a visual analogue of a word, formed by vector quantizing SIFT-like region descriptors. The topic discovery approach successfully translates to the visual domain: for a small set of objects, we show that both the object categories and their approximate spatial layout are found without supervision. Performance of this unsupervised method is compared to the supervised approach of Fergus et al. (2003) on a set of unseen images containing only one object per image. We also extend the bag-of-words vocabulary to include 'doublets' which encode spatially local co-occurring regions. It is demonstrated that this extended vocabulary gives a cleaner image segmentation. Finally, the classification and segmentation methods are applied to a set of images containing multiple objects per image. These results demonstrate that we can successfully build object class models from an unsupervised analysis of images.

1,129 citations


25 Feb 2005
TL;DR: Given a set of images containing multiple object categories, this work seeks to discover those categories and their image locations without supervision using generative models from the statistical text literature: probabilistic Latent Semantic Analysis (pLSA), and Latent Dirichlet Allocation (LDA).
Abstract: Given a set of images containing multiple object categories, we seek to discover those categories and their image locations without supervision. We achieve this using generative models from the statistical text literature: probabilistic Latent Semantic Analysis (pLSA), and Latent Dirichlet Allocation (LDA). In text analysis these are used to discover topics in a corpus using the bag-of-words document representation. Here we discover topics as object categories, so that an image containing instances of several categories is modelled as a mixture of topics. The models are applied to images by using a visual analogue of a word, formed by vector quantizing SIFT like region descriptors. We investigate a set of increasingly demanding scenarios, starting with image sets containing only two object categories through to sets containing multiple categories (including airplanes, cars, faces, motorbikes, spotted cats) and background clutter. The object categories sample both intra-class and scale variation, and both the categories and their approximate spatial layout are found without supervision. We also demonstrate classification of unseen images and images containing multiple objects. Performance of the proposed unsupervised method is compared to the semi-supervised approach of [7].1 1This work was sponsored in part by the EU Project CogViSys, the University of Oxford, Shell Oil, and the National Geospatial-Intelligence Agency.

524 citations


Proceedings ArticleDOI
17 Oct 2005
TL;DR: Probabilistic latent semantic analysis generates a compact scene representation, discriminative for accurate classification, and significantly more robust when less training data are available, and the ability of PLSA to automatically extract visually meaningful aspects is exploited to propose new algorithms for aspect-based image ranking and context-sensitive image segmentation.
Abstract: We present a new approach to model visual scenes in image collections, based on local invariant features and probabilistic latent space models. Our formulation provides answers to three open questions:(l) whether the invariant local features are suitable for scene (rather than object) classification; (2) whether unsupennsed latent space models can be used for feature extraction in the classification task; and (3) whether the latent space formulation can discover visual co-occurrence patterns, motivating novel approaches for image organization and segmentation. Using a 9500-image dataset, our approach is validated on each of these issues. First, we show with extensive experiments on binary and multi-class scene classification tasks, that a bag-of-visterm representation, derived from local invariant descriptors, consistently outperforms state-of-the-art approaches. Second, we show that probabilistic latent semantic analysis (PLSA) generates a compact scene representation, discriminative for accurate classification, and significantly more robust when less training data are available. Third, we have exploited the ability of PLSA to automatically extract visually meaningful aspects, to propose new algorithms for aspect-based image ranking and context-sensitive image segmentation.

410 citations


Proceedings Article
05 Dec 2005
TL;DR: In this paper, the authors generalize a successful static model of relationships into a dynamic model that accounts for friendships drifting over time, and show how to make it tractable to learn such models from data, even as the number of entities n gets large.
Abstract: This paper explores two aspects of social network modeling. First, we generalize a successful static model of relationships into a dynamic model that accounts for friendships drifting over time. Second, we show how to make it tractable to learn such models from data, even as the number of entities n gets large. The generalized model associates each entity with a point in p-dimensional Euclidian latent space. The points can move as time progresses but large moves in latent space are improbable. Observed links between entities are more likely if the entities are close in latent space. We show how to make such a model tractable (sub-quadratic in the number of entities) by the use of appropriate kernel functions for similarity in latent space; the use of low dimensional kd-trees; a new efficient dynamic adaptation of multidimensional scaling for a first pass of approximate projection of entities into latent space; and an efficient conjugate gradient update rule for non-linear local optimization in which amortized time per entity during an update is O(log n). We use both synthetic and real-world data on upto 11,000 entities which indicate linear scaling in computation time and improved performance over four alternative approaches. We also illustrate the system operating on twelve years of NIPS co-publication data. We present a detailed version of this work in [1].

364 citations


Proceedings ArticleDOI
Eric Gaussier1, Cyril Goutte1
15 Aug 2005
TL;DR: It is shown that PLSA solves the problem of NMF with KL divergence, and the implications of this relationship are explored.
Abstract: Non-negative Matrix Factorization (NMF, [5]) and Probabilistic Latent Semantic Analysis (PLSA, [4]) have been successfully applied to a number of text analysis tasks such as document clustering. Despite their different inspirations, both methods are instances of multinomial PCA [1]. We further explore this relationship and first show that PLSA solves the problem of NMF with KL divergence, and then explore the implications of this relationship.

305 citations


Proceedings Article
19 Aug 2005
TL;DR: This work proposes a novel sampling algorithm for collective entity resolution which is unsupervised and also takes entity relations into account, and demonstrates the utility and practicality of the relational entity resolution approach for author resolution in two real-world bibliographic datasets.
Abstract: Entity resolution has received considerable attention in recent years. Given many references to underlying entities, the goal is to predict which references correspond to the same entity. We show how to extend the Latent Dirichlet Allocation model for this task and propose a probabilistic model for collective entity resolution for relational domains where references are connected to each other. Our approach differs from other recently proposed entity resolution approaches in that it is a) generative, b) does not make pair-wise decisions and c) captures relations between entities through a hidden group variable. We propose a novel sampling algorithm for collective entity resolution which is unsupervised and also takes entity relations into account. Additionally, we do not assume the domain of entities to be known and show how to infer the number of entities from the data. We demonstrate the utility and practicality of our relational entity resolution approach for author resolution in two real-world bibliographic datasets. In addition, we present preliminary results on characterizing conditions under which relational information is useful.

293 citations


Proceedings ArticleDOI
15 Aug 2005
TL;DR: This paper introduces the multi-label informed latent semantic indexing (MLSI) algorithm which preserves the information of inputs and meanwhile captures the correlations between the multiple outputs and incorporates the human-annotated category information.
Abstract: Latent semantic indexing (LSI) is a well-known unsupervised approach for dimensionality reduction in information retrieval. However if the output information (i.e. category labels) is available, it is often beneficial to derive the indexing not only based on the inputs but also on the target values in the training data set. This is of particular importance in applications with multiple labels, in which each document can belong to several categories simultaneously. In this paper we introduce the multi-label informed latent semantic indexing (MLSI) algorithm which preserves the information of inputs and meanwhile captures the correlations between the multiple outputs. The recovered "latent semantics" thus incorporate the human-annotated category information and can be used to greatly improve the prediction accuracy. Empirical study based on two data sets, Reuters-21578 and RCV1, demonstrates very encouraging results.

253 citations


01 Jan 2005
TL;DR: This paper presents a general-purpose knowledge integration framework that employs Bayesian networks in integrating both low-level and semantic features, and demonstrates that effective inference engines can be built within this powerful and flexible framework according to specific domain knowledge and available training data.
Abstract: Current research in content-based semantic image understanding is largely confined to exemplar-based approaches built on low-level feature extraction and classification. The ability to extract both low-level and semantic features and perform knowledge integration of different types of features is expected to raise semantic image understanding to a new level. Belief networks, or Bayesian networks (BN), have proven to be an effective knowledge representation and inference engine in artificial intelligence and expert systems research. Their effectiveness is due to the ability to explicitly integrate domain knowledge in the network structure and to reduce a joint probability distribution to conditional independence relationships. In this paper, we present a general-purpose knowledge integration framework that employs BN in integrating both low-level and semantic features. The efficacy of this framework is demonstrated via three applications involving semantic understanding of pictorial images. The first application aims at detecting main photographic subjects in an image, the second aims at selecting the most appealing image in an event, and the third aims at classifying images into indoor or outdoor scenes. With these diverse examples, we demonstrate that effective inference engines can be built within this powerful and flexible framework according to specific domain knowledge and available training data to solve inherently uncertain vision problems.

138 citations


MonographDOI
01 Jan 2005
TL;DR: The linear factor analysis (FA) model is a popular tool for exploratory data analysis or, more precisely, for assessing the dimensionality of sets of items as mentioned in this paper, but it is often used for continuous observed indicators, yielding results that might be incorrect.
Abstract: The linear factor analysis (FA) model is a popular tool for exploratory data analysis or, more precisely, for assessing the dimensionality of sets of items. Although it is well known that it is meant for continuous observed indicators, it is often used with dichotomous, ordinal, and other types of discrete variables, yielding results that might be incorrect. Not only parameter estimates may be biased, but also goodness-of-fit indices cannot be trusted. Magidson and Vermunt (2001) presented a nonlinear factor-analytic model based on latent class (LC) analysis that is especially suited for dealing with categorical indicators, such as dichotomous, ordinal, and nominal variables,

112 citations


Proceedings ArticleDOI
04 Sep 2005
TL;DR: An unsupervised dynamic language model (LM) adaptation framework using long-distance latent topic mixtures and the LDA model is combined with the trigram language model using linear interpolation to reduce the perplexity and character error rate.
Abstract: We propose an unsupervised dynamic language model (LM) adaptation framework using long-distance latent topic mixtures. The framework employs the Latent Dirichlet Allocation model (LDA) which models the latent topics of a document collection in an unsupervised and Bayesian fashion. In the LDA model, each word is modeled as a mixture of latent topics. Varying topics within a context can be modeled by re-sampling the mixture weights of the latent topics from a prior Dirichlet distribution. The model can be trained using the variational Bayes Expectation Maximization algorithm. During decoding, mixture weights of the latent topics are adapted dynamically using the hypotheses of previously decoded utterances. In our work, the LDA model is combined with the trigram language model using linear interpolation. We evaluated the approach on the CCTV episode of the RT04 Mandarin Broadcast News test set. Results show that the proposed approach reduces the perplexity by up to 15.4% relative and the character error rate by 4.9% relative depending on the size and setup of the training set.

87 citations


Journal ArticleDOI
TL;DR: In this article, structural equation models are used to semiparametrically model nonlinear latent variable regression functions, where the latent classes are estimated only in the service of more flexibly modeling the characteristics of the aggregate population as a whole.
Abstract: To date, finite mixtures of structural equation models (SEMMs) have been developed and applied almost exclusively for the purpose of providing model-based cluster analyses. This type of analysis constitutes a direct application of the model wherein the estimated component distributions of the latent classes are thought to represent the characteristics of distinct unobserved subgroups of the population. This article instead considers an indirect application of the SEMM in which the latent classes are estimated only in the service of more flexibly modeling the characteristics of the aggregate population as a whole. More specifically, the SEMM is used to semiparametrically model nonlinear latent variable regression functions. This approach is first developed analytically and then demonstrated empirically through analyses of simulated and real data.

Proceedings ArticleDOI
13 Mar 2005
TL;DR: The proposed approach consists in identifying important concepts in documents using two criterions, co-occurrence and semantic relatedness and then disambiguating them via an external general purpose ontology, namely WordNet.
Abstract: This paper deals with the use of ontologies for Information Retrieval. Roughly, the proposed approach consists in identifying important concepts in documents using two criterions, co-occurrence and semantic relatedness and then disambiguating them via an external general purpose ontology, namely WordNet. Matching the ontology and a document results in a set of scored concept-senses (nodes) with weighted links. This representation, called semantic core of a document best reveals the semantic content of the document. We regard our approach, of which the first evaluation results are encouraging, as a short but strong step toward the long term goal of Intelligent Indexing and Semantic Retrieval.

Journal Article
TL;DR: LSA is proved to be an efficient method in text-based research from three aspects: Matching summaries to the texts read, essays grading and the measurement of textual coherence, among which the last one plays the key role.
Abstract: Latent Semantic Analysis(LSA) is a theory and method for extracting and representing the contextual-usage meaning of words by statistical computations applied to a large corpus of text.It is proved to be an efficient method in text-based research from three aspects: Matching summaries to the texts read,essays grading and the measurement of textual coherence,among which the last one plays the key role.LSA can measure the amount of semantic overlap between adjoining sections of text to calculate coherence.

Proceedings Article
30 Jul 2005
TL;DR: A new method for extracting meaningful relations from unstructured natural language sources based on information made available by shallow semantic parsers that surpassed the results of kernel-based models employing only semantic class information.
Abstract: This paper presents a new method for extracting meaningful relations from unstructured natural language sources. The method is based on information made available by shallow semantic parsers. Semantic information was used (1) to enhance a dependency tree kernel; and (2) to build semantic dependency structures used for enhanced relation extraction for several semantic classifiers. In our experiments the quality of the extracted relations surpassed the results of kernel-based models employing only semantic class information.

Journal ArticleDOI
TL;DR: A new dual probability model based on the similarity concepts is introduced to provide deeper understanding of LSI, indicating that LSI dimensions represent latent concepts.
Abstract: Latent Semantic Indexing (LSI), when applied to semantic space built on text collections, improves information retrieval, information filtering, and word sense disambiguation. A new dual probability model based on the similarity concepts is introduced to provide deeper understanding of LSI. Semantic associations can be quantitatively characterized by their statistical significance, the likelihood. Semantic dimensions containing redundant and noisy information can be separated out and should be ignored because their negative contribution to the overall statistical significance. LSI is the optimal solution of the model. The peak in the likelihood curve indicates the existence of an intrinsic semantic dimension. The importance of LSI dimensions follows the Zipf-distribution, indicating that LSI dimensions represent latent concepts. Document frequency of words follows the Zipf distribution, and the number of distinct words follows log-normal distribution. Experiments on five standard document collections confirm and illustrate the analysis.

Journal ArticleDOI
TL;DR: A semi-supervised semantic clustering method based on Support Vector Machines (SVM) to organize the 3D models semantically and proposes a unified search strategy which applies semantic constraints to the retrieval by using the resulting clusters.
Abstract: In this paper, we present a semi-supervised semantic clustering method based on Support Vector Machines (SVM) to organize the 3D models semantically. Ground truth data is used to identify the pattern of each semantic category by supervised learning. The unknown data is then automatically classified and clustered based on the resulting pattern. We also propose a unified search strategy which applies semantic constraints to the retrieval by using the resulting clusters. A query is first labeled with its semantic concept therefore shape-based search is only conducted in the corresponding cluster. Experiments are performed to evaluate the effects of the semantic clustering and retrieval respectively by using our prototypical 3D Engineering Shape Search System (3DESS).

Proceedings ArticleDOI
29 Jun 2005
TL;DR: The results comparing PLSA and LSA with three essay sets from various subjects found the methods were found to be almost equal in the accuracy measured by Spearman correlation between the grades given by the system and a human.
Abstract: Probabilistic Latent Semantic Analysis (PLSA) is an information retrieval technique proposed to improve the problems found in Latent Semantic Analysis (LSA). We have applied both LSA and PLSA in our system for grading essays written in Finnish, called Automatic Essay Assessor (AEA). We report the results comparing PLSA and LSA with three essay sets from various subjects. The methods were found to be almost equal in the accuracy measured by Spearman correlation between the grades given by the system and a human. Furthermore, we propose methods for improving the usage of PLSA in essay grading.

Proceedings ArticleDOI
07 Nov 2005
TL;DR: A novel approach for describing image color semantic including regional and global semantic description is developed and presented, which allows the users to query images with emotional semantic words.
Abstract: Describing images in semantic terms is an important and challenging problem in content-based image retrieval. According to the strong relationship between colors and human emotions, an emotional semantic query model based on image color semantic description is proposed in this study. First, images are segmented into regions through a new color image segmentation algorithm. Then, term sets are generated through a fuzzy clustering algorithm so that colors can be interpreted in semantic terms. We extend the method to extract the color semantic of image regions, and develop a novel approach for describing image color semantic including regional and global semantic description. Finally, we present an image query scheme through image color semantic description, which allows the users to query images with emotional semantic words. Experimental results demonstrate the effectiveness of our approach.

Journal ArticleDOI
TL;DR: This study proposes three new language modeling techniques that use semantic analysis for spoken dialog systems, and shows that as the semantic information utilized is increased and as the tightness of integration between lexical and semantic items is increased, the two types of models become more complementary in nature.

Journal ArticleDOI
TL;DR: Experimental results show that the performance of the proposed semantic learning scheme is excellent when compared with that of the traditional text-based semantic retrieval techniques and content-based image retrieval methods.
Abstract: In this paper, a new semantic learning method for content-based image retrieval using the analytic hierarchical process (AHP) is proposed. AHP proposed by Satty used a systematical way to solve multi-criteria preference problems involving qualitative data and was widely applied to a great diversity of areas. In general, the interpretations of an image are multiple and hard to describe in terms of low-level features due to the lack of a complete image understanding model. The AHP provides a good way to evaluate the fitness of a semantic description used to interpret an image. According to a predefined concept hierarchy, a semantic vector, consisting of the fitness values of semantic descriptions of a given image, is used to represent the semantic content of the image. Based on the semantic vectors, the database images are clustered. For each semantic cluster, the weightings of the low-level features (i.e. color, shape, and texture) used to represent the content of the images are calculated by analyzing the homogeneity of the class. In this paper, the values of weightings setting to the three low-level feature types are diverse in different semantic clusters for retrieval. The proposed semantic learning scheme provides a way to bridge the gap between the high-level semantic concept and the low-level features for content-based image retrieval. Experimental results show that the performance of the proposed method is excellent when compared with that of the traditional text-based semantic retrieval techniques and content-based image retrieval methods.

Proceedings Article
01 Jan 2005
TL;DR: This paper applies efficient variational inference based on DMA, which replaces the DP prior by a simpler alternative, namely Dirichlet-multinomial allocation (DMA), which maintains the main modelling properties of the DP.
Abstract: This paper describes nonparametric Bayesian treatments for analyzing records containing occurrences of items. The introduced model retains the strength of previous approaches that explore the latent factors of each record (e.g. topics of documents), and further uncovers the clustering structure of records, which reflects the statistical dependencies of the latent factors. The nonparametric model induced by a Dirichlet process (DP) flexibly adapts model complexity to reveal the clustering structure of the data. To avoid the problems of dealing with infinite dimensions, we further replace the DP prior by a simpler alternative, namely Dirichlet-multinomial allocation (DMA), which maintains the main modelling properties of the DP. Instead of relying on Markov chain Monte Carlo (MCMC) for inference, this paper applies efficient variational inference based on DMA. The proposed approach yields encouraging empirical results on both a toy problem and text data. The results show that the proposed algorithm uncovers not only the latent factors, but also the clustering structure.

Book ChapterDOI
20 Jul 2005
TL;DR: Methods for using vector-space retrieval model and Latent Semantic Indexing techniques in combination with an invariant image representation based on local descriptors of salient regions to find images with similar semantic labels are presented.
Abstract: The vector-space retrieval model and Latent Semantic Indexing approaches to retrieval have been used heavily in the field of text information retrieval over the past years. The use of these approaches in image retrieval, however, has been somewhat limited. In this paper, we present methods for using these techniques in combination with an invariant image representation based on local descriptors of salient regions. The paper also presents an evaluation in which the two techniques are used to find images with similar semantic labels.

Journal ArticleDOI
Thorsten Brants1
TL;DR: A new test-data likelihood substitute is derived for PLSA and an empirical evaluation shows that the new likelihood substitute produces the best predictions about accuracies in two different IR tasks and is therefore best suited to determine the number of EM steps when training PLSA models.
Abstract: Probabilistic Latent Semantic Analysis (PLSA) is a statistical latent class model that has recently received considerable attention. In its usual formulation it cannot assign likelihoods to unseen documents. Furthermore, it assigns a probability of zero to unseen documents during training. We point out that one of the two existing alternative formulations of the Expectation-Maximization algorithms for PLSA does not require this assumption. However, even that formulation does not allow calculation ofthe actual likelihood values. We therefore derive a new test-data likelihood substitute for PLSA and compare it to three existing likelihood substitutes. An empirical evaluation shows that our new likelihood substitute produces the best predictions about accuracies in two different IR tasks and is therefore best suited to determine the number of EM steps when training PLSA models. The new likelihood measure and its evaluation also suggest that PLSA is not very sensitive to overfitting for the two tasks considered. This renders additions like tempered EM that especially address overfitting unnecessary.

Proceedings ArticleDOI
04 Sep 2005
TL;DR: The Generalized Hebbian Algorithm is shown to be equivalent to Latent Semantic Analysis, and applicable to a range of LSAstyle tasks, and allows very large datasets to be processed.
Abstract: The Generalized Hebbian Algorithm is shown to be equivalent to Latent Semantic Analysis, and applicable to a range of LSAstyle tasks. GHA is a learning algorithm which converges on an approximation of the eigen decomposition of an unseen frequency matrix given observations presented in sequence. Use of GHA allows very large datasets to be processed.

01 Jun 2005
TL;DR: The aim of this work is to apply Latent Semantic Analysis to a general text corpus, and for a test vocabulary, the lexical relations between a test word and its closest neighbours are analysed.
Abstract: In the past decade, Latent Semantic Analysis (LSA) was used in many NLP approaches with sometimes remarkable success. However, its abilities to express semantic relatedness were not yet systematically investigated. This is the aim of our work, where LSA is applied to a general text corpus (German newspaper), and for a test vocabulary, the lexical relations between a test word and its closest neighbours are analysed. These results are compared to the results from a collocation analysis.

Book ChapterDOI
18 May 2005
TL;DR: A regularized probabilistic latent semantic analysis model (RPLSA), which can properly adjust the amount of model flexibility so that not only the training data can be fit well but also the model is robust to avoid the overfitting problem.
Abstract: Mixture models, such as Gaussian Mixture Model, have been widely used in many applications for modeling data. Gaussian mixture model (GMM) assumes that data points are generated from a set of Gaussian models with the same set of mixture weights. A natural extension of GMM is the probabilistic latent semantic analysis (PLSA) model, which assigns different mixture weights for each data point. Thus, PLSA is more flexible than the GMM method. However, as a tradeoff, PLSA usually suffers from the overfitting problem. In this paper, we propose a regularized probabilistic latent semantic analysis model (RPLSA), which can properly adjust the amount of model flexibility so that not only the training data can be fit well but also the model is robust to avoid the overfitting problem. We conduct empirical study for the application of speaker identification to show the effectiveness of the new model. The experiment results on the NIST speaker recognition dataset indicate that the RPLSA model outperforms both the GMM and PLSA models substantially. The principle of RPLSA of appropriately adjusting model flexibility can be naturally extended to other applications and other types of mixture models.

Journal ArticleDOI
01 Jan 2005
TL;DR: It is shown that knowledge structures can be interpreted as a special type of constrained latent class model, which offers a possibility to construct a knowledge structure by exploratory data analysis from observed response patterns.
Abstract: This paper tries to establish a connection between knowledge structures and latent class models. We will show that knowledge structures can be interpreted as a special type of constrained latent class model. Latent class models offer a well-founded theoretical framework to investigate the connection of a given latent class model to observed data. If we establish a connection between latent class models and knowledge structures, we can also use this framework in knowledge structure theory. We will show that the connection to latent class models offers us a possibility to construct a knowledge structure by exploratory data analysis from observed response patterns. Other possible applications are the empirical comparison of hypothetical knowledge structures and the statistical test of a given knowledge structure.

Journal ArticleDOI
TL;DR: This paper proposes a new set of models for classification in continuous domains, termed latent classification models, and presents algorithms for learning both the parameters and the structure of a latent classification model, and demonstrates empirically that the accuracy of the proposed model is significantly higher than theuracy of other probabilistic classifiers.
Abstract: One of the simplest, and yet most consistently well-performing set of classifiers is the Naive Bayes models. These models rely on two assumptions: (i) All the attributes used to describe an instance are conditionally independent given the class of that instance, and (ii) all attributes follow a specific parametric family of distributions. In this paper we propose a new set of models for classification in continuous domains, termed latent classification models. The latent classification model can roughly be seen as combining the Naive Bayes model with a mixture of factor analyzers, thereby relaxing the assumptions of the Naive Bayes classifier. In the proposed model the continuous attributes are described by a mixture of multivariate Gaussians, where the conditional dependencies among the attributes are encoded using latent variables. We present algorithms for learning both the parameters and the structure of a latent classification model, and we demonstrate empirically that the accuracy of the proposed model is significantly higher than the accuracy of other probabilistic classifiers.

Journal Article
TL;DR: This work investigates the efficiency of applying Latent Semantic Indexing, an automatic indexing method of information retrieval, to some classes of patent documents from the United States Patent Classification System and compares the performance of the LSI to the Vector Space Model technique applied to real life text documents.
Abstract: Since the huge database of patent documents is continuously increasing, the issue of classifying, updating and retrieving patent documents turned into an acute necessity. Therefore, we investigate the efficiency of applying Latent Semantic Indexing, an automatic indexing method of information retrieval, to some classes of patent documents from the United States Patent Classification System. We present some experiments that provide the optimal number of dimensions for the Latent Semantic Space and we compare the performance of Latent Semantic Indexing (LSI) to the Vector Space Model (VSM) technique applied to real life text documents, namely, patent documents. However, we do not strongly recommend the LSI as an improved alternative method to the VSM, since the results are not significantly better.

01 Jan 2005
TL;DR: This chapter analyzes the values used by Latent Sematic Indexing (LSI) for information retrieval by manipulating the values in the Singular Value Decomposition (SVD) matrices, and finds that a significant fraction of the values have little effect on overall performance, and can be removed.
Abstract: In this chapter we analyze the values used by Latent Sematic Indexing (LSI) for information retrieval By manipulating the values in the Singular Value Decomposition (SVD) matrices, we find that a significant fraction of the values have little effect on overall performance, and can thus be removed (changed to zero) This allows us to convert the dense term by dimension and document by dimension matrices into sparse matrices by identifying and removing those entries We empirically show that these entries are unimportant by presenting retrieval and runtime performance results, using seven collections, which show that removal of up 70% of the values in the term by dimension matrix results in similar or improved retrieval performance (as compared to LSI) Removal of 90% of the values degrades retrieval performance slightly for smaller collections, but improves retrieval performance by 60% on the large collection we tested Our approach additionally has the computational benefit of reducing memory requirements and query response time