scispace - formally typeset
Search or ask a question

Showing papers on "Probabilistic latent semantic analysis published in 2011"


Journal ArticleDOI
TL;DR: poLCA is a software package for the estimation of latent class and latent class regression models for polytomous outcome variables, implemented in the R statistical computing environment using expectation-maximization and Newton-Raphson algorithms to find maximum likelihood estimates of the model parameters.
Abstract: poLCA is a software package for the estimation of latent class and latent class regression models for polytomous outcome variables, implemented in the R statistical computing environment. Both models can be called using a single simple command line. The basic latent class model is a finite mixture model in which the component distributions are assumed to be multi-way cross-classification tables with all variables mutually independent. The latent class regression model further enables the researcher to estimate the effects of covariates on predicting latent class membership. poLCA uses expectation-maximization and Newton-Raphson algorithms to find maximum likelihood estimates of the model parameters.

991 citations


Book
08 Aug 2011
TL;DR: A unified approach showing how such apparently diverse methods as Latent Class Analysis and Factor Analysis are actually members of the same family of latent variable modeling from a statistical perspective is provided.
Abstract: Latent Variable Models and Factor Analysis provides a comprehensive and unified approach to factor analysis and latent variable modeling from a statistical perspective. This book presents a general framework to enable the derivation of the commonly used models, along with updated numerical examples. Nature and interpretation of a latent variable is also introduced along with related techniques for investigating dependency. This book: Provides a unified approach showing how such apparently diverse methods as Latent Class Analysis and Factor Analysis are actually members of the same family. Presents new material on ordered manifest variables, MCMC methods, non-linear models as well as a new chapter on related techniques for investigating dependency. Includes new sections on structural equation models (SEM) and Markov Chain Monte Carlo methods for parameter estimation, along with new illustrative examples. Looks at recent developments on goodness-of-fit test statistics and on non-linear models and models with mixed latent variables, both categorical and continuous. No prior acquaintance with latent variable modelling is pre-supposed but a broad understanding of statistical theory will make it easier to see the approach in its proper perspective. Applied statisticians, psychometricians, medical statisticians, biostatisticians, economists and social science researchers will benefit from this book.

597 citations


Journal ArticleDOI
TL;DR: This paper conducts a systematic investigation of two representative probabilistic topic models, probabilistically latent semantic analysis (PLSA) and Latent Dirichlet Allocation (LDA), using three representative text mining tasks, including document clustering, text categorization, and ad-hoc retrieval.
Abstract: Probabilistic topic models have recently attracted much attention because of their successful applications in many text mining tasks such as retrieval, summarization, categorization, and clustering. Although many existing studies have reported promising performance of these topic models, none of the work has systematically investigated the task performance of topic models; as a result, some critical questions that may affect the performance of all applications of topic models are mostly unanswered, particularly how to choose between competing models, how multiple local maxima affect task performance, and how to set parameters in topic models. In this paper, we address these questions by conducting a systematic investigation of two representative probabilistic topic models, probabilistic latent semantic analysis (PLSA) and Latent Dirichlet Allocation (LDA), using three representative text mining tasks, including document clustering, text categorization, and ad-hoc retrieval. The analysis of our experimental results provides deeper understanding of topic models and many useful insights about how to optimize the performance of topic models for these typical tasks. The task-based evaluation framework is generalizable to other topic models in the family of either PLSA or LDA.

236 citations


Journal ArticleDOI
TL;DR: In this article, the problem of learning a latent tree graphical model where samples are available only from a subset of variables has been studied and two consistent and computationally efficient algorithms for learning minimal latent trees, that is, trees without any redundant hidden nodes, have been proposed.
Abstract: We study the problem of learning a latent tree graphical model where samples are available only from a subset of variables. We propose two consistent and computationally efficient algorithms for learning minimal latent trees, that is, trees without any redundant hidden nodes. Unlike many existing methods, the observed nodes (or variables) are not constrained to be leaf nodes. Our algorithms can be applied to both discrete and Gaussian random variables and our learned models are such that all the observed and latent variables have the same domain (state space). Our first algorithm, recursive grouping, builds the latent tree recursively by identifying sibling groups using so-called information distances. One of the main contributions of this work is our second algorithm, which we refer to as CLGrouping. CLGrouping starts with a pre-processing procedure in which a tree over the observed variables is constructed. This global step groups the observed nodes that are likely to be close to each other in the true latent tree, thereby guiding subsequent recursive grouping (or equivalent procedures such as neighbor-joining) on much smaller subsets of variables. This results in more accurate and efficient learning of latent trees. We also present regularized versions of our algorithms that learn latent tree approximations of arbitrary distributions. We compare the proposed algorithms to other methods by performing extensive numerical experiments on various latent tree graphical models such as hidden Markov models and star graphs. In addition, we demonstrate the applicability of our methods on real-world data sets by modeling the dependency structure of monthly stock returns in the S&P index and of the words in the 20 newsgroups data set.

231 citations


Proceedings ArticleDOI
24 Jul 2011
TL;DR: This paper introduces Interdependent Latent Dirichlet Allocation (ILDA) model, a probabilistic graphical models which aim to extract aspects and corresponding ratings of products from online reviews and conducts experiments on a real life dataset, Epinions.com.
Abstract: Today, more and more product reviews become available on the Internet, e.g., product review forums, discussion groups, and Blogs. However, it is almost impossible for a customer to read all of the different and possibly even contradictory opinions and make an informed decision. Therefore, mining online reviews (opinion mining) has emerged as an interesting new research direction. Extracting aspects and the corresponding ratings is an important challenge in opinion mining. An aspect is an attribute or component of a product, e.g. 'screen' for a digital camera. It is common that reviewers use different words to describe an aspect (e.g. 'LCD', 'display', 'screen'). A rating is an intended interpretation of the user satisfaction in terms of numerical values. Reviewers usually express the rating of an aspect by a set of sentiments, e.g. 'blurry screen'. In this paper we present three probabilistic graphical models which aim to extract aspects and corresponding ratings of products from online reviews. The first two models extend standard PLSI and LDA to generate a rated aspect summary of product reviews. As our main contribution, we introduce Interdependent Latent Dirichlet Allocation (ILDA) model. This model is more natural for our task since the underlying probabilistic assumptions (interdependency between aspects and ratings) are appropriate for our problem domain. We conduct experiments on a real life dataset, Epinions.com, demonstrating the improved effectiveness of the ILDA model in terms of the likelihood of a held-out test set, and the accuracy of aspects and aspect ratings.

198 citations


01 May 2011
TL;DR: This work proposes two consistent and computationally efficient algorithms for learning minimal latent trees, that is, trees without any redundant hidden nodes, and applies these algorithms to both discrete and Gaussian random variables.

179 citations


Journal ArticleDOI
TL;DR: It is argued that the amount of perceptual and other semantic information that can be learned from purely distributional statistics has been underappreciated and that future focus should be on understanding the cognitive mechanisms humans use to integrate the two sources.
Abstract: Since their inception, distributional models of semantics have been criticized as inadequate cognitive theories of human semantic learning and representation. A principal challenge is that the representations derived by distributional models are purely symbolic and are not grounded in perception and action; this challenge has led many to favor feature-based models of semantic representation. We argue that the amount of perceptual and other semantic information that can be learned from purely distributional statistics has been underappreciated. We compare the representations of three feature-based and nine distributional models using a semantic clustering task. Several distributional models demonstrated semantic clustering comparable with clustering-based on feature-based representations. Furthermore, when trained on child-directed speech, the same distributional models perform as well as sensorimotor-based feature representations of children's lexical semantic knowledge. These results suggest that, to a large extent, information relevant for extracting semantic categories is redundantly coded in perceptual and linguistic experience. Detailed analyses of the semantic clusters of the feature-based and distributional models also reveal that the models make use of complementary cues to semantic organization from the two data streams. Rather than conceptualizing feature-based and distributional models as competing theories, we argue that future focus should be on understanding the cognitive mechanisms humans use to integrate the two sources.

140 citations


Journal ArticleDOI
TL;DR: Different LSA-based summarization algorithms are explained, two of which are proposed by the authors of this paper and their performances are compared using their ROUGE scores.
Abstract: Text summarization solves the problem of presenting the information needed by a user in a compact form. There are different approaches to creating well-formed summaries. One of the newest methods is the Latent Semantic Analysis (LSA). In this paper, different LSA-based summarization algorithms are explained, two of which are proposed by the authors of this paper. The algorithms are evaluated on Turkish and English documents, and their performances are compared using their ROUGE scores. One of our algorithms produces the best scores and both algorithms perform equally well on Turkish and English document sets.

133 citations


Journal ArticleDOI
TL;DR: This communication provides an introduction, an example, pointers to relevant software, and summarizes the choices that can be made by the analyst, so that visualization (“semantic mapping”) is made more accessible.

112 citations


Proceedings ArticleDOI
24 Jul 2011
TL;DR: Experiments with the 100K and 1M MovieLens datasets show that including emotion and semantic information significantly improves the accuracy of prediction and improves upon the state-of the art CF techniques.
Abstract: Collaborative filtering (CF) aims to recommend items based on prior user interaction. Despite their success, CF techniques do not handle data sparsity well, especially in the case of the cold start problem where there is no past rating for an item. In this paper, we provide a framework, which is able to tackle such issues by considering item-related emotions and semantic data. In order to predict the rating of an item for a given user, this framework relies on an extension of Latent Dirichlet Allocation, and on gradient boosted trees for the final prediction. We apply this framework to movie recommendation and consider two emotion spaces extracted from the movie plot summary and the reviews, and three semantic spaces: actor, director, and genre. Experiments with the 100K and 1M MovieLens datasets show that including emotion and semantic information significantly improves the accuracy of prediction and improves upon the state-of-the-art CF techniques. We also analyse the importance of each feature space and describe some uncovered latent groups.

108 citations


Proceedings ArticleDOI
24 Jul 2011
TL;DR: Two new document ranking models for Web search based upon the methods of semantic representation and the statistical translation-based approach to information retrieval (IR) are presented.
Abstract: This paper presents two new document ranking models for Web search based upon the methods of semantic representation and the statistical translation-based approach to information retrieval (IR). Assuming that a query is parallel to the titles of the documents clicked on for that query, large amounts of query-title pairs are constructed from clickthrough data; two latent semantic models are learned from this data. One is a bilingual topic model within the language modeling framework. It ranks documents for a query by the likelihood of the query being a semantics-based translation of the documents. The semantic representation is language independent and learned from query-title pairs, with the assumption that a query and its paired titles share the same distribution over semantic topics. The other is a discriminative projection model within the vector space modeling framework. Unlike Latent Semantic Analysis and its variants, the projection matrix in our model, which is used to map from term vectors into sematic space, is learned discriminatively such that the distance between a query and its paired title, both represented as vectors in the projected semantic space, is smaller than that between the query and the titles of other documents which have no clicks for that query. These models are evaluated on the Web search task using a real world data set. Results show that they significantly outperform their corresponding baseline models, which are state-of-the-art.

Book ChapterDOI
01 Jan 2011
TL;DR: This chapter discusses the findings and the responses that have been investigated, especially detection of attack profiles and the implementation of robust recommendation algorithms.
Abstract: Collaborative recommender systems are vulnerable to malicious users who seek to bias their output, causing them to recommend (or not recommend) particular items. This problem has been an active research topic since 2002. Researchers have found that the most widely-studied memory-based algorithms have significant vulnerabilities to attacks that can be fairly easily mounted. This chapter discusses these findings and the responses that have been investigated, especially detection of attack profiles and the implementation of robust recommendation algorithms.

01 Jan 2011
TL;DR: This paper presents a method that uses Latent Semantic Analysis (Landauer, Foltz & Laham, 1998) to automatically track and identify semantic changes across a corpus and demonstrates its potential by applying it to several well-known examples of semantic change in the history of the English language.
Abstract: Research in historical semantics relies on the examination, selection, and interpretation of texts from corpora. Changes in meaning are tracked through the collection and careful inspection of examples that span decades and centuries. This process is inextricably tied to the researcher‟s expertise and familiarity with the corpus. Consequently, the results tend to be difficult to quantify and put on an objective footing, and “big-picture” information about statistical trends and changes other than the specific ones under investigation are likely to be discarded. In this paper we present a method that uses Latent Semantic Analysis (Landauer, Foltz & Laham, 1998) to automatically track and identify semantic changes across a corpus. This method can take the entire corpus into account when tracing changes in the use of words and phrases, thus potentially allowing researchers to better account for the larger context in which these changes occurred, while at the same time considerably reducing the amount of work required. Moreover, because this measure relies on readily observable co-occurrence data, it affords the study of semantic change a measure of objectivity that was previously difficult to attain. In this paper we describe our method and demonstrate its potential by applying it to several well-known examples of semantic change in the history of the English language.

Proceedings Article
28 Jun 2011
TL;DR: This work proposes a fast, local-minimum-free spectral algorithm for learning latent variable models with arbitrary tree topologies, and shows that the joint distribution of the observed variables can be reconstructed from the marginals of triples of observed variables irrespective of the maximum degree of the tree.
Abstract: Latent variable models are powerful tools for probabilistic modeling, and have been successfully applied to various domains, such as speech analysis and bioinformatics. However, parameter learning algorithms for latent variable models have predominantly relied on local search heuristics such as expectation maximization (EM). We propose a fast, local-minimum-free spectral algorithm for learning latent variable models with arbitrary tree topologies, and show that the joint distribution of the observed variables can be reconstructed from the marginals of triples of observed variables irrespective of the maximum degree of the tree. We demonstrate the performance of our spectral algorithm on synthetic and real datasets; for large training sizes, our algorithm performs comparable to or better than EM while being orders of magnitude faster.

Proceedings ArticleDOI
24 Jul 2011
TL;DR: Regularized Latent Semantic Indexing (RLSI), a new method which is designed for parallelization, is introduced, which is as effective as existing topic models, and scales to larger datasets without reducing input vocabulary.
Abstract: Topic modeling can boost the performance of information retrieval, but its real-world application is limited due to scalability issues. Scaling to larger document collections via parallelization is an active area of research, but most solutions require drastic steps such as vastly reducing input vocabulary. We introduce Regularized Latent Semantic Indexing (RLSI), a new method which is designed for parallelization. It is as effective as existing topic models, and scales to larger datasets without reducing input vocabulary. RLSI formalizes topic modeling as a problem of minimizing a quadratic loss function regularized by l₂ and/or l₁ norm. This formulation allows the learning process to be decomposed into multiple sub-optimization problems which can be optimized in parallel, for example via MapReduce. We particularly propose adopting l₂ norm on topics and l₁ norm on document representations, to create a model with compact and readable topics and useful for retrieval. Relevance ranking experiments on three TREC datasets show that RLSI performs better than LSI, PLSI, and LDA, and the improvements are sometimes statistically significant. Experiments on a web dataset, containing about 1.6 million documents and 7 million terms, demonstrate a similar boost in performance on a larger corpus and vocabulary than in previous studies.

Journal ArticleDOI
28 Feb 2011-PLOS ONE
TL;DR: A stochastic model of content-based network, based on a copy and mutation algorithm and on the Heaps' law, is introduced that is able to capture the main statistical properties of the analysed semantic space, including the Zipf's law for the word frequency distribution.
Abstract: In this paper we extract the topology of the semantic space in its encyclopedic acception, measuring the semantic flow between the different entries of the largest modern encyclopedia, Wikipedia, and thus creating a directed complex network of semantic flows. Notably at the percolation threshold the semantic space is characterised by scale-free behaviour at different levels of complexity and this relates the semantic space to a wide range of biological, social and linguistics phenomena. In particular we find that the cluster size distribution, representing the size of different semantic areas, is scale-free. Moreover the topology of the resulting semantic space is scale-free in the connectivity distribution and displays small-world properties. However its statistical properties do not allow a classical interpretation via a generative model based on a simple multiplicative process. After giving a detailed description and interpretation of the topological properties of the semantic space, we introduce a stochastic model of content-based network, based on a copy and mutation algorithm and on the Heaps' law, that is able to capture the main statistical properties of the analysed semantic space, including the Zipf's law for the word frequency distribution.

Proceedings Article
12 Dec 2011
TL;DR: An efficient stochastic gradient descent algorithm that is able to learn probabilistic non-linear latent spaces composed of multiple activities and an incremental algorithm for the online setting which can update the latent space without extensive relearning are presented.
Abstract: A common approach for handling the complexity and inherent ambiguities of 3D human pose estimation is to use pose priors learned from training data. Existing approaches however, are either too simplistic (linear), too complex to learn, or can only learn latent spaces from "simple data", i.e., single activities such as walking or running. In this paper, we present an efficient stochastic gradient descent algorithm that is able to learn probabilistic non-linear latent spaces composed of multiple activities. Furthermore, we derive an incremental algorithm for the online setting which can update the latent space without extensive relearning. We demonstrate the effectiveness of our approach on the task of monocular and multi-view tracking and show that our approach outperforms the state-of-the-art.

Journal ArticleDOI
TL;DR: It is shown that even with restricting ourselves to binary trees, HLC models of comparable quality to Zhang's solutions are obtained, while being generally faster to compute, and it is demonstrated that the methods are able to estimate interpretable latent structures on real-world data with a large number of variables.
Abstract: Inferring latent structures from observations helps to model and possibly also understand underlying data generating processes. A rich class of latent structures is the latent trees, i.e., tree-structured distributions involving latent variables where the visible variables are leaves. These are also called hierarchical latent class (HLC) models. Zhang and Kocka [CHECK END OF SENTENCE] proposed a search algorithm for learning such models in the spirit of Bayesian network structure learning. While such an approach can find good solutions, it can be computationally expensive. As an alternative, we investigate two greedy procedures: The BIN-G algorithm determines both the structure of the tree and the cardinality of the latent variables in a bottom-up fashion. The BIN-A algorithm first determines the tree structure using agglomerative hierarchical clustering, and then determines the cardinality of the latent variables as for BIN-G. We show that even with restricting ourselves to binary trees, we obtain HLC models of comparable quality to Zhang's solutions (in terms of cross-validated log-likelihood), while being generally faster to compute. This claim is validated by a comprehensive comparison on several data sets. Furthermore, we demonstrate that our methods are able to estimate interpretable latent structures on real-world data with a large number of variables. By applying our method to a restricted version of the 20 newsgroups data, these models turn out to be related to topic models, and on data from the PASCAL Visual Object Classes (VOC) 2007 challenge, we show how such tree-structured models help us understand how objects co-occur in images. For reproducibility of all experiments in this paper, all code and data sets (or links to data) are available at http://people.kyb.tuebingen.mpg.de/harmeling/code/ltt-1.4.tar.

Journal ArticleDOI
TL;DR: This paper proposes a two-stage feature selection algorithm based on a kind of feature selection method and latent semantic indexing that constructs a new reduced semantic space between terms based on latent semanticindexing method.
Abstract: Feature selection for text categorization is a well-studied problem and its goal is to improve the effectiveness of categorization, or the efficiency of computation, or both. The system of text categorization based on traditional term-matching is used to represent the vector space model as a document; however, it needs a high dimensional space to represent the document, and does not take into account the semantic relationship between terms, which leads to a poor categorization accuracy. The latent semantic indexing method can overcome this problem by using statistically derived conceptual indices to replace the individual terms. With the purpose of improving the accuracy and efficiency of categorization, in this paper we propose a two-stage feature selection method. Firstly, we apply a novel feature selection method to reduce the dimension of terms; and then we construct a new semantic space, between terms, based on the latent semantic indexing method. Through some applications involving the spam database categorization, we find that our two-stage feature selection method performs better.

Proceedings Article
01 Jan 2011
TL;DR: The four stages: individual model building, non-linear blending, linear ensemble and post-processing lead to a successful final solution, within which techniques on feature engineering and aggregation (blending and ensemble learning) play crucial roles.
Abstract: Track 1 of KDDCup 2011 aims at predicting the rating behavior of users in the Yahoo! Music system. At National Taiwan University, we organize a course that teams up students to work on both tracks of KDDCup 2011. For track 1, we first tackle the problem by building variants of existing individual models, including Matrix Factorization, Restricted Boltzmann Machine, k-Nearest Neighbors, Probabilistic Latent Semantic Analysis, Probabilistic Principle Component Analysis and Supervised Regression. We then blend the individual models along with some carefully extracted features in a non-linear manner. A large linear ensemble that contains both the individual and the blended models is learned and taken through some post-processing steps to form the final solution. The four stages: individual model building, non-linear blending, linear ensemble and post-processing lead to a successful final solution, within which techniques on feature engineering and aggregation (blending and ensemble learning) play crucial roles. Our team is the first prize winner of both tracks of KDD Cup 2011.

Proceedings Article
01 Aug 2011
TL;DR: Use of probabilistic latent semantic analysis for modeling co-occurrence of overlapping sound events in audio recordings from everyday audio environments such as office, street or shop provides an increase of event detection accuracy to 35%, compared to 30% for using uniform priors for the events.
Abstract: This paper presents the use of probabilistic latent semantic analysis (PLSA) for modeling co-occurrence of overlapping sound events in audio recordings from everyday audio environments such as office, street or shop. Co-occurrence of events is represented as the degree of their overlapping in a fixed length segment of polyphonic audio. In the training stage, PLSA is used to learn the relationships between individual events. In detection, the PLSA model continuously adjusts the probabilities of events according to the history of events detected so far. The event probabilities provided by the model are integrated into a sound event detection system that outputs a monophonic sequence of events. The model offers a very good representation of the data, having low perplexity on test recordings. Using PLSA for estimating prior probabilities of events provides an increase of event detection accuracy to 35%, compared to 30% for using uniform priors for the events. There are different levels of performance increase in different audio contexts, with few contexts showing significant improvement.

01 Jan 2011
TL;DR: An inverse kinematics solver based on Gaussian process latent variable models (GP-LVM) is presented, which weight the representative poses and optimize the weights, combined with constraints on the end effectors, in order to synthesize the optimized pose.
Abstract: We present an inverse kinematics solver based on Gaussian process latent variable models (GP-LVM). Because of the high-dimension of motion capture data, Analyzing them directly is a very hard work. We map the motion capture data from higher-dimensional observation space to two-dimensional latent space based on GP-LVM, then, find out the representative poses of virtual character by clustering the motion capture data in latent space. Finally, weight the representative poses and optimize the weights, combined with constraints on the end effectors, in order to synthesize the optimized pose. The experiments show that our method obtains satisfying effects.

Proceedings Article
07 Aug 2011
TL;DR: A methodology for integrating non-parametric tree methods into probabilistic latent variable models by extending functional gradient boosting is presented in the context of occupancy-detection modeling, where the goal is to model the distribution of a species from imperfect detections.
Abstract: Important ecological phenomena are often observed indirectly. Consequently, probabilistic latent variable models provide an important tool, because they can include explicit models of the ecological phenomenon of interest and the process by which it is observed. However, existing latent variable methods rely on hand-formulated parametric models, which are expensive to design and require extensive preprocessing of the data. Nonparametric methods (such as regression trees) automate these decisions and produce highly accurate models. However, existing tree methods learn direct mappings from inputs to outputs—they cannot be applied to latent variable models. This paper describes a methodology for integrating non-parametric tree methods into probabilistic latent variable models by extending functional gradient boosting. The approach is presented in the context of occupancy-detection (OD) modeling, where the goal is to model the distribution of a species from imperfect detections. Experiments on 12 real and 3 synthetic bird species compare standard and tree-boosted OD models (latent variable models) with standard and tree-boosted logistic regression models (without latent structure). All methods perform similarly when predicting the observed variables, but the OD models learn better representations of the latent process. Most importantly, tree-boosted OD models learn the best latent representations when non-linearities and interactions are present.

Journal ArticleDOI
TL;DR: The proposed two-phase algorithm evaluates the semantic similarity for two or more sentences via a semantic vector space and has outstanding performance in handling long sentences with complex syntax.
Abstract: Research highlights? This research takes advantages of corpus-based ontology and Information Retrieval technologies to evaluate the semantic similarity between irregular sentences. ? The part of speech concept was taken into account and was integrated into the proposed semantic-VSM measure. ? This research tries to qualify the semantic similarity of natural language sentences. A novel sentence similarity measure for semantic based expert systems is presented. The well-known problem in the fields of semantic processing, such as QA systems, is to evaluate the semantic similarity between irregular sentences. This paper takes advantage of corpus-based ontology to overcome this problem. A transformed vector space model is introduced in this article. The proposed two-phase algorithm evaluates the semantic similarity for two or more sentences via a semantic vector space. The first phase built part-of-speech (POS) based subspaces by the raw data, and the latter carried out a cosine evaluation and adopted the WordNet ontology to construct the semantic vectors. Unlike other related researches that focused only on short sentences, our algorithm is applicable to short (4-5 words), medium (8-12 words), and even long sentences (over 12 words). The experiment demonstrates that the proposed algorithm has outstanding performance in handling long sentences with complex syntax. The significance of this research lies in the semantic similarity extraction of sentences, with arbitrary structures.

Proceedings Article
12 Dec 2011
TL;DR: This work presents a method based on kernel embeddings of distributions for latent tree graphical models with continuous and non-Gaussian variables that can recover the latent tree structures with provable guarantees and perform local-minimum free parameter learning and efficient inference.
Abstract: Latent tree graphical models are natural tools for expressing long range and hierarchical dependencies among many variables which are common in computer vision, bioinformatics and natural language processing problems. However, existing models are largely restricted to discrete and Gaussian variables due to computational constraints; furthermore, algorithms for estimating the latent tree structure and learning the model parameters are largely restricted to heuristic local search. We present a method based on kernel embeddings of distributions for latent tree graphical models with continuous and non-Gaussian variables. Our method can recover the latent tree structures with provable guarantees and perform local-minimum free parameter learning and efficient inference. Experiments on simulated and real data show the advantage of our proposed approach.

Journal ArticleDOI
TL;DR: The models used in this article are secondary dimension mixture models with the potential to explain differential item functioning (DIF) between latent classes, called latent DIF.
Abstract: The models used in this article are secondary dimension mixture models with the potential to explain differential item functioning (DIF) between latent classes, called latent DIF. The focus is on models with a secondary dimension that is at the same time specific to the DIF latent class and linked to an item property. A description of the models is provided along with a means of estimating model parameters using easily available software and a description of how the models behave in two applications. One application concerns a test that is sensitive to speededness and the other is based on an arithmetic operations test where the division items show latent DIF.

Proceedings ArticleDOI
Xi Chen1, Yanjun Qi1, Bing Bai1, Qihang Lin, Jaime G. Carbonell 
01 Jan 2011
TL;DR: A new model called Sparse LSA is proposed, which produces a sparse projection matrix via the `1 regularization and achieves similar performance gains to LSA, but is more efficient in projection computation, storage, and also well explain the topic-word relationships.
Abstract: Latent semantic analysis (LSA), as one of the most popular unsupervised dimension reduction tools, has a wide range of applications in text mining and information retrieval. The key idea of LSA is to learn a projection matrix that maps the high dimensional vector space representations of documents to a lower dimensional latent space, i.e. so called latent topic space. In this paper, we propose a new model called Sparse LSA, which produces a sparse projection matrix via the `1 regularization. Compared to the traditional LSA, Sparse LSA selects only a small number of relevant words for each topic and hence provides a compact representation of topic-word relationships. Moreover, Sparse LSA is computationally very efficient with much less memory usage for storing the projection matrix. Furthermore, we propose two important extensions of Sparse LSA: group structured Sparse LSA and non-negative Sparse LSA. We conduct experiments on several benchmark datasets and compare Sparse LSA and its extensions with several widely used methods, e.g. LSA, Sparse Coding and LDA. Empirical results suggest that Sparse LSA achieves similar performance gains to LSA, but is more efficient in projection computation, storage, and also well explain the topic-word relationships.

Patent
Jonathan Murray1
27 Jan 2011
TL;DR: In this paper, the authors present a system and method for using modified Latent Semantic Analysis techniques to structure data for efficient search and display, through a process of optimal agglomerative clustering.
Abstract: The disclosed embodiments provide a system and method for using modified Latent Semantic Analysis techniques to structure data for efficient search and display. The present invention creates a hierarchy of clustered documents, representing the topics of a domain corpus, through a process of optimal agglomerative clustering. The output from a search query is displayed in a fisheye view corresponding to the hierarchy of clustered documents. The fisheye view may link to a two-dimensional self-organizing map that represents semantic relationships between documents.

Journal Article
TL;DR: A simple categorization on topic models derived from LDA is made, and representative models of each category are introduced, which help to understand the relationship of works during the development of topic models.
Abstract: Topic models are receiving extensive attention in natural language processing.In this field,a topic is regarded as probabilistic distribution of terms.Topic models extract semantic topics using co-occurrence of terms in document level,and are used to transform documents locating in term space to the ones in topic space,obtaining the low dimensional representation of documents. This paper starts from Latent Semantic Indexing(LSI),the origin of topic models,and describes pLSI and LDA,the fundamental works in the development of topic models,with focus on the relationship among these works.As a generative model,LDA can be easily extended to other models.This paper makes a simple categorization on topic models derived from LDA,and representative models of each category are introduced.Furthermore,EM algorithms in parameter estimation of topic models are analyzed,which help to understand the relationship of works during the development of topic models.

Journal ArticleDOI
TL;DR: Experimental results over a QUICKBIRD image show that the clusters of the proposed algorithm are better than K-means and Iterative Self-Organizing Data Analysis Technique Algorithm in terms of object-oriented property.
Abstract: In this letter, we present a novel object-oriented semantic clustering algorithm for high-spatial-resolution remote sensing images using the probabilistic latent semantic analysis (PLSA) model coupled with neighborhood spatial information. First of all, an image collection is generated by partitioning a large satellite image into densely overlapped subimages. Then, the PLSA model is employed to model the image collection. Specifically, the image collection is partitioned into two subsets. One is used to learn topic models, where the number of topics is determined using a minimum description length criterion. The other is folded in using the learned topic models. Therefore, every pixel in each subimage has been allocated a topic label. At last, the cluster label of every pixel in the large satellite image is derived from the topic labels of multiple subimages which cover the pixel in the image collection. Experimental results over a QUICKBIRD image show that the clusters of the proposed algorithm are better than K-means and Iterative Self-Organizing Data Analysis Technique Algorithm in terms of object-oriented property.