Showing papers on "Probabilistic latent semantic analysis published in 2009"

PDF

Open Access

Proceedings Article•

Reading Tea Leaves: How Humans Interpret Topic Models

[...]

Jonathan Chang¹, Sean Gerrish², Chong Wang², Jordan Boyd-Graber³, David M. Blei² - Show less +1 more•Institutions (3)

Facebook¹, Princeton University², University of Maryland, College Park³

07 Dec 2009

TL;DR: New quantitative methods for measuring semantic meaning in inferred topics are presented, showing that they capture aspects of the model that are undetected by previous measures of model quality based on held-out likelihood.

...read moreread less

Abstract: Probabilistic topic models are a popular tool for the unsupervised analysis of text, providing both a predictive model of future text and a latent topic representation of the corpus. Practitioners typically assume that the latent space is semantically meaningful. It is used to check models, summarize the corpus, and guide exploration of its contents. However, whether the latent space is interpretable is in need of quantitative evaluation. In this paper, we present new quantitative methods for measuring semantic meaning in inferred topics. We back these measures with large-scale user studies, showing that they capture aspects of the model that are undetected by previous measures of model quality based on held-out likelihood. Surprisingly, topic models which perform better on held-out likelihood may infer less semantically meaningful topics.

...read moreread less

1,878 citations

Reference Entry•DOI•

Latent class and latent transition analysis.

[...]

Stephanie T. Lanza¹, Brian P. Flaherty¹, Linda M. Collins¹•Institutions (1)

Pennsylvania State University¹

30 Nov 2009

TL;DR: In this article, the authors introduce latent class analysis, its extension to repeated measures, and recent developments further extending the latent class model, and several recent developments that further extend the Latent class model are introduced.

...read moreread less

Abstract: Often quantities of interest in psychology cannot be observed directly. These unobservable quantities are known as latent variables. By using multiple items as indicators of the latent variable, we can obtain a more complete picture of the construct of interest and estimate measurement error. One approach to latent variable modeling is latent class analysis, a method appropriate for examining the relationship between discrete observed variables and a discrete latent variable. The present chapter will introduce latent class analysis, its extension to repeated measures, and recent developments further extending the latent class model. First, the concept of a latent class and the mathematical model are presented. This is followed by a discussion of parameter restrictions, model fit, and the measurement quality of categorical items. Second, latent class analysis is demonstrated through an examination of the prevalence of depression types in adolescents. Third, longitudinal extensions of the latent class model are presented. This section also contains an empirical example on adolescent depression types, where the previous analysis is extended to examine the stability and change in depression types over time. Finally, several recent developments that further extend the latent class model are introduced. Keywords: categorical variables; depression types; latent class analysis; latent transition analysis; latent variables; longitudinal

...read moreread less

932 citations

Proceedings Article•DOI•

Learning structural SVMs with latent variables

[...]

Chun-Nam Yu¹, Thorsten Joachims¹•Institutions (1)

Cornell University¹

14 Jun 2009

TL;DR: A large-margin formulation and algorithm for structured output prediction that allows the use of latent variables and the generality and performance of the approach is demonstrated through three applications including motiffinding, noun-phrase coreference resolution, and optimizing precision at k in information retrieval.

...read moreread less

Abstract: We present a large-margin formulation and algorithm for structured output prediction that allows the use of latent variables. Our proposal covers a large range of application problems, with an optimization problem that can be solved efficiently using Concave-Convex Programming. The generality and performance of the approach is demonstrated through three applications including motiffinding, noun-phrase coreference resolution, and optimizing precision at k in information retrieval.

...read moreread less

729 citations

Proceedings Article•

Nonparametric Latent Feature Models for Link Prediction

[...]

Kurt T. Miller¹, Michael I. Jordan¹, Thomas L. Griffiths¹•Institutions (1)

University of California, Berkeley¹

07 Dec 2009

TL;DR: This work pursues a similar approach with a richer kind of latent variable—latent features—using a Bayesian nonparametric approach to simultaneously infer the number of features at the same time the authors learn which entities have each feature, and combines these inferred features with known covariates in order to perform link prediction.

...read moreread less

Abstract: As the availability and importance of relational data—such as the friendships summarized on a social networking website—increases, it becomes increasingly important to have good models for such data. The kinds of latent structure that have been considered for use in predicting links in such networks have been relatively limited. In particular, the machine learning community has focused on latent class models, adapting Bayesian nonparametric methods to jointly infer how many latent classes there are while learning which entities belong to each class. We pursue a similar approach with a richer kind of latent variable—latent features—using a Bayesian nonparametric approach to simultaneously infer the number of features at the same time we learn which entities have each feature. Our model combines these inferred features with known covariates in order to perform link prediction. We demonstrate that the greater expressiveness of this approach allows us to improve performance on three datasets.

...read moreread less

448 citations

Journal Article•DOI•

Multiplicative latent factor models for description and prediction of social networks

[...]

Peter D. Hoff¹•Institutions (1)

University of Washington¹

01 Dec 2009-Computational and Mathematical Organization Theory

TL;DR: A statistical model of social network data derived from matrix representations and symmetry considerations is discussed that allows for the graphical description of a social network via the latent factors of the nodes, and provides a framework for the prediction of missing links in network data.

...read moreread less

Abstract: We discuss a statistical model of social network data derived from matrix representations and symmetry considerations. The model can include known predictor information in the form of a regression term, and can represent additional structure via sender-specific and receiver-specific latent factors. This approach allows for the graphical description of a social network via the latent factors of the nodes, and provides a framework for the prediction of missing links in network data.

...read moreread less

220 citations

Proceedings Article•DOI•

An extended model of natural logic

[...]

Bill MacCartney¹, Christopher D. Manning¹•Institutions (1)

Stanford University¹

07 Jan 2009

TL;DR: A model of natural language inference which identifies valid inferences by their lexical and syntactic features, without full semantic interpretation is proposed, extending past work in natural logic by incorporating both semantic exclusion and implicativity.

...read moreread less

Abstract: We propose a model of natural language inference which identifies valid inferences by their lexical and syntactic features, without full semantic interpretation. We extend past work in natural logic, which has focused on semantic containment and monotonicity, by incorporating both semantic exclusion and implicativity. Our model decomposes an inference problem into a sequence of atomic edits linking premise to hypothesis; predicts a lexical semantic relation for each edit; propagates these relations upward through a semantic composition tree according to properties of intermediate nodes; and joins the resulting semantic relations across the edit sequence. A computational implementation of the model achieves 70% accuracy and 89% precision on the FraCaS test suite. Moreover, including this model as a component in an existing system yields significant performance gains on the Recognizing Textual Entailment challenge.

...read moreread less

212 citations

Proceedings Article•DOI•

Probabilistic dyadic data analysis with local and global consistency

[...]

Deng Cai¹, Xuanhui Wang², Xiaofei He¹•Institutions (2)

Zhejiang University¹, University of Illinois at Urbana–Champaign²

14 Jun 2009

TL;DR: This work introduces a probabilistic framework for modeling both the topical and geometrical structure of the dyadic data that explicitly takes into account the local manifold structure.

...read moreread less

Abstract: Dyadic data arises in many real world applications such as social network analysis and information retrieval. In order to discover the underlying or hidden structure in the dyadic data, many topic modeling techniques were proposed. The typical algorithms include Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA). The probability density functions obtained by both of these two algorithms are supported on the Euclidean space. However, many previous studies have shown naturally occurring data may reside on or close to an underlying submanifold. We introduce a probabilistic framework for modeling both the topical and geometrical structure of the dyadic data that explicitly takes into account the local manifold structure. Specifically, the local manifold structure is modeled by a graph. The graph Laplacian, analogous to the Laplace-Beltrami operator on manifolds, is applied to smooth the probability density functions. As a result, the obtained probabilistic distributions are concentrated around the data manifold. Experimental results on real data sets demonstrate the effectiveness of the proposed approach.

...read moreread less

176 citations

Proceedings Article•DOI•

Learning semantic visual vocabularies using diffusion distance

[...]

Jingen Liu¹, Yang Yang¹, Mubarak Shah¹•Institutions (1)

University of Central Florida¹

20 Jun 2009

TL;DR: A novel approach to automatically learn a semantic visual vocabulary from abundant quantized midlevel features by using diffusion maps to capture the local intrinsic geometric relations between the midlevel feature points on the manifold.

...read moreread less

Abstract: In this paper, we propose a novel approach for learning generic visual vocabulary. We use diffusion maps to automatically learn a semantic visual vocabulary from abundant quantized midlevel features. Each midlevel feature is represented by the vector of pointwise mutual information (PMI). In this midlevel feature space, we believe the features produced by similar sources must lie on a certain manifold. To capture the intrinsic geometric relations between features, we measure their dissimilarity using diffusion distance. The underlying idea is to embed the midlevel features into a semantic lower-dimensional space. Our goal is to construct a compact yet discriminative semantic visual vocabulary. Although the conventional approach using k-means is good for vocabulary construction, its performance is sensitive to the size of the visual vocabulary. In addition, the learnt visual words are not semantically meaningful since the clustering criterion is based on appearance similarity only. Our proposed approach can effectively overcome these problems by capturing the semantic and geometric relations of the feature space using diffusion maps. Unlike some of the supervised vocabulary construction approaches, and the unsupervised methods such as pLSA and LDA, diffusion maps can capture the local intrinsic geometric relations between the midlevel feature points on the manifold. We have tested our approach on the KTH action dataset, our own YouTube action dataset and the fifteen scene dataset, and have obtained very promising results.

...read moreread less

173 citations

Proceedings Article•DOI•

Mind the gaps: weighting the unknown in large-scale one-class collaborative filtering

[...]

Rong Pan¹, Martin B. Scholz¹•Institutions (1)

Hewlett-Packard¹

28 Jun 2009

TL;DR: This paper proposes two novel algorithms for large-scale OCCF that allow to weight the unknowns: Low-rank matrix approximation, probabilistic latent semantic analysis, and maximum-margin matrix factorization.

...read moreread less

Abstract: One-Class Collaborative Filtering (OCCF) is a task that naturally emerges in recommender system settings. Typical characteristics include: Only positive examples can be observed, classes are highly imbalanced, and the vast majority of data points are missing. The idea of introducing weights for missing parts of a matrix has recently been shown to help in OCCF. While existing weighting approaches mitigate the first two problems above, a sparsity preserving solution that would allow to efficiently utilize data sets with e.g., hundred thousands of users and items has not yet been reported. In this paper, we study three different collaborative filtering frameworks: Low-rank matrix approximation, probabilistic latent semantic analysis, and maximum-margin matrix factorization. We propose two novel algorithms for large-scale OCCF that allow to weight the unknowns. Our experimental results demonstrate their effectiveness and efficiency on different problems, including the Netflix Prize data.

...read moreread less

163 citations

Journal Article•DOI•

More data trumps smarter algorithms: comparing pointwise mutual information with latent semantic analysis.

[...]

Gabriel Recchia¹, Michael N. Jones¹•Institutions (1)

Indiana University¹

01 Aug 2009-Behavior Research Methods

TL;DR: This work evaluates a simple metric of pointwise mutual information and demonstrates that this metric benefits from training on extremely large amounts of data and correlates more closely with human semantic similarity ratings than do publicly available implementations of several more complex models.

...read moreread less

Abstract: Computational models of lexical semantics, such as latent semantic analysis, can automatically generate semantic similarity measures between words from statistical redundancies in text. These measures are useful for experimental stimulus selection and for evaluating a model’s cognitive plausibility as a mechanism that people might use to organize meaning in memory. Although humans are exposed to enormous quantities of speech, practical constraints limit the amount of data that many current computational models can learn from. We follow up on previous work evaluating a simple metric of pointwise mutual information. Controlling for confounds in previous work, we demonstrate that this metric benefits from training on extremely large amounts of data and correlates more closely with human semantic similarity ratings than do publicly available implementations of several more complex models. We also present a simple tool for building simple and scalable models from large corpora quickly and efficiently.

...read moreread less

153 citations

Proceedings Article•DOI•

Latent Dirichlet Allocation with Topic-in-Set Knowledge

[...]

David Andrzejewski¹, Xiaojin Zhu¹•Institutions (1)

University of Wisconsin-Madison¹

04 Jun 2009

TL;DR: This work proposes a mechanism for adding partial supervision, called topic-in-set knowledge, to latent topic modeling, to encourage the recovery of topics which are more relevant to user modeling goals than the topics which would be recovered otherwise.

...read moreread less

Abstract: Latent Dirichlet Allocation is an unsupervised graphical model which can discover latent topics in unlabeled data. We propose a mechanism for adding partial supervision, called topic-in-set knowledge, to latent topic modeling. This type of supervision can be used to encourage the recovery of topics which are more relevant to user modeling goals than the topics which would be recovered otherwise. Preliminary experiments on text datasets are presented to demonstrate the potential effectiveness of this method.

...read moreread less

Proceedings Article•DOI•

Product feature categorization with multilevel latent semantic association

[...]

Honglei Guo¹, Huijia Zhu¹, Zhili Guo¹, Xiaoxun Zhang¹, Zhong Su¹ - Show less +1 more•Institutions (1)

IBM¹

02 Nov 2009

TL;DR: This paper proposes an unsupervised product-feature categorization method with multilevel latent semantic association that achieves better performance compared with the existing approaches and is language- and domain-independent.

...read moreread less

Abstract: In recent years, the number of freely available online reviews is increasing at a high speed. Aspect-based opinion mining technique has been employed to find out reviewers' opinions toward different product aspects. Such finer-grained opinion mining is valuable for the potential customers to make their purchase decisions. Product-feature extraction and categorization is very important for better mining aspect-oriented opinions. Since people usually use different words to describe the same aspect in the reviews, product-feature extraction and categorization becomes more challenging. Manually product-feature extraction and categorization is tedious and time consuming, and practically infeasible for the massive amount of products. In this paper, we propose an unsupervised product-feature categorization method with multilevel latent semantic association. After extracting product-features from the semi-structured reviews, we construct the first latent semantic association (LaSA) model to group words into a set of concepts according to their virtual context documents. It generates the latent semantic structure for each product-feature. The second LaSA model is constructed to categorize the product-features according to their latent semantic structures and context snippets in the reviews. Experimental results demonstrate that our method achieves better performance compared with the existing approaches. Moreover, the proposed method is language- and domain-independent.

...read moreread less

Proceedings Article•DOI•

Semantic Density Analysis: Comparing Word Meaning across Time and Phonetic Space

[...]

Eyal Sagi¹, Stefan Kaufmann¹, Brady Clark¹•Institutions (1)

Northwestern University¹

31 Mar 2009

TL;DR: A new statistical method for detecting and tracking changes in word meaning, based on Latent Semantic Analysis, which allows researchers to make statistical inferences on questions such as whether the meaning of a word changed across time or if a phonetic cluster is associated with a specific meaning.

...read moreread less

Abstract: This paper presents a new statistical method for detecting and tracking changes in word meaning, based on Latent Semantic Analysis. By comparing the density of semantic vector clusters this method allows researchers to make statistical inferences on questions such as whether the meaning of a word changed across time or if a phonetic cluster is associated with a specific meaning. Possible applications of this method are then illustrated in tracing the semantic change of 'dog', 'do', and 'deer' in early English and examining and comparing phonaesthemes.

...read moreread less

Proceedings Article•DOI•

Topic models for scene analysis and abnormality detection

[...]

Jagannadan Varadarajan¹, Jean-Marc Odobez¹•Institutions (1)

Idiap Research Institute¹

01 Sep 2009

TL;DR: An unsupervised learning approach relying on probabilistic Latent Semantic Analysis (pLSA) applied to a rich set visual features including motion and size activities for discovering relevant activity patterns occurring in busy traffic scenes, and shows how the discovered patterns can directly be used to segment the scene into regions with clear semantic activity content.

...read moreread less

Abstract: Automatic analysis and understanding of common activities and detection of deviant behaviors is a challenging task in computer vision. This is particularly true in surveillance data, where busy traffic scenes are rich with multifarious activities many of them occurring simultaneously. In this paper, we address these issues with an unsupervised learning approach relying on probabilistic Latent Semantic Analysis (pLSA) applied to a rich set visual features including motion and size activities for discovering relevant activity patterns occurring in such scenes. We then show how the discovered patterns can directly be used to segment the scene into regions with clear semantic activity content. Furthermore, we introduce novel abnormality detection measures within the scope of the adopted modeling approach, and investigate in detail their performance with respect to various issues. Experiments on 45 minutes of video captured from a busy traffic scene and involving abnormal events are conducted.

...read moreread less

Proceedings Article•DOI•

Supervised semantic indexing

[...]

Bing Bai¹, Jason Weston¹, David Grangier¹, Ronan Collobert¹, Kunihiko Sadamasa¹, Yanjun Qi¹, Olivier Chapelle², Kilian Q. Weinberger² - Show less +4 more•Institutions (2)

Princeton University¹, Yahoo!²

02 Nov 2009

TL;DR: This article proposes Supervised Semantic Indexing (SSI), an algorithm that is trained on (query, document) pairs of text documents to predict the quality of their match and proposes several improvements to the basic model, including low rank (but diagonal preserving) representations, and correlated feature hashing (CFH).

...read moreread less

Abstract: In this article we propose Supervised Semantic Indexing (SSI), an algorithm that is trained on (query, document) pairs of text documents to predict the quality of their match. Like Latent Semantic Indexing (LSI), our models take account of correlations between words (synonymy, polysemy). However, unlike LSI our models are trained with a supervised signal directly on the ranking task of interest, which we argue is the reason for our superior results. As the query and target texts are modeled separately, our approach is easily generalized to different retrieval tasks, such as online advertising placement. Dealing with models on all pairs of words features is computationally challenging. We propose several improvements to our basic model for addressing this issue, including low rank (but diagonal preserving) representations, and correlated feature hashing (CFH). We provide an empirical study of all these methods on retrieval tasks based on Wikipedia documents as well as an Internet advertisement task. We obtain state-of-the-art performance while providing realistically scalable methods.

...read moreread less

Proceedings Article•DOI•

Multilayer pLSA for multimodal image retrieval

[...]

Rainer Lienhart¹, Stefan Romberg¹, Eva Hörster¹•Institutions (1)

University of Augsburg¹

08 Jul 2009

TL;DR: It is shown that the best variant of the the proposed mm-pLSA system outperforms the unimodal systems by approximately 19% in the authors' query-by-example task.

...read moreread less

Abstract: It is current state of knowledge that our neocortex consists of six layers [10]. We take this knowledge from neuroscience as an inspiration to extend the standard single-layer probabilistic Latent Semantic Analysis (pLSA) [13] to multiple layers. As multiple layers should naturally handle multiple modalities and a hierarchy of abstractions, we denote this new approach multilayer multimodal probabilistic Latent Semantic Analysis (mm-pLSA). We derive the training and inference rules for the smallest possible non-degenerated mm-pLSA model: a model with two leaf-pLSAs (here from two different data modalities: image tags and visual image features) and a single top-level pLSA node merging the two leaf-pLSAs. From this derivation it is obvious how to extend the learning and inference rules to more modalities and more layers. We also propose a fast and strictly stepwise forward procedure to initialize bottom-up the mm-pLSA model, which in turn can then be post-optimized by the general mm-pLSA learning algorithm. We evaluate the proposed approach experimentally in a query-by-example retrieval task using 50-dimensional topic vectors as image models. We compare various variants of our mm-pLSA system to systems relying solely on visual features or tag features and analyze possible pitfalls of the mm-pLSA training. It is shown that the best variant of the the proposed mm-pLSA system outperforms the unimodal systems by approximately 19% in our query-by-example task.

...read moreread less

Proceedings Article•

Topic-based Multi-Document Summarization with Probabilistic Latent Semantic Analysis

[...]

Leonhard Hennig¹•Institutions (1)

Technical University of Berlin¹

01 Sep 2009

TL;DR: This work proposes a new method based on probabilistic latent semantic analysis, which allows for sentences and queries to be represented as probability distributions over latent topics, to estimate the summary relevance of sentences.

...read moreread less

Abstract: We consider the problem of query-focused multidocument summarization, where a summary containing the information most relevant to a user’s information need is produced from a set of topic-related documents. We propose a new method based on probabilistic latent semantic analysis, which allows us to represent sentences and queries as probability distributions over latent topics. Our approach combines queryfocused and thematic features computed in the latent topic space to estimate the summaryrelevance of sentences. In addition, we evaluate several dierent similarity measures for computing sentence-level feature scores. Experimental results show that our approach outperforms the best reported results on DUC 2006 data, and also compares well on DUC 2007 data.

...read moreread less

Proceedings Article•DOI•

A hybrid approach to item recommendation in folksonomies

[...]

Robert Wetzker¹, Winfried Umbrath¹, Alan Said¹•Institutions (1)

Technical University of Berlin¹

09 Feb 2009

TL;DR: This paper extends the probabilistic latent semantic analysis (PLSA) approach and presents a unified recommendation model which evolves from item user and item tag co-occurrences in parallel, which reduces known collaborative filtering problems related to overfitting and allows for higher quality recommendations.

...read moreread less

Abstract: In this paper we consider the problem of item recommendation in collaborative tagging communities, so called folksonomies, where users annotate interesting items with tags Rather than following a collaborative filtering or annotation-based approach to recommendation, we extend the probabilistic latent semantic analysis (PLSA) approach and present a unified recommendation model which evolves from item user and item tag co-occurrences in parallel The inclusion of tags reduces known collaborative filtering problems related to overfitting and allows for higher quality recommendations Experimental results on a large snapshot of the delicious bookmarking service show the scalability of our approach and an improved recommendation quality compared to two-mode collaborative or annotation based methods

...read moreread less

Proceedings Article•

Explicit versus latent concept models for cross-language information retrieval

[...]

Philipp Cimiano¹, Antje Schultz², Sergej Sizov², Philipp Sorg, Steffen Staab² - Show less +1 more•Institutions (2)

Delft University of Technology¹, University of Koblenz and Landau²

11 Jul 2009

TL;DR: This paper compares the recently proposed ESA model with two latent models (LSI and LDA) showing that the former is clearly superior to the both, and contributes to clarifying the role of explicit vs. implicitly derived or latent concepts in (cross-language) information retrieval research.

...read moreread less

Abstract: The field of information retrieval and text manipulation (classification, clustering) still strives for models allowing semantic information to be folded in to improve performance with respect to standard bag-of-word based models. Many approaches aim at a concept-based retrieval, but differ in the nature of the concepts, which range from linguistic concepts as defined in lexical resources such as WordNet, latent topics derived from the data itself - as in Latent Semantic Indexing (LSI) or (Latent Dirichlet Allocation (LDA) - to Wikipedia articles as proxies for concepts, as in the recently proposed Explicit Semantic Analysis (ESA) model. A crucial question which has not been answered so far is whether models based on explicitly given concepts (as in the ESA model for instance) perform inherently better than retrieval models based on "latent" concepts (as in LSI and/or LDA). In this paper we investigate this question closer in the context of a cross-language setting, which inherently requires concept-based retrieval bridging between different languages. In particular, we compare the recently proposed ESA model with two latent models (LSI and LDA) showing that the former is clearly superior to the both. From a general perspective, our results contribute to clarifying the role of explicit vs. implicitly derived or latent concepts in (cross-language) information retrieval research.

...read moreread less

Proceedings Article•DOI•

Probabilistic question recommendation for question answering communities

[...]

Mingcheng Qu¹, Guang Qiu¹, Xiaofei He¹, Cheng Zhang, Hao Wu¹, Jiajun Bu¹, Chun Chen¹ - Show less +3 more•Institutions (1)

Zhejiang University¹

20 Apr 2009

TL;DR: This paper adopts the Probabilistic Latent Semantic Analysis (PLSA) model for question recommendation and proposes a novel metric to evaluate the performance of the approach, and shows the experimental results show the recommendation approach is effective.

...read moreread less

Abstract: User-Interactive Question Answering (QA) communities such as Yahoo! Answers are growing in popularity. However, as these QA sites always have thousands of new questions posted daily, it is difficult for users to find the questions that are of interest to them. Consequently, this may delay the answering of the new questions. This gives rise to question recommendation techniques that help users locate interesting questions. In this paper, we adopt the Probabilistic Latent Semantic Analysis (PLSA) model for question recommendation and propose a novel metric to evaluate the performance of our approach. The experimental results show our recommendation approach is effective.

...read moreread less

Proceedings Article•DOI•

Language Models Based on Semantic Composition

[...]

Jeff Mitchell¹, Mirella Lapata¹•Institutions (1)

University of Edinburgh¹

06 Aug 2009

TL;DR: A novel statistical language model is proposed to capture long-range semantic dependencies by applying the concept of semantic composition to the problem of constructing predictive history representations for upcoming words.

...read moreread less

Abstract: In this paper we propose a novel statistical language model to capture long-range semantic dependencies Specifically, we apply the concept of semantic composition to the problem of constructing predictive history representations for upcoming words We also examine the influence of the underlying semantic space on the composition task by comparing spatial semantic representations against topic-based ones The composition models yield reductions in perplexity when combined with a standard n-gram language model over the n-gram model alone We also obtain perplexity reductions when integrating our models with a structured language model

...read moreread less

Proceedings Article•DOI•

Multiview clustering: a late fusion approach using latent models

[...]

Eric Bruno¹, Stéphane Marchand-Maillet¹•Institutions (1)

University of Geneva¹

19 Jul 2009

TL;DR: A probabilistic multi-view clustering model outperforming an early-fusion approach based on the latent modeling of cluster-cluster relationships is derived.

...read moreread less

Abstract: Multi-view clustering is an important problem in information retrieval due to the abundance of data offering many perspectives and generating multi-view representations. We investigate in this short note a late fusion approach for multi-view clustering based on the latent modeling of cluster-cluster relationships. We derive a probabilistic multi-view clustering model outperforming an early-fusion approach based on multi-view feature correlation analysis.

...read moreread less

Proceedings Article•DOI•

Semi-supervised Semantic Role Labeling Using the Latent Words Language Model

[...]

Koen Deschacht¹, Marie-Francine Moens¹•Institutions (1)

Katholieke Universiteit Leuven¹

06 Aug 2009

TL;DR: The Latent Words Language Model is presented, which is a language model that learns word similarities from unlabeled texts that uses these similarities for different semi-supervised SRL methods as additional features or to automatically expand a small training set.

...read moreread less

Abstract: Semantic Role Labeling (SRL) has proved to be a valuable tool for performing automatic analysis of natural language texts. Currently however, most systems rely on a large training set, which is manually annotated, an effort that needs to be repeated whenever different languages or a different set of semantic roles is used in a certain application. A possible solution for this problem is semi-supervised learning, where a small set of training examples is automatically expanded using unlabeled texts. We present the Latent Words Language Model, which is a language model that learns word similarities from unlabeled texts. We use these similarities for different semi-supervised SRL methods as additional features or to automatically expand a small training set. We evaluate the methods on the PropBank dataset and find that for small training sizes our best performing system achieves an error reduction of 33.27% F1-measure compared to a state-of-the-art supervised baseline.

...read moreread less

Proceedings Article•

Topic Evolution in a Stream of Documents.

[...]

André Gohr¹, Alexander Hinneburg¹, Rene Schult², Myra Spiliopoulou²•Institutions (2)

Martin Luther University of Halle-Wittenberg¹, Otto-von-Guericke University Magdeburg²

01 Jan 2009

TL;DR: This study uses Probabilistic Latent Semantic Analysis (PLSA) for topic modeling and proposed new folding-in techniques for topic adaptation under an evolving vocabulary to monitor and understanding of topic and vocabulary evolution over an in nite document sequence, i.e. a stream.

...read moreread less

Abstract: Document collections evolve over time, new topics emerge and old ones decline. At the same time, the terminology evolves as well. Much literature is devoted to topic evolution in nite document sequences assuming a xed vocabulary. In this study, we propose \Topic Monitor" for the monitoring and understanding of topic and vocabulary evolution over an in nite document sequence, i.e. a stream. We use Probabilistic Latent Semantic Analysis (PLSA) for topic modeling and propose new folding-in techniques for topic adaptation under an evolving vocabulary. We extract a series of models, on which we detect index-based topic threads as human-interpretable descriptions of topic evolution.

...read moreread less

Journal Article•DOI•

Genetic algorithm for text clustering based on latent semantic indexing

[...]

Wei Song¹, Soon Cheol Park¹•Institutions (1)

Chonbuk National University¹

01 Jun 2009-Computers & Mathematics With Applications

TL;DR: A variable string length genetic algorithm which has been exploited for automatically evolving the proper number of clusters as well as providing near optimal data set clustering results is proposed.

...read moreread less

Abstract: In this paper, we develop a genetic algorithm method based on a latent semantic model (GAL) for text clustering. The main difficulty in the application of genetic algorithms (GAs) for document clustering is thousands or even tens of thousands of dimensions in feature space which is typical for textual data. Because the most straightforward and popular approach represents texts with the vector space model (VSM), that is, each unique term in the vocabulary represents one dimension. Latent semantic indexing (LSI) is a successful technology in information retrieval which attempts to explore the latent semantics implied by a query or a document through representing them in a dimension-reduced space. Meanwhile, LSI takes into account the effects of synonymy and polysemy, which constructs a semantic structure in textual data. GA belongs to search techniques that can efficiently evolve the optimal solution in the reduced space. We propose a variable string length genetic algorithm which has been exploited for automatically evolving the proper number of clusters as well as providing near optimal data set clustering. GA can be used in conjunction with the reduced latent semantic structure and improve clustering efficiency and accuracy. The superiority of GAL approach over conventional GA applied in VSM model is demonstrated by providing good Reuter document clustering results.

...read moreread less

Proceedings Article•

Latent variable perceptron algorithm for structured classification

[...]

Xu Sun¹, Takuya Matsuzaki¹, Daisuke Okanohara¹, Jun'ichi Tsujii²•Institutions (2)

University of Tokyo¹, University of Manchester²

11 Jul 2009

TL;DR: Compared to existing probabilistic models of latent variables, the proposed perceptron-style algorithm lowers the training cost significantly yet with comparable or even superior classification accuracy.

...read moreread less

Abstract: We propose a perceptron-style algorithm for fast discriminative training of structured latent variable model, and analyzed its convergence properties. Our method extends the perceptron algorithm for the learning task with latent dependencies, which may not be captured by traditional models. It relies on Viterbi decoding over latent variables, combined with simple additive updates. Compared to existing probabilistic models of latent variables, our method lowers the training cost significantly yet with comparable or even superior classification accuracy.

...read moreread less

Proceedings Article•

Online graph planarisation for synchronous parsing of semantic and syntactic dependencies

[...]

Ivan Titov¹, James Henderson², Paola Merlo², Gabriele Musillo²•Institutions (2)

University of Illinois at Urbana–Champaign¹, University of Geneva²

11 Jul 2009

TL;DR: This paper investigates a generative history-based parsing model that synchronises the derivation of non-planar graphs representing semantic dependencies with the derivations of dependency trees representing syntactic structures, achieving relative error reduction of 12% in semantic F score over previously proposed synchronous models that cannot processnon-planarity online.

...read moreread less

Abstract: This paper investigates a generative history-based parsing model that synchronises the derivation of non-planar graphs representing semantic dependencies with the derivation of dependency trees representing syntactic structures. To process non-planarity online, the semantic transition-based parser uses a new technique to dynamically reorder nodes during the derivation. While the synchronised derivations allow different structures to be built for the semantic non-planar graphs and syntactic dependency trees, useful statistical dependencies between these structures are modeled using latent variables. The resulting synchronous parser achieves competitive performance on the CoNLL- 2008 shared task, achieving relative error reduction of 12% in semantic F score over previously proposed synchronous models that cannot process non-planarity online.

...read moreread less

Proceedings Article•DOI•

A Discriminative Latent Variable Chinese Segmenter with Hybrid Word/Character Information

[...]

Xu Sun¹, Yao-zhong Zhang¹, Takuya Matsuzaki¹, Yoshimasa Tsuruoka², Jun'ichi Tsujii² - Show less +1 more•Institutions (2)

University of Tokyo¹, University of Manchester²

31 May 2009

TL;DR: It is argued that the use of latent variables can help capture long range dependencies and improve the recall on segmenting long words, e.g., named-entities.

...read moreread less

Abstract: Conventional approaches to Chinese word segmentation treat the problem as a character-based tagging task. Recently, semi-Markov models have been applied to the problem, incorporating features based on complete words. In this paper, we propose an alternative, a latent variable model, which uses hybrid information based on both word sequences and character sequences. We argue that the use of latent variables can help capture long range dependencies and improve the recall on segmenting long words, e.g., named-entities. Experimental results show that this is indeed the case. With this improvement, evaluations on the data of the second SIGHAN CWS bakeoff show that our system is competitive with the best ones in the literature.

...read moreread less

Proceedings Article•DOI•

A Latent Variable Model of Synchronous Syntactic-Semantic Parsing for Multiple Languages

[...]

Andrea Gesmundo, James Henderson, Paola Merlo, Ivan Titov¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

04 Jun 2009

TL;DR: This work took a pre-existing generative latent variable model of joint syntactic-semantic dependency parsing, developed for English, and applied it to six new languages with minimal adjustments, resulting in a parser that was ranked third overall and robustness across languages indicates that this parser has a very general feature set.

...read moreread less

Abstract: Motivated by the large number of languages (seven) and the short development time (two months) of the 2009 CoNLL shared task, we exploited latent variables to avoid the costly process of hand-crafted feature engineering, allowing the latent variables to induce features from the data. We took a pre-existing generative latent variable model of joint syntactic-semantic dependency parsing, developed for English, and applied it to six new languages with minimal adjustments. The parser's robustness across languages indicates that this parser has a very general feature set. The parser's high performance indicates that its latent variables succeeded in inducing effective features. This system was ranked third overall with a macro averaged F1 score of 82.14%, only 0.5% worse than the best system.

...read moreread less

Book Chapter•DOI•

Semantic Spaces: Measuring the Distance between Different Subspaces

[...]

Guido Zuccon¹, Leif Azzopardi¹, Cornelis J. van Rijsbergen¹•Institutions (1)

University of Glasgow¹

18 Mar 2009

TL;DR: In this article, the authors address the question: how can the distance between different semantic spaces be computed? By representing each Semantic Space as a subspace of a more general Hilbert space, the relationship between Semantic Spaces can be computed by means of the subspace distance.

...read moreread less

Abstract: Semantic Space models, which provide a numerical representation of words' meaning extracted from corpus of documents, have been formalized in terms of Hermitian operators over real valued Hilbert spaces by Bruza et al. [1]. The collapse of a word into a particular meaning has been investigated applying the notion of quantum collapse of superpositional states [2]. While the semantic association between words in a Semantic Space can be computed by means of the Minkowski distance [3] or the cosine of the angle between the vector representation of each pair of words, a new procedure is needed in order to establish relations between two or more Semantic Spaces. We address the question: how can the distance between different Semantic Spaces be computed? By representing each Semantic Space as a subspace of a more general Hilbert space, the relationship between Semantic Spaces can be computed by means of the subspace distance. Such distance needs to take into account the difference in the dimensions between subspaces. The availability of a distance for comparing different Semantic Subspaces would enable to achieve a deeper understanding about the geometry of Semantic Spaces which would possibly translate into better effectiveness in Information Retrieval tasks.

...read moreread less

Collapse