scispace - formally typeset
Search or ask a question

Showing papers on "Probabilistic latent semantic analysis published in 2009"


Proceedings Article
07 Dec 2009
TL;DR: New quantitative methods for measuring semantic meaning in inferred topics are presented, showing that they capture aspects of the model that are undetected by previous measures of model quality based on held-out likelihood.
Abstract: Probabilistic topic models are a popular tool for the unsupervised analysis of text, providing both a predictive model of future text and a latent topic representation of the corpus. Practitioners typically assume that the latent space is semantically meaningful. It is used to check models, summarize the corpus, and guide exploration of its contents. However, whether the latent space is interpretable is in need of quantitative evaluation. In this paper, we present new quantitative methods for measuring semantic meaning in inferred topics. We back these measures with large-scale user studies, showing that they capture aspects of the model that are undetected by previous measures of model quality based on held-out likelihood. Surprisingly, topic models which perform better on held-out likelihood may infer less semantically meaningful topics.

1,878 citations


Reference EntryDOI
30 Nov 2009
TL;DR: In this article, the authors introduce latent class analysis, its extension to repeated measures, and recent developments further extending the latent class model, and several recent developments that further extend the Latent class model are introduced.
Abstract: Often quantities of interest in psychology cannot be observed directly. These unobservable quantities are known as latent variables. By using multiple items as indicators of the latent variable, we can obtain a more complete picture of the construct of interest and estimate measurement error. One approach to latent variable modeling is latent class analysis, a method appropriate for examining the relationship between discrete observed variables and a discrete latent variable. The present chapter will introduce latent class analysis, its extension to repeated measures, and recent developments further extending the latent class model. First, the concept of a latent class and the mathematical model are presented. This is followed by a discussion of parameter restrictions, model fit, and the measurement quality of categorical items. Second, latent class analysis is demonstrated through an examination of the prevalence of depression types in adolescents. Third, longitudinal extensions of the latent class model are presented. This section also contains an empirical example on adolescent depression types, where the previous analysis is extended to examine the stability and change in depression types over time. Finally, several recent developments that further extend the latent class model are introduced. Keywords: categorical variables; depression types; latent class analysis; latent transition analysis; latent variables; longitudinal

932 citations


Proceedings ArticleDOI
14 Jun 2009
TL;DR: A large-margin formulation and algorithm for structured output prediction that allows the use of latent variables and the generality and performance of the approach is demonstrated through three applications including motiffinding, noun-phrase coreference resolution, and optimizing precision at k in information retrieval.
Abstract: We present a large-margin formulation and algorithm for structured output prediction that allows the use of latent variables. Our proposal covers a large range of application problems, with an optimization problem that can be solved efficiently using Concave-Convex Programming. The generality and performance of the approach is demonstrated through three applications including motiffinding, noun-phrase coreference resolution, and optimizing precision at k in information retrieval.

729 citations


Proceedings Article
07 Dec 2009
TL;DR: This work pursues a similar approach with a richer kind of latent variable—latent features—using a Bayesian nonparametric approach to simultaneously infer the number of features at the same time the authors learn which entities have each feature, and combines these inferred features with known covariates in order to perform link prediction.
Abstract: As the availability and importance of relational data—such as the friendships summarized on a social networking website—increases, it becomes increasingly important to have good models for such data. The kinds of latent structure that have been considered for use in predicting links in such networks have been relatively limited. In particular, the machine learning community has focused on latent class models, adapting Bayesian nonparametric methods to jointly infer how many latent classes there are while learning which entities belong to each class. We pursue a similar approach with a richer kind of latent variable—latent features—using a Bayesian nonparametric approach to simultaneously infer the number of features at the same time we learn which entities have each feature. Our model combines these inferred features with known covariates in order to perform link prediction. We demonstrate that the greater expressiveness of this approach allows us to improve performance on three datasets.

448 citations


Journal ArticleDOI
TL;DR: A statistical model of social network data derived from matrix representations and symmetry considerations is discussed that allows for the graphical description of a social network via the latent factors of the nodes, and provides a framework for the prediction of missing links in network data.
Abstract: We discuss a statistical model of social network data derived from matrix representations and symmetry considerations. The model can include known predictor information in the form of a regression term, and can represent additional structure via sender-specific and receiver-specific latent factors. This approach allows for the graphical description of a social network via the latent factors of the nodes, and provides a framework for the prediction of missing links in network data.

220 citations


Proceedings ArticleDOI
07 Jan 2009
TL;DR: A model of natural language inference which identifies valid inferences by their lexical and syntactic features, without full semantic interpretation is proposed, extending past work in natural logic by incorporating both semantic exclusion and implicativity.
Abstract: We propose a model of natural language inference which identifies valid inferences by their lexical and syntactic features, without full semantic interpretation. We extend past work in natural logic, which has focused on semantic containment and monotonicity, by incorporating both semantic exclusion and implicativity. Our model decomposes an inference problem into a sequence of atomic edits linking premise to hypothesis; predicts a lexical semantic relation for each edit; propagates these relations upward through a semantic composition tree according to properties of intermediate nodes; and joins the resulting semantic relations across the edit sequence. A computational implementation of the model achieves 70% accuracy and 89% precision on the FraCaS test suite. Moreover, including this model as a component in an existing system yields significant performance gains on the Recognizing Textual Entailment challenge.

212 citations


Proceedings ArticleDOI
14 Jun 2009
TL;DR: This work introduces a probabilistic framework for modeling both the topical and geometrical structure of the dyadic data that explicitly takes into account the local manifold structure.
Abstract: Dyadic data arises in many real world applications such as social network analysis and information retrieval. In order to discover the underlying or hidden structure in the dyadic data, many topic modeling techniques were proposed. The typical algorithms include Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA). The probability density functions obtained by both of these two algorithms are supported on the Euclidean space. However, many previous studies have shown naturally occurring data may reside on or close to an underlying submanifold. We introduce a probabilistic framework for modeling both the topical and geometrical structure of the dyadic data that explicitly takes into account the local manifold structure. Specifically, the local manifold structure is modeled by a graph. The graph Laplacian, analogous to the Laplace-Beltrami operator on manifolds, is applied to smooth the probability density functions. As a result, the obtained probabilistic distributions are concentrated around the data manifold. Experimental results on real data sets demonstrate the effectiveness of the proposed approach.

176 citations


Proceedings ArticleDOI
20 Jun 2009
TL;DR: A novel approach to automatically learn a semantic visual vocabulary from abundant quantized midlevel features by using diffusion maps to capture the local intrinsic geometric relations between the midlevel feature points on the manifold.
Abstract: In this paper, we propose a novel approach for learning generic visual vocabulary. We use diffusion maps to automatically learn a semantic visual vocabulary from abundant quantized midlevel features. Each midlevel feature is represented by the vector of pointwise mutual information (PMI). In this midlevel feature space, we believe the features produced by similar sources must lie on a certain manifold. To capture the intrinsic geometric relations between features, we measure their dissimilarity using diffusion distance. The underlying idea is to embed the midlevel features into a semantic lower-dimensional space. Our goal is to construct a compact yet discriminative semantic visual vocabulary. Although the conventional approach using k-means is good for vocabulary construction, its performance is sensitive to the size of the visual vocabulary. In addition, the learnt visual words are not semantically meaningful since the clustering criterion is based on appearance similarity only. Our proposed approach can effectively overcome these problems by capturing the semantic and geometric relations of the feature space using diffusion maps. Unlike some of the supervised vocabulary construction approaches, and the unsupervised methods such as pLSA and LDA, diffusion maps can capture the local intrinsic geometric relations between the midlevel feature points on the manifold. We have tested our approach on the KTH action dataset, our own YouTube action dataset and the fifteen scene dataset, and have obtained very promising results.

173 citations


Proceedings ArticleDOI
28 Jun 2009
TL;DR: This paper proposes two novel algorithms for large-scale OCCF that allow to weight the unknowns: Low-rank matrix approximation, probabilistic latent semantic analysis, and maximum-margin matrix factorization.
Abstract: One-Class Collaborative Filtering (OCCF) is a task that naturally emerges in recommender system settings. Typical characteristics include: Only positive examples can be observed, classes are highly imbalanced, and the vast majority of data points are missing. The idea of introducing weights for missing parts of a matrix has recently been shown to help in OCCF. While existing weighting approaches mitigate the first two problems above, a sparsity preserving solution that would allow to efficiently utilize data sets with e.g., hundred thousands of users and items has not yet been reported. In this paper, we study three different collaborative filtering frameworks: Low-rank matrix approximation, probabilistic latent semantic analysis, and maximum-margin matrix factorization. We propose two novel algorithms for large-scale OCCF that allow to weight the unknowns. Our experimental results demonstrate their effectiveness and efficiency on different problems, including the Netflix Prize data.

163 citations


Journal ArticleDOI
TL;DR: This work evaluates a simple metric of pointwise mutual information and demonstrates that this metric benefits from training on extremely large amounts of data and correlates more closely with human semantic similarity ratings than do publicly available implementations of several more complex models.
Abstract: Computational models of lexical semantics, such as latent semantic analysis, can automatically generate semantic similarity measures between words from statistical redundancies in text. These measures are useful for experimental stimulus selection and for evaluating a model’s cognitive plausibility as a mechanism that people might use to organize meaning in memory. Although humans are exposed to enormous quantities of speech, practical constraints limit the amount of data that many current computational models can learn from. We follow up on previous work evaluating a simple metric of pointwise mutual information. Controlling for confounds in previous work, we demonstrate that this metric benefits from training on extremely large amounts of data and correlates more closely with human semantic similarity ratings than do publicly available implementations of several more complex models. We also present a simple tool for building simple and scalable models from large corpora quickly and efficiently.

153 citations


Proceedings ArticleDOI
04 Jun 2009
TL;DR: This work proposes a mechanism for adding partial supervision, called topic-in-set knowledge, to latent topic modeling, to encourage the recovery of topics which are more relevant to user modeling goals than the topics which would be recovered otherwise.
Abstract: Latent Dirichlet Allocation is an unsupervised graphical model which can discover latent topics in unlabeled data. We propose a mechanism for adding partial supervision, called topic-in-set knowledge, to latent topic modeling. This type of supervision can be used to encourage the recovery of topics which are more relevant to user modeling goals than the topics which would be recovered otherwise. Preliminary experiments on text datasets are presented to demonstrate the potential effectiveness of this method.

Proceedings ArticleDOI
Honglei Guo1, Huijia Zhu1, Zhili Guo1, Xiaoxun Zhang1, Zhong Su1 
02 Nov 2009
TL;DR: This paper proposes an unsupervised product-feature categorization method with multilevel latent semantic association that achieves better performance compared with the existing approaches and is language- and domain-independent.
Abstract: In recent years, the number of freely available online reviews is increasing at a high speed. Aspect-based opinion mining technique has been employed to find out reviewers' opinions toward different product aspects. Such finer-grained opinion mining is valuable for the potential customers to make their purchase decisions. Product-feature extraction and categorization is very important for better mining aspect-oriented opinions. Since people usually use different words to describe the same aspect in the reviews, product-feature extraction and categorization becomes more challenging. Manually product-feature extraction and categorization is tedious and time consuming, and practically infeasible for the massive amount of products. In this paper, we propose an unsupervised product-feature categorization method with multilevel latent semantic association. After extracting product-features from the semi-structured reviews, we construct the first latent semantic association (LaSA) model to group words into a set of concepts according to their virtual context documents. It generates the latent semantic structure for each product-feature. The second LaSA model is constructed to categorize the product-features according to their latent semantic structures and context snippets in the reviews. Experimental results demonstrate that our method achieves better performance compared with the existing approaches. Moreover, the proposed method is language- and domain-independent.

Proceedings ArticleDOI
31 Mar 2009
TL;DR: A new statistical method for detecting and tracking changes in word meaning, based on Latent Semantic Analysis, which allows researchers to make statistical inferences on questions such as whether the meaning of a word changed across time or if a phonetic cluster is associated with a specific meaning.
Abstract: This paper presents a new statistical method for detecting and tracking changes in word meaning, based on Latent Semantic Analysis. By comparing the density of semantic vector clusters this method allows researchers to make statistical inferences on questions such as whether the meaning of a word changed across time or if a phonetic cluster is associated with a specific meaning. Possible applications of this method are then illustrated in tracing the semantic change of 'dog', 'do', and 'deer' in early English and examining and comparing phonaesthemes.

Proceedings ArticleDOI
01 Sep 2009
TL;DR: An unsupervised learning approach relying on probabilistic Latent Semantic Analysis (pLSA) applied to a rich set visual features including motion and size activities for discovering relevant activity patterns occurring in busy traffic scenes, and shows how the discovered patterns can directly be used to segment the scene into regions with clear semantic activity content.
Abstract: Automatic analysis and understanding of common activities and detection of deviant behaviors is a challenging task in computer vision. This is particularly true in surveillance data, where busy traffic scenes are rich with multifarious activities many of them occurring simultaneously. In this paper, we address these issues with an unsupervised learning approach relying on probabilistic Latent Semantic Analysis (pLSA) applied to a rich set visual features including motion and size activities for discovering relevant activity patterns occurring in such scenes. We then show how the discovered patterns can directly be used to segment the scene into regions with clear semantic activity content. Furthermore, we introduce novel abnormality detection measures within the scope of the adopted modeling approach, and investigate in detail their performance with respect to various issues. Experiments on 45 minutes of video captured from a busy traffic scene and involving abnormal events are conducted.

Proceedings ArticleDOI
02 Nov 2009
TL;DR: This article proposes Supervised Semantic Indexing (SSI), an algorithm that is trained on (query, document) pairs of text documents to predict the quality of their match and proposes several improvements to the basic model, including low rank (but diagonal preserving) representations, and correlated feature hashing (CFH).
Abstract: In this article we propose Supervised Semantic Indexing (SSI), an algorithm that is trained on (query, document) pairs of text documents to predict the quality of their match. Like Latent Semantic Indexing (LSI), our models take account of correlations between words (synonymy, polysemy). However, unlike LSI our models are trained with a supervised signal directly on the ranking task of interest, which we argue is the reason for our superior results. As the query and target texts are modeled separately, our approach is easily generalized to different retrieval tasks, such as online advertising placement. Dealing with models on all pairs of words features is computationally challenging. We propose several improvements to our basic model for addressing this issue, including low rank (but diagonal preserving) representations, and correlated feature hashing (CFH). We provide an empirical study of all these methods on retrieval tasks based on Wikipedia documents as well as an Internet advertisement task. We obtain state-of-the-art performance while providing realistically scalable methods.

Proceedings ArticleDOI
08 Jul 2009
TL;DR: It is shown that the best variant of the the proposed mm-pLSA system outperforms the unimodal systems by approximately 19% in the authors' query-by-example task.
Abstract: It is current state of knowledge that our neocortex consists of six layers [10]. We take this knowledge from neuroscience as an inspiration to extend the standard single-layer probabilistic Latent Semantic Analysis (pLSA) [13] to multiple layers. As multiple layers should naturally handle multiple modalities and a hierarchy of abstractions, we denote this new approach multilayer multimodal probabilistic Latent Semantic Analysis (mm-pLSA). We derive the training and inference rules for the smallest possible non-degenerated mm-pLSA model: a model with two leaf-pLSAs (here from two different data modalities: image tags and visual image features) and a single top-level pLSA node merging the two leaf-pLSAs. From this derivation it is obvious how to extend the learning and inference rules to more modalities and more layers. We also propose a fast and strictly stepwise forward procedure to initialize bottom-up the mm-pLSA model, which in turn can then be post-optimized by the general mm-pLSA learning algorithm. We evaluate the proposed approach experimentally in a query-by-example retrieval task using 50-dimensional topic vectors as image models. We compare various variants of our mm-pLSA system to systems relying solely on visual features or tag features and analyze possible pitfalls of the mm-pLSA training. It is shown that the best variant of the the proposed mm-pLSA system outperforms the unimodal systems by approximately 19% in our query-by-example task.

Proceedings Article
01 Sep 2009
TL;DR: This work proposes a new method based on probabilistic latent semantic analysis, which allows for sentences and queries to be represented as probability distributions over latent topics, to estimate the summary relevance of sentences.
Abstract: We consider the problem of query-focused multidocument summarization, where a summary containing the information most relevant to a user’s information need is produced from a set of topic-related documents. We propose a new method based on probabilistic latent semantic analysis, which allows us to represent sentences and queries as probability distributions over latent topics. Our approach combines queryfocused and thematic features computed in the latent topic space to estimate the summaryrelevance of sentences. In addition, we evaluate several dierent similarity measures for computing sentence-level feature scores. Experimental results show that our approach outperforms the best reported results on DUC 2006 data, and also compares well on DUC 2007 data.

Proceedings ArticleDOI
09 Feb 2009
TL;DR: This paper extends the probabilistic latent semantic analysis (PLSA) approach and presents a unified recommendation model which evolves from item user and item tag co-occurrences in parallel, which reduces known collaborative filtering problems related to overfitting and allows for higher quality recommendations.
Abstract: In this paper we consider the problem of item recommendation in collaborative tagging communities, so called folksonomies, where users annotate interesting items with tags Rather than following a collaborative filtering or annotation-based approach to recommendation, we extend the probabilistic latent semantic analysis (PLSA) approach and present a unified recommendation model which evolves from item user and item tag co-occurrences in parallel The inclusion of tags reduces known collaborative filtering problems related to overfitting and allows for higher quality recommendations Experimental results on a large snapshot of the delicious bookmarking service show the scalability of our approach and an improved recommendation quality compared to two-mode collaborative or annotation based methods

Proceedings Article
11 Jul 2009
TL;DR: This paper compares the recently proposed ESA model with two latent models (LSI and LDA) showing that the former is clearly superior to the both, and contributes to clarifying the role of explicit vs. implicitly derived or latent concepts in (cross-language) information retrieval research.
Abstract: The field of information retrieval and text manipulation (classification, clustering) still strives for models allowing semantic information to be folded in to improve performance with respect to standard bag-of-word based models. Many approaches aim at a concept-based retrieval, but differ in the nature of the concepts, which range from linguistic concepts as defined in lexical resources such as WordNet, latent topics derived from the data itself - as in Latent Semantic Indexing (LSI) or (Latent Dirichlet Allocation (LDA) - to Wikipedia articles as proxies for concepts, as in the recently proposed Explicit Semantic Analysis (ESA) model. A crucial question which has not been answered so far is whether models based on explicitly given concepts (as in the ESA model for instance) perform inherently better than retrieval models based on "latent" concepts (as in LSI and/or LDA). In this paper we investigate this question closer in the context of a cross-language setting, which inherently requires concept-based retrieval bridging between different languages. In particular, we compare the recently proposed ESA model with two latent models (LSI and LDA) showing that the former is clearly superior to the both. From a general perspective, our results contribute to clarifying the role of explicit vs. implicitly derived or latent concepts in (cross-language) information retrieval research.

Proceedings ArticleDOI
Mingcheng Qu1, Guang Qiu1, Xiaofei He1, Cheng Zhang, Hao Wu1, Jiajun Bu1, Chun Chen1 
20 Apr 2009
TL;DR: This paper adopts the Probabilistic Latent Semantic Analysis (PLSA) model for question recommendation and proposes a novel metric to evaluate the performance of the approach, and shows the experimental results show the recommendation approach is effective.
Abstract: User-Interactive Question Answering (QA) communities such as Yahoo! Answers are growing in popularity. However, as these QA sites always have thousands of new questions posted daily, it is difficult for users to find the questions that are of interest to them. Consequently, this may delay the answering of the new questions. This gives rise to question recommendation techniques that help users locate interesting questions. In this paper, we adopt the Probabilistic Latent Semantic Analysis (PLSA) model for question recommendation and propose a novel metric to evaluate the performance of our approach. The experimental results show our recommendation approach is effective.

Proceedings ArticleDOI
06 Aug 2009
TL;DR: A novel statistical language model is proposed to capture long-range semantic dependencies by applying the concept of semantic composition to the problem of constructing predictive history representations for upcoming words.
Abstract: In this paper we propose a novel statistical language model to capture long-range semantic dependencies Specifically, we apply the concept of semantic composition to the problem of constructing predictive history representations for upcoming words We also examine the influence of the underlying semantic space on the composition task by comparing spatial semantic representations against topic-based ones The composition models yield reductions in perplexity when combined with a standard n-gram language model over the n-gram model alone We also obtain perplexity reductions when integrating our models with a structured language model

Proceedings ArticleDOI
19 Jul 2009
TL;DR: A probabilistic multi-view clustering model outperforming an early-fusion approach based on the latent modeling of cluster-cluster relationships is derived.
Abstract: Multi-view clustering is an important problem in information retrieval due to the abundance of data offering many perspectives and generating multi-view representations. We investigate in this short note a late fusion approach for multi-view clustering based on the latent modeling of cluster-cluster relationships. We derive a probabilistic multi-view clustering model outperforming an early-fusion approach based on multi-view feature correlation analysis.

Proceedings ArticleDOI
06 Aug 2009
TL;DR: The Latent Words Language Model is presented, which is a language model that learns word similarities from unlabeled texts that uses these similarities for different semi-supervised SRL methods as additional features or to automatically expand a small training set.
Abstract: Semantic Role Labeling (SRL) has proved to be a valuable tool for performing automatic analysis of natural language texts. Currently however, most systems rely on a large training set, which is manually annotated, an effort that needs to be repeated whenever different languages or a different set of semantic roles is used in a certain application. A possible solution for this problem is semi-supervised learning, where a small set of training examples is automatically expanded using unlabeled texts. We present the Latent Words Language Model, which is a language model that learns word similarities from unlabeled texts. We use these similarities for different semi-supervised SRL methods as additional features or to automatically expand a small training set. We evaluate the methods on the PropBank dataset and find that for small training sizes our best performing system achieves an error reduction of 33.27% F1-measure compared to a state-of-the-art supervised baseline.

Proceedings Article
01 Jan 2009
TL;DR: This study uses Probabilistic Latent Semantic Analysis (PLSA) for topic modeling and proposed new folding-in techniques for topic adaptation under an evolving vocabulary to monitor and understanding of topic and vocabulary evolution over an in nite document sequence, i.e. a stream.
Abstract: Document collections evolve over time, new topics emerge and old ones decline. At the same time, the terminology evolves as well. Much literature is devoted to topic evolution in nite document sequences assuming a xed vocabulary. In this study, we propose \Topic Monitor" for the monitoring and understanding of topic and vocabulary evolution over an in nite document sequence, i.e. a stream. We use Probabilistic Latent Semantic Analysis (PLSA) for topic modeling and propose new folding-in techniques for topic adaptation under an evolving vocabulary. We extract a series of models, on which we detect index-based topic threads as human-interpretable descriptions of topic evolution.

Journal ArticleDOI
TL;DR: A variable string length genetic algorithm which has been exploited for automatically evolving the proper number of clusters as well as providing near optimal data set clustering results is proposed.
Abstract: In this paper, we develop a genetic algorithm method based on a latent semantic model (GAL) for text clustering. The main difficulty in the application of genetic algorithms (GAs) for document clustering is thousands or even tens of thousands of dimensions in feature space which is typical for textual data. Because the most straightforward and popular approach represents texts with the vector space model (VSM), that is, each unique term in the vocabulary represents one dimension. Latent semantic indexing (LSI) is a successful technology in information retrieval which attempts to explore the latent semantics implied by a query or a document through representing them in a dimension-reduced space. Meanwhile, LSI takes into account the effects of synonymy and polysemy, which constructs a semantic structure in textual data. GA belongs to search techniques that can efficiently evolve the optimal solution in the reduced space. We propose a variable string length genetic algorithm which has been exploited for automatically evolving the proper number of clusters as well as providing near optimal data set clustering. GA can be used in conjunction with the reduced latent semantic structure and improve clustering efficiency and accuracy. The superiority of GAL approach over conventional GA applied in VSM model is demonstrated by providing good Reuter document clustering results.

Proceedings Article
11 Jul 2009
TL;DR: Compared to existing probabilistic models of latent variables, the proposed perceptron-style algorithm lowers the training cost significantly yet with comparable or even superior classification accuracy.
Abstract: We propose a perceptron-style algorithm for fast discriminative training of structured latent variable model, and analyzed its convergence properties. Our method extends the perceptron algorithm for the learning task with latent dependencies, which may not be captured by traditional models. It relies on Viterbi decoding over latent variables, combined with simple additive updates. Compared to existing probabilistic models of latent variables, our method lowers the training cost significantly yet with comparable or even superior classification accuracy.

Proceedings Article
11 Jul 2009
TL;DR: This paper investigates a generative history-based parsing model that synchronises the derivation of non-planar graphs representing semantic dependencies with the derivations of dependency trees representing syntactic structures, achieving relative error reduction of 12% in semantic F score over previously proposed synchronous models that cannot processnon-planarity online.
Abstract: This paper investigates a generative history-based parsing model that synchronises the derivation of non-planar graphs representing semantic dependencies with the derivation of dependency trees representing syntactic structures. To process non-planarity online, the semantic transition-based parser uses a new technique to dynamically reorder nodes during the derivation. While the synchronised derivations allow different structures to be built for the semantic non-planar graphs and syntactic dependency trees, useful statistical dependencies between these structures are modeled using latent variables. The resulting synchronous parser achieves competitive performance on the CoNLL- 2008 shared task, achieving relative error reduction of 12% in semantic F score over previously proposed synchronous models that cannot process non-planarity online.

Proceedings ArticleDOI
31 May 2009
TL;DR: It is argued that the use of latent variables can help capture long range dependencies and improve the recall on segmenting long words, e.g., named-entities.
Abstract: Conventional approaches to Chinese word segmentation treat the problem as a character-based tagging task. Recently, semi-Markov models have been applied to the problem, incorporating features based on complete words. In this paper, we propose an alternative, a latent variable model, which uses hybrid information based on both word sequences and character sequences. We argue that the use of latent variables can help capture long range dependencies and improve the recall on segmenting long words, e.g., named-entities. Experimental results show that this is indeed the case. With this improvement, evaluations on the data of the second SIGHAN CWS bakeoff show that our system is competitive with the best ones in the literature.

Proceedings ArticleDOI
04 Jun 2009
TL;DR: This work took a pre-existing generative latent variable model of joint syntactic-semantic dependency parsing, developed for English, and applied it to six new languages with minimal adjustments, resulting in a parser that was ranked third overall and robustness across languages indicates that this parser has a very general feature set.
Abstract: Motivated by the large number of languages (seven) and the short development time (two months) of the 2009 CoNLL shared task, we exploited latent variables to avoid the costly process of hand-crafted feature engineering, allowing the latent variables to induce features from the data. We took a pre-existing generative latent variable model of joint syntactic-semantic dependency parsing, developed for English, and applied it to six new languages with minimal adjustments. The parser's robustness across languages indicates that this parser has a very general feature set. The parser's high performance indicates that its latent variables succeeded in inducing effective features. This system was ranked third overall with a macro averaged F1 score of 82.14%, only 0.5% worse than the best system.

Book ChapterDOI
18 Mar 2009
TL;DR: In this article, the authors address the question: how can the distance between different semantic spaces be computed? By representing each Semantic Space as a subspace of a more general Hilbert space, the relationship between Semantic Spaces can be computed by means of the subspace distance.
Abstract: Semantic Space models, which provide a numerical representation of words' meaning extracted from corpus of documents, have been formalized in terms of Hermitian operators over real valued Hilbert spaces by Bruza et al. [1]. The collapse of a word into a particular meaning has been investigated applying the notion of quantum collapse of superpositional states [2]. While the semantic association between words in a Semantic Space can be computed by means of the Minkowski distance [3] or the cosine of the angle between the vector representation of each pair of words, a new procedure is needed in order to establish relations between two or more Semantic Spaces. We address the question: how can the distance between different Semantic Spaces be computed? By representing each Semantic Space as a subspace of a more general Hilbert space, the relationship between Semantic Spaces can be computed by means of the subspace distance. Such distance needs to take into account the difference in the dimensions between subspaces. The availability of a distance for comparing different Semantic Subspaces would enable to achieve a deeper understanding about the geometry of Semantic Spaces which would possibly translate into better effectiveness in Information Retrieval tasks.