scispace - formally typeset
Search or ask a question

Showing papers on "Probabilistic latent semantic analysis published in 2010"


Journal ArticleDOI
TL;DR: An object detection system based on mixtures of multiscale deformable part models that is able to represent highly variable object classes and achieves state-of-the-art results in the PASCAL object detection challenges is described.
Abstract: We describe an object detection system based on mixtures of multiscale deformable part models. Our system is able to represent highly variable object classes and achieves state-of-the-art results in the PASCAL object detection challenges. While deformable part models have become quite popular, their value had not been demonstrated on difficult benchmarks such as the PASCAL data sets. Our system relies on new methods for discriminative training with partially labeled data. We combine a margin-sensitive approach for data-mining hard negative examples with a formalism we call latent SVM. A latent SVM is a reformulation of MI--SVM in terms of latent variables. A latent SVM is semiconvex, and the training problem becomes convex once latent information is specified for the positive examples. This leads to an iterative training algorithm that alternates between fixing latent values for positive examples and optimizing the latent SVM objective function.

10,501 citations


Journal Article
TL;DR: This work presents an efficient algorithm for learning with posterior regularization and illustrates its versatility on a diverse set of structural constraints such as bijectivity, symmetry and group sparsity in several large scale experiments, including multi-view learning, cross-lingual dependency grammar induction, unsupervised part-of-speech induction, and bitext word alignment.
Abstract: We present posterior regularization, a probabilistic framework for structured, weakly supervised learning. Our framework efficiently incorporates indirect supervision via constraints on posterior distributions of probabilistic models with latent variables. Posterior regularization separates model complexity from the complexity of structural constraints it is desired to satisfy. By directly imposing decomposable regularization on the posterior moments of latent variables during learning, we retain the computational efficiency of the unconstrained model while ensuring desired constraints hold in expectation. We present an efficient algorithm for learning with posterior regularization and illustrate its versatility on a diverse set of structural constraints such as bijectivity, symmetry and group sparsity in several large scale experiments, including multi-view learning, cross-lingual dependency grammar induction, unsupervised part-of-speech induction, and bitext word alignment.

570 citations


Book ChapterDOI
01 Jan 2010
TL;DR: A statistical model can be called a latent class (LC) or mixture model if it assumes that some of its parameters differ across unobserved subgroups, LCs, or mixture components as mentioned in this paper.
Abstract: A statistical model can be called a latent class (LC) or mixture model if it assumes that some of its parameters differ across unobserved subgroups, LCs, or mixture components. This rather general idea has several seemingly unrelated applications, the most important of which are clustering, scaling, density estimation, and random-effects modeling. This article describes simple LC models for clustering, restricted LC models for scaling, and mixture regression models for nonparametric random-effects modeling, as well as gives an overview of recent developments in the field of LC analysis. Moreover, attention is paid to topics such as maximum likelihood estimation, identification issues, model selection, and software.

431 citations


Proceedings Article
31 Mar 2010
TL;DR: In this article, a variational inference framework for training the Gaussian process latent variable model and thus performing Bayesian nonlinear dimensionality reduction is introduced, which can automatically select the dimensionality of the nonlinear latent space.
Abstract: We introduce a variational inference framework for training the Gaussian process latent variable model and thus performing Bayesian nonlinear dimensionality reduction. This method allows us to variationally integrate out the input variables of the Gaussian process and compute a lower bound on the exact marginal likelihood of the nonlinear latent variable model. The maximization of the variational lower bound provides a Bayesian training procedure that is robust to overfitting and can automatically select the dimensionality of the nonlinear latent space. We demonstrate our method on real world datasets. The focus in this paper is on dimensionality reduction problems, but the methodology is more general. For example, our algorithm is immediately applicable for training Gaussian process models in the presence of missing or uncertain inputs.

338 citations


Journal ArticleDOI
TL;DR: An effective static technique for automatic bug localization can be built around Latent Dirichlet allocation (LDA), and there is no significant relationship between the accuracy of the LDA-based technique and the size of the subject software system or the stability of its source code base.
Abstract: Context: Some recent static techniques for automatic bug localization have been built around modern information retrieval (IR) models such as latent semantic indexing (LSI). Latent Dirichlet allocation (LDA) is a generative statistical model that has significant advantages, in modularity and extensibility, over both LSI and probabilistic LSI (pLSI). Moreover, LDA has been shown effective in topic model based information retrieval. In this paper, we present a static LDA-based technique for automatic bug localization and evaluate its effectiveness. Objective: We evaluate the accuracy and scalability of the LDA-based technique and investigate whether it is suitable for use with open-source software systems of varying size, including those developed using agile methods. Method: We present five case studies designed to determine the accuracy and scalability of the LDA-based technique, as well as its relationships to software system size and to source code stability. The studies examine over 300 bugs across more than 25 iterations of three software systems. Results: The results of the studies show that the LDA-based technique maintains sufficient accuracy across all bugs in a single iteration of a software system and is scalable to a large number of bugs across multiple revisions of two software systems. The results of the studies also indicate that the accuracy of the LDA-based technique is not affected by the size of the subject software system or by the stability of its source code base. Conclusion: We conclude that an effective static technique for automatic bug localization can be built around LDA. We also conclude that there is no significant relationship between the accuracy of the LDA-based technique and the size of the subject software system or the stability of its source code base. Thus, the LDA-based technique is widely applicable.

299 citations


Proceedings ArticleDOI
06 Aug 2010
TL;DR: The modeling framework can be viewed as a combination of dimensionality reduction and graphical modeling (to capture remaining statistical structure not attributable to the latent variables) and it consistently estimates both the number of hidden components and the conditional graphical model structure among the observed variables.
Abstract: Suppose we have samples of a subset of a collection of random variables. No additional information is provided about the number of latent variables, nor of the relationship between the latent and observed variables. Is it possible to discover the number of hidden components, and to learn a statistical model over the entire collection of variables? We address this question in the setting in which the latent and observed variables are jointly Gaussian, with the conditional statistics of the observed variables conditioned on the latent variables being specified by a graphical model. As a first step we give natural conditions under which such latent-variable Gaussian graphical models are identifiable given marginal statistics of only the observed variables. Essentially these conditions require that the conditional graphical model among the observed variables is sparse, while the effect of the latent variables is “spread out” over most of the observed variables. Next we propose a tractable convex program based on regularized maximum-likelihood for model selection in this latent-variable setting; the regularizer uses both the l 1 norm and the nuclear norm. Our modeling framework can be viewed as a combination of dimensionality reduction (to identify latent variables) and graphical modeling (to capture remaining statistical structure not attributable to the latent variables), and it consistently estimates both the number of hidden components and the conditional graphical model structure among the observed variables. These results are applicable in the high-dimensional setting in which the number of latent/observed variables grows with the number of samples of the observed variables. The geometric properties of the algebraic varieties of sparse matrices and of low-rank matrices play an important role in our analysis.

157 citations


Journal ArticleDOI
TL;DR: A novel hierarchical LDA-RF (latent dirichlet allocation-random forest) model to predict human protein-protein interactions from protein primary sequences directly is proposed, which is featured by a high success rate and strong ability for handling large-scale data sets by digging the hidden internal structures buried into the noisy amino acid sequences in low dimensional latent semantic space.
Abstract: Protein−protein interaction (PPI) is at the core of the entire interactomic system of any living organism. Although there are many human protein−protein interaction links being experimentally determined, the number is still relatively very few compared to the estimation that there are ∼300 000 protein−protein interactions in human beings. Hence, it is still urgent and challenging to develop automated computational methods to accurately and efficiently predict protein−protein interactions. In this paper, we propose a novel hierarchical LDA-RF (latent dirichlet allocation-random forest) model to predict human protein−protein interactions from protein primary sequences directly, which is featured by a high success rate and strong ability for handling large-scale data sets by digging the hidden internal structures buried into the noisy amino acid sequences in low dimensional latent semantic space. First, the local sequential features represented by conjoint triads are constructed from sequences. Then the gene...

156 citations


Patent
Ji-hyun Lee1, Chin-Wan Chung1
12 Feb 2010
TL;DR: In this paper, a semantic search system using a semantic ranking scheme is presented, which includes an ontology analyzer analyzing ontology data related to a search target to determine a weight value of each property according to a weighing method for each property; a semantic path extractor extracting all the semantic paths between resources and query keywords and determining a weighted semantic path according to the semantic path weight value determination scheme.
Abstract: A semantic search system using a semantic ranking scheme including: an ontology analyzer analyzing ontology data related to a search target to determine a weight value of each property according to a weighing method for property; a semantic path extractor extracting all the semantic paths between resources and query keywords and determining a weight value of each extracted semantic path according to the semantic path weight value determination scheme by using the weight value of each property; a relevant resource searcher traversing an instance graph of ontology based on a semantic path having a pre-set length and weight value of more than an expectation level to search resources that have a semantic relationship with the query keywords and are declared as a type presented in the query; and a semantic relevance ranker selecting a top-k results having the highest rank from among the candidate results extracted by the relevant resource researcher by using a relevance scoring function.

127 citations


Proceedings Article
09 Oct 2010
TL;DR: This work uses discriminative training to create a projection of documents from multiple languages into a single translingual vector space and evaluates these algorithms on two tasks: parallel document retrieval for Wikipedia and Europarl documents, and cross-lingual text classification on Reuters.
Abstract: Representing documents by vectors that are independent of language enhances machine translation and multilingual text categorization. We use discriminative training to create a projection of documents from multiple languages into a single translingual vector space. We explore two variants to create these projections: Oriented Principal Component Analysis (OPCA) and Coupled Probabilistic Latent Semantic Analysis (CPLSA). Both of these variants start with a basic model of documents (PCA and PLSA). Each model is then made discriminative by encouraging comparable document pairs to have similar vector representations. We evaluate these algorithms on two tasks: parallel document retrieval for Wikipedia and Europarl documents, and cross-lingual text classification on Reuters. The two discriminative variants, OPCA and CPLSA, significantly outperform their corresponding baselines. The largest differences in performance are observed on the task of retrieval when the documents are only comparable and not parallel. The OPCA method is shown to perform best.

126 citations


Proceedings Article
23 Aug 2010
TL;DR: The authors proposed semantic role features for a tree-to-string transducer to model the reordering/deletion of source-side semantic roles, which significantly outperformed systems trained based on Max-Likelihood and EM.
Abstract: We propose semantic role features for a Tree-to-String transducer to model the reordering/deletion of source-side semantic roles. These semantic features, as well as the Tree-to-String templates, are trained based on a conditional log-linear model and are shown to significantly outperform systems trained based on Max-Likelihood and EM. We also show significant improvement in sentence fluency by using the semantic role features in the log-linear model, based on manual evaluation.

120 citations


Proceedings Article
31 Mar 2010
TL;DR: This paper proposes a robust approach to factorizing the latent space into shared and private spaces by introducing orthogonality constraints, which penalize redundant latent representations.
Abstract: Existing approaches to multi-view learning are particularly effective when the views are either independent (i.e, multi-kernel approaches) or fully dependent (i.e., shared latent spaces). However, in real scenarios, these assumptions are almost never truly satisfied. Recently, two methods have attempted to tackle this problem by factorizing the information and learn separate latent spaces for modeling the shared (i.e., correlated) and private (i.e., independent) parts of the data. However, these approaches are very sensitive to parameters setting or initialization. In this paper we propose a robust approach to factorizing the latent space into shared and private spaces by introducing orthogonality constraints, which penalize redundant latent representations. Furthermore, unlike previous approaches, we simultaneously learn the structure and dimensionality of the latent spaces by relying on a regularizer that encourages the latent space of each data stream to be low dimensional. To demonstrate the benefits of our approach, we apply it to two existing shared latent space models that assume full dependence of the views, the sGPLVM and the sKIE, and show that our constraints improve the performance of these models on the task of pose estimation from monocular images.

Proceedings ArticleDOI
06 Dec 2010
TL;DR: A large-margin learning framework to discover a predictive latent subspace representation shared by multiple views based on an undirected latent space Markov network that fulfills a weak conditional independence assumption that multi-view observations and response variables are independent given a set of latent variables.
Abstract: Learning from multi-view data is important in many applications, such as image classification and annotation. In this paper, we present a large-margin learning framework to discover a predictive latent subspace representation shared by multiple views. Our approach is based on an undirected latent space Markov network that fulfills a weak conditional independence assumption that multi-view observations and response variables are independent given a set of latent variables. We provide efficient inference and parameter estimation methods for the latent sub-space model. Finally, we demonstrate the advantages of large-margin learning on real video and web image data for discovering predictive latent representations and improving the performance on image classification, annotation and retrieval.

Journal ArticleDOI
10 Jun 2010
TL;DR: A simple approach to learning models of visual object categories from images gathered from Internet image search engines, derived from the probabilistic latent semantic analysis technique for text document analysis, that can be used to automatically learn object models from these data.
Abstract: In this paper, we describe a simple approach to learning models of visual object categories from images gathered from Internet image search engines. The images for a given keyword are typically highly variable, with a large fraction being unrelated to the query term, and thus pose a challenging environment from which to learn. By training our models directly from Internet images, we remove the need to laboriously compile training data sets, required by most other recognition approaches-this opens up the possibility of learning object category models “on-the-fly.” We describe two simple approaches, derived from the probabilistic latent semantic analysis (pLSA) technique for text document analysis, that can be used to automatically learn object models from these data. We show two applications of the learned model: first, to rerank the images returned by the search engine, thus improving the quality of the search engine; and second, to recognize objects in other image data sets.

Proceedings Article
05 Jun 2010
TL;DR: Experiments show that a categorical model using NMF results in better performances for SemEval and fairy tales, whereas a dimensional model performs better with ISEAR.
Abstract: In this paper we present an evaluation of new techniques for automatically detecting emotions in text. The study estimates categorical model and dimensional model for the recognition of four affective states: Anger, Fear, Joy, and Sadness that are common emotions in three datasets: SemEval-2007 "Affective Text", ISEAR (International Survey on Emotion Antecedents and Reactions), and children's fairy tales. In the first model, WordNet-Affect is used as a linguistic lexical resource and three dimensionality reduction techniques are evaluated: Latent Semantic Analysis (LSA), Probabilistic Latent Semantic Analysis (PLSA), and Non-negative Matrix Factorization (NMF). In the second model, ANEW (Affective Norm for English Words), a normative database with affective terms, is employed. Experiments show that a categorical model using NMF results in better performances for SemEval and fairy tales, whereas a dimensional model performs better with ISEAR.

Journal ArticleDOI
TL;DR: A novel method for automatically classifying consumer video clips based on their soundtracks using a set of 25 overlapping semantic classes, chosen for their usefulness to users, viability of automatic detection and of annotator labeling, and sufficiency of representation in available video collections.
Abstract: This paper presents a novel method for automatically classifying consumer video clips based on their soundtracks. We use a set of 25 overlapping semantic classes, chosen for their usefulness to users, viability of automatic detection and of annotator labeling, and sufficiency of representation in available video collections. A set of 1873 videos from real users has been annotated with these concepts. Starting with a basic representation of each video clip as a sequence of mel-frequency cepstral coefficient (MFCC) frames, we experiment with three clip-level representations: single Gaussian modeling, Gaussian mixture modeling, and probabilistic latent semantic analysis of a Gaussian component histogram. Using such summary features, we produce support vector machine (SVM) classifiers based on the Kullback-Leibler, Bhattacharyya, or Mahalanobis distance measures. Quantitative evaluation shows that our approaches are effective for detecting interesting concepts in a large collection of real-world consumer video clips.

Proceedings ArticleDOI
05 Jan 2010
TL;DR: This paper sheds light on the theory that underlies text mining methods and provides guidance for researchers who seek to apply these methods.
Abstract: The amount of textual data that is available for researchers and businesses to analyze is increasing at a dramatic rate This reality has led IS researchers to investigate various text mining techniques This essay examines four text mining methods that are frequently used in order to identify their advantages and limitations The four methods that we examine are (1) latent semantic analysis, (2) probabilistic latent semantic analysis, (3) latent Dirichlet allocation, and (4) the correlated topic model We compare these four methods and highlight the optimal conditions under which to apply the various methods Our paper sheds light on the theory that underlies text mining methods and provides guidance for researchers who seek to apply these methods

Proceedings Article
11 Jul 2010
TL;DR: A new topic model called Probabilistic Cross-Lingual Latent Semantic Analysis (PCLSA) is proposed which extends the probabilistic LatentSemantic Analysis model by regularizing its likelihood function with soft constraints defined based on a bilingual dictionary.
Abstract: Probabilistic latent topic models have recently enjoyed much success in extracting and analyzing latent topics in text in an unsupervised way. One common deficiency of existing topic models, though, is that they would not work well for extracting cross-lingual latent topics simply because words in different languages generally do not co-occur with each other. In this paper, we propose a way to incorporate a bilingual dictionary into a probabilistic topic model so that we can apply topic models to extract shared latent topics in text data of different languages. Specifically, we propose a new topic model called Probabilistic Cross-Lingual Latent Semantic Analysis (PCLSA) which extends the Probabilistic Latent Semantic Analysis (PLSA) model by regularizing its likelihood function with soft constraints defined based on a bilingual dictionary. Both qualitative and quantitative experimental results show that the PCLSA model can effectively extract cross-lingual latent topics from multilingual text data.

Proceedings Article
23 Aug 2010
TL;DR: Experimental results on the Robocup sportscasting corpora in both English and Korean indicate that the probabilistic generative model presented produces more accurate semantic alignments than existing methods and also produces competitive semantic parsers and improved language generators.
Abstract: We present a probabilistic generative model for learning semantic parsers from ambiguous supervision. Our approach learns from natural language sentences paired with world states consisting of multiple potential logical meaning representations. It disambiguates the meaning of each sentence while simultaneously learning a semantic parser that maps sentences into logical form. Compared to a previous generative model for semantic alignment, it also supports full semantic parsing. Experimental results on the Robocup sportscasting corpora in both English and Korean indicate that our approach produces more accurate semantic alignments than existing methods and also produces competitive semantic parsers and improved language generators.

Book ChapterDOI
05 Sep 2010
TL;DR: A novel probabilistic inference algorithm for 3D shape estimation is proposed by maximum likelihood estimates of the GPLVM latent variables and the camera parameters that best fit generated 3D shapes to given silhouettes.
Abstract: In this paper we propose a probabilistic framework that models shape variations and infers dense and detailed 3D shapes from a single silhouette. We model two types of shape variations, the object phenotype variation and its pose variation using two independent Gaussian Process Latent Variable Models (GPLVMs) respectively. The proposed shape variation models are learnt from 3D samples without prior knowledge about object class, e.g. object parts and skeletons, and are combined to fully span the 3D shape space. A novel probabilistic inference algorithm for 3D shape estimation is proposed by maximum likelihood estimates of the GPLVM latent variables and the camera parameters that best fit generated 3D shapes to given silhouettes. The proposed inference involves a small number of latent variables and it is computationally efficient. Experiments on both human body and shark data demonstrate the efficacy of our new approach.

Proceedings Article
02 Jun 2010
TL;DR: It is shown that the 'problem' of high-frequency words can be dealt with more elegantly, and in a way that to the authors' knowledge has not been considered in LDA, through the use of appropriate weighting schemes comparable to those sometimes used in Latent Semantic Indexing (LSI).
Abstract: Many implementations of Latent Dirichlet Allocation (LDA), including those described in Blei et al. (2003), rely at some point on the removal of stopwords, words which are assumed to contribute little to the meaning of the text. This step is considered necessary because otherwise high-frequency words tend to end up scattered across many of the latent topics without much rhyme or reason. We show, however, that the 'problem' of high-frequency words can be dealt with more elegantly, and in a way that to our knowledge has not been considered in LDA, through the use of appropriate weighting schemes comparable to those sometimes used in Latent Semantic Indexing (LSI). Our proposed weighting methods not only make theoretical sense, but can also be shown to improve precision significantly on a non-trivial cross-language retrieval task.

Journal ArticleDOI
TL;DR: A broad class of semiparametric Bayesian SEMs, which allow mixed categorical and continuous manifest variables while also allowing the latent variables to have unknown distributions is proposed, based on centered Dirichlet process and CDP mixture models.
Abstract: Structural equation models (SEMs) with latent variables are widely useful for sparse covariance structure modeling and for inferring relationships among latent variables. Bayesian SEMs are appealing in allowing for the incorporation of prior information and in providing exact posterior distributions of unknowns, including the latent variables. In this article, we propose a broad class of semiparametric Bayesian SEMs, which allow mixed categorical and continuous manifest variables while also allowing the latent variables to have unknown distributions. In order to include typical identifiability restrictions on the latent variable distributions, we rely on centered Dirichlet process (CDP) and CDP mixture (CDPM) models. The CDP will induce a latent class model with an unknown number of classes, while the CDPM will induce a latent trait model with unknown densities for the latent traits. A simple and efficient Markov chain Monte Carlo algorithm is developed for posterior computation, and the methods are illustrated using simulated examples, and several applications.

Proceedings ArticleDOI
25 Jul 2010
TL;DR: A novel topic model using the Pitman-Yor(PY) process is proposed, called the PY topic model, which captures two properties of a document; a power-law word distribution and the presence of multiple topics.
Abstract: One important approach for knowledge discovery and data mining is to estimate unobserved variables because latent variables can indicate hidden specific properties of observed data. The latent factor model assumes that each item in a record has a latent factor; the co-occurrence of items can then be modeled by latent factors. In document modeling, a record indicates a document represented as a "bag of words," meaning that the order of words is ignored, an item indicates a word and a latent factor indicates a topic. Latent Dirichlet allocation (LDA) is a widely used Bayesian topic model applying the Dirichlet distribution over the latent topic distribution of a document having multiple topics. LDA assumes that latent topics, i.e., discrete latent variables, are distributed according to a multinomial distribution whose parameters are generated from the Dirichlet distribution. LDA also models a word distribution by using a multinomial distribution whose parameters follows the Dirichlet distribution. This Dirichlet-multinomial setting, however, cannot capture the power-law phenomenon of a word distribution, which is known as Zipf's law in linguistics. We therefore propose a novel topic model using the Pitman-Yor(PY) process, called the PY topic model. The PY topic model captures two properties of a document; a power-law word distribution and the presence of multiple topics. In an experiment using real data, this model outperformed LDA in document modeling in terms of perplexity.

Journal ArticleDOI
TL;DR: The schema theory for SLN including the concepts, rule-constraint normal forms and relevant algorithms are proposed, which provides the basis for normalized management of SLN and its applications.

Proceedings Article
23 Aug 2010
TL;DR: Two new LSA based summarization algorithms are proposed and their performances are compared using their ROUGE-L scores to find out well-formed summaries.
Abstract: Text summarization solves the problem of extracting important information from huge amount of text data. There are various methods in the literature that aim to find out well-formed summaries. One of the most commonly used methods is the Latent Semantic Analysis (LSA). In this paper, different LSA based summarization algorithms are explained and two new LSA based summarization algorithms are proposed. The algorithms are evaluated on Turkish documents, and their performances are compared using their ROUGE-L scores. One of our algorithms produces the best scores.

Proceedings ArticleDOI
12 Sep 2010
TL;DR: A series of Latent Dirichlet Allocation models with varying topic counts are generated to evaluate the ability of the model to identify related source code blocks, and demonstrate the consequences of choosing too few or too many latent topics.
Abstract: The optimal number of latent topics required to model the most accurate latent substructure for a source code corpus is an open question in source code analysis. Most estimates about the number of latent topics that exist in a software corpus are based on the assumption that the data is similar to natural language, but there is little empirical evidence to support this. In order to help determine the appropriate number of topics needed to accurately represent the source code, we generate a series of Latent Dirichlet Allocation models with varying topic counts. We use a heuristic to evaluate the ability of the model to identify related source code blocks, and demonstrate the consequences of choosing too few or too many latent topics.

Journal ArticleDOI
TL;DR: The mixture of experts framework allows covariates to enter the latent position cluster model in a number of ways, yielding different model interpretations, and is demonstrated through an illustrative example detailing relationships between a group of lawyers in the USA.

Journal ArticleDOI
11 Jun 2010-PLOS ONE
TL;DR: A low-dimensional, context-independent semantic map of natural language that represents simultaneously synonymy and antonymy is constructed, and provides a foundational metric system for the quantitative analysis of word meaning.
Abstract: Metric systems for semantics, or semantic cognitive maps, are allocations of words or other representations in a metric space based on their meaning. Existing methods for semantic mapping, such as Latent Semantic Analysis and Latent Dirichlet Allocation, are based on paradigms involving dissimilarity metrics. They typically do not take into account relations of antonymy and yield a large number of domain-specific semantic dimensions. Here, using a novel self-organization approach, we construct a low-dimensional, context-independent semantic map of natural language that represents simultaneously synonymy and antonymy. Emergent semantics of the map principal components are clearly identifiable: the first three correspond to the meanings of “good/bad” (valence), “calm/excited” (arousal), and “open/closed” (freedom), respectively. The semantic map is sufficiently robust to allow the automated extraction of synonyms and antonyms not originally in the dictionaries used to construct the map and to predict connotation from their coordinates. The map geometric characteristics include a limited number (∼4) of statistically significant dimensions, a bimodal distribution of the first component, increasing kurtosis of subsequent (unimodal) components, and a U-shaped maximum-spread planar projection. Both the semantic content and the main geometric features of the map are consistent between dictionaries (Microsoft Word and Princeton's WordNet), among Western languages (English, French, German, and Spanish), and with previously established psychometric measures. By defining the semantics of its dimensions, the constructed map provides a foundational metric system for the quantitative analysis of word meaning. Language can be viewed as a cumulative product of human experiences. Therefore, the extracted principal semantic dimensions may be useful to characterize the general semantic dimensions of the content of mental states. This is a fundamental step toward a universal metric system for semantics of human experiences, which is necessary for developing a rigorous science of the mind.

Journal ArticleDOI
TL;DR: A model that learns semantic representations from the distributional statistics of language, and infers semantic representations by taking into account the inherent sequential nature of linguistic data is described.
Abstract: In this paper, we describe a model that learns semantic representations from the distributional statistics of language. This model, however, goes beyond the common bag-of-words paradigm, and infers semantic representations by taking into account the inherent sequential nature of linguistic data. The model we describe, which we refer to as a Hidden Markov Topics model, is a natural extension of the current state of the art in Bayesian bag-of-words models, that is, the Topics model of Griffiths, Steyvers, and Tenenbaum (2007), preserving its strengths while extending its scope to incorporate more fine-grained linguistic information.

Journal ArticleDOI
TL;DR: A flexible approach to modeling relations in development among two or more discrete, multidimensional latent variables based on the general framework of loglinear modeling with latent variables called associative latent transition analysis (ALTA).
Abstract: To understand one developmental process, it is often helpful to investigate its relations with other developmental processes. Statistical methods that model development in multiple processes simultaneously over time include latent growth curve models with time-varying covariates, multivariate latent growth curve models, and dual trajectory models. These models are designed for growth represented by continuous, unidimensional trajectories. The purpose of this article is to present a flexible approach to modeling relations in development among two or more discrete, multidimensional latent variables based on the general framework of loglinear modeling with latent variables called associative latent transition analysis (ALTA). Focus is given to the substantive interpretation of different associative latent transition models, and exactly what hypotheses are expressed in each model. An empirical demonstration of ALTA is presented to examine the association between the development of alcohol use and sexual risk ...

Journal ArticleDOI
TL;DR: In this article, a semiparametric latent variable model is developed, in which outcome latent variables are related to explanatory latent variables and covariates through an additive structural equation formulated by a series of unspecified smooth functions.
Abstract: This article aims to develop a semiparametric latent variable model, in which outcome latent variables are related to explanatory latent variables and covariates through an additive structural equation formulated by a series of unspecified smooth functions. The Bayesian P-splines approach, together with a Markov chain Monte Carlo algorithm, is proposed to estimate smooth functions, unknown parameters, and latent variables in the model. The performance of the developed methodology is demonstrated by a simulation study. An illustrative example in analyzing bone mineral density in older men is provided. An Appendix which includes technical details of the proposed MCMC algorithm and an R code in implementing the algorithm are available as the online supplemental materials.