scispace - formally typeset
Search or ask a question

Showing papers on "Probabilistic latent semantic analysis published in 2012"


Journal ArticleDOI
TL;DR: This manuscript discusses a typology of (second-order) hierarchical latent variable models that include formative relationships, and provides an overview of different approaches that can be used to estimate the parameters in these models.

1,361 citations


Proceedings ArticleDOI
08 Feb 2012
TL;DR: A scalable parallel framework for efficient inference in latent variable models over streaming web-scale data by introducing a novel delta-based aggregation system with a bandwidth-efficient communication protocol, schedule-aware out-of-core storage, and approximate forward sampling to rapidly incorporate new data.
Abstract: Latent variable techniques are pivotal in tasks ranging from predicting user click patterns and targeting ads to organizing the news and managing user generated content. Latent variable techniques like topic modeling, clustering, and subspace estimation provide substantial insight into the latent structure of complex data with little or no external guidance making them ideal for reasoning about large-scale, rapidly evolving datasets. Unfortunately, due to the data dependencies and global state introduced by latent variables and the iterative nature of latent variable inference, latent-variable techniques are often prohibitively expensive to apply to large-scale, streaming datasets.In this paper we present a scalable parallel framework for efficient inference in latent variable models over streaming web-scale data. Our framework addresses three key challenges: 1) synchronizing the global state which includes global latent variables (e.g., cluster centers and dictionaries); 2) efficiently storing and retrieving the large local state which includes the data-points and their corresponding latent variables (e.g., cluster membership); and 3) sequentially incorporating streaming data (e.g., the news). We address these challenges by introducing: 1) a novel delta-based aggregation system with a bandwidth-efficient communication protocol; 2) schedule-aware out-of-core storage; and 3) approximate forward sampling to rapidly incorporate new data. We demonstrate state-of-the-art performance of our framework by easily tackling datasets two orders of magnitude larger than those addressed by the current state-of-the-art. Furthermore, we provide an optimized and easily customizable open-source implementation of the framework1.

287 citations


Proceedings ArticleDOI
14 May 2012
TL;DR: A probabilistic framework combining heterogeneous, uncertain, information such as object observations, shape, size, appearance of rooms and human input for semantic mapping that relies on the concept of spatial properties which make the semantic map more descriptive, and the system more scalable and better adapted for human interaction.
Abstract: This paper presents a probabilistic framework combining heterogeneous, uncertain, information such as object observations, shape, size, appearance of rooms and human input for semantic mapping. It abstracts multi-modal sensory information and integrates it with conceptual common-sense knowledge in a fully probabilistic fashion. It relies on the concept of spatial properties which make the semantic map more descriptive, and the system more scalable and better adapted for human interaction. A probabilistic graphical model, a chaingraph, is used to represent the conceptual information and perform spatial reasoning. Experimental results from online system tests in a large unstructured office environment highlight the system's ability to infer semantic room categories, predict existence of objects and values of other spatial properties as well as reason about unexplored space.

277 citations


Book
29 Oct 2012
TL;DR: This book discusses Latent Markov Modeling as a guide to Bayesian inference via reversible jump, and its applications include selection and hypothesis testing, and modeling and inference of latent variable models and their applications.
Abstract: Overview on Latent Markov Modeling Introduction Literature review on latent Markov models Alternative approaches Example datasets Background on Latent Variable and Markov Chain Models Introduction Latent variable models Expectation-Maximization algorithm Standard errors Latent class model Selection of the number of latent classes Applications Markov chain model for longitudinal data Applications Basic Latent Markov Model Introduction Univariate formulation Multivariate formulation Model identifiability Maximum likelihood estimation Selection of the number of latent states Applications Constrained Latent Markov Models Introduction Constraints on the measurement model Constraints on the latent model Maximum likelihood estimation Model selection and hypothesis testing Applications Including Individual Covariates and Relaxing Basic Model Assumptions Introduction Notation Covariates in the measurement model Covariates in the latent model Interpretation of the resulting models Maximum likelihood estimation Observed information matrix, identifiability, and standard errors Relaxing local independence Higher order extensions Applications Including Random Effects and Extension to Multilevel Data Introduction Random-effects formulation Maximum likelihood estimation Multilevel formulation Application to the student math achievement dataset Advanced Topics about Latent Markov Modeling Introduction Dealing with continuous response variables Dealing with missing responses Additional computational issues Decoding and forecasting Selection of the number of latent states Bayesian Latent Markov Models Introduction Prior distributions Bayesian inference via reversible jump Alternative sampling Application to the labor market dataset Appendix: Software List of Main Symbols Bibliography Index

181 citations


Journal ArticleDOI
TL;DR: A new algorithm which mines both context and content links in social media networks to discover the underlying latent semantic space in multimedia objects to enable the use of any off-the-shelf multimedia retrieval algorithms.
Abstract: Social media networks contain both content and context-specific information. Most existing methods work with either of the two for the purpose of multimedia mining and retrieval. In reality, both content and context information are rich sources of information for mining, and the full power of mining and processing algorithms can be realized only with the use of a combination of the two. This paper proposes a new algorithm which mines both context and content links in social media networks to discover the underlying latent semantic space. This mapping of the multimedia objects into latent feature vectors enables the use of any off-the-shelf multimedia retrieval algorithms. Compared to the state-of-the-art latent methods in multimedia analysis, this algorithm effectively solves the problem of sparse context links by mining the geometric structure underlying the content links between multimedia objects. Specifically for multimedia annotation, we show that an effective algorithm can be developed to directly construct annotation models by simultaneously leveraging both context and content information based on latent structure between correlated semantic concepts. We conduct experiments on the Flickr data set, which contains user tags linked with images. We illustrate the advantages of our approach over the state-of-the-art multimedia retrieval techniques.

140 citations


Journal ArticleDOI
TL;DR: A highly unsupervised, training free, no reference image quality assessment (IQA) model that is based on the hypothesis that distorted images have certain latent characteristics that differ from those of “natural” or “pristine” images is proposed.
Abstract: We propose a highly unsupervised, training free, no reference image quality assessment (IQA) model that is based on the hypothesis that distorted images have certain latent characteristics that differ from those of “natural” or “pristine” images. These latent characteristics are uncovered by applying a “topic model” to visual words extracted from an assortment of pristine and distorted images. For the latent characteristics to be discriminatory between pristine and distorted images, the choice of the visual words is important. We extract quality-aware visual words that are based on natural scene statistic features [1]. We show that the similarity between the probability of occurrence of the different topics in an unseen image and the distribution of latent topics averaged over a large number of pristine natural images yields a quality measure. This measure correlates well with human difference mean opinion scores on the LIVE IQA database [2].

131 citations


Journal ArticleDOI
TL;DR: The paper provides a full theoretical foundation for the causal discovery procedure first presented by Eberhardt et al. (2010) by adapting the procedure to the problem of cellular network inference, applying it to the biologically realistic data of the DREAMchallenges.
Abstract: Identifying cause-effect relationships between variables of interest is a central problem in science. Given a set of experiments we describe a procedure that identifies linear models that may contain cycles and latent variables. We provide a detailed description of the model family, full proofs of the necessary and sufficient conditions for identifiability, a search algorithm that is complete, and a discussion of what can be done when the identifiability conditions are not satisfied. The algorithm is comprehensively tested in simulations, comparing it to competing algorithms in the literature. Furthermore, we adapt the procedure to the problem of cellular network inference, applying it to the biologically realistic data of the DREAMchallenges. The paper provides a full theoretical foundation for the causal discovery procedure first presented by Eberhardt et al. (2010) and Hyttinen et al. (2010).

112 citations


Proceedings Article
03 Dec 2012
TL;DR: This work revisits independence assumptions for probabilistic latent variable models with a determinantal point process (DPP), leading to better intuition for the latent variable representation and quantitatively improved unsupervised feature extraction, without compromising the generative aspects of the model.
Abstract: Probabilistic latent variable models are one of the cornerstones of machine learning. They offer a convenient and coherent way to specify prior distributions over unobserved structure in data, so that these unknown properties can be inferred via posterior inference. Such models are useful for exploratory analysis and visualization, for building density models of data, and for providing features that can be used for later discriminative tasks. A significant limitation of these models, however, is that draws from the prior are often highly redundant due to i.i.d. assumptions on internal parameters. For example, there is no preference in the prior of a mixture model to make components non-overlapping, or in topic model to ensure that co-occurring words only appear in a small number of topics. In this work, we revisit these independence assumptions for probabilistic latent variable models, replacing the underlying i.i.d. prior with a determinantal point process (DPP). The DPP allows us to specify a preference for diversity in our latent variables using a positive definite kernel function. Using a kernel between probability distributions, we are able to define a DPP on probability measures. We show how to perform MAP inference with DPP priors in latent Dirichlet allocation and in mixture models, leading to better intuition for the latent variable representation and quantitatively improved unsupervised feature extraction, without compromising the generative aspects of the model.

99 citations


Proceedings ArticleDOI
16 Jun 2012
TL;DR: This paper introduces a constrained latent variable model whose generated output inherently accounts for prior knowledge about the specific problem at hand, and proposes an approach that explicitly imposes equality and inequality constraints on the model's output during learning, thus avoiding the computational burden of having to account for these constraints at inference.
Abstract: Latent variable models provide valuable compact representations for learning and inference in many computer vision tasks. However, most existing models cannot directly encode prior knowledge about the specific problem at hand. In this paper, we introduce a constrained latent variable model whose generated output inherently accounts for such knowledge. To this end, we propose an approach that explicitly imposes equality and inequality constraints on the model's output during learning, thus avoiding the computational burden of having to account for these constraints at inference. Our learning mechanism can exploit non-linear kernels, while only involving sequential closed-form updates of the model parameters. We demonstrate the effectiveness of our constrained latent variable model on the problem of non-rigid 3D reconstruction from monocular images, and show that it yields qualitative and quantitative improvements over several baselines.

95 citations


Book ChapterDOI
01 Jan 2012
TL;DR: This chapter surveys two influential forms of dimension reduction, including probabilistic latent semantic indexing and latent Dirichlet allocation, and describes the basic technologies in detail and exposes the underlying mechanism.
Abstract: The bag-of-words representation commonly used in text analysis can be analyzed very efficiently and retains a great deal of useful information, but it is also troublesome because the same thought can be expressed using many different terms or one term can have very different meanings. Dimension reduction can collapse together terms that have the same semantics, to identify and disambiguate terms with multiple meanings and to provide a lower-dimensional representation of documents that reflects concepts instead of raw terms. In this chapter, we survey two influential forms of dimension reduction. Latent semantic indexing uses spectral decomposition to identify a lower-dimensional representation that maintains semantic properties of the documents. Topic modeling, including probabilistic latent semantic indexing and latent Dirichlet allocation, is a form of dimension reduction that uses a probabilistic model to find the co-occurrence patterns of terms that correspond to semantic topics in a collection of documents. We describe the basic technologies in detail and expose the underlying mechanism. We also discuss recent advances that have made it possible to apply these techniques to very large and evolving text collections and to incorporate network structure or other contextual information.

91 citations


Journal ArticleDOI
TL;DR: Using the eigen-decomposition of the adjacency matrix, this work shows that it can consistently estimate feature maps for latent position graphs with positive definite link function $\kappa$, provided that the latent positions are i.i.d. from some distribution F.
Abstract: In this work we show that, using the eigen-decomposition of the adjacency matrix, we can consistently estimate feature maps for latent position graphs with positive definite link function $\kappa$, provided that the latent positions are i.i.d. from some distribution F. We then consider the exploitation task of vertex classification where the link function $\kappa$ belongs to the class of universal kernels and class labels are observed for a number of vertices tending to infinity and that the remaining vertices are to be classified. We show that minimization of the empirical $\varphi$-risk for some convex surrogate $\varphi$ of 0-1 loss over a class of linear classifiers with increasing complexities yields a universally consistent classifier, that is, a classification rule with error converging to Bayes optimal for any distribution F.

Journal ArticleDOI
TL;DR: A novel method based on non-negative matrix factorization to generate multimodal image representations that integrate visual features and text information that highly outperforms the response of the system in both tasks, when compared to multi-modal latent semantic spaces generated by a singular value decomposition.

Journal ArticleDOI
TL;DR: A robust server side methodology to detect phishing attacks, called phishGILLNET, which incorporates the power of natural language processing and machine learning techniques, and outperforms state of the art phishing detection methods.
Abstract: Identity theft is one of the most profitable crimes committed by felons. In the cyber space, this is commonly achieved using phishing. We propose here robust server side methodology to detect phishing attacks, called phishGILLNET, which incorporates the power of natural language processing and machine learning techniques. phishGILLNET is a multi-layered approach to detect phishing attacks. The first layer (phishGILLNET1) employs Probabilistic Latent Semantic Analysis (PLSA) to build a topic model. The topic model handles synonym (multiple words with similar meaning), polysemy (words with multiple meanings), and other linguistic variations found in phishing. Intentional misspelled words found in phishing are handled using Levenshtein editing and Google APIs for correction. Based on term document frequency matrix as input PLSA finds phishing and non-phishing topics using tempered expectation maximization. The performance of phishGILLNET1 is evaluated using PLSA fold in technique and the classification is achieved using Fisher similarity. The second layer of phishGILLNET (phishGILLNET2) employs AdaBoost to build a robust classifier. Using probability distributions of the best PLSA topics as features the classifier is built using AdaBoost. The third layer (phishGILLNET3) further expands phishGILLNET2 by building a classifier from labeled and unlabeled examples by employing Co-Training. Experiments were conducted using one of the largest public corpus of email data containing 400,000 emails. Results show that phishGILLNET3 outperforms state of the art phishing detection methods and achieves F-measure of 100%. Moreover, phishGILLNET3 requires only a small percentage (10%) of data be annotated thus saving significant time, labor, and avoiding errors incurred in human annotation.

Journal ArticleDOI
TL;DR: This paper proposes a novel Attribute-Restricted Latent Topic Model (ARLTM) to encode targets into semantic topics and shows that this model performs best by imposing semantic restrictions onto the generation of human specific attributes.

Book ChapterDOI
28 May 2012
TL;DR: The experimental results show that PLSA recommender with the probability utilization outperforms other combinations of recommenders and utilizations for recommending locations to users on LBSN.
Abstract: The development of location-based social networking (LBSN) services is growing rapidly these days. Users of LBSN services are more interested in sharing tips and experiences of their visits to various locations. In this paper, we aim at a study of recommending locations to users on LBSNs by collaborative filtering (CF) recommenders based only on users' check-in data. We first design and develop a distributed crawler to collect a large amount of check-in data from Gowalla. Then, we use three ways to utilize the check-in data, namely, the binary utilization, the FIF utilization, and the probability utilization. According to different utilizations, we introduce different CF recommenders, namely, user-based, item-based and probabilistic latent semantic analysis (PLSA), to do the location recommendation. Finally, we conduct a set of experiments to compare the performances of different recommenders using different check-in utilizations. The experimental results show that PLSA recommender with the probability utilization outperforms other combinations of recommenders and utilizations for recommending locations to users on LBSN.

Journal ArticleDOI
TL;DR: This paper presents a principled approach to learning a semantic vocabulary from a large amount of video words using Diffusion Maps embedding, and conjecture that the mid-level features produced by similar video sources must lie on a certain manifold.

Proceedings Article
26 Jun 2012
TL;DR: In this article, a unified framework for structured prediction with latent variables is proposed, which includes hidden conditional random fields and latent structured support vector machines as special cases, and a local entropy approximation for this general formulation using duality is derived.
Abstract: In this paper we propose a unified framework for structured prediction with latent variables which includes hidden conditional random fields and latent structured support vector machines as special cases. We describe a local entropy approximation for this general formulation using duality, and derive an efficient message passing algorithm that is guaranteed to converge. We demonstrate its effectiveness in the tasks of image segmentation as well as 3D indoor scene understanding from single images, showing that our approach is superior to latent structured support vector machines and hidden conditional random fields.

Proceedings Article
08 Jul 2012
TL;DR: It is found that by fusing local and global information, the ability of algorithms to distinguish sense from nonsense based on a variety of sentence-level phenomena can exceed 50% on this task (chance baseline is 20%), and some avenues for further research are suggested.
Abstract: This paper studies the problem of sentence-level semantic coherence by answering SAT-style sentence completion questions. These questions test the ability of algorithms to distinguish sense from nonsense based on a variety of sentence-level phenomena. We tackle the problem with two approaches: methods that use local lexical information, such as the n-grams of a classical language model; and methods that evaluate global coherence, such as latent semantic analysis. We evaluate these methods on a suite of practice SAT questions, and on a recently released sentence completion task based on data taken from five Conan Doyle novels. We find that by fusing local and global information, we can exceed 50% on this task (chance baseline is 20%), and we suggest some avenues for further research.

Journal ArticleDOI
TL;DR: This model uses a latent variable for every word in a text that represents synonyms or related words in the given context and shows that both for semantic role labeling and word sense disambiguation, the performance of a supervised classifier increases when incorporating these variables as extra features.

Proceedings Article
26 Jun 2012
TL;DR: In this paper, a fully Bayesian latent variable model is proposed to capture structure underlying extremely high dimensional spaces by exploiting conditional nonlinear (in-dependence) structures to learn an efficient latent representation.
Abstract: In this paper we present a fully Bayesian latent variable model which exploits conditional nonlinear (in)-dependence structures to learn an efficient latent representation. The latent space is factorized to represent shared and private information from multiple views of the data. In contrast to previous approaches, we introduce a relaxation to the discrete segmentation and allow for a "softly" shared latent space. Further, Bayesian techniques allow us to automatically estimate the dimensionality of the latent spaces. The model is capable of capturing structure underlying extremely high dimensional spaces. This is illustrated by modelling unprocessed images with tenths of thousands of pixels. This also allows us to directly generate novel images from the trained model by sampling from the discovered latent spaces. We also demonstrate the model by prediction of human pose in an ambiguous setting. Our Bayesian framework allows us to perform disambiguation in a principled manner by including latent space priors which incorporate the dynamic nature of the data.

Journal ArticleDOI
01 Jan 2012
TL;DR: An unsupervised learning approach that employs Scale Invariant Feature Transform (SIFT) for extraction of local image features and the probabilistic latent semantic analysis (pLSA) model used in the linguistic content analysis for data clustering is introduced.
Abstract: Since wireless capsule endoscopy (WCE) is a novel technology for recording the videos of the digestive tract of a patient, the problem of segmenting the WCE video of the digestive tract into subvideos corresponding to the entrance, stomach, small intestine, and large intestine regions is not well addressed in the literature. A selected few papers addressing this problem follow supervised leaning approaches that presume availability of a large database of correctly labeled training samples. Considering the difficulties in procuring sizable WCE training data sets needed for achieving high classification accuracy, we introduce in this paper an unsupervised learning approach that employs Scale Invariant Feature Transform (SIFT) for extraction of local image features and the probabilistic latent semantic analysis (pLSA) model used in the linguistic content analysis for data clustering. Results of experimentation indicate that this method compares well in classification accuracy with the state-of-the-art supervised classification approaches to WCE video segmentation.

Proceedings Article
03 Dec 2012
TL;DR: Factorial LDA is introduced, a multi-dimensional model in which a document is influenced by K different factors, and each word token depends on a K-dimensional vector of latent variables, which incorporates structured word priors and learns a sparse product of factors.
Abstract: Latent variable models can be enriched with a multi-dimensional structure to consider the many latent factors in a text corpus, such as topic, author perspective and sentiment. We introduce factorial LDA, a multi-dimensional model in which a document is influenced by K different factors, and each word token depends on a K-dimensional vector of latent variables. Our model incorporates structured word priors and learns a sparse product of factors. Experiments on research abstracts show that our model can learn latent factors such as research topic, scientific discipline, and focus (methods vs. applications). Our modeling improvements reduce test perplexity and improve human interpretability of the discovered factors.

Journal ArticleDOI
TL;DR: A generative statistical model to simultaneously capture both the domain distinction and commonality among multiple domains is proposed, and it is noted that as shown by the experimental results CD-PLSA for the collaborative training is more tolerant of distribution differences, and the local refinement also gains significant improvement in terms of classification accuracy.
Abstract: The distribution difference among multiple domains has been exploited for cross-domain text categorization in recent years. Along this line, we show two new observations in this study. First, the data distribution difference is often due to the fact that different domains use different index words to express the same concept. Second, the association between the conceptual feature and the document class can be stable across domains. These two observations actually indicate the distinction and commonality across domains. Inspired by the above observations, we propose a generative statistical model, named Collaborative Dual-PLSA (CD-PLSA), to simultaneously capture both the domain distinction and commonality among multiple domains. Different from Probabilistic Latent Semantic Analysis (PLSA) with only one latent variable, the proposed model has two latent factors y and z, corresponding to word concept and document class, respectively. The shared commonality intertwines with the distinctions over multiple domains, and is also used as the bridge for knowledge transformation. An Expectation Maximization (EM) algorithm is developed to solve the CD-PLSA model, and further its distributed version is exploited to avoid uploading all the raw data to a centralized location and help to mitigate privacy concerns. After the training phase with all the data from multiple domains we propose to refine the immediate outputs using only the corresponding local data. In summary, we propose a two-phase method for cross-domain text classification, the first phase for collaborative training with all the data, and the second step for local refinement. Finally, we conduct extensive experiments over hundreds of classification tasks with multiple source domains and multiple target domains to validate the superiority of the proposed method over existing state-of-the-art methods of supervised and transfer learning. It is noted to mention that as shown by the experimental results CD-PLSA for the collaborative training is more tolerant of distribution differences, and the local refinement also gains significant improvement in terms of classification accuracy.

Proceedings ArticleDOI
10 Jun 2012
TL;DR: The effectiveness of the pLSAnorm prediction method is proved by performing k-fold cross-validation of the GO annotations of two organisms, Gallus gallus and Bos taurus, by using a modified Probabilistic Latent Semantic Analysis algorithm.
Abstract: Consistency and completeness of biomolecular annotations is a keypoint of correct interpretation of biological experiments. Yet, the associations between genes (or proteins) and features correctly annotated are just some of all the existing ones. As time goes by, they increase in number and become more useful, but they remain incomplete and some of them incorrect. To support and quicken their time-consuming curation procedure and to improve consistence of available annotations, computational methods that are able to supply a ranked list of predicted annotations are hence extremely useful. Starting from a previous work on the automatic prediction of Gene Ontology (GO) annotations based on the Singular Value Decomposition of the annotation matrix, where every matrix element corresponds to the association of a gene with a feature, we propose the use of a modified Probabilistic Latent Semantic Analysis (pLSA) algorithm, named pLSAnorm, to better perform such prediction. pLSA is a statistical technique from the natural language processing field, which has not been used in bioinformatics annotation prediction yet; it takes advantage of the latent information contained in the analyzed data co-occurrences. We proved the effectiveness of the pLSAnorm prediction method by performing k-fold cross-validation of the GO annotations of two organisms, Gallus gallus and Bos taurus. Obtained results demonstrate the efficacy of our approach.

Proceedings Article
26 Jun 2012
TL;DR: In this paper, a nonparametric Bayesian model that posits a mixture of factor analyzers structure on the tasks is proposed to learn the "right" task structure in a data-driven manner.
Abstract: Multitask learning algorithms are typically designed assuming some fixed, a priori known latent structure shared by all the tasks. However, it is usually unclear what type of latent task structure is the most appropriate for a given multitask learning problem. Ideally, the "right" latent task structure should be learned in a data-driven manner. We present a flexible, nonparametric Bayesian model that posits a mixture of factor analyzers structure on the tasks. The nonparametric aspect makes the model expressive enough to subsume many existing models of latent task structures (e.g, mean-regularized tasks, clustered tasks, low-rank or linear/non-linear subspace assumption on tasks, etc.). Moreover, it can also learn more general task structures, addressing the shortcomings of such models. We present a variational inference algorithm for our model. Experimental results on synthetic and real-world datasets, on both regression and classification problems, demonstrate the effectiveness of the proposed method.

Journal ArticleDOI
TL;DR: This paper proposes a new generative model for Multi-view Learning via Probabilistic Latent Semantic Analysis, called MVPLSA, which jointly model the co-occurrences of features and documents from different views.

Journal ArticleDOI
TL;DR: The objective of this paper is to systematically analyze VSM, LSI and FCA for the task of IR using standard and real life datasets.
Abstract: Latent Semantic Indexing (LSI), a variant of classical Vector Space Model (VSM), is an Information Retrieval (IR) model that attempts to capture the latent semantic relationship between the data items. Mathematical lattices, under the framework of Formal Concept Analysis (FCA), represent conceptual hierarchies in data and retrieve the information. However, both LSI and FCA use the data represented in the form of matrices. The objective of this paper is to systematically analyze VSM, LSI and FCA for the task of IR using standard and real life datasets.

Proceedings Article
03 Dec 2012
TL;DR: A latent variable model for supervised dimensionality reduction and distance metric learning is described and it is shown that inference is completely tractable and an Expectation-Maximization (EM) algorithm for parameter estimation is derived.
Abstract: We describe a latent variable model for supervised dimensionality reduction and distance metric learning. The model discovers linear projections of high dimensional data that shrink the distance between similarly labeled inputs and expand the distance between differently labeled ones. The model's continuous latent variables locate pairs of examples in a latent space of lower dimensionality. The model differs significantly from classical factor analysis in that the posterior distribution over these latent variables is not always multivariate Gaussian. Nevertheless we show that inference is completely tractable and derive an Expectation-Maximization (EM) algorithm for parameter estimation. We also compare the model to other approaches in distance metric learning. The model's main advantage is its simplicity: at each iteration of the EM algorithm, the distance metric is re-estimated by solving an unconstrained least-squares problem. Experiments show that these simple updates are highly effective.

Journal ArticleDOI
TL;DR: A new approach is proposed, Discriminative Topic Model (DTM), which separates non-neighboring pairs from each other in addition to bringing neighboring pairs closer together, thereby preserving the global manifold structure as well as improving local consistency.
Abstract: Topic modeling has become a popular method used for data analysis in various domains including text documents. Previous topic model approaches, such as probabilistic Latent Semantic Analysis (pLSA) and Latent Dirichlet Allocation (LDA), have shown impressive success in discovering low-rank hidden structures for modeling text documents. These approaches, however do not take into account the manifold structure of the data, which is generally informative for nonlinear dimensionality reduction mapping. More recent topic model approaches, Laplacian PLSI (LapPLSI) and Locally-consistent Topic Model (LTM), have incorporated the local manifold structure into topic models and have shown resulting benefits. But they fall short of achieving full discriminating power of manifold learning as they only enhance the proximity between the low-rank representations of neighboring pairs without any consideration for non-neighboring pairs. In this article, we propose a new approach, Discriminative Topic Model (DTM), which separates non-neighboring pairs from each other in addition to bringing neighboring pairs closer together, thereby preserving the global manifold structure as well as improving local consistency. We also present a novel model-fitting algorithm based on the generalized EM algorithm and the concept of Pareto improvement. We empirically demonstrate the success of DTM in terms of unsupervised clustering and semisupervised classification accuracies on text corpora and robustness to parameters compared to state-of-the-art techniques.

Proceedings ArticleDOI
19 Mar 2012
TL;DR: The source coding theorem of classical information theory is extended to encompass semantics in the source and it is shown that by utilizing semantic relations between source symbols, higher rate of lossless compression may be achieved compared to traditional syntactic compression methods.
Abstract: We show how semantic relationships that exist within an information-rich source can be exploited for achieving parsimonious communication between a pair of semantically-aware nodes that preserves quality of information. We extend the source coding theorem of classical information theory to encompass semantics in the source and show that by utilizing semantic relations between source symbols, higher rate of lossless compression may be achieved compared to traditional syntactic compression methods. We define the capacity of a semantic source as the mutual information between its models and syntactic messages, and show that it equals the average semantic entropy of its messages. We further show the duality of semantic redundancy and semantic ambiguity in compressing semantic data, and establish the semantic capacity of a source as the lower bound on semantic compression. Finally, we give a practical semantic compression algorithm that exploits the graph structure of a shared knowledge base to facilitate semantic communication between a pair of nodes.