Topic
Latent Dirichlet allocation
About: Latent Dirichlet allocation is a research topic. Over the lifetime, 5351 publications have been published within this topic receiving 212555 citations. The topic is also known as: LDA.
Papers published on a yearly basis
Papers
More filters
••
01 Jan 2014TL;DR: This paper studies two complementary cross-modal prediction tasks: predicting text given an image (“Im2Text”), and predicting image(s) given a piece of text (‘Text2Im’), and proposes a novel Structural SVM based unified formulation for these two tasks.
Abstract: Building bilateral semantic associations between images and texts is among the fundamental problems in computer vision. In this paper, we study two complementary cross-modal prediction tasks: (i) predicting text(s) given an image (“Im2Text”), and (ii) predicting image(s) given a piece of text (“Text2Im”). We make no assumption on the specific form of text; i.e., it could be either a set of labels, phrases, or even captions. We pose both these tasks in a retrieval framework. For Im2Text, given a query image, our goal is to retrieve a ranked list of semantically relevant texts from an independent textcorpus (i.e., texts with no corresponding images). Similarly, for Text2Im, given a query text, we aim to retrieve a ranked list of semantically relevant images from a collection of unannotated images (i.e., images without any associated textual meta-data). We propose a novel Structural SVM based unified formulation for these two tasks. For both visual and textual data, two types of representations are investigated. These are based on: (1) unimodal probability distributions over topics learned using latent Dirichlet allocation, and (2) explicitly learned multi-modal correlations using canonical correlation analysis. Extensive experiments on three popular datasets (two medium and one web-scale) demonstrate that our framework gives promising results compared to existing models under various settings, thus confirming its efficacy for both the tasks.
52 citations
••
15 Sep 2014TL;DR: A new method for topic detection, where a topic is determined by identifying words that appear with high frequency in the topic and low frequency in other topics, is proposed using a hierarchy of discrete latent variables.
Abstract: In the LDA approach to topic detection, a topic is determined by identifying the words that are used with high frequency when writing about the topic. However, high frequency words in one topic may be also used with high frequency in other topics. Thus they may not be the best words to characterize the topic. In this paper, we propose a new method for topic detection, where a topic is determined by identifying words that appear with high frequency in the topic and low frequency in other topics. We model patterns of word co- occurrence and co-occurrences of those patterns using a hierarchy of discrete latent variables. The states of the latent variables represent clusters of documents and they are interpreted as topics. The words that best distinguish a cluster from other clusters are selected to characterize the topic. Empirical results show that the new method yields topics with clearer thematic characterizations than the alternative approaches.
52 citations
••
24 Mar 2014TL;DR: Experimental results show that the nonparametric framework presented can automatically learn the appropriate model parameters from sensor data without any form of model selection procedure and can outperform traditional parametric approaches for human routine discovery tasks.
Abstract: People engage in routine behaviors. Automatic routine discovery goes beyond low-level activity recognition such as sitting or standing and analyzes human behaviors at a higher level (e.g., commuting to work). With recent developments in ubiquitous sensor technologies, it becomes easier to acquire a massive amount of sensor data. One main line of research is to mine human routines from sensor data using parametric topic models such as latent Dirichlet allocation. The main shortcoming of parametric models is that it assumes a fixed, pre-specified parameter regardless of the data. Choosing an appropriate parameter usually requires an inefficient trial-and-error model selection process. Furthermore, it is even more difficult to find optimal parameter values in advance for personalized applications. In this paper, we present a novel nonparametric framework for human routine discovery that can infer high-level routines without knowing the number of latent topics beforehand. Our approach is evaluated on public datasets in two routine domains: a 34-daily-activity dataset and a transportation mode dataset. Experimental results show that our nonparametric framework can automatically learn the appropriate model parameters from sensor data without any form of model selection procedure and can outperform traditional parametric approaches for human routine discovery tasks.
52 citations
••
04 Oct 2015
TL;DR: This work considers a computationally and statistically efficient parameter estimation method for a wide class of latent variable models--including Gaussian mixture models, hidden Markov models, and latent Dirichlet allocation--which exploits a certain tensor structure in their low-order observable moments typically, of second- and third-order.
Abstract: This note is a short version of that in [1]. It is intended as a survey for the 2015 Algorithmic Learning Theory ALT conference.
This work considers a computationally and statistically efficient parameter estimation method for a wide class of latent variable models--including Gaussian mixture models, hidden Markov models, and latent Dirichlet allocation--which exploits a certain tensor structure in their low-order observable moments typically, of second- and third-order. Specifically, parameter estimation is reduced to the problem of extracting a certain orthogonal decomposition of a symmetric tensor derived from the moments; this decomposition can be viewed as a natural generalization of the singular value decomposition for matrices. Although tensor decompositions are generally intractable to compute, the decomposition of these specially structured tensors can be efficiently obtained by a variety of approaches, including power iterations and maximization approaches similar to the case of matrices. A detailed analysis of a robust tensor power method is provided, establishing an analogue of Wedin's perturbation theorem for the singular vectors of matrices. This implies a robust and computationally tractable estimation approach for several popular latent variable models.
51 citations
••
TL;DR: A novel model for short text topic modeling, referred as Conditional Random Field regularized Topic Model (CRFTM), which not only develops a generalized solution to alleviate the sparsity problem by aggregating short texts into pseudo-documents, but also leverages a Conditional random field regularized model that encourages semantically related words to share the same topic assignment.
Abstract: Short texts have become the prevalent format of information on the Internet. Inferring the topics of this type of messages becomes a critical and challenging task for many applications. Due to the length of short texts, conventional topic models (e.g., latent Dirichlet allocation and its variants) suffer from the severe data sparsity problem which makes topic modeling of short texts difficult and unreliable. Recently, word embeddings have been proved effective to capture semantic and syntactic information about words, which can be used to induce similarity measures and semantic correlations among words. Enlightened by this, in this paper, we design a novel model for short text topic modeling, referred as Conditional Random Field regularized Topic Model (CRFTM). CRFTM not only develops a generalized solution to alleviate the sparsity problem by aggregating short texts into pseudo-documents, but also leverages a Conditional Random Field regularized model that encourages semantically related words to share the same topic assignment. Experimental results on two real-world datasets show that our method can extract more coherent topics, and significantly outperform state-of-the-art baselines on several evaluation metrics.
51 citations