scispace - formally typeset
Patent

Collapsed gibbs sampler for sparse topic models and discrete matrix factorization

Reads0
Chats0
TLDR
In this article, a topic model defining a set of topics is inferred by performing latent Dirichlet allocation (LDA) with an Indian Buffet Process (IBP) compound Dichlet prior probability distribution.
Abstract
In an inference system for organizing a corpus of objects, feature representations are generated comprising distributions over a set of features corresponding to the objects. A topic model defining a set of topics is inferred by performing latent Dirichlet allocation (LDA) with an Indian Buffet Process (IBP) compound Dirichlet prior probability distribution. The inference is performed using a collapsed Gibbs sampling algorithm by iteratively sampling (1) topic allocation variables of the LDA and (2) binary activation variables of the IBP compound Dirichlet prior. In some embodiments the inference is configured such that each inferred topic model is a clean topic model with topics defined as distributions over sub-sets of the set of features selected by the prior. In some embodiments the inference is configured such that the inferred topic model associates a focused sub-set of the set of topics to each object of the training corpus.

read more

Citations
More filters
Proceedings ArticleDOI

Knowledge Discovery from Citation Networks

TL;DR: A Bernoulli Process Topic (BPT) model is proposed which models the corpus at two levels: document level and citation level, where each document has two different representations in the latent topic space associated with its roles.
Patent

Multi-tiered approach to e-mail prioritization

TL;DR: In this article, the authors propose an apparatus for automating a prioritization of an incoming message, including a batch learning module that generates a global classifier based on training data that is input to the batch learning modules.
Patent

Method of automated discovery of topics relatedness

TL;DR: In this article, a computer system and method for automated discovery of topic relatedness are disclosed, where topics within documents from a corpus may be discovered by applying multiple topic identification (ID) models, such as multi-component latent Dirichlet allocation (MC-LDA) or similar methods.
Patent

Access and presentation of files based on semantic proximity to current interests

TL;DR: In this article, a computer program product for managing and rendering one or more information nodes relative to a current focus is presented, based on an analysis of the information nodes, and a topic vector from a similarity of a first information node to each of the principal topics and a map from the topic vector to a storage location of the first node.
Patent

Parallel Processing Of Data Sets

TL;DR: In this article, a data set may be partitioned into a plurality of data partitions that may be distributed to two or more processors, such as a graphics processing unit, to determine local counts associated with the partitions.
References
More filters
Journal ArticleDOI

Latent dirichlet allocation

TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.
Proceedings Article

Probabilistic latent semantic analysis

TL;DR: This work proposes a widely applicable generalization of maximum likelihood model fitting by tempered EM, based on a mixture decomposition derived from a latent class model which results in a more principled approach which has a solid foundation in statistics.

Estimating a Dirichlet Distribution

Tom Minka
TL;DR: In this article, the Dirichlet distribution and its compound variant, Dirichletsmultinomial, are two of the most basic models for proportional data, such as the mix of vocabulary words in a text document, and the maximum likelihood estimate of these distributions is not available in closed-form.
Proceedings Article

Infinite latent feature models and the Indian buffet process

TL;DR: A probability distribution over equivalence classes of binary matrices with a finite number of rows and an unbounded number of columns is defined, suitable for use as a prior in probabilistic models that represent objects using a potentially infinite array of features.
Patent

Apparatus and method for retrieving and grouping images representing text files based on the relevance of key words extracted from a selected file to the text files

TL;DR: In this paper, a natural language recognition algorithm is used to determine the subject words of the selected file and then a statistical comparison between subject words and the contents of files in a database is performed.
Related Papers (5)