scispace - formally typeset
Proceedings ArticleDOI

Modeling Content and Users: Structured Probabilistic Representation and Scalable Inference Algorithms

Amr Ahmed
- pp 1
Reads0
Chats0
TLDR
This thesis addresses the problem of information organization of online document collections, and provides algorithms that create a structured representation of the otherwise unstructured content, and introduces a non-parametric Bayes model that is called the recurrent Chinese restaurant process (RCRP).
Abstract
Online content have become an important medium to disseminate information and express opinions. With the proliferation of online document collections, users are faced with the problem of missing the big picture in a sea of irrelevant and/or diverse content. In this thesis, we addresses the problem of information organization of online document collections, and provide algorithms that create a structured representation of the otherwise unstructured content. We leverage the expressiveness of latent probabilistic models (e.g. topic models) and non-parametric Bayes techniques (e.g. Dirichlet processes), and give online and distributed inference algorithms that scale to terabyte datasets and adapt the inferred representation with the arrival of new documents. Throughout the thesis, we consider two different domains: research publications and social media (news articles and blog posts); and focus on modeling two facets of content: temporal dynamics and structural correspondence. To model the temporal dynamics of document collections, we introduce a non-parametric Bayes model that we call the recurrent Chinese restaurant process (RCRP). RCRP is a framework for modeling complex longitudinal data, in which the number of mixture components at each time point is unbounded. On top of this process, we develop a hierarchical extension and use it to build an infinite dynamic topic model that recovers the timeline of ideas in research publications. Despite the expressiveness of the aforementioned model, it fails to capture the essential element of dynamics in social media: stories. To remedy this, we developed a multi-resolution model that treats stories as a first-citizen object and combines long-term, high-level topics with short-lived, tightly-focused storylines. Inference in the new model is carried out via a sequential Monte Carlo algorithm that processes new documents on real time. We then consider the problem of structural correspondence in document collections both across modalities and communities. In research publications, this problem arises due to the multi-modalities of research papers and the pressing need for developing systems that can retrieve relevant documents based on any of these modalities (e.g. figures, text, named entities, to name a few). In social media this problem arises due to ideological bias of the document's author that mixes facts with opinions. For both problems we develop a series of factored models. In research publications, the developed model represents ideas across modalities and as such can solve the aforementioned retrieval problem. In social media, the model contrasts the same idea across different ideologies, and as such can explain the bias of a given document on a topical-level and help the user staying informed by providing documents that express alternative views. Finally, we address the problem of inferring users' intent when they interact with document collections, and how this intent changes over time. The induced user model can then be used in matching users with relevant content.

read more

Citations
More filters
Journal ArticleDOI

Continuous-Time User Modeling in Presence of Badges: A Probabilistic Approach

TL;DR: This article proposes interdependent multi-dimensional temporal point processes that capture the impact of badges on user participation besides the peer influence and content factors, and proposes an inference algorithm based on Variational-Expectation Maximization that can efficiently learn the model parameters.
Proceedings Article

Scalable dynamic nonparametric Bayesian models of content and users

TL;DR: This paper addresses the problem of information organization of online document collections, and provides algorithms that create a structured representation of the otherwise unstructured content, using the expressiveness of latent probabilistic models and non-parametric Bayes techniques.
Posted Content

Continuous-Time User Modeling in the Presence of Badges: A Probabilistic Approach

TL;DR: This paper proposes interdependent multi-dimensional temporal point processes that capture the impact of badges on user participation besides the peer influence and content factors, and proposes an inference algorithm based on Variational-EM that can efficiently learn the model parameters.
References
More filters
Journal ArticleDOI

Latent dirichlet allocation

TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.
Journal ArticleDOI

Inference of population structure using multilocus genotype data

TL;DR: Pritch et al. as discussed by the authors proposed a model-based clustering method for using multilocus genotype data to infer population structure and assign individuals to populations, which can be applied to most of the commonly used genetic markers, provided that they are not closely linked.
Proceedings Article

Latent Dirichlet Allocation

TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).
Book

Bayesian Data Analysis

TL;DR: Detailed notes on Bayesian Computation Basics of Markov Chain Simulation, Regression Models, and Asymptotic Theorems are provided.
Related Papers (5)