Modeling Content and Users: Structured Probabilistic Representation and Scalable Inference Algorithms

doi:10.1145/2339530.2378373

Proceedings ArticleDOI

Modeling Content and Users: Structured Probabilistic Representation and Scalable Inference Algorithms

Amr Ahmed

- pp 1

Chats0

TLDR

This thesis addresses the problem of information organization of online document collections, and provides algorithms that create a structured representation of the otherwise unstructured content, and introduces a non-parametric Bayes model that is called the recurrent Chinese restaurant process (RCRP).

Abstract:

Online content have become an important medium to disseminate information and express opinions. With the proliferation of online document collections, users are faced with the problem of missing the big picture in a sea of irrelevant and/or diverse content. In this thesis, we addresses the problem of information organization of online document collections, and provide algorithms that create a structured representation of the otherwise unstructured content. We leverage the expressiveness of latent probabilistic models (e.g. topic models) and non-parametric Bayes techniques (e.g. Dirichlet processes), and give online and distributed inference algorithms that scale to terabyte datasets and adapt the inferred representation with the arrival of new documents. Throughout the thesis, we consider two different domains: research publications and social media (news articles and blog posts); and focus on modeling two facets of content: temporal dynamics and structural correspondence. To model the temporal dynamics of document collections, we introduce a non-parametric Bayes model that we call the recurrent Chinese restaurant process (RCRP). RCRP is a framework for modeling complex longitudinal data, in which the number of mixture components at each time point is unbounded. On top of this process, we develop a hierarchical extension and use it to build an infinite dynamic topic model that recovers the timeline of ideas in research publications. Despite the expressiveness of the aforementioned model, it fails to capture the essential element of dynamics in social media: stories. To remedy this, we developed a multi-resolution model that treats stories as a first-citizen object and combines long-term, high-level topics with short-lived, tightly-focused storylines. Inference in the new model is carried out via a sequential Monte Carlo algorithm that processes new documents on real time. We then consider the problem of structural correspondence in document collections both across modalities and communities. In research publications, this problem arises due to the multi-modalities of research papers and the pressing need for developing systems that can retrieve relevant documents based on any of these modalities (e.g. figures, text, named entities, to name a few). In social media this problem arises due to ideological bias of the document's author that mixes facts with opinions. For both problems we develop a series of factored models. In research publications, the developed model represents ideas across modalities and as such can solve the aforementioned retrieval problem. In social media, the model contrasts the same idea across different ideologies, and as such can explain the bias of a given document on a topical-level and help the user staying informed by providing documents that express alternative views. Finally, we address the problem of inferring users' intent when they interact with document collections, and how this intent changes over time. The induced user model can then be used in matching users with relevant content.

Modeling Content and Users: Structured Probabilistic Representation and Scalable Inference Algorithms

Citations

Continuous-Time User Modeling in Presence of Badges: A Probabilistic Approach

Scalable dynamic nonparametric Bayesian models of content and users

Continuous-Time User Modeling in the Presence of Badges: A Probabilistic Approach

References

Latent dirichlet allocation

Inference of population structure using multilocus genotype data

Latent Dirichlet Allocation

A New Approach to Linear Filtering and Prediction Problems

Bayesian Data Analysis

Related Papers (5)

ParallelTopics: A probabilistic approach to exploring document collections

User-directed Non-Disruptive Topic Model Update for Effective Exploration of Dynamic Content

Predicting Interesting Things in Text

Topic based semantic clustering using Wikipedia knowledge

Extracting insights from social media with large-scale matrix approximations