Abstract: In this work, we address the twin problems of unsupervised topic discovery and estimation of topic specific influence of blogs. We propose a new model that can be used to provide a user with highly influential blog postings on the topic of the user’s interest. We adopt the framework of an unsupervised model called Latent Dirichlet Allocation(Blei, Ng, & Jordan 2003), known for its effectiveness in topic discovery. An extension of this model, which we call Link-LDA (Erosheva, Fienberg, & Lafferty 2004), defines a generative model for hyperlinks and thereby models topic specific influence of documents, the problem of our interest. However, this model does not exploit the topical relationship between the documents on either side of a hyperlink, i.e., the notion that documents tend to link to other documents on the same topic. We propose a new model, called Link-PLSA-LDA, that combines PLSA (Hoffman 1999) and LDA (Blei, Ng, & Jordan 2003) into a single framework, and explicitly models the topical relationship between the linking and the linked document. The output of the new model on blog data reveals very interesting visualizations of topics and influential blogs on each topic. We also perform quantitative evaluation of the model using log-likelihood of unseen data and on the task of link prediction. Both experiments show that that the new model performs better, suggesting its superiority over Link-LDA in modeling topics and topic specific influence of blogs. Introduction Proliferation of blogs in the recent past has posed several new, interesting challenges to researchers in the information retrieval and data mining community. In particular, there is an increasing need for automatic techniques to help the users quickly access blogs that are not only informative and popular, but also relevant to the user’s topics of interest. Significant progress has been made in the recent past, towards this objective. For example Java et al (Java et al. 2006) studied the performance of various algorithms such as PageRank, HITS and in-degree, on modeling influence of blogs. Kale et al (Kale et al. 2006) exploited the polarity (agreement/disagreement) of the hyperlinks and applied a trust propagation algorithm to model the propagation of influence between blogs. Copyright c © 2008, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. The above mentioned papers address modeling influence in general, but it is also important to model influence of blogs with respect to the topic of the user’s interest. This problem has been addressed by the work of Haveliwala (Haveliwala 2002) in the context of key-word search. In this paper, PageRanks of documents are pre-computed for a certain number of topics. At query time, for each document matching the query, its PageRanks for various topics are combined based on the similarity of the query to each topic, to obtain a topic-sensitive PageRank. The author shows that the new PageRank results in superior performance than the traditional PageRank on key-word search. The topics used in the algorithm are, however, obtained from an external repository. Ideally, it would be very useful to mine these topics automatically as well. The problem of automatic topic mining from blogs has been addressed by Glance et al (Natalie S. Glance & Tomokiyo 2006), where the authors used a combination of NLP techniques, clustering and heuristics to mine topics and trends from blogs. However, this work does not address modeling the influence of blog postings with respect to the topics discovered. In our work, we aim at addressing both these problems simultaneously, i.e., topic discovery as well as modeling topic specific influence of blogs, in a completely unsupervised fashion. Towards this objective, we employ the probabilistic framework of latent topic models such as the Latent Dirichlet Allocation (Blei, Ng, & Jordan 2003), and propose a new model in this framework. The rest of the paper is organized as follows. In section , we discuss some of the past work done on joint models of topics and influence in the framework of latent topic models. We describe our new model in section . In section , we report the results of our experiments on blog data. We conclude the discussion in section with a few remarks on directions for future work. Note that in the rest of the paper, we use the terms ‘citation’ and ‘hyperlink’ interchangeably. Likewise, note that the term ‘citing’ is synonymous to ‘linking’ and so is ‘cited’ to ‘linked’. The reader is also recommended to refer to table 1 for some frequent notation used in this paper. M Total number of documents M← Number of cited documents M→ Number of citing documents V Vocabulary size K Number of topics N← Total number of words in the cited set d A citing document d A cited document ∆(p) A simplex of dimension (p− 1) c(d, d) citation from d to d Dir(·|α) Dirichlet distribution with parameter α Mult(·|β) Multinomial distribution with parameter β Ld Number of hyperlinks in document d Nd Number of words in document d βkw Probability of word w w.r.t. topic k Ωkd′ Probability of hyperlink to document d w.r.t. topic k πk Probability of topic k in the cited document set.