scispace - formally typeset
Open AccessProceedings ArticleDOI

Probabilistic question recommendation for question answering communities

Reads0
Chats0
TLDR
This paper adopts the Probabilistic Latent Semantic Analysis (PLSA) model for question recommendation and proposes a novel metric to evaluate the performance of the approach, and shows the experimental results show the recommendation approach is effective.
Abstract
User-Interactive Question Answering (QA) communities such as Yahoo! Answers are growing in popularity. However, as these QA sites always have thousands of new questions posted daily, it is difficult for users to find the questions that are of interest to them. Consequently, this may delay the answering of the new questions. This gives rise to question recommendation techniques that help users locate interesting questions. In this paper, we adopt the Probabilistic Latent Semantic Analysis (PLSA) model for question recommendation and propose a novel metric to evaluate the performance of our approach. The experimental results show our recommendation approach is effective.

read more

Content maybe subject to copyright    Report

Probabilistic Question Recommendation for Question
Answering Communities
Mingcheng Qu
1
, Guang Qiu
1
, Xiaofei He
2
, Cheng Zhang
3
, Hao Wu
1
, Jiajun Bu
1
, and
Chun Chen
1
1,2
College of Computer Science and Technology, Zhejiang University, China
3
China Disabled Persons’ Federation Information Center
1
{qumingcheng, qiuguang, haowu, bjj, chenc}@zju.edu.cn,
2
xiaofeihe@cad.zju.edu.cn,
3
zhangcheng@cdpf.org.cn
ABSTRACT
User-Interactive Question Answering (QA ) communities such
as Yahoo! Answers are growing in popularity. However,
as these QA sites always have thousands of new questions
posted daily, it is difficult for users to find the qu estions that
are of interest to them. Consequently, this may delay the an-
swering of th e new questions. This gives rise to question rec-
ommendation techniques that help users locate interesting
questions. In this paper, we adopt the Probabilistic Latent
Semantic Analysis (PLSA) model for qu estion recommenda-
tion and propose a novel metric to evaluate the performance
of our approach. The experimental results show our recom-
mendation approach is effective.
Categories and Subject Descriptors
H.3.3 [Information Search and Retrieval]: information
filtering
General Terms
Algorithms, Design, Experimentation
Keywords
Question Recommendation, Question Answering, PLSA
1. INTRODUCTION
Nowadays, the User-Interactive Question Answering (QA)
community has become a popular medium for online infor-
mation seeking and knowledge sharing. For example, Ya-
hoo! Answers
1
, one of the largest QA communities nowa-
days, has app roximately 23 million resolved questions, which
are posted and answered by users. In addition, there are also
thousands of questions posted daily. However, with the ex-
ponential growth in data volume, it is becoming more and
more time-consuming for users to find the questions that are
of interest to them. As a result, t he asker would have to wait
for a long time before getting answers to his/her question.
Supported by National Key Technology R&D Program of
China (NO.2008BAH26B02)
1
http://answers.yahoo.com
Copyright is held by the author/owner(s).
WWW 2009, April 20–24, 2009, Madrid, Spain.
ACM 978-1-60558-487-4/09/04.
To h elp users find interesting questions and expedite the
answering of new questions, some question recommendation
attempts are seen in QA communities like Yahoo! Answers,
such as maintaining in user home p ages a qu estion list auto-
matically generated based on features like posted time and
ratings.
However, these systems are not typical recommender sys-
tems in essence in that they have not taken users’ interest
into account. In our work, We employ PLSA [3] t o analyze a
user’s interest by investigating his previously asked questions
and accordingly generate ne-grained question recommen-
dation. Meanwhile, because traditional evaluation metrics
cannot meet t he special requirements of QA communities,
we also propose a novel metric to evaluate the recommen-
dation performance. Experimental results show the PLSA
model works effectively for recommending questions.
2. PLSA FOR QUESTION RECOMMENDA-
TION
Aiming to improve a QA community’s efficiency, question
recommendation is to recommend questions to users who are
interested in, and capable of answering them. Therefore, the
key to a question recommender is to capture users’ interest.
In our work, we propose to analyze u sers’ interest by in-
vestigating his previously asked questions. In a typical ques-
tion answering cycle, users always answer questions by first
identifying the topics in an implicit way. PLSA model [2],
known for its ability of capturing underlying topics, suits
our problem well. The latent variables in PLSA denote the
topics of corresponding questions. Therefore, given a ques-
tion collection, the distribution of users and their answered
questions can be formulated as follows:
Pr(u, q) =
X
z
Pr(u|z) Pr(q|z) Pr(z) (1)
where u u
1
, u
2
, ..., u
n
are users, q q
1
, q
2
, ..., q
m
are ques-
tions and z z
1
, z
2
, ..., z
k
are k topic models, each capturing
one topic u.
However, in a real QA community, each user can only
answer a small percentage of the overall questions, which
means that most observations (u, q) should be zero. In or-
der to deal with sparsity, we use a user-word aspect model
instead, where the co-occurrence data represent the event
that users type words in a particular question:
Pr(u, w) =
X
z
Pr(u|z) Pr(w|z) Pr(z) (2)
WWW 2009 MADRID!
Poster Sessions: Friday, April 24, 2009
1229

where w w
1
, w
2
, ..., w
l
are words which questions contain.
Note that the PLSA model allows multiple topics per user,
reflecting the fact that each user has lots of interest.
Then t he log likelihood L of the question collection is
L =
X
u,w
c(u, w) log Pr(u, w) (3)
where c(u, w) is t he sum of word w’s count in all questions
the user u answers.
Model parameters can be learned using Expectation Max-
imization (EM) to find a local maximum of the log likelihood
of the question collection:
Pr(z|u, w) =
Pr(u|z) Pr(w|z) Pr(z)
P
z
Pr(u|z
) Pr(w|z
) Pr(z
)
(4)
Pr(u|z)
X
w
c(u, w) Pr(z|u, w) (5)
Pr(w|z)
X
u
c(u, w) Pr(z|u, w) (6)
Pr(z)
X
u,w
c(u, w) Pr(z|u, w) (7)
We then model recommending questions to users as the
posterior probability Pr(u|q), that is, according to how likely
it is that user u will access the corresponding question q. Ac-
cording to Bayesian law, we can compute Pr(u|q) Pr(u, q),
which is calculated as the product of the probabilities of the
words q contains, normalized by the qu estion length:
Pr(u, q) =
Y
i
Pr(u, w
i
)
!
1/|q|
(8)
where w
i
are words in the question q , and |q| is the length
of q. Consequently, a rank in g list of users will be maintained
for the qu estion q according to the score. The recommenda-
tion can be conducted by recommending q to top-n u sers.
3. EXPERIMENTS AND RESULTS
To obtain the data sets for experiments, we crawl ques-
tions of three categories of the Yahoo! Answers: Astronomy,
Global Warming, and Philosophy, and filter out all questions
which have only one answer. Questions in each data sets are
already labeled with the best answers. The data set statis-
tics are listed in Table 1. For each category, a PLSA model
is trained from 85% of the question sets (questions and th eir
corresponding answers), and the left are used for testing. We
empirically choose the number of latent variables k = 100.
In traditional recommender systems, we can use the pre-
cision to evaluate the performance. However, the precision
metric fails to suit the QA context. Users in a QA com-
munity can only access a small portion of questions of all.
While questions one accessed are those he/she is interested
in, there is no guarantee that questions h e/she has not ac-
cessed are those he/she does not like.
Here we propose a new metric for the evaluation of ques-
tion recommendation. For each question in testing data, we
only recommend it to the users who actually answered it in-
stead of all possible users in the whole data sets. Then the
accuracy for this question is defined according to the rank of
the user who provides the best answerer. Since the choice of
the best answer subjects to asker’s personal viewpoint, one
may question whether the best answer is objectively the best
Table 1: Yahoo! Answers data set.
Category Questions Answers U sers
Astronomy 8,920 49,297 16,391
Global Warming 8,330 82,788 22,015
Philosophy 9,477 84,953 22,822
Table 2: Comparison of recommending methods.
Category Cosine PLSA
Astronomy 0.621 0.648
Global Warming 0.627 0.674
Philosophy 0.634 0.709
of all, or just the asker’s prejudice. Adamic et al. [1] check
questions from different categories in Yahoo! Answers, and
draw the conclusion that answers selected as best answers
are mostly indeed best answers for t he qu estions. There-
fore, in this paper we use the best answerer’s rank as the
ground truth of our evaluation metric:
accuracy =
|R| R
B
1
|R| 1
(9)
where |R| is the length of recommending list, which is equally
the number of answers in th is question set, and R
B
is the
rank of the best answerer.
As there is no previous work done on recommending ques-
tions to users according to their interest in QA communities,
for comparison we implement Cosine Similarity between
user and question vectors, with tf.idf weights:
s(u, q) =
P
w
tf.idf(u, w)tf.idf(q, w)
r
P
w
tf.idf(u, w)
2
r
P
w
tf.idf(q, w)
2
(10)
where tf.idf(q, w) is the word w’s tf.idf weight in q, and
tf.idf(u, w) is the sum of w’s tf.idf weights in questions that
u posts/answers.
Table 2 shows the experimental results. We observe that
our PLSA model outperforms the cosine similarity measure
in all the three data sets. I t shows PLSA can capture users’
interest and recommend questions effectively.
4. CONCLUSION
In this paper, we introduce the novel problem of question
recommendation in Question Answering communities. We
adopt the PLSA model to tackle this novel problem. We
also propose a novel evaluation metric to measure the per-
formance. The results show PLSA model can improve the
quality of recommending. In conclusion, our study opens a
promising direction to question recommendation.
5. REFERENCES
[1] L. A. Adamic, J. Zhang, E. Bakshy, and M. S. Ackerman.
Knowledge sharing and yahoo answers: everyone knows
something. In WWW ’08.
[2] T. Hofmann. Probabilistic latent semantic indexing. In SIGIR
’99.
[3] A. Popescul, L. H. Ungar, D. M. Pennock, and S. Lawrence.
Probabilistic model s f or unified collaborative and content-based
recommendation i n sparse-data environments. In UAI ’01.
WWW 2009 MADRID!
Poster Sessions: Friday, April 24, 2009
1230
Citations
More filters
Proceedings ArticleDOI

CQArank: jointly model topics and expertise in community question answering

TL;DR: This work proposed Topic Expertise Model (TEM), a novel probabilistic generative model with GMM hybrid, to jointly model topics and expertise by integrating textual content model and link structure analysis, and proposed CQARank to measure user interests and expertise score under different topics.
Journal ArticleDOI

Personalized task recommendation in crowdsourcing information systems - Current state of the art

TL;DR: The findings highlight the need for more significant empirical results through large-scale online experiments, an improved dialog with mainstream recommender systems research, and the integration of various sources of knowledge that exceed the boundaries of individual systems.
Proceedings ArticleDOI

Finding expert users in community question answering

TL;DR: It is shown that for a dataset constructed from the Stackoverflow website, these topic models outperform other methods in retrieving a candidate set of best experts for a question and that the Segmented Topic Model gives consistently better performance compared to the Latent Dirichlet Allocation Model.
Proceedings ArticleDOI

Routing questions to appropriate answerers in community question answering services

TL;DR: This paper proposes a framework called Question Routing (QR) which consists of four phases: performance profiling, expertise estimation, availability estimation, and answerer ranking, and results demonstrate that on average each of 1,713 testing questions obtains at least one answer if it is routed to the top 20 ranked answerers.
Journal ArticleDOI

Expert Finding for Question Answering via Graph Regularized Matrix Completion

TL;DR: This paper considers the problem of expert finding from the viewpoint of missing value estimation, and develops a novel graph-regularized matrix completion algorithm for inferring the user model and develops two efficient iterative procedures, GRMC-EGM andGRMC-AGM, to solve the optimization problem.
References
More filters
Journal ArticleDOI

Probabilistic latent semantic indexing

TL;DR: Probabilistic Latent Semantic Indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data.
Proceedings ArticleDOI

Knowledge sharing and yahoo answers: everyone knows something

TL;DR: This paper analyzes YA's forum categories and cluster them according to content characteristics and patterns of interaction among the users, finding that lower entropy correlates with receiving higher answer ratings, but only for categories where factual expertise is primarily sought after.
Proceedings Article

Probabilistic Models for Unified Collaborative and Content-Based Recommendation in Sparse-Data Environments

TL;DR: It is shown that secondary content information can often be used to overcome sparsity and appropriate mixture models incorporating secondary data produce significantly better quality recommenders than k-nearest neighbors (k-NN).
Frequently Asked Questions (7)
Q1. What are the contributions mentioned in the paper "Probabilistic question recommendation for question answering communities∗" ?

In this paper, the authors adopt the Probabilistic Latent Semantic Analysis ( PLSA ) model for question recommendation and propose a novel metric to evaluate the performance of their approach. The experimental results show their recommendation approach is effective. 

such as maintaining in user home pages a question list automatically generated based on features like posted time and ratings. 

For each category, a PLSA model is trained from 85% of the question sets (questions and their corresponding answers), and the left are used for testing. 

In order to deal with sparsity, the authors use a user-word aspect model instead, where the co-occurrence data represent the event that users type words in a particular question:Pr(u, w) = ∑zPr(u|z) Pr(w|z)Pr(z) (2)where w ∈ w1, w2, ..., wl are words which questions contain. 

given a question collection, the distribution of users and their answered questions can be formulated as follows:Pr(u, q) = ∑zPr(u|z) Pr(q|z)Pr(z) (1)where u ∈ u1, u2, ..., un are users, q ∈ q1, q2, ..., qm are questions and z ∈ z1, z2, ..., zk are k topic models, each capturing one topic u. 

with the exponential growth in data volume, it is becoming more and more time-consuming for users to find the questions that are of interest to them. 

w) = Pr(u|z) Pr(w|z) Pr(z) ∑z′ Pr(u|z′) Pr(w|z′)Pr(z′)(4)Pr(u|z) ∝ ∑wc(u, w)Pr(z|u, w) (5)Pr(w|z) ∝ ∑uc(u, w)Pr(z|u, w) (6)Pr(z) ∝ ∑u,wc(u, w)Pr(z|u, w) (7)The authors then model recommending questions to users as the posterior probability Pr(u|q), that is, according to how likely it is that user u will access the corresponding question q. According to Bayesian law, the authors can compute Pr(u|q) ∝ Pr(u, q), which is calculated as the product of the probabilities of the words q contains, normalized by the question length:Pr(u, q) =(∏iPr(u, wi))1/|q|(8)where wi are words in the question q , and |q| is the length of q.