scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Finding high-quality content in social media

11 Feb 2008-pp 183-194
TL;DR: This paper introduces a general classification framework for combining the evidence from different sources of information, that can be tuned automatically for a given social media type and quality definition, and shows that its system is able to separate high-quality items from the rest with an accuracy close to that of humans.
Abstract: The quality of user-generated content varies drastically from excellent to abuse and spam. As the availability of such content increases, the task of identifying high-quality content sites based on user contributions --social media sites -- becomes increasingly important. Social media in general exhibit a rich variety of information sources: in addition to the content itself, there is a wide array of non-content information available, such as links between items and explicit quality ratings from members of the community. In this paper we investigate methods for exploiting such community feedback to automatically identify high quality content. As a test case, we focus on Yahoo! Answers, a large community question/answering portal that is particularly rich in the amount and types of content and social interactions available in it. We introduce a general classification framework for combining the evidence from different sources of information, that can be tuned automatically for a given social media type and quality definition. In particular, for the community question/answering domain, we show that our system is able to separate high-quality items from the rest with an accuracy close to that of humans

Summary (5 min read)

1. INTRODUCTION

  • Recent years have seen a transformation in the type of content available on the web.
  • Community-driven question/answering portals are a particular form of user-generated content that is gaining a large audience in recent years.

2.1 Yahoo! Answers

  • Answers1 is a question/answering system where people ask and answer questions on any topic.
  • What makes this system interesting is that around a seemingly trivial question/answer paradigm, users are forming a social network characterized by heterogeneous interactions.
  • As a matter of fact, users do not only limit their activity to asking and answering questions, but they also actively participate in regulating the whole system.
  • A user can vote for answers of other users, mark interesting questions, and even report abusive behavior.
  • Thus, overall, each user has a threefold role: asker, answerer and evaluator.

3. CONTENT QUALITY ANALYSIS IN SOCIAL MEDIA

  • The authors now focus on the task of finding high quality content, and describe their overall approach to solving this problem.
  • Evaluation of content quality is an essential module for performing more advanced information-retrieval tasks on the question/answering system.
  • In particular, the authors model the intrinsic content quality (Section 3.1), the interactions between content creators and users (Section 3.2), as well as the content usage statistics (Section 3.3).
  • All these feature types are used as an input to a classifier that can be tuned for the quality definition for the particular media type (Section 3.4).
  • In the next section, the authors will expand and refine the feature set specifically to match their main application domain of community question/answering portals.

3.1 Intrinsic content quality

  • The intrinsic quality metrics (i.e., the quality of the content of each item) that the authors use in this research are mostly text-related, given that the social media items they evaluate are primarily textual in nature.
  • Several of their features capture the visual quality of the text, attempting to model these irregularities; among these are features measuring punctuation, capitalization, and spacing density (percent of all characters), as well as features measuring the character-level entropy of the text.
  • Additional features used to represent grammatical properties of the text are its formality score [16], and the distance between its language model and several given language models, such as the Wikipedia language model or the language model of the Yahoo!.
  • 7To identify out-of-vocabulary words, the authors construct multiple lists of the k most frequent words in Yahoo!.
  • Answers, with several k values ranging between 50 and 5000.

3.2 User relationships

  • A significant amount of quality information can be inferred from the relationships between users and items.
  • The authors could apply link-analysis algorithms for propagating quality scores in the entities of the question/answer system, e.g., they use the intuition that, “good” answerers write “good” answers, or vote for other “good” answerers.
  • These relationships are represented as edges in a graph, with content items and users as nodes.
  • The resulting user-user graph is extremely rich and heterogeneous, and is unlike traditional graphs studied in the web link analysis setting.
  • Hence, for each type of link the authors performed a separate computation of each link-analysis algorithm.

3.3 Usage statistics

  • Readers of the content (who may or may not also be contributors) provide valuable information about the items they find interesting.
  • In particular, usage statistics such as the number of clicks on the item and dwell time have been shown useful in the context of identifying high quality web search results, and are complementary to link-analysis based methods.
  • Intuitively, usage statistics measures are useful for social media content, but require different interpretation from the previously studied settings.
  • All items within a popular category such as celebrity images or popular culture topics may receive orders of magnitude more clicks than, for instance, science topics.
  • The specific usage statistics that the authors use are described in Section 4.3.

3.4 Overall classi cation framework

  • The authors cast the problem of quality ranking as a binary classification problem, in which a system must learn automatically to separate high-quality content from the rest.
  • A sequence of (typically simple) decision trees is constructed so that each tree minimizes the error on the residuals of the preceding sequence of trees; a stochastic element is added by randomly sampling the data repeatedly before each tree construction, to prevent overfitting.
  • Given a set of human-labeled quality judgments, the classifier is trained on all available features, combining evidence from semantic, user relationship, and content usage sources.
  • The judgments are tuned for the particular goal.
  • In the case of community question/answers, described next, their goal is to discover interesting, well formulated and factually accurate content.

4. MODELING CONTENT QUALITY IN COMMUNITY QUESTION/ANSWERING

  • The authors goal is to automatically assess the quality of questions and answers provided by users of the system.
  • The authors believe that this particular sub-problem of quality evaluation is an essential module for performing more advanced informationretrieval tasks on the question/answering or web search system.
  • A quality score can be used as a feature for ranking search results in this system.
  • Answers is question-centric: the interactions of users are organized around questions: the main forms of interaction among the users are (i) asking a question, (ii) answering a question, (iii) selecting best answer, and (iv) voting on an answer.
  • These relationships are explicitly modeled in the relational features described next.

4.1 Application-speci c user relationships

  • The relationships between questions, users asking and answering questions, and answers can be captured by a tripartite graph outlined in Figure 2, where an edge represents an explicit relationship between the different node types.
  • To streamline the process of exploring new features, the authors suggest naming the features with respect to their position in this tree.
  • The types of features on the question subtree are: Q Features from the question being answered QU Features from the asker of the question being answered QA Features from the other answers to the same question.
  • The authors represent user relationships around a question similarly to representing relationships around an answer.
  • The authors also denote by p′x the vector of PageRank scores in the transposed graph.

4.2 Content features for QA

  • As the base content quality features for both questions and answer text individually the authors use directly the semantic features from Section 3.1.
  • The authors rely on feature selection methods and the classifier to identify the most salient features for the specific tasks of question or answer quality classification.
  • Intuitively, a copy of a Wall Street Journal article about economy may have good quality, but would not be a good answer to a question about celebrity fashion.
  • Hence, the authors explicitly model the relationship between the question and the answer.
  • To represent this the authors include the KL-divergence between the language models of the two texts, their non-stopword overlap, the ratio between their lengths, and other similar features.

4.3 Usage features for QA

  • Recall that community QA is question-centric: a question thread is usually viewed as a whole, and the content usage statistics are available primarily for the complete question thread.
  • In addition, the authors exploit the rich set of metadata available for each question.
  • This includes temporal statistics, e.g., how long ago the question was posted, which allows us to give a better interpretation to the number of views of a question.
  • One of the features is computed as the click frequency normalized by subtracting the expected click frequency for that category, divided by the standard deviation of click frequency for the category.
  • As the authors will show in the empirical evaluation presented in the next sections, both the generally applicable, and the domain specific features turn out to be significant for quality identification.

5.1 Dataset

  • The authors dataset consists of 6,665 questions and 8,366 question/answer pairs.
  • The base usage features (page views or clicks) were obtained from the total number of times a question thread was clicked (e.g., in response to a search result).
  • Starting from the questions and answers included in the evaluation dataset the authors considered related questions and answers as follows.
  • Figure 5 depicts the process of finding related items.
  • The relative size of the portion the authors used (depicted with thick lines) is exaggerated for illustration purposes: actually the data they use is a tiny fraction of the whole collection.

5.2 Dataset statistics

  • The degree distributions of the user interaction graphs described earlier are very skewed.
  • The cumulative distribution of the number of answers, best answers, and votes given and received is shown in Figure 6.
  • Note that in all cases the authors execute HITS and PageRank on a subgraph of the graph induced by the whole dataset, so the results might be different than the results that one would obtain if executing those algorithms on the whole graph.
  • The distributions of answers given and received are very similar to each other, in contrast to [12] where there were clearly “askers” and “answerers” with different types of behaviors.
  • This observation is an important consideration for feature design.

5.3 Evaluation metrics and methodology

  • Recall that the authors want to automatically separate high-quality content from the rest.
  • The authors also report the area under the ROC curve for the classifiers, as a non-parametric single estimator of their accuracy.
  • For their classification task the authors used the 6,665 questions and 8,366 question/answer pairs of their base dataset, i.e., on the sets Q0 and A0.
  • The classification tasks are performed using their in-house classification software.
  • The sets Q1, U1, A1, and A2 are used only for extracting the additional user-relationship features for the sets Q0 and A0.

6. EXPERIMENTAL RESULTS

  • In this Section the authors show the results for answer and question content quality.
  • Recall that as a baseline the authors use only textual features for the current item (answer/question) at the level ∅ of the trees introduced in Section 4.1.
  • In the experiments reported here, 80% of their data was used as a training set and the rest for testing.

6.1 Question quality

  • Table 2 shows the classification performance of the question classifier, using different subsets of their feature set.
  • The KL-divergence between the question’s language model and a model estimated from a collection of question answered by the Yahoo editorial team (available in http://ask.yahoo.com). ∅.
  • The total number of answers of the asker that have been selected as the “best answer”.
  • These features are derived from page views statistics as described in Section 3.3.
  • Because of the effectiveness of the relational and usage features to independently identify high-quality content, the authors hypothesized that a variant of co-training or co-boosting [10], or using a Maximum Entropy classifier [5] would be more effective to expand the training set in a partially supervised setting.

6.2 Answer quality

  • Table 4 shows the classification performance of the answer classifier, again examining different subsets of their feature set.
  • The 20 most significant features for answer quality, according to a chi-squared test, were: ∅ Answer length.
  • ∅ UAV Average number of abuse reports received by the an- swerer over his/her answers. ∅.
  • The number of “thumbs up” minus “thumbs down” received by the answerer.
  • The entropy of the unigram character-level model of the answer.

7. CONCLUSIONS

  • The authors presented a general classification framework for quality estimation in social media.
  • As part of their work the authors developed a comprehensive graph-based model of contributor relationships and combined it with content- and usagebased features.
  • The authors have successfully applied their framework to identifying high quality items in a web-scale community question answering portal, resulting in a high level of accuracy on the question and answer quality classification task.
  • The authors investigated the contributions of the different sources of quality evidence, and have shown that some of the sources are complementary – i.e., capture the same high-quality content using the different perspectives.
  • The combination of several types of sources of information is likely to increase the classifier’s robustness to spam, as an adversary is required to not only create content the deceives the classifier, but also simulate realistic user relationships or usage statistics.

Did you find this useful? Give us your feedback

Figures (12)

Content maybe subject to copyright    Report

Finding High-Quality Content in Social Media
Eugene Agichtein
Emory University
Atlanta, USA
eugene@mathcs.emory.edu
Carlos Castillo
Yahoo! Research
Barcelona, Spain
chato@yahoo-inc.com
Debora Donato
Yahoo! Research
Barcelona, Spain
debora@yahoo-inc.com
Aristides Gionis
Yahoo! Research
Barcelona, Spain
gionis@yahoo-inc.com
Gilad Mishne
Search and Advertising
Sciences, Yahoo!
gilad@yahoo-inc.com
ABSTRACT
The quality of user-generated content varies drastically from
excellent to abuse and spam. As the availability of such con-
tent increases, the task of identifying high-quality content
in sites based on user contributions—social media sites—
becomes increasingly important. Social media in general
exhibit a rich variety of information sources: in addition to
the content itself, there is a wide array of non-content infor-
mation available, such as links between items and explicit
quality ratings from members of the community. In this pa-
per we investigate methods for exploiting such community
feedback to automatically identify high quality content. As
a test case, we focus on Yahoo! Answers, a large community
question/answering portal that is particularly rich in the
amount and types of content and social interactions avail-
able in it. We introduce a general classification framework
for combining the evidence from different sources of infor-
mation, that can be tuned automatically for a given social
media type and quality definition. In particular, for the
community question/answering domain, we show that our
system is able to separate high-quality items from the rest
with an accuracy close to that of humans.
Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]: H.3.1 Con-
tent Analysis and Indexing indexing methods, linguistic
processing; H.3.3 Information Search and Retrieval infor-
mation filtering, search process.
General Terms
Algorithms, Design, Experimentation.
Keywords
Social media, Community Question Answering, User Inter-
actions.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for prot or commercial advantage and that copies
bear this notice and the full citation on the rst page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specic
permission and/or a fee.
WSDM'08,
February 1112, 2008, Palo Alto, California, USA.
Copyright 2008 ACM 978-1-59593-927-9/08/0002 ...$5.00.
1. INTRODUCTION
Recent years have seen a transformation in the type of
content available on the web. During the first decade of the
web’s prominence—from the early 1990s onwards—most on-
line content resembled traditional published material: the
majority of web users were consumers of content, created
by a relatively small amount of publishers. From the early
2000s, user-generated content has become increasingly pop-
ular on the web: more and more users participate in con-
tent creation, rather than just consumption. Popular user-
generated content (or social media) domains include blogs
and web forums, social bookmarking sites, photo and video
sharing communities, as well as social networking platforms
such as Facebook and MySpace, which offers a combination
of all of these with an emphasis on the relationships among
the users of the community.
Community-driven question/answering portals are a par-
ticular form of user-generated content that is gaining a large
audience in recent years. These portals, in which users an-
swer questions posed by other users, provide an alternative
channel for obtaining information on the web: rather than
browsing results of search engines, users present detailed in-
formation needs—and get direct responses authored by hu-
mans. In some markets, this information seeking behavior
is dominating over traditional web search [29].
An important difference between user-generated content
and traditional content that is particularly significant for
knowledge-based media such as question/answering portals
is the variance in the quality of the content. As Ander-
son [3] describes, in traditional publishing—mediated by a
publisher—the typical range of quality is substantially nar-
rower than in niche, unmediated markets. The main chal-
lenge posed by content in social media sites is the fact that
the distribution of quality has high variance: from very
high-quality items to low-quality, sometimes abusive con-
tent. This makes the tasks of filtering and ranking in such
systems more complex than in other domains. However, for
information-retrieval tasks, social media systems present in-
herent advantages over traditional collections of documents:
their rich structure offers more available data than in other
domains. In addition to document content and link struc-
ture, social media exhibit a wide variety of user-to-document
relation types, and user-to-user interactions.
In this paper we address the task of identifying high-
quality content in community-driven question/answering sites,
exploring the benefits of having additional sources of infor-

mation in this domain. As a test case, we focus on Ya-
hoo! Answers, a large portal that is particularly rich in the
amount and types of content and social interaction available
in it. We focus on the following research questions:
1. What are the elements of social media that can be used
to facilitate automated discovery of high-quality con-
tent? In addition to the content itself, there is a wide
array of non-content information available, from links
between items to explicit and implicit quality rating
from members of the community. What is the utility
of each source of information to the task of estimating
quality?
2. How are these different factors related? Is content
alone enough for identifying high-quality items?
3. Can community feedback approximate judgments of spe-
cialists?
To our knowledge, this is the first large-scale study of com-
bining the analysis of the content with the user feedback
in social media. In particular, we model all user interac-
tions in a principled graph-based framework (Section 3 and
Section 4), allowing us to effectively combine the different
sources of evidence in a classification formulation. Further-
more, we investigate the utility of the different sources of
feedback in a large-scale, experimental setting (Section 5)
over the market leading question/answering portal. Our ex-
perimental results show that these sources of evidence are
complementary, and allow our system to exhibit high accu-
racy in the task of identifying content of high quality (Sec-
tion 6). We discuss our findings and directions for future
work in Section 7, which concludes this paper.
2. BACKGROUND AND RELATED WORK
Social media content has become indispensable to millions
of users. In particular, community question/answering por-
tals are a popular destination of users looking for help with
a particular situation, for entertainment, and for community
interaction. Hence, in this paper we focus on one particu-
larly important manifestation of social media community
question/answering sites, specifically on Yahoo! Answers.
Our work draws on significant amount of prior research on
social media, and we outline the related work b efore intro-
ducing our framework in Section 3.
2.1 Yahoo! Answers
Yahoo! Answers
1
is a question/answering system where
people ask and answer questions on any topic. What makes
this system interesting is that around a seemingly trivial
question/answer paradigm, users are forming a social net-
work characterized by heterogeneous interactions. As a mat-
ter of fact, users do not only limit their activity to asking
and answering questions, but they also actively participate
in regulating the whole system. A user can vote for answers
of other users, mark interesting questions, and even report
abusive behavior. Thus, overall, each user has a threefold
role: asker, answerer and evaluator.
The central element of the Yahoo! Answers system are
questions. Each question has a lifecycle. It starts in an
“open” state where it receives answers. Then at some point
1
http://answers.yahoo.com/
(decided by the asker, or by an automatic timeout in the
system), the question is considered “closed,” and can receive
no further answers. At this stage, a “best answer” is se-
lected either by the asker or through a voting procedure
from other users; once a best answer is chosen, the question
is “resolved.”
As previously noted, the system is partially moderated by
the community: any user may report another user’s question
or answer as violating the community guidelines (e.g., con-
taining spam, adult-oriented content, copyrighted material,
etc.). A user can also award a question a “star”, marking it
as an interesting question, sometimes can vote for the best
answer for a question, and can give to any answer a “thumbs
up” or “thumbs down” rating, corresponding to a positive or
negative vote respectively.
Yahoo! Answers is a very popular service (according to
some reports, it reached a market share of close to 100%
about a year after its launch [27]); as a result, it hosts a
very large amount of questions and answers in a wide va-
riety of topics, making it a particularly useful domain for
examining content quality in social media. Similar exist-
ing and past services (some with a different model) include
Amazon’s Askville
2
, Google Answers
3
, and Yedda
4
.
2.2 Related work
Link analysis in social media.
Link-based methods have
been shown to be successful for several tasks in social me-
dia [30]. In particular, link-based ranking algorithms that
were successful in estimating the quality of web pages have
been applied in this context. Two of the most prominent
link-based ranking algorithms are PageRank [25] and HITS [22].
Consider a graph G = (V, E) with vertex set V corre-
sponding to the users of a question/answer system and hav-
ing a directed edge e = (u, v) E from a user u V to
a user v V if user u has answered to at least one ques-
tion of user v. ExpertiseRank [32] corresponds to PageRank
over the transposed graph G
0
= (V, E
0
), that is, a score is
propagated from the person receiving the answer to the per-
son giving the answer. The recursion implies that if person u
was able to provide an answer to person v, and person v was
able to provide an answer to person w, then u should receive
some extra points given that he/she was able to provide an
answer to a person with a certain degree of expertise.
The HITS algorithm was applied over the same graph [8,
19] and it was shown to produce good results in finding
experts and/or good answers. The mutual reinforcement
process in this case can be interpreted as “good questions
attract go od answers” and “good answers are given to good
questions”; we examine this assumption in Section 5.2.
Propagating reputation.
Guha et al. [14] study the prob-
lem of propagating trust and distrust among Epinions
5
users,
who may assign positive (trust) and negative (distrust) rat-
ings to each other. The authors study ways of combining
trust and distrust and observe that, while considering trust
as a transitive property makes sense, distrust can not be
considered transitive.
2
http://askville.amazon.com/
3
http://answers.google.com/
4
http://yedda.com/
5
http://epinions.com/

Ziegler and Lausen [33] also study models for propagation
of trust. They present a taxonomy of trust metrics and dis-
cuss ways of incorporating information about distrust into
the rating scores.
Question/answering portals and forums.
The particular
context of question/answering communities we focus on in
this paper has been the object of some study in recent years.
According to Su et al. [31], the quality of answers in ques-
tion/answering portals is good on average, but the quality of
specific answers varies significantly. In particular, in a study
of the answers to a set of questions in Yahoo! Answers, the
authors found that the fraction of correct answers to specific
questions asked by the authors of the study, varied from 17%
to 45%. The fraction of questions in their sample with at
least one good answer was much higher, varying from 65%
to 90%, meaning that a metho d for finding high-quality an-
swers can have a significant impact in the user’s satisfaction
with the system.
Jeon et al. [17] extracted a set of features from a sample
of answers in Naver,
6
a Korean question/answering portal
similar to Yahoo! Answers. They built a model for answer
quality based on features derived from the particular answer
being analyzed, such as answer length, number of points
received, etc., as well as user features, such as fraction of best
answers, number of answers given, etc. Our work expands
on this by exploring a substantially larger range of features
including both structural, textual, and community features,
and by identifying quality of questions in addition to answer
quality.
Expert nding.
Zhang et al. [32] analyze data from an on-
line forum, seeking to identify users with high expertise.
They study the user answers graph in which there is a link
between users u and v if u answers a question by v, ap-
plying both ExpertiseRank and HITS to identify users with
high expertise. Their results show high correlation between
link-based metrics and the answer quality. The authors also
develop synthetic models that capture some of the charac-
teristics of the interactions among users in their dataset.
Jurczyk and Agichtein [20] show an application of the
HITS algorithm [22] to a question/answering portal. The
HITS algorithm is run on the user-answer graph. The re-
sults demonstrate that HITS is a promising approach, as the
obtained authority score is b etter correlated with the num-
ber of votes that the items receive, than simply counting the
number of answers the answerer has given in the past.
Campbell et al. [8] computed the authority score of HITS
over the user-user graph in a network of e-mail exchanges,
showing that it is more correlated to quality than other sim-
pler metrics. Dom et al. [11] studied the performance of
several link-based algorithms to rank people by expertise on
a network of e-mail exchanges, testing on both real and syn-
thetic data, and showing that in real data ExpertiseRank
outperforms HITS.
Text analysis for content quality.
Most work on estimat-
ing the quality of text has been in the field of Automated
Essay Grading (AES), where writings of students are graded
by machines on several aspects, including compositionality,
style, accuracy, and soundness. AES systems are typically
6
http://naver.com/
built as text classification tools, and use a range of prop-
erties derived from the text as features. Some of the fea-
tures employed in systems are lexical, such as word length,
measures of vocabulary irregularity via repetitiveness [7] or
uncharacteristic co-occurrence [9], and measures of topical-
ity through word and phrase frequencies [28]. Other features
take into account usage of punctuation and detection of com-
mon grammatical error (such as subject-verb disagreements)
via predefined templates [4, 24]. Most platforms are com-
mercial and do not disclose full details of their internal fea-
ture set; overall, AES systems have been shown to correlate
very well with human judgments [6, 24].
A different area of study involving text quality is read-
ability; here, the difficulty of text is analyzed to determine
the minimal age group able to comprehend it. Several mea-
sures of text readability have been proposed, including the
Gunning-Fog Index [15], the Flesch-Kincaid Formula [21],
and SMOG Grading [23]. All measures combine the num-
ber of syllables or words in the text with the number of
sentences—the first b eing a crude approximation of the syn-
tactic complexity and the second of the semantic complex-
ity. Although simplistic and controversial, these methods
are widely-used and provide a rough estimation of the diffi-
culty of text.
Implicit feedback for ranking.
Implicit feedback from mil-
lions of web users has been shown to be a valuable source of
result quality and ranking information. In particular, clicks
on results and methods for interpreting the clicks have been
studied in references [1, 18, 2]. We apply the results on click
interpretation on web search results from these studies, as
a source of quality information in social media. As we will
show, content usage statistics are valuable, but require dif-
ferent interpretation from the web search domain.
3. CONTENT QUALITY ANALYSIS IN
SOCIAL MEDIA
We now focus on the task of finding high quality content,
and describe our overall approach to solving this problem.
Evaluation of content quality is an essential module for p er-
forming more advanced information-retrieval tasks on the
question/answering system. For instance, a quality score
can be used as input to ranking algorithms. On a high level,
our approach is to exploit features of social media that are
intuitively correlated with quality, and then train a classi-
fier to appropriately select and weight the features for each
specific type of item, task, and quality definition.
In this section we identify a set of features of social media
and interactions that can be applied to the task of content-
quality identification. In particular, we model the intrinsic
content quality (Section 3.1), the interactions between con-
tent creators and users (Section 3.2), as well as the content
usage statistics (Section 3.3). All these feature types are
used as an input to a classifier that can be tuned for the
quality definition for the particular media type (Section 3.4).
In the next section, we will expand and refine the feature set
specifically to match our main application domain of com-
munity question/answering portals.
3.1 Intrinsic content quality
The intrinsic quality metrics (i.e., the quality of the con-
tent of each item) that we use in this research are mostly

text-related, given that the social media items we evaluate
are primarily textual in nature. For user-generated content
of other types (e.g., photos or bookmarks), intrinsic quality
may be modeled differently.
As a baseline, we use textual features only—with all word
n-grams up to length 5 that appear in the collection more
than 3 times used as features. This straightforward ap-
proach is the de-facto standard for text classification tasks,
both for classifying the topic and for other facets (e.g., sen-
timent classification [26]).
Additionally, we use a large number of semantic features,
organized as follows:
Punctuation and typos.
Poor quality text, and particu-
larly of the type found in online sources, is often marked with
low conformance to common writing practices. For example,
capitalization rules may be ignored; excessive punctuation—
particularly repeated ellipsis and question marks—may be
used, or spacing may be irregular. Several of our features
capture the visual quality of the text, attempting to model
these irregularities; among these are features measuring punc-
tuation, capitalization, and spacing density (percent of all
characters), as well as features measuring the character-level
entropy of the text. A particular form of low visual qual-
ity are misspellings and typos; additional features in our
set quantify the number of spelling mistakes, as well as the
number of out-of-vocabulary words.
7
Syntactic and semantic complexity.
Advancing from the
punctuation level to more involved layers of the text, other
features in this subset quantify the syntactic and semantic
complexity of it. These include simple proxies for complex-
ity such as the average number of syllables per word or the
entropy of word lengths, as well as more intricate ones such
as the readability measures [15, 21, 23] mentioned in Sec-
tion 2.2.
Grammaticality.
Finally, to measure the grammatical qual-
ity of the text, we use several linguistically-oriented features.
We annotate the content with part-of-speech (POS) tags,
and use the tag n-grams (again, up to length 5) as features.
This allows us to capture, to some degree, the level of “cor-
rectness” of the grammar used.
Some part-of-speech sequences are typical of correctly-
formed questions: e.g., the sequence“when|how|why to (verb)”
(as in “how to identify. . . ) is typical of lower-quality ques-
tions, whereas the sequence “when|how|why (verb) (personal
pronoun) (verb)” (as in “how do I remove. . . ) is more typ-
ical of correctly-formed content.
Additional features used to represent grammatical prop-
erties of the text are its formality score [16], and the distance
between its (trigram) language model and several given lan-
guage models, such as the Wikipedia language model or the
language model of the Yahoo! Answers corpus itself (the dis-
tance is measured with KL-divergence).
7
To identify out-of-vocabulary words, we construct multiple
lists of the k most frequent words in Yahoo! Answers, with
several k values ranging between 50 and 5000. These lists are
then used to calculate a set of “out-of-vocabulary” features,
where each feature assumes the list of top-k words for some
k is the vocabulary. An example feature created this way is
“the fraction of words in an answer that do not appear in
the top-1000 words of the collection.”
3.2 User relationships
A significant amount of quality information can be in-
ferred from the relationships between users and items. For
example, we could apply link-analysis algorithms for propa-
gating quality scores in the entities of the question/answer
system, e.g., we use the intuition that, “good” answerers
write “good” answers, or vote for other “good” answerers.
The main challenge we have to face is that our dataset,
viewed as a graph, often contains nodes of multiple types
(e.g., questions, answers, users), and edges represent a set
of interaction among the nodes having different semantics
(e.g., “answers”, “gives best answer”, “votes for”, “gives a
star to”).
These relationships are represented as edges in a graph,
with content items and users as nodes. The edges are typed,
i.e., labeled with the particular type of interaction (e.g.,
“User u answers question q). Besides the user-item rela-
tionship graph, we also consider the user-user graph. This
is the graph G = (V, E) in which the set of vertices V is
composed of the set of users, and the set E represents im-
plicit relationships between users. For example, a user-user
relationship could be “User u has answered a question from
user v.”
The resulting user-user graph is extremely rich and het-
erogeneous, and is unlike traditional graphs studied in the
web link analysis setting. However, we believe that (in our
classification framework) traditional link analysis algorithm
may provide useful evidence for quality classification, tuned
for the particular domain. Hence, for each typ e of link we
performed a separate computation of each link-analysis al-
gorithm. We computed the hubs and authorities scores (as
in HITS algorithm [22]), and the PageRank scores [25]. In
Section 4 we discuss the specific relationships and node types
developed for community question/answering.
3.3 Usage statistics
Readers of the content (who may or may not also b e con-
tributors) provide valuable information about the items they
find interesting. In particular, usage statistics such as the
number of clicks on the item and dwell time have been shown
useful in the context of identifying high quality web search
results, and are complementary to link-analysis based meth-
ods. Intuitively, usage statistics measures are useful for so-
cial media content, but require different interpretation from
the previously studied settings.
For example, all items within a popular category such as
celebrity images or popular culture topics may receive orders
of magnitude more clicks than, for instance, science topics.
Nevertheless, when normalized by the item category, the de-
viation from expected number of clicks can be used to infer
quality directly, or can be incorporated into the classifica-
tion framework. The specific usage statistics that we use are
described in Section 4.3.
3.4 Overall classication framework
We cast the problem of quality ranking as a binary classifi-
cation problem, in which a system must learn automatically
to separate high-quality content from the rest.
We experimented with several classification algorithms,
including those reported to achieve good performance with
text classification tasks, such as support vector machines
and log-linear classifiers; the best performance among the
techniques we tested was obtained with stochastic gradient

boosted trees [13]. In this classification framework, a se-
quence of (typically simple) decision trees is constructed so
that each tree minimizes the error on the residuals of the
preceding sequence of trees; a stochastic element is added
by randomly sampling the data repeatedly before each tree
construction, to prevent overfitting. A particularly useful
aspect of boosted trees for our settings is their ability to
utilize combinations of sparse and dense features.
Given a set of human-labeled quality judgments, the clas-
sifier is trained on all available features, combining evidence
from semantic, user relationship, and content usage sources.
The judgments are tuned for the particular goal. For ex-
ample, we could use this framework to classify questions by
genre or asker expertise. In the case of community ques-
tion/answers, described next, our goal is to discover inter-
esting, well formulated and factually accurate content.
4. MODELING CONTENT QUALITY IN
COMMUNITY QUESTION/ANSWERING
Our goal is to automatically assess the quality of questions
and answers provided by users of the system. We believe
that this particular sub-problem of quality evaluation is an
essential module for performing more advanced information-
retrieval tasks on the question/answering or web search sys-
tem. For example, a quality score can be used as a feature
for ranking search results in this system.
Note that Yahoo! Answers is question-centric: the inter-
actions of users are organized around questions: the main
forms of interaction among the users are (i) asking a ques-
tion, (ii) answering a question, (iii) selecting best answer,
and (iv) voting on an answer. These relationships are ex-
plicitly modeled in the relational features described next.
4.1 Application-specic user relationships
Our dataset, viewed as a graph, contains multiple types
of nodes and multiple types of interactions, as illustrated in
Figure 1.
Figure 1: Partial entity-relationship diagram of an-
swers.
The relationships between questions, users asking and an-
swering questions, and answers can be captured by a tripar-
tite graph outlined in Figure 2, where an edge represents an
explicit relationship between the different node types.
Since a user is not allowed to answer his/her own ques-
tions, there are no triangles in the graph, so in fact all cycles
in the graph have length at least 6.
Figure 2: Interaction of users-questions-answers
modeled as a tri-partite graph.
We use multi-relational features to describe multiple classes
of objects and multiple types of relationships between these
objects. In this section, we expand on the general user re-
lationships ideas of the previous section to develop specific
relational features that exploit the unique characteristics of
the community question/answering domain.
Answer features.
In Figure 3, we show the user relation-
ship data that is available for a particular answer. The types
of the data related to a particular answer form a tree, in
which the type “Answer” is the ro ot. So, an answer a A is
at the 0-th level of the tree, the question q that a answers
to, and the user u who p osted a are in the first level of the
tree, and so on.
To streamline the process of exploring new features, we
suggest naming the features with respect to their position
in this tree. Each feature corresponds to a data type, which
resides in a specific node in the tree, and thus, it is charac-
terized by the path from the root of the tree to that node.
Figure 3: Types of features available for inferring
the quality of an answer.
Hence, each specific feature can be represented by a path
in the tree (following the direction of the edges). For in-
stance, a feature of the type QU represents the information
about a question (Q) and the user (U) who asked that ques-
tion. In Figure 3, we can see two subtrees starting from the
answer being evaluated: one related to the question being
answered, and the other related to the user contributing the
answer.
The types of features on the question subtree are:
Q Features from the question being answered
QU Features from the asker of the question being answered
QA Features from the other answers to the same question

Citations
More filters
Proceedings ArticleDOI
28 Mar 2011
TL;DR: There are measurable differences in the way messages propagate, that can be used to classify them automatically as credible or not credible, with precision and recall in the range of 70% to 80%.
Abstract: We analyze the information credibility of news propagated through Twitter, a popular microblogging service. Previous research has shown that most of the messages posted on Twitter are truthful, but the service is also used to spread misinformation and false rumors, often unintentionally.On this paper we focus on automatic methods for assessing the credibility of a given set of tweets. Specifically, we analyze microblog postings related to "trending" topics, and classify them as credible or not credible, based on features extracted from them. We use features from users' posting and re-posting ("re-tweeting") behavior, from the text of the posts, and from citations to external sources.We evaluate our methods using a significant number of human assessments about the credibility of items on a recent sample of Twitter postings. Our results shows that there are measurable differences in the way messages propagate, that can be used to classify them automatically as credible or not credible, with precision and recall in the range of 70% to 80%.

2,123 citations


Cites background from "Finding high-quality content in soc..."

  • ...Many of the features follow previous works including [1, 2, 12, 26]....

    [...]

Journal ArticleDOI
TL;DR: A portion of the findings on students' perceptions of learning with mobile computing devices and the roles social media played are presented.
Abstract: The purpose of this research was to explore teaching and learning when mobile computing devices, such as cellphones and smartphones, were implemented in higher education. This paper presents a portion of the findings on students' perceptions of learning with mobile computing devices and the roles social media played. This qualitative research study focused on students from three universities across the US. The students' teachers had been integrating mobile computing devices, such as cellphones and smartphones, into their courses for at least two semesters. Data were collected through student focus group interviews. Two specific themes emerged from the interview data: (a) advantages of mobile computing devices for student learning and (b) frustrations from learning with mobile computing devices. Mobile computing devices and the use of social media created opportunities for interaction, provided opportunities for collaboration, as well as allowed students to engage in content creation and communication using social media and Web 2.0 tools with the assistance of constant connectivity.

1,196 citations

Book
16 Feb 2009
TL;DR: This text provides the background and tools needed to evaluate, compare and modify search engines and numerous programming exercises make extensive use of Galago, a Java-based open source search engine.
Abstract: KEY BENEFIT: Written by a leader in the field of information retrieval, this text provides the background and tools needed to evaluate, compare and modify search engines. KEY TOPICS: Coverage of the underlying IR and mathematical models reinforce key concepts. Numerous programming exercises make extensive use of Galago, a Java-based open source search engine. MARKET: A valuable tool for search engine and information retrieval professionals.

1,050 citations

Proceedings ArticleDOI
21 Apr 2008
TL;DR: This paper analyzes YA's forum categories and cluster them according to content characteristics and patterns of interaction among the users, finding that lower entropy correlates with receiving higher answer ratings, but only for categories where factual expertise is primarily sought after.
Abstract: Yahoo Answers (YA) is a large and diverse question-answer forum, acting not only as a medium for sharing technical knowledge, but as a place where one can seek advice, gather opinions, and satisfy one's curiosity about a countless number of things. In this paper, we seek to understand YA's knowledge sharing and activity. We analyze the forum categories and cluster them according to content characteristics and patterns of interaction among the users. While interactions in some categories resemble expertise sharing forums, others incorporate discussion, everyday advice, and support. With such a diversity of categories in which one can participate, we find that some users focus narrowly on specific topics, while others participate across categories. This not only allows us to map related categories, but to characterize the entropy of the users' interests. We find that lower entropy correlates with receiving higher answer ratings, but only for categories where factual expertise is primarily sought after. We combine both user attributes and answer characteristics to predict, within a given category, whether a particular answer will be chosen as the best answer by the asker.

799 citations


Cites background from "Finding high-quality content in soc..."

  • ...plementary, and concurrent, study of question and answer quality was performed by Agichtein et al. [ 1 ]....

    [...]

Journal ArticleDOI
TL;DR: In this paper, the authors explored the relationship between social media marketing activities and consumers' behavior towards a brand and found that SMMEs have a significant positive effect on brand equity and on the two main dimensions of brand awareness and brand image.

698 citations


Cites background from "Finding high-quality content in soc..."

  • ...Entertainment Entertainment is the result of the fun and play emerging from the social media experience (Agichtein et al., 2008)....

    [...]

  • ...Entertainment is the result of the fun and play emerging from the social media experience (Agichtein et al., 2008)....

    [...]

References
More filters
Proceedings Article
11 Nov 1999
TL;DR: This paper describes PageRank, a mathod for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them, and shows how to efficiently compute PageRank for large numbers of pages.
Abstract: The importance of a Web page is an inherently subjective matter, which depends on the readers interests, knowledge and attitudes. But there is still much that can be said objectively about the relative importance of Web pages. This paper describes PageRank, a mathod for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them. We compare PageRank to an idealized random Web surfer. We show how to efficiently compute PageRank for large numbers of pages. And, we show how to apply PageRank to search and to user navigation.

14,400 citations


"Finding high-quality content in soc..." refers methods in this paper

  • ...Two of the most prominent link-based ranking algorithms are PageRank [25] and HITS [22]....

    [...]

  • ...We computed the hubs and authorities scores (as in HITS algorithm [22]), and the PageRank scores [25]....

    [...]

  • ...Two of the most prominent link-based ranking algorithms are PageRank [25] and HITS [22]....

    [...]

  • ...{a, b, v, s, +, -}, we computed the hubs and authorities scores (as in HITS al­gorithm [22]), and the PageRank scores [25]....

    [...]

  • ...Note that in all cases we execute HITS and PageRank on a subgraph of the graph induced by the whole dataset, so the results might be di.erent than the results that one would obtain if executing those algorithms on the whole graph....

    [...]

Journal ArticleDOI
Jon Kleinberg1
TL;DR: This work proposes and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of “hub pages” that join them together in the link structure, and has connections to the eigenvectors of certain matrices associated with the link graph.
Abstract: The network structure of a hyperlinked environment can be a rich source of information about the content of the environment, provided we have effective means for understanding it. We develop a set of algorithmic tools for extracting information from the link structures of such environments, and report on experiments that demonstrate their effectiveness in a variety of context on the World Wide Web. The central issue we address within our framework is the distillation of broad search topics, through the discovery of “authorative” information sources on such topics. We propose and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of “hub pages” that join them together in the link structure. Our formulation has connections to the eigenvectors of certain matrices associated with the link graph; these connections in turn motivate additional heuristrics for link-based analysis.

8,328 citations

01 Jan 2002
TL;DR: In this paper, the problem of classifying documents not by topic, but by overall sentiment, e.g., determining whether a review is positive or negative, was considered and three machine learning methods (Naive Bayes, maximum entropy classiflcation, and support vector machines) were employed.
Abstract: We consider the problem of classifying documents not by topic, but by overall sentiment, e.g., determining whether a review is positive or negative. Using movie reviews as data, we flnd that standard machine learning techniques deflnitively outperform human-produced baselines. However, the three machine learning methods we employed (Naive Bayes, maximum entropy classiflcation, and support vector machines) do not perform as well on sentiment classiflcation as on traditional topic-based categorization. We conclude by examining factors that make the sentiment classiflcation problem more challenging.

6,980 citations

Proceedings ArticleDOI
06 Jul 2002
TL;DR: This work considers the problem of classifying documents not by topic, but by overall sentiment, e.g., determining whether a review is positive or negative, and concludes by examining factors that make the sentiment classification problem more challenging.
Abstract: We consider the problem of classifying documents not by topic, but by overall sentiment, e.g., determining whether a review is positive or negative. Using movie reviews as data, we find that standard machine learning techniques definitively outperform human-produced baselines. However, the three machine learning methods we employed (Naive Bayes, maximum entropy classification, and support vector machines) do not perform as well on sentiment classification as on traditional topic-based categorization. We conclude by examining factors that make the sentiment classification problem more challenging.

6,626 citations

Book
30 Dec 1991
TL;DR: Networks and Relations The Development of Social Network Analysis Handling Relational Data Lines, Direction and Density Centrality and Centralization Components, Cores, and Cliques Positions, Roles and Clusters Dimensions and Displays Appendix Social Network Packages
Abstract: Networks and Relations The Development of Social Network Analysis Handling Relational Data Lines, Direction and Density Centrality and Centralization Components, Cores, and Cliques Positions, Roles, and Clusters Dimensions and Displays Appendix Social Network Packages

5,638 citations


"Finding high-quality content in soc..." refers methods in this paper

  • ...Link-based methods have been shown to be successful for several tasks in social media [30]....

    [...]