Finding high-quality content in social media
Summary (5 min read)
1. INTRODUCTION
- Recent years have seen a transformation in the type of content available on the web.
- Community-driven question/answering portals are a particular form of user-generated content that is gaining a large audience in recent years.
2.1 Yahoo! Answers
- Answers1 is a question/answering system where people ask and answer questions on any topic.
- What makes this system interesting is that around a seemingly trivial question/answer paradigm, users are forming a social network characterized by heterogeneous interactions.
- As a matter of fact, users do not only limit their activity to asking and answering questions, but they also actively participate in regulating the whole system.
- A user can vote for answers of other users, mark interesting questions, and even report abusive behavior.
- Thus, overall, each user has a threefold role: asker, answerer and evaluator.
3. CONTENT QUALITY ANALYSIS IN SOCIAL MEDIA
- The authors now focus on the task of finding high quality content, and describe their overall approach to solving this problem.
- Evaluation of content quality is an essential module for performing more advanced information-retrieval tasks on the question/answering system.
- In particular, the authors model the intrinsic content quality (Section 3.1), the interactions between content creators and users (Section 3.2), as well as the content usage statistics (Section 3.3).
- All these feature types are used as an input to a classifier that can be tuned for the quality definition for the particular media type (Section 3.4).
- In the next section, the authors will expand and refine the feature set specifically to match their main application domain of community question/answering portals.
3.1 Intrinsic content quality
- The intrinsic quality metrics (i.e., the quality of the content of each item) that the authors use in this research are mostly text-related, given that the social media items they evaluate are primarily textual in nature.
- Several of their features capture the visual quality of the text, attempting to model these irregularities; among these are features measuring punctuation, capitalization, and spacing density (percent of all characters), as well as features measuring the character-level entropy of the text.
- Additional features used to represent grammatical properties of the text are its formality score [16], and the distance between its language model and several given language models, such as the Wikipedia language model or the language model of the Yahoo!.
- 7To identify out-of-vocabulary words, the authors construct multiple lists of the k most frequent words in Yahoo!.
- Answers, with several k values ranging between 50 and 5000.
3.2 User relationships
- A significant amount of quality information can be inferred from the relationships between users and items.
- The authors could apply link-analysis algorithms for propagating quality scores in the entities of the question/answer system, e.g., they use the intuition that, “good” answerers write “good” answers, or vote for other “good” answerers.
- These relationships are represented as edges in a graph, with content items and users as nodes.
- The resulting user-user graph is extremely rich and heterogeneous, and is unlike traditional graphs studied in the web link analysis setting.
- Hence, for each type of link the authors performed a separate computation of each link-analysis algorithm.
3.3 Usage statistics
- Readers of the content (who may or may not also be contributors) provide valuable information about the items they find interesting.
- In particular, usage statistics such as the number of clicks on the item and dwell time have been shown useful in the context of identifying high quality web search results, and are complementary to link-analysis based methods.
- Intuitively, usage statistics measures are useful for social media content, but require different interpretation from the previously studied settings.
- All items within a popular category such as celebrity images or popular culture topics may receive orders of magnitude more clicks than, for instance, science topics.
- The specific usage statistics that the authors use are described in Section 4.3.
3.4 Overall classi cation framework
- The authors cast the problem of quality ranking as a binary classification problem, in which a system must learn automatically to separate high-quality content from the rest.
- A sequence of (typically simple) decision trees is constructed so that each tree minimizes the error on the residuals of the preceding sequence of trees; a stochastic element is added by randomly sampling the data repeatedly before each tree construction, to prevent overfitting.
- Given a set of human-labeled quality judgments, the classifier is trained on all available features, combining evidence from semantic, user relationship, and content usage sources.
- The judgments are tuned for the particular goal.
- In the case of community question/answers, described next, their goal is to discover interesting, well formulated and factually accurate content.
4. MODELING CONTENT QUALITY IN COMMUNITY QUESTION/ANSWERING
- The authors goal is to automatically assess the quality of questions and answers provided by users of the system.
- The authors believe that this particular sub-problem of quality evaluation is an essential module for performing more advanced informationretrieval tasks on the question/answering or web search system.
- A quality score can be used as a feature for ranking search results in this system.
- Answers is question-centric: the interactions of users are organized around questions: the main forms of interaction among the users are (i) asking a question, (ii) answering a question, (iii) selecting best answer, and (iv) voting on an answer.
- These relationships are explicitly modeled in the relational features described next.
4.1 Application-speci c user relationships
- The relationships between questions, users asking and answering questions, and answers can be captured by a tripartite graph outlined in Figure 2, where an edge represents an explicit relationship between the different node types.
- To streamline the process of exploring new features, the authors suggest naming the features with respect to their position in this tree.
- The types of features on the question subtree are: Q Features from the question being answered QU Features from the asker of the question being answered QA Features from the other answers to the same question.
- The authors represent user relationships around a question similarly to representing relationships around an answer.
- The authors also denote by p′x the vector of PageRank scores in the transposed graph.
4.2 Content features for QA
- As the base content quality features for both questions and answer text individually the authors use directly the semantic features from Section 3.1.
- The authors rely on feature selection methods and the classifier to identify the most salient features for the specific tasks of question or answer quality classification.
- Intuitively, a copy of a Wall Street Journal article about economy may have good quality, but would not be a good answer to a question about celebrity fashion.
- Hence, the authors explicitly model the relationship between the question and the answer.
- To represent this the authors include the KL-divergence between the language models of the two texts, their non-stopword overlap, the ratio between their lengths, and other similar features.
4.3 Usage features for QA
- Recall that community QA is question-centric: a question thread is usually viewed as a whole, and the content usage statistics are available primarily for the complete question thread.
- In addition, the authors exploit the rich set of metadata available for each question.
- This includes temporal statistics, e.g., how long ago the question was posted, which allows us to give a better interpretation to the number of views of a question.
- One of the features is computed as the click frequency normalized by subtracting the expected click frequency for that category, divided by the standard deviation of click frequency for the category.
- As the authors will show in the empirical evaluation presented in the next sections, both the generally applicable, and the domain specific features turn out to be significant for quality identification.
5.1 Dataset
- The authors dataset consists of 6,665 questions and 8,366 question/answer pairs.
- The base usage features (page views or clicks) were obtained from the total number of times a question thread was clicked (e.g., in response to a search result).
- Starting from the questions and answers included in the evaluation dataset the authors considered related questions and answers as follows.
- Figure 5 depicts the process of finding related items.
- The relative size of the portion the authors used (depicted with thick lines) is exaggerated for illustration purposes: actually the data they use is a tiny fraction of the whole collection.
5.2 Dataset statistics
- The degree distributions of the user interaction graphs described earlier are very skewed.
- The cumulative distribution of the number of answers, best answers, and votes given and received is shown in Figure 6.
- Note that in all cases the authors execute HITS and PageRank on a subgraph of the graph induced by the whole dataset, so the results might be different than the results that one would obtain if executing those algorithms on the whole graph.
- The distributions of answers given and received are very similar to each other, in contrast to [12] where there were clearly “askers” and “answerers” with different types of behaviors.
- This observation is an important consideration for feature design.
5.3 Evaluation metrics and methodology
- Recall that the authors want to automatically separate high-quality content from the rest.
- The authors also report the area under the ROC curve for the classifiers, as a non-parametric single estimator of their accuracy.
- For their classification task the authors used the 6,665 questions and 8,366 question/answer pairs of their base dataset, i.e., on the sets Q0 and A0.
- The classification tasks are performed using their in-house classification software.
- The sets Q1, U1, A1, and A2 are used only for extracting the additional user-relationship features for the sets Q0 and A0.
6. EXPERIMENTAL RESULTS
- In this Section the authors show the results for answer and question content quality.
- Recall that as a baseline the authors use only textual features for the current item (answer/question) at the level ∅ of the trees introduced in Section 4.1.
- In the experiments reported here, 80% of their data was used as a training set and the rest for testing.
6.1 Question quality
- Table 2 shows the classification performance of the question classifier, using different subsets of their feature set.
- The KL-divergence between the question’s language model and a model estimated from a collection of question answered by the Yahoo editorial team (available in http://ask.yahoo.com). ∅.
- The total number of answers of the asker that have been selected as the “best answer”.
- These features are derived from page views statistics as described in Section 3.3.
- Because of the effectiveness of the relational and usage features to independently identify high-quality content, the authors hypothesized that a variant of co-training or co-boosting [10], or using a Maximum Entropy classifier [5] would be more effective to expand the training set in a partially supervised setting.
6.2 Answer quality
- Table 4 shows the classification performance of the answer classifier, again examining different subsets of their feature set.
- The 20 most significant features for answer quality, according to a chi-squared test, were: ∅ Answer length.
- ∅ UAV Average number of abuse reports received by the an- swerer over his/her answers. ∅.
- The number of “thumbs up” minus “thumbs down” received by the answerer.
- The entropy of the unigram character-level model of the answer.
7. CONCLUSIONS
- The authors presented a general classification framework for quality estimation in social media.
- As part of their work the authors developed a comprehensive graph-based model of contributor relationships and combined it with content- and usagebased features.
- The authors have successfully applied their framework to identifying high quality items in a web-scale community question answering portal, resulting in a high level of accuracy on the question and answer quality classification task.
- The authors investigated the contributions of the different sources of quality evidence, and have shown that some of the sources are complementary – i.e., capture the same high-quality content using the different perspectives.
- The combination of several types of sources of information is likely to increase the classifier’s robustness to spam, as an adversary is required to not only create content the deceives the classifier, but also simulate realistic user relationships or usage statistics.
Did you find this useful? Give us your feedback
Citations
2,123 citations
Cites background from "Finding high-quality content in soc..."
...Many of the features follow previous works including [1, 2, 12, 26]....
[...]
1,196 citations
1,050 citations
799 citations
Cites background from "Finding high-quality content in soc..."
...plementary, and concurrent, study of question and answer quality was performed by Agichtein et al. [ 1 ]....
[...]
698 citations
Cites background from "Finding high-quality content in soc..."
...Entertainment Entertainment is the result of the fun and play emerging from the social media experience (Agichtein et al., 2008)....
[...]
...Entertainment is the result of the fun and play emerging from the social media experience (Agichtein et al., 2008)....
[...]
References
14,400 citations
"Finding high-quality content in soc..." refers methods in this paper
...Two of the most prominent link-based ranking algorithms are PageRank [25] and HITS [22]....
[...]
...We computed the hubs and authorities scores (as in HITS algorithm [22]), and the PageRank scores [25]....
[...]
...Two of the most prominent link-based ranking algorithms are PageRank [25] and HITS [22]....
[...]
...{a, b, v, s, +, -}, we computed the hubs and authorities scores (as in HITS algorithm [22]), and the PageRank scores [25]....
[...]
...Note that in all cases we execute HITS and PageRank on a subgraph of the graph induced by the whole dataset, so the results might be di.erent than the results that one would obtain if executing those algorithms on the whole graph....
[...]
8,328 citations
6,980 citations
6,626 citations
5,638 citations
"Finding high-quality content in soc..." refers methods in this paper
...Link-based methods have been shown to be successful for several tasks in social media [30]....
[...]