scispace - formally typeset

Book ChapterDOI

Exploiting conversational features to detect high-quality blog comments

25 May 2011-pp 122-127

TL;DR: This approach to classifying the quality of blog comments using Linear-Chain Conditional Random Fields (CRFs) is found to yield high accuracy on binary classification of high-quality comments, with conversational features contributing strongly to the accuracy.
Abstract: In this work, we present a method for classifying the quality of blog comments using Linear-Chain Conditional Random Fields (CRFs). This approach is found to yield high accuracy on binary classification of high-quality comments, with conversational features contributing strongly to the accuracy. We also present a new corpus of blog data in conversational form, complete with user-generated quality moderation labels from the science and technology news blog Slashdot.

Content maybe subject to copyright    Report

Exploiting Conversational Features to Detect
High-Quality Blog Comments
Nicholas FitzGerald, Giuseppe Carenini, Gabriel Murray and Shafiq Joty
University of British Columbia
{nfitz,carenini,gabrielm,rjoty}@cs.ubc.ca
Abstract. In this work, we present a method for classifying the qual-
ity of blog comments using Linear-Chain Conditional Random Fields
(CRFs). This approach is found to yield high accuracy on binary classifi-
cation of high-quality comments, with conversational features contribut-
ing strongly to the accuracy. We also present a new corpus of blog data
in conversational form, complete with user-generated quality moderation
labels from the science and technology news blog Slashdot.
1 Introduction and Background
As the amount of content available on the Internet continues to increase expo-
nentially, the need for tools which can analyze and summarize large amounts
of text has become increasingly pronounced. Traditionally, most work on auto-
matic summarization has focused on extractive methods, where representative
sentences are chosen from the input corpus ([5]). In contrast, recent work (eg.
[10], [2]) has taken an abstractive approach, where information is first extracted
from the input corpus, and then expressed through novel sentences created with
Natural Language Generation techniques. This approach, though more difficult,
has been shown to produce superior summaries in terms of readability and co-
herence.
Several recent works have focused on summarization of multi-participant
conversations ([9], [10]). [10] describes an abstractive summarization system for
face-to-face meeting transcripts. The approach is to use a series of classifiers to
identify different types of messages in the transcripts; for example, utterances
expressing a decision being made, or a positive opinion being expressed. The
summarizer then selects a set of messages which maximize a function encom-
passing information about the sentences in which messages appear, and passes
these messages to the NLG system.
In this paper, we present our work on detecting high-quality comments in
blogs using CRFs. In future work, this will be combined with classification on
other axes—for instance that of the message’s rhetorical role (ie. Question, Re-
sponse, Criticism etc.)—to provide the messages for an abstractive summariza-
tion system.
CRFs ([7]) are a discriminative probabilistic model which have gained much
popularity in Natural Language Processing and Bio-informatics applications.

One benefit of using linear chain CRFs over more traditional linear classification
algorithms is that the sequence of labels is considered. Several works have shown
the effectiveness of CRFs on similar Natural Language Processing tasks which
involve sequential dependencies ([1], [4]). [11] uses Linear-Chain CRFs to classify
summary sentences to create extractive summaries of news articles, showing their
effectiveness on this task. [6] test CRFs against two other classifiers (Support
Vector Machines and Naive-Bayes) on the task of classifying dialogue acts in live-
chat conversations. They also show the usefulness of structural features, which
are similar to our conversational features (see Sect. 2.3).
2 Automatic Comment Rating System
2.1 The Slashdot Corpus
We compiled a new corpus comprised of articles and their subsequent user com-
ments from the science and technology news aggregation website Slashdot
1
. This
site was chosen for several reasons. Comments on Slashdot are moderated by
users of the site, meaning that each comment has a scores from -1 to +5 indicat-
ing the total score of moderations assigned, with each moderator able to modify
the score of a given comment by +1 or -1. Furthermore, each moderation assigns
a classification to the comment: for good comments, the classes are: Interesting,
Insightful, Informative and Funny. For bad comments, the classes are: Flame-
bait, Troll, Off-Topic and Redundant. Since the goal of this work was to identify
high-quality comments, most of our experiments were conducted with comments
grouped into GOOD and BAD.
Slashdot comments are displayed in a threaded conversation-tree type layout.
Users can directly reply to a given comment, and their reply will be placed
underneath that comment in a nested structure. This conversational structure
allows us to use Conversational Features in our classification approach (see Sect.
2.3).
Some comments were not successfully crawled, which meant that some com-
ments in the corpus referred to parent comments which had not been collected.
In order to prevent this, comments whose parents were missing were excluded
from the corpus. After this cleanup, the collection totalled 425,853 comments on
4320 articles.
2.2 Transformation into Sequences
As mentioned above, Slashdot commenters can reply directly to other comments,
forming several tree-like conversation for each article. This creates a problem for
our use of Linear-Chain CRFs, which require linear sequences.
In order to solve this problem, each conversation tree is transformed into
multiple Threads, one for each leaf-comment in the tree. The Thread is the
sequence of comments from the root comment to the leaf comment. Each Thread
1
http://slashdot.org

is then treated as a separate sequence by the classifier. One consequence of this
is that any comment with more than one reply will occur multiple times in the
training or testing set. This makes some intuitive sense for training, as comments
higher in the conversation tree are likely more important to the conversation as a
whole, as the earlier a comment appears in the thread the greater effect it has on
the course of conversation down-thread. We describe the process of re-merging
these comment threads, and investigate the effect this has on accuracy, in Sect.
3.3.
2.3 Features
Each comment in a given sequence was represented as a series of features. In
addition to simple unigram (bag-of-words) features, we experimented with two
other classes of features: lexical similarity, and conversational features. These
are described below:
Similarity Features Three features were used which capture the lexical simi-
larity between two comments: TF-IDF, LSA ([5]) and Lexical Cohesion([3]). For
each comment, each of these three scores was calculated for both the preceding
and following comment (0 if there was no comment before or after), giving a
total of six similarity features. These features were previously shown in [12] to
be useful in the task of topic-modelling in email conversations. However, in con-
trast to [12], where similarity was calculated between sentences, these metrics
were adapted to calculate similarity between entire comments.
Conversational Features The conversational features capture information
about the how the comment is situated in the conversation as a whole. The list
is as follows:
ThreadIndex The index of the comment in the current thread (starting at 0).
NumReplies The number of child comments replying to this
WordLength and SentenceLength The length of this comment in words and
sentences, respectively.
AvgReplyWordLength and AvgReplySentLength The average length of replies
to this comment in words and sentence length.
TotalReplyWordLenth and TotalReplySentLength The total length of all replies
to this comment in words and sentence length.
2.4 Training
The popular Natural Language Machine Learning toolkit MALLET
2
was used
to train the CRF model. A 1000-article subset of the entire Slashdot corpus
was divided 90%-10% between the training and testing set. The training set
consisted of 93,841 Threads from 900 articles, while the testing set consisted of
10,053 Threads from 100 articles.
2
http://mallet.cs.umass.edu/index.php

BAD GOOD
BAD 5991 1965
GOOD 1426 8814
P: 0.818
R: 0.861
F: 0.839
(a)
P R F
all good 0.563 1.000 0.720
uni 0.708 0.699 0.703
sim 0.802 0.900 0.848
conv 0.818
3
0.855 0.836
uni sim 0.780 0.847 0.812
uni conv 0.818
3
0.855 0.836
sim conv 0.818
3
0.855 0.836
uni sim conv 0.818
3
0.855 0.836
(b)
BAD GOOD
BAD 4160 467
GOOD 862 1090
P: 0.700
R: 0.558
F: 0.621
(c)
Table 1: (a) Confusion matrix for binary classification of comment threads. (b)
Results of feature analysis on the 3 feature classes. (c) Confusion matrix for
re-merged comment threads.
3 Experimental Results
3.1 Classification
Experiment 1 was to train the CRF using data where the full set of moderation
labels had been grouped into GOOD comments and BAD. The Conditional
Random Field classifier was trained on the full set of features presented in Sect.
2.3. The Confusion-Matrix of this experiment is presented in Tab. 1a. We can
see that the CRF performs well on this formulation of the task, with a precision
of 0.818 and recall of 0.839. This compares very favourably to a baseline of
assigning GOOD to all comments, which yields a precision score of 0.563. The
CRF result also performs favourably against a non-sequential Support Vector
Machine classifier (P = .799, R = .773) which confirms the existence of sequential
dependencies in this problem.
3.2 Feature Analysis
To investigate the relative importance of the 3-types of features (unigrams, simi-
larity, and conversational) we experiment with training the classifier with differ-
ent groupings of features. The results of this feature analysis is presented in Tab.
1b. All three sets of features can provide relatively good results by themselves,
but the similarity and conversational features greatly out-perform the unigram
features. Similarity features have a slight edge in terms of recall and f-score, while
the Conversational features provide the edge in precision, seeming to dominate
Similarity features when both are used. In fact, the results of this analysis seem
to show that whenever the conversational features are used, they dominate the
effect of the other features, since all sets of features which include Conversational
3
These results were not identical, though close enough that precision, recall, and
f-score were identical to the third decimal point.

features have the same results as using the Conversational features alone. This
would seem to indicate that most relevant factors in deciding the quality of a
given comment are conversational in nature, including the number of replies it
receives and the nature of those replies. This effect could be reinforced by the
fact that comments which have previously been moderated as GOOD are more
likely to be read by future readers, which will naturally increase the number of
comments they receive in reply. However, since the unigram- and, more notably,
similarity-features can still perform quite well without use of the conversational
features, our method is not overly-dependent on this effect.
3.3 Re-Merging Conversation Trees
As described in Sect. 2.2, conversation trees were decomposed into multiple
threads in order to cast the problem in the form of sequence labelling. The re-
sult of this is that after classification, each non-leaf thread has been classified
multiple times, equal to the number of sub-comments of that comment. These
different classifications need not be the same, ie. A given comment might well
have been classified as GOOD in one sequence and BAD in another. We next
recombined these sequences, such that there is only one classification per com-
ment. Comments which appeared in multiple sequences, and thus received multi-
ple classifications, were marked GOOD if they were classified as GOOD at least
once (GOOD if |{c
i
C : c
i
= good}| 1}, where C is the set of classifications
of comment i
4
.
There are two ways to evaluate the merged classifications. The first way is to
reassign the newly-merged classifications back onto the thread sequences. This
preserves the proportions of observations in the original experiments, which al-
lows us to determine whether merging has affected the accuracy of classification.
Doing so showed that there was no significant effect on the performance of the
classifier; precision and recall remained .818 and .861, respectively.
The other method is to look at the comment-level accuracy. This removes
duplicates from the data, and gives the overall accuracy for determining the
classification of a given comment. The results of this are given in Table 1c. The
precision and recall in this measure are significantly lower than in the thread-
based measure, which indicates that the classification of ”leaf” comments tended
to be less accurate than that of non-leaf comments which subsequently appeared
in more than one thread. The precision of .700 is still much greater than the
baseline of assigning GOOD to all comments, which would yield a precision of
.297. This indicates that our approach can successfully identify good comments.
4 Conclusion and Future Work
In this work, we have presented an approach to identifying high-quality com-
ments in blog comment conversations. By casting the problem as one of binary
4
This was compared to similar metrics such as a majority-vote metric (GOOD if
|{c
i
C : c
i
= good}| |{c
i
C : c
i
= bad}|), and performed the best (though the
difference was negligible).

Citations
More filters

Posted Content
Arkaitz Zubiaga1, Elena Kochkina1, Maria Liakata1, Rob Procter1  +1 moreInstitutions (2)
TL;DR: This work is the first to model Twitter conversations as a tree structure in this manner, introducing a novel way of tackling NLP tasks on Twitter conversations.
Abstract: Rumour stance classification, the task that determines if each tweet in a collection discussing a rumour is supporting, denying, questioning or simply commenting on the rumour, has been attracting substantial interest. Here we introduce a novel approach that makes use of the sequence of transitions observed in tree-structured conversation threads in Twitter. The conversation threads are formed by harvesting users' replies to one another, which results in a nested tree-like structure. Previous work addressing the stance classification task has treated each tweet as a separate unit. Here we analyse tweets by virtue of their position in a sequence and test two sequential classifiers, Linear-Chain CRF and Tree CRF, each of which makes different assumptions about the conversational structure. We experiment with eight Twitter datasets, collected during breaking news, and show that exploiting the sequential structure of Twitter conversations achieves significant improvements over the non-sequential methods. Our work is the first to model Twitter conversations as a tree structure in this manner, introducing a novel way of tackling NLP tasks on Twitter conversations.

66 citations


Cites methods from "Exploiting conversational features ..."

  • ...FitzGerald et al. (2011) used a linear-chain CRF to identify high-quality comments in threads responding to blog posts....

    [...]


Journal ArticleDOI
Enamul Hoque1, Giuseppe Carenini1Institutions (1)
01 Jun 2014-
TL;DR: A visual text analytic system that tightly integrates interactive visualization with novel text mining and summarization techniques to fulfill information needs of users in exploring conversations is presented.
Abstract: Today it is quite common for people to exchange hundreds of comments in online conversations e.g., blogs. Often, it can be very difficult to analyze and gain insights from such long conversations. To address this problem, we present a visual text analytic system that tightly integrates interactive visualization with novel text mining and summarization techniques to fulfill information needs of users in exploring conversations. At first, we perform a user requirement analysis for the domain of blog conversations to derive a set of design principles. Following these principles, we present an interface that visualizes a combination of various metadata and textual analysis results, supporting the user to interactively explore the blog conversations. We conclude with an informal user evaluation, which provides anecdotal evidence about the effectiveness of our system and directions for further design.

63 citations


Proceedings Article
Arkaitz Zubiaga1, Elena Kochkina1, Maria Liakata1, Rob Procter1  +1 moreInstitutions (2)
21 Sep 2016-
Abstract: Rumour stance classification, the task that determines if each tweet in a collection discussing a rumour is supporting, denying, questioning or simply commenting on the rumour, has been attracting substantial interest. Here we introduce a novel approach that makes use of the sequence of transitions observed in tree-structured conversation threads in Twitter. The conversation threads are formed by harvesting users’ replies to one another, which results in a nested tree-like structure. Previous work addressing the stance classification task has treated each tweet as a separate unit. Here we analyse tweets by virtue of their position in a sequence and test two sequential classifiers, Linear-Chain CRF and Tree CRF, each of which makes different assumptions about the conversational structure. We experiment with eight Twitter datasets, collected during breaking news, and show that exploiting the sequential structure of Twitter conversations achieves significant improvements over the non-sequential methods. Our work is the first to model Twitter conversations as a tree structure in this manner, introducing a novel way of tackling NLP tasks on Twitter conversations.

61 citations


Journal ArticleDOI
Inbal Yahav1, Onn Shehory1, David G. Schwartz1Institutions (1)
TL;DR: This paper reveals the bias introduced by between-participants’ discourse to the study of comments in social media, and proposes an adjustment to tf-idf that accounts for this bias.
Abstract: Text mining have gained great momentum in recent years, with user-generated content becoming widely available. One key use is comment mining, with much attention being given to sentiment analysis and opinion mining. An essential step in the process of comment mining is text pre-processing; a step in which each linguistic term is assigned with a weight that commonly increases with its appearance in the studied text, yet is offset by the frequency of the term in the domain of interest. A common practice is to use the well-known tf-idf formula to compute these weights. This paper reveals the bias introduced by between-participants’ discourse to the study of comments in social media, and proposes an adjustment. We find that content extracted from discourse is often highly correlated, resulting in dependency structures between observations in the study, thus introducing a statistical bias. Ignoring this bias can manifest in a non-robust analysis at best and can lead to an entirely wrong conclusion at worst. We propose an adjustment to tf-idf that accounts for this bias. We illustrate the effects of both the bias and correction with with seven Facebook fan pages data, covering different domains, including news, finance, politics, sport, shopping, and entertainment.

43 citations


Cites background from "Exploiting conversational features ..."

  • ..., [31], [32], [33]); interestingness of comments [44]; etc....

    [...]

  • ...in rumors [30], informativeness relative to the document posted, thread topic, past comments [31], [32], uniqueness of comments [33], and much more....

    [...]

  • ...The relationship between comments, both structural and lexical, is highly recognized as useful in classification as seen in [31] who study the quality of comments in a threaded blog discussion....

    [...]


Proceedings Article
01 Jan 2017-
TL;DR: A new task is defined of identifying “good” conversations, which are called ERICs—Engaging, Respectful, and/or Informative Conversations, posted in response to online news articles and in debate forums.
Abstract: Online news platforms curate high-quality content for their readers and, in many cases, users can post comments in response. While comment threads routinely contain unproductive banter, insults, or users “shouting” over each other, there are often good discussions buried among the noise. In this paper, we define a new task of identifying “good” conversations, which we call ERICs—Engaging, Respectful, and/or Informative Conversations. Our model successfully identifies ERICs posted in response to online news articles with F1 = 0.73 and F1 = 0.91 in debate forums.

25 citations


Cites background from "Exploiting conversational features ..."

  • ...Recent research aims to improve comment quality by identifying engaging comments (FitzGerald et al. 2011; Backstrom et al. 2013), ranking reddit comments by karma (Jaech et al....

    [...]

  • ...More closely related work have measured the quality of individual comments (FitzGerald et al. 2011) or threads on Slashdot using non-linguistic features (Lee, Yang, and Rim 2014)....

    [...]

  • ...Recent research aims to improve comment quality by identifying engaging comments (FitzGerald et al. 2011; Backstrom et al. 2013), ranking reddit comments by karma (Jaech et al. 2015), filtering inflammatory comments (Lin et al. 2012; Nobata et al. 2016) or trolling (Mihaylov and Nakov 2016; Cheng,…...

    [...]


References
More filters

Proceedings Article
28 Jun 2001-
TL;DR: This work presents iterative parameter estimation algorithms for conditional random fields and compares the performance of the resulting models to HMMs and MEMMs on synthetic and natural-language data.
Abstract: We present conditional random fields , a framework for building probabilistic models to segment and label sequence data. Conditional random fields offer several advantages over hidden Markov models and stochastic grammars for such tasks, including the ability to relax strong independence assumptions made in those models. Conditional random fields also avoid a fundamental limitation of maximum entropy Markov models (MEMMs) and other discriminative Markov models based on directed graphical models, which can be biased towards states with few successor states. We present iterative parameter estimation algorithms for conditional random fields and compare the performance of the resulting models to HMMs and MEMMs on synthetic and natural-language data.

12,343 citations



Book
01 Jan 2000-
TL;DR: This book takes an empirical approach to language processing, based on applying statistical and other machine-learning algorithms to large corpora, to demonstrate how the same algorithm can be used for speech recognition and word-sense disambiguation.
Abstract: From the Publisher: This book takes an empirical approach to language processing, based on applying statistical and other machine-learning algorithms to large corpora.Methodology boxes are included in each chapter. Each chapter is built around one or more worked examples to demonstrate the main idea of the chapter. Covers the fundamental algorithms of various fields, whether originally proposed for spoken or written language to demonstrate how the same algorithm can be used for speech recognition and word-sense disambiguation. Emphasis on web and other practical applications. Emphasis on scientific evaluation. Useful as a reference for professionals in any of the areas of speech and language processing.

3,602 citations


Journal ArticleDOI
Pat Langley1Institutions (1)
25 Mar 1986-Machine Learning

995 citations


Posted Content
Charles Sutton1, Andrew McCallum2Institutions (2)
Abstract: Often we wish to predict a large number of variables that depend on each other as well as on other observed variables. Structured prediction methods are essentially a combination of classification and graphical modeling, combining the ability of graphical models to compactly model multivariate data with the ability of classification methods to perform prediction using large sets of input features. This tutorial describes conditional random fields, a popular probabilistic method for structured prediction. CRFs have seen wide application in natural language processing, computer vision, and bioinformatics. We describe methods for inference and parameter estimation for CRFs, including practical issues for implementing large scale CRFs. We do not assume previous knowledge of graphical modeling, so this tutorial is intended to be useful to practitioners in a wide variety of fields.

785 citations


Network Information
Performance
Metrics
No. of citations received by the Paper in previous years
YearCitations
20192
20181
20172
20163
20151
20142