scispace - formally typeset
Open AccessProceedings ArticleDOI

Is This Post Persuasive? Ranking Argumentative Comments in Online Forum

Reads0
Chats0
TLDR
The experiments show that the surface textual features do not perform well compared to the argumentation based features, and the social interaction based features are effective especially when more users participate in the discussion.
Abstract
In this paper we study how to identify persuasive posts in the online forum discussions, using data from Change My View sub-Reddit. Our analysis confirms that the users’ voting score for a comment is highly correlated with its metadata information such as published time and author reputation. In this work, we propose and evaluate other features to rank comments for their persuasive scores, including textual information in the comments and social interaction related features. Our experiments show that the surface textual features do not perform well compared to the argumentation based features, and the social interaction based features are effective especially when more users participate in the discussion.

read more

Content maybe subject to copyright    Report

Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 195–200,
Berlin, Germany, August 7-12, 2016.
c
2016 Association for Computational Linguistics
Is This Post Pers uasive?
Ranking Argumentative Comments in the Online Forum
Zhongyu Wei
12
, Yang Liu
2
and Yi Li
2
1
School of Data Science, Fudan University, Shanghai, P.R.China
2
Computer Science Department, The University of Texas at Dallas
Richardson, Texas 75080, USA
{zywei,yangl,yili}@h lt.utdallas.edu
Abstract
In this paper we study how to identify per-
suasive posts in the online forum discus-
sions, using data from Change My View
sub-Reddit. Our analysis confirms that
the users’ voting score for a comment is
highly correlated with its metadata infor-
mation such as published time and author
reputation. In this work, we propose and
evaluate other features to rank comments
for their persuasive scores, including tex-
tual information in the comments and so-
cial interaction related features. Our ex-
periments show that the surface textual
features do not perform well compared to
the argumentation based features, and the
social interaction based features are effec-
tive especially when more users partici-
pate in the discussion.
1 Introduction
With the popularity of online forums such as ide-
bate
1
and convinceme
2
, researchers have been
paying increasing attentions to analyzing per-
suasive content, including identification of argu-
ing expressions in online debates (Trabelsi and
Zaıane, 2014), recognition of stance in ideolog-
ical online debates (Somasundaran and Wiebe,
2010; Hasan and Ng, 2014; Ranade et al., 2013b),
and debate summarization (Ranade et al., 2013a).
However, how to automatically determine if a text
is persuasive is still an unsolved problem.
Text quality and popularity evaluation has been
studied in different domains in the past few
years (Louis and Nenkova, 2013; Tan et al., 2014;
Park et al., 2016; Guerini et al., 2015). However,
1
http://idebate.org/
2
http://convinceme.net
quality evaluation of argumentative text in the on-
line forum has some unique characterisitcs. First,
persuasive text contains argument that is not com-
mon in other genres. Second, beside the text it-
self, the interplay between a comment and what it
responds to is crucial. Third, the community re-
action to the comment also needs to be taken into
consideration.
In this paper, we propose several sets of features
to capture the above mentioned characteristics for
persuasive comment identification in the online fo-
rum. We constructed a dataset from a sub-forum
of Reddit
3
, namely change my view (CMV)
4
. We
first analyze the corpus and show the correlation
between the human voting score for an argumen-
tative comment and its entry order and author rep-
utation. Then for the comment ranking task, we
propose three sets of features including surface
text features, social interaction based features and
argumentation based features. Our experimental
results show that the argumentation based features
work the best in the early stage of the discussion
and the effectiveness of social interaction features
increases when the number of comments in the
discussion grows.
2 Dataset and Task
2.1 Data
On CMV, people initiate a discussion thread with
a post expressing their thoughts toward a specific
topic and other users reply with arguments from
the opposite side in order to change the initiator’s
mind. The writing quality on CVM is quite good
since the discussions are monitored by modera-
tors. Besides commenting, users can vote on dif-
ferent replies to indicate which one is more per-
suasive than others. The total amount of upvotes
3
https://www.reddit.com
4
https://www.reddit.com/r/changemyview
195

Thread # 1,785
Comment # 374,472
Comment # / Thread #
209.79
Author #
32,639
Unique Author # / Thread #
70.67
Delta Awarded Thread # 886 (49.6%)
Delta Awarded Comment #
2,056 (0.5%)
Table 1: Statistics of the CMV dataset.
minus the down votes is called karma, indicating
the persuasiveness of the reply. Users can also
give delta to a comment if it changes their orig-
inal mind about the topic. The comment is then
named delta awarded comment (DAC), and the
thread containing a DAC is noted as delta awarded
thread.
We use a corpus collected from CMV.
5
The
original corpus contains all the threads published
between Jan. 2014 and Jan. 2015. We kept the
threads with more than 100 comments to form our
experimental dataset
6
. The basic statistics of the
dataset can be seen in Table 1.
Figure 1a shows the distribution of the karma
scores in the dataset. We can see that the karma
score is highly skewed, similar to what is reported
in (Jaech et al., 2015). 42% of comments obtain
a karma score of exactly one (i.e., no votes be-
yond the author), and around 15% of comments
have a score less than one. Figure 1b and 1c show
the correlation of the karma score with two meta-
data features, author reputation
7
and entry order,
respectively. We can see the karma score of a com-
ment is highly related to its entry order. In gen-
eral, the earlier a comment is posted, the higher
karma score it obtains. The average score is less
than one when it is posted after 30 comments. Fig-
ure 1c shows that authors of comments with higher
karma scores tend to have higher reputation on av-
erage.
2.2 Task
Tan et al. (2016) explored the task of mind change
by focusing on delta awarded comments using
their CMV data. However, the percentage of delta
awarded comments is quite low, as shown in Ta-
ble 1 (the percentage of comments obtained delta
is as low as 0.5%). In addition, a persuasive com-
ment is not necessarily delta awarded. It can be
5
The data was shared with us by researchers at the Uni-
versity of Washington.
6
Please contact authors about sharing the data set.
7
This is the number of deltas the author has received.
of high quality but does not change other people’s
mind. Our research thus uses the karma score
of a comment, instead of delta, as the reference
to represent the persuasiveness of the comment.
Our analysis also shows that delta awarded com-
ments generally have high karma scores (78.7% of
DACs obtain a higher karma score than the median
value in each delta awarded thread), indicating the
karma score is correlated with the delta value.
Using karma scores as ground truth, Jaech et
al. (2015) proposed a comment ranking task on
several sub-forums of Reddit. In order to reduce
the impact of timing, they rank each set of 10
connective comments. However, their setting is
not suitable for our task. First, at the later stage
of the discussion, comments posted connectively
in terms of time can belong to different sub-trees
of the discussion, and thus can be viewed or re-
acted with great difference. Second, as shown in
Figure 1b, comments entered in later stage obtain
little attention from audience. This makes their
karma scores less reliable as the ground-truth of
persuasiveness.
To further control the factor of timing, we define
the task as ranking the first-N comments in each
thread. The final karma scores of these N com-
ments are used to determine their reference rank
for evaluation. We study two setups for this rank-
ing task. First we use information until the time
point when the thread contains only these N com-
ments. Second we allow the system to access more
comments than N . Our goal is to investigate if we
can predict whether a comment is persuasive and
how the community reacts to a comment in the fu-
ture.
3 Methods
3.1 Ranking Model
A pair-wise learning-to-rank model (Ranking
SVM (Joachims, 2002)) is used in our task. We
first construct the training set including pairs of
comments. In each pair, the first comment is more
persuasive than the second one. Considering that
two samples with similar karma scores might not
be significantly different in terms of their persua-
siveness, we propose to use a modified score to
form training pairs in order to improve the learn-
ing efficacy. We group comments into 7 buckets
based on their karma scores, [-, 0], (0, 1], (1, 5],
(5, 10], (10, 20], (20, 50] and (50, +]. We then
use the bucket number (0 - 6) of each comment
196

0 10 20 30 40 50
karma score
percentage
0%
10%
20%
30%
40%
50%
(a) Karma score distribution
−40
−20
0
20
40
60
entry order
karma score
1 5 9 13 17 21 25 29
(b) Karma score vs entry order
0
10
20
30
40
50
rank of karma score
author reputation
1 5 9 13 17 21 25 29
(c) Karma score vs author reputation
Figure 1: Karma value distributions in the CMV dataset.
Feature Category Feature Name Feature Description
Surface Text Features
length # of the words, sentences and paragraphs in c.
url # of urls contained in c.
unique # of words # of unique words in c.
punctuation # of punctuation marks in c.
unique # of POS # of unique POS tags in c.
Social Interaction Features
tree size The tree size generated by c and rc.
reply num The number of replies obtained by c and rc.
tree height The height of the tree generated by by c and rc.
Is root reply Is c a root reply of the post?
Is leaf Is c a leaf of the tree generated by rc?
location The position of c in the tree generated by rc.
Argumentation Related Features
connective words Number of connective words in c.
modal verbs Number of modal verbs included in c.
argumentative sentence Number and percentage of argumentative sentences.
argument relevance Similarity with the original post and parent comment.
argument originality Maximum similarity with comments published earlier.
Table 2: Feature list (c: the comment; rc: the root comment of c.)
as its modified score. We use all the formed pairs
to train our ranker. In order to be consistent, we
use the first-N comments in the training threads to
construct the training samples to predict the rank
for the first-N comments in a test thread.
3.2 Features
We propose several key features that we hypoth-
esize are predictive of persuasive comments. The
full feature list is given in Table 2.
Surface Text Features
8
: In order to capture the
basic textual information, we use the comment
length and content diversity represented as the
number of words, POS tags, URLs, and punctu-
ation marks. We also explored unigram features
and named entity based features, but they did
not improve system performance and are thus
not included.
Social Interaction Features: We hypothesize
that if a comment attracts more social attention
8
Stanford CoreNLP (Manning et al., 2014) was used to
preprocess the text (i.e., comment splitting, sentence tok-
enization, POS tagging and NER recognition.).
from the community, it is more likely to be per-
suasive, therefore we propose several social in-
teraction features to capture the community re-
action to a comment. Besides the reply tree gen-
erated by the comment, we also consider the re-
ply tree generated by the root comment
9
for fea-
ture computing. The tree size is the number of
comments in the reply tree. The position of c is
its level in the reply tree (the level of root node
is zero).
Argumentation Related Features: We believe
a comment’s argumentation quality is a good in-
dicator of its persuasiveness. In order to cap-
ture the argumentation related information, we
propose two sub-groups of features based on
the comment itself and the interplay between
the comment and other comments in the discus-
sion. a) Local features: we trained a binary
classifier to classify sentences as argumentative
and non-argumentative using features proposed
in (Stab and Gurevych, 2014). We then use the
number and percentage of argumentative sen-
9
It is a comment that replies to the original post directly.
197

Approach NDCG@1 NDCG@5 NDCG@10
random 0.258 0.440 0.564
author 0.382 0.567 0.664
entry-order
0.460 0.600 0.689
LTR
text
0.372 0.558 0.658
LTR
social
0.475
0.650
0.718
LTR
arg
0.475
0.652
0.725
LTR
text+social
0.494
0.666
0.733
LTR
text+arg
0.485
0.654
0.729
LTR
social+arg
0.502
0.674
0.740
LTR
T +S+A
0.508
0.676
0.743
LTR
all
0.521
0.685
0.752
Table 3: Performance of first-10 comments rank-
ing (T+S+A: the combination of the three sets of
features we proposed; all: the combination of two
meta-data features and our features; bold: the best
performance in each column; : the approach is
significantly better than both metadata baselines
(p <0.01); : the approach is significantly better
than LTR approaches using a single category of
features (p <0.01).).
tences predicted by the classifier as features.
Besides, we include some features used in the
classifier directly (i.e. number of connective
words
10
and modal verbs). b) Interactive fea-
tures: for these features, we consider the simi-
larity of a comment and its parent comment, the
original post, and all the previously published
comments. We use cosine similarity computed
based on the term frequency vector representa-
tion. Intuitively a comment needs to be relevant
to the discussed topic and possibly have some
original convincing opinions or arguments to re-
ceive a high karma score.
4 Experimental Results
We use 5-fold cross-validation in our experiments.
Normalized discounted cumulative gain (NDCG)
score (J¨arvelin and Kek¨al¨ainen, 2000) is used as
the evaluation metric for our First-N comments
ranking task. In this study, N is10.
4.1 Experiment I: Using N Comments for
Ranking
Table 3 shows the results for rst-10 comments
ranking using information from only these 10
comments. As shown in Figure 1, metadata fea-
tures, entry order and author’s reputation are cor-
related with the karma score of a comment. We
10
We constructed a list of connective words including 55
entries (e.g., because, therefore etc.).
thus use these two values as baselines. We also
include the performance of the random baseline
for comparison
11
. For our ranking based models
(LTR
), we compare using the three sets of fea-
tures described in Section. 3.2 (noted as text, so-
cial and arg respectively), individually or in com-
bination. We report NDCG scores for position 1,
5 and 10 respectively. The followings are some
findings.
Both metadata based baselines generate signif-
icantly
12
better results compared to the random
baseline. Baseline entry-order performs much
better than author, suggesting that the entry or-
der is m ore indicative for the karma score of a
comment.
The surface text features are least effective
among the three sets of features, and the per-
formance using them is even worse than the
two metadata baselines. This might be because
the general writing quality of the comments in
CMV is high because of the policy of the forum.
Therefore, the surface text features we used are
not very discriminative for comment ranking.
A further analysis of features in this category
shows that length is the most effective feature.
Argumentation based features have the best per-
formance among the three categories. Its per-
formance is significantly better than surface text
features, consistent w ith our expectation that ar-
gumentation related features are useful for per-
suasiveness evaluation. Our additional experi-
ments show that interactive features are more
effective than local features. T his might be be-
cause the argumentation features and models
we use are not perfect. Future research is still
needed to better represent argumentation infor-
mation in the text.
When combining two categories of features,
the performance of the ranker increases con-
sistently. The performance can be further im-
proved by combining all the three categories of
features we proposed (the improvement com-
pared to using a single feature category is signif-
icant). The best results are achieved by LTR
all
,
i.e., combining two metadata features and fea-
tures we proposed.
11
The performance of random baseline is high because of
the tie of reference karma scores.
12
Significance is computed by two tailed t-test.
198

10 20 30 40 50
0.66
0.68
0.70
0.72
0.74
0.76
0.78
0.80
# of comments
NDCG@10
LTR
text
LTR
social
LTR
arg
LTR
T+S+A
Figure 2: Results using various number of com-
ments in the thread for ranking.
4.2 Experiment II: Using Varying Numbers
of Comments for Rankin g
With the evolving discussion, there will be more
comments joining the thread providing more in-
formation for social interaction based features. In
order to show the impact of different features at
different discussion stage, we conduct another ex-
periment by ranking first-10 comments with vary-
ing numbers of comments in the test thread for fea-
ture computing. The result of the experiment is
shown in Figure 2. The performance of LTR
text
and LTR
arg
remain the same since their feature
values are not affected by the new coming com-
ments. The performance of LTR
social
increases
consistently when the number of comments grows,
and it outperforms LTR
arg
when the number of
comments is more than 20. LTR
T +S+A
has always
the best performance, benefiting from the combi-
nation of different types of features.
5 Related Work
Our work is m ost related to two lines of work,
including text quality evaluation and research on
Reddit.com.
Text q uality: Text quality and popularity eval-
uation has been studied in different domains in the
past few years. Louis and Nenkova (2013) imple-
mented features to capture aspects of great writing
in science journalism domain. Tan et al. (2014)
looked into the effect of wording while predict-
ing the popularity of social m edia content. Park et
al. (2016) developed an interactive system to as-
sist human moderators to select high quality news.
Guerini et al. (2015) modeled a notion of euphony
and explored the impact of sounds on different
forms of persuasiveness. Their research focused
on the phonetic aspect instead of language usage.
Reddit based research: Reddit has been used
recently for research on social news analysis
and recommendation (e.g., (Buntain and Golbeck,
2014)). Researchers also analyzed the language
use on Reddit. Jaech et al. (2015) studied how
language use affects community reaction to com-
ments in Reddit. Tan et al. (2016) analyzed the
interaction dynamics and persuasion strategies in
CMV.
6 Conclusion
In this paper, we studied the impact of different
sets of features on the identification of persuasive
comments in the online forum. Our experiment re-
sults show that argumentation based features work
the best in the early stage of the discussion, while
the effectiveness of social interaction based fea-
tures increases when the number of comments in
the thread grows.
There are three major future directions for this
research. First, the approach for argument mod-
eling in this paper is lexical based, which limits
the effectiveness of argumentation related features
for our task. It is thus crucial to study more ef-
fective ways for argument modeling. Second, we
will explore persuasion behavior of the argumen-
tative comments and study the correlation between
the strength of the argument and different persua-
sion behaviors. Third, we plan to automatically
construct an argumentation corpus including pairs
of arguments from two opposite sides of the topic
from CMV, and use this for automatic disputing
argument generation.
Acknowledgments
We thank the anonymous reviewers for their de-
tailed and insightful comments on this paper. The
work is partially supported by DARPA Contract
No. FA8750-13-2-0041 and AFOSR award No.
FA9550-15-1-0346. Any opinions, findings, and
conclusions or recommendations expressed are
those of the authors and do not necessarily reflect
the views of the funding agencies. We thank Trang
Tran, Hao Fang and Mari Ostendorf at University
of Washington for sharing the Reddit data they
collected.
199

Citations
More filters
Proceedings ArticleDOI

Computational Argumentation Quality Assessment in Natural Language

TL;DR: This paper presents the first holistic work on computational argumentation quality in natural language, comprehensively survey the diverse existing theories and approaches to assess logical, rhetorical, and dialectical quality dimensions, and derives a systematic taxonomy from these.
Proceedings ArticleDOI

Analyzing the Semantic Types of Claims and Premises in an Online Persuasive Forum

TL;DR: A two-tiered annotation scheme to label claims and premises and their semantic types in an online persuasive forum, Change My View, with the long-term goal of understanding what makes a message persuasive.
Proceedings ArticleDOI

Recognizing Insufficiently Supported Arguments in Argumentative Essays

TL;DR: This work shows that human annotators substantially agree on the sufficiency criterion and introduces a novel annotated corpus and the final corpus as well as the annotation guideline are freely available for encouraging future research on argument quality.
Proceedings ArticleDOI

Let’s Make Your Request More Persuasive: Modeling Persuasive Strategies via Semi-Supervised Neural Nets on Crowdfunding Platforms

TL;DR: A semi-supervised hierarchical neural network model is proposed to quantify persuasiveness and identify the persuasive strategies in advocacy requests and has applications for other situations with document-level supervision but only partial sentence supervision.
Proceedings ArticleDOI

Deconfounded Lexicon Induction for Interpretable Social Science

TL;DR: Two deep learning algorithms are introduced that are more predictive and less confound-related than those of standard feature weighting and lexicon induction techniques like regression and log odds and used to induce lexicons that are predictive of timely responses to consumer complaints, enrollment from course descriptions, and sales from product descriptions.
References
More filters
Proceedings ArticleDOI

The Stanford CoreNLP Natural Language Processing Toolkit

TL;DR: The design and use of the Stanford CoreNLP toolkit is described, an extensible pipeline that provides core natural language analysis, and it is suggested that this follows from a simple, approachable design, straightforward interfaces, the inclusion of robust and good quality analysis components, and not requiring use of a large amount of associated baggage.
Proceedings ArticleDOI

Optimizing search engines using clickthrough data

TL;DR: The goal of this paper is to develop a method that utilizes clickthrough data for training, namely the query-log of the search engine in connection with the log of links the users clicked on in the presented ranking.
Journal ArticleDOI

IR evaluation methods for retrieving highly relevant documents

TL;DR: The novel evaluation methods and the case demonstrate that non-dichotomous relevance assessments are applicable in IR experiments, may reveal interesting phenomena, and allow harder testing of IR methods.
Proceedings ArticleDOI

Identifying Argumentative Discourse Structures in Persuasive Essays

TL;DR: A novel approach for identifying argumentative discourse structures in persuasive essays by evaluating several classifiers and proposing novel feature sets including structural, lexical, syntactic and contextual features.
Proceedings Article

Recognizing Stances in Ideological On-Line Debates

TL;DR: This work constructs an arguing lexicon automatically from a manually annotated corpus and builds supervised systems employing sentiment and arguing opinions and their targets as features, which perform substantially better than a distribution-based baseline.
Related Papers (5)