Is This Post Persuasive? Ranking Argumentative Comments in Online Forum

doi:10.18653/V1/P16-2032

Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 195–200,

Berlin, Germany, August 7-12, 2016.

c

2016 Association for Computational Linguistics

Is This Post Pers uasive?

Ranking Argumentative Comments in the Online Forum

Zhongyu Wei

12

, Yang Liu

2

and Yi Li

2

1

School of Data Science, Fudan University, Shanghai, P.R.China

2

Computer Science Department, The University of Texas at Dallas

Richardson, Texas 75080, USA

{zywei,yangl,yili}@h lt.utdallas.edu

Abstract

In this paper we study how to identify per-

suasive posts in the online forum discus-

sions, using data from Change My View

sub-Reddit. Our analysis conﬁrms that

the users’ voting score for a comment is

highly correlated with its metadata infor-

mation such as published time and author

reputation. In this work, we propose and

evaluate other features to rank comments

for their persuasive scores, including tex-

tual information in the comments and so-

cial interaction related features. Our ex-

periments show that the surface textual

features do not perform well compared to

the argumentation based features, and the

social interaction based features are effec-

tive especially when more users partici-

pate in the discussion.

1 Introduction

With the popularity of online forums such as ide-

bate

1

and convinceme

2

, researchers have been

paying increasing attentions to analyzing per-

suasive content, including identiﬁcation of argu-

ing expressions in online debates (Trabelsi and

Zaıane, 2014), recognition of stance in ideolog-

ical online debates (Somasundaran and Wiebe,

2010; Hasan and Ng, 2014; Ranade et al., 2013b),

and debate summarization (Ranade et al., 2013a).

However, how to automatically determine if a text

is persuasive is still an unsolved problem.

Text quality and popularity evaluation has been

studied in different domains in the past few

years (Louis and Nenkova, 2013; Tan et al., 2014;

Park et al., 2016; Guerini et al., 2015). However,

1

http://idebate.org/

2

http://convinceme.net

quality evaluation of argumentative text in the on-

line forum has some unique characterisitcs. First,

persuasive text contains argument that is not com-

mon in other genres. Second, beside the text it-

self, the interplay between a comment and what it

responds to is crucial. Third, the community re-

action to the comment also needs to be taken into

consideration.

In this paper, we propose several sets of features

to capture the above mentioned characteristics for

persuasive comment identiﬁcation in the online fo-

rum. We constructed a dataset from a sub-forum

of Reddit

3

, namely change my view (CMV)

4

. We

ﬁrst analyze the corpus and show the correlation

between the human voting score for an argumen-

tative comment and its entry order and author rep-

utation. Then for the comment ranking task, we

propose three sets of features including surface

text features, social interaction based features and

argumentation based features. Our experimental

results show that the argumentation based features

work the best in the early stage of the discussion

and the effectiveness of social interaction features

increases when the number of comments in the

discussion grows.

2 Dataset and Task

2.1 Data

On CMV, people initiate a discussion thread with

a post expressing their thoughts toward a speciﬁc

topic and other users reply with arguments from

the opposite side in order to change the initiator’s

mind. The writing quality on CVM is quite good

since the discussions are monitored by modera-

tors. Besides commenting, users can vote on dif-

ferent replies to indicate which one is more per-

suasive than others. The total amount of upvotes

3

https://www.reddit.com

4

https://www.reddit.com/r/changemyview

195

Thread # 1,785

Comment # 374,472

Comment # / Thread #

209.79

Author #

32,639

Unique Author # / Thread #

70.67

Delta Awarded Thread # 886 (49.6%)

Delta Awarded Comment #

2,056 (0.5%)

Table 1: Statistics of the CMV dataset.

minus the down votes is called karma, indicating

the persuasiveness of the reply. Users can also

give delta to a comment if it changes their orig-

inal mind about the topic. The comment is then

named delta awarded comment (DAC), and the

thread containing a DAC is noted as delta awarded

thread.

We use a corpus collected from CMV.

5

The

original corpus contains all the threads published

between Jan. 2014 and Jan. 2015. We kept the

threads with more than 100 comments to form our

experimental dataset

6

. The basic statistics of the

dataset can be seen in Table 1.

Figure 1a shows the distribution of the karma

scores in the dataset. We can see that the karma

score is highly skewed, similar to what is reported

in (Jaech et al., 2015). 42% of comments obtain

a karma score of exactly one (i.e., no votes be-

yond the author), and around 15% of comments

have a score less than one. Figure 1b and 1c show

the correlation of the karma score with two meta-

data features, author reputation

7

and entry order,

respectively. We can see the karma score of a com-

ment is highly related to its entry order. In gen-

eral, the earlier a comment is posted, the higher

karma score it obtains. The average score is less

than one when it is posted after 30 comments. Fig-

ure 1c shows that authors of comments with higher

karma scores tend to have higher reputation on av-

erage.

2.2 Task

Tan et al. (2016) explored the task of mind change

by focusing on delta awarded comments using

their CMV data. However, the percentage of delta

awarded comments is quite low, as shown in Ta-

ble 1 (the percentage of comments obtained delta

is as low as 0.5%). In addition, a persuasive com-

ment is not necessarily delta awarded. It can be

5

The data was shared with us by researchers at the Uni-

versity of Washington.

6

Please contact authors about sharing the data set.

7

This is the number of deltas the author has received.

of high quality but does not change other people’s

mind. Our research thus uses the karma score

of a comment, instead of delta, as the reference

to represent the persuasiveness of the comment.

Our analysis also shows that delta awarded com-

ments generally have high karma scores (78.7% of

DACs obtain a higher karma score than the median

value in each delta awarded thread), indicating the

karma score is correlated with the delta value.

Using karma scores as ground truth, Jaech et

al. (2015) proposed a comment ranking task on

several sub-forums of Reddit. In order to reduce

the impact of timing, they rank each set of 10

connective comments. However, their setting is

not suitable for our task. First, at the later stage

of the discussion, comments posted connectively

in terms of time can belong to different sub-trees

of the discussion, and thus can be viewed or re-

acted with great difference. Second, as shown in

Figure 1b, comments entered in later stage obtain

little attention from audience. This makes their

karma scores less reliable as the ground-truth of

persuasiveness.

To further control the factor of timing, we deﬁne

the task as ranking the ﬁrst-N comments in each

thread. The ﬁnal karma scores of these N com-

ments are used to determine their reference rank

for evaluation. We study two setups for this rank-

ing task. First we use information until the time

point when the thread contains only these N com-

ments. Second we allow the system to access more

comments than N . Our goal is to investigate if we

can predict whether a comment is persuasive and

how the community reacts to a comment in the fu-

ture.

3 Methods

3.1 Ranking Model

A pair-wise learning-to-rank model (Ranking

SVM (Joachims, 2002)) is used in our task. We

ﬁrst construct the training set including pairs of

comments. In each pair, the ﬁrst comment is more

persuasive than the second one. Considering that

two samples with similar karma scores might not

be signiﬁcantly different in terms of their persua-

siveness, we propose to use a modiﬁed score to

form training pairs in order to improve the learn-

ing efﬁcacy. We group comments into 7 buckets

based on their karma scores, [-∞, 0], (0, 1], (1, 5],

(5, 10], (10, 20], (20, 50] and (50, +∞]. We then

use the bucket number (0 - 6) of each comment

196

0 10 20 30 40 50

karma score

percentage

0%

10%

20%

30%

40%

50%

(a) Karma score distribution

−40

−20

0

20

40

60

entry order

karma score

1 5 9 13 17 21 25 29

(b) Karma score vs entry order

0

10

20

30

40

50

rank of karma score

author reputation

1 5 9 13 17 21 25 29

(c) Karma score vs author reputation

Figure 1: Karma value distributions in the CMV dataset.

Feature Category Feature Name Feature Description

Surface Text Features

length # of the words, sentences and paragraphs in c.

url # of urls contained in c.

unique # of words # of unique words in c.

punctuation # of punctuation marks in c.

unique # of POS # of unique POS tags in c.

Social Interaction Features

tree size The tree size generated by c and rc.

reply num The number of replies obtained by c and rc.

tree height The height of the tree generated by by c and rc.

Is root reply Is c a root reply of the post?

Is leaf Is c a leaf of the tree generated by rc?

location The position of c in the tree generated by rc.

Argumentation Related Features

connective words Number of connective words in c.

modal verbs Number of modal verbs included in c.

argumentative sentence Number and percentage of argumentative sentences.

argument relevance Similarity with the original post and parent comment.

argument originality Maximum similarity with comments published earlier.

Table 2: Feature list (c: the comment; rc: the root comment of c.)

as its modiﬁed score. We use all the formed pairs

to train our ranker. In order to be consistent, we

use the ﬁrst-N comments in the training threads to

construct the training samples to predict the rank

for the ﬁrst-N comments in a test thread.

3.2 Features

We propose several key features that we hypoth-

esize are predictive of persuasive comments. The

full feature list is given in Table 2.

• Surface Text Features

8

: In order to capture the

basic textual information, we use the comment

length and content diversity represented as the

number of words, POS tags, URLs, and punctu-

ation marks. We also explored unigram features

and named entity based features, but they did

not improve system performance and are thus

not included.

• Social Interaction Features: We hypothesize

that if a comment attracts more social attention

8

Stanford CoreNLP (Manning et al., 2014) was used to

preprocess the text (i.e., comment splitting, sentence tok-

enization, POS tagging and NER recognition.).

from the community, it is more likely to be per-

suasive, therefore we propose several social in-

teraction features to capture the community re-

action to a comment. Besides the reply tree gen-

erated by the comment, we also consider the re-

ply tree generated by the root comment

9

for fea-

ture computing. The tree size is the number of

comments in the reply tree. The position of c is

its level in the reply tree (the level of root node

is zero).

• Argumentation Related Features: We believe

a comment’s argumentation quality is a good in-

dicator of its persuasiveness. In order to cap-

ture the argumentation related information, we

propose two sub-groups of features based on

the comment itself and the interplay between

the comment and other comments in the discus-

sion. a) Local features: we trained a binary

classiﬁer to classify sentences as argumentative

and non-argumentative using features proposed

in (Stab and Gurevych, 2014). We then use the

number and percentage of argumentative sen-

9

It is a comment that replies to the original post directly.

197

Approach NDCG@1 NDCG@5 NDCG@10

random 0.258 0.440 0.564

author 0.382 0.567 0.664

entry-order

0.460 0.600 0.689

LTR

text

0.372 0.558 0.658

LTR

social

0.475

†

0.650

†

0.718

†

LTR

arg

0.475

†

0.652

†

0.725

†

LTR

text+social

0.494

†

0.666

†

0.733

†

LTR

text+arg

0.485

†

0.654

†

0.729

†

LTR

social+arg

0.502

†‡

0.674

†‡

0.740

†

LTR

T +S+A

0.508

†‡

0.676

†‡

0.743

†‡

LTR

all

0.521

†‡

0.685

†‡

0.752

†‡

Table 3: Performance of ﬁrst-10 comments rank-

ing (T+S+A: the combination of the three sets of

features we proposed; all: the combination of two

meta-data features and our features; bold: the best

performance in each column; †: the approach is

signiﬁcantly better than both metadata baselines

(p <0.01); ‡: the approach is signiﬁcantly better

than LTR approaches using a single category of

features (p <0.01).).

tences predicted by the classiﬁer as features.

Besides, we include some features used in the

classiﬁer directly (i.e. number of connective

words

10

and modal verbs). b) Interactive fea-

tures: for these features, we consider the simi-

larity of a comment and its parent comment, the

original post, and all the previously published

comments. We use cosine similarity computed

based on the term frequency vector representa-

tion. Intuitively a comment needs to be relevant

to the discussed topic and possibly have some

original convincing opinions or arguments to re-

ceive a high karma score.

4 Experimental Results

We use 5-fold cross-validation in our experiments.

Normalized discounted cumulative gain (NDCG)

score (J¨arvelin and Kek¨al¨ainen, 2000) is used as

the evaluation metric for our First-N comments

ranking task. In this study, N is10.

4.1 Experiment I: Using N Comments for

Ranking

Table 3 shows the results for ﬁrst-10 comments

ranking using information from only these 10

comments. As shown in Figure 1, metadata fea-

tures, entry order and author’s reputation are cor-

related with the karma score of a comment. We

10

We constructed a list of connective words including 55

entries (e.g., because, therefore etc.).

thus use these two values as baselines. We also

include the performance of the random baseline

for comparison

11

. For our ranking based models

(LTR

∗

), we compare using the three sets of fea-

tures described in Section. 3.2 (noted as text, so-

cial and arg respectively), individually or in com-

bination. We report NDCG scores for position 1,

5 and 10 respectively. The followings are some

ﬁndings.

• Both metadata based baselines generate signif-

icantly

12

better results compared to the random

baseline. Baseline entry-order performs much

better than author, suggesting that the entry or-

der is m ore indicative for the karma score of a

comment.

• The surface text features are least effective

among the three sets of features, and the per-

formance using them is even worse than the

two metadata baselines. This might be because

the general writing quality of the comments in

CMV is high because of the policy of the forum.

Therefore, the surface text features we used are

not very discriminative for comment ranking.

A further analysis of features in this category

shows that length is the most effective feature.

• Argumentation based features have the best per-

formance among the three categories. Its per-

formance is signiﬁcantly better than surface text

features, consistent w ith our expectation that ar-

gumentation related features are useful for per-

suasiveness evaluation. Our additional experi-

ments show that interactive features are more

effective than local features. T his might be be-

cause the argumentation features and models

we use are not perfect. Future research is still

needed to better represent argumentation infor-

mation in the text.

• When combining two categories of features,

the performance of the ranker increases con-

sistently. The performance can be further im-

proved by combining all the three categories of

features we proposed (the improvement com-

pared to using a single feature category is signif-

icant). The best results are achieved by LTR

all

,

i.e., combining two metadata features and fea-

tures we proposed.

11

The performance of random baseline is high because of

the tie of reference karma scores.

12

Signiﬁcance is computed by two tailed t-test.

198

10 20 30 40 50

0.66

0.68

0.70

0.72

0.74

0.76

0.78

0.80

# of comments

NDCG@10

LTR

text

LTR

social

LTR

arg

LTR

T+S+A

Figure 2: Results using various number of com-

ments in the thread for ranking.

4.2 Experiment II: Using Varying Numbers

of Comments for Rankin g

With the evolving discussion, there will be more

comments joining the thread providing more in-

formation for social interaction based features. In

order to show the impact of different features at

different discussion stage, we conduct another ex-

periment by ranking ﬁrst-10 comments with vary-

ing numbers of comments in the test thread for fea-

ture computing. The result of the experiment is

shown in Figure 2. The performance of LTR

text

and LTR

arg

remain the same since their feature

values are not affected by the new coming com-

ments. The performance of LTR

social

increases

consistently when the number of comments grows,

and it outperforms LTR

arg

when the number of

comments is more than 20. LTR

T +S+A

has always

the best performance, beneﬁting from the combi-

nation of different types of features.

5 Related Work

Our work is m ost related to two lines of work,

including text quality evaluation and research on

Reddit.com.

Text q uality: Text quality and popularity eval-

uation has been studied in different domains in the

past few years. Louis and Nenkova (2013) imple-

mented features to capture aspects of great writing

in science journalism domain. Tan et al. (2014)

looked into the effect of wording while predict-

ing the popularity of social m edia content. Park et

al. (2016) developed an interactive system to as-

sist human moderators to select high quality news.

Guerini et al. (2015) modeled a notion of euphony

and explored the impact of sounds on different

forms of persuasiveness. Their research focused

on the phonetic aspect instead of language usage.

Reddit based research: Reddit has been used

recently for research on social news analysis

and recommendation (e.g., (Buntain and Golbeck,

2014)). Researchers also analyzed the language

use on Reddit. Jaech et al. (2015) studied how

language use affects community reaction to com-

ments in Reddit. Tan et al. (2016) analyzed the

interaction dynamics and persuasion strategies in

CMV.

6 Conclusion

In this paper, we studied the impact of different

sets of features on the identiﬁcation of persuasive

comments in the online forum. Our experiment re-

sults show that argumentation based features work

the best in the early stage of the discussion, while

the effectiveness of social interaction based fea-

tures increases when the number of comments in

the thread grows.

There are three major future directions for this

research. First, the approach for argument mod-

eling in this paper is lexical based, which limits

the effectiveness of argumentation related features

for our task. It is thus crucial to study more ef-

fective ways for argument modeling. Second, we

will explore persuasion behavior of the argumen-

tative comments and study the correlation between

the strength of the argument and different persua-

sion behaviors. Third, we plan to automatically

construct an argumentation corpus including pairs

of arguments from two opposite sides of the topic

from CMV, and use this for automatic disputing

argument generation.

Acknowledgments

We thank the anonymous reviewers for their de-

tailed and insightful comments on this paper. The

work is partially supported by DARPA Contract

No. FA8750-13-2-0041 and AFOSR award No.

FA9550-15-1-0346. Any opinions, ﬁndings, and

conclusions or recommendations expressed are

those of the authors and do not necessarily reﬂect

the views of the funding agencies. We thank Trang

Tran, Hao Fang and Mari Ostendorf at University

of Washington for sharing the Reddit data they

collected.

199

Is This Post Persuasive? Ranking Argumentative Comments in Online Forum

Citations

Computational Argumentation Quality Assessment in Natural Language

Analyzing the Semantic Types of Claims and Premises in an Online Persuasive Forum

Recognizing Insufficiently Supported Arguments in Argumentative Essays

Let’s Make Your Request More Persuasive: Modeling Persuasive Strategies via Semi-Supervised Neural Nets on Crowdfunding Platforms

Deconfounded Lexicon Induction for Interpretable Social Science

References

The Stanford CoreNLP Natural Language Processing Toolkit

Optimizing search engines using clickthrough data

IR evaluation methods for retrieving highly relevant documents

Identifying Argumentative Discourse Structures in Persuasive Essays

Recognizing Stances in Ideological On-Line Debates

Related Papers (5)

Winning Arguments: Interaction Dynamics and Persuasion Strategies in Good-faith Online Discussions

Which argument is more convincing? Analyzing and predicting convincingness of Web arguments using bidirectional LSTM

Identifying Argumentative Discourse Structures in Persuasive Essays

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Long short-term memory