How many claims were created by modifying a selection of sentences from the Wikipedia?

FEVER consists of 185,445 claims created by modifying a selection of sentences from the Wikipedia and later on verified neglecting the knowledge of the sentence they were derived from.

What was the purpose of the task?

The purpose was to come out with a dataset closer to the speechesincluded in the test set. [10] used a text distortion model [11] to try to remove irrelevant contents.

(Open Access) Overview of the CLEF-2018 CheckThat! Lab on Automatic Identification and Verification of Political Claims (2018) | Preslav Nakov

Q: What contributions have the authors mentioned in the paper "Overview of the clef-2018 checkthat! lab on automatic identification and verification of political claims" ?

The authors present an overview of the CLEF-2018 CheckThat ! In terms of data, for both tasks, the authors focused on debates from the 2016 US Presidential Campaign, as well as on some speeches during and after the campaign ( they also provided translations in Arabic ), and they relied on comments and factuality judgments from factcheck. The authors release all datasets, the evaluation scripts, and the submissions by the participants, which should enable further research in both check-worthiness estimation and in automatic claim verification. Task 1 asked to predict which ( potential ) claims in a political debate should be prioritized for factchecking ; in particular, given a debate or a political speech, the goal was to produce a ranked list of its sentences based on their worthiness for fact checking. Com, which the authors further refined manually.

Q: What have the authors stated for future works in "Overview of the clef-2018 checkthat! lab on automatic identification and verification of political claims" ?

The corpora and evaluation metrics the authors have released as a result of this lab should enable further research in check-worthiness estimation and in automatic claim verification.

Overview of the CLEF-2018 CheckThat! Lab

on Automatic Identiﬁcation and Veriﬁcation

of Political Claims

Preslav Nakov

, Alberto Barr´on-Cede˜no

, Tamer Elsayed

Reem Suwaileh

, Llu´ıs M`arquez

, Wajdi Zaghouani

Pepa Atanasova

, Spas Kyuchukov

, and Giovanni Da San Martino

Qatar Computing Research Institute, HBKU, Doha, Qatar

{pnakov, albarron, gmartino}@qf.org.qa

Computer Science and Engineering Department, Qatar University, Doha, Qatar

{telsayed, reem.suwaileh}@qu.edu.qa

Amazon, Barcelona, Spain

lluismv@amazon.com

College of Humanities and Social Sciences, HBKU, Doha, Qatar

wzaghouani@hbku.edu.qa

SiteGround, Soﬁa, Bulgaria

pepa.gencheva@siteground.com

Soﬁa University “St Kliment Ohridski”, Soﬁa, Bulgaria

spas.kyuchukov@gmail.com

Abstract. We present an overview of the CLEF-2018 CheckThat! Lab

on Automatic Identiﬁcation and Veriﬁcation of Political Claims. In its

starting year, the lab featured two tasks. Task 1 asked to predict which

(potential) claims in a political debate should be prioritized for fact-

checking; in particular, given a debate or a political speech, the goal was

to produce a ranked list of its sentences based on their worthiness for

fact checking. Task 2 asked to assess whether a given check-worthy claim

made by a politician in the context of a debate/speech is factually true,

half-true, or false. We oﬀered both tasks in English and in Arabic. In

terms of data, for both tasks, we focused on debates from the 2016 US

Presidential Campaign, as well as on some speeches during and after the

campaign (we also provided translations in Arabic), and we relied on

comments and factuality judgments from factcheck.org and snopes.com,

which we further reﬁned manually. A total of 30 teams registered to

participate in the lab, and 9 of them actually submitted runs. The eval-

uation results show that the most successful approaches used various

neural networks (esp. for Task 1) and evidence retrieval from the Web

(esp. for Task 2). We release all datasets, the evaluation scripts, and the

submissions by the participants, which should enable further research in

both check-worthiness estimation and in automatic claim veriﬁcation.

Keywords: Computational journalism · Check-worthiness estimation ·

Fact-checking · Veracity

This is a post-peer-review, pre-copyedit version of an article published in

Experimental IR Meets Multilinguality, Multimodality, and Interaction. The

final authenticated version is available online at:

https://doi.org/10.1007/978-3-319-98932-7_32

2 P. Nakov et al.

1 Introduction

The current coverage of the political landscape in both the press and in social

media has led to an unprecedented situation. Like never before, a statement in

an interview, a press release, a blog note, or a tweet can spread almost instan-

taneously across the globe. This speed of proliferation has left little time for

double-checking claims against the facts, which has proven critical in politics.

For instance, the 2016 US Presidential Campaign was arguably inﬂuenced by

fake news in social media and by false claims. Indeed, some politicians were fast

to notice that when it comes to shaping public opinion, facts were secondary, and

that appealing to emotions and beliefs worked better. It has been even proposed

that this was marking the dawn of a post-truth age.

As the problem became evident, a number of fact-checking initiatives have

started, led by organizations such as FactCheck

and Snopes

among many oth-

ers. Yet, this has proved to be a very demanding manual eﬀort, which means

that only a relatively small number of claims could be fact-checked.

This makes

it important to prioritize the claims that fact-checkers should consider ﬁrst, and

then to help them discover the veracity of those claims.

The CheckThat! Lab at CLEF-2018 aims at helping in that respect, by

promoting the development of tools for computational journalism. Figure 1 illus-

trates the fact-checking pipeline, which includes three steps: (i) check-worthiness

estimation, (ii) claim normalization, and (iii) fact-checking. The CheckThat!

Lab focuses on the former and on the latter steps, while taking for granted (and

thus excluding) the intermediate claim normalization step.

Fig. 1. The general fact-checking pipeline. First, the input document is analyzed to

identify sentences containing check-worthy claims, then these claims are extracted and

normalized, and ﬁnally they are fact-checked.

http://www.factcheck.org

http://www.snopes.com

Fully automating the process of fact-checking is not yet a viable alternative, partly

because of limitations of the existing technology, and partly due to low trust in such

methods by human users.

Overview of the CLEF-2018 CheckThat! Lab 3

Task 1 (Check-Worthiness) aims to help fact-checkers prioritize their eﬀorts.

In particular, it asks participants to build systems that can mimic the selection

strategies of a particular fact-checking organization: factcheck.org. The task is

deﬁned as follows:

Given a transcription of a political debate/speech, predict

which claims should be prioritized for fact-checking.

Task 1 is a ranking task. The goal is to produce a ranked list of sentences

ordered by their worthiness for fact-checking. Each of the identiﬁed claims then

becomes an input for the next step (after being manually nomalized).

Task 2 (Fact-Checking) focuses on tools intended to verify the factuality of

a check-worthy claim. The task is deﬁned as follows:

Given a check-worthy claim in the form of a (tran-

scribed) sentence, determine whether the claim is likely

to be true, half-true, or false.

Task 2 is a classiﬁcation task. The goal is to label each check-worthy claim

with an estimated/predicted veracity. Note that we provide the participants not

only the normalized claim, but also the original sentence it originated in, which

is in turn given in the context of the entire debate/speech. Thus, this is a novel

task for fact-checking claims in context, an aspect that has been largely ignored

in previous research on fact-checking.

Note that the intermediate task of claim normalization is a challenging prob-

lem that requires dealing with anaphora resolution, paraphrasing, and dialogue

analysis, and thus we decided to skip it and to provide participants readily-

normalized claims.

We produced data starting from professional fact-checking annotations of

debates and speeches from factcheck.org, thus creating CT-C-18 , the CheckThat!

2018 corpus, which combines two sub-corpora: CT-CWC-18 to predict check-

worthiness, and CT-FCC-18 to assess the veracity of claims. We oﬀered each

of the two tasks in two languages: English and Arabic. For Arabic, we hired

professional translators to translate the English data, and we also had a separate

Arabic-only part for Task 2, based on claims from snopes.com.

Nine teams participated in CheckThat! this year. The most successful systems

relied on supervised models using a manifold of representations. We believe that

there is still large room for improvement, and thus we release the corpora, the

evaluation scripts, and the participants’ predictions, which should enable further

research in check-worthiness estimation and automatic claim veriﬁcation.

The remainder of the paper is organized as follows. Section 2 presents an

overview of related work. Section 3 describes the datasets. Section 4 discusses

Task 1 (check-worthiness) in detail, including the evaluation framework and the

setup, the approaches used by the participating teams, and the oﬃcial results.

Section 5 provides similar detail for Task 2 (fact-checking). Finally, Section 6

discusses the lessons learned.

https://github.com/clef2018-factchecking

4 P. Nakov et al.

2 Related Work

Journalists, online users, and researchers are well aware of the proliferation of

false information, and topics such as credibility and fact-checking are becoming

increasingly important. For example, there was a 2016 special issue of the ACM

Transactions on Information Systems journal on Trust and Veracity of Infor-

mation in Social Media [20], and there is a Workshop on Fact Extraction and

Veriﬁcation at EMNLP’2018. Moreover, there is a SemEval-2017 shared task

on Rumor Detection [6], an ongoing FEVER challenge on Fact Extraction and

VERiﬁcation at EMNLP’2018, the present CLEF’2018 Lab on Automatic Iden-

tiﬁcation and Veriﬁcation of Claims in Political Debates, and an upcoming task

at SemEval’2019 on Fact-Checking in Community Question Answering Forums.

Automatic fact-checking was envisioned in [25] as a multi-step process that

includes (i) identifying check-worthy statements [8, 13, 16], (ii) generating ques-

tions to be asked about these statements [18], (iii) retrieving relevant infor-

mation to create a knowledge base [24], and (iv) inferring the veracity of the

statements, e.g., using text analysis [5, 23] or external sources [18, 22].

The ﬁrst work to target check-worthiness was the ClaimBuster system [14].

It was trained on data that was manually annotated by students, professors,

and journalists, where each sentence was annotated as non-factual, unimportant

factual, or check-worthy factual. The data consisted of transcripts of historical

US election debates covering the period from 1960 until 2012 for a total of 30

debates and 28,029 transcribed sentences. In each sentence, the speaker was

marked: candidate vs. moderator. The ClaimBuster used an SVM classiﬁer and

a manifold of features such as sentiment, TF.IDF word representations, part-of-

speech (POS) tags, and named entities. It produced a check-worthiness ranking

on the basis of the SVM prediction scores. The ClaimBuster system did not try to

mimic the check-worthiness decisions for any speciﬁc fact-checking organization;

yet, it was later evaluated against CNN and PolitiFact [15]. In contrast, our

dataset is based on actual annotations by a fact-checking organization, and we

release freely all data and associated scripts (while theirs is not available).

More relevant to the setup of Task 1 of this Lab is the work of [7], who

focused on debates from the US 2016 Presidential Campaign and used pre-

existing annotations from nine respected fact-checking organizations (PolitiFact,

FactCheck, ABC, CNN, NPR, NYT, Chicago Tribune, The Guardian, and Wash-

ington Post): a total of four debates and 5,415 sentences. Beside most of the fea-

tures borrowed from ClaimBuster —together with sentiment, tense, and some

other features—, their model pays special attention to the context of each sen-

tence. This includes whether it is part of a long intervention by one of the actors

and even its position within such an intervention. The authors predicted both

(i) whether any of the fact-checking organizations would select the target sen-

tence, and also (ii) whether a speciﬁc one would select it.

In follow-up work, [16] developed ClaimRank, which can mimic the claim se-

lection strategies for each and any of the nine fact-checking organizations, as well

as for the union of them all. Even though trained on English, it further supports

Arabic, which is achieved via cross-language English-Arabic embeddings.

Overview of the CLEF-2018 CheckThat! Lab 5

We follow a similar setup for Task 1, but we manually verify the selected

sentences, e.g., to adjust the boundaries of the check-worthy claim, and also to

include all instances of a selected check-worthy claims (as fact-checkers would

only comment on one instance of a claim). We further have a larger dataset, and

we have an Arabic version of the dataset; however, we are limited to a single

fact-checking organization.

The work of [21] also focused on the 2016 US Election campaign, and they

also used data from nine fact-checking organizations (but slightly diﬀerent set

from above). They used presidential (3 presidential one vice-presidential) and

primary debates (7 Republican and 8 Democratic) for a total of 21,700 sentences.

Their setup asks to predict whether any of the fact-checking sources would select

the target sentence. They use a boosting-like model that takes SVMs focusing

on diﬀerent clusters of the dataset and the ﬁnal outcome is considered as that

coming from the most conﬁdent classiﬁer. The features considered go from LDA

topic-modeling to POS tuples and bag-of-word representations.

The Fact Extraction and VERiﬁcation corpus (FEVER) was released by

the EMNLP’2018 Workshop on Fact Extraction and Veriﬁcation to verify in-

formation against textual sources. FEVER consists of 185,445 claims created

by modifying a selection of sentences from the Wikipedia and later on veriﬁed

neglecting the knowledge of the sentence they were derived from. The claims

are classiﬁed by the annotators as refuted, supported, or marked as lacking the

necessary details to make a decision.

There have been several related shared tasks such as SemEval-2017’s shared

task on Rumor Detection [6] with a total of 5599 annotated rumourous Tweets

and the upcoming task at SemEval’2019 on Fact Checking in Community Ques-

tion and Answering Forums.

3 Corpora

We produced the corpus CT-C-18 , which stands for CheckThat! 2018 corpus.

It is composed of CT-CWC-18 —check-worthiness corpus— and CT-FCC-18 —

fact checking corpus. CT-C-18 includes transcripts from debates, together with

political speeches, and isolated claims. Table 1 gives an overview.

The training sets for both tasks come from the ﬁrst and second presiden-

tial debates and the vice-presidential debate in the 2016 US campaign. The

labels for both tasks were derived from manual journalist judgments published

at FactCheck.org. For Task 1, a claim was considered check-worthy if a journal-

ist had fact-checked it. For Task 2 the judgment of the journalist was adopted:

true, half-true, or false. We followed the same procedure for texts in the test

set: two other debates and ﬁve speeches by D. Trump, which occurred after he

took oﬃce as president. It is worth noting that there are cases in which the

number of claims intended for the prediction of factuallity is lower than the re-

ported number of check-worthy claims. The reason is that claims exist which

were formulated more than once in both debates and speeches and, whereas we

Overview of the CLEF-2018 CheckThat! Lab on Automatic Identification and Verification of Political Claims

Figures

Citations

SemEval-2020 Task 11: Detection of Propaganda Techniques in News Articles

SemEval-2020 Task 11: Detection of Propaganda Techniques in News Articles

Overview of the CLEF-2019 CheckThat! Lab: Automatic Identification and Verification of Claims

Overview of CheckThat! 2020: Automatic Identification and Verification of Claims in Social Media

CheckThat! at CLEF 2019: Automatic Identification and Verification of Claims.

References

Information credibility on twitter

"Liar, Liar Pants on Fire": A New Benchmark Dataset for Fake News Detection

Detecting rumors from microblogs with recurrent neural networks

SemEval-2016 Task 4: Sentiment Analysis in Twitter

Truth of Varying Shades: Analyzing Language in Fake News and Political Fact-Checking

Related Papers (5)

The spread of true and false news online

Fake News Detection on Social Media: A Data Mining Perspective

Truth of Varying Shades: Analyzing Language in Fake News and Political Fact-Checking

FEVER: a large-scale dataset for Fact Extraction and VERification

"Liar, Liar Pants on Fire": A New Benchmark Dataset for Fake News Detection

Frequently Asked Questions (6)

Q1. What contributions have the authors mentioned in the paper "Overview of the clef-2018 checkthat! lab on automatic identification and verification of political claims" ?

Q2. What have the authors stated for future works in "Overview of the clef-2018 checkthat! lab on automatic identification and verification of political claims" ?

Q3. How many claims were created by modifying a selection of sentences from the Wikipedia?

Q4. How many sentences were used in the task?

Q5. What was the purpose of the task?

Q6. What was the first work to target check-worthiness?