scispace - formally typeset
Open AccessBook ChapterDOI

Overview of the CLEF-2018 CheckThat! Lab on Automatic Identification and Verification of Political Claims

TLDR
The CLEF-2018 CheckThat! Lab on Automatic Identification and Verification of Political Claims as mentioned in this paper was the first edition of the CLEF task, which focused on predicting which potential claims in a political debate should be prioritized for fact-checking; in particular, given a debate or a political speech, the goal was to produce a ranked list of sentences based on their worthiness for fact checking.
Abstract
We present an overview of the CLEF-2018 CheckThat! Lab on Automatic Identification and Verification of Political Claims. In its starting year, the lab featured two tasks. Task 1 asked to predict which (potential) claims in a political debate should be prioritized for fact-checking; in particular, given a debate or a political speech, the goal was to produce a ranked list of its sentences based on their worthiness for fact-checking. Task 2 asked to assess whether a given check-worthy claim made by a politician in the context of a debate/speech is factually true, half-true, or false. We offered both tasks in English and in Arabic. In terms of data, for both tasks, we focused on debates from the 2016 US Presidential Campaign, as well as on some speeches during and after the campaign (we also provided translations in Arabic), and we relied on comments and factuality judgments from factcheck.org and snopes.com, which we further refined manually. A total of 30 teams registered to participate in the lab, and 9 of them actually submitted runs. The evaluation results show that the most successful approaches used various neural networks (esp. for Task 1) and evidence retrieval from the Web (esp. for Task 2). We release all datasets, the evaluation scripts, and the submissions by the participants, which should enable further research in both check-worthiness estimation and automatic claim verification.

read more

Content maybe subject to copyright    Report

Overview of the CLEF-2018 CheckThat! Lab
on Automatic Identification and Verification
of Political Claims
Preslav Nakov
1
, Alberto Barr´on-Cede˜no
1
, Tamer Elsayed
2
,
Reem Suwaileh
2
, Llu´ıs M`arquez
3
, Wajdi Zaghouani
4
,
Pepa Atanasova
5
, Spas Kyuchukov
6
, and Giovanni Da San Martino
1
1
Qatar Computing Research Institute, HBKU, Doha, Qatar
{pnakov, albarron, gmartino}@qf.org.qa
2
Computer Science and Engineering Department, Qatar University, Doha, Qatar
{telsayed, reem.suwaileh}@qu.edu.qa
3
Amazon, Barcelona, Spain
lluismv@amazon.com
4
College of Humanities and Social Sciences, HBKU, Doha, Qatar
wzaghouani@hbku.edu.qa
5
SiteGround, Sofia, Bulgaria
pepa.gencheva@siteground.com
6
Sofia University “St Kliment Ohridski”, Sofia, Bulgaria
spas.kyuchukov@gmail.com
Abstract. We present an overview of the CLEF-2018 CheckThat! Lab
on Automatic Identification and Verification of Political Claims. In its
starting year, the lab featured two tasks. Task 1 asked to predict which
(potential) claims in a political debate should be prioritized for fact-
checking; in particular, given a debate or a political speech, the goal was
to produce a ranked list of its sentences based on their worthiness for
fact checking. Task 2 asked to assess whether a given check-worthy claim
made by a politician in the context of a debate/speech is factually true,
half-true, or false. We offered both tasks in English and in Arabic. In
terms of data, for both tasks, we focused on debates from the 2016 US
Presidential Campaign, as well as on some speeches during and after the
campaign (we also provided translations in Arabic), and we relied on
comments and factuality judgments from factcheck.org and snopes.com,
which we further refined manually. A total of 30 teams registered to
participate in the lab, and 9 of them actually submitted runs. The eval-
uation results show that the most successful approaches used various
neural networks (esp. for Task 1) and evidence retrieval from the Web
(esp. for Task 2). We release all datasets, the evaluation scripts, and the
submissions by the participants, which should enable further research in
both check-worthiness estimation and in automatic claim verification.
Keywords: Computational journalism · Check-worthiness estimation ·
Fact-checking · Veracity
This is a post-peer-review, pre-copyedit version of an article published in
Experimental IR Meets Multilinguality, Multimodality, and Interaction. The
final authenticated version is available online at:
https://doi.org/10.1007/978-3-319-98932-7_32

2 P. Nakov et al.
1 Introduction
The current coverage of the political landscape in both the press and in social
media has led to an unprecedented situation. Like never before, a statement in
an interview, a press release, a blog note, or a tweet can spread almost instan-
taneously across the globe. This speed of proliferation has left little time for
double-checking claims against the facts, which has proven critical in politics.
For instance, the 2016 US Presidential Campaign was arguably influenced by
fake news in social media and by false claims. Indeed, some politicians were fast
to notice that when it comes to shaping public opinion, facts were secondary, and
that appealing to emotions and beliefs worked better. It has been even proposed
that this was marking the dawn of a post-truth age.
As the problem became evident, a number of fact-checking initiatives have
started, led by organizations such as FactCheck
7
and Snopes
8
among many oth-
ers. Yet, this has proved to be a very demanding manual effort, which means
that only a relatively small number of claims could be fact-checked.
9
This makes
it important to prioritize the claims that fact-checkers should consider first, and
then to help them discover the veracity of those claims.
The CheckThat! Lab at CLEF-2018 aims at helping in that respect, by
promoting the development of tools for computational journalism. Figure 1 illus-
trates the fact-checking pipeline, which includes three steps: (i) check-worthiness
estimation, (ii) claim normalization, and (iii) fact-checking. The CheckThat!
Lab focuses on the former and on the latter steps, while taking for granted (and
thus excluding) the intermediate claim normalization step.
Fig. 1. The general fact-checking pipeline. First, the input document is analyzed to
identify sentences containing check-worthy claims, then these claims are extracted and
normalized, and finally they are fact-checked.
7
http://www.factcheck.org
8
http://www.snopes.com
9
Fully automating the process of fact-checking is not yet a viable alternative, partly
because of limitations of the existing technology, and partly due to low trust in such
methods by human users.

Overview of the CLEF-2018 CheckThat! Lab 3
Task 1 (Check-Worthiness) aims to help fact-checkers prioritize their efforts.
In particular, it asks participants to build systems that can mimic the selection
strategies of a particular fact-checking organization: factcheck.org. The task is
defined as follows:
Given a transcription of a political debate/speech, predict
which claims should be prioritized for fact-checking.
Task 1 is a ranking task. The goal is to produce a ranked list of sentences
ordered by their worthiness for fact-checking. Each of the identified claims then
becomes an input for the next step (after being manually nomalized).
Task 2 (Fact-Checking) focuses on tools intended to verify the factuality of
a check-worthy claim. The task is defined as follows:
Given a check-worthy claim in the form of a (tran-
scribed) sentence, determine whether the claim is likely
to be true, half-true, or false.
Task 2 is a classification task. The goal is to label each check-worthy claim
with an estimated/predicted veracity. Note that we provide the participants not
only the normalized claim, but also the original sentence it originated in, which
is in turn given in the context of the entire debate/speech. Thus, this is a novel
task for fact-checking claims in context, an aspect that has been largely ignored
in previous research on fact-checking.
Note that the intermediate task of claim normalization is a challenging prob-
lem that requires dealing with anaphora resolution, paraphrasing, and dialogue
analysis, and thus we decided to skip it and to provide participants readily-
normalized claims.
We produced data starting from professional fact-checking annotations of
debates and speeches from factcheck.org, thus creating CT-C-18 , the CheckThat!
2018 corpus, which combines two sub-corpora: CT-CWC-18 to predict check-
worthiness, and CT-FCC-18 to assess the veracity of claims. We offered each
of the two tasks in two languages: English and Arabic. For Arabic, we hired
professional translators to translate the English data, and we also had a separate
Arabic-only part for Task 2, based on claims from snopes.com.
Nine teams participated in CheckThat! this year. The most successful systems
relied on supervised models using a manifold of representations. We believe that
there is still large room for improvement, and thus we release the corpora, the
evaluation scripts, and the participants’ predictions, which should enable further
research in check-worthiness estimation and automatic claim verification.
10
The remainder of the paper is organized as follows. Section 2 presents an
overview of related work. Section 3 describes the datasets. Section 4 discusses
Task 1 (check-worthiness) in detail, including the evaluation framework and the
setup, the approaches used by the participating teams, and the official results.
Section 5 provides similar detail for Task 2 (fact-checking). Finally, Section 6
discusses the lessons learned.
10
https://github.com/clef2018-factchecking

4 P. Nakov et al.
2 Related Work
Journalists, online users, and researchers are well aware of the proliferation of
false information, and topics such as credibility and fact-checking are becoming
increasingly important. For example, there was a 2016 special issue of the ACM
Transactions on Information Systems journal on Trust and Veracity of Infor-
mation in Social Media [20], and there is a Workshop on Fact Extraction and
Verification at EMNLP’2018. Moreover, there is a SemEval-2017 shared task
on Rumor Detection [6], an ongoing FEVER challenge on Fact Extraction and
VERification at EMNLP’2018, the present CLEF’2018 Lab on Automatic Iden-
tification and Verification of Claims in Political Debates, and an upcoming task
at SemEval’2019 on Fact-Checking in Community Question Answering Forums.
Automatic fact-checking was envisioned in [25] as a multi-step process that
includes (i) identifying check-worthy statements [8, 13, 16], (ii) generating ques-
tions to be asked about these statements [18], (iii) retrieving relevant infor-
mation to create a knowledge base [24], and (iv) inferring the veracity of the
statements, e.g., using text analysis [5, 23] or external sources [18, 22].
The first work to target check-worthiness was the ClaimBuster system [14].
It was trained on data that was manually annotated by students, professors,
and journalists, where each sentence was annotated as non-factual, unimportant
factual, or check-worthy factual. The data consisted of transcripts of historical
US election debates covering the period from 1960 until 2012 for a total of 30
debates and 28,029 transcribed sentences. In each sentence, the speaker was
marked: candidate vs. moderator. The ClaimBuster used an SVM classifier and
a manifold of features such as sentiment, TF.IDF word representations, part-of-
speech (POS) tags, and named entities. It produced a check-worthiness ranking
on the basis of the SVM prediction scores. The ClaimBuster system did not try to
mimic the check-worthiness decisions for any specific fact-checking organization;
yet, it was later evaluated against CNN and PolitiFact [15]. In contrast, our
dataset is based on actual annotations by a fact-checking organization, and we
release freely all data and associated scripts (while theirs is not available).
More relevant to the setup of Task 1 of this Lab is the work of [7], who
focused on debates from the US 2016 Presidential Campaign and used pre-
existing annotations from nine respected fact-checking organizations (PolitiFact,
FactCheck, ABC, CNN, NPR, NYT, Chicago Tribune, The Guardian, and Wash-
ington Post): a total of four debates and 5,415 sentences. Beside most of the fea-
tures borrowed from ClaimBuster —together with sentiment, tense, and some
other features—, their model pays special attention to the context of each sen-
tence. This includes whether it is part of a long intervention by one of the actors
and even its position within such an intervention. The authors predicted both
(i) whether any of the fact-checking organizations would select the target sen-
tence, and also (ii) whether a specific one would select it.
In follow-up work, [16] developed ClaimRank, which can mimic the claim se-
lection strategies for each and any of the nine fact-checking organizations, as well
as for the union of them all. Even though trained on English, it further supports
Arabic, which is achieved via cross-language English-Arabic embeddings.

Overview of the CLEF-2018 CheckThat! Lab 5
We follow a similar setup for Task 1, but we manually verify the selected
sentences, e.g., to adjust the boundaries of the check-worthy claim, and also to
include all instances of a selected check-worthy claims (as fact-checkers would
only comment on one instance of a claim). We further have a larger dataset, and
we have an Arabic version of the dataset; however, we are limited to a single
fact-checking organization.
The work of [21] also focused on the 2016 US Election campaign, and they
also used data from nine fact-checking organizations (but slightly different set
from above). They used presidential (3 presidential one vice-presidential) and
primary debates (7 Republican and 8 Democratic) for a total of 21,700 sentences.
Their setup asks to predict whether any of the fact-checking sources would select
the target sentence. They use a boosting-like model that takes SVMs focusing
on different clusters of the dataset and the final outcome is considered as that
coming from the most confident classifier. The features considered go from LDA
topic-modeling to POS tuples and bag-of-word representations.
The Fact Extraction and VERification corpus (FEVER) was released by
the EMNLP’2018 Workshop on Fact Extraction and Verification to verify in-
formation against textual sources. FEVER consists of 185,445 claims created
by modifying a selection of sentences from the Wikipedia and later on verified
neglecting the knowledge of the sentence they were derived from. The claims
are classified by the annotators as refuted, supported, or marked as lacking the
necessary details to make a decision.
There have been several related shared tasks such as SemEval-2017’s shared
task on Rumor Detection [6] with a total of 5599 annotated rumourous Tweets
and the upcoming task at SemEval’2019 on Fact Checking in Community Ques-
tion and Answering Forums.
3 Corpora
We produced the corpus CT-C-18 , which stands for CheckThat! 2018 corpus.
It is composed of CT-CWC-18 —check-worthiness corpus— and CT-FCC-18
fact checking corpus. CT-C-18 includes transcripts from debates, together with
political speeches, and isolated claims. Table 1 gives an overview.
The training sets for both tasks come from the first and second presiden-
tial debates and the vice-presidential debate in the 2016 US campaign. The
labels for both tasks were derived from manual journalist judgments published
at FactCheck.org. For Task 1, a claim was considered check-worthy if a journal-
ist had fact-checked it. For Task 2 the judgment of the journalist was adopted:
true, half-true, or false. We followed the same procedure for texts in the test
set: two other debates and five speeches by D. Trump, which occurred after he
took office as president. It is worth noting that there are cases in which the
number of claims intended for the prediction of factuallity is lower than the re-
ported number of check-worthy claims. The reason is that claims exist which
were formulated more than once in both debates and speeches and, whereas we

Figures
Citations
More filters
Posted Content

SemEval-2020 Task 11: Detection of Propaganda Techniques in News Articles

TL;DR: The results and the main findings of SemEval-2020 Task 11 on Detection of Propaganda Techniques in News Articles are presented and the system submissions and the methods they used are discussed.
Proceedings ArticleDOI

SemEval-2020 Task 11: Detection of Propaganda Techniques in News Articles

TL;DR: The SemEval-2020 Task 11 on Detection of Propaganda Techniques in News Articles as discussed by the authors focused on detecting propaganda techniques in news articles, which attracted a large number of participants: 250 teams signed up to participate and 44 made a submission.
References
More filters
Proceedings ArticleDOI

Information credibility on twitter

TL;DR: There are measurable differences in the way messages propagate, that can be used to classify them automatically as credible or not credible, with precision and recall in the range of 70% to 80%.
Proceedings ArticleDOI

"Liar, Liar Pants on Fire": A New Benchmark Dataset for Fake News Detection

TL;DR: Li et al. as discussed by the authors designed a hybrid convolutional neural network to integrate meta-data with text and showed that this hybrid approach can improve a text-only deep learning model.
Proceedings Article

Detecting rumors from microblogs with recurrent neural networks

TL;DR: A novel method that learns continuous representations of microblog events for identifying rumors based on recurrent neural networks that detects rumors more quickly and accurately than existing techniques, including the leading online rumor debunking services.
Proceedings ArticleDOI

SemEval-2016 Task 4: Sentiment Analysis in Twitter

TL;DR: The SemEval-2016 Task 4 comprises five subtasks, three of which represent a significant departure from previous editions. as mentioned in this paper discusses the fourth year of the Sentiment Analysis in Twitter Task and discusses the three new subtasks focus on two variants of the basic sentiment classification in Twitter task.
Proceedings ArticleDOI

Truth of Varying Shades: Analyzing Language in Fake News and Political Fact-Checking

TL;DR: Experiments show that while media fact-checking remains to be an open research question, stylistic cues can help determine the truthfulness of text.
Related Papers (5)
Frequently Asked Questions (6)
Q1. What contributions have the authors mentioned in the paper "Overview of the clef-2018 checkthat! lab on automatic identification and verification of political claims" ?

The authors present an overview of the CLEF-2018 CheckThat ! In terms of data, for both tasks, the authors focused on debates from the 2016 US Presidential Campaign, as well as on some speeches during and after the campaign ( they also provided translations in Arabic ), and they relied on comments and factuality judgments from factcheck. The authors release all datasets, the evaluation scripts, and the submissions by the participants, which should enable further research in both check-worthiness estimation and in automatic claim verification. Task 1 asked to predict which ( potential ) claims in a political debate should be prioritized for factchecking ; in particular, given a debate or a political speech, the goal was to produce a ranked list of its sentences based on their worthiness for fact checking. Com, which the authors further refined manually. 

The corpora and evaluation metrics the authors have released as a result of this lab should enable further research in check-worthiness estimation and in automatic claim verification. 

FEVER consists of 185,445 claims created by modifying a selection of sentences from the Wikipedia and later on verified neglecting the knowledge of the sentence they were derived from. 

They used presidential (3 presidential one vice-presidential) and primary debates (7 Republican and 8 Democratic) for a total of 21,700 sentences. 

The purpose was to come out with a dataset closer to the speechesincluded in the test set. [10] used a text distortion model [11] to try to remove irrelevant contents. 

It was trained on data that was manually annotated by students, professors, and journalists, where each sentence was annotated as non-factual, unimportant factual, or check-worthy factual.