Results of the WMT17 metrics shared task

doi:10.18653/V1/W17-4755

Home
/
Papers
/
Results of the WMT17 metrics shared task

Proceedings Article•DOI•

Results of the WMT17 metrics shared task

Ondřej Bojar, Yvette Graham, Amir Kamran

01 Sep 2017-pp 489-513

TL;DR: This year, the WMT17 Metrics Shared Task build upon two types of manual judgements: direct assessment (DA) and HUME manual semantic judgements.

read less

Abstract: This paper presents the results of the WMT17 Metrics Shared Task. We asked participants of this task to score the outputs of the MT systems involved in the WMT17 news translation task and Neural MT training task. We collected scores of 14 metrics from 8 research groups. In addition to that, we computed scores of 7 standard metrics (BLEU, SentBLEU, NIST, WER, PER, TER and CDER) as baselines. The collected scores were evaluated in terms of system-level correlation (how well each metric’s scores correlate with WMT17 official manual ranking of systems) and in terms of segment level correlation (how often a metric agrees with humans in judging the quality of a particular sentence). This year, we build upon two types of manual judgements: direct assessment (DA) and HUME manual semantic judgements.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Posted Content•

BERTScore: Evaluating Text Generation with BERT

[...]

Tianyi Zhang¹, Varsha Kishore, Felix Wu¹, Kilian Q. Weinberger¹, Yoav Artzi¹ - Show less +1 more•Institutions (1)

Cornell University¹

21 Apr 2019-arXiv: Computation and Language

TL;DR: This work proposes BERTScore, an automatic evaluation metric for text generation that correlates better with human judgments and provides stronger model selection performance than existing metrics.

...read moreread less

Abstract: We propose BERTScore, an automatic evaluation metric for text generation. Analogously to common metrics, BERTScore computes a similarity score for each token in the candidate sentence with each token in the reference sentence. However, instead of exact matches, we compute token similarity using contextual embeddings. We evaluate using the outputs of 363 machine translation and image captioning systems. BERTScore correlates better with human judgments and provides stronger model selection performance than existing metrics. Finally, we use an adversarial paraphrase detection task to show that BERTScore is more robust to challenging examples when compared to existing metrics.

...read moreread less

1,456 citations

Cites background or methods from "Results of the WMT17 metrics shared..."

...Machine Translation We use the WMT17 metric evaluation dataset (Bojar et al., 2017), which...
[...]
...Machine Translation We use the WMT17 metric evaluation dataset (Bojar et al., 2017), which contains translation systems outputs, gold reference translations, and two types of human judgment scores....
[...]
...In machine translation, BERTSCORE correlates better with segment-level human judgment than existing metrics on the common WMT17 benchmark (Bojar et al., 2017), including outperforming metrics learned specifically for this dataset....
[...]

Proceedings Article•

BERTScore: Evaluating Text Generation with BERT

[...]

Tianyi Zhang¹, Varsha Kishore, Felix Wu¹, Kilian Q. Weinberger¹, Yoav Artzi¹ - Show less +1 more•Institutions (1)

Cornell University¹

30 Apr 2020

TL;DR: This article proposed BERTScore, an automatic evaluation metric for text generation, which computes a similarity score for each token in the candidate sentence with each token from the reference sentence. But instead of exact matches, they compute token similarity using contextual embeddings.

...read moreread less

Abstract: We propose BERTScore, an automatic evaluation metric for text generation. Analogously to common metrics, BERTScore computes a similarity score for each token in the candidate sentence with each token in the reference sentence. However, instead of exact matches, we compute token similarity using contextual embeddings. We evaluate using the outputs of 363 machine translation and image captioning systems. BERTScore correlates better with human judgments and provides stronger model selection performance than existing metrics. Finally, we use an adversarial paraphrase detection task and show that BERTScore is more robust to challenging examples compared to existing metrics.

...read moreread less

819 citations

Proceedings Article•DOI•

How multilingual is Multilingual BERT

[...]

Telmo Pires¹, Eva Schlinger¹, Dan Garrette¹•Institutions (1)

Google¹

01 Jul 2019

TL;DR: This article showed that M-BERT is surprisingly good at zero-shot cross-lingual model transfer, in which task-specific annotations in one language are used to fine-tune the model for evaluation in another language.

...read moreread less

Abstract: In this paper, we show that Multilingual BERT (M-BERT), released by Devlin et al. (2018) as a single language model pre-trained from monolingual corpora in 104 languages, is surprisingly good at zero-shot cross-lingual model transfer, in which task-specific annotations in one language are used to fine-tune the model for evaluation in another language. To understand why, we present a large number of probing experiments, showing that transfer is possible even to languages in different scripts, that transfer works best between typologically similar languages, that monolingual corpora can train models for code-switching, and that the model can find translation pairs. From these results, we can conclude that M-BERT does create multilingual representations, but that these representations exhibit systematic deficiencies affecting certain language pairs.

...read moreread less

543 citations

Posted Content•

BLEURT: Learning Robust Metrics for Text Generation

[...]

Thibault Sellam¹, Dipanjan Das¹, Ankur P. Parikh¹•Institutions (1)

Google¹

09 Apr 2020-arXiv: Computation and Language

TL;DR: BLEURT, a learned evaluation metric for English based on BERT, can model human judgment with a few thousand possibly biased training examples and yields superior results even when the training data is scarce and out-of-distribution.

...read moreread less

Abstract: Text generation has made significant advances in the last few years. Yet, evaluation metrics have lagged behind, as the most popular choices (e.g., BLEU and ROUGE) may correlate poorly with human judgments. We propose BLEURT, a learned evaluation metric based on BERT that can model human judgments with a few thousand possibly biased training examples. A key aspect of our approach is a novel pre-training scheme that uses millions of synthetic examples to help the model generalize. BLEURT provides state-of-the-art results on the last three years of the WMT Metrics shared task and the WebNLG Competition dataset. In contrast to a vanilla BERT-based approach, it yields superior results even when the training data is scarce and out-of-distribution.

...read moreread less

465 citations

Cites background or methods from "Results of the WMT17 metrics shared..."

...Although this approach is quite straightforward, we will show in Section 5 that it gives state-of-theart results on WMT Metrics Shared Task 17-19, which makes it a high-performing evaluation metric....
[...]
...First, we benchmark BLEURT against existing text generation metrics on the last 3 years of the WMT Metrics Shared Task (Bojar et al., 2017)....
[...]
...5The organizers managed to collect 15 adequacy scores for each translation, and thus the ratings are almost perfectly repeatable (Bojar et al., 2017) Results: Figure 2 presents BLEURT’s performance as we vary the train and test skew independently....
[...]
...All the experiments that follow are based on the WMT Metrics Shared Task 2017, because the ratings for this edition are particularly reliable.5 Methodology: We create increasingly challenging datasets by sub-sampling the records from the WMT Metrics shared task, keeping low-rated translations for training and high-rated translations for test....
[...]
...To illustrate, consider the WMT Metrics Shared Task, an annual benchmark in which translation metrics are compared on their ability to imitate human assessments....
[...]

Proceedings Article•DOI•

MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance

[...]

Wei Zhao¹, Maxime Peyrard², Fei Liu³, Yang Gao⁴, Christian M. Meyer¹, Steffen Eger¹ - Show less +2 more•Institutions (4)

Technische Universität Darmstadt¹, École Polytechnique Fédérale de Lausanne², University of Central Florida³, Stanford University⁴

14 Aug 2019

TL;DR: This paper investigates strategies to encode system and reference texts to devise a metric that shows a high correlation with human judgment of text quality and validate the new metric, namely MoverScore, on a number of text generation tasks.

...read moreread less

Abstract: A robust evaluation metric has a profound impact on the development of text generation systems. A desirable metric compares system output against references based on their semantics rather than surface forms. In this paper we investigate strategies to encode system and reference texts to devise a metric that shows a high correlation with human judgment of text quality. We validate our new metric, namely MoverScore, on a number of text generation tasks including summarization, machine translation, image captioning, and data-to-text generation, where the outputs are produced by a variety of neural and non-neural systems. Our findings suggest that metrics combining contextualized representations with a distance measure perform the best. Such metrics also demonstrate strong generalization capability across tasks. For ease-of-use we make our metrics available as web service.

...read moreread less

387 citations

Cites methods from "Results of the WMT17 metrics shared..."

...Data We obtain the source language sentences, their system and reference translations from the WMT 2017 news translation shared task (Bojar et al., 2017)....
[...]
...1 Machine Translation Data We obtain the source language sentences, their system and reference translations from the WMT 2017 news translation shared task (Bojar et al., 2017)....
[...]
...Other metrics include SentBLEU, NIST, chrF, TER, WER, PER, CDER, and METEOR (Lavie and Agarwal, 2007) that are used and described in the WMT metrics shared task (Bojar et al., 2017; Ma et al., 2018)....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

Collapse

References

PDF

Open Access

More filters

Proceedings Article•

Neural Machine Translation by Jointly Learning to Align and Translate

[...]

Dzmitry Bahdanau¹, Kyunghyun Cho², Yoshua Bengio²•Institutions (2)

Jacobs University Bremen¹, Université de Montréal²

01 Jan 2015

TL;DR: It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.

...read moreread less

Abstract: Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. The models proposed recently for neural machine translation often belong to a family of encoder-decoders and consists of an encoder that encodes a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. With this new approach, we achieve a translation performance comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation. Furthermore, qualitative analysis reveals that the (soft-)alignments found by the model agree well with our intuition.

...read moreread less

20,027 citations

Proceedings Article•DOI•

Automatic evaluation of machine translation quality using n-gram co-occurrence statistics

[...]

George R. Doddington¹•Institutions (1)

National Institute of Standards and Technology¹

24 Mar 2002

TL;DR: NIST commissioned NIST to develop an MT evaluation facility based on the IBM work, which is now available from NIST and serves as the primary evaluation measure for TIDES MT research.

...read moreread less

Abstract: Evaluation is recognized as an extremely helpful forcing function in Human Language Technology R&D. Unfortunately, evaluation has not been a very powerful tool in machine translation (MT) research because it requires human judgments and is thus expensive and time-consuming and not easily factored into the MT research agenda. However, at the July 2001 TIDES PI meeting in Philadelphia, IBM described an automatic MT evaluation technique that can provide immediate feedback and guidance in MT research. Their idea, which they call an "evaluation understudy", compares MT output with expert reference translations in terms of the statistics of short sequences of words (word N-grams). The more of these N-grams that a translation shares with the reference translations, the better the translation is judged to be. The idea is elegant in its simplicity. But far more important, IBM showed a strong correlation between these automatically generated scores and human judgments of translation quality. As a result, DARPA commissioned NIST to develop an MT evaluation facility based on the IBM work. This utility is now available from NIST and serves as the primary evaluation measure for TIDES MT research.

...read moreread less

1,734 citations

Proceedings Article•DOI•

Manual and Automatic Evaluation of Machine Translation between European Languages

[...]

Philipp Koehn¹, Christof Monz²•Institutions (2)

University of Edinburgh¹, University of London²

08 Jun 2006

TL;DR: This work evaluated machine translation performance for six European language pairs that participated in a shared task: translating French, German, Spanish texts to English and back.

...read moreread less

Abstract: We evaluated machine translation performance for six European language pairs that participated in a shared task: translating French, German, Spanish texts to English and back. Evaluation was done automatically using the Bleu score and manually on fluency and adequacy.

...read moreread less

299 citations

Proceedings Article•

Universal Conceptual Cognitive Annotation (UCCA)

[...]

Omri Abend¹, Ari Rappoport¹•Institutions (1)

Hebrew University of Jerusalem¹

01 Aug 2013

TL;DR: UCCA is presented, a novel multi-layered framework for semantic representation that aims to accommodate the semantic distinctions expressed through linguistic utterances and its relative insensitivity to meaning-preserving syntactic variation is demonstrated.

...read moreread less

Abstract: Syntactic structures, by their nature, reflect first and foremost the formal constructions used for expressing meanings. This renders them sensitive to formal variation both within and across languages, and limits their value to semantic applications. We present UCCA, a novel multi-layered framework for semantic representation that aims to accommodate the semantic distinctions expressed through linguistic utterances. We demonstrate UCCA’s portability across domains and languages, and its relative insensitivity to meaning-preserving syntactic variation. We also show that UCCA can be effectively and quickly learned by annotators with no linguistic background, and describe the compilation of a UCCAannotated corpus.

...read moreread less

231 citations

Journal Article•DOI•

Can machine translation systems be evaluated by the crowd alone

[...]

Yvette Graham¹, Timothy Baldwin¹, Alistair Moffat¹, Justin Zobel¹•Institutions (1)

University of Melbourne¹

01 Jan 2017-Natural Language Engineering

TL;DR: A new methodology for crowd-sourcing human assessments of translation quality is presented, which allows individual workers to develop their own individual assessment strategy and has a substantially increased ability to identify significant differences between translation systems.

...read moreread less

Abstract: Crowd-sourced assessments of machine translation quality allow evaluations to be carried out cheaply and on a large scale. It is essential, however, that the crowd's work be filtered to avoid contamination of results through the inclusion of false assessments. One method is to filter via agreement with experts, but even amongst experts agreement levels may not be high. In this paper, we present a new methodology for crowd-sourcing human assessments of translation quality, which allows individual workers to develop their own individual assessment strategy. Agreement with experts is no longer required, and a worker is deemed reliable if they are consistent relative to their own previous work. Individual translations are assessed in isolation from all others in the form of direct estimates of translation quality. This allows more meaningful statistics to be computed for systems and enables significance to be determined on smaller sets of assessments. We demonstrate the methodology's feasibility in large-scale human evaluation through replication of the human evaluation component of Workshop on Statistical Machine Translation shared translation task for two language pairs, Spanish-to-English and English-to-Spanish. Results for measurement based solely on crowd-sourced assessments show system rankings in line with those of the original evaluation. Comparison of results produced by the relative preference approach and the direct estimate method described here demonstrate that the direct estimate method has a substantially increased ability to identify significant differences between translation systems.

...read moreread less

174 citations