Findings of the 2013 Workshop on Statistical Machine Translation

Home
/
Papers
/
Findings of the 2013 Workshop on Statistical Machine Translation

Proceedings Article•

Findings of the 2013 Workshop on Statistical Machine Translation

Ondřej Bojar¹, Christian Buck², Chris Callison-Burch³, Christian Federmann⁴, Barry Haddow², Philipp Koehn², Christof Monz⁵, Matt Post⁶, Radu Soricut, Lucia Specia⁷ - Show less +6 more•Institutions (7)

Charles University in Prague¹, University of Edinburgh², University of Pennsylvania³, German Research Centre for Artificial Intelligence⁴, University of Amsterdam⁵, Johns Hopkins University⁶, University of Sheffield⁷

01 Aug 2013-pp 1-44

TL;DR: The results of the WMT13 shared tasks, which included a translation task, a task for run-time estimation of machine translation quality, and an unofficial metrics task are presented.

read less

Abstract: We present the results of the WMT13 shared tasks, which included a translation task, a task for run-time estimation of machine translation quality, and an unofficial metrics task. This year, 143 machine translation systems were submitted to the ten translation tasks from 23 institutions. An additional 6 anonymized systems were included, and were then evaluated both automatically and manually, in our largest manual evaluation to date. The quality estimation task had four subtasks, with a total of 14 teams, submitting 55 entries.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

SemEval-2017 Task 1: Semantic Textual Similarity - Multilingual and Cross-lingual Focused Evaluation

[...]

Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, Lucia Specia - Show less +1 more

31 Jul 2017-arXiv: Computation and Language

TL;DR: The STS Benchmark is introduced as a new shared training and evaluation set carefully selected from the corpus of English STS shared task data (2012-2017), providing insight into the limitations of existing models.

...read moreread less

Abstract: Semantic Textual Similarity (STS) measures the meaning similarity of sentences. Applications include machine translation (MT), summarization, generation, question answering (QA), short answer grading, semantic search, dialog and conversational systems. The STS shared task is a venue for assessing the current state-of-the-art. The 2017 task focuses on multilingual and cross-lingual pairs with one sub-track exploring MT quality estimation (MTQE) data. The task obtained strong participation from 31 teams, with 17 participating in all language tracks. We summarize performance and review a selection of well performing methods. Analysis highlights common errors, providing insight into the limitations of existing models. To support ongoing work on semantic representations, the STS Benchmark is introduced as a new shared training and evaluation set carefully selected from the corpus of English STS shared task data (2012-2017).

...read moreread less

1,124 citations

Cites methods from "Findings of the 2013 Workshop on St..."

...We release one thousand new SpanishEnglish STS pairs sourced from the 2013 WMT translation task and produced by a phrase-based Moses SMT system (Bojar et al., 2013)....
[...]

Proceedings Article•DOI•

SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation

[...]

Daniel Cer¹, Mona Diab², Eneko Agirre³, Iñigo Lopez-Gazpio³, Lucia Specia⁴ - Show less +1 more•Institutions (4)

Google¹, George Washington University², University of the Basque Country³, University of Sheffield⁴

01 Jan 2017

TL;DR: The Semantic Textual Similarity (STS) shared task as discussed by the authors was the first task for assessing the state-of-the-art machine translation systems. But the task focused on multilingual and cross-lingual pairs with one sub-track exploring MT quality estimation (MTQE).

...read moreread less

929 citations

Proceedings Article•DOI•

chrF: character n-gram F-score for automatic MT evaluation

[...]

Maja Popović¹•Institutions (1)

Humboldt University of Berlin¹

01 Sep 2015

TL;DR: The proposed use of character n-gram F-score for automatic evaluation of machine translation output shows very promising results, especially for the CHRF3 score – for translation from English, this variant showed the highest segment-level correlations outperforming even the best metrics on the WMT14 shared evaluation task.

...read moreread less

Abstract: We propose the use of character n-gram F-score for automatic evaluation of machine translation output. Character ngrams have already been used as a part of more complex metrics, but their individual potential has not been investigated yet. We report system-level correlations with human rankings for 6-gram F1-score (CHRF) on the WMT12, WMT13 and WMT14 data as well as segment-level correlation for 6gram F1 (CHRF) and F3-scores (CHRF3) on WMT14 data for all available target languages. The results are very promising, especially for the CHRF3 score – for translation from English, this variant showed the highest segment-level correlations outperforming even the best metrics on the WMT14 shared evaluation task.

...read moreread less

743 citations

Cites methods from "Findings of the 2013 Workshop on St..."

..., 2012), WMT13 (Bojar et al., 2013) and WMT14 (Bojar et al....
[...]
...System-level correlations The evaluation metrics were compared with human rankings on the system-level by means of Spearman’s correlation coefficients ρ for the WMT12 and WMT13 data and Pearson’s correlation coefficients r for the WMT14 data....
[...]
...The CHRF scores were calculated for all available translation outputs from the WMT12 (Callison-Burch et al., 2012), WMT13 (Bojar et al., 2013) and WMT14 (Bojar et al., 2014) shared tasks, and then compared with human rankings....
[...]

Proceedings Article•DOI•

Findings of the 2016 Conference on Machine Translation

[...]

Ondˇrej Bojar, Rajen Chatterjee¹, Christian Federmann¹, Yvette Graham², Barry Haddow, Matthias Huck, Antonio Jimeno Yepes³, Philipp Koehn¹, Varvara Logacheva⁴, Christof Monz⁵, Matteo Negri⁶, Aurélie Névéol⁷, Mariana Neves⁸, Martin Popel⁹, Matt Post¹⁰, Raphael Rubino², Carolina Scarton⁴, Lucia Specia⁴, Marco Turchi⁶, Karin Verspoor¹¹, Marcos Zampieri¹² - Show less +17 more•Institutions (12)

University of Edinburgh¹, Dublin City University², IBM³, University of Sheffield⁴, University of Amsterdam⁵, fondazione bruno kessler⁶, Université Paris-Saclay⁷, Hasso Plattner Institute⁸, Charles University in Prague⁹, Johns Hopkins University¹⁰, University of Melbourne¹¹, Saarland University¹²

12 Aug 2016

TL;DR: The results of the WMT16 shared tasks are presented, which included five machine translation (MT) tasks (standard news, IT-domain, biomedical, multimodal, pronoun), three evaluation tasks (metrics, tuning, run-time estimation of MT quality), and an automatic post-editing task and bilingual document alignment task.

...read moreread less

Abstract: This paper presents the results of the WMT16 shared tasks, which included five machine translation (MT) tasks (standard news, IT-domain, biomedical, multimodal, pronoun), three evaluation tasks (metrics, tuning, run-time estimation of MT quality), and an automatic post-editing task and bilingual document alignment task. This year, 102 MT systems from 24 institutions (plus 36 anonymized online systems) were submitted to the 12 translation directions in the news translation task. The IT-domain task received 31 submissions from 12 institutions in 7 directions and the Biomedical task received 15 submissions systems from 5 institutions. Evaluation was both automatic and manual (relative ranking and 100-point scale assessments). The quality estimation task had three subtasks, with a total of 14 teams, submitting 39 entries. The automatic post-editing task had a total of 6 teams, submitting 11 entries.

...read moreread less

616 citations

Cites background or methods from "Findings of the 2013 Workshop on St..."

...This conference builds on nine previous WMT workshops (Koehn and Monz, 2006; Callison-Burch et al., 2007, 2008, 2009, 2010, 2011, 2012; Bojar et al., 2013, 2014, 2015)....
[...]
...a trivial “all-BAD” baseline outperforms many real systems in terms of F1-BAD score (Bojar et al., 2013)....
[...]
...…the WMT shared task on quality estimation (QE) of machine translation (MT) builds on the previous editions of the task (Callison-Burch et al., 2012; Bojar et al., 2013, 2014, 2015), with “traditional” tasks at sentence and word levels, a new task for entire documents quality prediction, and a…...
[...]

Proceedings Article•DOI•

Findings of the 2014 Workshop on Statistical Machine Translation

[...]

Ondrej Bojar¹, Christian Buck², Christian Federmann², Barry Haddow, Philipp Koehn, Johannes Leveling³, Christof Monz⁴, Pavel Pecina¹, Matt Post⁵, Herve Saint-Amand², Radu Soricut⁶, Lucia Specia⁷, Aleš Tamchyna¹ - Show less +9 more•Institutions (7)

Charles University in Prague¹, University of Edinburgh², Dublin City University³, University of Amsterdam⁴, Johns Hopkins University⁵, Google⁶, University of Sheffield⁷

01 Jun 2014

TL;DR: The results of the WMT14 shared tasks, which included a standard news translation task, a separate medical translationtask, a task for run-time estimation of machine translation quality, and a metrics task, are presented.

...read moreread less

Abstract: This paper presents the results of the WMT14 shared tasks, which included a standard news translation task, a separate medical translation task, a task for run-time estimation of machine translation quality, and a metrics task. This year, 143 machine translation systems from 23 institutions were submitted to the ten translation directions in the standard translation task. An additional 6 anonymized systems were included, and were then evaluated both automatically and manually. The quality estimation task had four subtasks, with a total of 10 teams, submitting 57 entries

...read moreread less

511 citations

Cites background or methods or result from "Findings of the 2013 Workshop on St..."

...Compared to the results regarding time prediction in the Quality Evaluation shared task from 2013 (Bojar et al., 2013), we note that this time all submissions were able to beat the baseline system (compared to only 1/3 of the submissions in 2013)....
[...]
...• Ranking: DeltaAvg (primary metric) (Bojar et al., 2013) and Spearman’s rank correlation....
[...]
...(Bojar et al., 2013) has focused on prediction of automatically derived labels, generally due to practical considerations as the manual annotation is labour intensive....
[...]
...This workshop builds on eight previous WMT workshops (Koehn and Monz, 2006; Callison-Burch et al., 2007, 2008, 2009, 2010, 2011, 2012; Bojar et al., 2013)....
[...]
...It has proved robust across a range of language pairs, MT systems, and text domains for predicting various forms of postediting effort (Callison-Burch et al., 2012; Bojar et al., 2013)....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

The measurement of observer agreement for categorical data

[...]

J. R. Landis¹, Gary G. Koch•Institutions (1)

University of Michigan¹

01 Mar 1977-Biometrics

TL;DR: A general statistical methodology for the analysis of multivariate categorical data arising from observer reliability studies is presented and tests for interobserver bias are presented in terms of first-order marginal homogeneity and measures of interob server agreement are developed as generalized kappa-type statistics.

...read moreread less

Abstract: This paper presents a general statistical methodology for the analysis of multivariate categorical data arising from observer reliability studies. The procedure essentially involves the construction of functions of the observed proportions which are directed at the extent to which the observers agree among themselves and the construction of test statistics for hypotheses involving these functions. Tests for interobserver bias are presented in terms of first-order marginal homogeneity and measures of interobserver agreement are developed as generalized kappa-type statistics. These procedures are illustrated with a clinical diagnosis example from the epidemiological literature.

...read moreread less

64,109 citations

"Findings of the 2013 Workshop on St..." refers background in this paper

...The exact interpretation of the kappa coefficient is difficult, but according to Landis and Koch (1977), 0–0.2 is slight, 0.2–0.4 is fair, 0.4–0.6 is moderate, 0.6–0.8 is substantial, and 0.8–1.0 is almost perfect....
[...]

Book•

Applied Logistic Regression

[...]

David W. Hosmer, Stanley Lemeshow

01 Jan 1989

TL;DR: Hosmer and Lemeshow as discussed by the authors provide an accessible introduction to the logistic regression model while incorporating advances of the last decade, including a variety of software packages for the analysis of data sets.

...read moreread less

Abstract: From the reviews of the First Edition. "An interesting, useful, and well-written book on logistic regression models... Hosmer and Lemeshow have used very little mathematics, have presented difficult concepts heuristically and through illustrative examples, and have included references."- Choice "Well written, clearly organized, and comprehensive... the authors carefully walk the reader through the estimation of interpretation of coefficients from a wide variety of logistic regression models . . . their careful explication of the quantitative re-expression of coefficients from these various models is excellent." - Contemporary Sociology "An extremely well-written book that will certainly prove an invaluable acquisition to the practicing statistician who finds other literature on analysis of discrete data hard to follow or heavily theoretical."-The Statistician In this revised and updated edition of their popular book, David Hosmer and Stanley Lemeshow continue to provide an amazingly accessible introduction to the logistic regression model while incorporating advances of the last decade, including a variety of software packages for the analysis of data sets. Hosmer and Lemeshow extend the discussion from biostatistics and epidemiology to cutting-edge applications in data mining and machine learning, guiding readers step-by-step through the use of modeling techniques for dichotomous data in diverse fields. Ample new topics and expanded discussions of existing material are accompanied by a wealth of real-world examples-with extensive data sets available over the Internet.

...read moreread less

35,847 citations

Journal Article•DOI•

A Coefficient of agreement for nominal Scales

[...]

Jacob Cohen¹•Institutions (1)

York University¹

01 Apr 1960-Educational and Psychological Measurement

TL;DR: In this article, the authors present a procedure for having two or more judges independently categorize a sample of units and determine the degree, significance, and significance of the units. But they do not discuss the extent to which these judgments are reproducible, i.e., reliable.

...read moreread less

Abstract: CONSIDER Table 1. It represents in its formal characteristics a situation which arises in the clinical-social-personality areas of psychology, where it frequently occurs that the only useful level of measurement obtainable is nominal scaling (Stevens, 1951, pp. 2526), i.e. placement in a set of k unordered categories. Because the categorizing of the units is a consequence of some complex judgment process performed by a &dquo;two-legged meter&dquo; (Stevens, 1958), it becomes important to determine the extent to which these judgments are reproducible, i.e., reliable. The procedure which suggests itself is that of having two (or more) judges independently categorize a sample of units and determine the degree, significance, and

...read moreread less

34,965 citations

"Findings of the 2013 Workshop on St..." refers methods in this paper

...We measured pairwise agreement among annotators using Cohen’s kappa coefficient (κ) (Cohen, 1960), which is defined as...
[...]
...We measured pairwise agreement among annotators using Cohen’s kappa coefficient (κ) (Cohen, 1960), which is defined as κ = P (A)− P (E) 1− P (E) where P (A) is the proportion of times that the annotators agree, and P (E) is the proportion of time that they would agree by chance....
[...]

Journal Article•DOI•

Applied Logistic Regression.

[...]

A. J. Scott, David W. Hosmer, Stanley Lemeshow

01 Dec 1991-Biometrics

TL;DR: Applied Logistic Regression, Third Edition provides an easily accessible introduction to the logistic regression model and highlights the power of this model by examining the relationship between a dichotomous outcome and a set of covariables.

...read moreread less

Abstract: \"A new edition of the definitive guide to logistic regression modeling for health science and other applicationsThis thoroughly expanded Third Edition provides an easily accessible introduction to the logistic regression (LR) model and highlights the power of this model by examining the relationship between a dichotomous outcome and a set of covariables. Applied Logistic Regression, Third Edition emphasizes applications in the health sciences and handpicks topics that best suit the use of modern statistical software. The book provides readers with state-of-the-art techniques for building, interpreting, and assessing the performance of LR models. New and updated features include: A chapter on the analysis of correlated outcome data. A wealth of additional material for topics ranging from Bayesian methods to assessing model fit Rich data sets from real-world studies that demonstrate each method under discussion. Detailed examples and interpretation of the presented results as well as exercises throughout Applied Logistic Regression, Third Edition is a must-have guide for professionals and researchers who need to model nominal or ordinal scaled outcome variables in public health, medicine, and the social sciences as well as a wide range of other fields and disciplines\"--

...read moreread less

30,190 citations

"Findings of the 2013 Workshop on St..." refers methods in this paper

...For GermanEnglish, LogReg was trained with Stepwise Feature Selection (Hosmer, 1989) on two feature sets: Feature Set 24 includes basic counts augmented with PCFG parsing features (number of VPs, alternative parses, parse probability) on both source and target sentences (Avramidis et al., 2011),…...
[...]
...For GermanEnglish, LogReg was trained with Stepwise Feature Selection (Hosmer, 1989) on two feature sets: Feature Set 24 includes basic counts augmented with PCFG parsing features (number of VPs, alternative parses, parse probability) on both source and target sentences (Avramidis et al., 2011), and pseudo-reference METEOR score; the most successful set, Feature Set 33 combines those 24 features with the 17 baseline features....
[...]
...For GermanEnglish, LogReg was trained with Stepwise Feature Selection (Hosmer, 1989) on two feature sets: Feature Set 24 includes basic counts augmented with PCFG parsing features (number of VPs, alternative parses, parse probability) on both source and target sentences (Avramidis et al....
[...]

Journal Article•DOI•

The WEKA data mining software: an update

[...]

Mark Hall, Eibe Frank¹, Geoffrey Holmes¹, Bernhard Pfahringer¹, Peter Reutemann¹, Ian H. Witten¹ - Show less +2 more•Institutions (1)

University of Waikato¹

16 Nov 2009-Sigkdd Explorations

TL;DR: This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.

...read moreread less

Abstract: More than twelve years have elapsed since the first public release of WEKA. In that time, the software has been rewritten entirely from scratch, evolved substantially and now accompanies a text on data mining [35]. These days, WEKA enjoys widespread acceptance in both academia and business, has an active community, and has been downloaded more than 1.4 million times since being placed on Source-Forge in April 2000. This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.

...read moreread less

19,603 citations

"Findings of the 2013 Workshop on St..." refers methods in this paper

...The prediction models were trained using four classifiers in the Weka toolkit (Hall et al., 2009): linear regression, M5P trees, multi layer perceptron and SVM regression....
[...]