scispace - formally typeset
Search or ask a question
Proceedings Article

Findings of the 2013 Workshop on Statistical Machine Translation

TL;DR: The results of the WMT13 shared tasks, which included a translation task, a task for run-time estimation of machine translation quality, and an unofficial metrics task are presented.
Abstract: We present the results of the WMT13 shared tasks, which included a translation task, a task for run-time estimation of machine translation quality, and an unofficial metrics task. This year, 143 machine translation systems were submitted to the ten translation tasks from 23 institutions. An additional 6 anonymized systems were included, and were then evaluated both automatically and manually, in our largest manual evaluation to date. The quality estimation task had four subtasks, with a total of 14 teams, submitting 55 entries.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
TL;DR: The STS Benchmark is introduced as a new shared training and evaluation set carefully selected from the corpus of English STS shared task data (2012-2017), providing insight into the limitations of existing models.
Abstract: Semantic Textual Similarity (STS) measures the meaning similarity of sentences. Applications include machine translation (MT), summarization, generation, question answering (QA), short answer grading, semantic search, dialog and conversational systems. The STS shared task is a venue for assessing the current state-of-the-art. The 2017 task focuses on multilingual and cross-lingual pairs with one sub-track exploring MT quality estimation (MTQE) data. The task obtained strong participation from 31 teams, with 17 participating in all language tracks. We summarize performance and review a selection of well performing methods. Analysis highlights common errors, providing insight into the limitations of existing models. To support ongoing work on semantic representations, the STS Benchmark is introduced as a new shared training and evaluation set carefully selected from the corpus of English STS shared task data (2012-2017).

1,124 citations


Cites methods from "Findings of the 2013 Workshop on St..."

  • ...We release one thousand new SpanishEnglish STS pairs sourced from the 2013 WMT translation task and produced by a phrase-based Moses SMT system (Bojar et al., 2013)....

    [...]

Proceedings ArticleDOI
01 Jan 2017
TL;DR: The Semantic Textual Similarity (STS) shared task as discussed by the authors was the first task for assessing the state-of-the-art machine translation systems. But the task focused on multilingual and cross-lingual pairs with one sub-track exploring MT quality estimation (MTQE).
Abstract: Semantic Textual Similarity (STS) measures the meaning similarity of sentences. Applications include machine translation (MT), summarization, generation, question answering (QA), short answer grading, semantic search, dialog and conversational systems. The STS shared task is a venue for assessing the current state-of-the-art. The 2017 task focuses on multilingual and cross-lingual pairs with one sub-track exploring MT quality estimation (MTQE) data. The task obtained strong participation from 31 teams, with 17 participating in all language tracks. We summarize performance and review a selection of well performing methods. Analysis highlights common errors, providing insight into the limitations of existing models. To support ongoing work on semantic representations, the STS Benchmark is introduced as a new shared training and evaluation set carefully selected from the corpus of English STS shared task data (2012-2017).

929 citations

Proceedings ArticleDOI
01 Sep 2015
TL;DR: The proposed use of character n-gram F-score for automatic evaluation of machine translation output shows very promising results, especially for the CHRF3 score – for translation from English, this variant showed the highest segment-level correlations outperforming even the best metrics on the WMT14 shared evaluation task.
Abstract: We propose the use of character n-gram F-score for automatic evaluation of machine translation output. Character ngrams have already been used as a part of more complex metrics, but their individual potential has not been investigated yet. We report system-level correlations with human rankings for 6-gram F1-score (CHRF) on the WMT12, WMT13 and WMT14 data as well as segment-level correlation for 6gram F1 (CHRF) and F3-scores (CHRF3) on WMT14 data for all available target languages. The results are very promising, especially for the CHRF3 score – for translation from English, this variant showed the highest segment-level correlations outperforming even the best metrics on the WMT14 shared evaluation task.

743 citations


Cites methods from "Findings of the 2013 Workshop on St..."

  • ..., 2012), WMT13 (Bojar et al., 2013) and WMT14 (Bojar et al....

    [...]

  • ...System-level correlations The evaluation metrics were compared with human rankings on the system-level by means of Spearman’s correlation coefficients ρ for the WMT12 and WMT13 data and Pearson’s correlation coefficients r for the WMT14 data....

    [...]

  • ...The CHRF scores were calculated for all available translation outputs from the WMT12 (Callison-Burch et al., 2012), WMT13 (Bojar et al., 2013) and WMT14 (Bojar et al., 2014) shared tasks, and then compared with human rankings....

    [...]

Proceedings ArticleDOI
12 Aug 2016
TL;DR: The results of the WMT16 shared tasks are presented, which included five machine translation (MT) tasks (standard news, IT-domain, biomedical, multimodal, pronoun), three evaluation tasks (metrics, tuning, run-time estimation of MT quality), and an automatic post-editing task and bilingual document alignment task.
Abstract: This paper presents the results of the WMT16 shared tasks, which included five machine translation (MT) tasks (standard news, IT-domain, biomedical, multimodal, pronoun), three evaluation tasks (metrics, tuning, run-time estimation of MT quality), and an automatic post-editing task and bilingual document alignment task. This year, 102 MT systems from 24 institutions (plus 36 anonymized online systems) were submitted to the 12 translation directions in the news translation task. The IT-domain task received 31 submissions from 12 institutions in 7 directions and the Biomedical task received 15 submissions systems from 5 institutions. Evaluation was both automatic and manual (relative ranking and 100-point scale assessments). The quality estimation task had three subtasks, with a total of 14 teams, submitting 39 entries. The automatic post-editing task had a total of 6 teams, submitting 11 entries.

616 citations


Cites background or methods from "Findings of the 2013 Workshop on St..."

  • ...This conference builds on nine previous WMT workshops (Koehn and Monz, 2006; Callison-Burch et al., 2007, 2008, 2009, 2010, 2011, 2012; Bojar et al., 2013, 2014, 2015)....

    [...]

  • ...a trivial “all-BAD” baseline outperforms many real systems in terms of F1-BAD score (Bojar et al., 2013)....

    [...]

  • ...…the WMT shared task on quality estimation (QE) of machine translation (MT) builds on the previous editions of the task (Callison-Burch et al., 2012; Bojar et al., 2013, 2014, 2015), with “traditional” tasks at sentence and word levels, a new task for entire documents quality prediction, and a…...

    [...]

Proceedings ArticleDOI
01 Jun 2014
TL;DR: The results of the WMT14 shared tasks, which included a standard news translation task, a separate medical translationtask, a task for run-time estimation of machine translation quality, and a metrics task, are presented.
Abstract: This paper presents the results of the WMT14 shared tasks, which included a standard news translation task, a separate medical translation task, a task for run-time estimation of machine translation quality, and a metrics task. This year, 143 machine translation systems from 23 institutions were submitted to the ten translation directions in the standard translation task. An additional 6 anonymized systems were included, and were then evaluated both automatically and manually. The quality estimation task had four subtasks, with a total of 10 teams, submitting 57 entries

511 citations


Cites background or methods or result from "Findings of the 2013 Workshop on St..."

  • ...Compared to the results regarding time prediction in the Quality Evaluation shared task from 2013 (Bojar et al., 2013), we note that this time all submissions were able to beat the baseline system (compared to only 1/3 of the submissions in 2013)....

    [...]

  • ...• Ranking: DeltaAvg (primary metric) (Bojar et al., 2013) and Spearman’s rank correlation....

    [...]

  • ...(Bojar et al., 2013) has focused on prediction of automatically derived labels, generally due to practical considerations as the manual annotation is labour intensive....

    [...]

  • ...This workshop builds on eight previous WMT workshops (Koehn and Monz, 2006; Callison-Burch et al., 2007, 2008, 2009, 2010, 2011, 2012; Bojar et al., 2013)....

    [...]

  • ...It has proved robust across a range of language pairs, MT systems, and text domains for predicting various forms of postediting effort (Callison-Burch et al., 2012; Bojar et al., 2013)....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: A general statistical methodology for the analysis of multivariate categorical data arising from observer reliability studies is presented and tests for interobserver bias are presented in terms of first-order marginal homogeneity and measures of interob server agreement are developed as generalized kappa-type statistics.
Abstract: This paper presents a general statistical methodology for the analysis of multivariate categorical data arising from observer reliability studies. The procedure essentially involves the construction of functions of the observed proportions which are directed at the extent to which the observers agree among themselves and the construction of test statistics for hypotheses involving these functions. Tests for interobserver bias are presented in terms of first-order marginal homogeneity and measures of interobserver agreement are developed as generalized kappa-type statistics. These procedures are illustrated with a clinical diagnosis example from the epidemiological literature.

64,109 citations


"Findings of the 2013 Workshop on St..." refers background in this paper

  • ...The exact interpretation of the kappa coefficient is difficult, but according to Landis and Koch (1977), 0–0.2 is slight, 0.2–0.4 is fair, 0.4–0.6 is moderate, 0.6–0.8 is substantial, and 0.8–1.0 is almost perfect....

    [...]

Book
01 Jan 1989
TL;DR: Hosmer and Lemeshow as discussed by the authors provide an accessible introduction to the logistic regression model while incorporating advances of the last decade, including a variety of software packages for the analysis of data sets.
Abstract: From the reviews of the First Edition. "An interesting, useful, and well-written book on logistic regression models... Hosmer and Lemeshow have used very little mathematics, have presented difficult concepts heuristically and through illustrative examples, and have included references."- Choice "Well written, clearly organized, and comprehensive... the authors carefully walk the reader through the estimation of interpretation of coefficients from a wide variety of logistic regression models . . . their careful explication of the quantitative re-expression of coefficients from these various models is excellent." - Contemporary Sociology "An extremely well-written book that will certainly prove an invaluable acquisition to the practicing statistician who finds other literature on analysis of discrete data hard to follow or heavily theoretical."-The Statistician In this revised and updated edition of their popular book, David Hosmer and Stanley Lemeshow continue to provide an amazingly accessible introduction to the logistic regression model while incorporating advances of the last decade, including a variety of software packages for the analysis of data sets. Hosmer and Lemeshow extend the discussion from biostatistics and epidemiology to cutting-edge applications in data mining and machine learning, guiding readers step-by-step through the use of modeling techniques for dichotomous data in diverse fields. Ample new topics and expanded discussions of existing material are accompanied by a wealth of real-world examples-with extensive data sets available over the Internet.

35,847 citations

Journal ArticleDOI
Jacob Cohen1
TL;DR: In this article, the authors present a procedure for having two or more judges independently categorize a sample of units and determine the degree, significance, and significance of the units. But they do not discuss the extent to which these judgments are reproducible, i.e., reliable.
Abstract: CONSIDER Table 1. It represents in its formal characteristics a situation which arises in the clinical-social-personality areas of psychology, where it frequently occurs that the only useful level of measurement obtainable is nominal scaling (Stevens, 1951, pp. 2526), i.e. placement in a set of k unordered categories. Because the categorizing of the units is a consequence of some complex judgment process performed by a &dquo;two-legged meter&dquo; (Stevens, 1958), it becomes important to determine the extent to which these judgments are reproducible, i.e., reliable. The procedure which suggests itself is that of having two (or more) judges independently categorize a sample of units and determine the degree, significance, and

34,965 citations


"Findings of the 2013 Workshop on St..." refers methods in this paper

  • ...We measured pairwise agreement among annotators using Cohen’s kappa coefficient (κ) (Cohen, 1960), which is defined as...

    [...]

  • ...We measured pairwise agreement among annotators using Cohen’s kappa coefficient (κ) (Cohen, 1960), which is defined as κ = P (A)− P (E) 1− P (E) where P (A) is the proportion of times that the annotators agree, and P (E) is the proportion of time that they would agree by chance....

    [...]

Journal ArticleDOI
TL;DR: Applied Logistic Regression, Third Edition provides an easily accessible introduction to the logistic regression model and highlights the power of this model by examining the relationship between a dichotomous outcome and a set of covariables.
Abstract: \"A new edition of the definitive guide to logistic regression modeling for health science and other applicationsThis thoroughly expanded Third Edition provides an easily accessible introduction to the logistic regression (LR) model and highlights the power of this model by examining the relationship between a dichotomous outcome and a set of covariables. Applied Logistic Regression, Third Edition emphasizes applications in the health sciences and handpicks topics that best suit the use of modern statistical software. The book provides readers with state-of-the-art techniques for building, interpreting, and assessing the performance of LR models. New and updated features include: A chapter on the analysis of correlated outcome data. A wealth of additional material for topics ranging from Bayesian methods to assessing model fit Rich data sets from real-world studies that demonstrate each method under discussion. Detailed examples and interpretation of the presented results as well as exercises throughout Applied Logistic Regression, Third Edition is a must-have guide for professionals and researchers who need to model nominal or ordinal scaled outcome variables in public health, medicine, and the social sciences as well as a wide range of other fields and disciplines\"--

30,190 citations


"Findings of the 2013 Workshop on St..." refers methods in this paper

  • ...For GermanEnglish, LogReg was trained with Stepwise Feature Selection (Hosmer, 1989) on two feature sets: Feature Set 24 includes basic counts augmented with PCFG parsing features (number of VPs, alternative parses, parse probability) on both source and target sentences (Avramidis et al., 2011),…...

    [...]

  • ...For GermanEnglish, LogReg was trained with Stepwise Feature Selection (Hosmer, 1989) on two feature sets: Feature Set 24 includes basic counts augmented with PCFG parsing features (number of VPs, alternative parses, parse probability) on both source and target sentences (Avramidis et al., 2011), and pseudo-reference METEOR score; the most successful set, Feature Set 33 combines those 24 features with the 17 baseline features....

    [...]

  • ...For GermanEnglish, LogReg was trained with Stepwise Feature Selection (Hosmer, 1989) on two feature sets: Feature Set 24 includes basic counts augmented with PCFG parsing features (number of VPs, alternative parses, parse probability) on both source and target sentences (Avramidis et al....

    [...]

Journal ArticleDOI
TL;DR: This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.
Abstract: More than twelve years have elapsed since the first public release of WEKA. In that time, the software has been rewritten entirely from scratch, evolved substantially and now accompanies a text on data mining [35]. These days, WEKA enjoys widespread acceptance in both academia and business, has an active community, and has been downloaded more than 1.4 million times since being placed on Source-Forge in April 2000. This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.

19,603 citations


"Findings of the 2013 Workshop on St..." refers methods in this paper

  • ...The prediction models were trained using four classifiers in the Weka toolkit (Hall et al., 2009): linear regression, M5P trees, multi layer perceptron and SVM regression....

    [...]