Automatic Annotation and Evaluation of Error Types for Grammatical Error Correction

doi:10.18653/V1/P17-1074

Home
/
Papers
/
Automatic Annotation and Evaluation of Error Types for Grammatical Error Correction

Proceedings Article•DOI•

Automatic Annotation and Evaluation of Error Types for Grammatical Error Correction

Christopher Bryant¹, Mariano Felice², Ted Briscoe²•Institutions (2)

National University of Singapore¹, University of Cambridge²

01 Jul 2017-Vol. 1, pp 793-805

TL;DR: ERRANT, a grammatical ERRor ANnotation Toolkit designed to automatically extract edits from parallel original and corrected sentences and classify them according to a new, dataset-agnostic, rule-based framework, which facilitates error type evaluation at different levels of granularity.

read less

Abstract: Until now, error type performance for Grammatical Error Correction (GEC) systems could only be measured in terms of recall because system output is not annotated. To overcome this problem, we introduce ERRANT, a grammatical ERRor ANnotation Toolkit designed to automatically extract edits from parallel original and corrected sentences and classify them according to a new, dataset-agnostic, rule-based framework. This not only facilitates error type evaluation at different levels of granularity, but can also be used to reduce annotator workload and standardise existing GEC datasets. Human experts rated the automatic edits as “Good” or “Acceptable” in at least 95% of cases, so we applied ERRANT to the system output of the CoNLL-2014 shared task to carry out a detailed error type analysis for the first time.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

The BEA-2019 Shared Task on Grammatical Error Correction.

[...]

Christopher Bryant¹, Mariano Felice¹, Øistein E. Andersen¹, Ted Briscoe¹•Institutions (1)

University of Cambridge¹

01 Aug 2019

TL;DR: This paper reports on the BEA-2019 Shared Task on Grammatical Error Correction (GEC), which introduces a new dataset, the Write&Improve+LOCNESS corpus, which represents a wider range of native and learner English levels and abilities.

...read moreread less

Abstract: This paper reports on the BEA-2019 Shared Task on Grammatical Error Correction (GEC). As with the CoNLL-2014 shared task, participants are required to correct all types of errors in test data. One of the main contributions of the BEA-2019 shared task is the introduction of a new dataset, the Write&Improve+LOCNESS corpus, which represents a wider range of native and learner English levels and abilities. Another contribution is the introduction of tracks, which control the amount of annotated data available to participants. Systems are evaluated in terms of ERRANT F_0.5, which allows us to report a much wider range of performance statistics. The competition was hosted on Codalab and remains open for further submissions on the blind test set.

...read moreread less

235 citations

Cites background or methods from "Automatic Annotation and Evaluation..."

...is calculated using the ERRANT scorer (Bryant et al., 2017), rather than the M2 scorer (Dahlmeier and Ng, 2012), because the ERRANT scorer can provide much more detailed feedback, e....
[...]
...5 scores in both the gold and auto reference settings, we note that MaxMatch exploits a dynamic alignment to artificially minimise the false positive rate and hence produces slightly inflated scores (Bryant et al., 2017)....
[...]
...Systems are evaluated on the W&I+LOCNESS test set using the ERRANT scorer (Bryant et al., 2017), an improved version of the MaxMatch scorer (Dahlmeier and Ng, 2012) that was previously used in the CoNLL shared tasks....
[...]
...Since FCE and NUCLE were annotated according to different error type frameworks and Lang8 and W&I+LOCNESS were not annotated with error types at all, we re-annotated all corpora automatically using ERRANT (Bryant et al., 2017)....
[...]
...Unlike CoNLL-2014 however, this is calculated using the ERRANT scorer (Bryant et al., 2017), rather than the M2 scorer (Dahlmeier and Ng, 2012), because the ERRANT scorer can provide much more detailed feedback, e.g. in terms of performance on specific error types....
[...]

Proceedings Article•DOI•

GECToR -- Grammatical Error Correction: Tag, Not Rewrite

[...]

Kostiantyn Omelianchuk, Vitaliy Atrasevych, Artem N. Chernodub¹, Oleksandr Skurzhanskyi•Institutions (1)

The Catholic University of America¹

26 May 2020

TL;DR: This paper presents a simple and efficient GEC sequence tagger using a Transformer encoder, pre-trained on synthetic data and then fine-tuned in two stages: first on errorful corpora, and second on a combination of errorful and error-free parallel corpora.

...read moreread less

Abstract: In this paper, we present a simple and efficient GEC sequence tagger using a Transformer encoder Our system is pre-trained on synthetic data and then fine-tuned in two stages: first on errorful corpora, and second on a combination of errorful and error-free parallel corpora We design custom token-level transformations to map input tokens to target corrections Our best single-model/ensemble GEC tagger achieves an F_05 of 653/665 on CONLL-2014 (test) and F_05 of 724/736 on BEA-2019 (test) Its inference speed is up to 10 times as fast as a Transformer-based seq2seq GEC system

...read moreread less

175 citations

Cites background or methods from "Automatic Annotation and Evaluation..."

..., 2014) evaluated by official M2 scorer (Dahlmeier and Ng, 2012), and on BEA-2019 dev and test sets evaluated by ERRANT (Bryant et al., 2017)....
[...]
...We report results on CoNLL2014 test set (Ng et al., 2014) evaluated by official M2 scorer (Dahlmeier and Ng, 2012), and on BEA-2019 dev and test sets evaluated by ERRANT (Bryant et al., 2017)....
[...]
..., grammatical error type classification (Bryant et al., 2017)....
[...]
...…2http://nlpprogress.com/english/ grammatical_error_correction.html (Accessed 1 April 2020). large amounts of training data and (iii) interpretability and explainability; they require additional functionality to explain corrections, e.g., grammatical error type classification (Bryant et al., 2017)....
[...]

Proceedings Article•DOI•

Neural Grammatical Error Correction Systems with Unsupervised Pre-training on Synthetic Data

[...]

Roman Grundkiewicz, Marcin Junczys-Dowmunt¹, Kenneth Heafield•Institutions (1)

Microsoft¹

02 Aug 2019

TL;DR: This work proposes a simple and surprisingly effective unsupervised synthetic error generation method based on confusion sets extracted from a spellchecker to increase the amount of training data.

...read moreread less

Abstract: Considerable effort has been made to address the data sparsity problem in neural grammatical error correction. In this work, we propose a simple and surprisingly effective unsupervised synthetic error generation method based on confusion sets extracted from a spellchecker to increase the amount of training data. Synthetic data is used to pre-train a Transformer sequence-to-sequence model, which not only improves over a strong baseline trained on authentic error-annotated data, but also enables the development of a practical GEC system in a scenario where little genuine error-annotated data is available. The developed systems placed first in the BEA19 shared task, achieving 69.47 and 64.24 F0.5 in the restricted and low-resource tracks respectively, both on the W&I+LOCNESS test set. On the popular CoNLL 2014 test set, we report state-of-the-art results of 64.16 M² for the submitted system, and 61.30 M² for the constrained system trained on the NUCLE and Lang-8 data.

...read moreread less

155 citations

Cites methods from "Automatic Annotation and Evaluation..."

...The performance of participating systems was evaluated using the ERRANT scorer (Bryant et al., 2017) which reports a F0....
[...]

Proceedings Article•DOI•

Corpora Generation for Grammatical Error Correction

[...]

Jared Lichtarge¹, Chris Alberti¹, Shankar Kumar¹, Noam Shazeer¹, Niki Parmar¹, Simon Tong¹ - Show less +2 more•Institutions (1)

Google¹

01 Jun 2019

TL;DR: It is demonstrated that neural GEC models trained using either type of corpora give similar performance, and systematic analysis is presented that compares the two approaches to data generation and highlights the effectiveness of ensembling.

...read moreread less

Abstract: Grammatical Error Correction (GEC) has been recently modeled using the sequence-to-sequence framework. However, unlike sequence transduction problems such as machine translation, GEC suffers from the lack of plentiful parallel data. We describe two approaches for generating large parallel datasets for GEC using publicly available Wikipedia data. The first method extracts source-target pairs from Wikipedia edit histories with minimal filtration heuristics while the second method introduces noise into Wikipedia sentences via round-trip translation through bridge languages. Both strategies yield similar sized parallel corpora containing around 4B tokens. We employ an iterative decoding strategy that is tailored to the loosely supervised nature of our constructed corpora. We demonstrate that neural GEC models trained using either type of corpora give similar performance. Fine-tuning these models on the Lang-8 corpus and ensembling allows us to surpass the state of the art on both the CoNLL ‘14 benchmark and the JFLEG task. We present systematic analysis that compares the two approaches to data generation and highlights the effectiveness of ensembling.

...read moreread less

131 citations

Cites methods from "Automatic Annotation and Evaluation..."

...The error categories were tagged using the approach in Bryant et al. (2017)....
[...]

Proceedings Article•DOI•

Fluency Boost Learning and Inference for Neural Grammatical Error Correction

[...]

Tao Ge¹, Furu Wei², Ming Zhou²•Institutions (2)

Peking University¹, Microsoft²

01 Jul 2018

TL;DR: Experiments show the proposed approaches improve the performance of seq2seq models for GEC, achieving state-of-the-art results on both CoNLL-2014 and JFLEG benchmark datasets.

...read moreread less

Abstract: Most of the neural sequence-to-sequence (seq2seq) models for grammatical error correction (GEC) have two limitations: (1) a seq2seq model may not be well generalized with only limited error-corrected data; (2) a seq2seq model may fail to completely correct a sentence with multiple errors through normal seq2seq inference. We attempt to address these limitations by proposing a fluency boost learning and inference mechanism. Fluency boosting learning generates fluency-boost sentence pairs during training, enabling the error correction model to learn how to improve a sentence’s fluency from more instances, while fluency boosting inference allows the model to correct a sentence incrementally with multiple inference steps until the sentence’s fluency stops increasing. Experiments show our approaches improve the performance of seq2seq models for GEC, achieving state-of-the-art results on both CoNLL-2014 and JFLEG benchmark datasets.

...read moreread less

113 citations

Additional excerpts

...…error detection (Leacock et al., 2010; Rei and Yannakoudakis, 2016; Kaneko et al., 2017) and GEC evaluation (Tetreault et al., 2010b; Madnani et al., 2011; Dahlmeier and Ng, 2012c; Napoles et al., 2015; Sakaguchi et al., 2016; Napoles et al., 2016; Bryant et al., 2017; Asano et al., 2017)....
[...]
..., 2017) and GEC evaluation (Tetreault et al., 2010b; Madnani et al., 2011; Dahlmeier and Ng, 2012c; Napoles et al., 2015; Sakaguchi et al., 2016; Napoles et al., 2016; Bryant et al., 2017; Asano et al., 2017)....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49

Collapse

References

PDF

Open Access

More filters

Book•

An introduction to the bootstrap

[...]

Bradley Efron¹, Robert Tibshirani•Institutions (1)

South Dakota School of Mines and Technology¹

01 Jan 1993

TL;DR: This article presents bootstrap methods for estimation, using simple arguments, with Minitab macros for implementing these methods, as well as some examples of how these methods could be used for estimation purposes.

...read moreread less

Abstract: This article presents bootstrap methods for estimation, using simple arguments. Minitab macros for implementing these methods are given.

...read moreread less

37,183 citations

Proceedings Article•

A New Dataset and Method for Automatically Grading ESOL Texts

[...]

Helen Yannakoudakis¹, Ted Briscoe¹, Ben Medlock•Institutions (1)

University of Cambridge¹

19 Jun 2011

TL;DR: It is demonstrated how supervised discriminative machine learning techniques can be used to automate the assessment of 'English as a Second or Other Language' (ESOL) examination scripts by using rank preference learning to explicitly model the grade relationships between scripts.

...read moreread less

Abstract: We demonstrate how supervised discriminative machine learning techniques can be used to automate the assessment of 'English as a Second or Other Language' (ESOL) examination scripts. In particular, we use rank preference learning to explicitly model the grade relationships between scripts. A number of different features are extracted and ablation tests are used to investigate their contribution to overall performance. A comparison between regression and rank preference models further supports our method. Experimental results on the first publically available dataset show that our system can achieve levels of performance close to the upper bound for the task, as defined by the agreement between human examiners on the same corpus. Finally, using a set of 'outlier' texts, we test the validity of our model and identify cases where the model's scores diverge from that of a human examiner.

...read moreread less

521 citations

"Automatic Annotation and Evaluation..." refers methods in this paper

...For example, a classifier trained on the First Certificate in English (FCE) corpus (Yannakoudakis et al., 2011) is unlikely to perform as well on the National University of Singapore Corpus of Learner English (NUCLE) (Dahlmeier and Ng, 2012) or vice versa, because both corpora have been annotated according to different standards (cf....
[...]
...For example, a classifier trained on the First Certificate in English (FCE) corpus (Yannakoudakis et al., 2011) is unlikely to perform as well on the National University of Singapore Corpus of Learner English (NUCLE) (Dahlmeier and Ng, 2012) or vice versa, because both corpora have been annotated…...
[...]

Free-Marginal Multirater Kappa (multirater K[free]): An Alternative to Fleiss' Fixed-Marginal Multirater Kappa.

[...]

Justus J. Randolph

14 Oct 2005

TL;DR: Free-marginal Multirater Kappa (multirater κfree), like its birater free- Marginal counterparts (PAB AK, S, RE, and κm,) is appropriate for the typical agreement study, in which raters’ distributions of cases into categories are not restricted.

...read moreread less

Abstract: Fleiss’ popular multirater kappa is known to be inf luenced by prevalence and bias, which can lead to the paradox of high agreeme nt but low kappa. It also assumes that raters are restricted in how they can distribute ca ses across categories, which is not a typical feature of many agreement studies. In this article, a free-marginal, multirater alternative to Fleiss’ multirater kappa is introduc ed. Free-marginal Multirater Kappa (multirater κfree), like its birater free-marginal counterparts (PAB AK, S, RE, and κm,) is not influenced by kappa and is appropriate for the typical agreement study, in which raters’ distributions of cases into categories are not restricted. Recommendations for the proper use of multirater κfree are included.

...read moreread less

502 citations

Proceedings Article•DOI•

The CoNLL-2014 Shared Task on Grammatical Error Correction

[...]

Hwee Tou Ng¹, Siew Mei Wu¹, Ted Briscoe¹, Christian Hadiwinoto¹, Raymond Hendy Susanto¹, Christopher Bryant² - Show less +2 more•Institutions (2)

National University of Singapore¹, University of Cambridge²

01 Jun 2014

TL;DR: The CoNLL-2014 shared task was devoted to grammatical error correction of all error types as discussed by the authors, where a participating system is expected to detect and correct grammatical errors of all types.

...read moreread less

Abstract: The CoNLL-2014 shared task was devoted to grammatical error correction of all error types. In this paper, we give the task definition, present the data sets, and describe the evaluation metric and scorer used in the shared task. We also give an overview of the various approaches adopted by the participating teams, and present the evaluation results. Compared to the CoNLL2013 shared task, we have introduced the following changes in CoNLL-2014: (1) A participating system is expected to detect and correct grammatical errors of all types, instead of just the five error types in CoNLL-2013; (2) The evaluation metric was changed from F1 to F0.5, to emphasize precision over recall; and (3) We have two human annotators who independently annotated the test essays, compared to just one human annotator in CoNLL-2013.

...read moreread less

484 citations

Proceedings Article•

Better Evaluation for Grammatical Error Correction

[...]

Daniel Dahlmeier, Hwee Tou Ng¹•Institutions (1)

National University of Singapore¹

03 Jun 2012

TL;DR: This work presents a novel method for evaluating grammatical error correction that is an algorithm for efficiently computing the sequence of phrase-level edits between a source sentence and a system hypothesis that achieves the highest overlap with the gold-standard annotation.

...read moreread less

Abstract: We present a novel method for evaluating grammatical error correction. The core of our method, which we call MaxMatch (M2), is an algorithm for efficiently computing the sequence of phrase-level edits between a source sentence and a system hypothesis that achieves the highest overlap with the gold-standard annotation. This optimal edit sequence is subsequently scored using F1 measure. We test our M2 scorer on the Helping Our Own (HOO) shared task data and show that our method results in more accurate evaluation for grammatical error correction.

...read moreread less

322 citations

"Automatic Annotation and Evaluation..." refers background or methods in this paper

..., 2011) is unlikely to perform as well on the National University of Singapore Corpus of Learner English (NUCLE) (Dahlmeier and Ng, 2012) or vice versa, because both corpora have been annotated according to different standards (cf....
[...]
...To show that automatic references are feasible alternatives to gold references, we evaluated each team in the CoNLL-2014 shared task using both types of reference with the M2 scorer (Dahlmeier and Ng, 2012), the de facto standard of GEC evaluation, and our own scorer....
[...]
...Since no scorer is currently capable of calculating error type performance however (Dahlmeier and Ng, 2012; Felice and Briscoe, 2015; Napoles et al., 2015), we instead built our own....
[...]
...It is worth mentioning that despite an increased interest in GEC evaluation in recent years (Dahlmeier and Ng, 2012; Felice and Briscoe, 2015; Bryant and Ng, 2015; Napoles et al., 2015; Grundkiewicz et al., 2015; Sakaguchi et al., 2016), ERRANT is the only toolkit currently capable of producing error types scores....
[...]
...For example, a classifier trained on the First Certificate in English (FCE) corpus (Yannakoudakis et al., 2011) is unlikely to perform as well on the National University of Singapore Corpus of Learner English (NUCLE) (Dahlmeier and Ng, 2012) or vice versa, because both corpora have been annotated according to different standards (cf. Xue and Hwa (2014))....
[...]