scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Automatic Annotation and Evaluation of Error Types for Grammatical Error Correction

TL;DR: ERRANT, a grammatical ERRor ANnotation Toolkit designed to automatically extract edits from parallel original and corrected sentences and classify them according to a new, dataset-agnostic, rule-based framework, which facilitates error type evaluation at different levels of granularity.
Abstract: Until now, error type performance for Grammatical Error Correction (GEC) systems could only be measured in terms of recall because system output is not annotated. To overcome this problem, we introduce ERRANT, a grammatical ERRor ANnotation Toolkit designed to automatically extract edits from parallel original and corrected sentences and classify them according to a new, dataset-agnostic, rule-based framework. This not only facilitates error type evaluation at different levels of granularity, but can also be used to reduce annotator workload and standardise existing GEC datasets. Human experts rated the automatic edits as “Good” or “Acceptable” in at least 95% of cases, so we applied ERRANT to the system output of the CoNLL-2014 shared task to carry out a detailed error type analysis for the first time.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
01 Aug 2019
TL;DR: This paper reports on the BEA-2019 Shared Task on Grammatical Error Correction (GEC), which introduces a new dataset, the Write&Improve+LOCNESS corpus, which represents a wider range of native and learner English levels and abilities.
Abstract: This paper reports on the BEA-2019 Shared Task on Grammatical Error Correction (GEC). As with the CoNLL-2014 shared task, participants are required to correct all types of errors in test data. One of the main contributions of the BEA-2019 shared task is the introduction of a new dataset, the Write&Improve+LOCNESS corpus, which represents a wider range of native and learner English levels and abilities. Another contribution is the introduction of tracks, which control the amount of annotated data available to participants. Systems are evaluated in terms of ERRANT F_0.5, which allows us to report a much wider range of performance statistics. The competition was hosted on Codalab and remains open for further submissions on the blind test set.

235 citations


Cites background or methods from "Automatic Annotation and Evaluation..."

  • ...is calculated using the ERRANT scorer (Bryant et al., 2017), rather than the M2 scorer (Dahlmeier and Ng, 2012), because the ERRANT scorer can provide much more detailed feedback, e....

    [...]

  • ...5 scores in both the gold and auto reference settings, we note that MaxMatch exploits a dynamic alignment to artificially minimise the false positive rate and hence produces slightly inflated scores (Bryant et al., 2017)....

    [...]

  • ...Systems are evaluated on the W&I+LOCNESS test set using the ERRANT scorer (Bryant et al., 2017), an improved version of the MaxMatch scorer (Dahlmeier and Ng, 2012) that was previously used in the CoNLL shared tasks....

    [...]

  • ...Since FCE and NUCLE were annotated according to different error type frameworks and Lang8 and W&I+LOCNESS were not annotated with error types at all, we re-annotated all corpora automatically using ERRANT (Bryant et al., 2017)....

    [...]

  • ...Unlike CoNLL-2014 however, this is calculated using the ERRANT scorer (Bryant et al., 2017), rather than the M2 scorer (Dahlmeier and Ng, 2012), because the ERRANT scorer can provide much more detailed feedback, e.g. in terms of performance on specific error types....

    [...]

Proceedings ArticleDOI
26 May 2020
TL;DR: This paper presents a simple and efficient GEC sequence tagger using a Transformer encoder, pre-trained on synthetic data and then fine-tuned in two stages: first on errorful corpora, and second on a combination of errorful and error-free parallel corpora.
Abstract: In this paper, we present a simple and efficient GEC sequence tagger using a Transformer encoder Our system is pre-trained on synthetic data and then fine-tuned in two stages: first on errorful corpora, and second on a combination of errorful and error-free parallel corpora We design custom token-level transformations to map input tokens to target corrections Our best single-model/ensemble GEC tagger achieves an F_05 of 653/665 on CONLL-2014 (test) and F_05 of 724/736 on BEA-2019 (test) Its inference speed is up to 10 times as fast as a Transformer-based seq2seq GEC system

175 citations


Cites background or methods from "Automatic Annotation and Evaluation..."

  • ..., 2014) evaluated by official M2 scorer (Dahlmeier and Ng, 2012), and on BEA-2019 dev and test sets evaluated by ERRANT (Bryant et al., 2017)....

    [...]

  • ...We report results on CoNLL2014 test set (Ng et al., 2014) evaluated by official M2 scorer (Dahlmeier and Ng, 2012), and on BEA-2019 dev and test sets evaluated by ERRANT (Bryant et al., 2017)....

    [...]

  • ..., grammatical error type classification (Bryant et al., 2017)....

    [...]

  • ...…2http://nlpprogress.com/english/ grammatical_error_correction.html (Accessed 1 April 2020). large amounts of training data and (iii) interpretability and explainability; they require additional functionality to explain corrections, e.g., grammatical error type classification (Bryant et al., 2017)....

    [...]

Proceedings ArticleDOI
02 Aug 2019
TL;DR: This work proposes a simple and surprisingly effective unsupervised synthetic error generation method based on confusion sets extracted from a spellchecker to increase the amount of training data.
Abstract: Considerable effort has been made to address the data sparsity problem in neural grammatical error correction. In this work, we propose a simple and surprisingly effective unsupervised synthetic error generation method based on confusion sets extracted from a spellchecker to increase the amount of training data. Synthetic data is used to pre-train a Transformer sequence-to-sequence model, which not only improves over a strong baseline trained on authentic error-annotated data, but also enables the development of a practical GEC system in a scenario where little genuine error-annotated data is available. The developed systems placed first in the BEA19 shared task, achieving 69.47 and 64.24 F0.5 in the restricted and low-resource tracks respectively, both on the W&I+LOCNESS test set. On the popular CoNLL 2014 test set, we report state-of-the-art results of 64.16 M² for the submitted system, and 61.30 M² for the constrained system trained on the NUCLE and Lang-8 data.

155 citations


Cites methods from "Automatic Annotation and Evaluation..."

  • ...The performance of participating systems was evaluated using the ERRANT scorer (Bryant et al., 2017) which reports a F0....

    [...]

Proceedings ArticleDOI
Jared Lichtarge1, Chris Alberti1, Shankar Kumar1, Noam Shazeer1, Niki Parmar1, Simon Tong1 
01 Jun 2019
TL;DR: It is demonstrated that neural GEC models trained using either type of corpora give similar performance, and systematic analysis is presented that compares the two approaches to data generation and highlights the effectiveness of ensembling.
Abstract: Grammatical Error Correction (GEC) has been recently modeled using the sequence-to-sequence framework. However, unlike sequence transduction problems such as machine translation, GEC suffers from the lack of plentiful parallel data. We describe two approaches for generating large parallel datasets for GEC using publicly available Wikipedia data. The first method extracts source-target pairs from Wikipedia edit histories with minimal filtration heuristics while the second method introduces noise into Wikipedia sentences via round-trip translation through bridge languages. Both strategies yield similar sized parallel corpora containing around 4B tokens. We employ an iterative decoding strategy that is tailored to the loosely supervised nature of our constructed corpora. We demonstrate that neural GEC models trained using either type of corpora give similar performance. Fine-tuning these models on the Lang-8 corpus and ensembling allows us to surpass the state of the art on both the CoNLL ‘14 benchmark and the JFLEG task. We present systematic analysis that compares the two approaches to data generation and highlights the effectiveness of ensembling.

131 citations


Cites methods from "Automatic Annotation and Evaluation..."

  • ...The error categories were tagged using the approach in Bryant et al. (2017)....

    [...]

Proceedings ArticleDOI
01 Jul 2018
TL;DR: Experiments show the proposed approaches improve the performance of seq2seq models for GEC, achieving state-of-the-art results on both CoNLL-2014 and JFLEG benchmark datasets.
Abstract: Most of the neural sequence-to-sequence (seq2seq) models for grammatical error correction (GEC) have two limitations: (1) a seq2seq model may not be well generalized with only limited error-corrected data; (2) a seq2seq model may fail to completely correct a sentence with multiple errors through normal seq2seq inference. We attempt to address these limitations by proposing a fluency boost learning and inference mechanism. Fluency boosting learning generates fluency-boost sentence pairs during training, enabling the error correction model to learn how to improve a sentence’s fluency from more instances, while fluency boosting inference allows the model to correct a sentence incrementally with multiple inference steps until the sentence’s fluency stops increasing. Experiments show our approaches improve the performance of seq2seq models for GEC, achieving state-of-the-art results on both CoNLL-2014 and JFLEG benchmark datasets.

113 citations


Additional excerpts

  • ...…error detection (Leacock et al., 2010; Rei and Yannakoudakis, 2016; Kaneko et al., 2017) and GEC evaluation (Tetreault et al., 2010b; Madnani et al., 2011; Dahlmeier and Ng, 2012c; Napoles et al., 2015; Sakaguchi et al., 2016; Napoles et al., 2016; Bryant et al., 2017; Asano et al., 2017)....

    [...]

  • ..., 2017) and GEC evaluation (Tetreault et al., 2010b; Madnani et al., 2011; Dahlmeier and Ng, 2012c; Napoles et al., 2015; Sakaguchi et al., 2016; Napoles et al., 2016; Bryant et al., 2017; Asano et al., 2017)....

    [...]

References
More filters
Book
01 Jan 1993
TL;DR: This article presents bootstrap methods for estimation, using simple arguments, with Minitab macros for implementing these methods, as well as some examples of how these methods could be used for estimation purposes.
Abstract: This article presents bootstrap methods for estimation, using simple arguments. Minitab macros for implementing these methods are given.

37,183 citations

Proceedings Article
19 Jun 2011
TL;DR: It is demonstrated how supervised discriminative machine learning techniques can be used to automate the assessment of 'English as a Second or Other Language' (ESOL) examination scripts by using rank preference learning to explicitly model the grade relationships between scripts.
Abstract: We demonstrate how supervised discriminative machine learning techniques can be used to automate the assessment of 'English as a Second or Other Language' (ESOL) examination scripts. In particular, we use rank preference learning to explicitly model the grade relationships between scripts. A number of different features are extracted and ablation tests are used to investigate their contribution to overall performance. A comparison between regression and rank preference models further supports our method. Experimental results on the first publically available dataset show that our system can achieve levels of performance close to the upper bound for the task, as defined by the agreement between human examiners on the same corpus. Finally, using a set of 'outlier' texts, we test the validity of our model and identify cases where the model's scores diverge from that of a human examiner.

521 citations


"Automatic Annotation and Evaluation..." refers methods in this paper

  • ...For example, a classifier trained on the First Certificate in English (FCE) corpus (Yannakoudakis et al., 2011) is unlikely to perform as well on the National University of Singapore Corpus of Learner English (NUCLE) (Dahlmeier and Ng, 2012) or vice versa, because both corpora have been annotated according to different standards (cf....

    [...]

  • ...For example, a classifier trained on the First Certificate in English (FCE) corpus (Yannakoudakis et al., 2011) is unlikely to perform as well on the National University of Singapore Corpus of Learner English (NUCLE) (Dahlmeier and Ng, 2012) or vice versa, because both corpora have been annotated…...

    [...]

14 Oct 2005
TL;DR: Free-marginal Multirater Kappa (multirater κfree), like its birater free- Marginal counterparts (PAB AK, S, RE, and κm,) is appropriate for the typical agreement study, in which raters’ distributions of cases into categories are not restricted.
Abstract: Fleiss’ popular multirater kappa is known to be inf luenced by prevalence and bias, which can lead to the paradox of high agreeme nt but low kappa. It also assumes that raters are restricted in how they can distribute ca ses across categories, which is not a typical feature of many agreement studies. In this article, a free-marginal, multirater alternative to Fleiss’ multirater kappa is introduc ed. Free-marginal Multirater Kappa (multirater κfree), like its birater free-marginal counterparts (PAB AK, S, RE, and κm,) is not influenced by kappa and is appropriate for the typical agreement study, in which raters’ distributions of cases into categories are not restricted. Recommendations for the proper use of multirater κfree are included.

502 citations

Proceedings ArticleDOI
01 Jun 2014
TL;DR: The CoNLL-2014 shared task was devoted to grammatical error correction of all error types as discussed by the authors, where a participating system is expected to detect and correct grammatical errors of all types.
Abstract: The CoNLL-2014 shared task was devoted to grammatical error correction of all error types. In this paper, we give the task definition, present the data sets, and describe the evaluation metric and scorer used in the shared task. We also give an overview of the various approaches adopted by the participating teams, and present the evaluation results. Compared to the CoNLL2013 shared task, we have introduced the following changes in CoNLL-2014: (1) A participating system is expected to detect and correct grammatical errors of all types, instead of just the five error types in CoNLL-2013; (2) The evaluation metric was changed from F1 to F0.5, to emphasize precision over recall; and (3) We have two human annotators who independently annotated the test essays, compared to just one human annotator in CoNLL-2013.

484 citations

Proceedings Article
03 Jun 2012
TL;DR: This work presents a novel method for evaluating grammatical error correction that is an algorithm for efficiently computing the sequence of phrase-level edits between a source sentence and a system hypothesis that achieves the highest overlap with the gold-standard annotation.
Abstract: We present a novel method for evaluating grammatical error correction. The core of our method, which we call MaxMatch (M2), is an algorithm for efficiently computing the sequence of phrase-level edits between a source sentence and a system hypothesis that achieves the highest overlap with the gold-standard annotation. This optimal edit sequence is subsequently scored using F1 measure. We test our M2 scorer on the Helping Our Own (HOO) shared task data and show that our method results in more accurate evaluation for grammatical error correction.

322 citations


"Automatic Annotation and Evaluation..." refers background or methods in this paper

  • ..., 2011) is unlikely to perform as well on the National University of Singapore Corpus of Learner English (NUCLE) (Dahlmeier and Ng, 2012) or vice versa, because both corpora have been annotated according to different standards (cf....

    [...]

  • ...To show that automatic references are feasible alternatives to gold references, we evaluated each team in the CoNLL-2014 shared task using both types of reference with the M2 scorer (Dahlmeier and Ng, 2012), the de facto standard of GEC evaluation, and our own scorer....

    [...]

  • ...Since no scorer is currently capable of calculating error type performance however (Dahlmeier and Ng, 2012; Felice and Briscoe, 2015; Napoles et al., 2015), we instead built our own....

    [...]

  • ...It is worth mentioning that despite an increased interest in GEC evaluation in recent years (Dahlmeier and Ng, 2012; Felice and Briscoe, 2015; Bryant and Ng, 2015; Napoles et al., 2015; Grundkiewicz et al., 2015; Sakaguchi et al., 2016), ERRANT is the only toolkit currently capable of producing error types scores....

    [...]

  • ...For example, a classifier trained on the First Certificate in English (FCE) corpus (Yannakoudakis et al., 2011) is unlikely to perform as well on the National University of Singapore Corpus of Learner English (NUCLE) (Dahlmeier and Ng, 2012) or vice versa, because both corpora have been annotated according to different standards (cf. Xue and Hwa (2014))....

    [...]