Neural Grammatical Error Correction Systems with Unsupervised Pre-training on Synthetic Data

Roman Grundkiewicz, Marcin Junczys-Dowmuntz, Kenneth Heafield

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Considerable effort has been made to address the data sparsity problem in neural grammatical error correction. In this work, we propose a simple and surprisingly effective unsupervised synthetic error generation method based onconfusion sets extracted from a spellchecker to increase the amount of training data. Synthetic data is used to pre-train a Transformer sequence-to-sequence model, which not only improves over a strong baseline trained on authenticerror-annotated data, but also enables the development of a practical GEC system in a scenario where little genuine error-annotated data is available. The developed systems placed first in the BEA19 shared task, achieving 69.47 and 64.24 F0:5 in the restricted and low-resource tracks respectively, both on the W&I+LOCNESS test set. On the popular CoNLL 2014 test set, we report state-of-theart results of 64.16 M2 for the submitted system, and 61.30 M2 for the constrained system trained on the NUCLE and Lang-8 data.
Original languageEnglish
Title of host publicationProceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications
EditorsHelen Yannakoudakis, Ekaterina Kochmar, Claudia Leacock, Nitin Madnani, Ildikó Pilán, Torsten Zesch
Place of PublicationFlorence, Italy
PublisherAssociation for Computational Linguistics
Pages252–263
Number of pages12
Publication statusE-pub ahead of print - 2 Aug 2019
Event14th Workshop on Innovative Use of NLP for Building Educational Applications - Florence, Italy
Duration: 2 Aug 20192 Aug 2019
Conference number: 14
https://sig-edu.org/bea/current

Conference

Conference14th Workshop on Innovative Use of NLP for Building Educational Applications
Abbreviated titleBEA14 2019
Country/TerritoryItaly
CityFlorence
Period2/08/192/08/19
Internet address

Fingerprint

Dive into the research topics of 'Neural Grammatical Error Correction Systems with Unsupervised Pre-training on Synthetic Data'. Together they form a unique fingerprint.

Cite this