The WikEd Error Corpus: A Corpus of Corrective Wikipedia Edits and its Application to Grammatical Error Correction

Roman Grundkiewicz, Marcin Junczys-Dowmunt

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

This paper introduces the freely available WikEd Error Corpus. We describe the data mining process from Wikipedia revision histories, corpus content and format. The corpus consists of more than 12 million sentences with a total of 14 million edits of various types. As one possible application, we show that WikEd can be successfully adapted to improve a strong baseline in an ESL grammatical error correction task by 2.63 Used together with an ESL error corpus, a composed system gains 1.64% when compared to the ESL-trained system.
Original languageEnglish
Title of host publicationAdvances in Natural Language Processing -- Lecture Notes in Computer Science
Subtitle of host publication9th International Conference on NLP, PolTAL 2014, Warsaw, Poland, September 17-19, 2014. Proceedings
EditorsAdam Przepiórkowski, Maciej Ogrodniczuk
Place of PublicationCham
PublisherSpringer
Pages478-490
Number of pages13
Volume8686
ISBN (Electronic)978-3-319-10888-9
ISBN (Print)978-3-319-10888-9
DOIs
Publication statusPublished - 2014
Event9th International Conference on Natural Language Processing (PoITAL 2014) - Warsaw, Poland
Duration: 17 Sept 201419 Sept 2014

Publication series

NameLecture Notes in Computer Science
PublisherSpringer International Publishing
Volume8686
ISSN (Print)0302-9743

Conference

Conference9th International Conference on Natural Language Processing (PoITAL 2014)
Country/TerritoryPoland
CityWarsaw
Period17/09/1419/09/14

Keywords / Materials (for Non-textual outputs)

  • error corpus
  • wikipedia revision histories
  • grammatical error correction

Fingerprint

Dive into the research topics of 'The WikEd Error Corpus: A Corpus of Corrective Wikipedia Edits and its Application to Grammatical Error Correction'. Together they form a unique fingerprint.

Cite this