Automatic Extraction of Polish Language Errors from Text Edition History

Roman Grundkiewicz

Research output: Chapter in Book/Report/Conference proceedingConference contribution


There are no large error corpora for a number of languages, despite the fact that they have multiple applications in natural language processing. The main reason underlying this situation is a high cost of manual corpora creation. In this paper we present the methods of automatic extraction of various kinds of errors such as spelling, typographical, grammatical, syntactic, semantic, and stylistic ones from text edition histories. By applying of these methods to the Wikipedia’s article revision history, we created the large and publicly available corpus of naturally-occurring language errors for Polish, called PlEWi. Finally, we analyse and evaluate the detected error categories in our corpus.
Original languageEnglish
Title of host publicationText, Speech, and Dialogue
Subtitle of host publication16th International Conference, TSD 2013, Pilsen, Czech Republic, September 1-5, 2013. Proceedings
PublisherSpringer Berlin Heidelberg
Number of pages8
ISBN (Electronic)978-3-642-40585-3
ISBN (Print)978-3-642-40584-6
Publication statusPublished - 2013

Publication series

NameLecture Notes in Computer Science
PublisherSpringer Berlin Heidelberg
ISSN (Print)0302-9743


Dive into the research topics of 'Automatic Extraction of Polish Language Errors from Text Edition History'. Together they form a unique fingerprint.

Cite this