TY - GEN
T1 - Automatic Extraction of Polish Language Errors from Text Edition History
AU - Grundkiewicz, Roman
PY - 2013
Y1 - 2013
N2 - There are no large error corpora for a number of languages, despite the fact that they have multiple applications in natural language processing. The main reason underlying this situation is a high cost of manual corpora creation. In this paper we present the methods of automatic extraction of various kinds of errors such as spelling, typographical, grammatical, syntactic, semantic, and stylistic ones from text edition histories. By applying of these methods to the Wikipedia’s article revision history, we created the large and publicly available corpus of naturally-occurring language errors for Polish, called PlEWi. Finally, we analyse and evaluate the detected error categories in our corpus.
AB - There are no large error corpora for a number of languages, despite the fact that they have multiple applications in natural language processing. The main reason underlying this situation is a high cost of manual corpora creation. In this paper we present the methods of automatic extraction of various kinds of errors such as spelling, typographical, grammatical, syntactic, semantic, and stylistic ones from text edition histories. By applying of these methods to the Wikipedia’s article revision history, we created the large and publicly available corpus of naturally-occurring language errors for Polish, called PlEWi. Finally, we analyse and evaluate the detected error categories in our corpus.
U2 - 10.1007/978-3-642-40585-3_17
DO - 10.1007/978-3-642-40585-3_17
M3 - Conference contribution
SN - 978-3-642-40584-6
T3 - Lecture Notes in Computer Science
SP - 129
EP - 136
BT - Text, Speech, and Dialogue
PB - Springer Berlin Heidelberg
ER -