From historic books to annotated XML: Building a large multilingual diachronic corpus

Magdalena Jitca, Rico Sennrich, Martin Volk

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

This paper introduces our approach towards annotating a large heritage corpus, which spans over 100 years of alpine literature. The corpus consists of over 16.000 articles from the yearbooks of the Swiss Alpine Club, 60% of which represent German texts, 38% French, 1% Italian and the remaining 1% Swiss German and Romansh. The present work describes the inherent difficulties in processing a multilingual corpus by referring to the most challenging annotation phases such as article identification, correction of optical character recognition (OCR) errors, tokenization, and language identification. The paper aims to raise awareness for the efforts in building and annotating multilingual corpora rather than to evaluate each individual annotation phase. Keywords: multilingual corpora, cultural heritage, corpus annotation, text digitization
Original languageEnglish
Title of host publicationConference of the German Society for Computational Linguistics and Language Technology (GSCL) 2011
Place of PublicationHamburg, Germany
PublisherUniversität Hamburg
Pages75-80
Number of pages6
Publication statusPublished - 1 Sep 2011
EventConference of the German Society for Computational Linguistics and Language Technology (GSCL) 2011 - Hamburg, Germany
Duration: 28 Sep 201130 Sep 2011

Publication series

NameArbeiten zur Mehrsprachigkeit, Folge B. Working Papers in Multilingualism, Series B
PublisherUniversität Hamburg

Conference

ConferenceConference of the German Society for Computational Linguistics and Language Technology (GSCL) 2011
Country/TerritoryGermany
CityHamburg
Period28/09/1130/09/11

Fingerprint

Dive into the research topics of 'From historic books to annotated XML: Building a large multilingual diachronic corpus'. Together they form a unique fingerprint.

Cite this