Digitised Historical Text: Does it have to be mediOCRe?

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

This paper reports on experiments to improve the Optical Character Recognition
(ocr) quality of historical text as a preliminary step in text mining.
We analyse the quality of ocred text compared to a gold standard and show
how it can be improved by performing two automatic correction steps. We also
demonstrate the impact this can have on named entity recognition in a preliminary extrinsic evaluation. This work was performed as part of the Trading
Consequences project which is focussed on text mining of historical documents
for the study of nineteenth century trade in the British Empire.
Original languageEnglish
Title of host publicationProceedings of KONVENS 2012 (LThist 2012 workshop)
Number of pages9
Publication statusPublished - 2012


Dive into the research topics of 'Digitised Historical Text: Does it have to be mediOCRe?'. Together they form a unique fingerprint.

Cite this