Estimating and Rating the Quality of Optically Character Recognised Text

Beatrice Alex, John Burns

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The focus of this paper is on the quality of historical text digitised through optical character recognition (OCR) and how it affects text mining. We study the effect OCR errors have on named entity recognition (NER) and show that in a random sample of documents picked from several historical text collections, 30.6% of false negative commodity and location mentions and 13.3% of all manually annotated commodity and location mentions contain OCR errors. We introduce a simple method for estimating text quality of OCRed text and examine how well human raters can evaluate it. We also illustrate how automatic text quality estimation compares to manual rating with the aim of determining a quality threshold below which documents could potentially be discarded or would require extensive correction first. This work was conducted during the Trading Consequences project which focussed on text mining and visualisation of historical documents for the study of nineteenth century trade.
Original languageEnglish
Title of host publicationProceedings of the First International Conference on Digital Access to Textual Cultural Heritage
Place of PublicationMadrid, Spain
PublisherACM Association for Computing Machinery
Pages97
Number of pages102
DOIs
Publication statusPublished - 19 May 2014

Fingerprint

Dive into the research topics of 'Estimating and Rating the Quality of Optically Character Recognised Text'. Together they form a unique fingerprint.
  • Trading Consequences

    Klein, E. (Principal Investigator)

    Other

    1/01/1231/12/13

    Project: Research

Cite this