Omni font OCR error correction with effect on retrieval

W. Magdy, K. Darwish

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Recent library digitization projects attempt to provide large collections of printed material from varying sources in a searchable format. The scanned documents are typically processed using Optical Character Recognition (OCR), which typically introduces errors in the text. This paper proposes a technique for correction of OCR degraded text that is independent of character-level OCR errors, and hence independent of scanned document source. It is based on language modeling in conjunction with a uniform character model that uses edit distance only. The technique compares well to state-of-the-art correction techniques that are based on language modeling and source-specific character error models. Although the proposed technique yielded lower correction effectiveness, its impact on retrieval effectiveness is statistically significant and at par with state-of-the-art correction techniques. The main requirement of the proposed technique is the training of a “good” language model matching genre, style, and temporal coverage. The advantage of being independent of character level errors is clear in applications were printed documents vary in source, font, and degradation level.
Original languageEnglish
Title of host publication2010 10th International Conference on Intelligent Systems Design and Applications
PublisherInstitute of Electrical and Electronics Engineers
Pages415-420
Number of pages6
ISBN (Electronic)978-1-4244-8136-1
ISBN (Print)978-1-4244-8134-7
DOIs
Publication statusPublished - 1 Nov 2010

Fingerprint

Dive into the research topics of 'Omni font OCR error correction with effect on retrieval'. Together they form a unique fingerprint.

Cite this