Abstract
Recent library digitization projects attempt to provide large collections of printed material from varying sources in a searchable format. The scanned documents are typically processed using Optical Character Recognition (OCR), which typically introduces errors in the text. This paper proposes a technique for correction of OCR degraded text that is independent of character-level OCR errors, and hence independent of scanned document source. It is based on language modeling in conjunction with a uniform character model that uses edit distance only. The technique compares well to state-of-the-art correction techniques that are based on language modeling and source-specific character error models. Although the proposed technique yielded lower correction effectiveness, its impact on retrieval effectiveness is statistically significant and at par with state-of-the-art correction techniques. The main requirement of the proposed technique is the training of a “good” language model matching genre, style, and temporal coverage. The advantage of being independent of character level errors is clear in applications were printed documents vary in source, font, and degradation level.
Original language | English |
---|---|
Title of host publication | 2010 10th International Conference on Intelligent Systems Design and Applications |
Publisher | Institute of Electrical and Electronics Engineers |
Pages | 415-420 |
Number of pages | 6 |
ISBN (Electronic) | 978-1-4244-8136-1 |
ISBN (Print) | 978-1-4244-8134-7 |
DOIs | |
Publication status | Published - 1 Nov 2010 |