A Cost Efficient Approach to Correct OCR Errors in Large Document Collections

Deepayan Das, Jerin Philip, Minesh Mathew, C. V. Jawahar

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

Word error rate of an OCR is often higher than its character error rate. This is especially true when OCRs are designed by recognizing characters. High word accuracies are critical for many practical applications like content creation and text-to-speech systems. In order to detect and correct the misrecognised words, it is common for an OCR to employ a post-processor module to improve the word accuracy. However, conventional approaches to post-processing like looking up a dictionary or using a statistical language model (SLM), are still limited. In many such scenarios, it is often required to remove the outstanding errors manually.

We observe that the traditional post-processing schemes look at error words sequentially, since OCRs process documents one at a time. We propose a cost-efficient model to address the error words in batches rather than correcting them individually. We exploit the fact that a collection of documents (eg. a book), unlike a single document, has a structure leading to repetition of words. Such words, if efficiently grouped together and corrected together, can lead to a significant reduction in the effort. Error correction can be fully automatic or with a human in the loop. We compare the performance of our method with various baseline approaches including the case where all the errors are removed by a human. We demonstrate the efficacy of our solution empirically by reporting more than 70% reduction in the human effort with near perfect error correction. We validate our method on books in both English and Hindi.
Original languageEnglish
Title of host publication2019 International Conference on Document Analysis and Recognition (ICDAR)
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Pages655-662
Number of pages8
ISBN (Electronic)978-1-7281-3014-9
ISBN (Print)978-1-7281-3015-6
DOIs
Publication statusPublished - 20 Sept 2019
Event15th International Conference on Document Analysis and Recognition - Sydney, Australia
Duration: 20 Sept 201925 Sept 2019
http://icdar2019.org/

Publication series

Name
PublisherIEEE
ISSN (Print)1520-5363
ISSN (Electronic)2379-2140

Conference

Conference15th International Conference on Document Analysis and Recognition
Abbreviated titleICDAR 2019
Country/TerritoryAustralia
CitySydney
Period20/09/1925/09/19
Internet address

Keywords / Materials (for Non-textual outputs)

  • OCR
  • Batch Correction
  • Clustering
  • Post-Processing

Fingerprint

Dive into the research topics of 'A Cost Efficient Approach to Correct OCR Errors in Large Document Collections'. Together they form a unique fingerprint.

Cite this