Improving IR Performance from OCRed Text Using Cooccurrence

Kripabandhu Ghosh, Anirban Chakraborty, Swapan Kumar Parui

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

Information Retrieval performance is hurt to a great extent by OCR errors. Much research has been reported on modelling and correction of OCR errors. However, all the existing systems make use of language dependent resources or training texts to study the nature of errors. No research has been reported on improving retrieval performance from erroneous text when no training data is available. We propose a novel algorithm for automatic detection of OCR errors and improvement of retrieval performance from the erroneous corpus. Our algorithm does not use any training data or any language specific resources like thesaurus. It also does not use any knowledge about the language except that the word delimiter is blank space. We have tested our algorithm on erroneous OCRed Bangla FIRE collection offered in the RISOT 2012 track and obtained about 9% improvement over the OCRed baseline. However, the improvement is not statistically significant.
Original languageEnglish
Title of host publicationPost-Proceedings of the 4th and 5th Workshops of the Forum for Information Retrieval Evaluation
EditorsPrasenjit Majumder, Mandar Mitra, Madhulika Agrawal, Parth Mehta
Place of PublicationNew York, NY, USA
PublisherAssociation for Computing Machinery (ACM)
Number of pages7
ISBN (Print)9781450328302
Publication statusPublished - 4 Dec 2013
Event5th Workshop of the Forum for Information Retrieval Evaluation (FIRE) 2013 - New Delhi, India
Duration: 4 Dec 20136 Dec 2013
Conference number: 5

Publication series

NameFIRE '12 & '13
PublisherAssociation for Computing Machinery


Workshop5th Workshop of the Forum for Information Retrieval Evaluation (FIRE) 2013
Abbreviated titleFIRE 2013
CityNew Delhi

Keywords / Materials (for Non-textual outputs)

  • Query Expansion
  • Cooccurrence


Dive into the research topics of 'Improving IR Performance from OCRed Text Using Cooccurrence'. Together they form a unique fingerprint.

Cite this