A Word Association Based Approach for Improving Retrieval Performance from Noisy OCRed Text

Anirban Chakraborty., Kripabandhu Ghosh., Utpal Roy.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

OCR errors hurt retrieval performance to a great extent. Research has been done on modelling and correction of OCR errors. However, most of the existing systems use language dependent resources or training texts for studying the nature of errors. Not much research has been reported on improving retrieval performance from erroneous text when no training data is available. We propose an algorithm of detecting OCR errors and improving retrieval performance from the erroneous corpus. We present two versions of the algorithm: one based on word cooccurrence and the other based on Pointwise Mutual Information. Our algorithm does not use any training data or any language specific resources like thesaurus. It also does not use any knowledge about the language except that the word delimiter is a blank space. We have tested our algorithm on erroneous Bangla FIRE collection and obtained significant improvements.
Original languageEnglish
Title of host publicationProceedings of the International Conference on Knowledge Discovery and Information Retrieval
EditorsAna Fred, Joaquim Filipe
PublisherSCITEPRESS
Pages450-456
Number of pages17
ISBN (Electronic)978-989-758-048-2
DOIs
Publication statusPublished - 24 Oct 2014
Event6th International Conference on Knowledge Discovery and Information Retrieval (KDIR) - Rome, Italy
Duration: 21 Oct 201424 Oct 2014
Conference number: 6
https://kdir.scitevents.org/Home.aspx?y=2014

Publication series

NameProceedings of the International Conference on Knowledge Discovery and Information Retrieval
PublisherScitepress
ISSN (Electronic)2184-3228

Conference

Conference6th International Conference on Knowledge Discovery and Information Retrieval (KDIR)
Abbreviated titleKDIR 2014
Country/TerritoryItaly
CityRome
Period21/10/1424/10/14
Internet address

Keywords

  • Erroneous Text
  • Cooccurrence
  • Pointwise Mutual Information

Fingerprint

Dive into the research topics of 'A Word Association Based Approach for Improving Retrieval Performance from Noisy OCRed Text'. Together they form a unique fingerprint.

Cite this