Retrieval from Noisy E-Discovery Corpus in the Absence of Training Data

Anirban Chakraborty, Kripabandhu Ghosh, Swapan Kumar Parui

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

OCR errors hurt retrieval performance to a great extent. Research has been done on modelling and correction of OCR errors. However, most of the existing systems use language dependent resources or training texts for studying the nature of errors. Not much research has been reported on improving retrieval performance from erroneous text when no training data is available. We propose a novel algorithm for detecting OCR errors and improving retrieval performance on an E-Discovery corpus. Our contribution is two-fold : (1) identifying erroneous variants of query terms for improvement in retrieval performance, and (2) presenting a scope for a possible error-modelling in the erroneous corpus where clean ground truth text is not available for comparison. Our algorithm does not use any training data or any language specific resources like thesaurus. It also does not use any knowledge about the language except that the word delimiter is blank space. The proposed approach obtained statistically significant improvements in recall over state-of-the-art baselines.
Original languageEnglish
Title of host publicationProceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval
Place of PublicationNew York, NY, USA
PublisherAssociation for Computing Machinery, Inc
Pages755–758
Number of pages4
ISBN (Print)9781450336215
DOIs
Publication statusPublished - 9 Aug 2015
Event38th International ACM SIGIR Conference on Research and Development in Information Retrieval - Santiago, Chile
Duration: 9 Aug 201513 Aug 2015
Conference number: 38

Publication series

NameSIGIR '15
PublisherAssociation for Computing Machinery

Conference

Conference38th International ACM SIGIR Conference on Research and Development in Information Retrieval
Abbreviated titleSIGIR 2015
Country/TerritoryChile
CitySantiago
Period9/08/1513/08/15

Keywords

  • e-discovery
  • co-occurrence
  • noisy data

Fingerprint

Dive into the research topics of 'Retrieval from Noisy E-Discovery Corpus in the Absence of Training Data'. Together they form a unique fingerprint.

Cite this