Abstract
OCR errors hurt retrieval performance to a great extent. Research has been done on modelling and correction of OCR errors. However, most of the existing systems use language dependent resources or training texts for studying the nature of errors. Not much research has been reported on improving retrieval performance from erroneous text when no training data is available. We propose an algorithm of detecting OCR errors and improving retrieval performance from the erroneous corpus. We present two versions of the algorithm: one based on word cooccurrence and the other based on Pointwise Mutual Information. Our algorithm does not use any training data or any language specific resources like thesaurus. It also does not use any knowledge about the language except that the word delimiter is a blank space. We have tested our algorithm on erroneous Bangla FIRE collection and obtained significant improvements.
Original language | English |
---|---|
Title of host publication | Proceedings of the International Conference on Knowledge Discovery and Information Retrieval |
Editors | Ana Fred, Joaquim Filipe |
Publisher | SCITEPRESS |
Pages | 450-456 |
Number of pages | 17 |
ISBN (Electronic) | 978-989-758-048-2 |
DOIs | |
Publication status | Published - 24 Oct 2014 |
Event | 6th International Conference on Knowledge Discovery and Information Retrieval (KDIR) - Rome, Italy Duration: 21 Oct 2014 → 24 Oct 2014 Conference number: 6 https://kdir.scitevents.org/Home.aspx?y=2014 |
Publication series
Name | Proceedings of the International Conference on Knowledge Discovery and Information Retrieval |
---|---|
Publisher | Scitepress |
ISSN (Electronic) | 2184-3228 |
Conference
Conference | 6th International Conference on Knowledge Discovery and Information Retrieval (KDIR) |
---|---|
Abbreviated title | KDIR 2014 |
Country/Territory | Italy |
City | Rome |
Period | 21/10/14 → 24/10/14 |
Internet address |
Keywords
- Erroneous Text
- Cooccurrence
- Pointwise Mutual Information