Abstract
Information Retrieval performance is hurt to a great extent by OCR errors. Much research has been reported on modelling and correction of OCR errors. However, all the existing systems make use of language dependent resources or training texts to study the nature of errors. No research has been reported on improving retrieval performance from erroneous text when no training data is available. We propose a novel algorithm for automatic detection of OCR errors and improvement of retrieval performance from the erroneous corpus. Our algorithm does not use any training data or any language specific resources like thesaurus. It also does not use any knowledge about the language except that the word delimiter is blank space. We have tested our algorithm on erroneous OCRed Bangla FIRE collection offered in the RISOT 2012 track and obtained about 9% improvement over the OCRed baseline. However, the improvement is not statistically significant.
Original language | English |
---|---|
Title of host publication | Post-Proceedings of the 4th and 5th Workshops of the Forum for Information Retrieval Evaluation |
Editors | Prasenjit Majumder, Mandar Mitra, Madhulika Agrawal, Parth Mehta |
Place of Publication | New York, NY, USA |
Publisher | Association for Computing Machinery (ACM) |
Number of pages | 7 |
ISBN (Print) | 9781450328302 |
DOIs | |
Publication status | Published - 4 Dec 2013 |
Event | 5th Workshop of the Forum for Information Retrieval Evaluation (FIRE) 2013 - New Delhi, India Duration: 4 Dec 2013 → 6 Dec 2013 Conference number: 5 |
Publication series
Name | FIRE '12 & '13 |
---|---|
Publisher | Association for Computing Machinery |
Workshop
Workshop | 5th Workshop of the Forum for Information Retrieval Evaluation (FIRE) 2013 |
---|---|
Abbreviated title | FIRE 2013 |
Country/Territory | India |
City | New Delhi |
Period | 4/12/13 → 6/12/13 |
Keywords / Materials (for Non-textual outputs)
- Query Expansion
- Cooccurrence