Abstract / Description of output
This paper reports on experiments to improve the Optical Character Recognition
(ocr) quality of historical text as a preliminary step in text mining.
We analyse the quality of ocred text compared to a gold standard and show
how it can be improved by performing two automatic correction steps. We also
demonstrate the impact this can have on named entity recognition in a preliminary extrinsic evaluation. This work was performed as part of the Trading
Consequences project which is focussed on text mining of historical documents
for the study of nineteenth century trade in the British Empire.
(ocr) quality of historical text as a preliminary step in text mining.
We analyse the quality of ocred text compared to a gold standard and show
how it can be improved by performing two automatic correction steps. We also
demonstrate the impact this can have on named entity recognition in a preliminary extrinsic evaluation. This work was performed as part of the Trading
Consequences project which is focussed on text mining of historical documents
for the study of nineteenth century trade in the British Empire.
Original language | English |
---|---|
Title of host publication | Proceedings of KONVENS 2012 (LThist 2012 workshop) |
Pages | 401-409 |
Number of pages | 9 |
Publication status | Published - 2012 |