Tools to address the interdependence between tokenisation and standoff annotation

Claire Grover, Michael Matthews, Richard Tobin

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In this paper we discuss technical issues arising from the interdependence between tokenisation and XML-based annotation tools, in particular those which use standoff annotation in the form of pointers to word tokens. It is common practice for an XML-based annotation tool to use word tokens as the target units for annotating such things as named entities because it provides appropriate units for stand-off annotation. Furthermore, these units can be easily selected, swept out or snapped to by the annotators and certain classes of annotation mistakes can be prevented by building a tool that does not permit selection of a substring which does not entirely span one or more XML elements. There is a downside to this method of annotation, however, in that it assumes that for any given data set, in whatever domain, the optimal tokenisation is known before any annotation is performed. If mistakes are made in the initial tokenisation and the word boundaries conflict with the annotators’ desired actions, then either the annotation is inaccurate or expensive retokenisation and reannotation will be required. Here we describe the methods we have developed to address this problem. We also describe experiments which explore the effects of different granularities of tokenisation on NER tagger performance.
Original languageEnglish
Title of host publicationProceedings of the 5th Workshop on NLP and XML: Multi-Dimensional Markup in Natural Language Processing
PublisherAssociation for Computational Linguistics
Pages19-26
Number of pages8
Publication statusPublished - 2006

Fingerprint

Dive into the research topics of 'Tools to address the interdependence between tokenisation and standoff annotation'. Together they form a unique fingerprint.

Cite this