Abstract
In this paper we discuss technical issues
arising from the interdependence between
tokenisation and XML-based annotation
tools, in particular those which use standoff
annotation in the form of pointers to
word tokens. It is common practice for an
XML-based annotation tool to use word tokens
as the target units for annotating such
things as named entities because it provides
appropriate units for stand-off annotation.
Furthermore, these units can be
easily selected, swept out or snapped to
by the annotators and certain classes of
annotation mistakes can be prevented by
building a tool that does not permit selection
of a substring which does not entirely
span one or more XML elements. There
is a downside to this method of annotation,
however, in that it assumes that for
any given data set, in whatever domain,
the optimal tokenisation is known before
any annotation is performed. If mistakes
are made in the initial tokenisation and the
word boundaries conflict with the annotators’
desired actions, then either the annotation
is inaccurate or expensive retokenisation
and reannotation will be required.
Here we describe the methods we have
developed to address this problem. We
also describe experiments which explore
the effects of different granularities of tokenisation
on NER tagger performance.
Original language | English |
---|---|
Title of host publication | Proceedings of the 5th Workshop on NLP and XML: Multi-Dimensional Markup in Natural Language Processing |
Publisher | Association for Computational Linguistics |
Pages | 19-26 |
Number of pages | 8 |
Publication status | Published - 2006 |