Abstract
We present a novel representation, evaluation
measure, and supervised models for the task of
identifying the multi word expressions (MWEs)
in a sentence, resulting in a lexical semantic
segmentation. Our approach generalizes
a standard chunking representation to encode
MWEs containing gaps, thereby enabling efficient sequence tagging algorithms for feature rich
discriminative models. Experiments on a
new dataset of English web text offer the first
linguistically-driven evaluation of MWE identification
with truly heterogeneous expression
types. Our statistical sequence model greatly
outperforms a lookup-based segmentation procedure,
achieving nearly 60% F1 for MWE
identification.
Original language | English |
---|---|
Pages (from-to) | 193-206 |
Number of pages | 14 |
Journal | Transactions of the Association for Computational Linguistics |
Volume | 2 |
Publication status | Published - 1 Apr 2014 |