Projects per year
Abstract / Description of output
Part-of-speech (POS) induction is one of the
most popular tasks in research on unsupervised
NLP. Many different methods have been
proposed, yet comparisons are difficult to
make since there is little consensus on evaluation
framework, and many papers evaluate
against only one or two competitor systems.
Here we evaluate seven different POS
induction systems spanning nearly 20 years of
work, using a variety of measures. We show
that some of the oldest (and simplest) systems
stand up surprisingly well against more recent
approaches. Since most of these systems were
developed and tested using data from the WSJ
corpus, we compare their generalization abilities
by testing on both WSJ and the multilingual
Multext-East corpus. Finally, we introduce
the idea of evaluating systems based
on their ability to produce cluster prototypes
that are useful as input to a prototype-driven
learner. In most cases, the prototype-driven
learner outperforms the unsupervised system
used to initialize it, yielding state-of-the-art
results on WSJ and improvements on nonEnglish
corpora.
Original language | English |
---|---|
Title of host publication | Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP 2010, 9-11 October 2010, MIT Stata Center, Massachusetts, USA, A meeting of SIGDAT, a Special Interest Group of the ACL |
Publisher | Association for Computational Linguistics |
Pages | 575-584 |
Number of pages | 10 |
Publication status | Published - 2010 |
Fingerprint
Dive into the research topics of 'Two Decades of Unsupervised POS Induction: How Far Have We Come?'. Together they form a unique fingerprint.Projects
- 1 Finished