Abstract / Description of output
We consider the problem of part-of-speech
tagging for informal, online conversational
text. We systematically evaluate the use of
large-scale unsupervised word clustering
and new lexical features to improve tagging
accuracy. With these features, our system
achieves state-of-the-art tagging results on
both Twitter and IRC POS tagging tasks;
Twitter tagging is improved from 90% to 93%
accuracy (more than 3% absolute). Qualitative
analysis of these word clusters yields
insights about NLP and linguistic phenomena
in this genre. Additionally, we contribute the
first POS annotation guidelines for such text
and release a new dataset of English language
tweets annotated using these guidelines.
Tagging software, annotation guidelines, and
large-scale word clusters are available at:
http://www.ark.cs.cmu.edu/TweetNLP
This paper describes release 0.3 of the “CMU
Twitter Part-of-Speech Tagger” and annotated
data.
Original language | English |
---|---|
Title of host publication | Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, June 9-14, 2013, Westin Peachtree Plaza Hotel, Atlanta, Georgia, USA |
Publisher | Association for Computational Linguistics |
Pages | 380-390 |
Number of pages | 11 |
Publication status | Published - 2013 |