Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters

Olutobi Owoputi, Brendan O'Connor, Chris Dyer, Kevin Gimpel, Nathan Schneider, Noah A. Smith

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

We consider the problem of part-of-speech tagging for informal, online conversational text. We systematically evaluate the use of large-scale unsupervised word clustering and new lexical features to improve tagging accuracy. With these features, our system achieves state-of-the-art tagging results on both Twitter and IRC POS tagging tasks; Twitter tagging is improved from 90% to 93% accuracy (more than 3% absolute). Qualitative analysis of these word clusters yields insights about NLP and linguistic phenomena in this genre. Additionally, we contribute the first POS annotation guidelines for such text and release a new dataset of English language tweets annotated using these guidelines. Tagging software, annotation guidelines, and large-scale word clusters are available at: http://www.ark.cs.cmu.edu/TweetNLP This paper describes release 0.3 of the “CMU Twitter Part-of-Speech Tagger” and annotated data.
Original languageEnglish
Title of host publicationHuman Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, June 9-14, 2013, Westin Peachtree Plaza Hotel, Atlanta, Georgia, USA
PublisherAssociation for Computational Linguistics
Pages380-390
Number of pages11
Publication statusPublished - 2013

Fingerprint

Dive into the research topics of 'Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters'. Together they form a unique fingerprint.

Cite this