Using paraphrases for improving first story detection in news and Twitter

Sasa Petrovic, Miles Osborne, Victor Lavrenko

Research output: Chapter in Book/Report/Conference proceedingConference contribution


First story detection (FSD) involves identifying first stories about events from a continuous stream of documents. A major problem in this task is the high degree of lexical variation in documents which makes it very difficult to detect stories that talk about the same event but expressed using different words. We suggest using paraphrases to alleviate this problem, making this the first work to use paraphrases for FSD. We show a novel way of integrating paraphrases with locality sensitive hashing (LSH) in order to obtain an efficient FSD system that can scale to very large datasets. Our system achieves state-of-the-art results on the first story detection task, beating both the best supervised and unsupervised systems. To test our approach on large data, we construct a corpus of events for Twitter, consisting of 50 million documents, and show that paraphrasing is also beneficial in this domain.
Original languageEnglish
Title of host publicationHuman Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings
Subtitle of host publicationJune 3-8, 2012, Montréal, Canada
Number of pages9
Publication statusPublished - Jun 2012


Dive into the research topics of 'Using paraphrases for improving first story detection in news and Twitter'. Together they form a unique fingerprint.

Cite this