The Web as a Baseline: Evaluating the Performance of Unsupervised Web-based Models for a Range of NLP Tasks

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

Previous work demonstrated that web counts can be used to approximate bigram frequencies, and thus should be useful for a wide variety of NLP tasks. So far, only two generation tasks (candidate selection for machine translation and confusion-set disambiguation) have been tested using web-scale data sets. The present paper investigates if these results generalize to tasks covering both syntax and semantics, both generation and analysis, and a larger range of n-grams. For the majority of tasks, we find that simple, unsupervised models perform better when n-gram frequencies are obtained from the web rather than from a large corpus. However, in most cases, web-based models fail to outperform more sophisticated state-of-the-art models trained on small corpora. We argue that web-based models should therefore be used as a baseline for, rather than an alternative to, standard models.
Original languageEnglish
Title of host publicationProceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004
PublisherAssociation for Computational Linguistics
Pages121-128
Number of pages8
Publication statusPublished - 2004
EventHLT-NAACL 2004 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics - Boston, MA, United States
Duration: 2 May 20047 May 2004

Conference

ConferenceHLT-NAACL 2004 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics
Country/TerritoryUnited States
CityBoston, MA
Period2/05/047/05/04

Fingerprint

Dive into the research topics of 'The Web as a Baseline: Evaluating the Performance of Unsupervised Web-based Models for a Range of NLP Tasks'. Together they form a unique fingerprint.

Cite this