Skip to main navigation Skip to search Skip to main content

Using the Web to Overcome Data Sparseness

Frank Keller, Maria Lapata, Olga Ourioupina

Research output: Chapter in Book/Report/Conference proceedingChapter

Abstract

This paper shows that the web can be employed to obtain frequencies for bigrams that are unseen in a given corpus. We describe a method for retrieving counts for adjective-noun, noun-noun, and verb-object bigrams from the web by querying a search engine. We evaluate this method by demonstrating that web frequencies and correlate with frequencies obtained from a carefully edited, balanced corpus. We also perform a task-based evaluation, showing that web frequencies can reliably predict human plausibility judgments.
Original languageEnglish
Title of host publicationEMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing
Place of PublicationStroudsburg, PA
PublisherAssociation for Computational Linguistics
Pages230-237
Number of pages8
Volume10
Publication statusPublished - 2002
Event7th Conference on Empirical Methods in Natural Language Processing (EMNLP 2002) - University of Pennsylvania, Philadelphia, PA, United States
Duration: 6 Jul 20027 Jul 2002

Conference

Conference7th Conference on Empirical Methods in Natural Language Processing (EMNLP 2002)
Country/TerritoryUnited States
CityPhiladelphia, PA
Period6/07/027/07/02

Fingerprint

Dive into the research topics of 'Using the Web to Overcome Data Sparseness'. Together they form a unique fingerprint.

Cite this