Abstract / Description of output
Previous work demonstrated that web counts can be used to approximate bigram frequencies, and thus should be useful for a wide variety of NLP tasks. So far, only two generation tasks (candidate selection for machine translation and confusion-set disambiguation) have been tested using web-scale data sets. The present paper investigates if these results generalize to tasks covering both syntax and semantics, both generation and analysis, and a larger range of n-grams. For the majority of tasks, we find that simple, unsupervised models perform better when n-gram frequencies are obtained from the web rather than from a large corpus. However, in most cases, web-based models fail to outperform more sophisticated state-of-the-art models trained on small corpora. We argue that web-based models should therefore be used as a baseline for, rather than an alternative to, standard models.
Original language | English |
---|---|
Title of host publication | Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004 |
Publisher | Association for Computational Linguistics |
Pages | 121-128 |
Number of pages | 8 |
Publication status | Published - 2004 |
Event | HLT-NAACL 2004 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics - Boston, MA, United States Duration: 2 May 2004 → 7 May 2004 |
Conference
Conference | HLT-NAACL 2004 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics |
---|---|
Country/Territory | United States |
City | Boston, MA |
Period | 2/05/04 → 7/05/04 |