Projects per year
Abstract
We report on methods to create the largest publicly available parallel corpora by crawling the web, using open source software. We empirically compare alternative methods and publish benchmark data sets for sentence alignment and sentence pair filtering. We also describe the parallel corpora released and evaluate their quality and their usefulness to create machine translation systems.
Original language | English |
---|---|
Title of host publication | Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics |
Publisher | Association for Computational Linguistics (ACL) |
Pages | 4555–4567 |
Number of pages | 13 |
ISBN (Electronic) | 978-1-952148-25-5 |
DOIs | |
Publication status | Published - 10 Jul 2020 |
Event | 2020 Annual Conference of the Association for Computational Linguistics - Hyatt Regency Seattle, Virtual conference, United States Duration: 5 Jul 2020 → 10 Jul 2020 Conference number: 58 https://acl2020.org/ |
Conference
Conference | 2020 Annual Conference of the Association for Computational Linguistics |
---|---|
Abbreviated title | ACL 2020 |
Country/Territory | United States |
City | Virtual conference |
Period | 5/07/20 → 10/07/20 |
Internet address |
Fingerprint
Dive into the research topics of 'ParaCrawl: Web-Scale Acquisition of Parallel Corpora'. Together they form a unique fingerprint.Projects
- 1 Finished
-
Paracrawl 3: Continued Web-Scale Provision of Parallel Corpora for European Languages
Koehn, P. (Principal Investigator), Heafield, K. (Co-investigator) & Waites, W. (Researcher)
1/10/19 → 30/09/21
Project: Research