ParaCrawl: Web-Scale Acquisition of Parallel Corpora

Marta Bañón , Pinzhen Chen, Barry Haddow, Kenneth Heafield, Hieu Hoang, Miquel Esplà-Gomis, Mikel Forcada, Amir Kamran, Faheem Kirefu, Philipp Koehn, Sergio Ortiz-Rojas, Leopoldo Pla, Gema Ramírez-Sánchez, Elsa Sarrías, Marek Strelec, Brian Thompson, William Waites, Dion Wiggins, Jaume Zaragoza

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We report on methods to create the largest publicly available parallel corpora by crawling the web, using open source software. We empirically compare alternative methods and publish benchmark data sets for sentence alignment and sentence pair filtering. We also describe the parallel corpora released and evaluate their quality and their usefulness to create machine translation systems.
Original languageEnglish
Title of host publicationProceedings of the 58th Annual Meeting of the Association for Computational Linguistics
PublisherAssociation for Computational Linguistics (ACL)
Pages4555–4567
Number of pages13
ISBN (Electronic)978-1-952148-25-5
DOIs
Publication statusPublished - 10 Jul 2020
Event2020 Annual Conference of the Association for Computational Linguistics - Hyatt Regency Seattle, Virtual conference, United States
Duration: 5 Jul 202010 Jul 2020
Conference number: 58
https://acl2020.org/

Conference

Conference2020 Annual Conference of the Association for Computational Linguistics
Abbreviated titleACL 2020
Country/TerritoryUnited States
CityVirtual conference
Period5/07/2010/07/20
Internet address

Fingerprint

Dive into the research topics of 'ParaCrawl: Web-Scale Acquisition of Parallel Corpora'. Together they form a unique fingerprint.

Cite this