Edinburgh Research Explorer

ParaCrawl: Web-Scale Acquisition of Parallel Corpora

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Related Edinburgh Organisations

Open Access permissions

Open

Documents

  • Download as Adobe PDF

    Accepted author manuscript, 491 KB, PDF document

    Licence: Creative Commons: Attribution (CC-BY)

  • Download as Adobe PDF

    Final published version, 604 KB, PDF document

    Licence: Creative Commons: Attribution (CC-BY)

https://www.aclweb.org/anthology/2020.acl-main.417
Original languageEnglish
Title of host publicationProceedings of the 58th Annual Meeting of the Association for Computational Linguistics
PublisherAssociation for Computational Linguistics (ACL)
Pages4555–4567
Number of pages13
ISBN (Electronic)978-1-952148-25-5
DOIs
Publication statusPublished - 10 Jul 2020
Event2020 Annual Conference of the Association for Computational Linguistics - Hyatt Regency Seattle, Virtual conference, United States
Duration: 5 Jul 202010 Jul 2020
Conference number: 58
https://acl2020.org/

Conference

Conference2020 Annual Conference of the Association for Computational Linguistics
Abbreviated titleACL 2020
CountryUnited States
CityVirtual conference
Period5/07/2010/07/20
Internet address

Abstract

We report on methods to create the largest publicly available parallel corpora by crawling the web, using open source software. We empirically compare alternative methods and publish benchmark data sets for sentence alignment and sentence pair filtering. We also describe the parallel corpora released and evaluate their quality and their usefulness to create machine translation systems.

Event

2020 Annual Conference of the Association for Computational Linguistics

5/07/2010/07/20

Virtual conference, United States

Event: Conference

Download statistics

No data available

ID: 146714196