Abstract / Description of output
Despite the fact that document-level machine translation has inherent advantages over sentence-level machine translation due to additional information available to a model from document context, most translation systems continue to operate at a sentence level. This is primarily due to the severe lack of publicly available large-scale parallel corpora at the document level. We release a large-scale open parallel corpus with document context extracted from ParaCrawl in five language pairs, along with code to compile document-level datasets for any language pair supported by ParaCrawl. We train context-aware models on these datasets and find improvements in terms of overall translation quality and targeted document-level phenomena. We also analyse how much long-range information is useful to model some of these discourse phenomena and find models are able to utilise context from several preceding sentences.
Original language | English |
---|---|
Title of host publication | The 62nd Annual Meeting of the Association for Computational Linguistics |
Publisher | Association for Computational Linguistics |
Pages | 13185–13197 |
Number of pages | 13 |
ISBN (Electronic) | 9798891760943 |
Publication status | Published - 16 Aug 2024 |
Event | The 62nd Annual Meeting of the Association for Computational Linguistics - Centara Grand and Bangkok Convention Centre at CentralWorld, Bangkok, Thailand Duration: 11 Aug 2024 → 16 Aug 2024 Conference number: 62 https://2024.aclweb.org/ |
Publication series
Name | Annual Meeting of the Association for Computational Linguistics |
---|---|
Publisher | Association for Computational Linguistics |
ISSN (Electronic) | 0736-587X |
Conference
Conference | The 62nd Annual Meeting of the Association for Computational Linguistics |
---|---|
Abbreviated title | ACL 2024 |
Country/Territory | Thailand |
City | Bangkok |
Period | 11/08/24 → 16/08/24 |
Internet address |