Document-level machine translation with large-scale public parallel corpora

Proyag Pal, Alexandra Birch-Mayne, Kenneth Heafield

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

Despite the fact that document-level machine translation has inherent advantages over sentence-level machine translation due to additional information available to a model from document context, most translation systems continue to operate at a sentence level. This is primarily due to the severe lack of publicly available large-scale parallel corpora at the document level. We release a large-scale open parallel corpus with document context extracted from ParaCrawl in five language pairs, along with code to compile document-level datasets for any language pair supported by ParaCrawl. We train context-aware models on these datasets and find improvements in terms of overall translation quality and targeted document-level phenomena. We also analyse how much long-range information is useful to model some of these discourse phenomena and find models are able to utilise context from several preceding sentences.
Original languageEnglish
Title of host publicationThe 62nd Annual Meeting of the Association for Computational Linguistics
PublisherAssociation for Computational Linguistics
Pages13185–13197
Number of pages13
ISBN (Electronic)9798891760943
Publication statusPublished - 16 Aug 2024
EventThe 62nd Annual Meeting of the Association for Computational Linguistics - Centara Grand and Bangkok Convention Centre at CentralWorld, Bangkok, Thailand
Duration: 11 Aug 202416 Aug 2024
Conference number: 62
https://2024.aclweb.org/

Publication series

NameAnnual Meeting of the Association for Computational Linguistics
PublisherAssociation for Computational Linguistics
ISSN (Electronic)0736-587X

Conference

ConferenceThe 62nd Annual Meeting of the Association for Computational Linguistics
Abbreviated titleACL 2024
Country/TerritoryThailand
CityBangkok
Period11/08/2416/08/24
Internet address

Fingerprint

Dive into the research topics of 'Document-level machine translation with large-scale public parallel corpora'. Together they form a unique fingerprint.

Cite this