Abstract / Description of output
The High Performance Language Technologies (HPLT) project is a 3-year EU-funded project that started in September 2022. It aims to deliver free, sustainable, and reusable datasets, models, and workflows at scale using high-performance computing. We describe the first results of the project. The data release includes monolingual data in 75 languages at 5.6T tokens and parallel data in 18 language pairs at 96M pairs, derived from 1.8 petabytes of web crawls. Building upon automated and transparent pipelines, the first machine translation (MT) models as well as large language models (LLMs) have been trained and released. Multiple data processing tools and pipelines have also been made public.
Original language | English |
---|---|
Title of host publication | Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 2) |
Editors | Carolina Scarton, Charlotte Prescott, Chris Bayliss, Chris Oakley, Joanna Wright, Stuart Wrigley, Xingyi Song, Edward Gow-Smith, Mikel Forcada, Helena Moniz |
Publisher | European Association for Machine Translation (EAMT) |
Pages | 53-54 |
Number of pages | 2 |
Volume | 2 |
ISBN (Electronic) | 9781068690716 |
Publication status | Published - 27 Jun 2024 |
Event | The 25th Annual Conference of The European Association for Machine Translation - University of Sheffield, Sheffield, United Kingdom Duration: 24 Jun 2024 → 27 Jun 2024 Conference number: 25 https://eamt2024.sheffield.ac.uk/ |
Conference
Conference | The 25th Annual Conference of The European Association for Machine Translation |
---|---|
Abbreviated title | EAMT 2024 |
Country/Territory | United Kingdom |
City | Sheffield |
Period | 24/06/24 → 27/06/24 |
Internet address |