HPLT's first release of data and models

Nikolay Arefyev, Mikko Aulamo, Pinzhen Chen, Ona De Gibert Bonet, Barry Haddow, Jindřich Helcl, Bhavitvya Malik, Gema Ramirez-Sánchez, Pavel Stepachev, Jörg Tiedemann, Dušan Variš, Jaume Zaragoza-Bernabeu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

The High Performance Language Technologies (HPLT) project is a 3-year EU-funded project that started in September 2022. It aims to deliver free, sustainable, and reusable datasets, models, and workflows at scale using high-performance computing. We describe the first results of the project. The data release includes monolingual data in 75 languages at 5.6T tokens and parallel data in 18 language pairs at 96M pairs, derived from 1.8 petabytes of web crawls. Building upon automated and transparent pipelines, the first machine translation (MT) models as well as large language models (LLMs) have been trained and released. Multiple data processing tools and pipelines have also been made public.
Original languageEnglish
Title of host publicationProceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 2)
EditorsCarolina Scarton, Charlotte Prescott, Chris Bayliss, Chris Oakley, Joanna Wright, Stuart Wrigley, Xingyi Song, Edward Gow-Smith, Mikel Forcada, Helena Moniz
PublisherEuropean Association for Machine Translation (EAMT)
Pages53-54
Number of pages2
Volume2
ISBN (Electronic)9781068690716
Publication statusPublished - 27 Jun 2024
EventThe 25th Annual Conference of The European Association for Machine Translation - University of Sheffield, Sheffield, United Kingdom
Duration: 24 Jun 202427 Jun 2024
Conference number: 25
https://eamt2024.sheffield.ac.uk/

Conference

ConferenceThe 25th Annual Conference of The European Association for Machine Translation
Abbreviated titleEAMT 2024
Country/TerritoryUnited Kingdom
CitySheffield
Period24/06/2427/06/24
Internet address

Fingerprint

Dive into the research topics of 'HPLT's first release of data and models'. Together they form a unique fingerprint.

Cite this