The XWikis Corpus (Perez-Beltrachini and Lapata, 2021) provides datasets with different language pairs and directions for cross-lingual abstractive document summarisation. This current version includes four languages: English, German, French, and Czech. The dataset is derived from Wikipedia. It is based on the observation that for a Wikipedia title, the lead section provides an overview conveying salient information, while the body provides detailed information. It thus assumes the body and lead paragraph as a document-summary pair. Furthermore, as a Wikipedia title can be associated with Wikipedia articles in various languages, 1) Wikipedia’s Interlanguage Links are used to find titles across languages and 2) given any two related Wikipedia titles, e.g., Huile d’Olive (French) and Olive Oil (English), the lead paragraph from one title is paired with the body of the other to derive cross-lingual pairs.

Data Citation

Perez-Beltrachini, Laura; Lapata, Mirella. (2021). XWikis Corpus, 2020 [text]. University of Edinburgh. School of Informatics. ILCC. https://doi.org/10.7488/ds/3259.
Date made available10 Dec 2021
PublisherEdinburgh DataShare
Temporal coverage20 Jun 2020 - 20 Jun 2020

