AltGen: 1.3M Plausible Alternatives From Neural Text Generators

  • Mario Giulianelli (Creator)
  • Sarenne Wallbridge (Creator)
  • Raquel Fernández (Creator)

Dataset

Description

AltGen: 1.3M Plausible Alternatives From Neural Text Generators The AltGen dataset contains 1.3 million English texts generated by neural language generators conditioned on contexts from three corpora of acceptability judgements and two corpora of reading times. For each corpus, each text generator, and each sampling algorithm,100 generations are sampled—for a total of 1,257,300 generations. Details about the language generators and the corpora are presented in a paper published at EMNLP 2023 (in particular, Section 4). Please cite this paper if you use any version of the dataset in your work: Mario Giulianelli, Sarenne Wallbridge, and Raquel Fernández. 2023. Information Value: Measuring Utterance Predictability as Distance from Plausible Alternatives. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. The files are in jsonl format and include a context_id field, which allows retrieving the relevant entry from the original corpus, and the alternatives field, which contains the language model generations. Please note that the alternatives are not post-processed (see code and footnote 2 in the paper for further details). Filenames are built as follows: DecodingAlgorithm_DecodingParameter-nNumAlternatives-maxlen_MaxGenerationLength-sep_Separator.jsonl.

Data Citation

Giulianelli, M., Wallbridge, S., & Fernández, R. (2023). AltGen: 1.3M Plausible Alternatives From Neural Text Generators [Data set]. Zenodo. https://doi.org/10.5281/zenodo.10006413
Date made available20 Oct 2023
PublisherZenodo

Cite this