Description
AltGen: 1.3M Plausible Alternatives From Neural Text Generators The AltGen dataset contains 1.3 million English texts generated by neural language generators conditioned on contexts from three corpora of acceptability judgements and two corpora of reading times. For each corpus, each text generator, and each sampling algorithm,100 generations are sampled—for a total of 1,257,300 generations. Details about the language generators and the corpora are presented in a paper published at EMNLP 2023 (in particular, Section 4). Please cite this paper if you use any version of the dataset in your work: Mario Giulianelli, Sarenne Wallbridge, and Raquel Fernández. 2023. Information Value: Measuring Utterance Predictability as Distance from Plausible Alternatives. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. The files are in jsonl format and include a context_id field, which allows retrieving the relevant entry from the original corpus, and the alternatives field, which contains the language model generations. Please note that the alternatives are not post-processed (see code and footnote 2 in the paper for further details). Filenames are built as follows: DecodingAlgorithm_DecodingParameter-nNumAlternatives-maxlen_MaxGenerationLength-sep_Separator.jsonl.
Data Citation
Giulianelli, M., Wallbridge, S., & Fernández, R. (2023). AltGen: 1.3M Plausible Alternatives From Neural Text Generators [Data set]. Zenodo. https://doi.org/10.5281/zenodo.10006413
| Date made available | 20 Oct 2023 |
|---|---|
| Publisher | Zenodo |
Cite this
- DataSetCite