REYD Yiddish TTS Corpus

  • Isaac Bleaman (Creator)
  • Samuel Lo (Creator)
  • Simon King (Creator)
  • Jacob Webber (Creator)

Dataset

Abstract

* The Reading Electronic Yiddish Documents (REYD) Dataset. The REYD TTS dataset is a speech dataset for Yiddish consisting of 4,892 short audio clips, with a total duration of 475.7 minutes. The recordings are of three speakers, two of whom speak the Lithuanian Yiddish dialect and one who speaks the Polish Yiddish dialect. The source texts are in standard literary Yiddish. The text sources are mostly works of fiction from the late 19th and early 20th centuries. Audio was recorded at the Montreal Jewish Public Library and the University of Haifa. All source texts and audio are public domain. Permission has been granted by the surviving relatives of the three readers for this work to be made public. This work has been used to train a TTS system. For an interactive demo and other information, please see our GitHub project page at https://github.com/REYD-TTS. A paper describing the work of assembling this dataset has been submitted for publication and will be linked to on the project page if accepted. * Creation. The dataset was prepared using hand-corrected text that was segmented automatically. The results were checked for accuracy. The code used for creating the dataset is available, along with manually corrected source texts, in this repository: https://github.com/REYD-TTS/yiddish-tts-texts. * License. If your use of this work results in a publication being made, we request that you cite the paper listed at https://github.com/REYD-TTS.

Data Citation

Webber, Jacob; Bleaman, Isaac; Lo, Samuel; King, Simon. (2022). REYD Yiddish TTS Corpus, [dataset]. Centre for Speech Technology Research (CSTR). https://doi.org/10.7488/ds/3424.
Date made available2 Apr 2022
PublisherEdinburgh DataShare

Cite this