Parallel Audiobook Corpus

  • Manuel Ribeiro (Creator)

Dataset

Abstract

The Parallel Audiobook Corpus (version 1.0) is a collection of parallel readings of audiobooks. The corpus consists of approximately 121 hours of speech at 22.05KHz across 4 books and 59 speakers. The data is provided in two formats. Chapter data contains the audiobook recording at the chapter level. Each chapter-level waveform is accompanied by the text and its respective word-level alignment. This format can be used if you are looking for a segmentation that does not correspond to utterance-level units. Segmented data provides a more traditional format for the corpus. The chapter-level alignment was segmented into utterances with waveforms organized by speaker. Note that, within each book, utterance identifiers are consistent across speakers, making it simple to find parallel data.

Data Citation

Ribeiro, Manuel Sam. (2018). Parallel Audiobook Corpus, [dataset]. University of Edinburgh. School of Informatics. http://dx.doi.org/10.7488/ds/2468
Date made available12 Nov 2018
PublisherEdinburgh DataShare

Cite this