Lightly supervised GMM VAD to use audiobook for speech synthesiser

Y. Mamiya, J. Yamagishi, Oliver Watts, R.A.J. Clark, S. King, A. Stan

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

Audiobooks have been focused on as promising data for training Text-to-Speech (TTS) systems. However, they usually do not have a correspondence between audio and text data. Moreover, they are usually divided only into chapter units. In practice, we have to make a correspondence of audio and text data before we use them for building TTS synthesisers. However aligning audio and text data is time-consuming and involves manual labor. It also requires persons skilled in speech processing. Previously, we have proposed to use graphemes for automatically aligning speech and text data. This paper further integrates a lightly supervised voice activity detection (VAD) technique to detect sentence boundaries as a pre-processing step before the grapheme approach. This lightly supervised technique requires time stamps of speech and silence only for the first fifty sentences. Combining those, we can semi-automatically build TTS systems from audiobooks with minimum manual intervention. From subjective evaluations we analyse how the grapheme-based aligner and/or the proposed VAD technique impact the quality of HMM-based speech synthesisers trained on audiobooks.
Original languageEnglish
Title of host publicationAcoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Number of pages5
Publication statusPublished - 2013

Keywords / Materials (for Non-textual outputs)

  • hidden Markov models
  • signal detection
  • speech synthesis
  • HMM-based speech synthesisers
  • VAD technique
  • audiobook
  • grapheme-based aligner approach
  • lightly supervised GMM VAD
  • lightly supervised voice activity detection technique
  • minimum manual intervention
  • semiautomatically build TTS systems
  • sentence boundary detection
  • speech processing
  • text data
  • text-to-speech system training
  • Buildings
  • Hidden Markov models
  • Manuals
  • Speech
  • Speech synthesis
  • Synthesizers
  • HMM-based speech synthesis
  • lightly supervised
  • voice activity detection


Dive into the research topics of 'Lightly supervised GMM VAD to use audiobook for speech synthesiser'. Together they form a unique fingerprint.

Cite this