Investigating the robustness of sequence-to-sequence text-to-speech models to imperfectly-transcribed training data

Jason Fong, Pilar Oplustil Gallegos, Zack Hodari, Simon King

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

Sequence-to-sequence (S2S) text-to-speech (TTS) models can synthesise high quality speech when large amounts of annotated training data are available. Transcription errors exist in all data and are especially prevalent in found data such as audiobooks. In previous generations of TTS technology, alignment using Hidden Markov Models (HMMs) was widely used to identify and eliminate bad data. In S2S models, the use of attention replaces HMM-based alignment, and there is no explicit mechanism for removing bad data. It is not yet understood how such models deal with transcription errors in the training data. We evaluate the quality of speech from S2S-TTS models when trained on data with imperfect transcripts, simulated using corruption, or provided by an Automatic Speech Recogniser (ASR). We find that attention can skip over extraneous words in the input sequence, providing robustness to insertion errors. But substitutions and deletions pose a problem because there is no ground truth input available to align to the ground truth acoustics during teacher-forced training. We conclude that S2S-TTS systems are only partially robust to training on imperfectly-transcribed data and further work is needed.

Original languageEnglish
Title of host publicationProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
PublisherISCA
Pages1546-1550
Number of pages5
Volume2019-September
DOIs
Publication statusPublished - 19 Sept 2019
EventInterspeech 2019 - Graz, Austria
Duration: 15 Sept 201919 Sept 2019
https://www.interspeech2019.org/

Publication series

NameProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
ISSN (Print)2308-457X

Conference

ConferenceInterspeech 2019
Country/TerritoryAustria
CityGraz
Period15/09/1919/09/19
Internet address

Keywords / Materials (for Non-textual outputs)

  • found data
  • sequence-to-sequence models
  • speech synthesis

Fingerprint

Dive into the research topics of 'Investigating the robustness of sequence-to-sequence text-to-speech models to imperfectly-transcribed training data'. Together they form a unique fingerprint.

Cite this