Analysis of pronunciation learning in end-to-end speech synthesis

Jason Taylor, Korin Richmond

Research output: Chapter in Book/Report/Conference proceedingConference contribution


Ensuring correct pronunciation for the widest possible variety of text input is vital for deployed text-to-speech (TTS) systems. For languages such as English that do not have trivial spelling, systems have always relied heavily upon a lexicon, both for pronunciation lookup and for training letter-to-sound (LTS) models as a fall-back to handle out-of-vocabulary words (OOVs). In contrast, recently proposed models that are trained “end-to-end” (E2E) aim to avoid linguistic text analysis and any explicit phone representation, instead learning pronunciation implicitly as part of a direct mapping from input characters to speech audio. This might be termed implicit LTS. In this paper, we explore the nature of this approach by training explicit LTS models with datasets commonly used to build E2E systems. We compare their performance with LTS models trained on a high quality English lexicon. We find that LTS errors for words with ambiguous or unpredictable pronunciations are mirrored as mispronunciations by an E2E model. Overall, our analysis suggests that limited and unbalanced lexical coverage in E2E training data may pose significant confounding factors that complicate learning accurate pronunciations in a purely E2E system.

Original languageEnglish
Title of host publicationProceedings of Interspeech 2019
PublisherInternational Speech Communication Association
Number of pages5
Publication statusPublished - 15 Sep 2019
Event20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language - Graz, Austria
Duration: 15 Sep 201919 Sep 2019

Publication series

ISSN (Electronic)1990-9772


Conference20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language
Abbreviated titleINTERSPEECH 2019
Internet address


  • End-to-End
  • Grapheme-to-Phoneme
  • Letter-to-Sound
  • Speech Synthesis


Dive into the research topics of 'Analysis of pronunciation learning in end-to-end speech synthesis'. Together they form a unique fingerprint.

Cite this