Analysis of pronunciation learning in end-to-end speech synthesis

Jason Taylor, Korin Richmond

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

Ensuring correct pronunciation for the widest possible variety of text input is vital for deployed text-to-speech (TTS) systems. For languages such as English that do not have trivial spelling, systems have always relied heavily upon a lexicon, both for pronunciation lookup and for training letter-to-sound (LTS) models as a fall-back to handle out-of-vocabulary words (OOVs). In contrast, recently proposed models that are trained “end-to-end” (E2E) aim to avoid linguistic text analysis and any explicit phone representation, instead learning pronunciation implicitly as part of a direct mapping from input characters to speech audio. This might be termed implicit LTS. In this paper, we explore the nature of this approach by training explicit LTS models with datasets commonly used to build E2E systems. We compare their performance with LTS models trained on a high quality English lexicon. We find that LTS errors for words with ambiguous or unpredictable pronunciations are mirrored as mispronunciations by an E2E model. Overall, our analysis suggests that limited and unbalanced lexical coverage in E2E training data may pose significant confounding factors that complicate learning accurate pronunciations in a purely E2E system.

Original languageEnglish
Title of host publicationProceedings of Interspeech 2019
PublisherInternational Speech Communication Association
Pages2070-2074
Number of pages5
DOIs
Publication statusPublished - 15 Sept 2019
Event20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language - Graz, Austria
Duration: 15 Sept 201919 Sept 2019
https://www.interspeech2019.org/

Publication series

Name
PublisherISCA
ISSN (Electronic)1990-9772

Conference

Conference20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language
Abbreviated titleINTERSPEECH 2019
Country/TerritoryAustria
CityGraz
Period15/09/1919/09/19
Internet address

Keywords / Materials (for Non-textual outputs)

  • End-to-End
  • Grapheme-to-Phoneme
  • Letter-to-Sound
  • Speech Synthesis

Fingerprint

Dive into the research topics of 'Analysis of pronunciation learning in end-to-end speech synthesis'. Together they form a unique fingerprint.

Cite this