Testing the limits of representation mixing for pronunciation correction in end-to-end speech synthesis

Jason Fong, Jason Taylor, Simon King

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

Accurate pronunciation is an essential requirement for text-to-speech (TTS) systems. Systems trained on raw text exhibit pronunciation errors in output speech due to ambiguous letter-to-sound relations. Without an intermediate phonemic representation, it is difficult to intervene and correct these errors. Retaining explicit control over pronunciation runs counter to the current drive toward end-to-end (E2E) TTS using sequence-to-sequence models. On the one hand, E2E TTS aims to eliminate manual intervention, especially expert skill such as phonemic transcription of words in a lexicon. On the other, a system making difficult-to-correct pronunciation errors is of little practical use. Some intervention is necessary. We explore the minimal amount of linguistic features required to correct pronunciation errors in an otherwise E2E TTS system that accepts graphemic input. We use representation-mixing: within each sequence the system accepts either graphemic and/or phonemic input. We quantify how little training data needs to be phonemically labelled - that is, how small a lexicon must be written - to ensure control over pronunciation. We find modest correction is possible with 500 phonemised word types from the LJ speech dataset but correction works best when the majority of word types are phonemised with syllable boundaries.
Original languageEnglish
Title of host publicationProceedings of the Annual Conference of the International Speech Communication Association
Pages4019-4023
Number of pages5
Volume2020-October
DOIs
Publication statusPublished - 31 Oct 2020
Event21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020 - Shanghai, China
Duration: 25 Oct 202029 Oct 2020

Publication series

NameProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
ISSN (Print)2308-457X

Conference

Conference21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020
Country/TerritoryChina
CityShanghai
Period25/10/2029/10/20

Keywords / Materials (for Non-textual outputs)

  • pronunciation control
  • representation mixing
  • speech synthesis

Fingerprint

Dive into the research topics of 'Testing the limits of representation mixing for pronunciation correction in end-to-end speech synthesis'. Together they form a unique fingerprint.

Cite this