Abstract / Description of output
Ensuring correct pronunciation for the widest possible variety of text input is vital for deployed text-to-speech (TTS) systems. For languages such as English that do not have trivial spelling, systems have always relied heavily upon a lexicon, both for pronunciation lookup and for training letter-to-sound (LTS) models as a fall-back to handle out-of-vocabulary words (OOVs). In contrast, recently proposed models that are trained “end-to-end” (E2E) aim to avoid linguistic text analysis and any explicit phone representation, instead learning pronunciation implicitly as part of a direct mapping from input characters to speech audio. This might be termed implicit LTS. In this paper, we explore the nature of this approach by training explicit LTS models with datasets commonly used to build E2E systems. We compare their performance with LTS models trained on a high quality English lexicon. We find that LTS errors for words with ambiguous or unpredictable pronunciations are mirrored as mispronunciations by an E2E model. Overall, our analysis suggests that limited and unbalanced lexical coverage in E2E training data may pose significant confounding factors that complicate learning accurate pronunciations in a purely E2E system.
Original language | English |
---|---|
Title of host publication | Proceedings of Interspeech 2019 |
Publisher | International Speech Communication Association |
Pages | 2070-2074 |
Number of pages | 5 |
DOIs | |
Publication status | Published - 15 Sept 2019 |
Event | 20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language - Graz, Austria Duration: 15 Sept 2019 → 19 Sept 2019 https://www.interspeech2019.org/ |
Publication series
Name | |
---|---|
Publisher | ISCA |
ISSN (Electronic) | 1990-9772 |
Conference
Conference | 20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language |
---|---|
Abbreviated title | INTERSPEECH 2019 |
Country/Territory | Austria |
City | Graz |
Period | 15/09/19 → 19/09/19 |
Internet address |
Keywords / Materials (for Non-textual outputs)
- End-to-End
- Grapheme-to-Phoneme
- Letter-to-Sound
- Speech Synthesis