Enhancing Sequence-to-Sequence Text-to-Speech with Morphology

Jason Taylor, Korin Richmond

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

Neural sequence-to-sequence (S2S) modelling encodes a single, unified representation for each input sequence. When used for text-to-speech synthesis (TTS), such representations must embed ambiguities between English spelling and pronunciation. For example, in pothole and there the character sequence th sounds different. This can be problematic when predicting pronunciation directly from letters. We posit pronunciation becomes easier to predict when letters are grouped into subword units like morphemes (e.g. a boundary lies between t and h in pothole but not there). Moreover, morphological boundaries can reduce the total number of, and increase the counts of, seen unit subsequences. Accordingly, we test here the effect of augmenting input sequences of letters with morphological boundaries. We find morphological boundaries substantially lower the Word and Phone Error Rates (WER and PER) for a Bi-LSTM performing G2P on one hand, and also increase the naturalness scores of Tacotrons performing TTS in a MUSHRA listening test on the other. The improvements to TTS quality are such that grapheme input augmented with morphological boundaries outperforms phone input without boundaries. Since morphological segmentation may be predicted with high accuracy, we highlight this simple pre-processing step has important potential for S2S modelling in TTS.
Original languageEnglish
Title of host publicationInterspeech 2020
Place of PublicationShanghai, China
PublisherInternational Speech Communication Association
Number of pages5
DOIs
Publication statusPublished - 25 Oct 2020
EventInterspeech 2020 - Virtual Conference, China
Duration: 25 Oct 202029 Oct 2020
http://www.interspeech2020.org/

Publication series

NameInterspeech
ISSN (Print)1990-9772

Conference

ConferenceInterspeech 2020
Abbreviated titleINTERSPEECH 2020
Country/TerritoryChina
CityVirtual Conference
Period25/10/2029/10/20
Internet address

Keywords / Materials (for Non-textual outputs)

  • Speech Synthesis
  • Sequence-to-Sequence
  • Morphology
  • Pronunciation

Fingerprint

Dive into the research topics of 'Enhancing Sequence-to-Sequence Text-to-Speech with Morphology'. Together they form a unique fingerprint.

Cite this