Abstract / Description of output
Neural sequence-to-sequence (S2S) modelling encodes a single, unified representation for each input sequence. When used for text-to-speech synthesis (TTS), such representations must embed ambiguities between English spelling and pronunciation. For example, in pothole and there the character sequence th sounds different. This can be problematic when predicting pronunciation directly from letters. We posit pronunciation becomes easier to predict when letters are grouped into subword units like morphemes (e.g. a boundary lies between t and h in pothole but not there). Moreover, morphological boundaries can reduce the total number of, and increase the counts of, seen unit subsequences. Accordingly, we test here the effect of augmenting input sequences of letters with morphological boundaries. We find morphological boundaries substantially lower the Word and Phone Error Rates (WER and PER) for a Bi-LSTM performing G2P on one hand, and also increase the naturalness scores of Tacotrons performing TTS in a MUSHRA listening test on the other. The improvements to TTS quality are such that grapheme input augmented with morphological boundaries outperforms phone input without boundaries. Since morphological segmentation may be predicted with high accuracy, we highlight this simple pre-processing step has important potential for S2S modelling in TTS.
Original language | English |
---|---|
Title of host publication | Interspeech 2020 |
Place of Publication | Shanghai, China |
Publisher | International Speech Communication Association |
Number of pages | 5 |
DOIs | |
Publication status | Published - 25 Oct 2020 |
Event | Interspeech 2020 - Virtual Conference, China Duration: 25 Oct 2020 → 29 Oct 2020 http://www.interspeech2020.org/ |
Publication series
Name | Interspeech |
---|---|
ISSN (Print) | 1990-9772 |
Conference
Conference | Interspeech 2020 |
---|---|
Abbreviated title | INTERSPEECH 2020 |
Country/Territory | China |
City | Virtual Conference |
Period | 25/10/20 → 29/10/20 |
Internet address |
Keywords / Materials (for Non-textual outputs)
- Speech Synthesis
- Sequence-to-Sequence
- Morphology
- Pronunciation