Liaison and pronunciation learning in end-to-end text-to-speech in French

Jason Taylor, Sébastien Le Maguer, Korin Richmond

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

Sequence-to-sequence (S2S) TTS models like Tacotron have grapheme-only inputs when trained fully end-to-end. Grapheme inputs map to phone sounds depending on context, which traditionally is handled by extensive preprocessing in the TTS front-end. However, French orthography does not provide a clear one-to-one mapping between graphemes and sounds, and in English, which similarly has rather non-phonetic orthography, pronunciations are a significant cause of error in S2S- TTS with grapheme-inputs. In this paper, we test implicit pronunciation knowledge where graphemes do not map directly to phones. Implicit pronunciation knowledge learnt in S2S-TTS is similar to a standalone grapheme-to-phoneme (G2P) model, which makes explicit phone predictions at the sequential level. We find grapheme-input S2S-TTS makes implicit pronunciation errors similar to explicit G2P models - notably for foreign names. In a traditional front-end pipeline, there are also post-lexical rules which modify G2P output at the sequential level. In French, post-lexical rules require a deep knowledge of linguistic structure in a process called Liaison. Without explicit rules, we find S2S-TTS with grapheme-inputs over-inserts Liaison sounds, leading to a significant preference for a phone-based equivalent. By testing with linguistically-motivated stimuli, we observe differences that would otherwise go undetected.
Original languageEnglish
Title of host publicationProc. 11th ISCA Speech Synthesis Workshop (SSW 11)
Publication statusPublished - 28 Aug 2021
EventThe 11th ISCA Speech Synthesis Workshop (SSW11) - Gárdony, Hungary
Duration: 26 Aug 202128 Aug 2021
Conference number: 11


ConferenceThe 11th ISCA Speech Synthesis Workshop (SSW11)
Abbreviated titleSSW11
Internet address

Keywords / Materials (for Non-textual outputs)

  • text-to-speech
  • phoneme
  • liaison
  • enchaînment


Dive into the research topics of 'Liaison and pronunciation learning in end-to-end text-to-speech in French'. Together they form a unique fingerprint.

Cite this