Intonation control for neural text-to-speech synthesis with polynomial models of F0

Niamh Corkey, Johannah O'Mahony, Simon King

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

We present a novel, user-friendly approach for controlling patterns of intonation (a fundamental aspect of prosody) within a neural TTS system. This involves concisely representing F0 contours with the coefficients of their Legendre polynomial series expansion, and implementing a model (based on FastPitch) which is conditioned on these sets of coefficients during training. At inference time the model will explicitly predict a coefficient set, or a user (eg. human-in-the-loop) can provide a target coefficient set which audibly alters the intonation of the output speech, based on just a few values. This is particularly effective for intonation transfer: where these coefficient targets are extracted from a ground truth recording, making the synthesised utterance more closely reflect the intonation of the real speaker.
Original languageEnglish
Title of host publicationProceedings of the Annual Conference of the International Speech Communication Association
Subtitle of host publicationInterspeech 2023
EditorsNaomi Harte, Julie Carson-Berndsen, Gareth Jones
Place of PublicationDublin
PublisherISCA
Pages2014-2015
Number of pages2
Publication statusPublished - Sept 2023
EventInterspeech 2023 - Dublin, Ireland
Duration: 20 Aug 202324 Aug 2023
Conference number: 24
https://www.interspeech2023.org/

Publication series

NameInterspeech - Annual Conference of the International Speech Communication Association
PublisherISCA
ISSN (Electronic)2308-457X

Conference

ConferenceInterspeech 2023
Country/TerritoryIreland
CityDublin
Period20/08/2324/08/23
Internet address

Keywords / Materials (for Non-textual outputs)

  • text-to-speech
  • speech synthesis
  • intonation modelling
  • prosody control
  • prosody transfer

Fingerprint

Dive into the research topics of 'Intonation control for neural text-to-speech synthesis with polynomial models of F0'. Together they form a unique fingerprint.

Cite this