Where do the improvements come from in sequence-to-sequence neural TTS?

Oliver Watts, Gustav Henter, Jason Fong, Cassia Valentini-Botinhao

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

Sequence-to-sequence neural networks with attention mechanisms have recently been widely adopted for text-to-speech. Compared with older, more modular statistical parametric synthesis systems, sequence-to-sequence systems feature three prominent innovations: 1) They replace substantial parts of traditional fixed front-end processing pipelines (like Festival’s) with learned text analysis; 2) They jointly learn to align text and speech and to synthesise speech audio from text; 3) They operate autoregressively on previously-generated acoustics. Naturalness improvements have been reported relative to earlier systems which do not contain these innovations. It would be useful to know how much each of the various innovations contribute to the improved performance. We here propose one way of associating the separately-learned components of a representative older modular system, specifically Merlin, with the different sub-networks within recent neural sequence-to-sequence architectures, specifically Tacotron 2 and DCTTS. This allows us to swap in and out various components and subnets to produce intermediate systems that step between the two paradigms; subjective evaluation of these systems then allows us to isolate the perceptual effects of the various innovations. We report on the design, evaluation, and findings of such an experiment.
Original languageEnglish
Title of host publication10th ISCA Speech Synthesis Workshop
PublisherInternational Speech Communication Association
Pages217-222
Number of pages6
DOIs
Publication statusPublished - 22 Sept 2019
EventThe 10th ISCA Speech Synthesis Workshop - Austrian museum of folk life and folk art in Vienna, Vienna, Austria
Duration: 20 Sept 201922 Sept 2019
Conference number: 10
http://ssw10.oeaw.ac.at/index.html

Publication series

Name
PublisherInternational Speech Communication Association
ISSN (Electronic)2312-2846

Conference

ConferenceThe 10th ISCA Speech Synthesis Workshop
Abbreviated titleSSW 2019
Country/TerritoryAustria
CityVienna
Period20/09/1922/09/19
Internet address

Keywords / Materials (for Non-textual outputs)

  • Speech synthesis
  • end-to-end
  • SPSS
  • naturalness

Fingerprint

Dive into the research topics of 'Where do the improvements come from in sequence-to-sequence neural TTS?'. Together they form a unique fingerprint.

Cite this