Projects per year
Abstract / Description of output
Sequence-to-sequence neural networks with attention mechanisms have recently been widely adopted for text-to-speech. Compared with older, more modular statistical parametric synthesis systems, sequence-to-sequence systems feature three prominent innovations: 1) They replace substantial parts of traditional fixed front-end processing pipelines (like Festival’s) with learned text analysis; 2) They jointly learn to align text and speech and to synthesise speech audio from text; 3) They operate autoregressively on previously-generated acoustics. Naturalness improvements have been reported relative to earlier systems which do not contain these innovations. It would be useful to know how much each of the various innovations contribute to the improved performance. We here propose one way of associating the separately-learned components of a representative older modular system, specifically Merlin, with the different sub-networks within recent neural sequence-to-sequence architectures, specifically Tacotron 2 and DCTTS. This allows us to swap in and out various components and subnets to produce intermediate systems that step between the two paradigms; subjective evaluation of these systems then allows us to isolate the perceptual effects of the various innovations. We report on the design, evaluation, and findings of such an experiment.
Original language | English |
---|---|
Title of host publication | 10th ISCA Speech Synthesis Workshop |
Publisher | International Speech Communication Association |
Pages | 217-222 |
Number of pages | 6 |
DOIs | |
Publication status | Published - 22 Sept 2019 |
Event | The 10th ISCA Speech Synthesis Workshop - Austrian museum of folk life and folk art in Vienna, Vienna, Austria Duration: 20 Sept 2019 → 22 Sept 2019 Conference number: 10 http://ssw10.oeaw.ac.at/index.html |
Publication series
Name | |
---|---|
Publisher | International Speech Communication Association |
ISSN (Electronic) | 2312-2846 |
Conference
Conference | The 10th ISCA Speech Synthesis Workshop |
---|---|
Abbreviated title | SSW 2019 |
Country/Territory | Austria |
City | Vienna |
Period | 20/09/19 → 22/09/19 |
Internet address |
Keywords / Materials (for Non-textual outputs)
- Speech synthesis
- end-to-end
- SPSS
- naturalness
Fingerprint
Dive into the research topics of 'Where do the improvements come from in sequence-to-sequence neural TTS?'. Together they form a unique fingerprint.Projects
- 1 Finished
-
SCRIPT : Speech Synthesis for Spoken Content Production
Yamagishi, J., King, S. & Watts, O.
1/12/16 → 30/11/19
Project: Research
Datasets
-
Listening-test materials for "Where do the improvements come from in sequence-to-sequence neural TTS?"
Valentini Botinhao, C. (Creator), Fong, J. (Creator), Watts, O. (Creator) & Henter, G. E. (Creator), Edinburgh DataShare, 17 Nov 2020
DOI: 10.7488/ds/2952
Dataset