Projects per year
Abstract / Description of output
Current approaches to statistical parametric speech synthesis using Neural Networks generally require input at the same temporal resolution as the output, typically a frame every 5ms, or in some cases at waveform sampling rate. It is therefore necessary to fabricate highly-redundant frame-level (or sample level) linguistic features at the input. This paper proposes the use of a hierarchical encoder-decoder model to perform the sequence-to-sequence regression in a way that takes the input linguistic features at their original timescales, and preserves the relationships between words, syllables and phones. The proposed model is designed to make more effective use of suprasegmental features than conventional architectures, as well as being computationally efficient. Experiments were conducted on prosodically-varied audiobook material because the use of supra-segmental features is thought to be particularly important in this case. Both objective measures and results from subjective listening tests, which asked listeners to focus on prosody, show that the proposed method performs significantly better than a conventional architecture that requires the linguistic input to be at the acoustic frame rate.
We provide code and a recipe to enable our system to be reproduced using the Merlin toolkit.
We provide code and a recipe to enable our system to be reproduced using the Merlin toolkit.
Original language | English |
---|---|
Title of host publication | Proceedings Interspeech 2017 |
Pages | 1133-1137 |
Number of pages | 5 |
DOIs | |
Publication status | Published - 24 Aug 2017 |
Event | Interspeech 2017 - Stockholm, Sweden Duration: 20 Aug 2017 → 24 Aug 2017 http://www.interspeech2017.org/ |
Publication series
Name | |
---|---|
Publisher | ISCA |
ISSN (Electronic) | 1990-9772 |
Conference
Conference | Interspeech 2017 |
---|---|
Country/Territory | Sweden |
City | Stockholm |
Period | 20/08/17 → 24/08/17 |
Internet address |
Fingerprint
Dive into the research topics of 'A Hierarchical Encoder-Decoder Model for Statistical Parametric Speech Synthesis'. Together they form a unique fingerprint.Projects
- 1 Finished
-
SCRIPT : Speech Synthesis for Spoken Content Production
Yamagishi, J., King, S. & Watts, O.
1/12/16 → 30/11/19
Project: Research