Abstract
This paper explores whether adding Discourse Relation (DR) features improves the naturalness of neural statistical parametric speech synthesis (SPSS) in English. We hypothesize first - in the light of several previous studies - that DRs have a dedicated prosodic encoding. Secondly, we hypothesize that encoding DRs in a speech synthesizer's input will improve the naturalness of its output. In order to test our hypotheses, we prepare a dataset of DR-annotated transcriptions of audiobooks in English. We then perform an acoustic analysis of the corpus which supports our first hypothesis that DRs are acoustically encoded in speech prosody. The analysis reveals significant correlation between specific DR categories and acoustic features, such as F0 and intensity. Then, we use the corpus to train a neural SPSS system in two configurations: a baseline configuration making use only of conventional linguistic features, and an experimental one where these are supplemented with DRs. Augmenting the inputs with DR features improves objective acoustic scores on a test set and leads to significant preference by listeners in a forced choice AB test for naturalness.
Original language | English |
---|---|
Title of host publication | Interspeech 2019 |
Publisher | ISCA |
Pages | 4470-4474 |
Volume | 2019-September |
DOIs | |
Publication status | Published - 19 Sept 2019 |
Event | 20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language - Graz, Austria Duration: 15 Sept 2019 → 19 Sept 2019 https://www.interspeech2019.org/ |
Publication series
Name | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH |
---|---|
ISSN (Print) | 2308-457X |
Conference
Conference | 20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language |
---|---|
Abbreviated title | INTERSPEECH 2019 |
Country/Territory | Austria |
City | Graz |
Period | 15/09/19 → 19/09/19 |
Internet address |
Keywords / Materials (for Non-textual outputs)
- Discourse
- prosody
- Speech synthesis