Median-based generation of synthetic speech durations using a non-parametric approach

S. Ronanki, O. Watts, Simon King, G. E. Henter

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

This paper proposes a new approach to duration modelling for statistical parametric speech synthesis in which a recurrent statistical model is trained to output a phone transition probability at each timestep (acoustic frame). Unlike conventional approaches to duration modelling - which assume that duration distributions have a particular form (e.g., a Gaussian) and use the mean of that distribution for synthesis - our approach can in principle model any distribution supported on the non-negative integers. Generation from this model can be performed in many ways; here we consider output generation based on the median predicted duration. The median is more typical (more probable) than the conventional mean duration, is robust to training-data irregularities, and enables incremental generation. Furthermore, a frame-level approach to duration prediction is consistent with a longer-term goal of modelling durations and acoustic features together. Results indicate that the proposed method is competitive with baseline approaches in approximating the median duration of held-out natural speech.
Original languageEnglish
Title of host publication2016 IEEE Spoken Language Technology Workshop (SLT)
PublisherInstitute of Electrical and Electronics Engineers
Pages686-692
Number of pages7
ISBN (Electronic)978-1-5090-4903-5
ISBN (Print)978-1-5090-4904-2
DOIs
Publication statusPublished - 9 Feb 2017
Event2016 IEEE Spoken Language Technology Workshop - San Diego, United States
Duration: 13 Dec 201616 Dec 2016
https://www2.securecms.com/SLT2016//Default.asp

Conference

Conference2016 IEEE Spoken Language Technology Workshop
Abbreviated titleIEEE SLT 2016
Country/TerritoryUnited States
CitySan Diego
Period13/12/1616/12/16
Internet address

Fingerprint

Dive into the research topics of 'Median-based generation of synthetic speech durations using a non-parametric approach'. Together they form a unique fingerprint.

Cite this