Median-based generation of synthetic speech durations using a non-parametric approach

S. Ronanki, O. Watts, Simon King, G. E. Henter

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

This paper proposes a new approach to duration modelling for statistical parametric speech synthesis in which a recurrent statistical model is trained to output a phone transition probability at each timestep (acoustic frame). Unlike conventional approaches to duration modelling - which assume that duration distributions have a particular form (e.g., a Gaussian) and use the mean of that distribution for synthesis - our approach can in principle model any distribution supported on the non-negative integers. Generation from this model can be performed in many ways; here we consider output generation based on the median predicted duration. The median is more typical (more probable) than the conventional mean duration, is robust to training-data irregularities, and enables incremental generation. Furthermore, a frame-level approach to duration prediction is consistent with a longer-term goal of modelling durations and acoustic features together. Results indicate that the proposed method is competitive with baseline approaches in approximating the median duration of held-out natural speech.
Original languageEnglish
Title of host publication2016 IEEE Spoken Language Technology Workshop (SLT)
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Number of pages7
ISBN (Electronic)978-1-5090-4903-5
ISBN (Print)978-1-5090-4904-2
Publication statusPublished - 9 Feb 2017
Event2016 IEEE Spoken Language Technology Workshop - San Diego, United States
Duration: 13 Dec 201616 Dec 2016


Conference2016 IEEE Spoken Language Technology Workshop
Abbreviated titleIEEE SLT 2016
Country/TerritoryUnited States
CitySan Diego
Internet address


Dive into the research topics of 'Median-based generation of synthetic speech durations using a non-parametric approach'. Together they form a unique fingerprint.

Cite this