Synthesising Filled Pauses: Representation and Datamixing

Rasmus Dall, Marcus Tomalin, Mirjam Wester

Research output: Chapter in Book/Report/Conference proceedingConference contribution


Filled pauses occur frequently in spontaneous human speech, yet modern text-to-speech synthesis systems rarely model these disfluencies overtly, and consequently they do not output convincing synthetic filled pauses. This paper presents a text-to-speech system that is specifically designed to model these particular disfluencies more efffectively. A preparatory investigation shows that a synthetic voice trained exclusively on spontaneous speech is perceived to be inferior in quality to a voice trained entirely on read speech, even though the matter does not handle filled pauses well. This motivates an investigation into the phonetic representation of filled pauses which show that, in a preference test, the use of a distinct phone for filled pauses is preferred over the standard
/V/ phone and the alternative /@/ phone. In addition, we present a variety of data-mixing techniques to combine the strengths of standard synthesis systems trained on read speech corpora with the supplementary advantages offered by systems trained on spontaneous speech. In a MUSHRA-style test, it is found that the best overall quality is obtained by combining the two types of corpora using a source marking technique. Specifically, general speech is synthesised with a standard mark, while filled pauses are synthesised with a spontaneous mark, which has the added benefit of also producing filled pauses that are comparatively well synthesised.
Original languageEnglish
Title of host publicationProceedings of 9th ISCA Workshop on Speech Synthesis
Number of pages7
Publication statusPublished - 15 Sep 2016
Event9th ISCA Workshop on Speech Synthesis 2017 - Sunnyvale, United States
Duration: 13 Sep 201715 Jun 2018


Conference9th ISCA Workshop on Speech Synthesis 2017
Abbreviated titleISCA 2017
CountryUnited States
Internet address


Dive into the research topics of 'Synthesising Filled Pauses: Representation and Datamixing'. Together they form a unique fingerprint.

Cite this