Wavelets for intonation modeling in HMM speech synthesis

Antti Suni, Daniel Aalto, Tuomo Raitio, Paavo Alku, Martti Vainio

Research output: Chapter in Book/Report/Conference proceedingConference contribution


The pitch contour in speech contains information about differ- ent linguistic units at several distinct temporal scales. At the finest level, the microprosodic cues are purely segmental in na- ture, whereas in the coarser time scales, lexical tones, word ac- cents, and phrase accents appear with both linguistic and para- linguistic functions. Consequently, the pitch movements hap- pen on different temporal scales: the segmental perturbations are faster than typical pitch accents and so forth. In HMM- based speech synthesis paradigm, slower intonation patterns are not easy to model. The statistical procedure of decision tree clustering highlights instances that are more common, result- ing in good reproduction of microprosody and declination, but with less variation on word and phrase level compared to hu- man speech. Here we present a system that uses wavelets to decompose the pitch contour into five temporal scales ranging from microprosody to the utterance level. Each component is then individually trained within HMM framework and used in a superpositional manner at the synthesis stage. The resulting system is compared to a baseline where only one decision tree is trained to generate the pitch contour.
Original languageEnglish
Title of host publicationProc. 8th ISCA Speech Synthesis Workshop, 2013
Number of pages6
Publication statusPublished - 2013


  • HMM-based synthesis, intonation modeling, wavelet decomposition


Dive into the research topics of 'Wavelets for intonation modeling in HMM speech synthesis'. Together they form a unique fingerprint.

Cite this