Deep neural network context embeddings for model selection in rich-context HMM synthesis

Thomas Merritt, Junichi Yamagishi, Zhizheng Wu, Oliver Watts, Simon King

Research output: Chapter in Book/Report/Conference proceedingConference contribution


This paper introduces a novel form of parametric synthesis that uses context embeddings produced by the bottleneck layer of a deep neural network to guide the selection of models in a rich-context HMM-based synthesiser. Rich-context synthesis – in which Gaussian distributions estimated from single linguistic contexts seen in the training data are used for synthesis, rather than more conventional decision tree-tied models – was originally proposed to address over-smoothing due to averag- ing across contexts. Our previous investigations have confirmed experimentally that averaging across different contexts is indeed one of the largest factors contributing to the limited quality of statistical parametric speech synthesis. However, a possible weakness of the rich context approach as previously formulated is that a conventional tied model is still used to guide selection of Gaussians at synthesis time. Our proposed approach replaces this with context embeddings derived from a neural network.
Original languageEnglish
Title of host publicationProceedings of Interspeech 2015
Place of PublicationDresden
PublisherInternational Speech Communication Association
Publication statusPublished - 6 Sep 2015
EventInterspeech 2015 - Dresden, Germany
Duration: 6 Sep 20159 Sep 2015


ConferenceInterspeech 2015


  • Speech Synthesis
  • Deep Neural Networks
  • Statistical parametric speech synthesis


Dive into the research topics of 'Deep neural network context embeddings for model selection in rich-context HMM synthesis'. Together they form a unique fingerprint.

Cite this